# Exploratory Data Analysis

Due the characterstics of the data in this EDA we'll plot the data using different techinques such as plotting in time domain or in frequency domain.

### Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option("display.max_colwidth", None) # setting the maximum width in characters when displaying pandas column. "None" value means unlimited.

import matplotlib.pyplot as plt  # plotting
from glob import glob     # pathname management

import seaborn as sns
from scipy.interpolate import interp1d  # interpolating a 1-D function
import matplotlib.mlab as mlab  # some MATLAB commands

import librosa
import librosa.display

### Setup variables

In [None]:
training_labels=pd.read_csv("data/training_labels.csv")
training_labels.head()

To make things easier let's merge the path of the file into the df with the target

With glob we can get all the files in the train directory

In [None]:
training_paths = glob("D:/Projects/G2Net-Gravitational-Wave-Detection/data/train/*/*/*/*")
print("The total number of files in the training set:", len(training_paths))

In [None]:
ids = [path.split("\\")[-1].split(".")[0] for path in training_paths]
paths_df = pd.DataFrame({"path":training_paths, "id": ids})
train_data = pd.merge(left=training_labels, right=paths_df, on="id")
train_data.head()

In [None]:
train_data.to_csv("data/data_path.csv")

## Plots

Let's check if the source data target is balanced

In [None]:
train_data['target'].value_counts()

In [None]:
sns.countplot(data=train_data, x="target")

As we can see the source data is balanced.

Let's plot the signals

In [None]:
def plot_raw_data(path,
           df,
           target,
           labels = ('LIGO Hanford', 'LIGO Livingston', 'Virgo')
):
    sample_id = df[df['target'] == target].sample(random_state=42)['id'].values[0]
    sample_id = int(sample_id)
    training_files = glob(path)
    data = np.load(training_files[sample_id])
    fig, ax = plt.subplots(3,1,figsize=(12,10), sharey= True) 
    for i in range(3):
        
        plt.suptitle(f"Strain data for three observatories from sample: {sample_id} | Target: {target}")
        sns.lineplot(data=data[i], ax=ax[i], color=sns.color_palette()[i])
        ax[i].legend([labels[i]])
        ax[i].set_xlim(0, 4096)
        ax[i].set_xticks(ticks=[0, 2048, 4096])
        ax[i].set_xticklabels(labels=[0, 1, 2])
   

In [None]:
# plot the sample with gravitational wave signal
plot_raw_data(training_paths,1)

In [None]:
# plot the sample without gravitational wave signal
plot_raw_data(training_paths,0)

Descibir que se ve royo :

The three plots above show the strain values sampled for 2s at 2048 Hz for id 882722dba9. Out of the three readings, the two LIGO values are similar in amplitude while the Virgo is smaller. Even though this particular sample has gravitaional wave signal, it is burried deep in the instrument noise.


Similarly, for the sample 05552e5b6a without gravitational wave signal, we cannot visually see any signs. The strain is of the order , which is extremely small and can be affected by many external factors. However, as seen in both the sample plots, the strain data is a combination of many frequencies and analysing the signals in frequency domain, instead of the time domain, might give us better insights.

A Fourier Transform is the most commonly used method in maths and signal processing, to decompose the signals into its constituent discrete frequencies. This spectrum of frequencies can be analyzed based on average, power or energy of the signal to get a spectral density plot. We will follow some of the concepts from this tutorial. As it says, one of the ways to visualize a raw signal in frequency domain is by plotting the amplitude spectral density (ASD).

### Spectral density plots

In [None]:
# let's define some signal parameters
sample_rate = 2048 # data is provided at 2048 Hz
signal_length = 2 # each signal lasts 2 s
#NFFT = 4*fs    # the Nyquist frequency -
f_min = 20.
#f_max = fs/2

In [None]:
# function to plot the amplitude spectral density (ASD) plot
def plot_asd(path,
             df,
             target,
             signal_length,
             sample_rate,
             labels = ('LIGO Hanford', 'LIGO Livingston', 'Virgo')
):
    sample_id = df[df['target'] == target].sample(random_state=42)['id'].values[0]
    sample_id = int(sample_id)
    training_files = glob(path)
    data = np.load(training_files[sample_id])

    for i in range(data.shape[0]):
        
        ts = TimeSeries(data[i], sample_rate=sample_rate)
        ax = ts.asd(signal_length).plot(figsize=(12, 5)).gca()
        ax.set_xlim(10, 1024);
        ax.set_title(f"ASD plots for sample: {sample_id} from {labels[i]}")
        

In [None]:
def plot_asd_mix(path,
                 df,
                 target,
                 sample_rate,
                 NFFT,
                 f_min,
                 f_max,
                 labels = ('LIGO Hanford', 'LIGO Livingston', 'Virgo')):
    
    sample_id = df[df['target'] == target].sample(random_state=42)['id'].values[0]
    sample_id = int(sample_id)
    training_files = glob(path)
    sample = np.load(training_files[sample_id])
    
   
    Pxx_1, freqs = mlab.psd(sample[0], Fs = sample_rate, NFFT = NFFT)
    Pxx_2, freqs = mlab.psd(sample[1], Fs = sample_rate, NFFT = NFFT)
    Pxx_3, freqs = mlab.psd(sample[2], Fs = sample_rate, NFFT = NFFT)

    psd_1 = interp1d(freqs, Pxx_1)
    psd_2 = interp1d(freqs, Pxx_2)
    psd_3 = interp1d(freqs, Pxx_3)

    fig, ax = plt.subplots(ncols=1, nrows=1, figsize=(10, 5))
    ax.loglog(freqs, np.sqrt(Pxx_1),"g",label=labels[0])
    ax.loglog(freqs, np.sqrt(Pxx_2),"r",label=labels[1])
    ax.loglog(freqs, np.sqrt(Pxx_3),"b",label=labels[2])

    ax.set_xlim([f_min, f_max])
    ax.set_xlabel("Frequency (Hz)")
    ax.set_ylabel("Hz^-1/2")
    ax.set_title(f"ASD plots for sample: {sample_id}");
    ax.legend()

    plt.show()

In [None]:
# plot ASD for sample w/ GW
plot_asd(training_paths, 1, train_data, signal_length, sample_rate)

In [None]:
plot_asd_mix(training_paths, 1, train_data, sample_rate NFFT,f_min,f_max,)

These plots are plotted on a log scale for x-axis, and we see that it ranges from 10 Hz ~ 1000 Hz. Although, these limits are for visualization purposes only, it helps us see some peaks for each observatory. A particular frequency can be peculiar in one measurement but remember that the GW signal has to be detected in all three waves to be confirmed. This data here still seems a bit noisy and as showed in the tutorial, if sampled for longer periods of time (on real data), it can give some valuable insights. However, the data in this competition is simulated and we try to find other ways to visualize it.

Just for the sake of completeness, we also plot the spectral density plots for a sample without GW.

In [None]:
# plot ASD for sample w/o GW
plot_asd(training_paths, 0, train_data, signal_length, sample_rate)

In [None]:
plot_asd_mix(training_paths, 0, train_data, sample_rate NFF
             T,f_min,f_max,)

They do seem to have fewer peaks, specially around 200 Hz, but there is so much variability in this data, that it can be concluded with certainty.

In [None]:
# function to plot the Q-transform spectrogram side-by-side
def plot_distribution(path,
                      df,
                      sample_rate,
                      signal_names,
                      labels=("LIGO Hanford", "LIGO Livingston", "Virgo")
                        ):
    # Get the data
    sample_1 = df[df['target'] == 1].sample(random_state=42)['id'].values[0]
    sample_1 = int(sample_1)
    sample_0 = df[df['target'] == 0].sample(random_state=42)['id'].values[0]
    sample_0 = int(sample_0)
    training_files = glob(path)
    sample_1 = np.load(training_files[sample_1])
    sample_0 = np.load(training_files[sample_0])
    
    k = 1
    for i in range(3):
        plt.subplot(2, 3, k)
        sns.distplot(sample_1[i]*10**(20), label = labels[i])
        plt.legend()
        plt.title('Target: 1')
        k+=1
    
    
    for i in range(3):
        plt.subplot(2, 3, k)
        sns.distplot(sample_0[i]*10**(20), label = labels[i], color = 'r')
        plt.title('Target: 0')
        plt.legend()
        k+=1

    plt.tight_layout()

In [None]:
plot_distribution(training_paths, 1, train_data, signal_length, sample_rate)

In [None]:
plot_distribution(training_paths, 0, train_data, signal_length, sample_rate)