# Exploratory Data Analysis

Due the characterstics of the data in this EDA we'll plot the data using different techinques such as plotting in time domain or in frequency domain.

### Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option("display.max_colwidth", None) # setting the maximum width in characters when displaying pandas column. "None" value means unlimited.

import matplotlib.pyplot as plt  # plotting
from glob import glob     # pathname management

import random    # generating (pseudo)-random numbers
import seaborn as sns # for data visualization
import matplotlib.mlab as mlab  # some MATLAB commands
from scipy.interpolate import interp1d  # interpolating a 1-D function
from src.plot.plot import *

In [None]:
!pip install gwpy
import gwpy

In [None]:
from src.plot.plot import *

### Reading the files

In [None]:
training_labels=pd.read_csv("data/training_labels.csv")
training_labels.head()

Let's check if the source data target is balanced

In [None]:
training_labels['target'].value_counts()

In [None]:
sns.countplot(data=train_df, x="target")

As we can see the source data is balanced.

To make things easier let's merge the path of the file into the df with the target

In [None]:
training_paths = glob("../input/g2net-gravitational-wave-detection/train/*/*/*/*")
print("The total number of files in the training set:", len(training_paths))

With glob we can get all the files in the train directory

In [None]:
ids = [path.split("/")[-1].split(".")[0] for path in training_paths]
paths_df = pd.DataFrame({"path":training_paths, "id": ids})
train_data = pd.merge(left=training_labels, right=paths_df, on="id")
train_data.head()

In [None]:
train_data.to_csv("data/data_path.csv")

In [None]:
# draw a random sample from the train data
sample_gw_id = train_data[train_data['target'] == 1].sample(random_state=42)['id'].values[0]

In [None]:
# plot the sample with gravitational wave signal
visualize_sample(sample_gw_id)

Descibir que se ve royo :

The three plots above show the strain values sampled for 2s at 2048 Hz for id 882722dba9. Out of the three readings, the two LIGO values are similar in amplitude while the Virgo is smaller. Even though this particular sample has gravitaional wave signal, it is burried deep in the instrument noise.


In [None]:
# draw another random sample from train without gravitational wave signal
sample_no_gw_id = train_data[train_data['target'] == 0].sample(random_state=42)['id'].values[0]

In [None]:
# plot the sample without gravitational wave signal
visualize_sample(sample_no_gw_id)

Similarly, for the sample 05552e5b6a without gravitational wave signal, we cannot visually see any signs. The strain is of the order , which is extremely small and can be affected by many external factors. However, as seen in both the sample plots, the strain data is a combination of many frequencies and analysing the signals in frequency domain, instead of the time domain, might give us better insights.

A Fourier Transform is the most commonly used method in maths and signal processing, to decompose the signals into its constituent discrete frequencies. This spectrum of frequencies can be analyzed based on average, power or energy of the signal to get a spectral density plot. We will follow some of the concepts from this tutorial. As it says, one of the ways to visualize a raw signal in frequency domain is by plotting the amplitude spectral density (ASD).

Spectral density plots

In [None]:
# let's define some signal parameters
sample_rate = 2048 # data is provided at 2048 Hz
signal_length = 2 # each signal lasts 2 s
NFFT = 4*fs    # the Nyquist frequency -
f_min = 20.
f_max = fs/2

In [None]:
# plot ASD for sample w/ GW
plot_asd(sample_gw_id,sample_rate,signal_length)

In [None]:
plot_asd_mix(sample_gw_id,sample_rate,NFFT,f_min,f_max)

These plots are plotted on a log scale for x-axis, and we see that it ranges from 10 Hz ~ 1000 Hz. Although, these limits are for visualization purposes only, it helps us see some peaks for each observatory. A particular frequency can be peculiar in one measurement but remember that the GW signal has to be detected in all three waves to be confirmed. This data here still seems a bit noisy and as showed in the tutorial, if sampled for longer periods of time (on real data), it can give some valuable insights. However, the data in this competition is simulated and we try to find other ways to visualize it.

Just for the sake of completeness, we also plot the spectral density plots for a sample without GW.

In [None]:
# plot ASD for sample w/o GW
plot_asd(sample_no_gw_id)

In [None]:
plot_asd_mix(sample_no_gw_id,sample_rate,NFFT,f_min,f_max)

They do seem to have fewer peaks, specially around 200 Hz, but there is so much variability in this data, that it can be concluded with certainty.