# Gravitational wave detection


If you are looking at a static version of this notebook and would like to run its contents, head over to [GitHub](https://github.com/giotto-ai/giotto-tda/blob/master/examples/gravitational_waves_detection.ipynb) and download the source.

## Useful references

* [Topology of time series](https://giotto-ai.github.io/gtda-docs/latest/notebooks/topology_time_series.html), in which the *Takens embedding* technique used here is explained in detail and illustrated via simple examples.
* [Detection of gravitational waves using topological data analysis and convolutional neural network: An improved approach](https://arxiv.org/abs/1910.08245) by Christopher Bresten and Jae-Hun Jung. We thank Christopher Bresten for sharing the code and data used in the article.



## Motivation
The videos below show different representations of the gravitational waves that we aim to detect. We will aim to pick out the "chirp" signal of two colliding black holes from a very noisy backgound.

## Generate the data

In the article, the authors create a synthetic training set as follows: 

* Generate gravitational wave signals that correspond to non-spinning binary black hole mergers
* Generate a noisy time series and embed a gravitational wave signal with probability 0.5 at a random time.

The result is a set of time series of the form

$$ s = g + \epsilon \frac{1}{R}\xi $$

where $g$ is a gravitational wave signal from the reference set, $\xi$ is Gaussian noise, $\epsilon=10^{-19}$ scales the noise amplitude to the signal, and $R \in (0.075, 0.65)$ is a parameter that controls the signal-to-noise-ratio (SNR).

## Constant signal-to-noise ratio

As a warmup, let's generate some noisy signals with a constant SNR of 17.98. As shown in Table 1 of the article, this corresponds to an $R$ value of 0.65. By picking the upper end of the interval, we place ourselves in a favorable scenario and, thus, can gain a sense for what the best possible performance is for our time series classifier. We pick a small number of samples to make the computations run fast, but in practice would scale this by 1-2 orders of magnitude as done in the original article.

In [24]:
import numpy as np # Import the NumPy library for numerical operations
from pathlib import Path # Import the Path class from the pathlib module to handle file system paths

# Poner comentarios en cada linea de que significa cada cosa
# En alguna parte se asegura que los datos esten balanceados.
# Tarea: Generar datos con imbalance (75 señal, 25 ruido) (25-75) (90-10) (10-90)

# Define the main function `make_gravitational_waves`
def make_gravitational_waves(
    path_to_data: Path, # Path to the directory containing the data
    n_signals: int = 30, # Number of signals to generate
    downsample_factor: int = 2, # Factor by which to downsample the signals
    r_min: float = 0.075,# Minimum signal-to-noise ratio (SNR) coefficient
    r_max: float = 0.65, # Maximum signal-to-noise ratio (SNR) coefficient
    n_snr_values: int = 10, # Number of distinct SNR values to use
        ):
    def padrand(V, n, kr):
        cut = np.random.randint(n) # Generate a random integer to determine the split point for padding
        rand1 = np.random.randn(cut) # Create random noise for the first part of the padding
        rand2 = np.random.randn(n - cut) # Create random noise for the second part of the padding
        
        # Concatenate the first padding, the input signal `V`, and the second padding
        # Scale the padding by the factor `kr`
        out = np.concatenate((rand1 * kr, V, rand2 * kr))
        return out

    Rcoef = np.linspace(r_min, r_max, n_snr_values) # Generate a list of SNR coefficients evenly spaced between `r_min` and `r_max`
    Npad = 500  # number of padding points on either side of the vector
    gw = np.load("../data/gravitational_wave_signals.npy") # Load data
    Norig = len(gw["data"][0]) # Get the original number of data points in each signal
    Ndat = len(gw["signal_present"]) # Get the total number of signals in the dataset
    N = int(Norig / downsample_factor) # Calculate the number of data points after downsampling

    # Initialize lists to store noise coefficients and SNR coefficients
    ncoeff = []
    Rcoeflist = []

    # Loop through the number of signals to generate noise coefficients and SNR coefficients
    for j in range(n_signals):
        # Calculate the noise coefficient based on the SNR coefficient
        ncoeff.append(10 ** (-19) * (1 / Rcoef[j % n_snr_values]))
        # Append the corresponding SNR coefficient to the list
        Rcoeflist.append(Rcoef[j % n_snr_values])

    # Initialize variables
    noisy_signals = []
    gw_signals = []
    k = 0
    labels = np.zeros(n_signals)

    # Loop through the number of signals to generate noisy signals and labels
    for j in range(n_signals):
        # Select a signal from the dataset and downsample it
        signal = gw["data"][j % Ndat][range(0, Norig, downsample_factor)]
        
        # Randomly decide if the signal is present (1) or absent (0)
        sigp = int((np.random.randn() < 0))
        
        # Generate random noise scaled by the noise coefficient
        noise = ncoeff[j] * np.random.randn(N)
        
        # Assign the label based on whether the signal is present
        labels[j] = sigp
        
        # If the signal is present, add it to the noise and pad the result
        if sigp == 1:
            rawsig = padrand(signal + noise, Npad, ncoeff[j])
            
            # Ensure at least one signal is present in the dataset
            if k == 0:
                k = 1
        else:
            # If the signal is absent, pad only the noise
            rawsig = padrand(noise, Npad, ncoeff[j])
        
        # Append the padded noisy signal to the list
        noisy_signals.append(rawsig.copy())
        
        # Append the original signal to the list
        gw_signals.append(signal)
    
    # Return the generated noisy signals, original signals, and labels
    return noisy_signals, gw_signals, labels



# generate and get data
R = 0.50 # Maximum signal-to-noise ratio (SNR) coefficient
n_signals = 100 # Number of signals to generate
DATA = Path(".") # Path to the directory containing the data

noisy_signals_50_50, gw_signals_50_50, labels_50_50 = make_gravitational_waves(
    path_to_data=DATA, n_signals=n_signals, r_min=R, r_max=R, n_snr_values=1
)

print(f"Number of noisy signals: {len(noisy_signals_50_50)}")
print(f"Number of timesteps per series: {len(noisy_signals_50_50[0])}")

Number of noisy signals: 100
Number of timesteps per series: 8692


Next let's visualise the two different types of time series that we wish to classify: one that is pure noise vs. one that is composed of noise plus an embedded gravitational wave signal:

In [25]:
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# get the index corresponding to the first pure noise time series
background_idx = np.argmin(labels_50_50)
# get the index corresponding to the first noise + gravitational wave time series
signal_idx = np.argmax(labels_50_50)

ts_noise = noisy_signals_50_50[background_idx]
ts_background = noisy_signals_50_50[signal_idx]
ts_signal = gw_signals_50_50[signal_idx]

fig = make_subplots(rows=1, cols=2)

fig.add_trace(
    go.Scatter(x=list(range(len(ts_noise))), y=ts_noise, mode="lines", name="noise"),
    row=1,
    col=1,
)

fig.add_trace(
    go.Scatter(
        x=list(range(len(ts_background))),
        y=ts_background,
        mode="lines",
        name="background",
    ),
    row=1,
    col=2,
)

fig.add_trace(
    go.Scatter(x=list(range(len(ts_signal))), y=ts_signal, mode="lines", name="signal"),
    row=1,
    col=2,
)
fig.show()

We make two observations:
1. It is hard to distinguish the signal by eye,
2. The signal features some regularity or periodicity.

Both observations lead us to examining the _**Takens embedding**_ of the signal $s(t)$, in order to pick up the recurrent structure. Indeed, if $f$ is sampled from a dynamical system with a non-trivial recurrent structure, then, for appropriate parameters, the image by the embedding will have non-trivial topology.

More formally,, we extract a sequence of vectors in $\mathbb{R}^{d}$ of the form

$$
TD_{d,\tau} s : \mathbb{R} \to \mathbb{R}^{d}\,, \qquad t \to \begin{bmatrix}
           s(t) \\
           s(t + \tau) \\
           s(t + 2\tau) \\
           \vdots \\
           s(t + (d-1)\tau)
         \end{bmatrix},
$$
where $d$ is the embedding dimension and $\tau$ is the time delay. The quantity $(d-1)\tau$ is known as the "window size" and the difference between $t_{i+1}$ and $t_i$ is called the stride.

Let's examine what the time delay embedding of a pure gravitational wave signal looks like:

In [26]:
# Cambiar por sliding windows en lugar de takens (tener de ambos)

from gtda.time_series import SingleTakensEmbedding
embedding_dimension = 30
embedding_time_delay = 30
stride = 5

embedder = SingleTakensEmbedding(
    parameters_type="search", n_jobs=6, time_delay=embedding_time_delay, dimension=embedding_dimension, stride=stride
)

y_gw_embedded = embedder.fit_transform(gw_signals_50_50[0])

We can use PCA to project our high-dimensional space to 3-dimensions for visualisation:

In [27]:
from sklearn.decomposition import PCA
from gtda.plotting import plot_point_cloud

pca = PCA(n_components=3)
y_gw_embedded_pca = pca.fit_transform(y_gw_embedded)

plot_point_cloud(y_gw_embedded_pca)

From the plot we can see that the decaying periodic signal generated by a black hole merger emerges as a _spiral_ in the time delay embedding space! For contrast, let's compare this to one of the pure noise time series in our sample:

In [28]:
embedding_dimension = 30
embedding_time_delay = 30
stride = 5

embedder = SingleTakensEmbedding(
    parameters_type="search", n_jobs=6, time_delay=embedding_time_delay, dimension=embedding_dimension, stride=stride
)

y_noise_embedded = embedder.fit_transform(noisy_signals_50_50[background_idx])

pca = PCA(n_components=3)
y_noise_embedded_pca = pca.fit_transform(y_noise_embedded)

plot_point_cloud(y_noise_embedded_pca)

Evidently, pure noise resembles a high-dimensional ball in the time delay embedding space. Let's see if we can use persistent homology to tease apart which time series contain a gravitational wave signal versus those that don't. To do so we will adapt the strategy from the original article:

1. Generate 200-dimensional time delay embeddings of each time series
2. Use PCA to reduce the time delay embeddings to 3-dimensions
3. Use the Vietoris-Rips construction to calculate persistence diagrams of $H_0$ and $H_1$ generators
4. Extract feature vectors using persistence entropy
5. Train a binary classifier on the topological features

### Define the topological feature generation pipeline

We can do steps 1 and 2 by using the following ``giotto-tda`` tools:

- The ``TakensEmbedding`` transformer – instead of ``SingleTakensEmbedding`` – which will transform each time series in ``noisy_signals`` separately and return a collection of point clouds;
- ``CollectionTransformer``, which is a convenience "meta-estimator" for applying the same PCA to each point cloud resulting from step 1.

Using the ``Pipeline`` class from ``giotto-tda``, we can chain all operations up to and including step 4 as follows:

In [29]:
from gtda.diagrams import PersistenceEntropy, Scaler
from gtda.homology import VietorisRipsPersistence
from gtda.metaestimators import CollectionTransformer
from gtda.pipeline import Pipeline
from gtda.time_series import TakensEmbedding

embedding_dimension = 200
embedding_time_delay = 10 # Modificar para mas calificacion
stride = 10 # Modificar para mas calificacion

embedder = TakensEmbedding(time_delay=embedding_time_delay,
                           dimension=embedding_dimension,
                           stride=stride)

batch_pca = CollectionTransformer(PCA(n_components=3), n_jobs=-1)

persistence = VietorisRipsPersistence(homology_dimensions=[0, 1], n_jobs=-1) # Buscamos por homologias H_0 y H_1 (H_1 es perioricidad)

scaling = Scaler()

entropy = PersistenceEntropy(normalize=True, nan_fill_value=-10)


steps = [("embedder", embedder),
         ("pca", batch_pca),
         ("persistence", persistence),
         ("scaling", scaling),
         ("entropy", entropy)]
topological_transfomer = Pipeline(steps)

In [30]:
features = topological_transfomer.fit_transform(noisy_signals_50_50)

### Train and evaluate a model

For the final step, let's train a simple classifier on our topological features. As usual we create training and validation sets

In [31]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    features, labels_50_50, test_size=0.1, random_state=42
)

and then fit and evaluate our model:

In [32]:
from sklearn.metrics import accuracy_score, roc_auc_score


def print_scores(fitted_model):
    res = {
        "Accuracy on train:": accuracy_score(fitted_model.predict(X_train), y_train),
        "ROC AUC on train:": roc_auc_score(
            y_train, fitted_model.predict_proba(X_train)[:, 1]
        ),
        "Accuracy on valid:": accuracy_score(fitted_model.predict(X_valid), y_valid),
        "ROC AUC on valid:": roc_auc_score(
            y_valid, fitted_model.predict_proba(X_valid)[:, 1]
        ),
    }

    for k, v in res.items():
        print(k, round(v, 3))

In [33]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
print_scores(model)

Accuracy on train: 0.6
ROC AUC on train: 0.752
Accuracy on valid: 0.6
ROC AUC on valid: 0.88


# Exercise

As a simple baseline, this model is not too bad - it outperforms the deep learning baseline in the article which typically fares little better than random on the raw data. However, the combination of deep learning and persistent homology is where significant performance gains are seen.
Write code to perform this combination.
