This assignment is the first part of two exercises, in which we will analyze LIGO data to find the gravitational wave transient caused by the coalescence of two neutron stars (GW170817).

You are given a true 2048-second segment of Hanford LIGO data, sampled at 4096 Hz (down-sampled from the original 16 kHz data). Along with this PDF, you should have:

1. `strain.npy`, readable by NumPy, containing the strain data.
2. `gw_search_functions`, containing helpful functions, constants.
3. The timestamps corresponding to the strain are not uploaded due to size, and are instead provided in `gw_search_functions`.

---

In this notebook, create an use a **template-bank**, attempt to find the famous GW170817 event, and place confidence in the detection, in the form of a false-alarm rate.

It is advised to get this code from https://github.com/JonathanMushkin/GW_search_tutorial, and use the pyproject.toml to define an environment.

Please contact jonathan.mushkin[at]weizmann.ac.il for any help, question or comment.



## 0  
Load data, evaluate ASD and whitening filter

In [None]:
import time
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal, stats
import gw_search_functions

filename = "strain.npy"
event_name = "GW170817"
detector_name = "H"
fs = 2**12  # Hz

strain = np.load(filename)
times = np.arange(len(strain)) / fs
dt = times[1] - times[0]
freqs = np.fft.rfftfreq(len(strain), d=dt)
df = freqs[1] - freqs[0]

tukey_window = signal.windows.tukey(M=len(strain), alpha=0.1)
strain_f = np.fft.rfft(strain * tukey_window)

seg_duration = 64
overlap_duration = 32
nperseg = int(seg_duration * fs)
noverlap = int(overlap_duration * fs)
welch_dict = {
    "x": strain,
    "fs": fs,
    "nperseg": nperseg,
    "noverlap": noverlap,
    "average": "median",
    "scaling": "density",
}
psd_freqs, psd_estimation = signal.welch(**welch_dict)
asd_estimation = psd_estimation ** (1 / 2)
fmin = 20
asd = np.interp(freqs, psd_freqs, asd_estimation)

# Create high-pass filter
# make it go like sin-squared from 0 to 1 over (fmin, fmin+1Hz) interval
highpass_filter = np.zeros(len(freqs))
i1, i2 = np.searchsorted(freqs, (fmin, fmin + 1))
highpass_filter[i1:i2] = np.sin(np.linspace(0, np.pi / 2, i2 - i1)) ** 2
highpass_filter[i2:] = 1.0

# whitening filter is 1/asd(f) * high-pass filter
whitening_filter_raw = highpass_filter / np.interp(
    x=freqs, xp=psd_freqs, fp=asd_estimation
)

# To avoid ripples in Fourier domain, we apply a windowing in time domain

padded_tukey_window = np.fft.fftshift(
    np.pad(
        signal.windows.tukey(M=nperseg, alpha=0.1),
        pad_width=(len(strain) - nperseg) // 2,
        constant_values=0,
    )
)
# tranform to time domain, apply the window, and return to frequency domain
whitening_filter = (
    highpass_filter
    * np.fft.rfft(padded_tukey_window * np.fft.irfft(whitening_filter_raw))
).real * np.sqrt(2 * dt)

wht_strain_f = strain_f * whitening_filter
wht_strain_t = np.fft.irfft(wht_strain_f)

i1 = np.searchsorted(freqs, fmin)




 Now you know how to conduct a search with a single template. We now go on to prepare a bank of templates. 
 
 The game is make the template bank "dense" enough so the mismatch between a true signal in the data is never too large, while making it "sparse" enough not to be wasteful. For example, if 2 parameters gives the exact same waveform, it is a shame to include both.


## 1.  Theoretical questions:

Assume you have an error budget of 5\% in the $\text{SNR}^2$. 

1. You conduct a search with template $h_1$. The data is $d = n + A h_1$, with amplitude $A$. Find the $\ln\mathcal{L}={\rm SNR}^2/2$. It is useful to use the inner-product notation:
\begin{equation}
\langle a \mid b \rangle = \sum_{f = 0}^{f_{\rm max}} \frac{a(f)b^{\ast}(f)}{S_n(f)}
\end{equation} 

2. Assume data $d = h_1 + n$. You conduct a search with $h_2\neq h_1$. Relate the $\text{SNR}^2$ loss to the overlap between the waveforms, 

\begin{equation}
\mathcal{O}_{ij} = \frac{\vert\langle h_i \mid h_j \rangle\vert}{\sqrt{\langle h_i \mid h_i \rangle \langle h_j \mid h_j \rangle}}
\end{equation}


 ## Template bank
 
 One option to create a bank is to draw samples of the template parameters, create all templates. Then, evaluate all overlaps between template pairs :

 
 time time and phase shifts are done in the search. But the cost of this process is hugh (scales like bank size squared).

If instead we find a good basis to describe the banks, we can evaluate the match / mismatch between templates based on their new found coordinates.

## Phase templates

The main idea (explained in https://arxiv.org/abs/1904.01683) is that the $h_+$ waveform can be represented by an amplitude and phase. For a small enough region of parameter space, the amplitudes are the same, and the phases are mostly similar. It is therefore useful to write the phase $\Psi(f)$ as a common phase evolution $\bar{\Psi}(f)$  (mean over templates per frequency) and a deviation from it. We will write the deviations as a linear combination of orthonormal phase functions:

$$
\Psi_i(f) = \bar{\Psi}(f) + \sum_\alpha c_\alpha \psi_\alpha(f)
$$

We assume the phases are the linear-free (global phase and time standartization)  as discussed in the previous notebook, and is available at
`gw_search_functions.phases_to_linear_free_phases`. 

Orthonormality is defined using the whitened-amplitude weights:

$$
\sum_f \frac{A^2(f)}{S_n(f)} \psi_i(f)\psi_j(f) = \delta_{ij}
$$

Given two normalized templates $h_i(f)$, $i=1,2$, the match between the templates is:

$$
\mathcal{O}_{ij} = \langle h_i| h_j\rangle = \sum_f \frac{A_i(f) A_j(f)}{S_n(f)} e^{i(\Psi_1(f)-\Psi_2(f))} \, \mathrm{d}f
$$

To second order in $\Delta \Psi = \Psi_i - \Psi_j$:

$$
\langle h_i | h_j \rangle \approx \sum_f\frac{A^2(f)}{S_n(f)} \left(1 + i \Delta \Psi(f) - \frac{1}{2}(\Delta\Psi(f))^2 \right) \, \mathrm{d}f
$$

The imaginary part will not matter for the SNR calculation, so we ignore it (return to the Gaussian likelihood to understand why). The second-order term becomes:

$$
\langle h_i | h_j \rangle \approx 1 - \frac{1}{2} \sum_\alpha (\Delta c_\alpha)^2
$$


## SVD of phase-basis
We can create an orthonormal basis by taking the sample phases and performing SVD on a matrix $X$ of shape $N_{\rm samples} × N_{\rm frequencies}$:

$$
X_{ij} = \Psi_i(f_j) \cdot \frac{A(f_j)}{\sqrt{S_n(f_j)}}
$$

To obtain the desired $\psi_\alpha(f)$, divide the resulting linear basis vectors by the weights. The eigenvalues from the SVD indicate how many components are needed to represent the waveform set accurately.


**Note:** The phase evolution is smooth. You can downsample the frequency grid starting at 20 Hz with steps of $2^{-4}$ Hz to reduce computational cost. Performing SVD on the full-resolution grid may be too demanding for standard hardware.

In [None]:
n_samples = 2**6
m1, m2 = gw_search_functions.draw_mass_samples(n_samples)


plt.scatter(m1,m2)
plt.xlabel(r"$m_1\; (M_\odot)$")
plt.xlabel(r"$m_2\; (M_\odot)$")


In [None]:
fmin = 20
freq_jump = 1 / 2**4
step_jump = int(freq_jump / df)
fslice = slice(np.searchsorted(freqs, (fmin+0.5)), len(freqs), 128)
f_sparse = freqs[fslice]  # spareset frequency grid

amp = f_sparse ** (-7 / 6)
wht_amp = amp * whitening_filter[fslice]
wht_amp = wht_amp / np.sqrt(np.sum(wht_amp**2))  # renormalize

Psi = gw_search_functions.masses_to_phases(m1, m2, f_sparse)
Psi_linear_free = gw_search_functions.phases_to_linear_free_phases(
    Psi, f_sparse, wht_amp
)

1. Pefrom the SVD, using the proper weights, not including the common phase evolution term. 

2. Plot the eigen-values of the different components, on a logarithmic y-axis plot. How many components will you take? Why?

3. Plot all (or some) of the phases (without the common phase evolution) against frequency. Reconstruct the phases using the smaller number of components. Plot the residuals against frequency. Are you content with the phase differences? 

In [None]:
common_phase_evolution = 0 # wrong
phases_without_common_evolution = Psi_linear_free # wrong
svd_phase = phases_without_common_evolution
svd_weights = 1 # wrong


In [None]:
# could take up to 1-5 minutes.
u, d, v = np.linalg.svd(svd_phase * svd_weights)

In [None]:
fig, ax = plt.subplots()
ax.semilogy(d, ".")
ax.grid()

In [None]:
# cut the number of components to the approximation level you want
ndim = len(d) # wrong
u = u[:, :ndim]
d = d[:ndim]
v = v[:ndim, :]

# create a phase vector (without weights) from SVD components
# and new set of coordiantes
coordinates = u # wrong
phase_basis_sparse_freqs = v # wrong

In [None]:
fig, axs = plt.subplots(ncols=1, nrows=2)
_ = axs[0].plot(f_sparse, 
         phases_without_common_evolution.T)

axs[0].set_ylabel(r"Deviation from $\bar{\Psi}$")
axs[0].set_xlabel("Frequency (Hz)")
residual = phases_without_common_evolution - coordinates @ phase_basis_sparse_freqs

_ = axs[1].plot(f_sparse, residual.T)
axs[1].set_ylabel("Residuals")
axs[1].set_xlabel("Frequency (Hz)")

fig.tight_layout()

Interpolate the phase-basis to full frequency resolution, so the templates can be correlated with the data. 

In [None]:
phase_basis = np.array(
    [
        np.interp(x=freqs, xp=freqs[fslice], fp=phase_base, left=0)
        for phase_base in phase_basis_sparse_freqs
    ]
)

##  2

Let us check our underlying assumption about the coordinates & overlap relation. 

Create a template with some coordinates $\mathbf{c}=\{c^{\alpha=1},c^{\alpha=2}...,\}$. 

Creaet more template with increasingly greater distance, a distance of 1.

Alternatively, use the already existing phases and their coordinates. 

**Plot the coordinate distance between the tempaltes to the rest evaluated using the full inner product, and the one assumed from the coordinate distance.**

Does the results agree? Do they fully agree? Do you expect them to fully agree? 



In [None]:
# do the calculation

In [None]:
# plot the results

# 3 Bank Creation with Random Placement

Following the option suggested by Barak in the lecture, we will attempt the "brute force" random placement approach. 

It is possible to try other methods. 

1. For the random placement method, draw $2^{13}$ mass samples. Create the phase for each, and find the coordinates of each. Use the coordinates / vectors found in the SVD from the last section.

2. Use to coordinates, select a subset such that the distance between any 2 samples is not smaller than 0.1. It can be done iteratively : Look at a template. Compare it to the accepted templates, and accept / reject it. Repeat for the next tepplate, and so on. You can implement it youself, or use the function `gw_search_functions.select_points_without_clutter`.

3. On the same plot, create a scatter plot of the coordinates of the $2^{13}$ samples and of the selected subset.

4. On the plot, write down the size of subset. This subset defines the search bank.

In [None]:
m1, m2 = gw_search_functions.draw_mass_samples(2**13)

In [None]:
phases_on_coarse_freqs = gw_search_functions.masses_to_phases(m1, m2, f_sparse)
linear_free_phases = gw_search_functions.phases_to_linear_free_phases(
    phases_on_coarse_freqs, f_sparse, wht_amp
)
phases_without_common_evolution = linear_free_phases - common_phase_evolution

In [None]:
coordinates = (
    svd_weights**2 * phases_without_common_evolution
) @ phase_basis_sparse_freqs.T

distance_scale = 1 # wrong

bank_coordinates, bank_indices = (
    gw_search_functions.select_points_without_clutter(
        coordinates, distance_scale
    )
)


In [None]:
plt.scatter(
    *coordinates.T,
    s=1,
    alpha=0.5,
    c="r",
    label=f"full set ({len(coordinates)} points)",
)
plt.scatter(
    *bank_coordinates.T,
    s=5,
    c="k",
    label=f"subset ({len(bank_coordinates)} points)",
)
print(bank_coordinates.shape)
plt.legend(bbox_to_anchor=(1.01, 0.99))

Create whitened-amplitudes for the templates. Make sure they are properly normalized for the correlation functions, which isn't necessarily the same as for the inner product used in previous sections.

In the previous exercise we defined the correlation function as 

`correlation = np.fft.irfft( data_wht * template_wht.conj())`

You can use the `gw_search_functions.correlate`.

You can either implememt you own SNR calculation code, or us `snr2_timeseries` and `complex_overlap_timeseries` in `gw_search_functions`

In [None]:
amp = np.zeros_like(freqs)
amp[i1:] = freqs[i1:] ** (-7 / 6)
wht_amp = amp * whitening_filter
normalization = 1 # wrong
amp /= normalization
wht_amp /= normalization

# 4 The Search

A search actually includes quite a lot of by-products. In the following we will try to conduct a search while also not keeping too much un-wanted information. 

**Before using the entire bank, try a small subset and see that the results make sense. The entire search could take several minutes, depending on hardware.**

1. Do for each template in the bank individually (including glitch-removal). 

2. For each interval of 0.1 seconds, record which template gave the maximal SNR, and what was that SNR.

3. Plot the time-series of maximal $\text{SNR}^2$ in per 0.1 seconds. 

4. Plot a histogram of the maximal values per 0.1 seconds 

In [None]:

# this is how we found the frequency-split to h_low and h_high

frac_snr2 = np.cumsum(np.abs(wht_amp) ** 2)
frac_snr2 /= frac_snr2[-1]
j = np.searchsorted(frac_snr2, 0.5)

In [None]:
indices_lists = []
snr2_lists = []
glitch_mask_list = []
snr2_lists_raw = []
indices_lists_raw = []

common_phase_evolution_high_res = np.interp(
    x=freqs, xp=f_sparse, fp=common_phase_evolution
)

f_sampling = 1 / dt
t_start = time.time()
segment_for_maximization = int(0.1 * fs)
glitch_test_threshold = stats.chi2(df=2).isf(0.01)
min_snr2_for_glitch_removal = 10

for template_index, template_coordinate in tqdm(
    enumerate(bank_coordinates),
    total=len(bank_coordinates),
    desc="Conducting a search",
):
    phase = common_phase_evolution_high_res + template_coordinate @ phase_basis

    # conduct a search without caring about glitchs
    wht_h = wht_amp * np.exp(1j * phase)
    snr2 = ()

    maxs, argmaxs = gw_search_functions.max_argmax_over_n_samples(
        snr2, segment_for_maximization
    )
    indices_lists_raw.append(argmaxs)
    snr2_lists_raw.append(maxs)

    # conduct the search with caring about glitchsß
    wht_h_low, wht_h_high = np.zeros((2, len(freqs)), complex)
    wht_h_low[:j] = wht_h[:j]
    wht_h_high[j:] = wht_h[j:]
    z_low = ()
    z_high = ()

    glitch_test_statistic = np.abs(z_low - z_high) ** 2
    glitch_mask = ()
    glitch_mask_list.append(glitch_mask)

    snr2_after_glitch_test = snr2 * ~glitch_mask
    maxs, argmaxs = gw_search_functions.max_argmax_over_n_samples(
        snr2_after_glitch_test, segment_for_maximization
    )
    indices_lists_raw.append(argmaxs)
    snr2_lists.append(maxs)

snr2_per_template = np.array(snr2_lists)
time_indices_per_template = np.array(indices_lists)
snr2_per_template_raw = np.array(snr2_lists_raw)
time_indices_per_template_raw = np.array(indices_lists_raw)


print("Done!")

In [None]:
binned_times = np.linspace(0, times[-1], snr2_per_template.shape[1])


In [None]:
fig, axs = plt.subplots(ncols=2, nrows=1, sharex=True, sharey=True)
axs[0].plot(binned_times, np.zeros_like(binned_times))
axs[0].set_xlabel("time (s)")
axs[0].set_ylabel(r"Bestfit ${\rm SNR}^2$ without glitch removal")

axs[1].plot(binned_times, np.zeros_like(binned_times))
axs[1].set_xlabel("time (s)")
axs[1].set_ylabel(r"Bestfit ${\rm SNR}^2$ with glitch removal")

In [None]:
fig, ax = plt.subplots()
hist_kwargs = {"histtype": "step", "density": True, "log": True, "bins": 200}
counts, edges, patches = ax.hist(
    np.zeros_like(binned_times),
    **hist_kwargs,
    alpha=0.5,
    label="With glitch removal",
)

hist_kwargs = {"histtype": "step", "density": True, "log": True, "bins": 200}
counts, edges, patches = ax.hist(
    np.zeros_like(binned_times),
    label="Before glitch removal",
    **hist_kwargs,
    ls="--",
)

ax.set_xlabel(r"${\rm SNR}^2$")
ax.set_ylabel("counts (normalized)")
leg = ax.legend()


## 5.
If you detected an event, **report its time, the masses of the template and an estimation or a upper bound of the false-alarm rate for such SNR**. Consider the number of templates you used and the fact that waveforms have typical auto-correlation length of 1 ms.

In [None]:
# compare with https://arxiv.org/pdf/1710.05832 Table 1
best_template_index, best_timestamp_index = np.unravel_index(
    snr2_per_template.argmax(), snr2_per_template.shape
)

bestfit_m1 = 1 # wrong
bestfit_m2 = 1 # wrong
bestfit_mchirp = gw_search_functions.m1m2_to_mchirp(bestfit_m1, bestfit_m2)
bestfit_snr2 = snr2_per_template.max()
bestfit_time = snr2_per_template.max(axis=0).argmax()

print(f"Maximal SNR^2 found : {bestfit_snr2:.5g} at time {bestfit_time:.4f}")
print(
    f"Template of masses ({bestfit_m1:.3g},{bestfit_m2:.3g}), or chirp-mass {bestfit_mchirp:.5g} (solar masses)"
)

# 5 Post analysis

1. What is the time of detection, masses of best-fit template, and chirp-mass of it?

2. What is the False-alarm-rate? Consider the number of attemps, based on the number of templates, duration of the data, and auto-correlation length of the templates (roughly 1ms). 



## 6. Look at the spectrum

Create a spectogram (using e.g. `matplotlib.pyplot.specgram`), localized in time and frequency around the event you found.

In [None]:
# create the histogram in 2 steps. So I can calibrate the dynamic range in the second histogram using the fist histogram
specgram_kwargs = {
    "x": np.fft.irfft(strain_f * whitening_filter),
    "NFFT": int(fs * 0.5),
    "noverlap": int(fs * 0.25),
    "scale": "linear",
    "vmin": 0,
    "vmax": 25,
    "Fs": fs,
}

o = plt.specgram(**specgram_kwargs)

In [None]:
specgram_kwargs = {}  # fill it with relelvant parameters

o = plt.specgram(**specgram_kwargs)
tmin = 1 # probably wrong
tmax = 2 # probably wrong
fmin = 20 # probably ok
fmax = 1000 # probably ok
plt.xlim(tmin, tmax)
plt.ylim(fmin, fmax)