# Kaggle g2net-dataset preprocessing with whiten, bandpass and notch filters

This notebook describes a kind of the [gravitational waves signals](https://www.kaggle.com/c/g2net-gravitational-wave-detection/overview)
processing [approach](https://doi.org/10.1088/1361-6382/ab685e).

First of all, install gwnet package. It is the simple tool for g2net-dataset processing.
For more detail research use [GWpy](https://github.com/gwpy/gwpy) or
[PyCBC](https://pycbc.org/) which includes more procedures for GW processing.

In [None]:
!pip install --upgrade git+git://github.com/Sunnesoft/g2net-challenge.git

from gwnet import GwTimeseries

Mount Google Drive folder with the dataset processed by whiten, bandpass and notch filters:

In [None]:
import os
from google.colab import drive

BASE_PATH = '/content/'
G2NET_PATH = os.path.join(BASE_PATH, 'g2net/')
DRIVE_PATH = os.path.join(BASE_PATH, 'drive')
drive.mount(DRIVE_PATH)

To use the Kaggle API, sign up for a Kaggle account at [kaggle](
https://www.kaggle.com). Then go to the 'Account' tab of your user
profile and select 'Create API Token'. This will trigger the download of
kaggle.json, a file containing your API credentials. Then upload this
file to $BASE_PATH folder:

In [None]:
from google.colab import files
files.upload()

Install Kaggle-CLI and move kaggle.json to the target folder:

In [None]:
!pip install kaggle==1.5.12
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json

Download and unzip 'g2net-gravitational-wave-detection' dataset:

In [None]:
!kaggle competitions download -c g2net-gravitational-wave-detection -p $BASE_PATH
!unzip -qq /content/g2net-gravitational-wave-detection.zip -d $G2NET_PATH

Set up the configuration:

In [None]:
from tqdm import tqdm

SAMPLE_RATE = 2048
FREQ_RANGE = (50, 250)
WINDOWED_FILTER = ('tukey', 0.1)
TRAIN_PATH = os.path.join(G2NET_PATH, 'train/')
OUTPUT_PATH = os.path.join(BASE_PATH, 'filtered/train/')
ZIP_GD_OUT_FILE = os.path.join(DRIVE_PATH, 'MyDrive/g2net/filtered_train.zip')
OUTPUT_ZIP_FILE = os.path.join(BASE_PATH, 'filtered_train.zip')

!mkdir -p $TRAIN_FILTERED_PATH

Process dataset:

In [None]:
createdf_count = 0
createdf_count_view = 0
createdf_count_step = 1000

for root, dirs, files in tqdm(os.walk(TRAIN_PATH)):
    rel_path = root.replace(TRAIN_PATH, '')
    out_path = os.path.join(OUTPUT_PATH, rel_path)
    os.makedirs(out_path, exist_ok=True)

    for fname in files:
        in_fn = os.path.join(root, fname)
        out_fn = os.path.join(out_path, fname.split('.')[0] + '.npy')

        if os.path.exists(out_fn):
            continue

        tss = GwTimeseries.load(in_fn, SAMPLE_RATE)

        sps = []
        for ts in tss:
            f, Pxx = ts.psd(fftlength=ts.duration, nperseg=2048, overlap=0.75, window=('tukey', 0.5))
            ts.apply_window(window=WINDOWED_FILTER)
            ts.whiten(psd_val=(f, Pxx))
            ts.filter(frange=FREQ_RANGE,
                      psd_val=(f, Pxx),
                      outlier_threshold=3.0)

        GwTimeseries.save(out_fn, tss)
        createdf_count += 1

    if createdf_count > createdf_count_view:
        print(f'{createdf_count} files processed.')
        createdf_count_view += createdf_count_step

Archive results and upload them to Google Drive:

In [None]:
!zip -rq $OUTPUT_ZIP_FILE $OUTPUT_PATH
!cp $OUTPUT_ZIP_FILE $ZIP_GD_OUT_FILE