# Goal/ Explanation

In this competition, you’ll aim to detect GW signals from the mergers of binary black holes. Specifically, you'll build a model to analyze simulated GW time-series data from a network of Earth-based detectors.  
(このコンテストでは、連星ブラックホールの合併によるGW信号の検出を目指します。具体的には、地球ベースの検出器のネットワークからシミュレートされたGW時系列データを分析するためのモデルを構築する)

These signals are unimaginably tiny ripples in the fabric of space-time and even though the global network of GW detectors are some of the most sensitive instruments on the planet, the signals are buried in detector noise.（要約:これらの信号はとても小さなさざなみであり、かなり精密なGW検知機をもってしても、ノイズに埋もれてしまう)  

In this competition you are provided with a training set of time series data containing simulated gravitational wave measurements from a network of 3 gravitational wave interferometers (LIGO Hanford, LIGO Livingston, and Virgo).   
(このコンペティションでは、3つの重力波干渉計（LIGOハンフォード、LIGOリビングストン、およびバーゴ）のネットワークからのシミュレートされた重力波測定値を含む時系列データのトレーニングセットが提供されます。)

Each time series contains either detector noise or detector noise plus a simulated gravitational wave signal.   
(各時系列には、検出器ノイズまたは検出器ノイズのいずれかと、シミュレートされた重力波信号が含まれます。)  
The task is to identify when a signal is present in the data (target=1).  ※シグナルなので、重力波の波形をそのまま表してはいない？  
(タスクは、信号がデータに存在するときを識別することです（target = 1）  )　⇨ 存在する確率を出す

Each data sample (npy file) contains 3 time series (1 for each detector) and each spans 2 sec and is sampled at 2,048 Hz.  
(各データサンプル（npyファイル）には3つの時系列（各検出器に1つ）が含まれ、それぞれが2秒にまたがり、2,048Hzでサンプリングされます。)
  
信号対雑音比（SNR） : 信号がどの程度検出可能であるかを示す最も有益な尺度

## Competition Metric
AUC - ROC curve

## Reference
* [kaggle日本語初心者取り組み](https://www.kaggle.com/tensorchoko/g2net-gravitational-timm-efn-train#%F0%9F%8E%B5import)
* [top kaggler Modeling](https://www.kaggle.com/ihelon/g2net-eda-and-modeling)  
* [Good Tutorial EDA](https://www.kaggle.com/andradaolteanu/g2net-searching-the-sky-pytorch-effnet-w-meta)

### Library

In [None]:
import os
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time

### Look at training_labels.csv

In [None]:
train = pd.read_csv('../input/g2net-gravitational-wave-detection/training_labels.csv')
train.head()

* target = 1 GW exist / target = 0 GW not exist

### Look at sample_submission.csv

In [None]:
test = pd.read_csv('../input/g2net-gravitational-wave-detection/sample_submission.csv')
test.head()

#### 👉  the Goal of this competition is to show probability whether GW in the time serise data

### distribution train

In [None]:
sns.countplot(data=train, x="target")

#### check the detail

In [None]:
# check the detail
print(train.shape)
num_target_exist = train["target"] == 1
# count target == 1
print("exist:{0}".format(num_target_exist.sum()))
num_target_notexist = ~(train["target"] == 1)
print("not exist:{0}".format(num_target_notexist.sum()))

In [None]:
def convert_image_id_2_path(image_id: str, is_train: bool = True) -> str:
    folder = "train" if is_train else "test"
    return "../input/g2net-gravitational-wave-detection/{}/{}/{}/{}/{}.npy".format(
        folder, image_id[0],  image_id[1],  image_id[2],  image_id 
    )

# EDA

Each data sample (npy file) contains 3 time series (1 for each detector) and each spans 2 sec and is sampled at 2,048 Hz.  

In [None]:
# なぜ事前に"LIGO Hanford", "LIGO Livingston", "Virgo"　の順番だとわかったのか？
def visualize_sample(_id, target, colors=("black", "red", "green"), signal_names=("LIGO Hanford", "LIGO Livingston", "Virgo")):
    # visualize_train_data
    path = convert_image_id_2_path(_id)
    x = np.load(path)
    # print(x) # show the source data
    plt.figure(figsize=(16, 7))
    for i in range(3):
        plt.subplot(4, 1, i+1)
        plt.plot(x[i], color=colors[i])
        plt.legend([signal_names[i]], fontsize=12, loc="lower right")
        
        plt.subplot(4, 1, 4)
        plt.plot(x[i], color=colors[i])
    
    plt.subplot(4, 1, 4)
    plt.legend(signal_names, fontsize=12, loc="lower right")
    
    plt.suptitle(f"id: {_id} target: {target}", fontsize=16)
    plt.show

In [None]:
for i in random.sample(train.index.tolist(), 3):
    _id = train.iloc[i]['id']
    target = train.iloc[i]['target']
    visualize_sample(_id, target)

### Signal transformation - Spectogram

In [None]:
import librosa
import librosa.display

In [None]:
def visualize_sample_spectogram(_id, target, signal_names=("LIGO Hanford", "LIGO Livingston", "Virgo")):
    x = np.load(convert_image_id_2_path(_id))
    plt.figure(figsize=(16, 5))
    for i in range(3):
        X = librosa.stft(x[i] / x[i].max()) # why x[i] / x[i].max()?
        Xdb = librosa.amplitude_to_db(abs(X))
        plt.subplot(1, 3, i + 1)
        librosa.display.specshow(Xdb, sr=2048, x_axis="time", y_axis="hz", vmin=-3, vmax=50)
        plt.colorbar()
        plt.title(signal_names[i], fontsize=14)
        
        
    plt.suptitle(f"id: {_id} target: {target}", fontsize=16)
    plt.show()

In [None]:
for i in random.sample(train.index.tolist(), 3):
    _id = train.iloc[i]['id']
    target = train.iloc[i]['target']
    visualize_sample_spectogram(_id, target)

### Signal transformation - Q-transform

### nnAudio
* nnAudio is an audio processing toolbox using PyTorch convolutional neural network as its backend.  
* spectrograms can be generated from audio on-the-fly during neural network training and the Fourier kernels (e.g. or CQT kernels) can be trained

In [None]:
!pip install -q nnAudio -qq
import torch
from nnAudio.Spectrogram import CQT1992v2 # 定数Q変換を計算するための効率的なアルゴリズム

In [None]:
Q_TRANSFORM = CQT1992v2(sr=2048, fmin=20, fmax=1024, hop_length=32)

def visualize_sample_qtransform(_id, target, signal_names=("LIGO Hanford", "LIGO Livingston", "Virgo"), sr=2048):
    x = np.load(convert_image_id_2_path(_id))
    plt.figure(figsize=(16, 5))
    for i in range(3):
        waves = x[i] / np.max(x[i])
        waves = torch.from_numpy(waves).float()
        image = Q_TRANSFORM(waves)
        
        plt.subplot(1, 3, i + 1)
        plt.imshow(image.squeeze())
        plt.title(signal_names[i], fontsize=14)
        
    plt.suptitle(f"id: {_id} target: {target}", fontsize=16)
    plt.show()

In [None]:
for i in random.sample(train.index.tolist(), 3):
    _id = train.iloc[i]['id']
    target = train.iloc[i]['target']
    
    visualize_sample(_id, target)
    visualize_sample_qtransform(_id, target)

# Model

#### EfficientNet  
Arrange 3 balance in "depth" / "width" / "resolution" properly  ※ not change the Arichitecture  
(モデルの「深さ」と「広さ」と「解像度(=入力画像の大きさ)」の3つをバランスよく調整する ※レイヤーのアーキティクチャ自体は変えない)

In [None]:
!pip install efficientnet_pytorch -qq

In [None]:
import torch
from torch import nn
from torch.utils import data as torch_data
from sklearn import model_selection as sk_model_selection
from torch.nn import functional as torch_functional
import efficientnet_pytorch
from sklearn.model_selection import StratifiedKFold # 交差検証 分布に大きな不均衡がある場合に用いるKFold,  分布の比率を維持したままデータを訓練用とテスト用に分割する．

#### DataLoader
datasetsからバッチごとに取り出す  
datsets = [データセット全て]  
Dataloader = [[batch_1], [batch_2], ... [batch_n]]  
that is..   
len(datasets)="すべてのデータの数"  
len(Dataloader)="イテレーションの数"  

In [None]:
class DataRetriever(torch_data.Dataset):
    def __init__(self, paths, targets):
        self.paths = paths
        self.targets = targets
        
        self.q_transform = CQT1992v2(sr=2048, fmin=20, fmax=1024, hop_length=32)
        
    def __len__(self):
        return len(self.paths)
    
    def __get_qtransform(self, x):
        image = []
        for i in range(3):
            waves = x[i] / np.max(x[i])
            waves = torch.from_numpy(waves).float()
            channel = self.q_transform(waves).squeeze().numpy()
            image.append(channel)
        
        return torch.tensor(image).float()
    
    def __getitem__(self, index):
        # データのindexを指定すると、前処理(x: Q-Transformで画像化, y: 画像をTensor化)をして返す
        file_path = convert_image_id_2_path(self.paths[index])
        x = np.load(file_path)
        image = self.__get_qtransform(x)
        y = torch.tensor(self.targets[index], dtype=torch.float)
        
        return {"X": image, "y": y}

#### Model Efficientnet

In [None]:
class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = efficientnet_pytorch.EfficientNet.from_pretrained("efficientnet-b7")
        n_features = self.net._fc.in_features
        self.net._fc = nn.Linear(in_features=n_features, out_features=1, bias=True)
    
    def forward(self, x):
        out = self.net(x)
        return out

In [None]:
class LossMeter:
    def __init__(self):
        self.avg = 0
        self.n = 0
        
    def update(self, val):
        self.n += 1
        # ↓Whats?  (incremental update)
        
        self.avg = val / self.n + (self.n - 1) / self.n * self.avg
        
        
class AccMeter:
    def __init__(self):
        self.avg = 0
        self.n = 0
        
    def update(self, y_true, y_pred):
        y_true = y_true.cpu().numpy().astype(int)
        y_pred = y_pred.cpu().numpy() >= 0
        last_n = self.n
        self.n += len(y_true)
        true_count = np.sum(y_true == y_pred)
        # incremental update (unfamiliar)
        self.avg = true_count / self.n + last_n / self.n * self.avg

In [None]:
class Trainer:
    # do train
    def __init__(self, model, device, optimizer, criterion, loss_meter, score_meter):
        self.model = model
        self.device = device
        self.optimizer = optimizer
        self.criterion = criterion
        self.loss_meter = loss_meter
        self.score_meter = score_meter
        
        self.best_valid_score = -np.inf
        self.n_patience = 0 # early_stoppingのために、止める指標 (今回は100epochでstop)
        
        self.messages = {
            "epoch": "[Epoch {}: {}] loss: {:.5f}, score: {:.5f}, time: {} s",
            "checkpoint": "The score improved from {:.5f} to {:.5f}. Save model to '{}'",
            "patience": "\nValid score didn't improve last {} epochs."
        }
        
    def fit(self, epochs, train_loader, valid_loader, save_path, patience):
        for n_epoch in range(1, epochs + 1):
            self.info_message("EPOCH: {}", n_epoch)
            
            train_loss, train_score, train_time = self.train_epoch(train_loader)
            valid_loss, valid_score, valid_time = self.valid_epoch(valid_loader)
            
            self.info_message(
                self.messages["epoch"], "Train", n_epoch, train_loss, train_score, train_time
            )
            self.info_message(
                self.messages["epoch"], "Valid", n_epoch, valid_loss, valid_score, valid_time
            )
            
            if True:
                # if self.best_valid_score < valid_score:
                self.info_message(
                    self.messages["checkpoint"], self.best_valid_score, valid_score, save_path
                )
                self.best_valid_score = valid_score
                self.save_model(n_epoch, save_path)
                self.n_patience = 0
            else:
                self.n_patience += 1
            
            if self.n_patience >= patience:
                self.info_message(self.messages["patience"], patience)
                break
            
                
    def train_epoch(self, train_loader):
        self.model.train()
        t = time.time()
        train_loss = self.loss_meter()
        train_score = self.score_meter()
        
        for step, batch in enumerate(train_loader, 1):
            X = batch["X"].to(self.device)
            targets = batch["y"].to(self.device)
            self.optimizer.zero_grad()
            outputs = self.model(X).squeeze(1)
            
            loss = self.criterion(outputs, targets)
            loss.backward()
            
            train_loss.update(loss.detach().item())
            train_score.update(targets, outputs.detach())
            
            self.optimizer.step() # update weight
            
            _loss, _score = train_loss.avg, train_score.avg
            message = 'Train Step {}/{}, train_loss: {:.5f}, train_score: {:.5f}'
            self.info_message(message, step, len(train_loader), _loss, _score, end="\r")
                   
        return train_loss.avg, train_score.avg, int(time.time() - t)
            
    def valid_epoch(self, valid_loader):
        self.model.eval()
        t = time.time()
        valid_loss = self.loss_meter()
        valid_score = self.score_meter()
        
        for step, batch in enumerate(valid_loader, 1):
            with torch.no_grad():
                X = batch["X"].to(self.device)
                targets = batch["y"].to(self.device)

                outputs = self.model(X).squeeze(1)
                loss = self.criterion(outputs, targets)

                valid_loss.update(loss.detach().item())
                valid_score.update(targets, outputs)
                
            _loss, _score = valid_loss.avg, valid_score.avg
            message = 'Valid Step {}/{}, valid_loss: {:.5f}, valid_score: {:.5f}' 
            self.info_message(message, step, len(valid_loader), _loss, _score, end="\r")
            
        return valid_loss.avg, valid_score.avg, int(time.time() - t)
        
    def save_model(self, n_epoch, save_path):
        torch.save(
        {
            "model_state_dict" : self.model.state_dict(),
            "optimizer_state_dict" : self.optimizer.state_dict(),
            "best_valid_score" : self.best_valid_score,
            "n_epoch" : n_epoch,
        },
        save_path
        )
    @staticmethod        
    def info_message(message, *args, end="\n"):
        print(message.format(*args), end=end)

## Training

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

skf = StratifiedKFold(n_splits=2, random_state=42, shuffle=True)
for fold, (train_index, valid_index) in enumerate(skf.split(train, train["target"])):
    # prepare Dataset
    train_X = train.iloc[train_index]
    valid_X = train.iloc[valid_index][:20000]  # Reduce calculation time
    print(train_X.shape, valid_X.shape)
    
    train_data_retriever = DataRetriever(
        train_X["id"].values,
        train_X["target"].values
    )
    valid_data_retriever = DataRetriever(
        valid_X["id"].values, 
        valid_X["target"].values
    )
    
    train_loader = torch_data.DataLoader(train_data_retriever, batch_size=32, shuffle=True, num_workers=8)
    valid_loader = torch_data.DataLoader(valid_data_retriever, batch_size=32, shuffle=False, num_workers=8)
    
    model = Model()
    model.to(device)
    
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    criterion = torch_functional.binary_cross_entropy_with_logits
    
    trainer = Trainer(model, device, optimizer, criterion, LossMeter, AccMeter)
    
    # train 1epoch
    history = trainer.fit(1, train_loader, valid_loader, f"best-model-{fold}.pth" ,100)

### choice best model

In [None]:
models = []
for i in range(2):
    model = Model()
    model.to(device)
    
    checkpoint = torch.load(f"best-model-{i}.pth")
    model.load_state_dict(checkpoint["model_state_dict"])
    model.eval()
    
    models.append(model)

### Prepare testdata

In [None]:
class DataRetriever(torch_data.Dataset):
    def __init__(self, paths):
        self.paths = paths

        self.q_transform = CQT1992v2(
            sr=2048, fmin=20, fmax=1024, hop_length=32
        )
    
    def __len__(self):
        return len(self.paths)
    
    def __get_qtransform(self, x):
        image = []
        for i in range(3):
            waves = x[i] / np.max(x[i])
            waves = torch.from_numpy(waves).float() # numpy to tensor
            channel = self.q_transform(waves).squeeze().numpy() # .numpy() :This implicitly means that the converted tensor will be now processed on the CPU.
            image.append(channel)
        
        return torch.tensor(image).float()
    
    def __getitem__(self, index):
        file_path = convert_image_id_2_path(self.paths[index], is_train=False)
        x = np.load(file_path)
        image = self.__get_qtransform(x)
        
        return {"X": image, "id": self.paths[index]}

In [None]:
test_data_retriever = DataRetriever(
    test["id"].values,
)

test_loader = torch_data.DataLoader(
    test_data_retriever,
    batch_size=32,
    shuffle=False,
    num_workers = 8
)

### Inference

In [None]:
y_pred = []
ids = []

for e, batch in enumerate(test_loader):
    print(f"{e}/{len(test_loader)}", end="\r")
    with torch.no_grad():
        tmp_pred = np.zeros((batch["X"].shape[0], ))
        for model in models:
            tmp_res = torch.sigmoid(model(batch["X"].to(device))).cpu().numpy().squeeze()
            tmp_pred += tmp_res / 2
        y_pred.extend(tmp_pred)
        ids.extend(batch["id"])

## Submission

In [None]:
submission = pd.DataFrame({"id": ids, "target": y_pred})
submission.to_csv("model_submission.csv", index=False)

In [None]:
submission