# SETI Breakthrough Listen - E.T. Signal Search - Exploratory Data Analysis

Quick Exploratory Data Analysis for [SETI Breakthrough Listen - E.T. Signal Search](https://www.kaggle.com/c/seti-breakthrough-listen/) challenge    

**“Are we alone in the Universe?”**


识别数据中的异常信号

![](https://storage.googleapis.com/kaggle-competitions/kaggle/23652/logos/header.png)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:black; border:0' role="tab" aria-controls="home"><center>Quick Navigation</center></h3>

* [Overview](#1)
* [Visualizations](#2)
* [Targets](#3)
    
    

* [Competition Metric](#10)
* [Sample Submission](#20)
* [Prepared Submission](#30)

<a id="1"></a>
<h2 style='background:black; border:0; color:white'><center>Overview<center><h2>

In this competition you are tasked with looking for technosignature signals in cadence snippets taken from the Green Bank Telescope (GBT)

**train/** - a training set of cadence snippet files stored in numpy float16 format (v1.20.1), one file per cadence snippet id, with corresponding labels found in the train_labels.csv file. Each file has dimension (6, 273, 256), with the 1st dimension representing the 6 positions of the cadence, and the 2nd and 3rd dimensions representing the 2D spectrogram.  
**test/** - the test set cadence snippet files; you must predict whether or not the cadence contains a "needle", which is the target for this competition  
**sample_submission.csv** - a sample submission file in the correct format  
**train_labels** - targets corresponding (by id) to the cadence snippet files found in the train/ folder  

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
df_train = pd.read_csv("../input/seti-breakthrough-listen/train_labels.csv")
df_train

In [None]:
df_train['target'].value_counts()

In [None]:
plt.figure(figsize=(16, 5))
sn.countplot(y="target", data=df_train)
plt.title("Target Distribution");

<a id="2"></a>
<h2 style='background:black; border:0; color:white'><center>Visualizations<center><h2>

In [None]:
def get_train_filename_by_id(_id: str) -> str:
    return f"../input/seti-breakthrough-listen/train/{_id[0]}/{_id}.npy"


get_train_filename_by_id(df_train.iloc[0]["id"])

In [None]:
tmp_filename = get_train_filename_by_id(df_train.iloc[0]["id"])
print(tmp_filename)
arr = np.load(tmp_filename)
arr.shape

In [None]:
def show_cadence(filename: str, label: int) -> None:
    plt.figure(figsize=(16, 10))
    arr = np.load(filename)
    for i in range(6):
        plt.subplot(6, 1, i + 1)
        if i == 0:
            plt.title(f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
        plt.imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
        plt.text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
        plt.xticks([])
    plt.show()

In [None]:
df_tmp = df_train[df_train["target"] == 0].sample(1)
for ind, row in df_tmp.iterrows():
    show_cadence(get_train_filename_by_id(row["id"]), row["target"])

df_tmp = df_train[df_train["target"] == 1].sample(5)
for ind, row in df_tmp.iterrows():
    show_cadence(get_train_filename_by_id(row["id"]), row["target"])

In [None]:
def show_channels(filename: str, label: int) -> None:
    plt.figure(figsize=(16, 10))
    plt.suptitle(f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
    arr = np.load(filename)
    for i in range(6):
        plt.subplot(2, 3, i + 1)
        plt.imshow(arr[i].astype(float))
    plt.show()

In [None]:
df_tmp = df_train[df_train["target"] == 0].sample(1)
for ind, row in df_tmp.iterrows():
    show_channels(get_train_filename_by_id(row["id"]), row["target"])

df_tmp = df_train[df_train["target"] == 1].sample(1)
for ind, row in df_tmp.iterrows():
    show_channels(get_train_filename_by_id(row["id"]), row["target"])

<a id="3"></a>
<h2 style='background:black; border:0; color:white'><center>Targets<center><h2>

#### Easy to find
![](https://i.imgur.com/5ohQpvE.png)

#### Medium
![](https://i.imgur.com/Pz6YdoV.png)
![](https://i.imgur.com/81jL2N7.png)

#### Hard
![](https://i.imgur.com/Sgu0k7n.png)

<a id="10"></a>
<h2 style='background:black; border:0; color:white'><center>Competition Metric<center><h2>

Submissions are evaluated on [area under the ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) between the predicted probability and the observed target.

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, auc

In [None]:
list_y_true = [
    [1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
    [1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
    [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.], #  IMBALANCE
]
list_y_pred = [
    [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
    [0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.5],
    [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], #  IMBALANCE
]

for y_true, y_pred in zip(list_y_true, list_y_pred):
    fpr, tpr, _ = roc_curve(y_true, y_pred)
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(5, 5))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([-0.01, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

<a id="20"></a>
<h2 style='background:black; border:0; color:white'><center>Sample Submission<center><h2>

In [None]:
df_submission = pd.read_csv("../input/seti-breakthrough-listen/sample_submission.csv")
df_submission

In [None]:
df_submission["target"] = 0.51
df_submission.to_csv("submission.csv", index=False)

<a id="30"></a>
<h2 style='background:black; border:0; color:white'><center>Prepared Submission<center><h2>

I experiments with these two excellent kernels, try to retrain and ensemble them:   
[SETI / NFNet_l0 starter [inference]](https://www.kaggle.com/yasufuminakama/seti-nfnet-l0-starter-inference)   
[SETI-BL: TF Starter TPU 🚀](https://www.kaggle.com/awsaf49/seti-bl-tf-starter-tpu)

In [None]:
df_prepared = pd.read_csv("../input/signal-search-submissions/submission_2021-05-13_20-00-00.csv", index_col=0)
df_prepared.to_csv("submission_2021-05-13_20-00-00.csv")
df_prepared = pd.read_csv("../input/signal-search-submissions/submission_2021-05-13_21-00-00.csv", index_col=0)
df_prepared.to_csv("submission_2021-05-13_21-00-00.csv")

## WORK IN PROGRESS...