# 🛸 外星信号搜索 👽 数据分析 & 基线模型 [布尔艺数 BoolArt]

<div align=center>
<img src="https://storage.googleapis.com/kaggle-media/competitions/SETI-Berkeley/DSC_4014-Edit_2.jpg" width = "600" height = "100" alt="图片名称">
</div>
   

**“我们在宇宙中是孤独的吗?”**这是人类最深刻和永恒的问题之一。

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:Blue; border:0' role="tab" aria-controls="home"><center>目录导航</center></h3>
    
# 目录

* 赛题背景
* 数据探索
* 评价指标
* 基线模型
* 后续思路
* 引用

## 赛题背景

[比赛地址：SETI Breakthrough Listen - E.T. Signal Search](https://www.kaggle.com/c/seti-breakthrough-listen/) 

为了搜寻外星信号，我们将数字光谱仪Breakthrough Listen 安装在大型望远镜Green Bank Telescope (GBT)，它从望远镜接收原始数据（每天数百 TB）并执行傅立叶变换以生成光谱图。频谱的数据非常宽，通常会跨越几个 GHz 的无线电频谱，数据文件非常巨大，为了简化数据，比赛数据只抽取其中很小的频谱区域用于预测（被称作snippets，针）。

为了防止来自于人类世界的无线电台，还有 wifi 路由器等无线电信号干扰，Breakthrough Listen 通过交替观测我们的主要目标星和附近三颗恒星来对抗这种干扰，具体方式为：在恒星“A”上观察 5 分钟，然后在恒星“B”上观察 5 分钟，然后回到恒星“A”上 5 分钟，然后是“C” ”，然后回到“A”，然后在“D”星上用 5 分钟结束。一组六个观察值 (ABACAD) 被称为cadence（“节奏”）。由于我们只是为每个节奏提供小范围的频率，因此我们将您将要分析的数据集称为cadence snippets（“节奏片段”）。

<div align=center>
    <img src="https://storage.googleapis.com/kaggle-media/competitions/SETI-Berkeley/Screen%20Shot%202021-05-03%20at%2011.39.42.png" width = "500", height = "600">
</div>


上图是距离地球 200 亿公里的航海者一号飞船的cadence snippets。第一个、第三个和第五个面板是“A”目标（航海者一号飞船）。黄色对角线是来自航海者号的无线电信号。当我们指向航天器时它会被检测到，当我们指向远处时它就会消失。这是图中的一条对角线，因为地球和航天器的相对运动会产生多普勒漂移，导致频率随时间变化。而其他地球的人工信号更倾向于保持在固定频率，因此我们可以通过观测是否发生多普勒漂移现象来进行信号甄别。虽然完全根据已经发射的飞船的观测来训练我们的算法会很好，但它们的例子并不多，而且我们还希望能够找到更广泛的信号类型。所以我们进行了数据模拟，在拍摄了数以万计的节奏片段上添加一些类似于航海者一号飞船的信号，这构成了我们的训练数据。

## 数据探索

In [None]:
!pip install efficientnet_pytorch

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
# Libraries
import warnings
warnings.filterwarnings('ignore')

import os
import sys
import glob
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import cv2
from tqdm import tqdm
from colorama import Fore, Back, Style
r_ = Fore.WHITE
from plotly.offline import iplot
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

from skimage.io import imshow, imread, imsave
from skimage.transform import rotate, AffineTransform, warp,rescale, resize, downscale_local_mean
from skimage import color,data
from skimage.exposure import adjust_gamma
from skimage.util import random_noise

from sklearn import metrics
from tqdm import tqdm
import torch
import torch.nn as nn
from efficientnet_pytorch import model as enet
import random
from sklearn.model_selection import StratifiedKFold

In [None]:
print(os.listdir("../input/seti-breakthrough-listen/"))

### 数据文件

**train/** - 训练集，由numpy float16格式存储的（6,273,256）维度的数组，第1个维度表示6个节奏，第2和3个维度表示频谱信号，每个文件对应的标签可以在train_labels.csv中找到。  
**test/** - 测试集，数据结构与训练集一致。  
**sample_submission.csv/** - 提交格式范例。  
**train_labels/** - 训练集数据标签。

In [None]:
def get_train_filename_by_id(_id: str) -> str:
    return f"../input/seti-breakthrough-listen/train/{_id[0]}/{_id}.npy"


def show_cadence(filename: str, label: int) -> None:
    plt.figure(figsize=(8, 8))
    arr = np.load(filename)
    for i in range(6):
        plt.subplot(6, 1, i + 1)
        if i == 0:
            plt.title(f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
        plt.imshow(arr[i].astype(float), interpolation='nearest', aspect='auto')
        plt.text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
        plt.xticks([])
    plt.show()
    
    
def show_channels(filename: str, label: int) -> None:
    plt.figure(figsize=(10, 8))
    plt.suptitle(f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
    arr = np.load(filename)
    for i in range(6):
        plt.subplot(2, 3, i + 1)
        plt.imshow(arr[i].astype(float))
    plt.show()

### 预测标签

共计50165个样本，二分类问题，样本比列约为 10:1

In [None]:
train_labels = pd.read_csv("../input/seti-breakthrough-listen/train_labels.csv")
print(train_labels.head())
print("-" * 20)
print(train_labels.shape)
print("-" * 20)
print(train_labels.target.value_counts())

### 可视化

In [None]:
# 包含信号的样本
df_tmp = train_labels[train_labels["target"] == 1].sample(3)
for ind, row in df_tmp.iterrows():
    show_cadence(get_train_filename_by_id(row["id"]), row["target"])

In [None]:
# 不包含信号的样本
df_tmp = train_labels[train_labels["target"] == 0].sample(3)
for ind, row in df_tmp.iterrows():
    show_cadence(get_train_filename_by_id(row["id"]), row["target"])

#### 很容易发现的信号

<div align=center>
    <img src="https://i.imgur.com/5ohQpvE.png" width = "500", height = "600">
</div>

#### 中等难度信号

<div align=center>
    <img src="https://i.imgur.com/Pz6YdoV.png" width = "500", height = "600">
</div>

<div align=center>
    <img src="https://i.imgur.com/81jL2N7.png" width = "500", height = "600">
</div>


#### 高难度信号

<div align=center>
    <img src="https://i.imgur.com/Sgu0k7n.png" width = "500", height = "600">
</div>

## 评价指标

这是一个二分类问题，且存在样本不均衡情况，因此主办方采用了AUC（Area Under the ROC Curve，ROC曲线下面积）作为评价指标。下面简单介绍下AUC的计算与特性。

对于一个二分类问题，我们可以得到如下图所示的的混淆矩阵（confusion matrix）

<div align=center>
    <img src="https://i.loli.net/2021/06/06/82nuZpo1fQxHjNK.png" width = "500", height = "600">
</div>



- TP(true positive)：真实类别为positive，模型预测的类别也为positive
- FP(false positive): 预测为positive，但真实类别为negative，真实类别和预测类别不一致
- FN(false negative): 预测为negative，但真实类别为positive，真实类别和预测类别不一致
- TN(true negtive): 真实类别为negative，模型预测的类别也为negative

### ROC curve

ROC曲线的纵坐标True Positive Rate（TPR）在数值上就等于positive类别的召回率，横坐标False Positive Rate（FPR）在数值上等于(1 - negative class的recall)。曲线通过对分类阈值θ（默认0.5）从大到小或者从小到大依次取值，我们可以得到很多组TPR和FPR的值，将其在图像中依次画出就可以得到一条ROC曲线，阈值θ取值范围为[0,1]。ROC曲线在图像上越接近左上角(0,1)模型越好，即ROC曲线下面与横轴和直线FPR = 1围成的面积（AUC值）越大越好。直观上理解，纵坐标TPR就是recallpositive值，横坐标FPR就是(1 - recallnegative)，前者越大越好，后者整体越小越好，在图像上表示就是曲线越接近左上角(0,1)坐标越好。

要知道哪个模型更好，则需要计算每条曲线的AUC值，一般认为AUC值越大越好。AUC值由定义通过计算ROC曲线、横轴和直线FPR = 1三者围成的面积即可得到，通常取值在0.5-1.0间，越大越好。

<div align=center>
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/36/Roc-draft-xkcd-style.svg/1920px-Roc-draft-xkcd-style.svg.png" width = "500", height = "600">
</div>


### AUC
AUC能有效处理不均衡样本，下面模拟几组数据。

- 第1，第2组数据为均衡样本；
- 第3，第4组数据为不均衡样本；

可以发现，第三组样本哪怕全部预测为正样本，AUC依然只有0.5。

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, auc
import numpy as np

list_y_true = [
    [1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
    [1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
    [1., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0.], #  IMBALANCE
    [1., 0., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0.], #  IMBALANCE
]
list_y_pred = [
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], 
    [0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.5],
    [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],#  IMBALANCE
    [1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 0.], #  IMBALANCE
]

for y_true, y_pred in zip(list_y_true, list_y_pred):
    fpr, tpr, _ = roc_curve(y_true, y_pred)
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(5, 5))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([-0.01, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC curve example')
    plt.legend(loc="lower right")
    plt.show()

## 基线模型

频谱数据可以看成6通道图像，因此我们的基线模型采用计算机视觉的迁移学习网络：EfficientNet-B1，由于模型原本是基于普通3通道图像设计的，因此我们在EfficientNet前添加1x1卷积层，使得通道数量 6 -> 3，修改最后一层全连接网络，输出维度调整为1。

In [None]:
# 将输入信号视为二维图像，采用视觉模型EfficientNet做迁移训练；

class enetv2(nn.Module):
    def __init__(self, backbone, out_dim):
        super(enetv2, self).__init__()
        self.enet = enet.EfficientNet.from_name(backbone)
        self.enet.load_state_dict(torch.load(pretrained_model[backbone]))
        self.myfc = nn.Linear(self.enet._fc.in_features, out_dim)
        self.enet._fc = nn.Identity()
        self.conv1 = nn.Conv2d(6, 3, kernel_size=3, stride=1, padding=3, bias=False)

    def extract(self, x):
        return self.enet(x)

    def forward(self, x):
        x = self.conv1(x)
        x = self.extract(x)
        x = self.myfc(x)
        return x

In [None]:
# 配置

def set_seed(seed = 0):
    '''设置随机数种子，保证模型可复现性'''
    np.random.seed(seed)
    random_state = np.random.RandomState(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ['PYTHONHASHSEED'] = str(seed)
    return random_state

random_state = set_seed(76)


if torch.cuda.is_available():
    device = torch.device("cuda")
    print("GPU is available")
else:
    device = torch.device("cpu")
    print("GPU not available, CPU used")

In [None]:
# ET-search分类数据集

class ClassificationDataset:
    
    def __init__(self, image_paths, targets): 
        self.image_paths = image_paths
        self.targets = targets

    def __len__(self):
        return len(self.image_paths)
    
    def __getitem__(self, item):      
        image = np.load(self.image_paths[item]).astype(float)

        targets = self.targets[item]
                
        return {
            "image": torch.tensor(image, dtype=torch.float),
            "targets": torch.tensor(targets, dtype=torch.long),
        }
    
df_train=pd.read_csv('../input/seti-breakthrough-listen/train_labels.csv')
df_train['img_path']=df_train['id'].apply(lambda x:f'../input/seti-breakthrough-listen/train/{x[0]}/{x}.npy')

In [None]:
baseline_name = 'efficientnet-b1'
pretrained_model = {
    baseline_name: '../input/efficientnet-pytorch/efficientnet-b1-dbc7070a.pth'
}

model = enetv2(baseline_name, out_dim=1)

In [None]:
# 训练辅助函数

def train(data_loader, model, optimizer, device):
    
    model.train()
    
    for data in tqdm(data_loader, position=0, leave=True, desc='Training'):
        inputs = data["image"]
        targets = data['targets']
        
        inputs = inputs.to(device, dtype=torch.float)
        targets = targets.to(device, dtype=torch.float)
        
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))
        loss.backward()
        optimizer.step()


def evaluate(data_loader, model, device):
    model.eval()
    
    final_targets = []
    final_outputs = []
    
    with torch.no_grad():
        
        for data in tqdm(data_loader, position=0, leave=True, desc='Evaluating'):
            inputs = data["image"]
            targets = data["targets"]
            inputs = inputs.to(device, dtype=torch.float)
            targets = targets.to(device, dtype=torch.float)
            
            output = model(inputs)
            
            targets = targets.detach().cpu().numpy().tolist()
            output = output.detach().cpu().numpy().tolist()
            
            final_targets.extend(targets)
            final_outputs.extend(output)
            
    return final_outputs, final_targets

In [None]:
baseline_name = 'efficientnet-b1'
pretrained_model = {
    baseline_name: '../input/efficientnet-pytorch/efficientnet-b1-dbc7070a.pth'
}
models = []
device = "cuda"
epochs = 3
Batch_Size = 32
X = df_train.img_path.values
Y = df_train.target.values
skf = StratifiedKFold(n_splits=5)
fold = 0

for train_index, test_index in skf.split(X, Y):
    
    model = enetv2(baseline_name, out_dim=1)
    model.to(device)

    train_images, valid_images = X[train_index], X[test_index]
    train_targets, valid_targets = Y[train_index], Y[test_index]

    train_dataset = ClassificationDataset(image_paths=train_images, targets=train_targets)
    valid_dataset = ClassificationDataset(image_paths=valid_images, targets=valid_targets)
    train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=Batch_Size,shuffle=True, num_workers=4)
    valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=Batch_Size,shuffle=False, num_workers=4)

    optimizer = torch.optim.Adam(model.parameters(), lr=5e-4)

    for epoch in range(epochs):
        train(train_loader, model, optimizer, device=device)
        predictions, valid_targets = evaluate(valid_loader, model, device=device)
        roc_auc = metrics.roc_auc_score(valid_targets, predictions)
        print(f"Epoch={epoch}, Valid ROC AUC={roc_auc}")
        
    torch.save(model.state_dict(),baseline_name + '-' + str(fold) + '.pt')
    models.append(model)
    fold += 1

In [None]:
submission=pd.read_csv('../input/seti-breakthrough-listen/sample_submission.csv')
submission['img_path']=submission['id'].apply(lambda x:f'../input/seti-breakthrough-listen/test/{x[0]}/{x}.npy')
test_dataset=ClassificationDataset(image_paths=submission.img_path.values, targets=submission.target.values)
test_loader=torch.utils.data.DataLoader(test_dataset, batch_size=16,shuffle=False,num_workers=4)

sig=torch.nn.Sigmoid()
outs=[]
for model in models:
    predictions,valid_targets=evaluate(test_loader, model, device=device)
    predictions=np.array(predictions)[:,0]
    out=sig(torch.from_numpy(predictions))
    out=out.detach().numpy()
    outs.append(out)
    
pred=np.mean(np.array(outs),axis=0)
submission.target=pred
submission.drop(['img_path'],axis=1,inplace=True)
submission.to_csv('submission.csv', index=False)

In [None]:
submission.head()

## 改进方向

- 尝试多种预训练模型：ResNext, SENet，Vit等；
- 尝试只使用6个节奏中的，1,3,5节奏；
- 尝试双塔架构，A模型使用1,3,5节奏，B模型使用2,4,6节奏，并在不同尺寸特征图进行融合，进行预测；
- 抵抗样本不均衡的策略（损失函数、样本抽样等）；
- 过拟合；
- 模型融合；
- 时间序列模型；


**欢迎Follow，后续会持续更新这个比赛的开源内容**

## Refs

- https://www.kaggle.com/ihelon/signal-search-exploratory-data-analysis
- https://www.kaggle.com/c/seti-breakthrough-listen/overview/data-information
- https://www.kaggle.com/c/seti-breakthrough-listen/discussion/238298
- https://www.cnblogs.com/wuliytTaotao/p/9285227.html
- https://www.kaggle.com/robert76/efficientnet-pretrained/data