# SETI Breakthrough Listen - E.T. Signal Search - Exploratory Data Analysis

Quick Exploratory Data Analysis for [SETI Breakthrough Listen - E.T. Signal Search](https://www.kaggle.com/c/seti-breakthrough-listen/) challenge    

**“Are we alone in the Universe?”**


In this competition, use your data science skills to help identify anomalous signals in scans of Breakthrough Listen targets. Because there are no confirmed examples of alien signals to use to train machine learning algorithms, the team included some simulated signals (that they call “needles”) in the haystack of data from the telescope. They have identified some of the hidden needles so that you can train your model to find more. The data consist of two-dimensional arrays, so there may be approaches from computer vision that are promising, as well as digital signal processing, anomaly detection, and more. The algorithm that’s successful at identifying the most needles will win a cash prize, but also has the potential to help answer one of the biggest questions in science.

![](https://storage.googleapis.com/kaggle-competitions/kaggle/23652/logos/header.png)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:black; border:0' role="tab" aria-controls="home"><center>Quick Navigation</center></h3>

* [Overview](#1)
* [Visualizations](#2)
* [Targets](#3)
    
    

* [Competition Metric](#10)
* [Sample Submission](#20)
* [Prepared Submission](#30)

<a id="1"></a>
<h2 style='background:black; border:0; color:white'><center>Overview<center><h2>

In this competition you are tasked with looking for technosignature signals in cadence snippets taken from the Green Bank Telescope (GBT)

**train/** - a training set of cadence snippet files stored in numpy float16 format (v1.20.1), one file per cadence snippet id, with corresponding labels found in the train_labels.csv file. Each file has dimension (6, 273, 256), with the 1st dimension representing the 6 positions of the cadence, and the 2nd and 3rd dimensions representing the 2D spectrogram.  
**test/** - the test set cadence snippet files; you must predict whether or not the cadence contains a "needle", which is the target for this competition  
**sample_submission.csv** - a sample submission file in the correct format  
**train_labels** - targets corresponding (by id) to the cadence snippet files found in the train/ folder

In [None]:
!pip install -q git+https://github.com/rwightman/pytorch-image-models.git
!pip install -q torchsummary
!pip install -q -U git+https://github.com/albu/albumentations --no-cache-dir
!pip install -q neptune-client 

In [None]:
import math
import os
import random
import warnings
from typing import *

import albumentations
import albumentations as A
import cv2
import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
import plotly.express as px
import seaborn as sn
import seaborn as sns
import timm
import gc
import torch
import torch.nn.functional as F
import torchvision
import cuml
from albumentations.pytorch import ToTensorV2
from albumentations.pytorch.transforms import ToTensorV2
from IPython.display import clear_output
from matplotlib import pyplot as plt
from sklearn.model_selection import GroupKFold, KFold, StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from torch import nn
from torch.autograd import Variable
from torch.optim.lr_scheduler import (CosineAnnealingLR,
                                      CosineAnnealingWarmRestarts,
                                      ReduceLROnPlateau, _LRScheduler)
from torch.optim.optimizer import Optimizer
from torch.utils.data import DataLoader, Dataset
from torchsummary import summary
from torchvision import models
from tqdm import tqdm
from tqdm.notebook import tqdm

warnings.filterwarnings("ignore")
clear_output()


## Config

In [None]:
CONFIG = {
    "COMPETITION_NAME": "SETI Breakthrough Listen - E.T. Signal Search",
    "MODEL": {"MODEL_FACTORY": "timm", "MODEL_NAME": "efficientnet_b3"},
    "WORKSPACE": "KAGGLE",
    "DATA": {
        "TARGET_COL_NAME": "target",
        "IMAGE_COL_NAME": "id",
        "NUM_CLASSES": 1,
        "CLASS_LIST": [0, 1],
        "IMAGE_SIZE": 512,
        "CHANNEL_MODE": "spatial_3ch",
        "IS_TRANSPOSE": False,
        "USE_MIXUP": True
    },
    "CROSS_VALIDATION": {"SCHEMA": 'StratifiedKFold', "NUM_FOLDS": 5},
    "TRAIN": {
        "DATALOADER": {
            "batch_size": 32,
            "shuffle": True,  # using random sampler
            "num_workers": 4,
            "drop_last": False,
        },
        "SETTINGS": {
            "IMAGE_SIZE": 512,
            "NUM_EPOCHS": 20,
            "USE_AMP": True,
            "USE_GRAD_ACCUM": False,
            "ACCUMULATION_STEP": 1,
            "DEBUG": False,
            "VERBOSE": True,
            "VERBOSE_STEP": 10,
        },
    },
    "VALIDATION": {
        "DATALOADER": {
            "batch_size": 32,
            "shuffle": False,
            "num_workers": 4,
            "drop_last": False,
        }
    },
    "TEST": {
        "DATALOADER": {
            "batch_size": 32,
            "shuffle": False,
            "num_workers": 4,
            "drop_last": False,
        }
    },
    "OPTIMIZER": {
        "NAME": "AdamW",
        # if use big model like nfnet change lr to 1e-5
        "OPTIMIZER_PARAMS": {"lr": 1e-4, "eps": 1.0e-8, "weight_decay": 1.0e-3},
    },
    "SCHEDULER": {
        "NAME": "CosineAnnealingWarmRestarts",
        "SCHEDULER_PARAMS": {
            "T_0": 19,
            "T_mult": 1,
            "eta_min": 1.0e-7,
            "last_epoch": -1,
            "verbose": True,
            # "NAME": "CosineAnnealingLR",
            # "SCHEDULER_PARAMS": {
            #     "T_max": 16,
            #     "eta_min": 1.0e-7,
            #     "last_epoch": -1,
            #     "verbose": True,
        },
        "CUSTOM": "GradualWarmupSchedulerV2",
        "CUSTOM_PARAMS": {"multiplier": 10, "total_epoch": 1},
        "VAL_STEP": False,
    },
    "CRITERION_TRAIN": {
        "NAME": "BCEWithLogitsLoss",
        "LOSS_PARAMS": {
            "weight": None,
            "size_average": None,
            "reduce": None,
            "reduction": "mean",
            "pos_weight": None
        },
    },
    "CRITERION_VALIDATION": {
        "NAME": "BCEWithLogitsLoss",
        "LOSS_PARAMS": {
            "weight": None,
            "size_average": None,
            "reduce": None,
            "reduction": "mean",
            "pos_weight": None
        },
    },
    "TRAIN_TRANSFORMS": {
        # "RandomResizedCrop": {"height": 384, "width": 384, "scale": [0.9, 1.0], "p": 1},
        "VerticalFlip": {"p": 0.4},
        "HorizontalFlip": {"p": 0.4},
        "ShiftScaleRotate": {"rotate_limit": 10, "p": 0.4},
        "Resize": {"height": 512, "width": 512, "p": 1},
        # "Normalize": {"mean": (0.485, 0.456, 0.406), "std": (0.229, 0.224, 0.225)},

    },
    "VALID_TRANSFORMS": {
        "Resize": {"height": 512, "width": 512, "p": 1},
        # "Normalize": {"mean": (0.485, 0.456, 0.406), "std": (0.229, 0.224, 0.225)},
    },
    "TEST_TRANSFORMS": {
        "Resize": {"height": 512, "width": 512, "p": 1},
        # "Normalize": {"mean": (0.485, 0.456, 0.406), "std": (0.229, 0.224, 0.225)},
    },
    "PATH": {
        "ROOT_DIR": "../input/seti-breakthrough-listen",
        "TRAIN_CSV": "../input/seti-breakthrough-listen/train_labels.csv",
        "TRAIN_PATH": "../input/seti-breakthrough-listen/train",
        "TEST_CSV": "../input/seti-breakthrough-listen/sample_submission.csv",
        "TEST_PATH": "../input/seti-breakthrough-listen/test",
        "WEIGHTS_PATH": "../input/et-alien-weights",
        "OOF_PATH": "",
        "LOG_PATH": "./log.txt"
    },
    "SEED": 19921930,
    "DEVICE": "cuda",
    "GPU": "P100",
}

In [None]:
config = CONFIG

In [None]:
df_train = pd.read_csv(config["PATH"]["TRAIN_CSV"])
df_test  = pd.read_csv(config["PATH"]["TEST_CSV"])

Targets: 

- 54000 Positive
- 6000  Negatives

In [None]:
px.histogram(df_train, y="target", color="target", title='Target Distribution')

<a id="2"></a>
<h2 style='background:black; border:0; color:white'><center>Visualizations<center><h2>

In [None]:
def get_train_filename_by_id(image_id: str) -> str:
    """This function takes in an filename id and returns the path of this file.

    Args:
        image_id (str): [description]

    Returns:
        str: [description]
    """
    return f"../input/seti-breakthrough-listen/train/{image_id[0]}/{image_id}.npy"


def get_test_filename_by_id(image_id: str) -> str:
    """This function takes in an filename id and returns the path of this file.

    Args:
        image_id (str): An example of an image_id is cc9526e839463b1. 
                        Note that image_id[0] = c gives you the subfolder containing images that starts with c.
                        You can see the usage in the return statement.

    Returns:
        str: [description]
    """
    return f"../input/seti-breakthrough-listen/test/{image_id[0]}/{image_id}.npy"

In [None]:
# Seems need to find easy/hard images as the data refreshes

easy_image_1 = get_train_filename_by_id(image_id = 'cd73ff1954feeb9')
easy_image_2 = get_train_filename_by_id(image_id = '0e55f80554f8d36')
medium_image_1 = get_train_filename_by_id(image_id = '886f7aa765d6282')
medium_image_2 = get_train_filename_by_id(image_id = '56fe32cddc6d17e')
# This hard_image_1 is difficult because all channels seem to have signal but they are not.
hard_image_1 = get_train_filename_by_id(image_id = 'cc9526e839463b1')
hard_image_2 = get_train_filename_by_id(image_id = 'a5abd8d2eafd618')
test_image = get_test_filename_by_id(image_id = '10013eb3e11e199')


image_list : List[str] = [easy_image_1, easy_image_2,
                           medium_image_1, medium_image_2,
                           hard_image_1, hard_image_2]

Apply the two functions on train and test csv to add an additional column `file_path` that indicates the filepath of the images. This is useful as we can directly query the paths during training later.

In [None]:
df_train['file_path'] = df_train['id'].apply(get_train_filename_by_id)
df_test['file_path']  = df_test['id'].apply(get_test_filename_by_id)
display(df_train.head())

## Basic Plotting

[REFERENCE: signal-search-exploratory-data-analysis](https://www.kaggle.com/ihelon/signal-search-exploratory-data-analysis)

[REFERENCE: signal-search-exploratory-data-analysis](https://www.kaggle.com/evilpsycho42/eda-baseline-boolart)


In [None]:
def show_cadence_channels(filename: str, label: int, show_text: bool = True) -> None:
    """Plot and show cadence as images. Note on channel = odd channels
       This plot shows all 6 cadences with ON and OFF channels. The host says extraterrestrial signals
       should appear in ON channels.

    Args:
        filename (str): [description]
        label (int): [description]
    """
    plt.figure(figsize=(16, 10))
    # load .npy files into np.array
    image_arr = np.load(filename)

    for i in range(6):
        plt.subplot(6, 1, i + 1)
        if i == 0:
            plt.title(
                f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
        plt.tight_layout()
        plt.imshow(image_arr[i].astype(float),
                   interpolation='nearest', aspect='auto')
        if show_text:
            plt.text(5, 100, ["ON", "OFF"][i % 2], bbox={'facecolor': 'white'})
        plt.xticks(np.arange(0, 255, step=25))
        plt.yticks(np.arange(0, 275, step=100))
        plt.ylabel(ylabel='Time Axis')
    plt.show()


Note that `plt.imshow()` is CxHxW and so the H is the vertical axis which is time axis 273, while W is the horizontal axis which is frequency axis.

In [None]:
df_tmp = df_train[df_train["target"] == 0].sample(1)
for ind, row in df_tmp.iterrows():
    show_cadence_channels(get_train_filename_by_id(row["id"]), row["target"], show_text=True)

df_tmp = df_train[df_train["target"] == 1].sample(1)
for ind, row in df_tmp.iterrows():
    show_cadence_channels(get_train_filename_by_id(row["id"]), row["target"], show_text=True)

Note that `plt.imshow()` is CxHxW and so the H is the vertical axis which is time axis 273, while W is the horizontal axis which is frequency axis.

We have a function called `show_cadence_spatial` where we do the following:

1. There are 6 channels in the images of shape (6, 273, 256) where 273 is the time axis, 256 is the frequency axis and 6 is the 6 cadence snippets. Note that `plt.imshow()` is CxHxW and so the H is the vertical axis which is time axis 273, while W is the horizontal axis which is frequency axis.
> Each file has dimension (6, 273, 256), with the 1st dimension representing the 6 positions of the cadence, and the 2nd and 3rd dimensions representing the 2D spectrogram.

2. Since the following were made clear by the host, we have quite a few options. 
    - spatial 6 channels - shape = (273 * 6, 256) or (256, 273 * 6) if you transpose it.
        - We **concatenate** all 6 channels **vertically along the time axis (axis=0)**.
        
    - spatial 3 channels - shape = (273 * 3, 256) or (256, 273 * 3) if you transpose it.
        - We **concatenate** channels 1, 3 and 5 **vertically along the time axis (axis=0)**. Note to me it does not make sense to concatenate channels 1, 3 and 5 along the frequency axis horizontally. As one cadence consists of 6 snippets of different time, therefore, in order to use spatial, one can concateneate along time axis to have a "full overview" of the sequence of events. One can also consider using RNN or time series to model here but I won't use it.
    
    > Not all of the “needle” signals look like diagonal lines, and they may not be present for the entirety of all three “A” observations, but what they do have in common is that they are only present in some or all of the “A” observations (panels 1, 3, and 5 in the cadence snippets).

    - channel wise: We won't be using it here as empirically speaking, it does not perform as well as spatial wise.


![Image](https://i.ibb.co/JFQ44tB/channel-vs-spatial.png)

In [None]:
def show_cadence_spatial(filename: str, label: int, show_text: bool = True, spatial_mode: str = 'spatial_3ch', transpose: bool = True) -> None:
    """Plot and show cadence as images. Note on channel = odd channels
       This plot shows all 6 cadences with ON and OFF channels. The host says extraterrestrial signals
       should appear in ON channels.

    Args:
        filename (str): [description]
        label (int): [description]
    """
    plt.figure(figsize=(16, 10))
    # load .npy files into np.array
    image_arr = np.load(filename)
    assert image_arr.shape == (6, 273, 256)

    if spatial_mode == 'spatial_3ch':
        image_arr_3ch = image_arr[::2, :, :].astype(np.float32)
        assert image_arr_3ch.shape == (3, 273, 256)
        # print(np.array_equal(image_arr_3ch, image[::2].astype(np.float32), equal_nan=False))
        if transpose is True:
            image_arr_3ch = np.vstack(image_arr_3ch).transpose((1, 0))
            assert image_arr_3ch.shape == (256, 819)
        else:
            image_arr_3ch = np.vstack(image_arr_3ch)
            assert image_arr_3ch.shape == (819, 256)

        plt.title(
            f"ID: {os.path.basename(filename)} TARGET: {label}", fontsize=18)
        #plt.subplot(5, 2, i + 1)
        plt.tight_layout()
        plt.imshow(image_arr_3ch)

    elif spatial_mode == 'spatial_6ch':
        # note [:, :, :] produces a verbatim copy of the array
        image_arr_6ch = image_arr[:, :, :].astype(np.float32)
        assert image_arr_6ch.shape == (3, 273, 256)

        if transpose is True:
            image_arr_6ch = np.vstack(image_arr_6ch).transpose((1, 0))
            assert image_arr_6ch.shape == (256, 1638)
        else:
            image_arr_6ch = np.vstack(image_arr_6ch)
            assert image_arr_6ch.shape == (1638, 256)

        #plt.subplot(5, 2, i + 1)
        plt.tight_layout()
        plt.imshow(image_arr_6ch)

    plt.show()

In [None]:
df_tmp = df_train[df_train["target"] == 0].sample(1)
for ind, row in df_tmp.iterrows():
    show_cadence_spatial(get_train_filename_by_id(row["id"]), row["target"], show_text=True)

df_tmp = df_train[df_train["target"] == 1].sample(1)
for ind, row in df_tmp.iterrows():
    show_cadence_spatial(get_train_filename_by_id(row["id"]), row["target"], show_text=True)

In [None]:
# for images in image_list:
#     show_cadence_channels(filename=images, label = 1)

In [None]:
show_cadence_spatial(filename=easy_image_1, label = 1, transpose=True)

In [None]:
show_cadence_spatial(filename=easy_image_1, label = 1, transpose=False)

## Augmentations

[REFERENCE: search-for-effective-data-augmentation](https://www.kaggle.com/shionhonda/search-for-effective-data-augmentation)

In [None]:
class Transform:
    # The variant here uses `ToTensorV2` so you do not need to transpose in Dataset as Albumentations does it for you they can detect if you using channels first, else will throw error.
    def __init__(self, aug_kwargs: Dict):

        albu_augs = [getattr(A, name)(**kwargs)
                     for name, kwargs in aug_kwargs.items()]
        albu_augs.append(ToTensorV2(p=1))

        self.transform = A.Compose(albu_augs)

    def __call__(self, image: Union[np.ndarray, torch.tensor]):
        image = self.transform(image=image)["image"]
        return image



## CNN Embeddings

[REFERENCE: eda-seti-e-t-train-v-s-test-by-cnn-embedding](https://www.kaggle.com/ttahara/eda-seti-e-t-train-v-s-test-by-cnn-embeddings)

In [None]:
def seed_all(seed: int = 1930) -> None:
    """Seeds all random number generators.

    Args:
        seed (int, optional): [description]. Defaults to 1930.
    """

    print("Using Seed Number {}".format(seed))

    os.environ["PYTHONHASHSEED"] = str(
        seed
    )  # set PYTHONHASHSEED env var at fixed value
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.cuda.manual_seed(seed)  # pytorch (both CPU and CUDA)
    np.random.seed(seed)  # for numpy pseudo-random generator

    # set fixed value for python built-in pseudo-random generator
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.enabled = False


def seed_worker(_worker_id: int) -> None:
    """Seed a worker with the given ID. For Torch users.

    Args:
        _worker_id (int): [description]
    """
    worker_seed = torch.initial_seed() % 2 ** 32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

In [None]:
seed_all(config['SEED'])

## Dataset

In [None]:
class AlienTrainDataset(Dataset):
    def __init__(self, df: pd.DataFrame, config: CONFIG, transform: Transform = None, mode: str = 'train'):
        self.df = df
        self.config = config
        self.transform = transform
        self.mode = mode

        self.file_names: Union[List[str],
                               np.ndarray[str]] = df['file_path'].values

        self.labels: Union[List[int], np.ndarray[int]
                           ] = df[config['DATA']['TARGET_COL_NAME']].values

    def __len__(self):
        """Method - Len of the dataset

        Returns:
            [type]: [description]
        """
        return len(self.df)

    def __getitem__(self, idx):
        image_arr = np.load(self.file_names[idx]).astype(np.float32)
        assert image_arr.shape == (6, 273, 256)

        if self.config['DATA']['CHANNEL_MODE'] == 'spatial_6ch':
            
            if self.config['DATA']['IS_TRANSPOSE']:
                image_arr = np.vstack(image_arr).transpose((1, 0))
                assert image_arr.shape == (256, 1638)
            else:
                image_arr = np.vstack(image_arr)
                assert image_arr.shape == (1638, 256)

        elif self.config['DATA']['CHANNEL_MODE'] == 'spatial_3ch':
            image_arr = image_arr[::2, :, :].astype(np.float32)
            assert image_arr.shape == (3, 273, 256)
            # print(np.array_equal(image_arr_3ch, image[::2].astype(np.float32), equal_nan=False))
            if self.config['DATA']['IS_TRANSPOSE']:
                image_arr = np.vstack(image_arr).transpose((1, 0))
                assert image_arr.shape == (256, 819)
            else:
                image_arr = np.vstack(image_arr)
                assert image_arr.shape == (819, 256)

        elif self.config['DATA']['CHANNEL_MODE'] == '6_channel':
            image_arr = image_arr.astype(np.float32)
            image_arr = np.transpose(image_arr, (1, 2, 0))

        elif self.config['DATA']['CHANNEL_MODE'] == '3_channel':
            image_arr = image_arr[::2].astype(np.float32)
            image_arr = np.transpose(image_arr, (1, 2, 0))

        if self.transform:
            image_arr = self.transform(image_arr)
        else:
            image_arr = torch.from_numpy(image_arr).float()

        if self.mode == 'test':
            return {"image": image_arr}
        else:
            label = torch.tensor(self.labels[idx]).float()
            return {"image": image_arr, "target": label}


In [None]:
train_dataset = AlienTrainDataset(df=df_train, config=config,
                                  transform=Transform(config["TRAIN_TRANSFORMS"]),
                                  mode='train')

for i in range(2):
    image, label = train_dataset[i]['image'], train_dataset[i]['target']
    plt.imshow(image[0])
    plt.title(f'label: {label}')
    plt.show()
image.shape

## Model

In [None]:
sigmoid = torch.nn.Sigmoid()


class Swish(torch.autograd.Function):
    @staticmethod
    def forward(ctx, i):
        result = i * sigmoid(i)
        ctx.save_for_backward(i)
        return result

    @staticmethod
    def backward(ctx, grad_output):
        i = ctx.saved_variables[0]
        sigmoid_i = sigmoid(i)
        return grad_output * (sigmoid_i * (1 + i * (1 - sigmoid_i)))


class Swish_Module(torch.nn.Module):
    def forward(self, x):
        return Swish.apply(x)

In [None]:
class AlienSingleHead(torch.nn.Module):
    """A custom model."""

    def __init__(
        self,
        config: type,
        pretrained: bool = True,
    ):
        """Construct a custom model."""
        super().__init__()
        self.config = config
        self.pretrained = pretrained
        print("Pretrained is {}".format(self.pretrained))
        # self.activation = Swish_Module()
        self.activation = Swish_Module()
        self.architecture = {
            "backbone": None,
            "bottleneck": None,
            "classifier_head": None,
        }

        def __setattr__(self, name, value):
            self.model.__setattr__(self, name, value)

        _model_factory = (
            timm.create_model
            if self.config["MODEL"]["MODEL_FACTORY"] == "timm"
            else geffnet.create_model
        )
        if config['DATA']['CHANNEL_MODE'] == 'spatial_6ch' or config['DATA']['CHANNEL_MODE'] == 'spatial_3ch':

            self.model = _model_factory(
                model_name=self.config["MODEL"]["MODEL_NAME"],
                pretrained=self.pretrained, in_chans=1) # set channel = 1 since we using spatial

        else:
            self.model = _model_factory(
                            model_name=self.config["MODEL"]["MODEL_NAME"],
                            pretrained=self.pretrained, in_chans=3) # set channel = 1 since we using spatial

        # reset head
        self.model.reset_classifier(num_classes=0, global_pool="avg")
        # after resetting, there is no longer any classifier head, therefore it is the backbone now.
        self.architecture["backbone"] = self.model
        # get out features of the last cnn layer from backbone, which is also the in features of the next layer

        self.in_features = self.architecture["backbone"].num_features
        print(self.in_features)

        # self.single_head_fc = torch.nn.Sequential(
        #     torch.nn.Linear(self.in_features, self.config["DATA"]["NUM_CLASSES"])
        # )
        self.single_head_fc = torch.nn.Sequential(
            torch.nn.Linear(self.in_features, self.in_features),
            self.activation,
            torch.nn.Dropout(p=0.5),
            torch.nn.Linear(self.in_features, self.config["DATA"]["NUM_CLASSES"]),
        )
        self.architecture["classifier_head"] = self.single_head_fc


    # feature map after cnn layer
    def extract_features(self, x):
        feature_logits = self.architecture["backbone"](x)
        assert feature_logits.shape[1] == self.in_features , "feature_logits is the output logits right after the CNN extraction layer, in other words, it is the output to be fed in to the head layer. Thus the shape must match.\
                                                                as an example, if batch_size is 4, then at this stage the shape should be [4, in_features]."
        
        # TODO: caution, if you use forward_features, then you need reshape. See test.py
        return feature_logits

    def forward(self, x):
        feature_logits = self.extract_features(x)
        # print(self.architecture["classifier_head"][3])
        classifier_logits = self.architecture["classifier_head"](feature_logits)
        return classifier_logits


Parameters
in_features – size of each input sample

out_features – size of each output sample

bias – If set to False, the layer will not learn an additive bias. Default: True

---


Applies a linear transformation to the incoming data: y = xW^T + b 
And our incoming data is [4, in_features] in this case which is x. And if our head is the native `torch.nn.Linear(self.in_features, self.config["DATA"]["NUM_CLASSES"])`, then it is clear that our output which is feature logits must be of compatible shape - because we know for a fact that

In [None]:
model = AlienSingleHead(config,pretrained=False)
train_dataset = AlienTrainDataset(df_train, config, transform=Transform(config["TRAIN_TRANSFORMS"]))
train_loader = DataLoader(train_dataset, batch_size=4, shuffle=True,
                          num_workers=4, pin_memory=True, drop_last=True)

for data in train_loader:
    image, label = data['image'], data['target']
    output = model(image)
    print(output)
    break

In [None]:
dataset_train = AlienTrainDataset(
    config=config,
    df=df_train,
    mode="train",
    transform=Transform(config["VALID_TRANSFORMS"]), # use valid transforms no augmentation
)
dataset_test = AlienTrainDataset(
    config=config,
    df=df_test,
    mode="test",
    transform=Transform(config["TEST_TRANSFORMS"]),
)

train_loader = torch.utils.data.DataLoader(
    dataset_train,
    # sampler=RandomSampler(dataset_train),
    **config["VALIDATION"]["DATALOADER"], # DO not use train dataloader, else will cause issue cause shuffle
)
test_loader = torch.utils.data.DataLoader(
    dataset_test, **config["TEST"]["DATALOADER"]
)



In [None]:
efficientnet_b3 = AlienSingleHead(config, pretrained=False)

In [None]:
model_path = "../input/et-alien-weights/efficientnet_b3_fold_1_epoch_18.pt"
efficientnet_b3.load_state_dict(torch.load(model_path, map_location=config['DEVICE'])["model_state_dict"])

In [None]:
def extract_features(model, loader):
    """When you plot the feature embeddings, there should be a clear boundary between classes. See ArcFace MNIST Notebook.

    Args:
        model ([type]): [description]
        loader ([type]): [description]
        device ([type]): [description]

    Returns:
        [type]: [description]
    """
    model.to(config['DEVICE'])
    model.eval()
    emb_list = []
    pred_list = []
    with torch.no_grad():
        for batch in tqdm(loader):
            x = batch['image'].to(config['DEVICE'])
            embeddings = model.extract_features(x) 
            # embeddings = model.architecture["backbone"](x) 
            y_logits   = model.architecture["classifier_head"](embeddings) 

            emb_list.append(embeddings.detach().cpu().numpy())
            pred_list.append(y_logits.detach().cpu().numpy())

        emb_arr = np.concatenate(emb_list)
        pred_arr = np.concatenate(pred_list)
        del emb_list
        del pred_list
    return emb_arr, pred_arr

In [None]:
train_emb, train_pred = extract_features(efficientnet_b3, train_loader)

In [None]:
test_emb, test_pred = extract_features(efficientnet_b3, test_loader)

In [None]:
print(train_emb.shape, train_pred.shape)
print(test_emb.shape, test_pred.shape)

In [None]:
del efficientnet_b3, train_loader, test_loader
torch.cuda.empty_cache()
gc.collect()

In [None]:
all_emb = np.concatenate([train_emb, test_emb], axis=0)
all_pred = np.concatenate([train_pred, test_pred], axis=0)
print(all_emb.shape, all_pred.shape)

In [None]:
all_df = pd.concat([df_train, df_test], axis=0, ignore_index=True)
all_df["target"].value_counts()

In [None]:
all_df["data_type"] = ""
all_df.loc[all_df.target == 1.0, "data_type"] = "train_pos"
all_df.loc[all_df.target == 0.0, "data_type"] = "train_neg"
all_df.loc[all_df.target == 0.5, "data_type"] = "test"
all_df["data_type"].value_counts()

In [None]:
tsne = cuml.TSNE(n_components=2, perplexity=10.0)
all_emb_2d = tsne.fit_transform(all_emb)

neg_emb_2d = all_emb_2d[all_df.query("data_type == 'train_neg'").index.values]
pos_emb_2d = all_emb_2d[all_df.query("data_type == 'train_pos'").index.values]
test_emb_2d = all_emb_2d[all_df.query("data_type == 'test'").index.values]

In [None]:
fig = plt.figure(figsize=(20,20))
ax_neg = fig.add_subplot(2,2,1)
ax_pos = fig.add_subplot(2,2,2)
ax_posneg = fig.add_subplot(2,2,3)

ax_neg.scatter(neg_emb_2d[:, 0],neg_emb_2d[:, 1],color='red',s=10,label='train_non-needles', alpha=0.3)
ax_neg.legend(fontsize=13)
ax_neg.set_title('non-"needles" in Train', fontsize=18)
ax_pos.scatter(pos_emb_2d[:, 0],pos_emb_2d[:, 1],color='blue',s=10,label='train_needles', alpha=0.3)
ax_pos.legend(fontsize=13)
ax_pos.set_title('"needles" in Train', fontsize=18)

ax_posneg.scatter(neg_emb_2d[:, 0],neg_emb_2d[:, 1],color='red',s=10,label='train_non-needles', alpha=0.3)
ax_posneg.scatter(pos_emb_2d[:, 0],pos_emb_2d[:, 1],color='blue',s=10,label='train_needles', alpha=0.3)
ax_posneg.legend(fontsize=13)
ax_posneg.set_title('"needles" v.s. non-"needles" in Train', fontsize=18)

In [None]:
fig = plt.figure(figsize=(20,25))

ax_posneg = fig.add_subplot(3,2,1)
ax_test = fig.add_subplot(3,2,2)
ax_negtest = fig.add_subplot(3,2,3)
ax_postest = fig.add_subplot(3,2,4)
ax_all = fig.add_subplot(3,2,5)

ax_posneg.scatter(neg_emb_2d[:, 0],neg_emb_2d[:, 1],color='red',s=10, label='train_non-needles', alpha=0.3)
ax_posneg.scatter(pos_emb_2d[:, 0],pos_emb_2d[:, 1],color='blue',s=10, label='train_needles', alpha=0.3)
ax_posneg.legend(fontsize=13)
ax_posneg.set_title('"needles" v.s. non-"needles" in Train', fontsize=18)

ax_test.scatter(test_emb_2d[:, 0],test_emb_2d[:, 1],color='limegreen',s=10, label='test_examples', alpha=0.3)
ax_test.legend(fontsize=13)
ax_test.set_title('examples in Test', fontsize=18)

ax_negtest.scatter(test_emb_2d[:, 0],test_emb_2d[:, 1],color='limegreen',s=10, label='test_examples', alpha=0.3)
ax_negtest.scatter(neg_emb_2d[:, 0],neg_emb_2d[:, 1],color='red',s=10, label='train_non-needles', alpha=0.3)
ax_negtest.legend(fontsize=13)
ax_negtest.set_title('non-"needles" in Train  v.s. examples in Test', fontsize=18)

ax_postest.scatter(test_emb_2d[:, 0],test_emb_2d[:, 1],color='limegreen',s=10, label='test_examples', alpha=0.3)
ax_postest.scatter(pos_emb_2d[:, 0],pos_emb_2d[:, 1],color='blue',s=10, label='train_needles', alpha=0.3)
ax_postest.legend(fontsize=13)
ax_postest.set_title('"needles" in Train  v.s. examples in Test', fontsize=18)

ax_all.scatter(test_emb_2d[:, 0],test_emb_2d[:, 1],color='limegreen',s=10, label='test_examples', alpha=0.3)
ax_all.scatter(neg_emb_2d[:, 0],neg_emb_2d[:, 1],color='red',s=10, label='train_non-needles', alpha=0.3)
ax_all.scatter(pos_emb_2d[:, 0],pos_emb_2d[:, 1],color='blue',s=10, label='train_needles', alpha=0.3)
ax_all.legend(fontsize=13)
ax_all.set_title('Train v.s. Test', fontsize=18)

In [None]:
##### AUGMENTATIONS
# A.Resize(height  = image_size, 
#                            width   = image_size),
# note ToTensorV2 expands grayscale to one more dim.
augs = A.Compose([
                  A.transforms.Normalize(mean=(-0.0001,), std=(0.9055,), max_pixel_value=255.0, always_apply=False, p=1.0),
                  ToTensorV2(p=1.0),
                  
                  ])#

In [None]:
# ###### DATASET & DATALOADER

# # dataset
# image_dataset = ImageData(df        = df, 
#                           transform = augs)

# # data loader
# image_loader = DataLoader(image_dataset, 
#                           batch_size  = batch_size, 
#                           shuffle     = False, 
#                           num_workers = num_workers)

In [None]:
from typing import *

In [None]:
filenames: List = [_id.split("/")[-1] for _id in image_dataset.file_names]

In [None]:
filenames.index(hard_image_1.split("/")[-1])

In [None]:
image_dataset[47987]

In [None]:
image_dataset[47987].shape

In [None]:
plt.imshow(image_dataset[47987][0], cmap='gray')

<a id="3"></a>
<h2 style='background:black; border:0; color:white'><center>Targets<center><h2>

#### Easy to find
![](https://i.imgur.com/5ohQpvE.png)

#### Medium
![](https://i.imgur.com/Pz6YdoV.png)
![](https://i.imgur.com/81jL2N7.png)

#### Hard
![](https://i.imgur.com/Sgu0k7n.png)

<a id="10"></a>
<h2 style='background:black; border:0; color:white'><center>Competition Metric<center><h2>

Submissions are evaluated on [area under the ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) between the predicted probability and the observed target.

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve, auc

In [None]:
list_y_true = [
    [1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
    [1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
    [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.], #  IMBALANCE
]
list_y_pred = [
    [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
    [0.9, 0.9, 0.9, 0.9, 0.1, 0.9, 0.9, 0.1, 0.9, 0.1, 0.1, 0.5],
    [1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], #  IMBALANCE
]

for y_true, y_pred in zip(list_y_true, list_y_pred):
    fpr, tpr, _ = roc_curve(y_true, y_pred)
    roc_auc = auc(fpr, tpr)

    plt.figure(figsize=(5, 5))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
    plt.xlim([-0.01, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show()

<a id="20"></a>
<h2 style='background:black; border:0; color:white'><center>Sample Submission<center><h2>

In [None]:
df_submission = pd.read_csv("../input/seti-breakthrough-listen/sample_submission.csv")
df_submission

In [None]:
df_submission["target"] = 0.51
df_submission.to_csv("submission.csv", index=False)

<a id="30"></a>
<h2 style='background:black; border:0; color:white'><center>Prepared Submission<center><h2>

I experiments with these two excellent kernels, try to retrain and ensemble them:   
[SETI / NFNet_l0 starter [inference]](https://www.kaggle.com/yasufuminakama/seti-nfnet-l0-starter-inference)   
[SETI-BL: TF Starter TPU 🚀](https://www.kaggle.com/awsaf49/seti-bl-tf-starter-tpu)

In [None]:
df_prepared = pd.read_csv("../input/signal-search-submissions/submission_2021-05-13_20-00-00.csv", index_col=0)
df_prepared.to_csv("submission_2021-05-13_20-00-00.csv")
df_prepared = pd.read_csv("../input/signal-search-submissions/submission_2021-05-13_21-00-00.csv", index_col=0)
df_prepared.to_csv("submission_2021-05-13_21-00-00.csv")

## WORK IN PROGRESS...