# CNNs for SETI ET signal detection! (using fastai and PyTorch)

**BEFORE YOU COPY AND EDIT NOTEBOOK, PLEASE SUPPORT AND UPVOTE**

Spectrograms can also be represented as images which can be passed into CNNs, so here we will try training a CNN from scratch with PyTorch and fastai. I will walk you through all of the code, so you can use these notebook as a jumping point for your own experiments! 🙂

If you want to learn more about this competition, please check out my EDA over [here](https://www.kaggle.com/tanlikesmath/seti-simple-eda-to-help-you-get-started)!

In [None]:
!pip install --upgrade fastai

In [None]:
!pip install timm

## Setup
Here, let's import the required modules and set a random seed for reproducibility.

In [None]:
import cv2
import pandas as pd
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from fastai.vision.all import *

In [None]:
set_seed(999,reproducible=True)

Now let's load our CSV file and process them.

In [None]:
dataset_path = Path('../input/seti-breakthrough-listen')
df = pd.read_csv(dataset_path/'train_labels.csv')

In [None]:
df['path'] = df['id'].apply(lambda x: str(dataset_path/'train'/x[0]/x)+'.npy') #adding the path for each id for easier processing

In [None]:
df.head()

## PyTorch Dataset

We can make a simple dataset class as shown below. Here, I have written a class that allows you to use channel vs. spatial approaches, and 3- vs. 6-channel apporaches, which were the main approaches discussed on the forums. For this notebook, we will train with the spatial approach as mentioned [here](https://www.kaggle.com/c/seti-breakthrough-listen/discussion/238611). 

In [None]:
class SETIDataset:
    def __init__(self, df, spatial=True, sixchan=True):
        self.df = df
        self.spatial = spatial
        self.sixchan = sixchan
        
        
    def __len__(self):
        return len(self.df)

    def __getitem__(self, index):
        label = self.df.iloc[index].target
        filename = self.df.iloc[index].path
        data = np.load(filename).astype(np.float32)
        if not self.sixchan: data = data[::2].astype(np.float32)
        if self.spatial:
            data = np.vstack(data).transpose((1, 0))
            data = cv2.resize(data, dsize=(256,256))     
            data_tensor = torch.tensor(data).float().unsqueeze(0)
        else:
            data = np.transpose(data, (1,2,0))
            data = cv2.resize(data, dsize=(256,256))     
            data = np.transpose(data, (2, 0, 1)).astype(np.float32)
            data_tensor = torch.tensor(data).float()
            
        

        return (data_tensor, torch.tensor(label))

Split the `DataFrame` into `train_df` and `valid_df` in a reproducible manner.

In [None]:
train_df, valid_df = train_test_split(df, test_size=0.2, random_state=999)

In [None]:
train_ds = SETIDataset(train_df)
valid_ds = SETIDataset(valid_df)

bs = 128
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=bs, num_workers=8)
valid_dl = torch.utils.data.DataLoader(valid_ds, batch_size=bs, num_workers=8)

Now we will use fastai functionality to wrap the dataloaders into the `DataLoaders` class, which gathers all the dataloaders into a single object which can be passed into fastai's `Learner`. You can see how flexible fastai really is: you can use any custom PyTorch DataLoader!

In [None]:
dls = DataLoaders(train_dl, valid_dl)

# Training

We will use Zachary Mueller's `timm_learner` function to create an already-instantiated `Learner` object with the `DataLoaders` and an appropriately defined CNN model taken from Ross Wightman's amazing `timm` package. The code for `timm_learner` (see the hidden cell below) is based on fastai's `cnn_learner` function. We can tell`timm_learner` what CNN backbone we want to use, as well as the number of input and output channels, and fastai automatically defines the appropriate model. We also pass in the metrics and the loss function. Fastai's default optimizer is AdamW. Finally, we can also use mixed precision training easily.

We'll use a simple ImageNet-pretrained ResNext50_32x4d model.

In [None]:
from timm import create_model
from fastai.vision.learner import _update_first_layer

def create_timm_body(arch:str, pretrained=True, cut=None, n_in=3):
    "Creates a body from any model in the `timm` library."
    model = create_model(arch, pretrained=pretrained, num_classes=0, global_pool='')
    _update_first_layer(model, n_in, pretrained)
    if cut is None:
        ll = list(enumerate(model.children()))
        cut = next(i for i,o in reversed(ll) if has_pool_type(o))
    if isinstance(cut, int): return nn.Sequential(*list(model.children())[:cut])
    elif callable(cut): return cut(model)
    else: raise NamedError("cut must be either integer or function")
        
def create_timm_model(arch:str, n_out, cut=None, pretrained=True, n_in=3, init=nn.init.kaiming_normal_, custom_head=None,
                     concat_pool=True, **kwargs):
    "Create custom architecture using `arch`, `n_in` and `n_out` from the `timm` library"
    body = create_timm_body(arch, pretrained, None, n_in)
    if custom_head is None:
        nf = num_features_model(nn.Sequential(*body.children()))
        head = create_head(nf, n_out, concat_pool=concat_pool, **kwargs)
    else: head = custom_head
    model = nn.Sequential(body, head)
    if init is not None: apply_init(model[1], init)
    return model

def timm_learner(dls, arch:str, loss_func=None, pretrained=True, cut=None, splitter=None,
                y_range=None, config=None, n_in=3, n_out=None, normalize=True, **kwargs):
    "Build a convnet style learner from `dls` and `arch` using the `timm` library"
    if config is None: config = {}
    if n_out is None: n_out = get_c(dls)
    assert n_out, "`n_out` is not defined, and could not be inferred from data, set `dls.c` or pass `n_out`"
    if y_range is None and 'y_range' in config: y_range = config.pop('y_range')
    model = create_timm_model(arch, n_out, default_split, pretrained, n_in=n_in, y_range=y_range, **config)
    learn = Learner(dls, model, loss_func=loss_func, splitter=default_split, **kwargs)
    if pretrained: learn.freeze()
    return learn

In [None]:
def roc_auc(preds,targ):
    try: return roc_auc_score(targ.cpu(),preds.squeeze().cpu())
    except: return 0.5

In [None]:
learn = timm_learner(dls,'resnext50_32x4d',pretrained=True,n_in=1,n_out=1,metrics=[roc_auc], opt_func=ranger, loss_func=BCEWithLogitsLossFlat()).to_fp16()

fastai provides a useful function to help determine the most optimal learning rate:

In [None]:
learn.lr_find()

The idea is that the learning rate where the loss decreases the most is likely the best learning rate. In this case, this is around ~3e-2.

Let's fine-tune the pretrained model using fastai's fit_one_cycle function to train the frozen pretrained model with a one-cycle learning rate schedule. I use high weight decay regularization to prevent overfitting.

In [None]:
learn.fit_one_cycle(3, 0.1, cbs=[ReduceLROnPlateau()])

In [None]:
learn.recorder.plot_loss()

In [None]:
learn = learn.to_fp32()

Let's save our model if needed for later:

In [None]:
learn.save('resnext50_32x4d-3epochs')
learn = learn.load('resnext50_32x4d-3epochs')

## Inference

Inference is also quite trivial. Let's load our CSV file and create our dataloader.

In [None]:
test_df = pd.read_csv(dataset_path/'sample_submission.csv')
test_df['path'] = test_df['id'].apply(lambda x: str(dataset_path/'test'/x[0]/x)+'.npy')
test_ds = SETIDataset(test_df)

bs = 128
test_dl = torch.utils.data.DataLoader(test_ds, batch_size=bs, num_workers=8, shuffle=False)

While fastai provides inference functions if we use their specific data API, in this case we used plain PyTorch dataloaders. So we'll just have to iterate over the dataloader and apply the model:

In [None]:
preds = []
for xb, _ in tqdm(test_dl):
    with torch.no_grad(): output = learn.model(xb.cuda())
    preds.append(torch.sigmoid(output.float()).squeeze().cpu())
preds = torch.cat(preds)    

Create the submission file and save:

In [None]:
sample_df = pd.read_csv(dataset_path/'sample_submission.csv')
sample_df['target'] = preds
sample_df.to_csv('submission.csv', index=False)

Now, **WE ARE DONE!**

If you enjoyed this notebook, please give it an upvote. If you have any questions or suggestions, please leave a comment!