This notebook breaks down the code to train the second part of the model, the LSTM, based the code from the [2nd place repo here](https://github.com/darraghdog/rsna/blob/master/scripts/trainlstm.py)

### Main Observations
I really enjoyed taking a look under the hood of this LSTM training script. The main thing that stood out for me was the need **to be very careful when handling the data with the custom Dataset and Collate Function**. Making sure to keep the correct sequence of images when processing the data and feeding it to the model was the goal here.

Also dealing with batches containing odd sequence lengths was an issue that was nicely dealt with. The LSTM model is surprisingly easy to understand, yet clearly very effective.

### Loading already trained embeddings

The winners also enabled us to download the 2nd Place Stage 1 train and validation embeddings and the Stage 2 test embeddings from their trained Resnext101 model. 

As described in their repo, the following code will download the Stage 1 train, validation and test (stage 1) embeddings. The `gdown` package is used as the file size on the google drive is quite large (16gb):

`pip install gdown
gdown https://drive.google.com/uc?id=13hqPFdCjoMxtAwF863J3Dk33TcBN_wie -O resnext101v12fold1.tar.gz
gunzip resnext101v12fold1.tar.gz
tar -xvf resnext101v12fold1.tar`

The Stage 2 test embeddings can then be downloaded: 

`wget gdown https://drive.google.com/uc?id=1YxCJ0mWIYXfYLN15DPpQ6OLSt4Y54Hp0`

### Core Modelling Parameters
- Model : Custom LSTM, from stage 1 Kaggle Toxic comp
- Epochs : 10
- Folds : 2, (this notebook doesn't implement folds, the 2nd place solution only got the chance to train for 2 folds but wanted to train for more)
- Optimizer : Adam
- Batch Size : 4

### This Notebook
The initial cells are a little messy are they are a copy and paste from the python script and I don't have a huge amount of time to reformat them to be more suitable for Jupyter.

Also, this notebook is a modification from the original training script in the following ways:
- The Stage-1 train and validation datasets were used as they were available to download from the 2nd place solution repo
- The Stage-2 test dataset was used, in order to submit a prediction to Kaggle
- Folds were not used as I just wanted to get a demo working, and not spend gpu time replicating the initial score
- You might have to tweak the directory settings to match your own folder names/structure
- The section around processing and loading the saved embeddings looks different to the original script, but the outcome is the same

### Imports...

In [1]:
import numpy as np
import csv, gzip, os, sys, gc
import math
import torch
from torch import nn
import torch.optim as optim
from torch.nn import functional as F

import logging
import datetime
import optparse
import pandas as pd
import os
from sklearn.metrics import log_loss
import ast
from torch.utils.data import Dataset
from sklearn.metrics import log_loss
from torch.utils.data import DataLoader
from scipy.ndimage import uniform_filter
from torch.optim.lr_scheduler import StepLR

from apex.parallel import DistributedDataParallel as DDP
from apex.fp16_utils import *
from apex import amp, optimizers
from apex.multi_tensor_apply import multi_tensor_applier

### Keep scrolling...

In [2]:
# Print info about environments
parser = optparse.OptionParser()
parser.add_option('-s', '--seed', action="store", dest="seed", help="model seed", default="1234")
parser.add_option('-o', '--fold', action="store", dest="fold", help="Fold for split", default="0")
parser.add_option('-p', '--nbags', action="store", dest="nbags", help="Number of bags for averaging", default="4")
parser.add_option('-e', '--epochs', action="store", dest="epochs", help="epochs", default="10")
parser.add_option('-b', '--batchsize', action="store", dest="batchsize", help="batch size", default="4")
parser.add_option('-r', '--rootpath', action="store", dest="rootpath", help="root directory", default="")
parser.add_option('-i', '--imgpath', action="store", dest="imgpath", help="root directory", default="data/mount/512X512X6/")
parser.add_option('-w', '--workpath', action="store", dest="workpath", help="Working path", default="weights/")
parser.add_option('-f', '--weightsname', action="store", dest="weightsname", help="Weights file name", default="pytorch_model.bin")
parser.add_option('-l', '--lr', action="store", dest="lr", help="learning rate", default="0.00005")
parser.add_option('-g', '--logmsg', action="store", dest="logmsg", help="root directory", default="Recursion-pytorch")
parser.add_option('-c', '--size', action="store", dest="size", help="model size", default="512")
parser.add_option('-a', '--globalepoch', action="store", dest="globalepoch", help="root directory", default="3")
parser.add_option('-n', '--loadcsv', action="store", dest="loadcsv", help="Convert csv embeddings to numpy", default="F")
parser.add_option('-j', '--lstm_units', action="store", dest="lstm_units", help="Lstm units", default="128")
parser.add_option('-d', '--dropout', action="store", dest="dropout", help="LSTM input spatial dropout", default="0.3")
parser.add_option('-z', '--decay', action="store", dest="decay", help="Weight Decay", default="0.0")
parser.add_option('-m', '--lrgamma', action="store", dest="lrgamma", help="Scheduler Learning Rate Gamma", default="1.0")
parser.add_option('-k', '--ttahflip', action="store", dest="ttahflip", help="Bag with horizontal flip on and off", default="F")
parser.add_option('-q', '--ttatranspose', action="store", dest="ttatranspose", help="Bag with horizontal flip on and off", default="F")
parser.add_option('-x', '--datapath', action="store", dest="datapath", help="Data path", default="data")

options, args = parser.parse_args()
package_dir = options.rootpath
sys.path.append(package_dir)
sys.path.insert(0, 'scripts')
from logs import get_logger
from utils import dumpobj, loadobj, GradualWarmupScheduler

### Keep scrolling...

In [3]:
# Print info about environments
logger = get_logger(options.logmsg, 'INFO') # noqa
logger.info('Cuda set up : time {}'.format(datetime.datetime.now().time()))

device=torch.device('cuda')
logger.info('Device : {}'.format(torch.cuda.get_device_name(0)))
logger.info('Cuda available : {}'.format(torch.cuda.is_available()))
n_gpu = torch.cuda.device_count()
logger.info('Cuda n_gpus : {}'.format(n_gpu ))


logger.info('Load params : time {}'.format(datetime.datetime.now().time()))
for (k,v) in options.__dict__.items():
    logger.info('{}{}'.format(k.ljust(20), v))

WDIR = 'resnext101v01'
GEPOCH=0
epochs = 12
fold = 0
lr = 0.00001
batchsize = 4
workpath = f'scripts/{WDIR}'
ttahflip = 'T'
ttatranspose = 'T'
lrgamma = 0.95
nbags = 12
globalepoch = f'{GEPOCH}'
loadcsv = 'F'
lstm_units = 2048
    
SEED = int(options.seed)
SIZE = 408 # int(options.size)
EPOCHS = int(options.epochs)
GLOBALEPOCH= globalepoch #int(options.globalepoch)
n_epochs = EPOCHS 
lr=float(options.lr)
#lrgamma=float(options.lrgamma)
DECAY=float(options.decay)
batch_size = batchsize  #int(options.batchsize)
ROOT = options.rootpath
path_data = os.path.join(ROOT, options.datapath)
path_img = os.path.join(ROOT, options.imgpath)
WORK_DIR = os.path.join(ROOT, options.workpath)
path_emb = os.path.join(ROOT, options.workpath)
WEIGHTS_NAME = options.weightsname
FOLD = 0   #int(options.fold)
LOADCSV= options.loadcsv=='T'
LSTM_UNITS=2048   #int(options.lstm_units)
#nbags=int(options.nbags)
DROPOUT=float(options.dropout)
TTAHFLIP= 'T'  #'T' if options.ttahflip=='T' else ''
TTATRANSPOSE='P' if options.ttatranspose=='T' else ''

n_classes = 6
label_cols = ['epidural', 'intraparenchymal', 'intraventricular', 'subarachnoid', 'subdural', 'any']
logmsg = f'Rsna-lstm-{GEPOCH}-{FOLD}-fp16'

2020-01-29 12:46:44,515 - Recursion-pytorch - INFO - Cuda set up : time 12:46:44.515726
2020-01-29 12:46:44,524 - Recursion-pytorch - INFO - Device : GeForce RTX 2080 Ti
2020-01-29 12:46:44,525 - Recursion-pytorch - INFO - Cuda available : True
2020-01-29 12:46:44,526 - Recursion-pytorch - INFO - Cuda n_gpus : 1
2020-01-29 12:46:44,526 - Recursion-pytorch - INFO - Load params : time 12:46:44.526832
2020-01-29 12:46:44,527 - Recursion-pytorch - INFO - seed                1234
2020-01-29 12:46:44,527 - Recursion-pytorch - INFO - fold                0
2020-01-29 12:46:44,528 - Recursion-pytorch - INFO - nbags               4
2020-01-29 12:46:44,528 - Recursion-pytorch - INFO - epochs              10
2020-01-29 12:46:44,529 - Recursion-pytorch - INFO - batchsize           4
2020-01-29 12:46:44,529 - Recursion-pytorch - INFO - rootpath            
2020-01-29 12:46:44,530 - Recursion-pytorch - INFO - imgpath             data/mount/512X512X6/
2020-01-29 12:46:44,530 - Recursion-pytorch - INFO

# PyTorch Dataset

In [10]:
class IntracranialDataset(Dataset):
    def __init__(self, df, mat, labels=label_cols):
        self.data = df
        self.mat = mat
        #print(self.mat.shape)
        self.labels = labels
        self.patients = df.SliceID.unique()
        self.data = self.data.set_index('SliceID')

    def __len__(self):
        return len(self.patients)

    def __getitem__(self, idx):
        
        # Get the PatientID from the given index
        patidx = self.patients[idx]
        
        # For a particular PatientID, sort the values according to the seq key
        # Wrap the argument to .loc in a list to ensure a dataframe is returned every time
        patdf = self.data.loc[[patidx]].sort_values('seq')            
            
        # Select the embedding index values from the particular Patient Dataframe
        # and index into the dataset .mat with those indices
        patemb = self.mat[patdf['embidx'].values]

        # Feed in the embeddings in sequence on key - Patient, Study and Series - 
        # also concat on the deltas between current and previous/next embeddings ( and ) 
        # to give the model knowledge of changes around the image.
        
        # This will mean that every item from the Dataset will return 3 embeddings:
        #   -  The patient's embedding for that image ("patemb")
        #   -  The difference between an embedding and its previous embedding("patdeltalag")
        #   -  The difference between an embedding and its previous embedding("patdeltalead")
        
        patdeltalag  = np.zeros(patemb.shape)
        patdeltalead = np.zeros(patemb.shape)
        patdeltalag [1:] = patemb[1:]-patemb[:-1]   # e.g. patemb.shape = (36, 2048) , patemb[1:].shape = (35, 2048)
        patdeltalead[:-1] = patemb[:-1]-patemb[1:]

        # The 3 embeddings are concatted together going from 3 x (36, 2048) to (36, 6144)
        patemb = np.concatenate((patemb, patdeltalag, patdeltalead), -1)
        
        ids = torch.tensor(patdf['embidx'].values)

        if self.labels:
            labels = torch.tensor(patdf[label_cols].values)
            return {'emb': patemb, 'embidx' : ids, 'labels': labels}    
        else:      
            return {'emb': patemb, 'embidx' : ids}

# Prep Metadata DataFrames

### Generate SliceID 
based on ['PatientID', 'SeriesInstanceUID', 'StudyInstanceUID']

In [4]:
from random import sample 

# Print info about environments
logger.info('Cuda set up : time {}'.format(datetime.datetime.now().time()))

# Get image sequences
trnmdf = pd.read_csv(os.path.join(path_data, 'rsna_darraghdog/darraghdog_train_metadata.csv'))
#trnmdf.Image = 'train_' + trnmdf.Image
tstmdf = pd.read_csv(os.path.join(path_data, 'rsna_darraghdog/darraghdog_test_metadata.csv'))
#tstmdf.Image = 'test_' + tstmdf.Image

trnmdf['SliceID'] = trnmdf[['PatientID', 'SeriesInstanceUID', 'StudyInstanceUID']].apply(lambda x: '{}__{}__{}'.format(*x.tolist()), 1)
tstmdf['SliceID'] = tstmdf[['PatientID', 'SeriesInstanceUID', 'StudyInstanceUID']].apply(lambda x: '{}__{}__{}'.format(*x.tolist()), 1)

print(len(trnmdf), len(tstmdf))

2020-01-29 12:46:44,549 - Recursion-pytorch - INFO - Cuda set up : time 12:46:44.549043


752803 121232


### Generate Sequence Count
- Generate Sequence numbers for each series of images based on ['SliceID', 'ImagePos1', 'ImagePos2', 'ImagePos3']
- Reduce num of colums

In [5]:
# Generate poscols like this: ['ImagePos1', 'ImagePos2', 'ImagePos3']
poscols = ['ImagePos{}'.format(i) for i in range(1, 4)]

# Parse the ImagePositionPatient string (e.g. "['-125.000', '-144.700', '109.750']" ) and add each value
# to one of 3 "ImagePos" columns
# ast.literal_eval()
#     Safely evaluate an expression node or a string containing a Python literal 
#     or container display. The string or node provided may only consist of the 
#     following Python literal structures: strings, bytes, numbers, tuples, lists, 
#     dicts, sets, booleans, and None.
# This can be used for safely evaluating strings containing Python values from untrusted sources 
# without the need to parse the values oneself. It is not capable of evaluating arbitrarily 
# complex expressions, for example involving operators or indexing.

trnmdf[poscols] = pd.DataFrame(trnmdf['ImagePositionPatient']\
              .apply(lambda x: list(map(float, ast.literal_eval(x)))).tolist())
tstmdf[poscols] = pd.DataFrame(tstmdf['ImagePositionPatient']\
              .apply(lambda x: list(map(float, ast.literal_eval(x)))).tolist())

# 1. Sort values by ['SliceID', 'ImagePos1', 'ImagePos2', 'ImagePos3']
# 2. Only select the following columns:  [PatientID, SliceID,SOPInstanceUID,ImagePos1,ImagePos2,ImagePos3]
# 3. Reset index
trnmdf = trnmdf.sort_values(['SliceID']+poscols)\
                [['PatientID', 'SliceID', 'SOPInstanceUID']+poscols].reset_index(drop=True)
tstmdf = tstmdf.sort_values(['SliceID']+poscols)\
                [['PatientID', 'SliceID', 'SOPInstanceUID']+poscols].reset_index(drop=True)

# Group by the SliceID col and then do a cumulative count for each item that it is grouped by
# Beacuse these samples have already been sorted by ['SliceID', 'ImagePos1', 'ImagePos2', 'ImagePos3']
# this value can be inferred to the sequence order of this sequence of images
trnmdf['seq'] = (trnmdf.groupby(['SliceID']).cumcount() + 1)
tstmdf['seq'] = (tstmdf.groupby(['SliceID']).cumcount() + 1)

# Further reduce the columns kept
keepcols = ['PatientID', 'SliceID', 'SOPInstanceUID', 'seq']
trnmdf = trnmdf[keepcols]
tstmdf = tstmdf[keepcols]

# rename SOPInstanceUID to Image to prepare to join to the dataframe with labels
trnmdf.columns = tstmdf.columns = ['PatientID', 'SliceID', 'Image', 'seq']

trnmdf.Image = 'train_' + trnmdf.Image
tstmdf.Image = 'test_' + tstmdf.Image

print(len(trnmdf), len(tstmdf))

trnmdf.head()

752803 121232


Unnamed: 0,PatientID,SliceID,Image,seq
0,ID_0002cd41,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,train_ID_45785016b,1
1,ID_0002cd41,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,train_ID_37f32aed2,2
2,ID_0002cd41,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,train_ID_1b9de2922,3
3,ID_0002cd41,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,train_ID_d61a6a7b9,4
4,ID_0002cd41,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,train_ID_406c82112,5


### Load the datasets

In [6]:
def return_stg1_2_embs_data():
    imgpath = 'data/rsna_darraghdog/darraghdog_proc'

    stg_1_trn_loader = loadobj('weights/stg1_downloaded_embeddings_resnext101v12fold1/loader_trn_size480_fold1_ep4')
    stg_1_val_loader = loadobj('weights/stg1_downloaded_embeddings_resnext101v12fold1/loader_val_size480_fold1_ep4')
    
    stg_1_trn_loader.dataset.path = imgpath
    stg_1_val_loader.dataset.path = imgpath

    stg_1_trn_df = stg_1_trn_loader.dataset.data
    stg_1_val_df = stg_1_val_loader.dataset.data

    stg_1_trn_df.Image = ['train_ID_' + x.split('_', 1)[1] for x in stg_1_trn_df.Image]
    stg_1_val_df.Image = ['train_ID_' + x.split('_', 1)[1] for x in stg_1_val_df.Image]

    # LOAD Stg2Test
    stg_2_tst_loader = loadobj('weights/stg2tst/loader_tst2_size480_fold1_ep5')
    stg_2_tst_loader.dataset.path = imgpath
    stg_2_tst_df = stg_2_tst_loader.dataset.data
    
    return stg_1_trn_df, stg_1_val_df, stg_2_tst_df

### More Data Processing
- Load Stage 1 + 2 Datasets
- Set the embedding index

In [11]:
use_stg1_embs = True

if use_stg1_embs:
    
    # Retrieve the Stage 1 data and stage 2 test data from the saved dataloaders
    stg_1_trn_df, stg_1_val_df, stg_2_tst_df = return_stg1_2_embs_data()
    
    # THE ORDER OF THE ORIGINAL TRAIN DATASET SHOULD BE PRESERVED
    # Save the current order of the embeddings (as the loaded ones are in this order)
    stg_1_trn_df['embidx'] = range(stg_1_trn_df.shape[0])
    stg_1_val_df['embidx'] = range(stg_1_val_df.shape[0])
    
    # MERGE stg1 train and val sets
    train = pd.concat([stg_1_trn_df, stg_1_val_df], axis=0, sort=False)
    
    # IMPORT STAGE 2 TRAIN (which has all labels)
    stg2_train = pd.read_csv(os.path.join(path_data, 'rsna_darraghdog/darraghdog_train.csv.gz'))
    stg2_test = pd.read_csv(os.path.join(path_data, 'rsna_darraghdog/darraghdog_test.csv.gz'))
    
    stg2_train.Image = 'train_' + stg2_train.Image
    stg2_test.Image = 'test_' + stg2_test.Image
    
    # DROP the fold column as we want to use the 
    stg2_train.drop(['fold'], axis=1)
    
    # DROP LABEL COLS (as some are missing due to stg_1_tst_df merge) COLUMNS FROM train
    drop_cols = label_cols.copy()
    drop_cols.append('PatientID')
    drop_cols.append('fold')
    train.drop(drop_cols, axis=1, inplace=True)
    
    # MERGE train with stg2_train
    train = train.merge(stg2_train, on = 'Image', sort=False)
    
    # RENAME STAGE 2 TEST
    test = stg2_test.copy()
        
    # MERGE with METADATA DF to get full picture
    trndf = train.merge(trnmdf.drop('PatientID', 1), on = 'Image')
    tstdf = test.merge(tstmdf, on = 'Image')
    
    # THE ORDER OF THE ORIGINAL TRAIN DATASET SHOULD BE PRESERVED
    # Save the current order of the embeddings (as the loaded ones are in this order)
    tstdf['embidx'] = range(tstdf.shape[0])

    # THE ORDER OF THE ORIGINAL TRAIN DATASET SHOULD BE PRESERVED
    
    trndf = trndf.sort_values(['PatientID','SliceID','seq']).reset_index(drop=True)
    tstdf = tstdf.sort_values(['PatientID','SliceID','seq']).reset_index(drop=True)

trndf.head()

Unnamed: 0,Image,embidx,any,epidural,intraparenchymal,intraventricular,subarachnoid,subdural,PatientID,fold,SliceID,seq
0,train_ID_45785016b,146382,0,0,0,0,0,0,ID_0002cd41,4,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,1
1,train_ID_37f32aed2,117917,0,0,0,0,0,0,ID_0002cd41,4,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,2
2,train_ID_1b9de2922,58394,0,0,0,0,0,0,ID_0002cd41,4,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,3
3,train_ID_d61a6a7b9,451680,0,0,0,0,0,0,ID_0002cd41,4,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,4
4,train_ID_406c82112,135730,0,0,0,0,0,0,ID_0002cd41,4,ID_0002cd41__ID_e22a5534e6__ID_66929e09d4,5


### Train/Val Split
Split into Train and Val sets again, as per Stg1 split

In [12]:
valdf = trndf.loc[trndf.Image.isin(stg_1_val_df.Image.values)].copy()
trndf = trndf.loc[trndf.Image.isin(stg_1_trn_df.Image.values)].copy()
len(trndf), len(valdf)

(539827, 134430)

### Load the Embeddings
- Load the Stage 1 Train + Val Embeddings and the Stage 2 Test Embeddings
- Assuming Stage2 embeddings came from the same model as the Stage1 embeddings

In [14]:
def load_saved_emb(e_pth, emb_path):
    return np.load(os.path.join(e_pth, emb_path))['arr_0']

stg_1_emb_path = 'weights/stg1_downloaded_embeddings_resnext101v12fold1'
stg_2_path_emb = 'weights/stg2tst'

# Paths for the Stage 1 embeddings
trn_emb_path = 'emb_trn_size480_fold1_ep4.npz'
val_emb_path = 'emb_val_size480_fold1_ep4.npz'

# Paths for the Stage 2 embeddings
tst_emb_path = 'emb_tst2_size480_fold1_ep5.npz'

logger.info('Load embeddings...')
trnembls = [load_saved_emb(stg_1_emb_path, trn_emb_path)]
valembls = [load_saved_emb(stg_1_emb_path, val_emb_path)]
tstembls = [load_saved_emb(stg_2_path_emb, tst_emb_path)]

trnemb = sum(trnembls)/len(trnembls)
valemb = sum(valembls)/len(valembls)
tstemb = sum(tstembls)/len(tstembls)

logger.info('Trn shape {} {}'.format(*trnemb.shape))
logger.info('Val shape {} {}'.format(*valemb.shape))
logger.info('Tst shape {} {}'.format(*tstemb.shape))

2020-01-29 13:04:31,858 - Recursion-pytorch - INFO - Load embeddings...
2020-01-29 13:05:12,360 - Recursion-pytorch - INFO - Trn shape 539827 2048
2020-01-29 13:05:12,368 - Recursion-pytorch - INFO - Val shape 134430 2048
2020-01-29 13:05:12,369 - Recursion-pytorch - INFO - Tst shape 121232 2048


### Collate Function
- Define a collate function to pass to collate_fn in the DataLoader in order to operate on the Dataset as needed
- https://www.kaggle.com/bminixhofer/speed-up-your-rnn-with-sequence-bucketing

In [None]:
def collatefn(batch):
    maxlen = max([l['emb'].shape[0] for l in batch])
    
    embdim = batch[0]['emb'].shape[1]
    withlabel = 'labels' in batch[0]
    if withlabel:
        labdim= batch[0]['labels'].shape[1]
    
    for b in batch:
        # batch size could be (3, 40, 6144)
        # For sequences of different length:
        #     - padded them to same length
        #     - made a dummy embedding of zeros
        #     - then threw the results of this away before calculating loss and saving the predictions.
        #
        # "masklen"  :  The number of dummy image embeddings to add to make sure each dequence in the batch 
        # has the same dequence length. Calculated as the difference between:
        #     - the number of the LONGEST sequence of embeddings
        #       in the batch (longest seq of images from the same patient)
        #     MINUS
        #     - The number of image embeddings of this particular sequence
        #
        # A batch contains sequences of images from multiple patients. Some patients
        # might have 28 images in their sequence, while some others might have 40. 
        # If there a patients with different numbers of images (different lengths of sequences)
        # in a batch then these sequences won't be able to be stacked as items in a  batch must 
        # the same dimensions. 
        # This scenario is addressed by creating dummy images full of zeros and adding them to 
        # a patient's sequence of images. The number of dummy images needed is dictated by the 
        # the "masklen"
        masklen = maxlen-len(b['emb'])
        
        # Stack a number ("masklen") of dummy embeddings onto the current sequence of embeddings
        # to make sure all sequences in the batch are the same length
        b['emb'] = np.vstack((np.zeros((masklen, embdim)), b['emb']))
        
        # Adjust the embedding index "embidx" by adding a number ("masklen") of -1's to it
        # e.g. tensor([-1, -1, -1, -1])
        b['embidx'] = torch.cat((torch.ones((masklen),dtype=torch.long)*-1, b['embidx']))
        
        # "mask" is a flag to indicate whether the embedding is a dummy or not "array([1., 1., 1., 1.])"
        # Create it to be the length of the longest sequence and fill it with 1's
        b['mask'] = np.ones((maxlen))
        
        # Change the first numbner ("masklen") of flags to have a 0 flag, meaning it is a dummy embedding
        # This works because the dummy embeddings were inserted ahead of the real embeddings in "b['emb']"
        b['mask'][:masklen] = 0.
        if withlabel:
            # Add dummy labels for the dummy embeddings
            b['labels'] = np.vstack((np.zeros((maxlen-len(b['labels']), labdim)), b['labels']))
    
    # Expand the array dimensions to be the correct dims for the LSTM model.
    # nn.LSTM takes inputs of shape: (batch, seq_len, input_size) when batch_first=True in the lstm definition
    # numpy.expand_dims(a, axis)[source]
    #     Expand the shape of an array. Insert a new axis that will appear at the axis position
    #     in the expanded array shape.
    #     Returns: View of a with the number of dimensions increased by one.
    # e.g. Expands b['emb'] from (36, 6144) to (1, 36, 6144)
    outbatch = {'emb' : torch.tensor(np.vstack([np.expand_dims(b['emb'], 0) \
                                                for b in batch])).float()}  
    
    outbatch['mask'] = torch.tensor(np.vstack([np.expand_dims(b['mask'], 0) \
                                                for b in batch])).float()
    outbatch['embidx'] = torch.tensor(np.vstack([np.expand_dims(b['embidx'], 0) \
                                                for b in batch])).float()
    if withlabel:
        outbatch['labels'] = torch.tensor(np.vstack([np.expand_dims(b['labels'], 0) for b in batch])).float()
    return outbatch

### Create the PyTorch Datasets and Dataloaders based on loaded info

In [None]:
logger.info('Create loaders...')
trndataset = IntracranialDataset(trndf, trnemb, labels=True)
valdataset = IntracranialDataset(valdf, valemb, labels=True)
tstdataset = IntracranialDataset(tstdf, tstemb, labels=False)

batch_size=4
trnloader = DataLoader(trndataset, batch_size=batch_size, shuffle=True, num_workers=8, collate_fn=collatefn)
valloader = DataLoader(valdataset, batch_size=batch_size*4, shuffle=False, num_workers=8, collate_fn=collatefn)
tstloader = DataLoader(tstdataset, batch_size=batch_size*4, shuffle=False, num_workers=8, collate_fn=collatefn)

SpatialDropout not used in the final solution, leaving it in here just for reference

In [None]:
# class SpatialDropout(nn.Dropout2d):
#     def forward(self, x):
#         x = x.unsqueeze(2)    # (N, T, 1, K)
#         x = x.permute(0, 3, 2, 1)  # (N, K, 1, T)
#         x = super(SpatialDropout, self).forward(x)  # (N, K, 1, T), some features are masked
#         x = x.permute(0, 3, 2, 1)  # (N, T, 1, K)
#         x = x.squeeze(2)  # (N, T, K)
#         return x

# LSTM Model

Define the LSTM model

In [22]:
class NeuralNet(nn.Module):
    def __init__(self, embed_size=trnemb.shape[-1]*3, LSTM_UNITS=64, DO = 0.3):  # embed_size=trnemb.shape[-1]*3
        super(NeuralNet, self).__init__()
        
        # embed_size=trnemb.shape[-1]*3 because of the earlier creatiion of lagging and leading embeddings
        # trnemb.shape) (26805, 2048)
        # trnemb.shape[-1]) is 2048
        #trnemb.shape[-1]*3) is 6144
        
        # This SpatialDropout doesn't seem to be used, commenting out
        #self.embedding_dropout = SpatialDropout(0.0) #DO)
        
        # LSTM
        #  input of shape (batch, seq_len, input_size): tensor containing the 
        #    features of the input sequence.
        #  h_0 of shape (batch, num_layers * num_directions, hidden_size): tensor 
        #    containing the initial hidden state for each element in the batch. If 
        #    the LSTM is bidirectional, num_directions should be 2, else it should be 1.
        #  c_0 of shape (batch, num_layers * num_directions, hidden_size): tensor
        
        # Note, bidirectional=True, meaning the lstm runs through the sequence 
        # forwards and backwards, returning the SUM of each output
        self.lstm1 = nn.LSTM(input_size=embed_size,   # 6144
                             hidden_size=LSTM_UNITS,  # 64
                             bidirectional=True,
                             batch_first=True)
        # lstm1 output size will be: torch.Size([3, 36, 4096]), when LSTM_UNITS=2048
        # output of shape `(batch, seq_len, num_directions * hidden_size), (3, 36, 2 * 64) when LSTM_UNITS=64
        
        self.lstm2 = nn.LSTM(LSTM_UNITS * 2,   # 128
                             LSTM_UNITS,       # 64
                             bidirectional=True, 
                             batch_first=True)

        self.linear1 = nn.Linear(in_features=LSTM_UNITS*2,  # 128
                                 out_features=LSTM_UNITS*2  # 128
                                )
        self.linear2 = nn.Linear(LSTM_UNITS*2,  # 128
                                 LSTM_UNITS*2   # 128
                                )

        self.linear = nn.Linear(LSTM_UNITS*2,   # 128
                                n_classes       # 6
                               )

    def forward(self, x, lengths=None):
        # x.size() is torch.Size([2, 36, 6144])
        h_embedding = x
        
        n = 2048
        # while h_embedding[:,:,:n].size() is torch.Size([2, 36, 2048])
        # Selecting and duplicating the first (and origional) of the 3 embeddings (original, lag, lead) 
        # that were earlier concatted 
        
        # h_embadd.size is torch.Size([2, 36, 4096]), not needed now though
        h_embadd = torch.cat((h_embedding[:,:,:n], h_embedding[:,:,:n]), -1)
        
        h_lstm1, _ = self.lstm1(h_embedding)
        h_lstm2, _ = self.lstm2(h_lstm1)
        
        h_conc_linear1  = F.relu(self.linear1(h_lstm1))
        h_conc_linear2  = F.relu(self.linear2(h_lstm2))
        
        # SUM the oringal embedding ("h_embadd") back to the outputs from the other layers
        # "h_embadd" had to be used due to doubling of lstm_unit in earlier dimensions (where lstm_unit = 2048)
        # However this sum to calculate "hidden" then kills my 13GB memory:
        hidden = h_lstm1 + h_lstm2 + h_conc_linear1 + h_conc_linear2 + h_embadd

        output = self.linear(hidden)
        
        return output

In [None]:
# Try free up some memory
del(trnemb, valemb, tstemb)
gc.collect()

### Los function
A customer BCEWithLogitsLoss loss function, same as the image classifier loss function 

In [25]:
def criterion(data, targets, criterion = torch.nn.BCEWithLogitsLoss()):
    ''' Define custom loss function for weighted BCE on 'target' column '''
    loss_all = criterion(data, targets)
    loss_any = criterion(data[:,-1:], targets[:,-1:])
    return (loss_all*6 + loss_any*1)/7

### Reclaim GPU RAM
- context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted
- taken from fastai: https://docs.fast.ai/troubleshoot.html

In [27]:
class gpu_mem_restore_ctx():
    " context manager to reclaim GPU RAM if CUDA out of memory happened, or execution was interrupted"
    def __enter__(self): return self
    def __exit__(self, exc_type, exc_val, exc_tb):
        if not exc_val: return True
        traceback.clear_frames(exc_tb)
        raise exc_type(exc_val).with_traceback(exc_tb) from None

In [8]:
# ls = []
# for i in range(10):
#     print(1e-4*((0.95*1.0)**(i)))
#     ls.append(1e-4*((0.95*1.0)**(i)))
    
# from matplotlib import pyplot as plt
# plt.plot(ls)
# plt.xlabel('Epoch number')
# plt.ylabel('Learning rate')

How many epochs are we doing?

In [9]:
EPOCHS

10

### Prediction
- Used for retrieving predictions to calculate the validation loss and get the test set predictions
- Ignores the mask embeddings

In [24]:
def predict(loader):
    valls = []
    imgls = []
    imgdf = loader.dataset.data.reset_index().set_index('embidx')[['Image']].copy()
    for step, batch in enumerate(loader):
        inputs = batch["emb"]
        mask = batch['mask'].to(device, dtype=torch.int)
        inputs = inputs.to(device, dtype=torch.float)
        logits = model(inputs)
        # get the mask for masked labels
        maskidx = mask.view(-1)==1
        # reshape for
        logits = logits.view(-1, n_classes)[maskidx]
        valls.append(torch.sigmoid(logits).detach().cpu().numpy())
        # Get the list of images
        embidx = batch["embidx"].detach().cpu().numpy().astype(np.int32)
        embidx = embidx.flatten()[embidx.flatten()>-1]
        images = imgdf.loc[embidx].Image.tolist() 
        imgls += images
    return np.concatenate(valls, 0), imgls

### Make Submission to Kaggle
Gets the output predictions formatted correctly for submission

In [26]:
def makeSub(ypred, imgs):
    imgls = np.array(imgs).repeat(len(label_cols)) 
    icdls = pd.Series(label_cols*ypred.shape[0])   
    yidx = ['{}_{}'.format(i,j) for i,j in zip(imgls, icdls)]
    subdf = pd.DataFrame({'ID' : yidx, 'Label': ypred.flatten()})
    return subdf

### Initialise Model

In [31]:
logger.info('Create model')
lrgamma = 0.95

#LSTM_UNITS = 1024
LSTM_UNITS = 2048
model = NeuralNet(LSTM_UNITS=LSTM_UNITS, DO = DROPOUT)
model = model.to(device)

param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
plist = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)], 'weight_decay': DECAY},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

optimizer = optim.Adam(plist, lr=lr)
scheduler = StepLR(optimizer, step_size=1, gamma=lrgamma, last_epoch=-1)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

ypredls = []

2020-01-27 18:10:39,282 - Recursion-pytorch - INFO - Create model


Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic


### Training Loop

In [32]:
# Run Training
for epoch in range(EPOCHS):
    logger.info(f'EPOCH {epoch}')
    tr_loss = 0.
    for param in model.parameters():
        param.requires_grad = True
    model.train()  
    for step, batch in enumerate(trnloader):
        y = batch['labels'].to(device, dtype=torch.float)
        mask = batch['mask'].to(device, dtype=torch.int)
        
        x = batch['emb'].to(device, dtype=torch.float)
        x = torch.autograd.Variable(x, requires_grad=True)
        
        y = torch.autograd.Variable(y)
        logits = model(x).to(device, dtype=torch.float)
        
        # get the mask for masked labels
        maskidx = mask.view(-1)==1
        y = y.view(-1, n_classes)[maskidx]
        logits = logits.view(-1, n_classes)[maskidx]
        # Get loss
        loss = criterion(logits, y)
        
        tr_loss += loss.item()
        optimizer.zero_grad()
        
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            with gpu_mem_restore_ctx():
                scaled_loss.backward()
        optimizer.step()
        if step%50==0:
            logger.info('Trn step {} of {} trn lossavg {:.5f}'. \
                        format(step, len(trnloader), (tr_loss/(1+step))))
    
        del(batch)
        gc.collect()
        # Hitting memory errors, possibly with amp
        torch.cuda.empty_cache()
        
    output_model_file = os.path.join(WORK_DIR, 'lstm_gepoch{}_lstmepoch{}_fold{}.bin'.format(GLOBALEPOCH, epoch, fold))
    torch.save(model.state_dict(), output_model_file)

    scheduler.step()
    
    # Get Validation Loss
    model.eval()
    logger.info('Prep val score...')
    ypred, imgval = predict(valloader)
    ypredls.append(ypred)
     
    yvalpred = sum(ypredls[-nbags:])/len(ypredls[-nbags:])
    yvalout = makeSub(yvalpred, imgval)
    yvalp = makeSub(ypred, imgval)
    
    # get Val score
    weights = ([1, 1, 1, 1, 1, 2] * ypred.shape[0])
    yact = valloader.dataset.data[label_cols].values#.flatten()
    yact = makeSub(yact, valloader.dataset.data['Image'].tolist())
    yact = yact.set_index('ID').loc[yvalout.ID].reset_index()
    valloss = log_loss(yact['Label'].values, yvalp['Label'].values.clip(.00001,.99999) , sample_weight = weights)
    vallossavg = log_loss(yact['Label'].values, yvalout['Label'].values.clip(.00001,.99999) , sample_weight = weights)
    logger.info('Epoch {} val logloss {:.5f} bagged {:.5f}'.format(epoch, valloss, vallossavg))

    del(ypred, yvalout, yvalp, imgval, yact)
    gc.collect()
    
print('DONE!')

2020-01-27 18:10:42,198 - Recursion-pytorch - INFO - EPOCH 0
2020-01-27 18:10:43,044 - Recursion-pytorch - INFO - Trn step 0 of 1955 trn lossavg 0.67631
2020-01-27 18:10:49,770 - Recursion-pytorch - INFO - Trn step 50 of 1955 trn lossavg 0.10079
2020-01-27 18:10:56,523 - Recursion-pytorch - INFO - Trn step 100 of 1955 trn lossavg 0.08413
2020-01-27 18:11:03,324 - Recursion-pytorch - INFO - Trn step 150 of 1955 trn lossavg 0.07459
2020-01-27 18:11:10,198 - Recursion-pytorch - INFO - Trn step 200 of 1955 trn lossavg 0.07127
2020-01-27 18:11:17,022 - Recursion-pytorch - INFO - Trn step 250 of 1955 trn lossavg 0.06836
2020-01-27 18:11:23,844 - Recursion-pytorch - INFO - Trn step 300 of 1955 trn lossavg 0.06579
2020-01-27 18:11:30,691 - Recursion-pytorch - INFO - Trn step 350 of 1955 trn lossavg 0.06505
2020-01-27 18:11:37,489 - Recursion-pytorch - INFO - Trn step 400 of 1955 trn lossavg 0.06321
2020-01-27 18:11:44,411 - Recursion-pytorch - INFO - Trn step 450 of 1955 trn lossavg 0.06199
20

2020-01-27 18:21:05,169 - Recursion-pytorch - INFO - Trn step 350 of 1955 trn lossavg 0.04909
2020-01-27 18:21:12,139 - Recursion-pytorch - INFO - Trn step 400 of 1955 trn lossavg 0.04983
2020-01-27 18:21:19,113 - Recursion-pytorch - INFO - Trn step 450 of 1955 trn lossavg 0.04931
2020-01-27 18:21:26,055 - Recursion-pytorch - INFO - Trn step 500 of 1955 trn lossavg 0.04907
2020-01-27 18:21:32,960 - Recursion-pytorch - INFO - Trn step 550 of 1955 trn lossavg 0.04937
2020-01-27 18:21:39,916 - Recursion-pytorch - INFO - Trn step 600 of 1955 trn lossavg 0.04928
2020-01-27 18:21:46,859 - Recursion-pytorch - INFO - Trn step 650 of 1955 trn lossavg 0.04890
2020-01-27 18:21:53,774 - Recursion-pytorch - INFO - Trn step 700 of 1955 trn lossavg 0.04905
2020-01-27 18:22:00,583 - Recursion-pytorch - INFO - Trn step 750 of 1955 trn lossavg 0.04910
2020-01-27 18:22:07,547 - Recursion-pytorch - INFO - Trn step 800 of 1955 trn lossavg 0.04877
2020-01-27 18:22:14,489 - Recursion-pytorch - INFO - Trn ste

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0


2020-01-27 18:28:45,401 - Recursion-pytorch - INFO - Trn step 1600 of 1955 trn lossavg 0.04624
2020-01-27 18:28:52,320 - Recursion-pytorch - INFO - Trn step 1650 of 1955 trn lossavg 0.04625
2020-01-27 18:28:59,174 - Recursion-pytorch - INFO - Trn step 1700 of 1955 trn lossavg 0.04636
2020-01-27 18:29:06,092 - Recursion-pytorch - INFO - Trn step 1750 of 1955 trn lossavg 0.04657
2020-01-27 18:29:12,915 - Recursion-pytorch - INFO - Trn step 1800 of 1955 trn lossavg 0.04662
2020-01-27 18:29:19,891 - Recursion-pytorch - INFO - Trn step 1850 of 1955 trn lossavg 0.04659
2020-01-27 18:29:26,913 - Recursion-pytorch - INFO - Trn step 1900 of 1955 trn lossavg 0.04650
2020-01-27 18:29:33,645 - Recursion-pytorch - INFO - Trn step 1950 of 1955 trn lossavg 0.04653
2020-01-27 18:29:35,681 - Recursion-pytorch - INFO - Prep val score...
2020-01-27 18:29:45,253 - Recursion-pytorch - INFO - Epoch 3 val logloss 0.06401 bagged 0.06296
2020-01-27 18:29:45,346 - Recursion-pytorch - INFO - Prep test sub...
202

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 262144.0


2020-01-27 18:38:15,637 - Recursion-pytorch - INFO - Trn step 1500 of 1955 trn lossavg 0.04211
2020-01-27 18:38:22,712 - Recursion-pytorch - INFO - Trn step 1550 of 1955 trn lossavg 0.04203
2020-01-27 18:38:29,823 - Recursion-pytorch - INFO - Trn step 1600 of 1955 trn lossavg 0.04200
2020-01-27 18:38:36,894 - Recursion-pytorch - INFO - Trn step 1650 of 1955 trn lossavg 0.04218
2020-01-27 18:38:43,918 - Recursion-pytorch - INFO - Trn step 1700 of 1955 trn lossavg 0.04212
2020-01-27 18:38:51,049 - Recursion-pytorch - INFO - Trn step 1750 of 1955 trn lossavg 0.04220
2020-01-27 18:38:58,149 - Recursion-pytorch - INFO - Trn step 1800 of 1955 trn lossavg 0.04223
2020-01-27 18:39:05,195 - Recursion-pytorch - INFO - Trn step 1850 of 1955 trn lossavg 0.04214
2020-01-27 18:39:12,579 - Recursion-pytorch - INFO - Trn step 1900 of 1955 trn lossavg 0.04216
2020-01-27 18:39:19,629 - Recursion-pytorch - INFO - Trn step 1950 of 1955 trn lossavg 0.04207
2020-01-27 18:39:21,651 - Recursion-pytorch - INFO

2020-01-27 18:48:03,168 - Recursion-pytorch - INFO - Trn step 1550 of 1955 trn lossavg 0.03478
2020-01-27 18:48:10,104 - Recursion-pytorch - INFO - Trn step 1600 of 1955 trn lossavg 0.03464


Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 524288.0


2020-01-27 18:48:16,905 - Recursion-pytorch - INFO - Trn step 1650 of 1955 trn lossavg 0.03473
2020-01-27 18:48:23,795 - Recursion-pytorch - INFO - Trn step 1700 of 1955 trn lossavg 0.03467
2020-01-27 18:48:30,705 - Recursion-pytorch - INFO - Trn step 1750 of 1955 trn lossavg 0.03452
2020-01-27 18:48:37,907 - Recursion-pytorch - INFO - Trn step 1800 of 1955 trn lossavg 0.03443
2020-01-27 18:48:44,992 - Recursion-pytorch - INFO - Trn step 1850 of 1955 trn lossavg 0.03427
2020-01-27 18:48:52,092 - Recursion-pytorch - INFO - Trn step 1900 of 1955 trn lossavg 0.03432
2020-01-27 18:48:59,115 - Recursion-pytorch - INFO - Trn step 1950 of 1955 trn lossavg 0.03432
2020-01-27 18:49:01,153 - Recursion-pytorch - INFO - Prep val score...
2020-01-27 18:49:10,744 - Recursion-pytorch - INFO - Epoch 7 val logloss 0.07367 bagged 0.06319
2020-01-27 18:49:10,841 - Recursion-pytorch - INFO - Prep test sub...
2020-01-27 18:49:17,041 - Recursion-pytorch - INFO - EPOCH 8
2020-01-27 18:49:17,715 - Recursion-p

Gradient overflow.  Skipping step, loss scaler 0 reducing loss scale to 524288.0


2020-01-27 18:53:25,159 - Recursion-pytorch - INFO - Trn step 1750 of 1955 trn lossavg 0.02874
2020-01-27 18:53:32,248 - Recursion-pytorch - INFO - Trn step 1800 of 1955 trn lossavg 0.02877
2020-01-27 18:53:39,437 - Recursion-pytorch - INFO - Trn step 1850 of 1955 trn lossavg 0.02865
2020-01-27 18:53:46,578 - Recursion-pytorch - INFO - Trn step 1900 of 1955 trn lossavg 0.02867
2020-01-27 18:53:53,613 - Recursion-pytorch - INFO - Trn step 1950 of 1955 trn lossavg 0.02862
2020-01-27 18:53:55,626 - Recursion-pytorch - INFO - Prep val score...
2020-01-27 18:54:05,258 - Recursion-pytorch - INFO - Epoch 8 val logloss 0.08200 bagged 0.06319
2020-01-27 18:54:05,352 - Recursion-pytorch - INFO - Prep test sub...
2020-01-27 18:54:11,508 - Recursion-pytorch - INFO - EPOCH 9
2020-01-27 18:54:12,171 - Recursion-pytorch - INFO - Trn step 0 of 1955 trn lossavg 0.02686
2020-01-27 18:54:19,194 - Recursion-pytorch - INFO - Trn step 50 of 1955 trn lossavg 0.01849
2020-01-27 18:54:26,191 - Recursion-pytorc

DONE!


### Make Submission

In [63]:
ypredtstls = []
model.eval()
logger.info('Prep test submission...')
ypred, imgtst = predict(tstloader)
ypredtstls.append(ypred)

logger.info('Write out bagged prediction to preds folder')
ytstpred = sum(ypredtstls[-nbags:])/len(ypredtstls[-nbags:])
ytstout = makeSub(ytstpred, imgtst)
ytstout.ID = [x.split('_', 1)[1] for x in ytstout.ID.values]

ytstout.to_csv('preds/mg_lstm_20200127.csv.gz', index = False, compression = 'gzip')
print('DONE!')

2020-01-27 21:18:58,318 - Recursion-pytorch - INFO - Prep test sub...
2020-01-27 21:19:04,697 - Recursion-pytorch - INFO - Write out bagged prediction to preds folder


DONE!


In [67]:
ytstout.head(6)

Unnamed: 0,ID,Label
0,ID_9b2d4a7b3_epidural,0.0
1,ID_9b2d4a7b3_intraparenchymal,0.0
2,ID_9b2d4a7b3_intraventricular,0.0
3,ID_9b2d4a7b3_subarachnoid,0.0
4,ID_9b2d4a7b3_subdural,0.0
5,ID_9b2d4a7b3_any,0.0


### Create csv download link

In [66]:
from IPython.display import FileLink, FileLinks

FileLink('preds/mg_lstm_20200127.csv.gz')

### Result = 1.29793