<p align="center"><img width="100%" src='https://drive.google.com/uc?id=1ecTqgg55aswLO-s1vqG2fTyeRoo7pmJ5'></p>


## **Competition Example Code:**  
### 1. Train Example Model using training dataset
### 2. Generate Predictions from X test data
### 3. Submit to competition leaderboard



## 1. Use Google Drive link to view a folder.  Please request access via email.
https://drive.google.com/drive/folders/1pQtbHymHwvBups8jvGgIuaWBxGXRP0Ie?usp=sharing

***The competition training data is stored in four zipfiles in the TrainingSet folder:***

TrainingSet

*   trainbatch01.zip
*   trainbatch02.zip
*   trainbatch03.zip
*   trainbatch04.zip

## 2. From Google Drive, right click the folder name ("Harvard Hospital") and click "Add shortcut to Drive" after receiving access

Google Colab users can directly connect to this Google Drive folder and begin model training with Colab by following instructions to mount this drive below.

This will be sure the data in this folder is accessible in your personal drive folder.

If you do not wish to use Colab, you can skip this step and simply download the competition data and train your model with your preferred approach.

<p align="center"><img width="40%" src='https://drive.google.com/uc?id=1WTeV9qblf19IpPpW0ejXMFTeZlftn29v'></p>


In [1]:
# Connect to your Google Drive to train models using Google Colab
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Download helper files from github repo and copy to working directory
!git clone https://github.com/AIModelShare/harvarddatautils


Cloning into 'harvarddatautils'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (6/6), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (6/6), 4.25 KiB | 4.25 MiB/s, done.


In [3]:
# Copy utility files to working directory
import shutil
import os
 
# path to source directory
src_dir = 'harvarddatautils'
 
# path to destination directory
dest_dir = os.getcwd()
 
shutil.copytree(src_dir, dest_dir,dirs_exist_ok=True)

'/content'

Please first run `bash init.sh` to install the dependencies.


In [None]:
#install missing libraries using init.sh help file
!bash init.sh

In [5]:
from importlib import reload
import datetime
import os
import random

import torch


DEBUG = True
CHECKPOINT_DIR = os.path.join('checkpoints')
os.makedirs(CHECKPOINT_DIR, exist_ok=True)

# Note - Adjust this filepath to the directory where your training zipfiles are saved.
trainDir = "drive/MyDrive/Harvard Hospital/trainingSet"

In [6]:
# Extract a batch of training data from zipfile to working directory
# Use this batch of data to train a model.

# Extract more batches to continue to train model on more data.

from zipfile import ZipFile
  
# extracting training batch to working directory into folder named "batch01"
with ZipFile(trainDir+"/trainbatch1.zip", 'r') as zObject:
    zObject.extractall(
        path=os.getcwd())
    
#Data is now extracted to "batch01" folder!  
#Change zipfile name to trainbatch2.zip, trainbatch3.zip, or trainbatch4.zip to extract more training data.


## Examine Data

In [7]:
# Example data file path
path=os.listdir("batch01")[0]
print(path)

sample043265.mat


In [8]:
# Check if file exists - You may have to adjust the "trainDir" filepath above to point to the location where you saved the shortcut in step 1
import os.path

path = os.path.join("batch01", path)
check_file = os.path.isfile(path)

print(check_file)

True


In [9]:
# Examine data (you can learn more about this data on the competition's Eventbrite webpage)
import mat73
_sample = mat73.loadmat(path)
print(f"Keys: {_sample.keys()}")
print(f"data_50sec: Shape={_sample['data_50sec'].shape}")
print(f"votes: {_sample['votes']}")
print(f"subject_ID: {_sample['subject_ID']}")
for _ in range(4):
    print(f"spec_10min[{_}]: ['{_sample['spec_10min'][_][0]}', numpy.ndarray with shape={_sample['spec_10min'][_][1].shape}]")

Keys: dict_keys(['data_50sec', 'spec_10min', 'subject_ID', 'votes'])
data_50sec: Shape=(20, 10000)
votes: [ 0.  0.  1. 10.  0.  0.]
subject_ID: sid1056
spec_10min[0]: ['LL', numpy.ndarray with shape=(100, 300)]
spec_10min[1]: ['RL', numpy.ndarray with shape=(100, 300)]
spec_10min[2]: ['LP', numpy.ndarray with shape=(100, 300)]
spec_10min[3]: ['RP', numpy.ndarray with shape=(100, 300)]


In [10]:
# We will just load the data using data_utils. Note that this step may take several minutes.
# In the following, the test set is not loaded yet - we will load it in batches later 

#Load training data for one batch of data
#NOTE: change batchdir to batch02, batch03, or batch04 to load more training data.
batchdir="batch01"

import data_utils
reload(data_utils)
train_data = data_utils.ColumbiaData('train', split_ratio=[0.7, 0.3, 0.], data_dir=batchdir, debug=DEBUG)
valid_data = data_utils.ColumbiaData('val', split_ratio=[0.7, 0.3, 0.], data_dir=batchdir, debug=DEBUG)
#test_data = data_utils.ColumbiaData('test', data_dir=trainDir, debug=DEBUG)
for _split, _data in zip(['train', 'valid'], [train_data, valid_data]):
    print(f"number of {_split} samples={len(_data)}")
    print(f"number of {_split} subjects={_data._infos['subject_ID'].nunique()}")


/root/.persist_to_disk does not exist. Creating it for persist_to_disk
/root/.persist_to_disk/cache does not exist. Creating it for persist_to_disk
/root/.persist_to_disk/cache/content-1 does not exist. Creating it for persist_to_disk
/root/.persist_to_disk/cache/content-1/data_utils does not exist. Creating it for persist_to_disk
/root/.persist_to_disk/cache/content-1/data_utils/_read_and_transform_x does not exist. Creating it for persist_to_disk
/root/.persist_to_disk/cache/content-1/data_utils/_read_labels does not exist. Creating it for persist_to_disk
Reading labels...


100%|██████████| 200/200 [00:07<00:00, 28.17it/s]


Splitting Patients...
Reading signals...


100%|██████████| 136/136 [00:03<00:00, 44.46it/s]


Reading labels...


100%|██████████| 200/200 [00:00<00:00, 4815.59it/s]


Splitting Patients...
Reading signals...


100%|██████████| 64/64 [00:01<00:00, 45.18it/s]

number of train samples=136
number of train subjects=59
number of valid samples=64
number of valid subjects=26





## Train the Model

In [11]:
# We use a CNN on spectrogram - the model code was loaded with the helper files earlier
from model import CNNEncoder2D_IIIC

model = CNNEncoder2D_IIIC(nclass=6, num_channels=16)
model

CNNEncoder2D_IIIC(
  (conv1): Sequential(
    (0): Conv2d(32, 48, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ELU(alpha=1.0, inplace=True)
  )
  (conv2): ResBlock(
    (conv1): Conv2d(48, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
    (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (relu): ELU(alpha=1.0, inplace=True)
    (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (downsample): Sequential(
      (0): Conv2d(48, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (conv

In [12]:
from torch.utils.data import DataLoader
import tqdm
import numpy as np
from utils import concat_list_of_dict

@torch.no_grad()
def eval_model(model, valid_data, batch_size=128, device='cpu', num_workers=0):
    model.eval()
    valid_loader = DataLoader(dataset=valid_data, batch_size=batch_size, shuffle=False, num_workers=num_workers)
    preds = []
    for batch in tqdm.tqdm(valid_loader):
        logits = model(batch.pop('data').to(device))
        batch['logits'] = logits.cpu()
        preds.append(batch)
    return concat_list_of_dict(preds, ['logits', 'target', 'index'])

def train(model, train_data, valid_data=None, batch_size=128, epochs=30, debug=False, device='cuda:0', num_workers=4):
    if debug:
        device, num_workers, epochs = 'cpu', 0, 3
    model = model.to(device)

    ckpt_path = os.path.join(CHECKPOINT_DIR, 'best_ckpt.pth')
    curr_val_loss = best_val_loss = np.inf
    train_loss = []

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = torch.nn.CrossEntropyLoss()

    train_loader = DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True, num_workers=num_workers)

    for curr_epoch in range(epochs):
        model.train()
        for batch in tqdm.tqdm(train_loader, desc=f'Training Epoch={curr_epoch}'):
            optimizer.zero_grad()
            logits = model(batch['data'].to(device))
            loss = criterion(logits, batch['target'].to(device))
            loss.backward()
            optimizer.step()
            train_loss.append(loss.item())
        if valid_data is not None:
            eval_results = eval_model(model, valid_data, batch_size=batch_size, device=device, num_workers=num_workers)
            curr_val_loss = criterion(torch.tensor(eval_results['logits']),
                                      torch.tensor(eval_results['target'])).item()
            if curr_val_loss < best_val_loss:
                best_val_loss = curr_val_loss
                torch.save(model.state_dict(), ckpt_path)
        print(f"Train Loss={np.mean(train_loss):.3f}, Val Loss={curr_val_loss:.3f}, Best Val Loss={best_val_loss:.3f}")
    model.load_state_dict(torch.load(ckpt_path))
    return model

In [13]:
trained_model = train(model, train_data, valid_data, debug=DEBUG)

Training Epoch=0: 100%|██████████| 2/2 [00:04<00:00,  2.42s/it]
100%|██████████| 1/1 [00:00<00:00,  1.21it/s]


Train Loss=1.886, Val Loss=1.909, Best Val Loss=1.909


Training Epoch=1: 100%|██████████| 2/2 [00:03<00:00,  1.98s/it]
100%|██████████| 1/1 [00:00<00:00,  1.65it/s]


Train Loss=1.862, Val Loss=1.692, Best Val Loss=1.692


Training Epoch=2: 100%|██████████| 2/2 [00:03<00:00,  1.80s/it]
100%|██████████| 1/1 [00:00<00:00,  1.65it/s]

Train Loss=1.820, Val Loss=1.564, Best Val Loss=1.564





## Generate test predictions to submit to competition leaderboard

In [14]:
#Generate predicted labels from entire test dataset (in four batches)
# Test data stored in pickle files.  NOTE: you may need to adjust filepaths to...
#...load test data files correctly below.

# Note that the file paths below may need to be adjusted if the test data pickle files were saved 
# anywhere other than the default location 
def get_prediction_labels(trainedmodel, testdata):
      test_preds = eval_model(trainedmodel, testdata)
      import pandas as pd
      test_pred_df = pd.DataFrame(test_preds['logits'], columns=[f'pred_{_}' for _ in range(6)])
      CLASSES = ['Other', 'Seizure', 'LPD', 'GPD', 'LRDA', 'GRDA']
      test_pred_df['pred'] = np.argmax(test_preds['logits'], axis=1)

      predicted_labels = []

      for i in test_pred_df['pred']:
          label=CLASSES[i]
          predicted_labels.append(label)

      test_pred_df['predicted_labels']=predicted_labels
      test_pred_df['index'] = test_preds['index']
      test_pred_df['target'] = test_preds['target']
      return list(test_pred_df['predicted_labels'])

import pickle
file = open('drive/MyDrive/Harvard Hospital/batch1.pickle', 'rb')
b1data = pickle.load(file)
file.close()

predicted_labels_batch1=get_prediction_labels(trained_model,b1data)

del(b1data)

file = open('drive/MyDrive/Harvard Hospital/batch2.pickle', 'rb')
b2data = pickle.load(file)
file.close()

predicted_labels_batch2=get_prediction_labels(trained_model,b2data)

del(b2data)

file = open('drive/MyDrive/Harvard Hospital/batch3.pickle', 'rb')
b3data = pickle.load(file)
file.close()

predicted_labels_batch3=get_prediction_labels(trained_model,b3data)

del(b3data)

file = open('drive/MyDrive/Harvard Hospital/batch4.pickle', 'rb')
b4data = pickle.load(file)
file.close()

predicted_labels_batch4=get_prediction_labels(trained_model,b4data)

del(b4data)

100%|██████████| 78/78 [02:08<00:00,  1.65s/it]
100%|██████████| 79/79 [01:52<00:00,  1.42s/it]
100%|██████████| 79/79 [01:48<00:00,  1.37s/it]
100%|██████████| 45/45 [00:56<00:00,  1.26s/it]


In [15]:
#Combine predicted labels to submit to leaderboard 
predicted_labels=predicted_labels_batch1+predicted_labels_batch2+predicted_labels_batch3+predicted_labels_batch4

In [None]:
#install aimodelshare library to make competition leaderboard submissions
! pip install aimodelshare --upgrade

To submit models to the competition leaderboard, you will need credentials for modelshare.ai 

[Create a free account here](https://www.modelshare.ai/)

In [27]:
from aimodelshare import ModelPlayground
from aimodelshare.aws import set_credentials

playground_id="https://negjs28m2m.execute-api.us-east-2.amazonaws.com/prod/m"
myplayground = ModelPlayground(playground_url = playground_id, task_type="classification")
set_credentials(apiurl=playground_id)

In [None]:
# Submit Model predictions to Competition Leaderboard
# To win the competition, be sure to share the code you used to preprocess data and generate your model...
#...on the code tab of the competition webpage here: https://www.modelshare.ai/detail/model:3691 

myplayground.submit_model(model=None,
                          preprocessor=None,
                          prediction_submission=predicted_labels,submission_type="competition")

In [36]:
# Check competition leaderboard
data = myplayground.get_leaderboard(submission_type="competition")
myplayground.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,model_type,username,version
0,17.06%,15.68%,17.17%,16.65%,unknown,unknown,AIModelShare,1
1,26.51%,10.73%,8.68%,17.07%,unknown,unknown,newusertest,3
2,26.51%,10.73%,8.68%,17.07%,unknown,unknown,newusertest,4
3,26.51%,10.73%,8.68%,17.07%,unknown,unknown,newusertest,5
4,22.40%,12.65%,13.87%,15.99%,unknown,unknown,gstreett,2
