- Section with imports + notes on what they are
- Section on loading data w/reference back to dataloaders
- Include note on importing from kaggle
- Model + dataset section w/extra notes
    - talk about context in dataset section 



# Importing Libraries



In [1]:
#These libraries help to interact with the operating system and the runtime environment respectively
import os
import sys

#Model/Training related libraries
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import pandas as pd

#Dataloader libraries
from torch.utils.data import DataLoader, Dataset
from sklearn.metrics import accuracy_score, multilabel_confusion_matrix

# Setting up Data

To be able to use our data in training, the data must be loaded and stored in a workable form. In our case, we want to reformat our data into a pytorch tensor - which we can then store into a Dataset and Dataset Loader that can be used by our model.

### 1. Load Data



All homework datasets are accessible through Kaggle. Ideally, you don't want to download all the data locally onto your laptop. Instead, they can be loaded to the temporary process your notebook is working on. We will show an example here, with some tips on dataformatting. For more details, reference the Colab recitation. 

In [2]:
#Intall Kaggle API and create kaggle directory
!pip install kaggle
!mkdir .kaggle
#This data is used to login  into your Kaggle account
import json
token = {"username":"samruddhi98","key":"db26269bc2e5ae4d4f8456490e50b5a8"}
with open('/content/.kaggle/kaggle.json', 'w') as file:
    json.dump(token, file)




In [4]:
!chmod 600 /content/.kaggle/kaggle.json
!cp /content/.kaggle/kaggle.json /root/.kaggle/
!kaggle config set -n path -v /content

- path is now set to: /content


In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
#Download dataset .npz files from kaggle
!kaggle competitions download -c idl-fall2021-hw1p2

Downloading train.npy.zip to /content/competitions/idl-fall2021-hw1p2
 99% 1.89G/1.92G [00:08<00:00, 273MB/s]
100% 1.92G/1.92G [00:08<00:00, 238MB/s]
Downloading sample.csv.zip to /content/competitions/idl-fall2021-hw1p2
  0% 0.00/4.03M [00:00<?, ?B/s]
100% 4.03M/4.03M [00:00<00:00, 133MB/s]
Downloading dev.npy.zip to /content/competitions/idl-fall2021-hw1p2
 97% 238M/246M [00:00<00:00, 270MB/s]
100% 246M/246M [00:00<00:00, 267MB/s]
Downloading test.npy.zip to /content/competitions/idl-fall2021-hw1p2
 94% 227M/241M [00:05<00:00, 40.6MB/s]
100% 241M/241M [00:05<00:00, 42.6MB/s]
Downloading dev_labels.npy.zip to /content/competitions/idl-fall2021-hw1p2
  0% 0.00/617k [00:00<?, ?B/s]
100% 617k/617k [00:00<00:00, 86.3MB/s]
Downloading train_labels.npy.zip to /content/competitions/idl-fall2021-hw1p2
  0% 0.00/5.16M [00:00<?, ?B/s]
100% 5.16M/5.16M [00:00<00:00, 84.7MB/s]


In [8]:
!unzip -d competitions/idl-fall2021-hw1p2/ competitions/idl-fall2021-hw1p2/train.npy.zip 
!unzip -d competitions/idl-fall2021-hw1p2/ competitions/idl-fall2021-hw1p2/train_labels.npy.zip 
!unzip -d competitions/idl-fall2021-hw1p2/ competitions/idl-fall2021-hw1p2/dev.npy.zip 
!unzip -d competitions/idl-fall2021-hw1p2/ competitions/idl-fall2021-hw1p2/dev_labels.npy.zip 
!unzip -d competitions/idl-fall2021-hw1p2/ competitions/idl-fall2021-hw1p2/test.npy.zip 

Archive:  competitions/idl-fall2021-hw1p2/train.npy.zip
  inflating: competitions/idl-fall2021-hw1p2/train.npy  
Archive:  competitions/idl-fall2021-hw1p2/train_labels.npy.zip
  inflating: competitions/idl-fall2021-hw1p2/train_labels.npy  
Archive:  competitions/idl-fall2021-hw1p2/dev.npy.zip
  inflating: competitions/idl-fall2021-hw1p2/dev.npy  
Archive:  competitions/idl-fall2021-hw1p2/dev_labels.npy.zip
  inflating: competitions/idl-fall2021-hw1p2/dev_labels.npy  
Archive:  competitions/idl-fall2021-hw1p2/test.npy.zip
  inflating: competitions/idl-fall2021-hw1p2/test.npy  


### 2. Set up Dataset Class

The dataset class is used to format the input/output pairs and store them as well. For more details, recall the Dataset and Dataloader recitation on how each of these features work, as well as the OOP lecture.

In [9]:
class MLPDataset(Dataset):
    
    def __init__(self, data, labels, context=0):
        self.data = data 
        self.labels = labels 
        self.length = len(self.labels)
        self.context = context
        
    def __len__(self):
        return self.length
    
    def __getitem__(self,index):
        x = self.data[index:index+2*self.context+1,:]
        y = self.labels[index]
        return x,y
    
    def collate_fn(batch):
        batch_x = [x for x,y in batch]
        batch_y = [y for x,y in batch]
        batch_x = torch.as_tensor(batch_x)
        batch_y = torch.as_tensor(batch_y)
        return batch_x,batch_y

In [10]:
### Load the Data
x_train = np.load('competitions/idl-fall2021-hw1p2/train.npy',allow_pickle=True) # Data file name
labels_train = np.load('competitions/idl-fall2021-hw1p2/train_labels.npy',allow_pickle=True) # Label file name
x_val = np.load('competitions/idl-fall2021-hw1p2/dev.npy',allow_pickle=True) # Data file name
labels_val = np.load('competitions/idl-fall2021-hw1p2/dev_labels.npy',allow_pickle=True) # Label file name 
x_test = np.load('competitions/idl-fall2021-hw1p2/test.npy',allow_pickle=True) # Data file name

(14542,)
(14542,)
(2683,)
(2683,)
(2600,)


In [11]:
context = 25

## Concatenating and padding the data
# Training
x_train = np.concatenate(x_train,axis=0)
x_train = np.pad(x_train, ((context, context), (0, 0)), 'constant', constant_values=0)
labels_train = np.concatenate(labels_train,axis=0)
# Validation 
x_val = np.concatenate(x_val,axis=0)
x_val = np.pad(x_val, ((context, context), (0, 0)), 'constant', constant_values=0)
labels_val = np.concatenate(labels_val,axis=0)
# Testing
x_test = np.concatenate(x_test,axis=0)
labels_test = np.random.randint(71,size=x_test.shape[0])
x_test = np.pad(x_test, ((context, context), (0, 0)), 'constant', constant_values=0)
print(x_test.shape,labels_test.shape)

(1910062, 40) (1910012,)


In [27]:
## Dataloaders
batch_size = 128
test_batch_size = 4
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print('Device is ',device)

# Training dataloader
train_data = MLPDataset(x_train, labels_train, context=context)
train_args = dict(shuffle = True, batch_size = batch_size, num_workers=4, collate_fn=MLPDataset.collate_fn)
train_loader = DataLoader(train_data, **train_args)

# Validation dataloader
val_data = MLPDataset(x_val, labels_val, context=context)
val_args = dict(shuffle = False, batch_size = batch_size, num_workers=4, collate_fn=MLPDataset.collate_fn)
val_loader = DataLoader(val_data, **val_args)

# Testing dataloader
test_data = MLPDataset(x_test, labels_test, context=context)
test_args = dict(shuffle = False, batch_size = test_batch_size, num_workers=4, collate_fn=MLPDataset.collate_fn)
test_loader = DataLoader(test_data, **test_args)

Device is  cuda:0


In [13]:
## Model Architecture definition

class MLP(nn.Module):

    # define model elements
    def __init__(self, size):
        super(MLP, self).__init__()
        
        # Sequential model definition: Input -> Linear -> ReLU -> Linear -> Softmax -> Output
        
        self.model = nn.Sequential(nn.Linear(size[0], size[1]), 
                                   nn.BatchNorm1d(size[1]),
                                   nn.ReLU(),
                                   nn.Dropout(0.25,inplace=False),
                                   nn.Linear(size[1], size[2]),
                                   nn.BatchNorm1d(size[2]),
                                   nn.ReLU(),
                                   nn.Dropout(0.25,inplace=False),
                                   nn.Linear(size[2], size[3]), 
                                   nn.BatchNorm1d(size[3]),
                                   nn.ReLU(),
                                   nn.Dropout(0.25,inplace=False),
                                   nn.Linear(size[3], size[4]),
                                   nn.BatchNorm1d(size[4]),
                                   nn.ReLU(),
                                   nn.Linear(size[4], size[5]),
                                   nn.ReLU(),
                                   nn.Linear(size[5], size[6])
                                   )

    def forward(self, x):
        # Model forward pass
        self.x = self.model(x)
        return self.x

In [14]:
# Model
input_size = (2*context+1)*40
model = MLP([input_size, 4096, 2048, 2048, 1024, 512, 71])
model.to(device)
# Define Criterion/ Loss function
criterion = nn.CrossEntropyLoss()

# Define Adam Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr = 0.001)
# exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer, 
#                                              step_size=7, gamma=0.1)
rlr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='max', patience=2, threshold=0.01, min_lr=0.0001)
print(model)

MLP(
  (model): Sequential(
    (0): Linear(in_features=2040, out_features=4096, bias=True)
    (1): BatchNorm1d(4096, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): ReLU()
    (3): Dropout(p=0.25, inplace=False)
    (4): Linear(in_features=4096, out_features=2048, bias=True)
    (5): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): ReLU()
    (7): Dropout(p=0.25, inplace=False)
    (8): Linear(in_features=2048, out_features=2048, bias=True)
    (9): BatchNorm1d(2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (10): ReLU()
    (11): Dropout(p=0.25, inplace=False)
    (12): Linear(in_features=2048, out_features=1024, bias=True)
    (13): BatchNorm1d(1024, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (14): ReLU()
    (15): Linear(in_features=1024, out_features=512, bias=True)
    (16): ReLU()
    (17): Linear(in_features=512, out_features=71, bias=True)
  )
)


In [15]:
# Train the model
from tqdm import tqdm
def train_model(train_loader, model,batch_size):
    training_loss = 0
    
    # Set model in 'Training mode'
    model.train()
    
    # enumerate mini batches
    with tqdm(train_loader, unit="batch") as tepoch:
        for inputs, targets in tepoch:
            if (inputs.size()[0]==batch_size):
            # clear the gradients
                optimizer.zero_grad()
                inputs = inputs.type(torch.FloatTensor).to(device)
                targets = targets.type(torch.LongTensor).to(device)
                # compute the model output
                out = model(inputs.view(batch_size,-1))
                # calculate loss
                loss = criterion(out,targets)
                # Backward pass
                loss.backward()
                # Update model weights
                optimizer.step()

                training_loss += loss.item()
        training_loss /= len(train_loader)
        return training_loss

In [16]:
# Evaluate the model

def evaluate_model(val_loader, model,batch_size):
    # with torch.no_grad():
    predictions = []
    actuals = []
    
    # Set model in validation mode
    model.eval()
    
    with tqdm(val_loader, unit="batch") as tepoch:
        for inputs, targets in tepoch:    
        # evaluate the model on the validation set 
            if (inputs.size()[0]==batch_size):
                inputs = inputs.type(torch.FloatTensor).to(device)
                targets = targets.type(torch.LongTensor).to(device)
                out = model(inputs.view(batch_size,-1))
                # Calculate validation loss
                loss = criterion(out,targets)
                # retrieve numpy array
                out = out.cpu().detach().numpy()
                actual = targets.cpu().numpy()
                # convert to class labels
                out = np.argmax(out,axis=1)
                predictions.append(out)
                actuals.append(actual)
        predictions, actuals = np.asarray(predictions).flatten(), np.asarray(actuals).flatten()
        # print(predictions.shape,actuals.shape)
        acc = accuracy_score(actuals, predictions)
        return acc, loss.item()

In [28]:
# Test the model

def predict_model(test_loader, model,batch_size):
    predictions = []
    model.eval()
    
    with tqdm(test_loader, unit="batch") as tepoch:
        for inputs, targets in tepoch:
            if (inputs.size()[0]==batch_size):
                inputs = inputs.type(torch.FloatTensor).to(device)
                targets = targets.type(torch.LongTensor).to(device)
                out = model(inputs.view(batch_size,-1))
                out = out.cpu().detach().numpy()
                out = np.argmax(out,axis=1)
                predictions.append(out)
                
        predictions = np.asarray(predictions).flatten()
        return predictions

In [None]:
# Define number of epochs

if __name__=='__main__':
    
    epochs = 20
    for epoch in range(11,epochs):
        print('Starting epoch ',epoch)
        # Train
        training_loss = train_model(train_loader, model, batch_size)
        # Validation
        val_acc, val_loss = evaluate_model(val_loader, model, batch_size)
        rlr_scheduler.step(val_acc)

        PATH = 'model-4_epoch%d_val_acc-%0.4f.pt'%(epoch,val_acc)

        # Save
        torch.save({
            'epoch': epochs,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler': rlr_scheduler.state_dict(),
            'loss': training_loss,
            'val_accuracy': val_acc,
            'val_loss': val_loss
            }, PATH)
        
        print('Epoch {}, lr {}'.format(epoch, optimizer.param_groups[0]['lr']))

        # Print log of accuracy and loss
        print("Epoch: "+str(epoch)+", Training loss: "+str(training_loss)+", Validation loss:"+str(val_loss)+
          ", Validation accuracy:"+str(val_acc*100)+"%")
    

Starting epoch  11


100%|██████████| 144399/144399 [53:36<00:00, 44.89batch/s]
100%|██████████| 15123/15123 [04:49<00:00, 52.17batch/s]


Epoch 11, lr 0.0005
Epoch: 11, Training loss: 0.541172926608208, Validation loss:3.2272121906280518, Validation accuracy:77.87195394127761%
Starting epoch  12


  2%|▏         | 2302/144399 [00:52<50:45, 46.66batch/s]

In [34]:
# Loading the model for predicting labels on test data
epoch = 11 # load the best model
val_acc = 0.7787 # the corresponding accuracy
PATH = 'model-4_epoch%d_val_acc-%0.4f.pt'%(epoch,val_acc)
print('Loading %s model '%(PATH))
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
rlr_scheduler.load_state_dict(checkpoint['scheduler'])
epochs = checkpoint['epoch']
training_loss = checkpoint['loss']
val_acc = checkpoint['val_accuracy']
val_loss = checkpoint['val_loss']

Loading drive/MyDrive/IDL-HW1P2/model-4_epoch10_val_acc-0.7787.pt model 


In [29]:
# Predicting data
preds = predict_model(test_loader, model, batch_size=test_batch_size)
print('Shape of predicted data is ',preds.shape)

100%|██████████| 477503/477503 [15:34<00:00, 511.00batch/s]


Shape of predicted data is  (1910012,)


In [30]:
# Saving in csv format
import pandas as pd
filename = 'drive/MyDrive/IDL-HW1P2/submissions_2.csv'
index = np.arange(0,preds.shape[0])
data = {'id':list(index),
        'label':list(preds)}
pred_df = pd.DataFrame(data)
pred_df = pred_df.rename_axis('id',axis=1)
print(pred_df)
pred_df.to_csv(filename,index=False)

id            id  label
0              0      0
1              1      0
2              2      0
3              3      0
4              4      0
...          ...    ...
1910007  1910007      0
1910008  1910008      0
1910009  1910009      0
1910010  1910010      0
1910011  1910011      0

[1910012 rows x 2 columns]


In [31]:
# Submitting predictions to kaggle
!kaggle competitions submit -c idl-fall2021-hw1p2 -f {filename} -m "Model-1 submission"

100% 18.5M/18.5M [00:08<00:00, 2.21MB/s]
Successfully submitted to IDL-Fall21-HW1P2