<h1 align='center'><b><font color ='pickle'>Refactor Custom Shallow NN</b></h1>

Changes Made
- Use Embedding Layer instead of tfidf 
Move dataloaders, loss functions, optimizer after wandb.init()
- Check transformation using check loader
- Training Loops
> Add functionality for 
>- model.train(); model.eval()
>- Gradient Clipping
>- log batch loss and accuracy 
>- print time it  takes to run epoch
>- Save checkpoints
>- Early stopping

- Add Dictionary for Hyperparameters
- Learning Rate Scheduler
- Weight Decay
- Enhance wandb logging
- Add seed for reproducability
- Sanity Check 





# <Font color = 'pickle'>**Load Libraries/Install Software**

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
else:
  print('Not running on CoLab')

Running on CoLab


In [3]:
# Install wandb and update it to the latest version
if 'google.colab' in str(get_ipython()):
    !pip install wandb --upgrade -q

In [4]:
# mount google drive
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Importing the necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchtext.vocab import  vocab

import random
from datetime import datetime
import numpy as np
import pandas as pd
import joblib
from collections import Counter
from types import SimpleNamespace

from pathlib import Path
import sys

from sklearn.model_selection import train_test_split
import wandb

In [None]:
#if 'google.colab' in str(get_ipython()):
#  !python -m spacy download 'en_core_web_sm'

In [7]:
# Login to W&B
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mhsingh-utd[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

# <Font color = 'pickle'>**Specify Project Folders**

In [8]:
# This is the path where we will downlaod and save data
if 'google.colab' in str(get_ipython()):
  base_folder = Path('/content/drive/MyDrive/data')
else:
  base_folder = Path('/home/harpreet/Insync/google_drive_shaannoor/data')

In [9]:
data_folder = base_folder/'datasets/aclImdb'
model_folder = base_folder/'models/nlp_fall_2022/imdb'
custom_functions = base_folder/'custom-functions'

In [10]:
sys.path.append(str(custom_functions))

In [11]:
sys.path

['/content',
 '/env/python',
 '/usr/lib/python37.zip',
 '/usr/lib/python3.7',
 '/usr/lib/python3.7/lib-dynload',
 '',
 '/usr/local/lib/python3.7/dist-packages',
 '/usr/lib/python3/dist-packages',
 '/usr/local/lib/python3.7/dist-packages/IPython/extensions',
 '/root/.ipython',
 '/content/drive/MyDrive/data/custom-functions']

In [12]:
import custom_preprocessor as cp

# <Font color = 'pickle'>**IMDB Dataset**

For this notebook, we will use IMDB movie review dataset. <br>
LInk for complete dataset: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.

We downloaded the dataset in the previous lecture 2 notebook (notebook: 3_Faster_tokenization_spacy_final.ipynb)

We created csv files in Lecture 2 -- train.csv and test.csv file. The files are availible in Lecture2/data folder from eLearning. I have applied the custom pre=processor and cleaned the data set for this lecture. I pickled the datasets and saved them as files. The files are available in Lecture_6/data folder. We will download the following files as well.

- 'x_train_cleaned_bag_of_words.pkl'
- 'x_valid_cleaned_bag_of_words.pkl'
- 'x_test_cleaned_bag_of_words.pkl'

In [13]:
# location of train and test files
train_file = data_folder /'train.csv'
test_file = data_folder /'test.csv'

In [14]:
# creating Pandas Dataframe
train_data = pd.read_csv(train_file, index_col=0)
test_data = pd.read_csv(test_file, index_col=0)

In [15]:
# print shape of the datasets
print(f'Shape of Training data set is : {train_data.shape}')
print(f'Shape of Test data set is : {test_data.shape}')

Shape of Training data set is : (25000, 2)
Shape of Test data set is : (25000, 2)


In [16]:
train_data.head()

Unnamed: 0,Reviews,Labels
0,Ever wanted to know just how much Hollywood co...,1
1,The movie itself was ok for the kids. But I go...,1
2,You could stage a version of Charles Dickens' ...,1
3,this was a fantastic episode. i saw a clip fro...,1
4,and laugh out loud funny in many scenes.<br />...,1


## <Font color = 'pickle'>**Create Train/Test/Valid Split**


In [17]:
X, y = train_data['Reviews'].values, train_data['Labels'].values

In [18]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.20, random_state=42)

In [19]:
X_test , y_test = test_data['Reviews'].values, test_data['Labels'].values

## <Font color = 'pickle'>**Data PreProcessing**

In [20]:
# X_train_cleaned = cp.SpacyPreprocessor(model = 'en_core_web_sm', batch_size=1000).transform(X_train)

In [21]:
# X_valid_cleaned = cp.SpacyPreprocessor(model = 'en_core_web_sm', batch_size=1000).transform(X_valid)
# X_test_cleaned = cp.SpacyPreprocessor(model = 'en_core_web_sm', batch_size=1000).transform(X_test)

In [22]:
X_train_cleaned_file = data_folder / 'x_train_cleaned_bag_of_words.pkl'
X_valid_cleaned_file = data_folder / 'x_valid_cleaned_bag_of_words..pkl'
X_test_cleaned_file = data_folder / 'x_test_cleaned_bag_of_words..pkl'

In [23]:
# joblib.dump(X_train_cleaned, X_train_cleaned_file)
# joblib.dump(X_valid_cleaned, X_valid_cleaned_file)
# joblib.dump(X_test_cleaned, X_test_cleaned_file)

In [24]:
X_train_cleaned = joblib.load(X_train_cleaned_file)
X_valid_cleaned = joblib.load(X_valid_cleaned_file)
X_test_cleaned = joblib.load(X_test_cleaned_file)

In [25]:
print(type(X_train_cleaned))
print(type(y_train))

<class 'list'>
<class 'numpy.ndarray'>


## <Font color = 'pickle'>**Custom Dataset Class**

In [26]:
class CustomDataset(torch.utils.data.Dataset):
    """IMDB dataset."""

    def __init__(self, X, y):
        self.X = np.array(X)
        self.y = y

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        text = self.X[idx]
        labels = self.y[idx]
        sample = (text, labels)
        
        return sample

In [27]:
trainset = CustomDataset(X_train_cleaned,y_train)
validset = CustomDataset(X_valid_cleaned,y_valid)
testset = CustomDataset(X_test_cleaned,y_test)

## <Font color = 'pickle'>**Create Vocab**

In [28]:
def create_vocab(dataset, min_freq):
  counter = Counter()
  for (text, _) in dataset:
    counter.update(str(text).split())
  my_vocab = vocab(counter, min_freq=min_freq)
  my_vocab.insert_token('<unk>', 0)
  my_vocab.set_default_index(0)
  return my_vocab

vocab should always be created based on trainset

In [29]:
imdb_vocab = create_vocab(trainset, min_freq = 2)

In [30]:
len(imdb_vocab)

36241

In [31]:
imdb_vocab.get_itos()[0:5]

['<unk>', 'production', 'absolutely', 'storyline', 'acting']

In [32]:
imdb_vocab['abracadabra']

0

## <Font color = 'pickle'>**Collate_fn for Data Loaders**

In [33]:
# Creating a lambda function objects that will be used to get the indices of words from vocab
text_pipeline = lambda x: [imdb_vocab[token] for token in str(x).split()]
label_pipeline = lambda x: int(x)

In [34]:
'''
We know that input to the embedding layers are indices of words from the vocab.
The collate_batch() accepts batch of data and gets the indices of text from vocab and returns the same
We will include this collate_batch() in collat_fn attribute of DataLoader.
So it will create a batch of data containing indices of words and corresponding labels.
But for EmbeddingBag we need one more extra parameter, that is offset.
offsets determines the starting index position of each bag (sequence) in input.
'''
def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_text, _label) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return text_list, label_list, offsets

## <Font color = 'pickle'>**Check Data Loaders**

Let us check if our collate function is working by creating a dataloader

In [35]:
batch_size=2
check_loader= torch.utils.data.DataLoader(dataset=trainset,
                                        batch_size=batch_size,
                                        shuffle=True,
                                        collate_fn=collate_batch,
                                        num_workers=4)

In [36]:
for text, label, offsets in check_loader:
  print(label, text, offsets)
  break

tensor([0, 1]) tensor([  122,   472,  1181,   272,    69,   126,   619,  2349,   472,   242,
         1420,  1862,   135,    97,     4,   395,  2738,  1493,  7696,   782,
          941,   746,   941,   172,  1401,    47,   154,    93,  1873,   984,
        10675,   122,   635, 16091,   537,   802,  3466,   376,  2591,  8728,
          130,  8789,  1161,  1800,    69,   122,   314,  4266,   542, 22252,
          220,   874,  4048, 16787,  1252,  1069,   741,  1832,   526,  6979,
         1776,  6115,   582,   961, 18422,  2600,  2396, 31744,   321,  1695,
        27206,  2837,   331, 22258,  3142,  1445, 18976,   441,   102, 11971,
        30472, 19007, 18422,   483,  8258,  9684,  4261,   542,  2336,   258,
         2034,  4305,   749, 14840, 15360, 16787,    82,  1872,  5991,  1472,
           67,     5, 18504,  3088, 11062,  2766,  4155,  6950,  1808,   209,
          362,   352,  1147,  1322,  1252,  1069,   172,  5076,  2135,   567,
         7262,  1665,   172,   349,  7039,    69]

# <font color = 'pickle'> **Functions to implement NN Training**

Now, we will start implementing our Softmax Regression Model from scratch.

We will now create following functions:

- **Model**
- **Loss Function** 
- **One Hot Encoding**
- **Training Loop for 1 epoch**
- **Validation Loop for 1 epoch**
- **Model Training** - repeat the training and validation loops for given number of epochs
- **Function to get the accuracy given the model**

## <Font color = 'pickle'>**Model**

In [37]:
class MLPCustom(nn.Module):
  def __init__(self, embed_dim, vocab_size, hidden_dim1, hidden_dim2, output_dim, non_linearity):

    super().__init__()    
    self.hidden_dim1 = hidden_dim1
    self.hidden_dim2 = hidden_dim2
    self.output_dim = output_dim
    self.vocab_size = vocab_size
    self.embed_dim = embed_dim

    self.non_linearity = non_linearity

    

    # embedding_layer
    self.embedding = nn.EmbeddingBag(self.vocab_size, self.embed_dim)

    # hidden layer1
    self.hidden_layer1 = nn.Linear(self.embed_dim, self.hidden_dim1)

    # dropout layer 1
    self.drop1 = nn.Dropout(p= 0.5)

    # batch layer norm 1
    self.batchnorm1 = nn.BatchNorm1d(num_features=self.hidden_dim1)

    # hideen layer2
    self.hidden_layer2 = nn.Linear(self.hidden_dim1, self.hidden_dim2)
    
    # dropout layer 2
    self.drop2 = nn.Dropout(p= 0.5)

    # batch layer norm 2    
    self.batchnorm2 = nn.BatchNorm1d(num_features=self.hidden_dim2)
    
    # output layer
    self.output_layer = nn.Linear(self.hidden_dim2, self.output_dim)

    # nonlinearity


  def forward(self, input_, offsets):
    embed_out = self.embedding(input_, offsets) # batchsize, embedding_dim

    hout1 = self.non_linearity(self.hidden_layer1(embed_out)) # batchsize, hidden_dim1
    hout1 = self.batchnorm1(hout1)
    hout1 = self.drop1(hout1)
    
    hout2 = self.non_linearity(self.hidden_layer2(hout1)) # batchsize, hidden_dim2
    hout2 = self.batchnorm2(hout2)
    hout2 = self.drop2(hout2)
    
    ypred = self.output_layer(hout2) # batchsize, hidden_dim2
    
    # Note : We do not need to apply softmax as we will be using nn.CrossEntropy Loss

    return ypred

## <Font color = 'pickle'>**Function for Training  Loops**

**Model Training** involves five steps: 

- Step 0: Randomly initialize parameters / weights
- Step 1: Compute model's predictions - forward pass
- Step 2: Compute loss
- Step 3: Compute the gradients
- Step 4: Update the parameters
- Step 5: Repeat steps 1 - 4

Model training is repeating this process over and over, for many **epochs**.

We will specify number of ***epochs*** and during each epoch we will iterate over the complete dataset and will keep on updating the parameters.

***Learning rate*** and ***epochs*** are known as hyperparameters. We have to adjust the values of these two based on validation dataset.

We will now create functions for step 1 to 4.

In [38]:
def train(train_loader, loss_function, model, optimizer, grad_clipping, max_norm, log_batch, log_interval):

  # Training Loop 

  # initilalize variables as global
  # these counts will be updated every epoch
  global batch_ct_train

  # Initialize train_loss at the he start of the epoch
  running_train_loss = 0
  running_train_correct = 0
  
  # put the model in training mode

  model.train()
  # Iterate on batches from the dataset using train_loader
  for input_, targets, offsets in train_loader:
    
    # move inputs and outputs to GPUs
    input_ = input_.to(device)
    targets = targets.to(device)
    offsets = offsets.to(device)


    # Step 1: Forward Pass: Compute model's predictions 
    output = model(input_, offsets)
    
    # Step 2: Compute loss
    loss = loss_function(output, targets)

    # Correct prediction
    y_pred = torch.argmax(output, dim = 1)
    correct = torch.sum(y_pred == targets)

    batch_ct_train += 1

    # Step 3: Backward pass -Compute the gradients
    optimizer.zero_grad()
    loss.backward()

    # Gradient Clipping
    if grad_clipping:
      nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm, norm_type=2)

    # Step 4: Update the parameters
    optimizer.step()
          
    # Add train loss of a batch 
    running_train_loss += loss.item()

    # Add Corect counts of a batch
    running_train_correct += correct

    # log batch loss and accuracy
    if log_batch:
      if ((batch_ct_train + 1) % log_interval) == 0:
        wandb.log({f"Train Batch Loss  :": loss})
        wandb.log({f"Train Batch Acc :": correct/len(targets)})

  
  # Calculate mean train loss for the whole dataset for a particular epoch
  train_loss = running_train_loss/len(train_loader)

  # Calculate accuracy for the whole dataset for a particular epoch
  train_acc = running_train_correct/len(train_loader.dataset)
  

  return train_loss, train_acc

## <Font color = 'pickle'>**Function for Validation Loops**


In [39]:
def validate(valid_loader, loss_function, model, log_batch, log_interval):

  # initilalize variables as global
  # these counts will be updated every epoch
  global batch_ct_valid

  # Validation/Test loop
  # Initialize valid_loss at the he strat of the epoch
  running_val_loss = 0
  running_val_correct = 0

  # put the model in evaluation mode
  model.eval()

  with torch.no_grad():
    for input_, targets, offsets in valid_loader:

      # move inputs and outputs to GPUs
      input_ = input_.to(device)
      targets = targets.to(device)
      offsets = offsets.to(device)

      # Step 1: Forward Pass: Compute model's predictions 
      output = model(input_, offsets)

      # Step 2: Compute loss
      loss = loss_function(output, targets)

      # Correct Predictions
      y_pred = torch.argmax(output, dim = 1)
      correct = torch.sum(y_pred == targets)

      batch_ct_valid += 1

      # Add val loss of a batch 
      running_val_loss += loss.item()

      # Add correct count for each batch
      running_val_correct += correct

      # log batch loss and accuracy
      if log_batch:
        if ((batch_ct_valid + 1) % log_interval) == 0:
          wandb.log({f"Valid Batch Loss  :": loss})
          wandb.log({f"Valid Batch Accuracy :": correct/len(targets)})

    # Calculate mean val loss for the whole dataset for a particular epoch
    val_loss = running_val_loss/len(valid_loader)

    # Calculate accuracy for the whole dataset for a particular epoch
    val_acc = running_val_correct/len(valid_loader.dataset)

    # scheduler step
    # scheduler.step(valid_loss)
    # scheduler.step()
    
  return val_loss, val_acc

## <Font color = 'pickle'>**Function for Model Training**
    
We will now create a function for step 5 of model training


In [40]:
def train_loop(train_loader, valid_loader, model, optimizer, loss_function, epochs, device, patience, early_stopping,
               file_model):
    
  """ 
  Function for training the model and plotting the graph for train & validation loss vs epoch.
  Input: iterator for train dataset, initial weights and bias, epochs, learning rate, batch size.
  Output: final weights, bias and train loss and validation loss for each epoch.
  """

  # Create lists to store train and val loss at each epoch
  train_loss_history = []
  valid_loss_history = []
  train_acc_history = []
  valid_acc_history = []

  # initialize variables for early stopping

  delta = 0
  best_score = None
  valid_loss_min = np.Inf
  counter_early_stop=0
  early_stop=False

  # Iterate for the given number of epochs
  # Step 5: Repeat steps 1 - 4

  for epoch in range(epochs):

    t0 = datetime.now()

    # Get train loss and accuracy for one epoch
    train_loss, train_acc = train(train_loader, loss_function, model, optimizer, 
                                  wandb.config.GRAD_CLIPPING, wandb.config.MAX_NORM,
                                  wandb.config.LOG_BATCH, wandb.config.LOG_INTERVAL)
    valid_loss, valid_acc   = validate(valid_loader, loss_function, model, 
                                       wandb.config.LOG_BATCH, wandb.config.LOG_INTERVAL)

    dt = datetime.now() - t0

    # Save history of the Losses and accuracy
    train_loss_history.append(train_loss)
    train_acc_history.append(train_acc)

    valid_loss_history.append(valid_loss)
    valid_acc_history.append(valid_acc)

    # Log the train and valid loss to wandb
    wandb.log({f"Train Loss :": train_loss, "epoch": epoch})
    wandb.log({f"Train Acc :": train_acc, "epoch": epoch})

    wandb.log({f"Valid Loss :": valid_loss, "epoch": epoch})
    wandb.log({f"Valid Acc :": valid_acc, "epoch": epoch})

    if early_stopping:
      score = -valid_loss
      if best_score is None:
        best_score=score
        print(f'Validation loss has decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving Model...')
        torch.save(model.state_dict(), file_model)
        valid_loss_min = valid_loss

      elif score < best_score + delta:
        counter_early_stop += 1
        print(f'Early stoping counter: {counter_early_stop} out of {patience}')
        if counter_early_stop > patience:
          early_stop = True


      else:
        best_score = score
        print(f'Validation loss has decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving model...')
        torch.save(model.state_dict(), file_model)
        counter_early_stop=0
        valid_loss_min = valid_loss

      if early_stop:
        print('Early Stopping')
        break

    else:

      score = -valid_loss
      if best_score is None:
        best_score=score
        print(f'Validation loss has decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving Model...')
        torch.save(model.state_dict(), file_model)
        valid_loss_min = valid_loss

      elif score < best_score + delta:
        print(f'Validation loss has not decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Not Saving Model...')
      
      else:
        best_score = score
        print(f'Validation loss has decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving model...')
        torch.save(model.state_dict(), file_model)
        valid_loss_min = valid_loss
    
    # Print the train loss and accuracy for given number of epochs, batch size and number of samples
    print(f'Epoch : {epoch+1} / {epochs}')
    print(f'Time to complete {epoch+1} is {dt}')
    # print(f'Learning rate: {scheduler._last_lr[0]}')
    print(f'Train Loss: {train_loss : .4f} | Train Accuracy: {train_acc * 100 : .4f}%')
    print(f'Valid Loss: {valid_loss : .4f} | Valid Accuracy: {valid_acc * 100 : .4f}%')
    print()
    torch.cuda.empty_cache()

  return train_loss_history, train_acc_history, valid_loss_history, valid_acc_history

## <Font color = 'pickle'>**Function for Accuracy and Predictions**

Now we have final values for weights and bias after training the model. We will use these values to make predictions on the test dataset.

In [41]:
def get_acc_pred(data_loader, model, device):
    
  """ 
  Function to get predictions and accuracy for a given data using estimated model
  Input: Data iterator, Final estimated weoights, bias
  Output: Prections and Accuracy for given dataset
  """

  # Array to store predicted labels
  predictions = torch.Tensor() # empty tensor
  predictions = predictions.to(device) # move predictions to GPU

  # Array to store actual labels
  y = torch.Tensor() # empty tensor
  y = y.to(device)

  # put the model in evaluation mode
  model.eval()
  
  # Iterate over batches from data iterator
  with torch.no_grad():
    for input_, targets, offsets in data_loader:
      
      # move inputs and outputs to GPUs
      
      input_ = input_.to(device)
      targets = targets.to(device)
      offsets = offsets.to(device)
      
      # Calculated the predicted labels
      output = model(input_, offsets)

      # Choose the label with maximum probability
      prediction = torch.argmax(output, dim = 1)

      # Add the predicted labels to the array
      predictions = torch.cat((predictions, prediction)) 

      # Add the actual labels to the array
      y = torch.cat((y, targets)) 

  # Check for complete dataset if actual and predicted labels are same or not
  # Calculate accuracy
  acc = (predictions == y).float().mean()

  # Return tuple containing predictions and accuracy
  return predictions, acc  

# <Font color = 'pickle'>**Meta Data**

In [42]:
hyperparameters = SimpleNamespace(
    EMBED_DIM = 300,
    VOCAB_SIZE = len(imdb_vocab),
    OUTPUT_DIM = 2,
    HIDDEN_DIM1 = 200,
    HIDDEN_DIM2 = 100,
    NON_LINEARITY= F.relu,
    EPOCHS = 50,
    
    BATCH_SIZE = 128,
    LEARNING_RATE = 0.001,
    DATASET="IMDB",
    ARCHITECTUREe="Embed_2_hidden_layers",
    LOG_INTERVAL = 25,
    LOG_BATCH = True,
    FILE_MODEL = model_folder/'imdb_2_hidden_layers.pt',
    GRAD_CLIPPING = False,
    MAX_NORM = 0,
    MOMENTUM = 0,
    PATIENCE = 5,
    EARLY_STOPPING = True,
    # SCHEDULER_FACTOR = 0,
    # SCHEDULER_PATIENCE = 0,
    WEIGHT_DECAY = 0.001
    )

# <Font color = 'pickle'>**Data Loaders, Loss Function, Optimizer**

In [43]:
# Initialize a new project
import random
wandb.init(name = 'Embed_2_hidden_layers', project = 'NLP_MLP_imdb')

In [44]:
wandb.config = hyperparameters
wandb.config

namespace(ARCHITECTUREe='Embed_2_hidden_layers', BATCH_SIZE=128, DATASET='IMDB', EARLY_STOPPING=True, EMBED_DIM=300, EPOCHS=50, FILE_MODEL=PosixPath('/content/drive/MyDrive/data/models/nlp_fall_2022/imdb/imdb_2_hidden_layers.pt'), GRAD_CLIPPING=False, HIDDEN_DIM1=200, HIDDEN_DIM2=100, LEARNING_RATE=0.001, LOG_BATCH=True, LOG_INTERVAL=25, MAX_NORM=0, MOMENTUM=0, NON_LINEARITY=<function relu at 0x7fa9b6bc9200>, OUTPUT_DIM=2, PATIENCE=5, VOCAB_SIZE=36241, WEIGHT_DECAY=0.001)

In [45]:
# Fix seed value
SEED = 2345
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Data Loader
train_loader = torch.utils.data.DataLoader(trainset, batch_size=wandb.config.BATCH_SIZE, shuffle = True, 
                                           collate_fn=collate_batch, num_workers = 4)
valid_loader = torch.utils.data.DataLoader(validset, batch_size=wandb.config.BATCH_SIZE, shuffle = False, 
                                           collate_fn=collate_batch,  num_workers = 4)
test_loader = torch.utils.data.DataLoader(testset, batch_size=wandb.config.BATCH_SIZE,   shuffle = False, 
                                          collate_fn=collate_batch,  num_workers = 4)

# cross entropy loss function
loss_function = nn.CrossEntropyLoss()

# use GPUs
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
wandb.config.DEVICE = device

# model 
model_imdb = MLPCustom(wandb.config.EMBED_DIM, 
                       wandb.config.VOCAB_SIZE, 
                       wandb.config.HIDDEN_DIM1, 
                       wandb.config.HIDDEN_DIM2,
                       wandb.config.OUTPUT_DIM, 
                       wandb.config.NON_LINEARITY)

model_imdb.to(wandb.config.DEVICE)

def init_weights(m):
  if type(m) == nn.Linear:
      torch.nn.init.kaiming_normal_(m.weight)
      torch.nn.init.zeros_(m.bias)
        
# apply initialization recursively  to all modules
model_imdb.apply(init_weights)

# Intialize stochiastic gradient descent optimizer
optimizer = torch.optim.Adam(model_imdb.parameters(), 
                             lr = wandb.config.LEARNING_RATE, 
                             weight_decay=wandb.config.WEIGHT_DECAY)

wandb.config.OPTIMIZER = optimizer

# scheduler = ReduceLROnPlateau(optimizer, mode='min', factor= wandb.config.scheduler_factor, 
#                              patience=wandb.config.scheduler_patience, verbose=True)

#scheduler = StepLR(optimizer, gamma=0.4,step_size=1, verbose=True)

In [46]:
wandb.config

namespace(ARCHITECTUREe='Embed_2_hidden_layers', BATCH_SIZE=128, DATASET='IMDB', DEVICE=device(type='cuda', index=0), EARLY_STOPPING=True, EMBED_DIM=300, EPOCHS=50, FILE_MODEL=PosixPath('/content/drive/MyDrive/data/models/nlp_fall_2022/imdb/imdb_2_hidden_layers.pt'), GRAD_CLIPPING=False, HIDDEN_DIM1=200, HIDDEN_DIM2=100, LEARNING_RATE=0.001, LOG_BATCH=True, LOG_INTERVAL=25, MAX_NORM=0, MOMENTUM=0, NON_LINEARITY=<function relu at 0x7fa9b6bc9200>, OPTIMIZER=Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 0.001
    maximize: False
    weight_decay: 0.001
), OUTPUT_DIM=2, PATIENCE=5, VOCAB_SIZE=36241, WEIGHT_DECAY=0.001)

# <Font color = 'pickle'>**Sanity Check**
- Check the loss without any training. For Cross entropy the expected value will be log(number of classes)

In [47]:
for input_, targets, offsets in train_loader:
  
  # move inputs and outputs to GPUs
  input_ = input_.to(device)
  targets = targets.to(device)
  offsets = offsets.to(device)
  model_imdb.eval()
  # Forward pass
  output = model_imdb(input_, offsets)
  loss = loss_function(output, targets)
  print(f'Actual loss: {loss}')
  break

print(f'Expected Theoretical loss: {np.log(2)}')

Actual loss: 0.7161325216293335
Expected Theoretical loss: 0.6931471805599453


# <Font color = 'pickle'>**Training Model**

In [48]:
wandb.watch(model_imdb, log = 'all', log_freq=25, log_graph=True)

[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


[<wandb.wandb_torch.TorchGraph at 0x7fa8a76a7590>]

In [49]:
# See live graphs in the notebook.
#%%wandb 
batch_ct_train, batch_ct_valid = 0, 0
train_loss_history, train_acc_history, valid_loss_history, valid_acc_history = train_loop(train_loader, 
                                                                                          valid_loader, 
                                                                                          model_imdb, 
                                                                                          optimizer,
                                                                                          loss_function, 
                                                                                          wandb.config.EPOCHS, 
                                                                                          wandb.config.DEVICE,
                                                                                          wandb.config.PATIENCE, 
                                                                                          wandb.config.EARLY_STOPPING,
                                                                                          wandb.config.FILE_MODEL)

Validation loss has decreased (inf --> 0.546675). Saving Model...
Epoch : 1 / 50
Time to complete 1 is 0:00:03.683584
Train Loss:  0.8629 | Train Accuracy:  60.1350%
Valid Loss:  0.5467 | Valid Accuracy:  72.6200%

Validation loss has decreased (0.546675 --> 0.508445). Saving model...
Epoch : 2 / 50
Time to complete 2 is 0:00:03.835472
Train Loss:  0.6051 | Train Accuracy:  69.0250%
Valid Loss:  0.5084 | Valid Accuracy:  75.9800%

Validation loss has decreased (0.508445 --> 0.485896). Saving model...
Epoch : 3 / 50
Time to complete 3 is 0:00:05.329176
Train Loss:  0.5267 | Train Accuracy:  73.8300%
Valid Loss:  0.4859 | Valid Accuracy:  77.8200%

Validation loss has decreased (0.485896 --> 0.456259). Saving model...
Epoch : 4 / 50
Time to complete 4 is 0:00:04.208974
Train Loss:  0.4857 | Train Accuracy:  76.7550%
Valid Loss:  0.4563 | Valid Accuracy:  79.8200%

Validation loss has decreased (0.456259 --> 0.434745). Saving model...
Epoch : 5 / 50
Time to complete 5 is 0:00:04.243996
Tr

# <Font color = 'pickle'>**Get Accuracy, Predictions**

In [50]:
device

device(type='cuda', index=0)

In [51]:
model_nn = MLPCustom(wandb.config.EMBED_DIM, wandb.config.VOCAB_SIZE, wandb.config.HIDDEN_DIM1, wandb.config.HIDDEN_DIM2, 
                  wandb.config.OUTPUT_DIM, wandb.config.NON_LINEARITY)
model_nn.to(device)
model_nn.load_state_dict(torch.load(wandb.config.FILE_MODEL))

<All keys matched successfully>

In [52]:
# Get the prediction and accuracy for the test dataseta
predictions_test, acc_test = get_acc_pred(test_loader, model_nn, device)
predictions_train, acc_train = get_acc_pred(train_loader, model_nn, device)
predictions_valid, acc_valid = get_acc_pred(valid_loader, model_nn, device)

In [53]:
# Print Test Accuracy
print('Test accuracy', acc_test * 100)
print('Train accuracy', acc_train * 100)
print('Valid accuracy', acc_valid * 100)

Test accuracy tensor(84.5040, device='cuda:0')
Train accuracy tensor(95.4350, device='cuda:0')
Valid accuracy tensor(87.1400, device='cuda:0')


In [54]:
wandb.log({'Best_test_Acc': acc_test})
wandb.log({'Best_train_Acc': acc_train})
wandb.log({'Best_valid_Acc': acc_valid})

# <Font color = 'pickle'>**Confusion Matrix for Test Data**

Now, we will make some visualizations for the predictions that we obtained.

We will construct a `confusion matrix` which will help us to visualize the performance of our classification model on the test dataset as we know the true values for the test data.

In [55]:
# Get an array containing actual labels
testing_labels = testset.y

In [56]:
np.unique(testing_labels)

array([0, 1])

In [57]:
# Log a confusion matrix to W&B
wandb.log({"conf_mat" : wandb.plot.confusion_matrix(
                        probs = None,
                        y_true = testing_labels,
                        preds = predictions_test.to('cpu').numpy(),
                        class_names =['negative', 'positive'])})

In [58]:
wandb.finish()

0,1
Best_test_Acc,▁
Best_train_Acc,▁
Best_valid_Acc,▁
Train Acc :,▁▃▄▅▅▅▆▆▆▇▇▇▇▇▇██████
Train Batch Acc :,▁▁▃▃▂▄▅▅▄▃▅▅▅▄▅▅▆▆▅▆▅▆▆▆▆▇▆▇▇▇▆█▆▇▇█▇█▇▇
Train Batch Loss :,█▇▆▆▆▅▄▄▅▅▄▄▄▃▄▄▃▄▄▂▃▂▃▃▃▂▃▂▂▂▃▂▃▂▂▁▂▂▂▃
Train Loss :,█▅▅▄▄▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁
Valid Acc :,▁▃▃▄▅▆▆▆▇▇▇▇█████████
Valid Batch Accuracy :,▂▁▂▂▃▁▂▂▃▅▃▄▅▃▃▄▅▅▄▆▅▆▃▆▆▅▅█▇▅▄▅▃
Valid Batch Loss :,▇██▇▆▇▇▆▆▄▆▄▃▅▆▅▄▃▅▁▂▄▅▄▄▂▆▁▁▄▅▄▆

0,1
Best_test_Acc,0.84504
Best_train_Acc,0.95435
Best_valid_Acc,0.8714
Train Acc :,0.93115
Train Batch Acc :,0.89844
Train Batch Loss :,0.23515
Train Loss :,0.17563
Valid Acc :,0.8722
Valid Batch Accuracy :,0.82812
Valid Batch Loss :,0.40363
