# <font color = 'pickle'> **Fine-tuning Transformer Model**


<font color = 'indianred'>  **Objective:**
- Learn to fine-tune Transformer Models</font>

# <Font color = 'pickle'>**Load Libraries/Install Software**

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
if 'google.colab' in str(get_ipython()):
  print('Running on CoLab')
else:
  print('Not running on CoLab')

Running on CoLab


In [3]:
#!pip install -U spacy -qq
if 'google.colab' in str(get_ipython()):
  !pip install -U gensim -qq

[K     |████████████████████████████████| 24.1 MB 1.2 MB/s 
[?25h

In [4]:
# Install wandb and update it to the latest version
if 'google.colab' in str(get_ipython()):
    !pip install wandb --upgrade -q

[K     |████████████████████████████████| 1.9 MB 33.1 MB/s 
[K     |████████████████████████████████| 168 kB 65.4 MB/s 
[K     |████████████████████████████████| 182 kB 74.1 MB/s 
[K     |████████████████████████████████| 62 kB 456 kB/s 
[K     |████████████████████████████████| 168 kB 20.8 MB/s 
[K     |████████████████████████████████| 166 kB 65.8 MB/s 
[K     |████████████████████████████████| 166 kB 66.4 MB/s 
[K     |████████████████████████████████| 162 kB 63.7 MB/s 
[K     |████████████████████████████████| 162 kB 12.8 MB/s 
[K     |████████████████████████████████| 158 kB 62.0 MB/s 
[K     |████████████████████████████████| 157 kB 65.6 MB/s 
[K     |████████████████████████████████| 157 kB 31.1 MB/s 
[K     |████████████████████████████████| 157 kB 69.3 MB/s 
[K     |████████████████████████████████| 157 kB 76.5 MB/s 
[K     |████████████████████████████████| 157 kB 66.5 MB/s 
[K     |████████████████████████████████| 157 kB 64.0 MB/s 
[K     |█████████████████

In [5]:
# mount google drive
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
# Import random function

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchtext.vocab import  vocab
from torch.optim.lr_scheduler import ReduceLROnPlateau, OneCycleLR, StepLR

from torch.nn.utils.rnn import pad_sequence
from torch.utils.data.sampler import Sampler


import wandb


import random
from datetime import datetime
import numpy as np
from pathlib import Path
import pandas as pd
import joblib
from collections import Counter
import sys


from sklearn.model_selection import train_test_split


from types import SimpleNamespace

We will be using W&B for visualization.

In [7]:
# Login to W&B
wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

# <Font color = 'pickle'>**Specify Project Folders**

In [8]:
# This is the path where we will downlaod and save data
if 'google.colab' in str(get_ipython()):
  base_folder = Path('/content/drive/MyDrive/NLP_Fall22/HW7')
else:
  base_folder = Path('/home/harpreet/Insync/google_drive_shaannoor/data')

In [9]:
data_folder = base_folder/'Data'
model_folder = base_folder/'Models'

# <Font color = 'pickle'>**IMDB Dataset**

For this notebook, we will use IMDB movie review dataset. <br>
LInk for complete dataset: http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz.

We downloaded the dataset in the previous lecture 2 notebook (notebook: 3_Faster_tokenization_spacy_final.ipynb)

We created csv files in Lecture 2 -- train.csv and test.csv file. The files are availible in Lecture2/data folder from eLearning. I have applied the custom pre=processor and cleaned the data set for this lecture. I pickled the datasets and saved them as files. The files are available in Lecture_6/data folder. We will download the following files as well.

- 'x_train_cleaned_bag_of_words.pkl'
- 'x_valid_cleaned_bag_of_words.pkl'
- 'x_test_cleaned_bag_of_words.pkl'

In [10]:
# location of train and test files
train_file = data_folder /'train.csv'
test_file = data_folder /'test.csv'

In [11]:
# creating Pandas Dataframe
train_data = pd.read_csv(train_file, index_col=0)
test_data = pd.read_csv(test_file, index_col=0)

In [12]:
train_data = train_data.sample(frac = 0.1, random_state=1)
test_data = train_data.sample(frac = 0.1, random_state=1)

In [13]:
# print shape of the datasets
print(f'Shape of Training data set is : {train_data.shape}')
print(f'Shape of Test data set is : {test_data.shape}')

Shape of Training data set is : (2500, 2)
Shape of Test data set is : (250, 2)


In [14]:
train_data.head()

Unnamed: 0,Reviews,Labels
21492,"The movie starts with a pair of campers, a man...",0
9488,"I'm a pretty old dude, old enough to remember ...",1
16933,When they killed off John Amos's character the...,0
12604,"Despite some occasionally original touches, li...",0
8222,I found this movie to be very well-paced. The ...,1


## <Font color = 'pickle'>**Create Train/Test/Valid Split**


In [15]:
X, y = train_data['Reviews'].values, train_data['Labels'].values

In [16]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.20, random_state=42)

In [17]:
X_test , y_test = test_data['Reviews'].values, test_data['Labels'].values

## <Font color = 'pickle'>**Data PreProcessing**

In [18]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 30.4 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 79.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 54.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [21]:
from transformers import  AutoTokenizer

In [22]:
checkpoint = 'distilbert-base-uncased'

In [23]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [34]:
train_encodings = tokenizer(list(X_train), truncation = True, padding = True)

In [35]:
valid_encodings = tokenizer(list(X_valid), truncation = True, padding = True)
test_encodings = tokenizer(list(X_test), truncation = True, padding = True)

In [36]:
train_encodings[0]

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

In [37]:
train_encodings.keys()

dict_keys(['input_ids', 'attention_mask'])

In [38]:
for key,value in train_encodings.items():
  print(key, value[0])
  break

input_ids [101, 9826, 1010, 2073, 2064, 1045, 4088, 999, 2023, 2001, 1037, 2659, 5166, 1010, 27762, 6051, 2143, 1010, 2009, 2001, 2061, 18178, 2229, 2100, 2009, 2018, 2149, 2035, 21305, 2007, 7239, 2000, 2129, 3294, 2128, 7559, 5732, 2009, 2001, 999, 1996, 4690, 3554, 5019, 4694, 1005, 1056, 2130, 4690, 9590, 1010, 2027, 2020, 2652, 2105, 2007, 2070, 6081, 10689, 2027, 4149, 2012, 24547, 1011, 20481, 1998, 2035, 2027, 2020, 2725, 2001, 2074, 22653, 2000, 3046, 1998, 2191, 2009, 2298, 2066, 2027, 2020, 8084, 999, 999, 2033, 1998, 2026, 2155, 2001, 1999, 1996, 6888, 2005, 1037, 2428, 2204, 2895, 3185, 2028, 2154, 1010, 2061, 2057, 2787, 2000, 2175, 2000, 1996, 3573, 1998, 2298, 2005, 2028, 1010, 1998, 2045, 2009, 2001, 1996, 2387, 19392, 2479, 3185, 1012, 1045, 2812, 2009, 2246, 2061, 2307, 2021, 2043, 2057, 3427, 2009, 2012, 2188, 1045, 8134, 2351, 2044, 1996, 2034, 3496, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 2821, 1998, 1996, 5436, 1997, 1996, 2143, 1010, 1996, 2466, 26

## <font color = 'pickle'> **Custom Dataset Class**

In [39]:
class IMDbDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):

    self.encodings = encodings

    self.labels = labels



  def __getitem__(self, idx):

    item = {key: torch.tensor(value[idx]) for key, value in self.encodings.items()}

    item['labels'] = torch.tensor(self.labels[idx])

    return item



  def __len__(self):

    return len(self.labels)

In [40]:
trainset = IMDbDataset(train_encodings, y_train)
validset = IMDbDataset(valid_encodings, y_valid)
testset = IMDbDataset(test_encodings, y_test)

In [41]:
trainset.__getitem__(0)

{'input_ids': tensor([  101,  9826,  1010,  2073,  2064,  1045,  4088,   999,  2023,  2001,
          1037,  2659,  5166,  1010, 27762,  6051,  2143,  1010,  2009,  2001,
          2061, 18178,  2229,  2100,  2009,  2018,  2149,  2035, 21305,  2007,
          7239,  2000,  2129,  3294,  2128,  7559,  5732,  2009,  2001,   999,
          1996,  4690,  3554,  5019,  4694,  1005,  1056,  2130,  4690,  9590,
          1010,  2027,  2020,  2652,  2105,  2007,  2070,  6081, 10689,  2027,
          4149,  2012, 24547,  1011, 20481,  1998,  2035,  2027,  2020,  2725,
          2001,  2074, 22653,  2000,  3046,  1998,  2191,  2009,  2298,  2066,
          2027,  2020,  8084,   999,   999,  2033,  1998,  2026,  2155,  2001,
          1999,  1996,  6888,  2005,  1037,  2428,  2204,  2895,  3185,  2028,
          2154,  1010,  2061,  2057,  2787,  2000,  2175,  2000,  1996,  3573,
          1998,  2298,  2005,  2028,  1010,  1998,  2045,  2009,  2001,  1996,
          2387, 19392,  2479,  3185,  1

# <font color = 'pickle'> **Training Functions**

## <Font color = 'pickle'>**Function for Training  Loops**

**Model Training** involves five steps: 

- Step 0: Randomly initialize parameters / weights
- Step 1: Compute model's predictions - forward pass
- Step 2: Compute loss
- Step 3: Compute the gradients
- Step 4: Update the parameters
- Step 5: Repeat steps 1 - 4

Model training is repeating this process over and over, for many **epochs**.

We will specify number of ***epochs*** and during each epoch we will iterate over the complete dataset and will keep on updating the parameters.

***Learning rate*** and ***epochs*** are known as hyperparameters. We have to adjust the values of these two based on validation dataset.

We will now create functions for step 1 to 4.

In [43]:
def train(train_loader, model, optimizer, grad_clipping, max_norm, log_batch, log_interval, device):

  # Training Loop 
  # initilalize variables as global
  # these counts will be updated every epoch
  global batch_ct_train

  # Initialize train_loss at the he start of the epoch

  running_train_loss = 0
  running_train_correct = 0

  # put the model in training mode
  model.train()

  # Iterate on batches from the dataset using train_loader
  for batch in train_loader:
  # move inputs and outputs to GPUs

    input_ids = batch['input_ids'].to(device)

    attention_mask = batch['attention_mask'].to(device)

    labels = batch['labels'].to(device)

    # Outputs & loss

    outputs = model(input_ids, attention_mask = attention_mask, labels = labels)

    loss, output = outputs['loss'], outputs['logits']

    # correct predictions
    y_pred = torch.argmax(output, dim =1)
    correct = torch.sum(y_pred==labels)

    batch_ct_train+=1

    # Compute gradients
    optimizer.zero_grad()
    loss.backward()

    # Gradient Clipping

    if grad_clipping:
      nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm, norm_type=2)


    # Step 4: Update the parameters
    optimizer.step()

    # Add train loss of a batch 
    running_train_loss += loss.item()

    # Add Corect counts of a batch
    running_train_correct += correct

    # log batch loss and accuracy
    if log_batch:
      if ((batch_ct_train + 1) % log_interval) == 0:
        wandb.log({f"Train Batch Loss  :": loss})
        wandb.log({f"Train Batch Acc :": correct/len(labels)})

  # Calculate mean train loss for the whole dataset for a particular epoch
  train_loss = running_train_loss/len(train_loader)

  # Calculate accuracy for the whole dataset for a particular epoch
  train_acc = running_train_correct/len(train_loader.dataset)

  return train_loss, train_acc

## <Font color = 'pickle'>**Function for Validation Loops**


In [47]:
def validate(valid_loader, model, log_batch, log_interval, device):

  # initilalize variables as global
  # these counts will be updated every epoch
  global batch_ct_valid

  # Validation/Test loop
  # Initialize valid_loss at the he strat of the epoch
  running_val_loss = 0
  running_val_correct = 0

  # put the model in evaluation mode
  model.eval()

  with torch.no_grad():
    for batch in valid_loader:
    # move inputs and outputs to GPUs

      input_ids = batch['input_ids'].to(device)

      attention_mask = batch['attention_mask'].to(device)

      labels = batch['labels'].to(device)


      # Step 1: Forward Pass: Compute model's predictions 
    
      outputs = model(input_ids, attention_mask = attention_mask, labels = labels)


      # Step 2: Compute loss
      loss, output = outputs['loss'], outputs['logits']

      # Correct Predictions
      y_pred = torch.argmax(output, dim=1)

      correct = torch.sum(y_pred==labels)

      batch_ct_valid += 1

      # Add val loss of a batch 
      running_val_loss += loss.item()

      # Add correct count for each batch
      running_val_correct += correct

      # log batch loss and accuracy
      if log_batch:
        if ((batch_ct_valid + 1) % log_interval) == 0:
          wandb.log({f"Valid Batch Loss  :": loss})
          wandb.log({f"Valid Batch Accuracy :": correct/len(labels)})

    # Calculate mean val loss for the whole dataset for a particular epoch
    val_loss = running_val_loss/len(valid_loader)

    # Calculate accuracy for the whole dataset for a particular epoch
    val_acc = running_val_correct/len(valid_loader.dataset)

    # scheduler step
    # scheduler.step(val_loss)
    # scheduler.step()
    
  return val_loss, val_acc

## <Font color = 'pickle'>**Function for Model Training**
    
We will now create a function for step 5 of model training


In [48]:
def train_loop(train_loader, valid_loader, model, optimizer, epochs, device, patience, early_stopping,
               file_model, save_best_model, grad_clipping, max_norm, log_batch, log_interval):
    
  """ 
  Function for training the model and plotting the graph for train & validation loss vs epoch.
  Input: iterator for train dataset, initial weights and bias, epochs, learning rate, batch size.
  Output: final weights, bias and train loss and validation loss for each epoch.
  """

  # Create lists to store train and val loss at each epoch
  train_loss_history = []
  valid_loss_history = []
  train_acc_history = []
  valid_acc_history = []

  # initialize variables for early stopping

  delta = 0
  best_score = None
  valid_loss_min = np.Inf
  counter_early_stop=0
  early_stop=False

  # Iterate for the given number of epochs
  # Step 5: Repeat steps 1 - 4

  for epoch in range(epochs):

    t0 = datetime.now()

    # Get train loss and accuracy for one epoch
    train_loss, train_acc = train(train_loader, model, optimizer, grad_clipping, max_norm, log_batch, log_interval, device)
    valid_loss, valid_acc  = validate(valid_loader, model, log_batch, log_interval, device)

    dt = datetime.now() - t0

    # Save history of the Losses and accuracy
    train_loss_history.append(train_loss)
    train_acc_history.append(train_acc)

    valid_loss_history.append(valid_loss)
    valid_acc_history.append(valid_acc)

    # Log the train and valid loss to wandb
    wandb.log({f"Train Loss :": train_loss, "epoch": epoch})
    # wandb.log({f"Train Acc :": train_acc, "epoch": epoch})

    wandb.log({f"Valid Loss :": valid_loss, "epoch": epoch})
    # wandb.log({f"Valid Acc :": valid_acc, "epoch": epoch})

    if early_stopping:
      score = -valid_loss
      if best_score is None:
        best_score=score
        print(f'Validation loss has decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving Model...')
        torch.save(model.state_dict(), file_model)
        valid_loss_min = valid_loss

      elif score < best_score + delta:
        counter_early_stop += 1
        print(f'Early stoping counter: {counter_early_stop} out of {patience}')
        if counter_early_stop > patience:
          early_stop = True


      else:
        best_score = score
        print(f'Validation loss has decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving model...')
        torch.save(model.state_dict(), file_model)
        counter_early_stop=0
        valid_loss_min = valid_loss

      if early_stop:
        print('Early Stopping')
        break

    elif save_best_model:

      score = -valid_loss
      if best_score is None:
        best_score=score
        print(f'Validation loss has decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving Model...')
        torch.save(model.state_dict(), file_model)
        valid_loss_min = valid_loss

      elif score < best_score + delta:
        print(f'Validation loss has not decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Not Saving Model...')
      
      else:
        best_score = score
        print(f'Validation loss has decreased ({valid_loss_min:.6f} --> {valid_loss:.6f}). Saving model...')
        torch.save(model.state_dict(), file_model)
        valid_loss_min = valid_loss
        
    else:
        torch.save(model.state_dict(), file_model)
    
    print(f'Epoch : {epoch+1} / {epochs}')
    print(f'Time to complete {epoch+1} is {dt}')
    # print(f'Learning rate: {scheduler._last_lr[0]}')
    print(f'Train Loss: {train_loss : .4f} | Train Accuracy: {train_acc * 100 : .4f}%')
    print(f'Valid Loss: {valid_loss : .4f} | Valid Accuracy: {valid_acc * 100 : .4f}%')
    print()
    torch.cuda.empty_cache()

  return train_loss_history, train_acc_history, valid_loss_history, valid_acc_history

## <Font color = 'pickle'>**Function for Accuracy and Predictions**

Now we have final values for weights and bias after training the model. We will use these values to make predictions on the test dataset.

In [49]:
def get_acc_pred(data_loader, model, device):
    
  """ 
  Function to get predictions and accuracy for a given data using estimated model
  Input: Data iterator, Final estimated weoights, bias
  Output: Prections and Accuracy for given dataset
  """

  # Array to store predicted labels
  predictions = torch.Tensor() # empty tensor
  predictions = predictions.to(device) # move predictions to GPU

  # Array to store actual labels
  y = torch.Tensor() # empty tensor
  y = y.to(device)

  # put the model in evaluation mode
  model.eval()
  
  # Iterate over batches from data iterator
  with torch.no_grad():
    for batch in data_loader:
      
      # move inputs and outputs to GPUs
      
      input_ids = batch['input_ids'].to(device)
      attention_mask = batch['attention_mask'].to(device)
      labels = batch['labels'].to(device)
      
      outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
      loss, output = outputs['loss'], outputs['logits']

      # Choose the label with maximum probability
      prediction = torch.argmax(output, dim = 1)

      # Add the predicted labels to the array
      predictions = torch.cat((predictions, prediction)) 

      # Add the actual labels to the array
      y = torch.cat((y, labels)) 

  # Check for complete dataset if actual and predicted labels are same or not
  # Calculate accuracy
  acc = (predictions == y).float().mean()

  # Return tuple containing predictions and accuracy
  return predictions, acc  

# <font color = 'pickle'> **Model Training**

## <font color = 'pickle'> **Meta data**



In [50]:
hyperparameters = SimpleNamespace(

    EPOCHS = 1,
    BATCH_SIZE = 16,
    LEARNING_RATE = 5e-5,
    DATASET="IMDB",
    ARCHITECTURE="distilbert",
    LOG_INTERVAL = 25,
    LOG_BATCH = True,
    FILE_MODEL = model_folder/'distilbert.PT',
    GRAD_CLIPPING = True,
    MAX_NORM = 1,
    MOMENTUM = 0,
    PATIENCE = 10,
    SAVE_BEST_MODEL = True,
    EARLY_STOPPING = False,
    WEIGHT_DECAY = 0.00,
    DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
   )


# <Font color = 'pickle'>**Data Loaders, Loss Function, Optimizer**

In [51]:
wandb.init(name = 'distilbert', project = '', config = hyperparameters)

[34m[1mwandb[0m: Currently logged in as: [33mhamshaks1[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [52]:
wandb.config = hyperparameters
wandb.config

namespace(ARCHITECTURE='distilbert', BATCH_SIZE=16, DATASET='IMDB', DEVICE=device(type='cuda', index=0), EARLY_STOPPING=False, EPOCHS=1, FILE_MODEL=PosixPath('/content/drive/MyDrive/NLP_Fall22/HW7/Models/distilbert.PT'), GRAD_CLIPPING=True, LEARNING_RATE=5e-05, LOG_BATCH=True, LOG_INTERVAL=25, MAX_NORM=1, MOMENTUM=0, PATIENCE=10, SAVE_BEST_MODEL=True, WEIGHT_DECAY=0.0)

In [53]:
# Fix seed value
SEED = 2345
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Data Loader

train_loader = torch.utils.data.DataLoader(trainset, batch_size=wandb.config.BATCH_SIZE, shuffle=True)
valid_loader = torch.utils.data.DataLoader(validset, batch_size=wandb.config.BATCH_SIZE, shuffle=False)
test_loader = torch.utils.data.DataLoader(testset, batch_size=wandb.config.BATCH_SIZE, shuffle=False)

# cross entropy loss function
# loss_function = nn.CrossEntropyLoss()

# model 
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

model.to(wandb.config.DEVICE)
# code here

#Intialize stochiastic gradient descent optimizer
#optimizer = torch.optim.SGD(model.parameters(), lr = wandb.config.learning_rate, weight_decay=wandb.config.weight_decay, 
#                          momentum=wandb.config.momentum)

optimizer = torch.optim.Adam(model.parameters(), 
                             lr = wandb.config.LEARNING_RATE, 
                             weight_decay=wandb.config.WEIGHT_DECAY)

wandb.config.optimizer = optimizer


Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [54]:
wandb.config

namespace(ARCHITECTURE='distilbert', BATCH_SIZE=16, DATASET='IMDB', DEVICE=device(type='cuda', index=0), EARLY_STOPPING=False, EPOCHS=1, FILE_MODEL=PosixPath('/content/drive/MyDrive/NLP_Fall22/HW7/Models/distilbert.PT'), GRAD_CLIPPING=True, LEARNING_RATE=5e-05, LOG_BATCH=True, LOG_INTERVAL=25, MAX_NORM=1, MOMENTUM=0, PATIENCE=10, SAVE_BEST_MODEL=True, WEIGHT_DECAY=0.0, optimizer=Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    eps: 1e-08
    foreach: None
    lr: 5e-05
    maximize: False
    weight_decay: 0.0
))

## <font color = 'pickle'> **Sanity Check**
- Check the loss without any training. For Cross entropy the expected value will be log(number of classes)

In [55]:
# Fix seed value
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
SEED = 2345
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

for batch in train_loader:
  
  # move inputs and outputs to GPUs
  input_ids = batch['input_ids'].to(device)
  attention_mask = batch['attention_mask'].to(device)
  labels = batch['labels'].to(device)

  # Step 1: Forward Pass: Compute model's predictions 
  outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
  loss, output = outputs['loss'], outputs['logits']
  print(f'Actual loss: {loss}')
  break

print(f'Expected Theoretical loss: {np.log(2)}')

Actual loss: 0.6892105340957642
Expected Theoretical loss: 0.6931471805599453


## <font color = 'pickle'> **Train Model and Save best model**

In [56]:
wandb.watch(model, log = 'all', log_freq=25, log_graph=True)

[34m[1mwandb[0m: logging graph, to disable use `wandb.watch(log_graph=False)`


[<wandb.wandb_torch.TorchGraph at 0x7f299e88c4f0>]

In [57]:
# See live graphs in the notebook.
#%%wandb 
batch_ct_train, batch_ct_valid = 0, 0
train_loss_history, train_acc_history, valid_loss_history, valid_acc_history = train_loop(train_loader,
                                                                                            valid_loader,
                                                                                            model, 
                                                                                            optimizer,                                                                          wandb.config.EPOCHS,
                                                                                            wandb.config.DEVICE,
                                                                                            wandb.config.PATIENCE, 
                                                                                            wandb.config.EARLY_STOPPING,
                                                                                            wandb.config.FILE_MODEL,
                                                                                           wandb.config.SAVE_BEST_MODEL,
                                                                                           wandb.config.GRAD_CLIPPING,
                                                                                           wandb.config.MAX_NORM,
                                                                                           wandb.config.LOG_BATCH,
                                                                                           wandb.config.LOG_INTERVAL)

Validation loss has decreased (inf --> 0.260541). Saving Model...
Epoch : 1 / 1
Time to complete 1 is 0:01:45.609718
Train Loss:  0.4071 | Train Accuracy:  81.7500%
Valid Loss:  0.2605 | Valid Accuracy:  90.0000%



# <Font color = 'pickle'>**Get Accuracy, Predictions**

Now we have final values for weights and bias after training the model. We will use these values to make predictions on the test dataset.

## <font color = 'pickle'> **Load saved model from file** 

In [58]:
model_nn = model
model_nn.to(device)
model_nn.load_state_dict(torch.load(wandb.config.FILE_MODEL))

<All keys matched successfully>

In [59]:
# Get the prediction and accuracy for the test dataseta
predictions_test, acc_test = get_acc_pred(test_loader, model_nn, device)
predictions_train, acc_train = get_acc_pred(train_loader, model_nn, device)
predictions_valid, acc_valid = get_acc_pred(valid_loader, model_nn, device)

In [60]:
# Print Accuracy
print('Test accuracy', acc_test * 100)
print('Train accuracy', acc_train * 100)
print('Valid accuracy', acc_valid * 100)

Test accuracy tensor(92.8000, device='cuda:0')
Train accuracy tensor(93.8000, device='cuda:0')
Valid accuracy tensor(90., device='cuda:0')


In [61]:
# Print Accuracy based on saved Model
print('acc_train', acc_train * 100)
print('acc_valid', acc_valid * 100)
print('acc_test', acc_test * 100)

acc_train tensor(93.8000, device='cuda:0')
acc_valid tensor(90., device='cuda:0')
acc_test tensor(92.8000, device='cuda:0')


In [62]:
wandb.log({'Best_test_Acc': acc_test})
wandb.log({'Best_train_Acc': acc_train})
wandb.log({'Best_valid_Acc': acc_valid})

**We have obtained 92 % accuracy on test dataset.**


# <Font color = 'pickle'>**Confusion Matrix for Test Data**

Now, we will make some visualizations for the predictions that we obtained.

We will construct a `confusion matrix` which will help us to visualize the performance of our classification model on the test dataset as we know the true values for the test data.

In [63]:
# Get an array containing actual labels
testing_labels = y_test

In [64]:
np.unique(testing_labels)

array([0, 1])

In [65]:
# Log a confusion matrix to W&B
wandb.log({"conf_mat" : wandb.plot.confusion_matrix(
                        probs = None,
                        y_true = testing_labels,
                        preds = predictions_test.to('cpu').numpy(),
                        class_names =['negative', 'positive'])})

In [66]:
wandb.finish()

0,1
Best_test_Acc,▁
Best_train_Acc,▁
Best_valid_Acc,▁
Train Batch Acc :,▁▁███
Train Batch Loss :,▆█▁▁▁
Train Loss :,▁
Valid Batch Accuracy :,▁
Valid Batch Loss :,▁
Valid Loss :,▁
epoch,▁▁

0,1
Best_test_Acc,0.928
Best_train_Acc,0.938
Best_valid_Acc,0.9
Train Batch Acc :,0.9375
Train Batch Loss :,0.17961
Train Loss :,0.40707
Valid Batch Accuracy :,0.75
Valid Batch Loss :,0.66594
Valid Loss :,0.26054
epoch,0.0
