# **CSE354 HW3**
**Due date: 11:59 pm EST on April 24, 2022 (Sunday)**

---
For this assignment, we will use Google Colab, which allows us to utilize resources that some of us might not have in their local machines such as GPUs. You can use your Stony Brook (*.stonybrook.edu) account or your personal gmail account for coding and Google Drive to save your results.

## **Google Colab Tutorial**
---
Go to https://colab.research.google.com/notebooks/, you will see a tutorial named "Welcome to Colaboratory" file, where you can learn the basics of using google colab.

**This notebook would need you to train your model on Colab's GPU. However, the runtimes are limited. So ensure that your code works on the default CPU runtime before switching over to the GPU runtime.**

## **Problem statement**
---
In this homework, you will be using language models to predict the sentiment of a given movie review. The dataset, which is given to you, is sampled from the [IMDB dataset of 50k movie reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The sentences are sampled to a smaller set to help with quicker computation on Colab. The data contains a review and an associated label for the sentiment of that review. The label can either be *positive* or *negative*. You have been given three files - train_data.csv, val_data.csv and test_data.csv. The training data will be used to fine-tune the language model, the val data will be used to select the best model while training and finally the test data will give the model's final performance on the data.

To perform this task you will be using a pre-trained DistilBERT model. DistilBERT is a BERT based language model. Its size is 40% lesser than BERT, it has around 97% of BERT's language understanding capabilities and is 60% faster. You can read more about DistilBERT - https://arxiv.org/abs/1910.01108.

You will be using the model by taking advantage of the libraries provided by Hugging Face (https://huggingface.co/). In order to use this library, it will need to be installed using the command in the cell below. You will be training four different DistilBERT models for this assignment.

Fill in the # TODO(students) portions in this Colab file for this assignment.

**Todos for the assignment:**
*   Fill in the # TODO(students) portions in this Colab file for this assignment.
*   Run the experiment code blocks and note down the colab outputs in a separate text file.
*   Use the aformentioned colab outputs for writing the report as per submission guideline that is described at the end of this colab file.



In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 13.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 46.9 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 30.4 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 48.8 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.7 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    F

## **Imports**
---

All the allowed imports have been done for you in the code block below. You do need and will not be allowed to use any more imports other than the ones done below.

In [3]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, TensorDataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AdamW
import os
from sklearn.metrics import precision_score, recall_score, f1_score
torch.manual_seed(42)
np.random.seed(42)

## **Mounting your drive**
---

I would highly recommend mounting you Google Drive while running this notebook. This drive could contain the path to your dataset and it will also be used to save your fine-tuned models. In case you choose to simply save the models on your Colab workspace, the models will cease to exist after the runtime disconnects.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
!ls

drive  sample_data


In [6]:
#Set the path of the folder where your colab file and data exist in Google Drive in the ------ porition

# TODO(students): start
%cd "drive/MyDrive/cse354/Assignment_3/HW3_Release/"
# TODO(students): end

/content/drive/MyDrive/cse354/Assignment_3/HW3_Release


## **Constants in the file**
---

The code block below contains a few constants.


1.   **BATCH_SIZE**: The batch size input to the models. This has been set to 16 and should not be changed. In case you encounter any CUDA - out of memory errors while training your models, this value may be reduced from 16. But please mention this in your submission report.
2.   **EPOCHS**: The number of epochs to train your model. This should not be changed.
3. **TEST_PATH**: This is the path to the test_data.csv file.
4. **TRAIN_PATH**: This is the path to the train_data.csv file.
5. **VAL_PATH**: This is the path to the val_data.csv file.
6. **SAVE_PATH**: This is the path to directory your model will be saved. Note: This path will be altered further down in the code by appending the name of the kind of DistilBERT model you train as per your experiments.



In [7]:
#DO NOT CHANGE THE CONSTANTS
BATCH_SIZE = 16
EPOCHS = 3
TEST_PATH = "data/test_data.csv"
TRAIN_PATH = "data/train_data.csv"
VAL_PATH = "data/val_data.csv"
SAVE_PATH = "models/DistilBERT"

In [8]:
def load_dataset(path):
  dataset = pd.read_csv(path)
  return dataset

In [9]:
train_data = load_dataset(TRAIN_PATH)
val_data = load_dataset(VAL_PATH)
test_data = load_dataset(TEST_PATH)

## **Problem 1 (Initialize the Model Class)**
---

Here, we will setup the pre-trained DistillBert model class in order to do our binary sentiment analysis task. In the code block below, you would need to load a pre-trained DistilBERT model and it's tokenizer using Hugging Face's library. The model you would need to load is called "distilbert-base-uncased". It would also need to have the model hyperparameter set to *num_classes* as the output shape of the model (in this case it is going to be 2, positive and negative). Please write your code between the given TODO block.



More about the model and how to load it can be read at - https://huggingface.co/distilbert-base-uncased.

In [10]:
class DistillBERT():

  def __init__(self, model_name='distilbert-base-uncased', num_classes=2):
    # TODO(students): start
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModelForSequenceClassification.from_pretrained(model_name , num_labels = num_classes)
    
    # TODO(students): end

  def get_tokenizer_and_model(self):
    return self.model, self.tokenizer

## **Problem 2 (Initialize the Dataloader Class)**
---
Here, we will setup the dataloader class which will read data, tokenize it using the DistillBert tokenizer, converts the tokenized sentence to tensors and the labels to tensors. The code block below takes your dataset(train,validation or test) and the tokenizer you loaded in the previous block and generates the DataLoader object for it. You would need to implement a part of the tokenize_data method. This method takes the given data and generates a list of token IDs for a given review along with it's label. You would need to use the tokenizer to generated the token ids (hint:refer to tokenizer.encode_plus for more details) values for each review. **Please ensure that the maximum length of an encoded review is 512 tokens. If any input data is longer than 512 words/tokens, truncate it to first 512.** 

You would also need to convert the labels to a corresponding numerical class using the label_dict dictionary. Please write your code between the given TODO block.

In [11]:
class DatasetLoader(Dataset):

  def __init__(self, data, tokenizer):
    self.data = data
    self.tokenizer = tokenizer

  def tokenize_data(self):
    print("Processing data..")
    tokens = []
    labels = []
    label_dict = {'positive': 1, 'negative': 0}

    review_list = self.data['review'].to_list()
    label_list = self.data['sentiment'].to_list()

    for (review, label) in tqdm(zip(review_list, label_list), total=len(review_list)):
      # TODO(students): start
      # review_encoding = self.tokenizer.encode_plus( review, truncation_strategy="only_first")
      review_encoding = self.tokenizer.encode_plus( review,truncation=True, max_length = 512)
      review_encoding = review_encoding['input_ids']
      review_encoding = torch.tensor(review_encoding).cuda()
      tokens.append(review_encoding)
      
      labels.append(label_dict[label])
      
      # TODO(students): end
    # tokens = torch.FloatTensor(tokens)
    tokens = pad_sequence(tokens, batch_first=True)
    labels = torch.tensor(labels)
    dataset = TensorDataset(tokens, labels)
    return dataset

  def get_data_loaders(self, batch_size=32, shuffle=True):
    processed_dataset = self.tokenize_data()

    data_loader = DataLoader(
        processed_dataset,
        shuffle=shuffle,
        batch_size=batch_size
    )

    return data_loader

## **Problem 3 (Training Function)**
---
In this problem, you will write the code that will be used to run your model class on the dataset class, both of which you have written in the previous problems.

The class below provides method to train a given model. It takes a dictionary with the following parameters:


1.   device: The device to run the model on.
2.   train_data: The train_data dataframe.
3.   val_data: The val_data dataframe.
4.   batch_size: The batch_size which is input to the model.
5.   epochs: The number of epochs to train the model.
6.   training_type: The type of training that your model will be undergoing. This can take four values - 'frozen_embeddings', 'top_2_training', 'top_4_training' and 'all_training'.

#### **Problem 3(a)**

Your first problem here would be to implement the set_training_parameters() method. In this method you will select the layers of your model to train based on the training_type. **Note: By default the Hugging Face DistilBERT has 6 layers.**

1. frozen_embeddings: This setting is supposed to train the DistilBERT model with embeddings that are 'frozen' i.e., not trainable. You would need to ensure that the embedding layers in your model are not trainable.
2. top_2_training: This setting is supposed to train just the final two layers of DistilBERT (layer 5 and layer 4). All other layers before these would need to be frozen.
3. top_4_training: This setting is supposed to train just the final four layers of DistilBERT (layer 5, layer 4, layer 3 and layer 2). All other layers before these would need to be frozen.
4. all_training: All layers of DistilBERT would need to trained.

Please write your code between the given TODO block.

**Note: The classifier head on top of the final DistilBERT layer would always need to be trained, please do not freeze that layer.**

**Note: You can use model.named_parameters() and iterate over all the named parameters of the model. To set the layers to not be trainable, apply layer.requires_grad = false**

#### **Problem 3(b)**

Your second problem would be to implement a single training step in the given loop inside the train() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would also need to propagate the loss backwards to the model and update the given optimizer's parameters.

Please write your code between the given TODO block.

#### **Problem 3(c)**

Your second problem would be to implement a single validation step in the given loop inside the eval() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would need to ensure that the loss is not propagated backwards.

Please write your code between the given TODO block.

**Note: Consult the pytorch demos by the TAs during class for Problem 3(b) and 3(c).** (https://colab.research.google.com/drive/1Nf_5z4_g09KqOy0km4fyG4Kj2bRcEcCK?usp=sharing) 

In [12]:
class Trainer():

  def __init__(self, options):
    self.device = options['device']
    self.train_data = options['train_data']
    self.val_data = options['val_data']
    self.batch_size = options['batch_size']
    self.epochs = options['epochs']
    self.save_path = options['save_path']
    self.training_type = options['training_type']
    transformer = DistillBERT()
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0)
    recall = recall_score(labels_flat, pred_flat, zero_division=0)
    f1 = f1_score(labels_flat, pred_flat, zero_division=0)
    return precision, recall, f1

  def set_training_parameters(self):
    # TODO(students): start
    model = self.model
    # for name, layer in model.named_parameters():
    #   print("param name:", name, "requires_grad:", layer.requires_grad) #distilbert.transformer.layer.5.output_layer_norm.bias

    if self.training_type == "frozen_embeddings":
      for name, layer in model.named_parameters():
        if "embeddings"  in name:
          layer.requires_grad = False    

    elif self.training_type == "top_2_training":
      for name, layer in model.named_parameters():
        if "embeddings"  in name:
          layer.requires_grad = False
        if "layer.0"  in name:
          layer.requires_grad = False
        if "layer.1"  in name:
          layer.requires_grad = False
        if "layer.2"  in name:
          layer.requires_grad = False
        if "layer.3"  in name:
          layer.requires_grad = False

    elif self.training_type == "top_4_training":
      for name, layer in model.named_parameters():
        if "embeddings"  in name:
          layer.requires_grad = False
        if "layer.0"  in name:
          layer.requires_grad = False
        if "layer.1"  in name:
          layer.requires_grad = False


    # TODO(students): end

  def train(self, data_loader, optimizer):
    self.model.train()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    for batch_idx, (reviews, labels) in enumerate(tqdm(data_loader)):
      self.model.zero_grad()
      # TODO(students): start

      labels = labels.to(self.device)
      outputs = self.model(reviews, labels = labels)
      cur_loss = outputs.loss
      
      optimizer.zero_grad()
      cur_loss.backward()
      optimizer.step()

      #Storing the loss and accuracy of each batch
      total_loss+=cur_loss
      result = self.get_performance_metrics(outputs['logits'].cpu().detach() ,labels.cpu().detach())
      total_precision += result[0]
      total_recall += result[1]
      total_f1 += result[2]

      # TODO(students): end

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def eval(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        
        labels = labels.to(self.device)
        outputs = self.model(reviews, labels = labels)
        cur_loss = outputs.loss
      
        #Storing the loss and accuracy of each batch
        total_loss+=cur_loss
        result = self.get_performance_metrics(outputs['logits'].cpu().detach() ,labels.cpu().detach())
        total_precision += result[0]
        total_recall += result[1]
        total_f1 += result[2]

        # TODO(students): end
    
    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def save_transformer(self):
    self.model.save_pretrained(self.save_path)
    self.tokenizer.save_pretrained(self.save_path)

  def execute(self):
    last_best = 0
    train_dataset = DatasetLoader(self.train_data, self.tokenizer)
    train_data_loader = train_dataset.get_data_loaders(self.batch_size)
    val_dataset = DatasetLoader(self.val_data, self.tokenizer)
    val_data_loader = val_dataset.get_data_loaders(self.batch_size)
    optimizer = AdamW(self.model.parameters(), lr = 3e-5, eps = 1e-8)
    self.set_training_parameters()
    for epoch_i in range(0, self.epochs):
      train_precision, train_recall, train_f1, train_loss = self.train(train_data_loader, optimizer)
      print(f'Epoch {epoch_i + 1}: train_loss: {train_loss:.4f} train_precision: {train_precision:.4f} train_recall: {train_recall:.4f} train_f1: {train_f1:.4f}')
      val_precision, val_recall, val_f1, val_loss = self.eval(val_data_loader)
      print(f'Epoch {epoch_i + 1}: val_loss: {val_loss:.4f} val_precision: {val_precision:.4f} val_recall: {val_recall:.4f} val_f1: {val_f1:.4f}')

      if val_f1 > last_best:
        print("Saving model..")
        self.save_transformer()
        last_best = val_f1
        print("Model saved.")

**Notes: Run the following blocks in order to train the model and save it in your Google Drive. There will be variations due to random initializations. Most of the experiment validation and test accuracy should be between 80%-95%. Each experiment should not take more than 30 minutes to run when runtime is set to GPU.**

#### **Experiment 1**
---
Training your DistilBERT with frozen embeddings.



In [14]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_frozen_embeddings'
options['epochs'] = EPOCHS
options['training_type'] = 'frozen_embeddings'
trainer = Trainer(options)
trainer.execute()

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

Processing data..


100%|██████████| 5130/5130 [00:05<00:00, 936.32it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 928.02it/s]
100%|██████████| 321/321 [07:52<00:00,  1.47s/it]


Epoch 1: train_loss: 0.4304 train_precision: 0.7858 train_recall: 0.7801 train_f1: 0.7572


100%|██████████| 17/17 [00:08<00:00,  1.90it/s]


Epoch 1: val_loss: 0.2257 val_precision: 0.8919 val_recall: 0.9641 val_f1: 0.9208
Saving model..
Model saved.


100%|██████████| 321/321 [07:53<00:00,  1.48s/it]


Epoch 2: train_loss: 0.2029 train_precision: 0.9267 train_recall: 0.9273 train_f1: 0.9213


100%|██████████| 17/17 [00:08<00:00,  1.90it/s]


Epoch 2: val_loss: 0.2396 val_precision: 0.9477 val_recall: 0.8587 val_f1: 0.8960


100%|██████████| 321/321 [07:50<00:00,  1.46s/it]


Epoch 3: train_loss: 0.1054 train_precision: 0.9657 train_recall: 0.9685 train_f1: 0.9646


100%|██████████| 17/17 [00:08<00:00,  1.91it/s]

Epoch 3: val_loss: 0.3550 val_precision: 0.8321 val_recall: 0.9881 val_f1: 0.8923





#### **Experiment 2**
---
Training your DistilBERT with only top 2 layers being trained. 



In [15]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_top_2_training'
options['epochs'] = EPOCHS
options['training_type'] = 'top_2_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

Processing data..


100%|██████████| 5130/5130 [00:05<00:00, 977.03it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 1034.62it/s]
100%|██████████| 321/321 [04:29<00:00,  1.19it/s]


Epoch 1: train_loss: 0.3632 train_precision: 0.8418 train_recall: 0.8404 train_f1: 0.8238


100%|██████████| 17/17 [00:08<00:00,  1.90it/s]


Epoch 1: val_loss: 0.2797 val_precision: 0.8594 val_recall: 0.9784 val_f1: 0.9108
Saving model..
Model saved.


100%|██████████| 321/321 [04:29<00:00,  1.19it/s]


Epoch 2: train_loss: 0.2422 train_precision: 0.9074 train_recall: 0.9064 train_f1: 0.8978


100%|██████████| 17/17 [00:08<00:00,  1.90it/s]


Epoch 2: val_loss: 0.2233 val_precision: 0.8801 val_recall: 0.9329 val_f1: 0.9019


100%|██████████| 321/321 [04:29<00:00,  1.19it/s]


Epoch 3: train_loss: 0.1824 train_precision: 0.9328 train_recall: 0.9346 train_f1: 0.9286


100%|██████████| 17/17 [00:08<00:00,  1.91it/s]


Epoch 3: val_loss: 0.2047 val_precision: 0.8883 val_recall: 0.9557 val_f1: 0.9184
Saving model..
Model saved.


#### **Experiment 3**
---
Training your DistilBERT with only top 4 layers being trained. 



In [19]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_top_4_training'
options['epochs'] = EPOCHS
options['training_type'] = 'top_4_training'
trainer = Trainer(options)
trainer.execute()

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classi

Processing data..


100%|██████████| 5130/5130 [00:05<00:00, 990.03it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 984.69it/s]
100%|██████████| 321/321 [05:54<00:00,  1.10s/it]


Epoch 1: train_loss: 0.3618 train_precision: 0.8424 train_recall: 0.8487 train_f1: 0.8246


100%|██████████| 17/17 [00:08<00:00,  1.99it/s]


Epoch 1: val_loss: 0.2376 val_precision: 0.8797 val_recall: 0.9416 val_f1: 0.9064
Saving model..
Model saved.


100%|██████████| 321/321 [05:54<00:00,  1.11s/it]


Epoch 2: train_loss: 0.2010 train_precision: 0.9248 train_recall: 0.9256 train_f1: 0.9182


100%|██████████| 17/17 [00:08<00:00,  2.00it/s]


Epoch 2: val_loss: 0.2038 val_precision: 0.8833 val_recall: 0.9620 val_f1: 0.9187
Saving model..
Model saved.


100%|██████████| 321/321 [05:55<00:00,  1.11s/it]


Epoch 3: train_loss: 0.1227 train_precision: 0.9524 train_recall: 0.9569 train_f1: 0.9510


100%|██████████| 17/17 [00:08<00:00,  1.99it/s]

Epoch 3: val_loss: 0.2159 val_precision: 0.9000 val_recall: 0.9203 val_f1: 0.9066





#### **Experiment 4**
---
Training your DistilBERT with all layers being trained. 



In [17]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_all_training'
options['epochs'] = EPOCHS
options['training_type'] = 'all_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'pre_classifi

Processing data..


100%|██████████| 5130/5130 [00:05<00:00, 948.42it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 952.55it/s]
100%|██████████| 321/321 [07:59<00:00,  1.49s/it]


Epoch 1: train_loss: 0.3488 train_precision: 0.8391 train_recall: 0.8645 train_f1: 0.8336


100%|██████████| 17/17 [00:08<00:00,  1.90it/s]


Epoch 1: val_loss: 0.2329 val_precision: 0.8962 val_recall: 0.9360 val_f1: 0.9111
Saving model..
Model saved.


100%|██████████| 321/321 [07:58<00:00,  1.49s/it]


Epoch 2: train_loss: 0.1796 train_precision: 0.9307 train_recall: 0.9320 train_f1: 0.9250


100%|██████████| 17/17 [00:08<00:00,  1.90it/s]


Epoch 2: val_loss: 0.2076 val_precision: 0.9123 val_recall: 0.9521 val_f1: 0.9267
Saving model..
Model saved.


100%|██████████| 321/321 [07:58<00:00,  1.49s/it]


Epoch 3: train_loss: 0.0697 train_precision: 0.9791 train_recall: 0.9785 train_f1: 0.9771


100%|██████████| 17/17 [00:08<00:00,  1.90it/s]

Epoch 3: val_loss: 0.2573 val_precision: 0.9072 val_recall: 0.9376 val_f1: 0.9198





## **Problem 4 (Test Function)**
---
Here, you will write the code for the testing of the models that you trained in the previous code blocks. 

The class below provides method to test a given model. It takes a dictionary with the following parameters:

1.   device: The device to run the model on.
2.   test_data: The test_data dataframe.
3.   batch_size: The batch_size which is input to the model.
4.   save_path: The directory of your saved model.

You would need to implement a single test step in the given loop inside the test() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would need to ensure that the loss is not propagated backwards.

Please write your code between the given TODO block.

Hint: This problem is very similar to 3(c).

In [13]:
class Tester():

  def __init__(self, options):
    self.save_path = options['save_path']
    self.device = options['device']
    self.test_data = options['test_data']
    self.batch_size = options['batch_size']
    transformer = DistillBERT(self.save_path)
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0)
    recall = recall_score(labels_flat, pred_flat, zero_division=0)
    f1 = f1_score(labels_flat, pred_flat, zero_division=0)
    return precision, recall, f1

  def test(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        labels = labels.to(self.device)
        outputs = self.model(reviews, labels = labels)
        cur_loss = outputs.loss
      
        #Storing the loss and accuracy of each batch
        total_loss+=cur_loss
        result = self.get_performance_metrics(outputs['logits'].cpu().detach() ,labels.cpu().detach())
        total_precision += result[0]
        total_recall += result[1]
        total_f1 += result[2]


        # TODO(students): end
    
    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def execute(self):
    test_dataset = DatasetLoader(self.test_data, self.tokenizer)
    test_data_loader = test_dataset.get_data_loaders(self.batch_size)

    test_precision, test_recall, test_f1, test_loss = self.test(test_data_loader)

    print()
    print(f'test_loss: {test_loss:.4f} test_precision: {test_precision:.4f} test_recall: {test_recall:.4f} test_f1: {test_f1:.4f}')

**Notes: Run these blocks only after Experiment 1 to 4 are completed and the models are saved in the "models" folder. Copy the output blocks into another text file for report writing.**

#### **Experiment 5**
---
Testing your DistilBERT trained with frozen embeddings.



In [19]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_frozen_embeddings'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 982.31it/s]
100%|██████████| 38/38 [00:19<00:00,  1.91it/s]


test_loss: 0.2712 test_precision: 0.8627 test_recall: 0.8980 test_f1: 0.8675





#### **Experiment 6**
---
Testing your DistilBERT trained with all layers frozen except the final two layers.



In [20]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_top_2_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 1030.43it/s]
100%|██████████| 38/38 [00:19<00:00,  1.91it/s]


test_loss: 0.2880 test_precision: 0.8523 test_recall: 0.9275 test_f1: 0.8804





#### **Experiment 7**
---
Testing your DistilBERT trained with all layers frozen except the final four layers.



In [20]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_top_4_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 1003.46it/s]
100%|██████████| 38/38 [00:18<00:00,  2.01it/s]


test_loss: 0.2863 test_precision: 0.8557 test_recall: 0.9313 test_f1: 0.8850





#### **Experiment 8**
---
Testing your DistilBERT trained with all layers trainable.



In [22]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_all_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 937.04it/s]
100%|██████████| 38/38 [00:20<00:00,  1.90it/s]


test_loss: 0.2595 test_precision: 0.8755 test_recall: 0.9067 test_f1: 0.8841





## **Submission guidelines**
---
You would need to submit the following files:


1.   `NLP_HW3.ipynb` - This jupyter notebook.
2.   `gdrive_link.txt` - Should contain a wgetable to a folder that contains your four DistilBERT models. Please make sure you provide the necessary permissions.
3. `<SBUID>_Report.pdf` - A PDF report as detailed below.

Your PDF report should contain the answers to the following questions. Use the outputs of the code blocks that were asked to save for this task:

1.   Explanation of your code implementations for each TO-DO tasks.
2.   A table containing the precision, recall and F1 scores of each DistilBERT model during testing (Experiments 5 to 8).
3. An analysis explaining your understanding of the impact freezing/training different layers has on the model's performance.


**Colab design credit**: TA Dhruv Verma