# **CSE354 HW3**
**Due date: 11:59 pm EST on April 24, 2022 (Sunday)**

---
For this assignment, we will use Google Colab, which allows us to utilize resources that some of us might not have in their local machines such as GPUs. You can use your Stony Brook (*.stonybrook.edu) account or your personal gmail account for coding and Google Drive to save your results.

## **Google Colab Tutorial**
---
Go to https://colab.research.google.com/notebooks/, you will see a tutorial named "Welcome to Colaboratory" file, where you can learn the basics of using google colab.

**This notebook would need you to train your model on Colab's GPU. However, the runtimes are limited. So ensure that your code works on the default CPU runtime before switching over to the GPU runtime.**

## **Problem statement**
---
In this homework, you will be using language models to predict the sentiment of a given movie review. The dataset, which is given to you, is sampled from the [IMDB dataset of 50k movie reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The sentences are sampled to a smaller set to help with quicker computation on Colab. The data contains a review and an associated label for the sentiment of that review. The label can either be *positive* or *negative*. You have been given three files - train_data.csv, val_data.csv and test_data.csv. The training data will be used to fine-tune the language model, the val data will be used to select the best model while training and finally the test data will give the model's final performance on the data.

To perform this task you will be using a pre-trained DistilBERT model. DistilBERT is a BERT based language model. Its size is 40% lesser than BERT, it has around 97% of BERT's language understanding capabilities and is 60% faster. You can read more about DistilBERT - https://arxiv.org/abs/1910.01108.

You will be using the model by taking advantage of the libraries provided by Hugging Face (https://huggingface.co/). In order to use this library, it will need to be installed using the command in the cell below. You will be training four different DistilBERT models for this assignment.

Fill in the # TODO(students) portions in this Colab file for this assignment.

**Todos for the assignment:**
*   Fill in the # TODO(students) portions in this Colab file for this assignment.
*   Run the experiment code blocks and note down the colab outputs in a separate text file.
*   Use the aformentioned colab outputs for writing the report as per submission guideline that is described at the end of this colab file.



In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 29.4 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 50.9 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 55.0 MB/s 
[?25hCollecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 60.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 8.0 MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel fo

## **Imports**
---

All the allowed imports have been done for you in the code block below. You do need and will not be allowed to use any more imports other than the ones done below.

In [None]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, TensorDataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AdamW
import os
from sklearn.metrics import precision_score, recall_score, f1_score
torch.manual_seed(42)
np.random.seed(42)

## **Mounting your drive**
---

I would highly recommend mounting you Google Drive while running this notebook. This drive could contain the path to your dataset and it will also be used to save your fine-tuned models. In case you choose to simply save the models on your Colab workspace, the models will cease to exist after the runtime disconnects.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls

drive  sample_data


In [None]:
#Set the path of the folder where your colab file and data exist in Google Drive in the ------ porition

# TODO(students): start
%cd "drive/MyDrive/HW3_Release"
!ls
# TODO(students): end

/content/drive/MyDrive/HW3_Release
'Copy of NLP_HW3.ipynb'   data	 models


## **Constants in the file**
---

The code block below contains a few constants.


1.   **BATCH_SIZE**: The batch size input to the models. This has been set to 16 and should not be changed. In case you encounter any CUDA - out of memory errors while training your models, this value may be reduced from 16. But please mention this in your submission report.
2.   **EPOCHS**: The number of epochs to train your model. This should not be changed.
3. **TEST_PATH**: This is the path to the test_data.csv file.
4. **TRAIN_PATH**: This is the path to the train_data.csv file.
5. **VAL_PATH**: This is the path to the val_data.csv file.
6. **SAVE_PATH**: This is the path to directory your model will be saved. Note: This path will be altered further down in the code by appending the name of the kind of DistilBERT model you train as per your experiments.



In [None]:
#DO NOT CHANGE THE CONSTANTS
BATCH_SIZE = 16
EPOCHS = 3
TEST_PATH = "data/test_data.csv"
TRAIN_PATH = "data/train_data.csv"
VAL_PATH = "data/val_data.csv"
SAVE_PATH = "models/DistilBERT"

In [None]:
def load_dataset(path):
  dataset = pd.read_csv(path)
  return dataset

In [None]:
train_data = load_dataset(TRAIN_PATH)
val_data = load_dataset(VAL_PATH)
test_data = load_dataset(TEST_PATH)

## **Problem 1 (Initialize the Model Class)**
---

Here, we will setup the pre-trained DistillBert model class in order to do our binary sentiment analysis task. In the code block below, you would need to load a pre-trained DistilBERT model and it's tokenizer using Hugging Face's library. The model you would need to load is called "distilbert-base-uncased". It would also need to have the model hyperparameter set to *num_classes* as the output shape of the model (in this case it is going to be 2, positive and negative). Please write your code between the given TODO block.



More about the model and how to load it can be read at - https://huggingface.co/distilbert-base-uncased.

In [None]:
class DistillBERT():

  def __init__(self, model_name='distilbert-base-uncased', num_classes=2):
    # TODO(students): start
    self.num_classes = num_classes
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = self.num_classes)
    
    # TODO(students): end

  def get_tokenizer_and_model(self):
    return self.model, self.tokenizer

## **Problem 2 (Initialize the Dataloader Class)**
---
Here, we will setup the dataloader class which will read data, tokenize it using the DistillBert tokenizer, converts the tokenized sentence to tensors and the labels to tensors. The code block below takes your dataset(train,validation or test) and the tokenizer you loaded in the previous block and generates the DataLoader object for it. You would need to implement a part of the tokenize_data method. This method takes the given data and generates a list of token IDs for a given review along with it's label. You would need to use the tokenizer to generate the token ids (hint:refer to tokenizer.encode_plus for more details) values for each review. **Please ensure that the maximum length of an encoded review is 512 tokens. If any input data is longer than 512 words/tokens, truncate it to first 512.** 

You would also need to convert the labels to a corresponding numerical class using the label_dict dictionary. Please write your code between the given TODO block.

In [None]:
class DatasetLoader(Dataset):

  def __init__(self, data, tokenizer):
    self.data = data
    self.tokenizer = tokenizer

  def tokenize_data(self):
    print("Processing data..")
    tokens = []
    labels = []
    label_dict = {'positive': 1, 'negative': 0}

    review_list = self.data['review'].to_list()
    label_list = self.data['sentiment'].to_list()

    for (review, label) in tqdm(zip(review_list, label_list), total=len(review_list)):
      # TODO(students): start
      token_dict = self.tokenizer.encode_plus(review, max_length = 512)
      input_ids = token_dict["input_ids"]
      tokens.append(torch.tensor(input_ids))
      labels.append(torch.tensor(label_dict[label]))
      # TODO(students): end
    tokens = pad_sequence(tokens, batch_first=True)
    labels = torch.tensor(labels)
    dataset = TensorDataset(tokens, labels)
    return dataset

  def get_data_loaders(self, batch_size=32, shuffle=True):
    processed_dataset = self.tokenize_data()

    data_loader = DataLoader(
        processed_dataset,
        shuffle=shuffle,
        batch_size=batch_size
    )

    return data_loader

## **Problem 3 (Training Function)**
---
In this problem, you will write the code that will be used to run your model class on the dataset class, both of which you have written in the previous problems.

The class below provides method to train a given model. It takes a dictionary with the following parameters:


1.   device: The device to run the model on.
2.   train_data: The train_data dataframe.
3.   val_data: The val_data dataframe.
4.   batch_size: The batch_size which is input to the model.
5.   epochs: The number of epochs to train the model.
6.   training_type: The type of training that your model will be undergoing. This can take four values - 'frozen_embeddings', 'top_2_training', 'top_4_training' and 'all_training'.

#### **Problem 3(a)**

Your first problem here would be to implement the set_training_parameters() method. In this method you will select the layers of your model to train based on the training_type. **Note: By default the Hugging Face DistilBERT has 6 layers.**

1. frozen_embeddings: This setting is supposed to train the DistilBERT model with embeddings that are 'frozen' i.e., not trainable. You would need to ensure that the embedding layers in your model are not trainable.
2. top_2_training: This setting is supposed to train just the final two layers of DistilBERT (layer 5 and layer 4). All other layers before these would need to be frozen.
3. top_4_training: This setting is supposed to train just the final four layers of DistilBERT (layer 5, layer 4, layer 3 and layer 2). All other layers before these would need to be frozen.
4. all_training: All layers of DistilBERT would need to trained.

Please write your code between the given TODO block.

**Note: The classifier head on top of the final DistilBERT layer would always need to be trained, please do not freeze that layer.**

**Note: You can use model.named_parameters() and iterate over all the named parameters of the model. To set the layers to not be trainable, apply layer.requires_grad = false**

#### **Problem 3(b)**

Your second problem would be to implement a single training step in the given loop inside the train() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would also need to propagate the loss backwards to the model and update the given optimizer's parameters.

Please write your code between the given TODO block.

#### **Problem 3(c)**

Your second problem would be to implement a single validation step in the given loop inside the eval() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would need to ensure that the loss is not propagated backwards.

Please write your code between the given TODO block.

**Note: Consult the pytorch demos by the TAs during class for Problem 3(b) and 3(c).** (https://colab.research.google.com/drive/1Nf_5z4_g09KqOy0km4fyG4Kj2bRcEcCK?usp=sharing) 

In [None]:
class Trainer():

  def __init__(self, options):
    self.device = options['device']
    self.train_data = options['train_data']
    self.val_data = options['val_data']
    self.batch_size = options['batch_size']
    self.epochs = options['epochs']
    self.save_path = options['save_path']
    self.training_type = options['training_type']
    transformer = DistillBERT()
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0)
    recall = recall_score(labels_flat, pred_flat, zero_division=0)
    f1 = f1_score(labels_flat, pred_flat, zero_division=0)
    return precision, recall, f1

  def set_training_parameters(self):
    # TODO(students): start
    for name, param in self.model.named_parameters():
      if(self.training_type == "top_2_training"):
        if "pre_classifier" in name:
          continue
        elif "classifier" in name:
          continue
        elif "embeddings" in name:
          continue
        if not("layer.4" in name) or not("layer.5" in name):
          param.requires_grad = False

      elif(self.training_type == "top_4_training"):
        if "layer.0" in name:
          param.requires_grad = False
        elif "layer.1" in name:
          param.requires_grad = False

      elif(self.training_type == "frozen_embeddings"):
        if "embeddings" in name:
          param.requires_grad = False
    
    # TODO(students): end

  def train(self, data_loader, optimizer):
    self.model.train()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    for batch_idx, (reviews, labels) in enumerate(tqdm(data_loader)):
      self.model.zero_grad()
      # TODO(students): start
      reviews = reviews.to(self.device)
      labels = labels.to(self.device)
      outputs = self.model(reviews, labels = labels)
      logits = outputs.logits
      logits = logits.detach().cpu().numpy()
      current_loss = outputs.loss
      
      total_loss += current_loss
    
      label2 = labels.to("cpu").numpy()
      current_precision, current_recall, current_f1 = self.get_performance_metrics(logits, label2)
      total_precision += current_precision
      total_recall += current_recall
      total_f1 += current_f1
      
      optimizer.zero_grad()
      current_loss.backward()
      optimizer.step()

      # TODO(students): end

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def eval(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        reviews = reviews.to(self.device)
        labels = labels.to(self.device)
        outputs = self.model(reviews, labels = labels)
        logits = outputs.logits
        current_loss = outputs.loss
        total_loss += current_loss
        
        logits = logits.detach().cpu().numpy()
        label2 = labels.to("cpu").numpy()

        current_precision, current_recall, current_f1 = self.get_performance_metrics(logits, label2)
        total_precision += current_precision
        total_recall += current_recall
        total_f1 += current_f1
        # TODO(students): end
    
    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def save_transformer(self):
    self.model.save_pretrained(self.save_path)
    self.tokenizer.save_pretrained(self.save_path)

  def execute(self):
    last_best = 0
    train_dataset = DatasetLoader(self.train_data, self.tokenizer)
    train_data_loader = train_dataset.get_data_loaders(self.batch_size)
    val_dataset = DatasetLoader(self.val_data, self.tokenizer)
    val_data_loader = val_dataset.get_data_loaders(self.batch_size)
    optimizer = AdamW(self.model.parameters(), lr = 3e-5, eps = 1e-8)
    self.set_training_parameters()
    for epoch_i in range(0, self.epochs):
      train_precision, train_recall, train_f1, train_loss = self.train(train_data_loader, optimizer)
      print(f'Epoch {epoch_i + 1}: train_loss: {train_loss:.4f} train_precision: {train_precision:.4f} train_recall: {train_recall:.4f} train_f1: {train_f1:.4f}')
      val_precision, val_recall, val_f1, val_loss = self.eval(val_data_loader)
      print(f'Epoch {epoch_i + 1}: val_loss: {val_loss:.4f} val_precision: {val_precision:.4f} val_recall: {val_recall:.4f} val_f1: {val_f1:.4f}')

      if val_f1 > last_best:
        print("Saving model..")
        self.save_transformer()
        last_best = val_f1
        print("Model saved.")

**Notes: Run the following blocks in order to train the model and save it in your Google Drive. There will be variations due to random initializations. Most of the experiment validation and test accuracy should be between 80%-95%. Each experiment should not take more than 30 minutes to run when runtime is set to GPU.**

#### **Experiment 1**
---
Training your DistilBERT with frozen embeddings.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_frozen_embeddings'
options['epochs'] = EPOCHS
options['training_type'] = 'frozen_embeddings'
trainer = Trainer(options)
trainer.execute()


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

Processing data..


  0%|          | 0/5130 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 5130/5130 [00:04<00:00, 1034.14it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 941.45it/s]
100%|██████████| 321/321 [07:34<00:00,  1.41s/it]


Epoch 1: train_loss: 0.3516 train_precision: 0.8674 train_recall: 0.8643 train_f1: 0.8474


100%|██████████| 17/17 [00:08<00:00,  1.91it/s]


Epoch 1: val_loss: 0.2140 val_precision: 0.9392 val_recall: 0.9095 val_f1: 0.9186
Saving model..
Model saved.


100%|██████████| 321/321 [07:36<00:00,  1.42s/it]


Epoch 2: train_loss: 0.1933 train_precision: 0.9284 train_recall: 0.9268 train_f1: 0.9212


100%|██████████| 17/17 [00:09<00:00,  1.88it/s]


Epoch 2: val_loss: 0.2119 val_precision: 0.9179 val_recall: 0.9673 val_f1: 0.9389
Saving model..
Model saved.


100%|██████████| 321/321 [07:40<00:00,  1.44s/it]


Epoch 3: train_loss: 0.1113 train_precision: 0.9591 train_recall: 0.9639 train_f1: 0.9581


100%|██████████| 17/17 [00:09<00:00,  1.88it/s]


Epoch 3: val_loss: 0.2012 val_precision: 0.9266 val_recall: 0.9642 val_f1: 0.9426
Saving model..
Model saved.


#### **Experiment 2**
---
Training your DistilBERT with only top 2 layers being trained. 



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_top_2_training'
options['epochs'] = EPOCHS
options['training_type'] = 'top_2_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

Processing data..


  0%|          | 0/5130 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 5130/5130 [00:05<00:00, 983.03it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 989.12it/s]
100%|██████████| 321/321 [06:00<00:00,  1.12s/it]


Epoch 1: train_loss: 0.6486 train_precision: 0.6622 train_recall: 0.7758 train_f1: 0.6856


100%|██████████| 17/17 [00:09<00:00,  1.87it/s]


Epoch 1: val_loss: 0.5492 val_precision: 0.7591 val_recall: 0.8496 val_f1: 0.7895
Saving model..
Model saved.


100%|██████████| 321/321 [06:02<00:00,  1.13s/it]


Epoch 2: train_loss: 0.3975 train_precision: 0.8589 train_recall: 0.8524 train_f1: 0.8446


100%|██████████| 17/17 [00:09<00:00,  1.88it/s]


Epoch 2: val_loss: 0.3123 val_precision: 0.8770 val_recall: 0.8941 val_f1: 0.8765
Saving model..
Model saved.


100%|██████████| 321/321 [06:02<00:00,  1.13s/it]


Epoch 3: train_loss: 0.2246 train_precision: 0.9271 train_recall: 0.9165 train_f1: 0.9160


100%|██████████| 17/17 [00:09<00:00,  1.87it/s]


Epoch 3: val_loss: 0.2703 val_precision: 0.8678 val_recall: 0.9240 val_f1: 0.8886
Saving model..
Model saved.


#### **Experiment 3**
---
Training your DistilBERT with only top 4 layers being trained. 



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_top_4_training'
options['epochs'] = EPOCHS
options['training_type'] = 'top_4_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

Processing data..


  0%|          | 0/5130 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 5130/5130 [00:04<00:00, 1027.73it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 1040.86it/s]
100%|██████████| 321/321 [07:14<00:00,  1.35s/it]


Epoch 1: train_loss: 0.4210 train_precision: 0.7769 train_recall: 0.7745 train_f1: 0.7482


100%|██████████| 17/17 [00:09<00:00,  1.88it/s]


Epoch 1: val_loss: 0.2790 val_precision: 0.8291 val_recall: 0.9829 val_f1: 0.8959
Saving model..
Model saved.


100%|██████████| 321/321 [07:14<00:00,  1.35s/it]


Epoch 2: train_loss: 0.1949 train_precision: 0.9256 train_recall: 0.9291 train_f1: 0.9202


100%|██████████| 17/17 [00:09<00:00,  1.88it/s]


Epoch 2: val_loss: 0.2085 val_precision: 0.9329 val_recall: 0.8921 val_f1: 0.9047
Saving model..
Model saved.


100%|██████████| 321/321 [07:14<00:00,  1.35s/it]


Epoch 3: train_loss: 0.1037 train_precision: 0.9609 train_recall: 0.9685 train_f1: 0.9617


100%|██████████| 17/17 [00:09<00:00,  1.88it/s]


Epoch 3: val_loss: 0.2031 val_precision: 0.9337 val_recall: 0.9423 val_f1: 0.9346
Saving model..
Model saved.


#### **Experiment 4**
---
Training your DistilBERT with all layers being trained. 



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_all_training'
options['epochs'] = EPOCHS
options['training_type'] = 'all_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

Processing data..


  0%|          | 0/5130 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 5130/5130 [00:05<00:00, 1017.53it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 1017.50it/s]
100%|██████████| 321/321 [07:49<00:00,  1.46s/it]


Epoch 1: train_loss: 0.4801 train_precision: 0.7384 train_recall: 0.7480 train_f1: 0.7068


100%|██████████| 17/17 [00:09<00:00,  1.88it/s]


Epoch 1: val_loss: 0.2460 val_precision: 0.9167 val_recall: 0.8878 val_f1: 0.8974
Saving model..
Model saved.


100%|██████████| 321/321 [07:44<00:00,  1.45s/it]


Epoch 2: train_loss: 0.2014 train_precision: 0.9301 train_recall: 0.9219 train_f1: 0.9195


100%|██████████| 17/17 [00:08<00:00,  1.90it/s]


Epoch 2: val_loss: 0.2295 val_precision: 0.8808 val_recall: 0.9590 val_f1: 0.9137
Saving model..
Model saved.


100%|██████████| 321/321 [07:44<00:00,  1.45s/it]


Epoch 3: train_loss: 0.1004 train_precision: 0.9713 train_recall: 0.9690 train_f1: 0.9678


100%|██████████| 17/17 [00:08<00:00,  1.91it/s]


Epoch 3: val_loss: 0.2476 val_precision: 0.9207 val_recall: 0.9504 val_f1: 0.9320
Saving model..
Model saved.


## **Problem 4 (Test Function)**
---
Here, you will write the code for the testing of the models that you trained in the previous code blocks. 

The class below provides method to test a given model. It takes a dictionary with the following parameters:

1.   device: The device to run the model on.
2.   test_data: The test_data dataframe.
3.   batch_size: The batch_size which is input to the model.
4.   save_path: The directory of your saved model.

You would need to implement a single test step in the given loop inside the test() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would need to ensure that the loss is not propagated backwards.

Please write your code between the given TODO block.

Hint: This problem is very similar to 3(c).

In [None]:
class Tester():

  def __init__(self, options):
    self.save_path = options['save_path']
    self.device = options['device']
    self.test_data = options['test_data']
    self.batch_size = options['batch_size']
    transformer = DistillBERT(self.save_path)
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0)
    recall = recall_score(labels_flat, pred_flat, zero_division=0)
    f1 = f1_score(labels_flat, pred_flat, zero_division=0)
    return precision, recall, f1

  def test(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        reviews = reviews.to(self.device)
        labels = labels.to(self.device)
        outputs = self.model(reviews, labels = labels)
        logits = outputs.logits
        current_loss = outputs.loss
        total_loss += current_loss
        
        logits = logits.detach().cpu().numpy()
        label2 = labels.to("cpu").numpy()

        current_precision, current_recall, current_f1 = self.get_performance_metrics(logits, label2)
        total_precision += current_precision
        total_recall += current_recall
        total_f1 += current_f1

        # TODO(students): end
    
    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def execute(self):
    test_dataset = DatasetLoader(self.test_data, self.tokenizer)
    test_data_loader = test_dataset.get_data_loaders(self.batch_size)

    test_precision, test_recall, test_f1, test_loss = self.test(test_data_loader)

    print()
    print(f'test_loss: {test_loss:.4f} test_precision: {test_precision:.4f} test_recall: {test_recall:.4f} test_f1: {test_f1:.4f}')

**Notes: Run these blocks only after Experiment 1 to 4 are completed and the models are saved in the "models" folder. Copy the output blocks into another text file for report writing.**

#### **Experiment 5**
---
Testing your DistilBERT trained with frozen embeddings.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_frozen_embeddings'
tester = Tester(options)
tester.execute()

Processing data..


  0%|          | 0/600 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 600/600 [00:00<00:00, 1007.77it/s]
100%|██████████| 38/38 [00:19<00:00,  1.93it/s]


test_loss: 0.2694 test_precision: 0.8759 test_recall: 0.8820 test_f1: 0.8683





#### **Experiment 6**
---
Testing your DistilBERT trained with all layers frozen except the final two layers.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_top_2_training'
tester = Tester(options)
tester.execute()

Processing data..


  0%|          | 0/600 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 600/600 [00:00<00:00, 1020.61it/s]
100%|██████████| 38/38 [00:19<00:00,  1.93it/s]


test_loss: 0.3423 test_precision: 0.8264 test_recall: 0.8581 test_f1: 0.8312





#### **Experiment 7**
---
Testing your DistilBERT trained with all layers frozen except the final four layers.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_top_4_training'
tester = Tester(options)
tester.execute()

Processing data..


  0%|          | 0/600 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 600/600 [00:00<00:00, 990.00it/s] 
100%|██████████| 38/38 [00:19<00:00,  1.93it/s]


test_loss: 0.3247 test_precision: 0.8634 test_recall: 0.8849 test_f1: 0.8636





#### **Experiment 8**
---
Testing your DistilBERT trained with all layers trainable.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_all_training'
tester = Tester(options)
tester.execute()

Processing data..


  0%|          | 0/600 [00:00<?, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
100%|██████████| 600/600 [00:00<00:00, 1016.86it/s]
100%|██████████| 38/38 [00:19<00:00,  1.92it/s]


test_loss: 0.3479 test_precision: 0.8605 test_recall: 0.9112 test_f1: 0.8777





## **Submission guidelines**
---
You would need to submit the following files:


1.   `NLP_HW3.ipynb` - This jupyter notebook.
2.   `gdrive_link.txt` - Should contain a wgetable to a folder that contains your four DistilBERT models. Please make sure you provide the necessary permissions.
3. `<SBUID>_Report.pdf` - A PDF report as detailed below.

Your PDF report should contain the answers to the following questions. Use the outputs of the code blocks that were asked to save for this task:

1.   Explanation of your code implementations for each TO-DO tasks.
2.   A table containing the precision, recall and F1 scores of each DistilBERT model during testing (Experiments 5 to 8).
3. An analysis explaining your understanding of the impact freezing/training different layers has on the model's performance.


**Colab design credit**: TA Dhruv Verma