# **CSE354 HW3**
**Due date: 11:59 pm EST on April 21, 2023 (Friday)**

---
For this assignment, we will use Google Colab, which allows us to utilize resources that some of us might not have in their local machines such as GPUs. You can use your Stony Brook (*.stonybrook.edu) account or your personal gmail account for coding and Google Drive to save your results.

## **Google Colab Tutorial**
---
Go to https://colab.research.google.com/notebooks/, you will see a tutorial named "Welcome to Colaboratory" file, where you can learn the basics of using google colab.

**This notebook would need you to train your model on Colab's GPU. However, the runtimes are limited. So ensure that your code works on the default CPU runtime before switching over to the GPU runtime.**

## **Problem statement**
---
In this homework, you will be using language models to predict the sentiment of a given movie review. The dataset, which is given to you, is sampled from the [IMDB dataset of 50k movie reviews](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). The sentences are sampled to a smaller set to help with quicker computation on Colab. The data contains a review and an associated label for the sentiment of that review. The label can either be *positive* or *negative*. You have been given three files - train_data.csv, val_data.csv and test_data.csv. The training data will be used to fine-tune the language model, the val data will be used to select the best model while training and finally the test data will give the model's final performance on the data.

To perform this task you will be using a pre-trained DistilBERT model. DistilBERT is a BERT based language model. Its size is 40% lesser than BERT, it has around 97% of BERT's language understanding capabilities and is 60% faster. You can read more about DistilBERT - https://arxiv.org/abs/1910.01108.

You will be using the model by taking advantage of the libraries provided by Hugging Face (https://huggingface.co/). In order to use this library, it will need to be installed using the command in the cell below. You will be training four different DistilBERT models for this assignment.

**Todos for the assignment:**
*   Fill in the # TODO(students) portions in this Colab file for this assignment.
*   Run the experiment code blocks and note down the colab outputs in a separate text file.
*   Use the aformentioned colab outputs for writing the report as per submission guideline that is described at the end of this colab file.



In [None]:
!pip install transformers

## **Imports**
---

All the allowed imports have been done for you in the code block below. You do need and will not be allowed to use any more imports other than the ones done below.

In [1]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, TensorDataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AdamW
import os
from sklearn.metrics import precision_score, recall_score, f1_score
torch.manual_seed(42)
np.random.seed(42)

  from .autonotebook import tqdm as notebook_tqdm
2023-04-16 01:18:23.569756: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-16 01:18:23.774908: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-16 01:18:23.826122: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-04-16 01:18:24.614425: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not l

## **Mounting your drive**
---

I would highly recommend mounting you Google Drive while running this notebook. This drive could contain the path to your dataset and it will also be used to save your fine-tuned models. In case you choose to simply save the models on your Colab workspace, the models will cease to exist after the runtime disconnects.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!ls

In [None]:
#Set the path of the folder where your colab file and data exist in Google Drive in the ------ porition

# TODO(students): start
%cd "drive/MyDrive/--------"
# TODO(students): end

## **Constants in the file**
---

The code block below contains a few constants.


1.   **BATCH_SIZE**: The batch size input to the models. This has been set to 16 and should not be changed. In case you encounter any CUDA - out of memory errors while training your models, this value may be reduced from 16. But please mention this in your submission report.
2.   **EPOCHS**: The number of epochs to train your model. This should not be changed.
3. **TEST_PATH**: This is the path to the test_data.csv file.
4. **TRAIN_PATH**: This is the path to the train_data.csv file.
5. **VAL_PATH**: This is the path to the val_data.csv file.
6. **SAVE_PATH**: This is the path to directory your model will be saved. Note: This path will be altered further down in the code by appending the name of the kind of DistilBERT model you train as per your experiments.



In [2]:
#DO NOT CHANGE THE CONSTANTS
BATCH_SIZE = 16
EPOCHS = 3
TEST_PATH = "data/test_data.csv"
TRAIN_PATH = "data/train_data.csv"
VAL_PATH = "data/val_data.csv"
SAVE_PATH = "models/DistilBERT"

In [3]:
def load_dataset(path):
  dataset = pd.read_csv(path)
  return dataset

In [4]:
train_data = load_dataset(TRAIN_PATH)
val_data = load_dataset(VAL_PATH)
test_data = load_dataset(TEST_PATH)

## **Problem 1 (Initialize the Model Class)**
---

Here, we will setup the pre-trained DistillBert model class in order to do our binary sentiment analysis task. In the code block below, you would need to load a pre-trained DistilBERT model and it's tokenizer using Hugging Face's library. The model you would need to load is called "distilbert-base-uncased". It would also need to have the model hyperparameter set to *num_classes* as the output shape of the model (in this case it is going to be 2, positive and negative). Please write your code between the given TODO block.



More about the model and how to load it can be read at - https://huggingface.co/distilbert-base-uncased.

In [5]:
class DistillBERT():

  def __init__(self, model_name='distilbert-base-uncased', num_classes=2):
    # TODO(students): start
    self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
    # TODO(students): end

  def get_tokenizer_and_model(self):
    return self.model, self.tokenizer
  
#x = DistillBERT()
# i=0
# for name, layer in x.model.named_parameters():
#   print(f"{i}:{name}")
#   i+=1

# for y in x.model.modules():
#   print(f"{i}:{y}")
#   i+=1


## **Problem 2 (Initialize the Dataloader Class)**
---
Here, we will setup the dataloader class which will read data, tokenize it using the DistillBert tokenizer, converts the tokenized sentence to tensors and the labels to tensors. The code block below takes your dataset(train,validation or test) and the tokenizer you loaded in the previous block and generates the DataLoader object for it. You would need to implement a part of the tokenize_data method. This method takes the given data and generates a list of token IDs for a given review along with it's label. You would need to use the tokenizer to generated the token ids (hint:refer to tokenizer.encode_plus for more details) values for each review. **Please ensure that the maximum length of an encoded review is 512 tokens. If any input data is longer than 512 words/tokens, truncate it to first 512.** 

You would also need to convert the labels to a corresponding numerical class using the label_dict dictionary. Please write your code between the given TODO block.

In [6]:
class DatasetLoader(Dataset):

  def __init__(self, data, tokenizer):
    self.data = data
    self.tokenizer = tokenizer

  def tokenize_data(self):
    print("Processing data..")
    tokens = []
    labels = []
    label_dict = {'positive': 1, 'negative': 0}

    review_list = self.data['review'].to_list()
    label_list = self.data['sentiment'].to_list()

    for (review, label) in tqdm(zip(review_list, label_list), total=len(review_list)):
      # TODO(students): start
      encoding = self.tokenizer.encode_plus(review, max_length=512, truncation=True, pad_to_max_length=True, add_special_tokens=True, return_tensors='pt')['input_ids'].to('cuda')
      tokens.append(encoding)
      labels.append(label_dict[label])
      # TODO(students): end
    
    tokens = pad_sequence(tokens, batch_first=True)
    labels = torch.tensor(labels)
    dataset = TensorDataset(tokens, labels)
    return dataset

  def get_data_loaders(self, batch_size=32, shuffle=True):
    processed_dataset = self.tokenize_data()

    data_loader = DataLoader(
        processed_dataset,
        shuffle=shuffle,
        batch_size=batch_size
    )

    return data_loader

## **Problem 3 (Training Function)**
---
In this problem, you will write the code that will be used to run your model class on the dataset class, both of which you have written in the previous problems.

The class below provides method to train a given model. It takes a dictionary with the following parameters:


1.   device: The device to run the model on.
2.   train_data: The train_data dataframe.
3.   val_data: The val_data dataframe.
4.   batch_size: The batch_size which is input to the model.
5.   epochs: The number of epochs to train the model.
6.   training_type: The type of training that your model will be undergoing. This can take four values - 'frozen_embeddings', 'top_2_training', 'top_4_training' and 'all_training'.

#### **Problem 3(a)**

Your first problem here would be to implement the set_training_parameters() method. In this method you will select the layers of your model to train based on the training_type. **Note: By default the Hugging Face DistilBERT has 6 layers.**

1. frozen_embeddings: This setting is supposed to train the DistilBERT model with embeddings that are 'frozen' i.e., not trainable. You would need to ensure that the embedding layers in your model are not trainable.
2. top_2_training: This setting is supposed to train just the final two layers of DistilBERT (layer 5 and layer 4). All other layers before these would need to be frozen.
3. top_4_training: This setting is supposed to train just the final four layers of DistilBERT (layer 5, layer 4, layer 3 and layer 2). All other layers before these would need to be frozen.
4. all_training: All layers of DistilBERT would need to trained.

Please write your code between the given TODO block.

**Note: The classifier head on top of the final DistilBERT layer would always need to be trained, please do not freeze that layer.**

**Note: You can use model.named_parameters() and iterate over all the named parameters of the model. To set the layers to not be trainable, apply layer.requires_grad = false**

#### **Problem 3(b)**

Your second problem would be to implement a single training step in the given loop inside the train() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would also need to propagate the loss backwards to the model and update the given optimizer's parameters.

Please write your code between the given TODO block.

#### **Problem 3(c)**

Your second problem would be to implement a single validation step in the given loop inside the eval() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would need to ensure that the loss is not propagated backwards.

Please write your code between the given TODO block.

**Note: Consult the pytorch demos by the TAs during class for Problem 3(b) and 3(c).** (https://colab.research.google.com/drive/1Nf_5z4_g09KqOy0km4fyG4Kj2bRcEcCK?usp=sharing) 

In [7]:
class Trainer():

  def __init__(self, options):
    self.device = options['device']
    self.train_data = options['train_data']
    self.val_data = options['val_data']
    self.batch_size = options['batch_size']
    self.epochs = options['epochs']
    self.save_path = options['save_path']
    self.training_type = options['training_type']
    transformer = DistillBERT()
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0)
    recall = recall_score(labels_flat, pred_flat, zero_division=0)
    f1 = f1_score(labels_flat, pred_flat, zero_division=0)
    return precision, recall, f1

  def set_training_parameters(self):
    # TODO(students): start
    t = self.training_type
    if t == 'frozen_embeddings':
      # will not turn on require_grad = True for any layer
      for name, layer in self.model.named_parameters():
        if 'classifier' in name:
          continue
        layer.require_grad = False
    elif t == 'top_2_training':
      # require_grad = True for layers 4,5
      for name, layer in self.model.named_parameters():
        if 'classifier' in name:
          continue
        if 'layer.4' in name or 'layer.5' in name:
          layer.require_grad = True
        else:
          layer.require_grad = False  
    elif t == 'top_4_training':
      #require_grad = True for layers 2,3,4,5
      for name, layer in self.model.named_parameters():
        if 'classifier' in name:
          continue
        if 'layer.2' in name or 'layer.3' in name or 'layer.4' in name or 'layer.5' in name:
          layer.require_grad = True
        else:
          layer.require_grad = False 
    elif t == 'all_training':
      # require_grad = True for layers 0,1,2,3,4,5
      for name, layer in self.model.named_parameters():
        layer.require_grad = True
    else:
      raise KeyError(f"training_type={t} not found")
    # TODO(students): end

  def train(self, data_loader, optimizer):
    self.model.train()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    for batch_idx, (reviews, labels) in enumerate(tqdm(data_loader)):
      self.model.zero_grad()
      # TODO(students): start
      logits = self.model(reviews.squeeze(1)).logits
      loss = torch.nn.CrossEntropyLoss()(logits.to('cpu'), labels.to('cpu'))

      loss.backward()
      optimizer.step()
      total_loss += loss
      precision, recall, f1 = self.get_performance_metrics(preds=logits.detach().cpu().numpy(), labels=labels.detach().cpu().numpy())

      total_recall += recall
      total_precision += precision
      total_f1 += f1
      # TODO(students): end

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def eval(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        # self.set_training_parameters() not sure if necessary?
        logits = self.model(reviews.squeeze(1)).logits
        loss = torch.nn.CrossEntropyLoss()(logits.to('cpu'), labels.to('cpu'))
        
        total_loss += loss
        precision, recall, f1 = self.get_performance_metrics(preds=logits.detach().cpu().numpy(), labels=labels.detach().cpu().numpy())

        total_recall += recall
        total_precision += precision
        total_f1 += f1
        # TODO(students): end
    
    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def save_transformer(self):
    self.model.save_pretrained(self.save_path)
    self.tokenizer.save_pretrained(self.save_path)

  def execute(self):
    last_best = 0
    train_dataset = DatasetLoader(self.train_data, self.tokenizer)
    train_data_loader = train_dataset.get_data_loaders(self.batch_size)
    val_dataset = DatasetLoader(self.val_data, self.tokenizer)
    val_data_loader = val_dataset.get_data_loaders(self.batch_size)
    optimizer = torch.optim.AdamW(self.model.parameters(), lr = 3e-5, eps = 1e-8)
    self.set_training_parameters()
    for epoch_i in range(0, self.epochs):
      train_precision, train_recall, train_f1, train_loss = self.train(train_data_loader, optimizer)
      print(f'Epoch {epoch_i + 1}: train_loss: {train_loss:.4f} train_precision: {train_precision:.4f} train_recall: {train_recall:.4f} train_f1: {train_f1:.4f}')
      val_precision, val_recall, val_f1, val_loss = self.eval(val_data_loader)
      print(f'Epoch {epoch_i + 1}: val_loss: {val_loss:.4f} val_precision: {val_precision:.4f} val_recall: {val_recall:.4f} val_f1: {val_f1:.4f}')

      if val_f1 > last_best:
        print("Saving model..")
        self.save_transformer()
        last_best = val_f1
        print("Model saved.")

**Notes: Run the following blocks in order to train the model and save it in your Google Drive. There will be variations due to random initializations. Most of the experiment validation and test accuracy should be between 80%-95%. Each experiment should not take more than 30 minutes to run when runtime is set to GPU.**

#### **Experiment 1**
---
Training your DistilBERT with frozen embeddings.



In [8]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_frozen_embeddings'
options['epochs'] = EPOCHS
options['training_type'] = 'frozen_embeddings'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

Processing data..


100%|██████████| 5130/5130 [00:03<00:00, 1298.95it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 1279.91it/s]
100%|██████████| 321/321 [01:50<00:00,  2.91it/s]


Epoch 1: train_loss: 0.4494 train_precision: 0.7516 train_recall: 0.7502 train_f1: 0.7258


100%|██████████| 17/17 [00:01<00:00,  8.92it/s]


Epoch 1: val_loss: 0.2209 val_precision: 0.8983 val_recall: 0.9293 val_f1: 0.9032
Saving model..
Model saved.


100%|██████████| 321/321 [01:53<00:00,  2.82it/s]


Epoch 2: train_loss: 0.1945 train_precision: 0.9321 train_recall: 0.9281 train_f1: 0.9249


100%|██████████| 17/17 [00:01<00:00,  8.60it/s]


Epoch 2: val_loss: 0.2547 val_precision: 0.9250 val_recall: 0.8543 val_f1: 0.8836


100%|██████████| 321/321 [02:57<00:00,  1.81it/s]


Epoch 3: train_loss: 0.0933 train_precision: 0.9701 train_recall: 0.9689 train_f1: 0.9667


100%|██████████| 17/17 [00:05<00:00,  3.35it/s]

Epoch 3: val_loss: 0.2651 val_precision: 0.8776 val_recall: 0.9302 val_f1: 0.8971





#### **Experiment 2**
---
Training your DistilBERT with only top 2 layers being trained. 



In [9]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_top_2_training'
options['epochs'] = EPOCHS
options['training_type'] = 'top_2_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

Processing data..


100%|██████████| 5130/5130 [00:05<00:00, 894.07it/s] 


Processing data..


100%|██████████| 270/270 [00:00<00:00, 710.27it/s]
100%|██████████| 321/321 [04:39<00:00,  1.15it/s]


Epoch 1: train_loss: 0.4269 train_precision: 0.7788 train_recall: 0.8032 train_f1: 0.7633


100%|██████████| 17/17 [00:05<00:00,  3.36it/s]


Epoch 1: val_loss: 0.2894 val_precision: 0.8243 val_recall: 0.9727 val_f1: 0.8860
Saving model..
Model saved.


100%|██████████| 321/321 [03:34<00:00,  1.49it/s]


Epoch 2: train_loss: 0.2051 train_precision: 0.9217 train_recall: 0.9278 train_f1: 0.9184


100%|██████████| 17/17 [00:02<00:00,  6.39it/s]


Epoch 2: val_loss: 0.2053 val_precision: 0.9058 val_recall: 0.9599 val_f1: 0.9274
Saving model..
Model saved.


100%|██████████| 321/321 [02:26<00:00,  2.19it/s]


Epoch 3: train_loss: 0.0921 train_precision: 0.9664 train_recall: 0.9732 train_f1: 0.9674


100%|██████████| 17/17 [00:02<00:00,  6.45it/s]

Epoch 3: val_loss: 0.2592 val_precision: 0.9119 val_recall: 0.9288 val_f1: 0.9183





#### **Experiment 3**
---
Training your DistilBERT with only top 4 layers being trained. 



In [10]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_top_4_training'
options['epochs'] = EPOCHS
options['training_type'] = 'top_4_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

Processing data..


100%|██████████| 5130/5130 [00:04<00:00, 1108.09it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 1147.21it/s]
100%|██████████| 321/321 [02:31<00:00,  2.12it/s]


Epoch 1: train_loss: 0.3749 train_precision: 0.8243 train_recall: 0.8186 train_f1: 0.7989


100%|██████████| 17/17 [00:02<00:00,  6.36it/s]


Epoch 1: val_loss: 0.1812 val_precision: 0.9317 val_recall: 0.9131 val_f1: 0.9159
Saving model..
Model saved.


100%|██████████| 321/321 [02:35<00:00,  2.07it/s]


Epoch 2: train_loss: 0.1739 train_precision: 0.9325 train_recall: 0.9389 train_f1: 0.9296


100%|██████████| 17/17 [00:02<00:00,  6.98it/s]


Epoch 2: val_loss: 0.2185 val_precision: 0.9126 val_recall: 0.9254 val_f1: 0.9164
Saving model..
Model saved.


100%|██████████| 321/321 [02:34<00:00,  2.08it/s]


Epoch 3: train_loss: 0.0790 train_precision: 0.9723 train_recall: 0.9743 train_f1: 0.9712


100%|██████████| 17/17 [00:02<00:00,  6.40it/s]

Epoch 3: val_loss: 0.2665 val_precision: 0.9160 val_recall: 0.8934 val_f1: 0.8997





#### **Experiment 4**
---
Training your DistilBERT with all layers being trained. 



In [11]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_all_training'
options['epochs'] = EPOCHS
options['training_type'] = 'all_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

Processing data..


100%|██████████| 5130/5130 [00:04<00:00, 1220.46it/s]


Processing data..


100%|██████████| 270/270 [00:00<00:00, 1191.19it/s]
100%|██████████| 321/321 [02:34<00:00,  2.08it/s]


Epoch 1: train_loss: 0.3470 train_precision: 0.8551 train_recall: 0.8579 train_f1: 0.8381


100%|██████████| 17/17 [00:02<00:00,  6.39it/s]


Epoch 1: val_loss: 0.2284 val_precision: 0.8750 val_recall: 0.9712 val_f1: 0.9165
Saving model..
Model saved.


100%|██████████| 321/321 [02:33<00:00,  2.09it/s]


Epoch 2: train_loss: 0.1785 train_precision: 0.9364 train_recall: 0.9291 train_f1: 0.9270


100%|██████████| 17/17 [00:02<00:00,  6.34it/s]


Epoch 2: val_loss: 0.2252 val_precision: 0.8910 val_recall: 0.9383 val_f1: 0.9073


100%|██████████| 321/321 [02:35<00:00,  2.07it/s]


Epoch 3: train_loss: 0.0721 train_precision: 0.9793 train_recall: 0.9741 train_f1: 0.9750


100%|██████████| 17/17 [00:02<00:00,  6.39it/s]


Epoch 3: val_loss: 0.2594 val_precision: 0.9033 val_recall: 0.9655 val_f1: 0.9270
Saving model..
Model saved.


## **Problem 4 (Test Function)**
---
Here, you will write the code for the testing of the models that you trained in the previous code blocks. 

The class below provides method to test a given model. It takes a dictionary with the following parameters:

1.   device: The device to run the model on.
2.   test_data: The test_data dataframe.
3.   batch_size: The batch_size which is input to the model.
4.   save_path: The directory of your saved model.

You would need to implement a single test step in the given loop inside the test() method. You would need to pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. You would need to ensure that the loss is not propagated backwards.

Please write your code between the given TODO block.

Hint: This problem is very similar to 3(c).

In [12]:
class Tester():

  def __init__(self, options):
    self.save_path = options['save_path']
    self.device = options['device']
    self.test_data = options['test_data']
    self.batch_size = options['batch_size']
    transformer = DistillBERT(self.save_path)
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0)
    recall = recall_score(labels_flat, pred_flat, zero_division=0)
    f1 = f1_score(labels_flat, pred_flat, zero_division=0)
    return precision, recall, f1

  def test(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):
        # TODO(students): start
        logits = self.model(reviews.squeeze(1)).logits
        loss = torch.nn.CrossEntropyLoss()(logits.to('cpu'), labels.to('cpu'))
        
        total_loss += loss
        precision, recall, f1 = self.get_performance_metrics(preds=logits.detach().cpu().numpy(), labels=labels.detach().cpu().numpy())

        total_recall += recall
        total_precision += precision
        total_f1 += f1

        # TODO(students): end
    
    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def execute(self):
    test_dataset = DatasetLoader(self.test_data, self.tokenizer)
    test_data_loader = test_dataset.get_data_loaders(self.batch_size)

    test_precision, test_recall, test_f1, test_loss = self.test(test_data_loader)

    print()
    print(f'test_loss: {test_loss:.4f} test_precision: {test_precision:.4f} test_recall: {test_recall:.4f} test_f1: {test_f1:.4f}')

**Notes: Run these blocks only after Experiment 1 to 4 are completed and the models are saved in the "models" folder. Copy the output blocks into another text file for report writing.**

#### **Experiment 5**
---
Testing your DistilBERT trained with frozen embeddings.



In [13]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_frozen_embeddings'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 1132.25it/s]
100%|██████████| 38/38 [00:04<00:00,  9.12it/s]


test_loss: 0.2823 test_precision: 0.8429 test_recall: 0.9198 test_f1: 0.8713





#### **Experiment 6**
---
Testing your DistilBERT trained with all layers frozen except the final two layers.



In [14]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_top_2_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 885.90it/s]
100%|██████████| 38/38 [00:04<00:00,  8.75it/s]


test_loss: 0.2785 test_precision: 0.8210 test_recall: 0.9404 test_f1: 0.8673





#### **Experiment 7**
---
Testing your DistilBERT trained with all layers frozen except the final four layers.



In [15]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_top_4_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 986.47it/s] 
100%|██████████| 38/38 [00:04<00:00,  8.55it/s]


test_loss: 0.2995 test_precision: 0.9178 test_recall: 0.8802 test_f1: 0.8888





#### **Experiment 8**
---
Testing your DistilBERT trained with all layers trainable.



In [18]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_all_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 600/600 [00:00<00:00, 1006.56it/s]
100%|██████████| 38/38 [00:04<00:00,  8.53it/s]


test_loss: 0.3401 test_precision: 0.8593 test_recall: 0.9093 test_f1: 0.8741





## **Results**
---

Answer the following questions based on the analyses you have performed above:

### 1. Briefly explain your code implementations for each TO-DO task.

TODO [STUDENT]

Problem 1.
I used the imported classes AutoModelForSequenceClassification and AutoTokenizer from the transformers library in order to load the pretrained DistilledBert model, making sure to pass num_labels=num_classes in order to set the output dimensions. 

Problem 2.

I used the tokenizer.encode_plus method to encode the input review text into a padded and/or truncated pytorch cuda tensor with shape (1,512). The loop does this for each review in the dataset, making sure to mark the corresponding correct labels in another pytorch long tensor.

encoding = self.tokenizer.encode_plus(review, max_length=512, truncation=True, pad_to_max_length=True, add_special_tokens=True, return_tensors='pt')['input_ids'].to('cuda')

Problem 3a.

When iterating over self.model.named_parameters(), the first element of the tuple is the name which gives information about the corresponding layer (e.g. You would see 'layer.0' in the name for layer 0 as a substring). I used a bunch of if-elif statements to hard code which layers to turn .require_grad on.

Problem 3b 3c 4.

All of these problems are very similar, except 3c and 4 do not require the model to compute the gradients and backpropogate (via loss.backward and optimizer.step). 

In order to input the reviews to the model in order to output the logits, I had to squeeze the reviews tensor along dimension 1 because it was a (batch_size, 1, 512) shape tensor and needed to be of shape (batch_size, 512). 

I used the provided self.get_performance_metrics in order to calculate the current batch's precision, recall, and f1 score (making sure to pass the logits and labels as numpy arrays). Finally, I added the current scores to the total scores.

logits = self.model(reviews.squeeze(1)).logits
loss = torch.nn.CrossEntropyLoss()(logits.to('cpu'), labels.to('cpu'))

loss.backward()
optimizer.step()
total_loss += loss
precision, recall, f1 = self.get_performance_metrics(preds=logits.detach().cpu().numpy(), labels=labels.detach().cpu().numpy())

total_recall += recall
total_precision += precision
total_f1 += f1


### 2. A table containing the precision, recall and F1 scores of each DistilBERT model during validation and testing.

|   Experiment   | Precision | Recall | F1 |
| -------------- | -------- | -------- | -------- | 
| 1 (Validation) | 0.8776 | 0.9302 | 0.8971 |
| 2 (Validation) | 0.9119 | 0.9288 | 0.9183 |
| 3 (Validation) | 0.9160 | 0.8934 | 0.8997 |
| 4 (Validation) | 0.9033 | 0.9655 | 0.9270 |
| 5 (Test)       | 0.8429 | 0.9198 | 0.8713 |
| 6 (Test)       | 0.8210 | 0.9404 | 0.8673 |
| 7 (Test)       | 0.9178 | 0.8802 | 0.8888 |
| 8 (Test)       | 0.8593 | 0.9093 | 0.8741 |

TODO [STUDENT]

### 3. An analysis explaining your understanding of the impact freezing/training different layers has on the model's performance.


The first observation I had was that freezing all embeddings (meaning keeping the pretrained model as is) still had good performance on this sentiment task. In terms of the metrics, freezing all embeddings was about 3-5 percentage points lower than the highest score in the corresponding dataset (validation or test). This is expected -- the Professor mentioned in lecture that these pretrained language models become pretty good at these tasks even without finetuning because the MLM task helps the model "understand" the language.

However, finetuning to some degree seems to provide benefit in the metric scores. From what I found, Experiments 3 and 7 (where only the top 4 layers were trainable) provided the most improvement. In this case, these experiments resulted in the highest precision score for both the validation and test set. 

Overall, the data seems to imply that training ALL the layers does not seem worthwhile. The benefit provided compared to only training some of the layers is marginal whereas the amount of resources/time might be costly in comparison.



## **Submission guidelines**
---
You would need to submit the following files:


1.   `NLP_HW3.ipynb` - This jupyter notebook. It will also work as your report, so please add description to code wherever required. Also make sure to write your analyses outcomes in the RESULTS section above.
2.   `gdrive_link.txt` - Should contain a wgetable to a folder that contains your four DistilBERT models. Please make sure you provide the necessary permissions.

**Colab design credit**: TA Dhruv Verma