# **CSE 354 Project**
---
## **Problem statement**

In this project, we will be using language models to predict the sentiment of a given news article. The dataset is sampled from the [PerSent](https://stonybrooknlp.github.io/PerSenT/) corpus. The data contains around 5k documents and 38K paragraphs annotated on the author’s sentiment towards the main entity in the news article. The label can either be *positive*, *negative*, or *neutral*. We have been given four files - train_data.csv, val_data.csv, random_test.csv, and fixed_test.csv. The training data will be used to fine-tune the language model, the val data will be used to evaluate the training, and finally the test data will test on randomly organized test instances.

To perform this task we will be using a pre-trained DistilBERT model. DistilBERT is a BERT based language model. Its size is 40% lesser than BERT, it has around 97% of BERT's language understanding capabilities and is 60% faster. You can read more about DistilBERT - https://arxiv.org/abs/1910.01108.

We will be using the model by taking advantage of the libraries provided by Hugging Face (https://huggingface.co/). In order to use this library, it will need to be installed using the command in the cell below. We will be training four different DistilBERT models for this project.

Code based on CSE 354 HW3

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m38.4 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


## **Imports**

In [None]:
import torch
import pandas as pd
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, TensorDataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from tqdm import tqdm
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import AdamW
import os
from sklearn.metrics import precision_score, recall_score, f1_score
torch.manual_seed(42)
np.random.seed(42)

## **Mounting Drive**

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
!ls

drive  sample_data


In [None]:
%cd "drive/MyDrive/CSE-354-Project"

/content/drive/MyDrive/CSE-354-Project


## **Constants**

The code block below contains a few constants.


1.   **BATCH_SIZE**: The batch size input to the models. This has been set to 16. In case we encounter any CUDA - out of memory errors while training our models, this value may be reduced from 16.
2.   **EPOCHS**: The number of epochs to train our model.
3. **TEST_PATH**: This is the path to the test_data.csv file.
4. **TRAIN_PATH**: This is the path to the train_data.csv file.
5. **VAL_PATH**: This is the path to the val_data.csv file.
6. **SAVE_PATH**: This is the path to the directory our model will be saved. Note: This path will be altered further down in the code by appending the name of the kind of DistilBERT model we train as per our experiments.



In [None]:
BATCH_SIZE = 16
EPOCHS = 3
TEST_PATH = "data/random_test.csv"
TRAIN_PATH = "data/train.csv"
VAL_PATH = "data/dev.csv"
SAVE_PATH = "models/DistilBERT"

In [None]:
def load_dataset(path):
  dataset = pd.read_csv(path)
  return dataset

In [None]:
train_data = load_dataset(TRAIN_PATH)
val_data = load_dataset(VAL_PATH)
test_data = load_dataset(TEST_PATH)

## **Initialize the Model Class**

Here, we will setup the pre-trained DistillBert model class in order to do our trinary sentiment analysis task. In the code block below, we load a pre-trained DistilBERT model and its tokenizer using Hugging Face's library. The model we load is called "distilbert-base-uncased". It has the model hyperparameter set to *num_classes* as the output shape of the model (in this case it is going to be 3, positive, negative, and neutral).



More about the model and how to load it can be read at - https://huggingface.co/distilbert-base-uncased.

In [None]:
class DistillBERT():

  def __init__(self, model_name='distilbert-base-uncased', num_classes=3):
    self.model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels = num_classes)
    self.tokenizer = AutoTokenizer.from_pretrained(model_name)

  def get_tokenizer_and_model(self):
    return self.model, self.tokenizer

## **Initialize the Dataloader Class**

Here, we will setup the dataloader class which will read data, tokenize it using the DistillBert tokenizer, converts the tokenized sentence to tensors and the labels to tensors. The code block below takes our dataset (train, validation, or test) and the tokenizer we loaded in the previous block and generates the DataLoader object for it. We implement a tokenize_data method that takes the given data and generates a list of token IDs for a given article along with its label. We use the tokenizer to generated the token ids using tokenizer.encode_plus values for each article. We ensure that the maximum length of an encoded article is 512 tokens. If any input data is longer than 512 words/tokens, we truncate it to first 512. We also convert the labels to a corresponding numerical class using the label_dict dictionary.

In [None]:
class DatasetLoader(Dataset):

  def __init__(self, data, tokenizer):
    self.data = data
    self.tokenizer = tokenizer

  def tokenize_data(self):
    print("Processing data..")
    tokens = []
    labels = []
    label_dict = {'Positive': 2, 'Negative': 1, 'Neutral': 0}

    review_list = self.data['DOCUMENT'].to_list()
    label_list = self.data['TRUE_SENTIMENT'].to_list()

    for (review, label) in tqdm(zip(review_list, label_list), total=len(review_list)):
      encoding = self.tokenizer.encode_plus(review, truncation=True, max_length=512)
      input_ids = encoding['input_ids']
      tokens.append(torch.tensor(input_ids))
      labels.append((label_dict[label]))

    tokens = pad_sequence(tokens, batch_first=True)
    labels = torch.tensor(labels)
    dataset = TensorDataset(tokens, labels)
    return dataset

  def get_data_loaders(self, batch_size=32, shuffle=True):
    processed_dataset = self.tokenize_data()

    data_loader = DataLoader(
        processed_dataset,
        shuffle=shuffle,
        batch_size=batch_size
    )

    return data_loader

## **Training Function**

Here, we write the code that will be used to run our model class on the dataset class, both of which we have written in the previously.

The class below provides methods to train a given model. It takes a dictionary with the following parameters:


1.   device: The device to run the model on.
2.   train_data: The train_data dataframe.
3.   val_data: The val_data dataframe.
4.   batch_size: The batch_size which is input to the model.
5.   epochs: The number of epochs to train the model.
6.   training_type: The type of training that our model will be undergoing. This can take four values - 'frozen_embeddings', 'top_2_training', 'top_4_training' and 'all_training'.

#### **Set Training Parameters**

Here we implement the set_training_parameters() method. In this method we select the layers of our model to train based on the training_type. **Note: By default the Hugging Face DistilBERT has 6 layers.**

1. frozen_embeddings: This setting trains the DistilBERT model with embeddings that are 'frozen' i.e., not trainable. We ensure that the embedding layers in our model are not trainable.
2. top_2_training: This setting trains just the final two layers of DistilBERT (layer 5 and layer 4). All other layers before these are frozen.
3. top_4_training: This setting trains just the final four layers of DistilBERT (layer 5, layer 4, layer 3 and layer 2). All other layers before these are frozen.
4. all_training: All layers of DistilBERT are trained.

**Note: The classifier head on top of the final DistilBERT layer is always trained, so we do not freeze that layer.**

**Note: We use model.named_parameters() to iterate over all the named parameters of the model. To set the layers to not be trainable, we apply layer.requires_grad = false**


#### **Single Training Step**

Here we implement a single training step in the given loop inside the train() method. We pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. We also propagate the loss backwards to the model and update the given optimizer's parameters.


#### **Single Validation Step**

Here we implement a single validation step in the given loop inside the eval() method. We pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. We ensure that the loss is not propagated backwards.

In [None]:
class Trainer():

  def __init__(self, options):
    self.device = options['device']
    self.train_data = options['train_data']
    self.val_data = options['val_data']
    self.batch_size = options['batch_size']
    self.epochs = options['epochs']
    self.save_path = options['save_path']
    self.training_type = options['training_type']
    transformer = DistillBERT()
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0, average='micro')
    recall = recall_score(labels_flat, pred_flat, zero_division=0, average='micro')
    f1 = f1_score(labels_flat, pred_flat, zero_division=0, average='micro')
    return precision, recall, f1

  def set_training_parameters(self):
    if self.training_type == 'frozen_embeddings':
        for name, param in self.model.named_parameters():
            if 'embeddings' in name:
                param.requires_grad = False
    elif self.training_type == 'top_2_training':
        for name, param in self.model.named_parameters():
            if 'embeddings' in name:
                param.requires_grad = False
            if 'layer.0' in name:
                param.requires_grad = False
            if 'layer.1' in name:
                param.requires_grad = False
            if 'layer.2' in name:
                param.requires_grad = False
            if 'layer.3' in name:
                param.requires_grad = False
    elif self.training_type == 'top_4_training':
        for name, param in self.model.named_parameters():
            if 'embeddings' in name:
                param.requires_grad = False
            if 'layer.0' in name:
                param.requires_grad = False
            if 'layer.1' in name:
                param.requires_grad = False

  def train(self, data_loader, optimizer):
    self.model.train()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    for batch_idx, (reviews, labels) in enumerate(tqdm(data_loader)):
      self.model.zero_grad()

      reviews = reviews.to(self.device)
      labels = labels.to(self.device)
      output = self.model(reviews, labels=labels)
      loss = output.loss

      precision, recall, f1 = self.get_performance_metrics(output.logits.detach().cpu(), labels.cpu())
      total_loss += loss
      total_precision += precision
      total_recall += recall
      total_f1 += f1

      optimizer.zero_grad()
      loss.backward()
      optimizer.step()

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def eval(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):

        reviews = reviews.to(self.device)
        labels = labels.to(self.device)
        output = self.model(reviews, labels=labels)
        loss = output.loss

        precision, recall, f1 = self.get_performance_metrics(output.logits.detach().cpu(), labels.cpu())
        total_loss += loss
        total_precision += precision
        total_recall += recall
        total_f1 += f1

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def save_transformer(self):
    self.model.save_pretrained(self.save_path)
    self.tokenizer.save_pretrained(self.save_path)

  def execute(self):
    last_best = 0
    train_dataset = DatasetLoader(self.train_data, self.tokenizer)
    train_data_loader = train_dataset.get_data_loaders(self.batch_size)
    val_dataset = DatasetLoader(self.val_data, self.tokenizer)
    val_data_loader = val_dataset.get_data_loaders(self.batch_size)
    optimizer = torch.optim.AdamW(self.model.parameters(), lr = 3e-5, eps = 1e-8)
    self.set_training_parameters()
    for epoch_i in range(0, self.epochs):
      train_precision, train_recall, train_f1, train_loss = self.train(train_data_loader, optimizer)
      print(f'Epoch {epoch_i + 1}: train_loss: {train_loss:.4f} train_precision: {train_precision:.4f} train_recall: {train_recall:.4f} train_f1: {train_f1:.4f}')
      val_precision, val_recall, val_f1, val_loss = self.eval(val_data_loader)
      print(f'Epoch {epoch_i + 1}: val_loss: {val_loss:.4f} val_precision: {val_precision:.4f} val_recall: {val_recall:.4f} val_f1: {val_f1:.4f}')

      if val_f1 > last_best:
        print("Saving model..")
        self.save_transformer()
        last_best = val_f1
        print("Model saved.")

#### **Training Experiment**

Training our DistilBERT with all layers being trained.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['train_data'] = train_data
options['val_data'] = val_data
options['save_path'] = SAVE_PATH + '_all_training'
options['epochs'] = EPOCHS
options['training_type'] = 'all_training'
trainer = Trainer(options)
trainer.execute()

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifi

Processing data..


100%|██████████| 3355/3355 [00:05<00:00, 663.73it/s]


Processing data..


100%|██████████| 578/578 [00:00<00:00, 756.26it/s]
100%|██████████| 210/210 [02:45<00:00,  1.27it/s]


Epoch 1: train_loss: 0.9306 train_precision: 0.5261 train_recall: 0.5261 train_f1: 0.5261


100%|██████████| 37/37 [00:09<00:00,  3.75it/s]


Epoch 1: val_loss: 0.8349 val_precision: 0.5693 val_recall: 0.5693 val_f1: 0.5693
Saving model..
Model saved.


100%|██████████| 210/210 [02:45<00:00,  1.27it/s]


Epoch 2: train_loss: 0.8518 train_precision: 0.5804 train_recall: 0.5804 train_f1: 0.5804


100%|██████████| 37/37 [00:09<00:00,  3.75it/s]


Epoch 2: val_loss: 0.8321 val_precision: 0.5946 val_recall: 0.5946 val_f1: 0.5946
Saving model..
Model saved.


100%|██████████| 210/210 [02:45<00:00,  1.27it/s]


Epoch 3: train_loss: 0.7223 train_precision: 0.6484 train_recall: 0.6484 train_f1: 0.6484


100%|██████████| 37/37 [00:09<00:00,  3.75it/s]

Epoch 3: val_loss: 0.8789 val_precision: 0.5473 val_recall: 0.5473 val_f1: 0.5473





## **Test Function**

Here, we write the code for the testing of the models that we trained in the previous code blocks.

The class below provides method to test a given model. It takes a dictionary with the following parameters:

1.   device: The device to run the model on.
2.   test_data: The test_data dataframe.
3.   batch_size: The batch_size which is input to the model.
4.   save_path: The directory of our saved model.

We implement a single test step in the given loop inside the test() method. We pass the review and label in the given batch to the model, take the output and compute the Precision, Recall and F1 for that batch using the get_performance_metrics() method. We ensure that the loss is not propagated backwards.


In [None]:
class Tester():

  def __init__(self, options):
    self.save_path = options['save_path']
    self.device = options['device']
    self.test_data = options['test_data']
    self.batch_size = options['batch_size']
    transformer = DistillBERT(self.save_path)
    self.model, self.tokenizer = transformer.get_tokenizer_and_model()
    self.model.to(self.device)

  def get_performance_metrics(self, preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    precision = precision_score(labels_flat, pred_flat, zero_division=0, average='micro')
    recall = recall_score(labels_flat, pred_flat, zero_division=0, average='micro')
    f1 = f1_score(labels_flat, pred_flat, zero_division=0, average='micro')
    return precision, recall, f1

  def test(self, data_loader):
    self.model.eval()
    total_recall = 0
    total_precision = 0
    total_f1 = 0
    total_loss = 0

    with torch.no_grad():
      for (reviews, labels) in tqdm(data_loader):

        reviews = reviews.to(self.device)
        labels = labels.to(self.device)
        output = self.model(reviews, labels=labels)
        loss = output.loss

        precision, recall, f1 = self.get_performance_metrics(output.logits.detach().cpu(), labels.cpu())
        total_loss += loss
        total_precision += precision
        total_recall += recall
        total_f1 += f1

    precision = total_precision/len(data_loader)
    recall = total_recall/len(data_loader)
    f1 = total_f1/len(data_loader)
    loss = total_loss/len(data_loader)

    return precision, recall, f1, loss

  def execute(self):
    test_dataset = DatasetLoader(self.test_data, self.tokenizer)
    test_data_loader = test_dataset.get_data_loaders(self.batch_size)

    test_precision, test_recall, test_f1, test_loss = self.test(test_data_loader)

    print()
    print(f'test_loss: {test_loss:.4f} test_precision: {test_precision:.4f} test_recall: {test_recall:.4f} test_f1: {test_f1:.4f}')

**Notes: Now we run these blocks only after Training Experiment is completed and the best model is saved in the "models" folder.**

#### **Testing Experiment**

Testing our DistilBERT trained with all layers trainable.



In [None]:
options = {}
options['batch_size'] = BATCH_SIZE
options['device'] = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
options['test_data'] = test_data
options['save_path'] = SAVE_PATH + '_all_training'
tester = Tester(options)
tester.execute()

Processing data..


100%|██████████| 579/579 [00:01<00:00, 563.57it/s]
100%|██████████| 37/37 [00:10<00:00,  3.59it/s]


test_loss: 0.8685 test_precision: 0.5895 test_recall: 0.5895 test_f1: 0.5895





## **Results**

Analysis of our models' performance:

Training Experiment:

train_loss: 0.8518 train_precision: 0.5804 train_recall: 0.5804 train_f1: 0.5804

val_loss: 0.8321 val_precision: 0.5946 val_recall: 0.5946 val_f1: 0.5946


Testing Experiment:

test_loss: 0.8685 test_precision: 0.5895 test_recall: 0.5895 test_f1: 0.5895



