## Emotion Classification using Fine-tuned BERT model

This notebook is adapted from one provided by the authors of Saravia et al. (2018) found [here](https://colab.research.google.com/drive/1nwCE6b9PXIKhv2hvbqf1oZKIGkXMTi1X#scrollTo=t23zHggkEpc-). Notably, the TokenizersCollateFn class, the architecture of the classification head and the methods of the LighningModule wrapper class are taken from there and adapted to my purposes. The DataModule and the rest of the data processing as well as the training setup and all evaluation (including the Callback class) were written by me. 

## Setup

In [1]:
import torch
from torch import nn
from typing import List
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelWithLMHead, AdamW, get_linear_schedule_with_warmup
import os
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing
import pytorch_lightning as pl
from torch.utils.data import DataLoader
from argparse import Namespace
from sklearn.metrics import classification_report
from datasets import load_dataset
torch.__version__

  from .autonotebook import tqdm as notebook_tqdm


'2.2.2+cu121'

In [2]:
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
os.environ["HF_DATASETS_CACHE"] = "./data"
os.environ["HF_MODELS_CACHE"] = "./model"
os.environ['TRANSFORMERS_CACHE'] = "./model"


## Load and create dataset

In [3]:
## emotion labels
label2int = {
  "sadness": 0,
  "joy": 1,
  "love": 2,
  "anger": 3,
  "fear": 4,
  "surprise": 5
}

emotions = [ "sadness", "joy", "love", "anger", "fear", "surprise"]

In [4]:
class TokenizersCollateFn:
    def __init__(self, max_tokens=512):

        # RoBERTa uses BPE tokenizer AutoTokenizer.from_pretrained('distilroberta-base')
        # that I downloaded and saved in "tokenizer" directory
        t = ByteLevelBPETokenizer(
            "tokenizer/vocab.json",
            "tokenizer/merges.txt"
        )
        t._tokenizer.post_processor = BertProcessing(
            ("</s>", t.token_to_id("</s>")),
            ("<s>", t.token_to_id("<s>")),
        )
        t.enable_truncation(max_tokens)
        t.enable_padding(length=max_tokens, pad_id=t.token_to_id("<pad>"))
        self.tokenizer = t

    def __call__(self, batch):
        encoded = self.tokenizer.encode_batch([x[0] for x in batch])
        sequences_padded = torch.tensor([enc.ids for enc in encoded])
        attention_masks_padded = torch.tensor([enc.attention_mask for enc in encoded])
        labels = torch.tensor([x[1] for x in batch])

        return (sequences_padded, attention_masks_padded), labels

In [5]:
batch_size = 32

class EmotionDataModule(pl.LightningDataModule):
    def setup(self, stage):
        train = load_dataset("dair-ai/emotion")["train"]
        self.train_dataset = [(ex['text'], ex['label']) for ex in train]
        val = load_dataset("dair-ai/emotion")["validation"]
        self.val_dataset = [(ex['text'], ex['label']) for ex in val]
        test = load_dataset("dair-ai/emotion")["test"]
        self.test_dataset = [(ex['text'], ex['label']) for ex in test]
        
    def train_dataloader(self):    
        return DataLoader(self.train_dataset, batch_size=batch_size, shuffle=True,
                    collate_fn=TokenizersCollateFn())
    
    def val_dataloader(self):
        return DataLoader(self.val_dataset, batch_size=batch_size, shuffle=False,
                    collate_fn=TokenizersCollateFn())
    
    def test_dataloader(self):
        return DataLoader(self.test_dataset, batch_size=batch_size, shuffle=False,
                    collate_fn=TokenizersCollateFn())

## Building the Model

### Load the Pretrained Language Model

[RoBERTa](https://arxiv.org/abs/1907.11692) is a variant of of BERT which "*modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates*".

Knowledge distillation help to train smaller LMs with similar performance and potential.

In [6]:
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base')
roberta_model = AutoModelWithLMHead.from_pretrained("distilroberta-base", cache_dir="./model")
base_model = roberta_model.base_model

Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


#### Testing:

In [7]:
text = "LIP12005: Corpora in Speech and Language Processing"
enc = tokenizer.encode_plus(text)
print(enc)
print(tokenizer.convert_ids_to_tokens(enc['input_ids']))

{'input_ids': [0, 574, 3808, 1092, 31866, 35, 26091, 102, 11, 27242, 8, 22205, 28395, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['<s>', 'L', 'IP', '12', '005', ':', 'ĠCorpor', 'a', 'Ġin', 'ĠSpeech', 'Ġand', 'ĠLanguage', 'ĠProcessing', '</s>']


In [8]:
# Example input: We need to unsqueeze to get a batch of size 1. Input size = output size = [1, num_tokens, 768]
out = base_model(torch.tensor(enc["input_ids"]).unsqueeze(0), torch.tensor(enc["attention_mask"]).unsqueeze(0))
out[0].shape

torch.Size([1, 14, 768])

### Building Custom Classification head on top of LM base model

Use Mish activiation function as in the one proposed in the original tutorial

In [9]:
# from https://github.com/digantamisra98/Mish/blob/b5f006660ac0b4c46e2c6958ad0301d7f9c59651/Mish/Torch/mish.py
@torch.jit.script
def mish(input):
    return input * torch.tanh(F.softplus(input))

class Mish(nn.Module):
    def forward(self, input):
        return mish(input)

In [10]:
class EmoModel(nn.Module):
    def __init__(self, base_model, n_classes, base_model_output_size=768, dropout=0.05):
        super().__init__()
        self.base_model = base_model

        self.classifier = nn.Sequential(
            nn.Dropout(dropout),
            nn.Linear(base_model_output_size, base_model_output_size),
            Mish(),
            nn.Dropout(dropout),
            nn.Linear(base_model_output_size, n_classes)
        )

        for layer in self.classifier:
            if isinstance(layer, nn.Linear):
                layer.weight.data.normal_(mean=0.0, std=0.02)
                if layer.bias is not None:
                    layer.bias.data.zero_()

    def forward(self, input_, *args):
        X, attention_mask = input_
        hidden_states = self.base_model(X, attention_mask=attention_mask)

        # maybe do some pooling / RNNs... go crazy here!

        # use the <s> representation
        return self.classifier(hidden_states[0][:, 0, :])

### Putting the model together

In [11]:
class EmotionClassifier(pl.LightningModule):
    def __init__(self, hparams):
        super().__init__()
        self.model = EmoModel(AutoModelWithLMHead.from_pretrained("distilroberta-base").base_model, len(emotions))
        self.loss = nn.CrossEntropyLoss() ## combines LogSoftmax() and NLLLoss()
        #self.hparams = hparams
        self.hparams.update(vars(hparams))
        self.training_step_outputs = []
        self.max_val_acc = 0

    def step(self, batch, step_name="train"):
        X, y = batch
        loss = self.loss(self.forward(X), y)
        #print(loss)
        self.training_step_outputs.append(loss)
        loss_key = f"{step_name}_loss"
        tensorboard_logs = {loss_key: loss}

        return { ("loss" if step_name == "train" else loss_key): loss, 'log': tensorboard_logs,
               "progress_bar": {loss_key: loss}}

    def forward(self, X, *args):
        return self.model(X, *args)

    def training_step(self, batch, batch_idx):
        #print(batch[0])
        return self.step(batch, "train")

    def validation_step(self, batch, batch_idx):
        return self.step(batch, "val")

    def validation_end(self, outputs: List[dict]):
        loss = torch.stack([x["val_loss"] for x in outputs]).mean()
        return {"val_loss": loss}

    def test_step(self, batch, batch_idx):
        return self.step(batch, "test")

    #@lru_cache()
    def total_steps(self):
        return self.hparams.data_size // self.hparams.accumulate_grad_batches * self.hparams.epochs

    def configure_optimizers(self):
        optimizer = AdamW(self.model.parameters(), lr=self.hparams.lr)
        lr_scheduler = get_linear_schedule_with_warmup(
                    optimizer,
                    num_warmup_steps=self.hparams.warmup_steps,
                    num_training_steps=self.total_steps(),
        )
        return [optimizer], [{"scheduler": lr_scheduler, "interval": "step"}]

In [12]:
class LossCallback(pl.Callback):
    def on_train_epoch_end(self, trainer, model):
        #if trainer.current_epoch % 5 == 0:
        epoch_mean = torch.stack(model.training_step_outputs).mean()
        print(epoch_mean.item(), "\n")
        with torch.no_grad():
            model.eval()
            model.cuda()
                
            true_y, pred_y = [], []
            for i, batch_ in enumerate(trainer.datamodule.test_dataloader()):
                (X, attn), y = batch_
                batch = (X.cuda(), attn.cuda())
                y_pred = torch.argmax(model(batch), dim=1)
                true_y.extend(y.cpu())
                pred_y.extend(y_pred.cpu())
            correct = (torch.tensor(true_y) == torch.tensor(pred_y)).float().sum() / len(true_y)
            print("on dev: ", correct.item())
            if correct.item() > model.max_val_acc and correct.item() > 0.92:
                torch.save(model.state_dict(), "model/roberta_best.pl")
                model.max_val_acc = correct.item()
                print("model saved at ", correct.item())
        model.training_step_outputs.clear()

## Training the Emotion Classifier

In [None]:
hparams = Namespace(
    batch_size=batch_size,
    warmup_steps=100,
    epochs=4,
    lr=1e-4,
    accumulate_grad_batches=1,
    data_size=16000 
)

In [14]:
## garbage collection
import gc; gc.collect()
torch.cuda.empty_cache()

In [15]:
model = EmotionClassifier(hparams)
data_module = EmotionDataModule()
trainer = pl.Trainer(max_epochs=hparams.epochs, accumulate_grad_batches=hparams.accumulate_grad_batches, enable_progress_bar=True,
                     callbacks=[LossCallback()])

trainer.fit(model, data_module)



Some weights of the model checkpoint at distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after paralle

Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

/home/users1/zabereus/zabereus/corpora_env/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


                                                                           

/home/users1/zabereus/zabereus/corpora_env/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Epoch 0: 100%|██████████| 500/500 [04:39<00:00,  1.79it/s, v_num=63]0.44704288244247437 

Epoch 1: 100%|██████████| 500/500 [04:40<00:00,  1.78it/s, v_num=63]0.15304653346538544 

Epoch 2: 100%|██████████| 500/500 [04:23<00:00,  1.90it/s, v_num=63]0.12714065611362457 

Epoch 3:  13%|█▎        | 63/500 [00:32<03:47,  1.92it/s, v_num=63] 

/home/users1/zabereus/zabereus/corpora_env/lib64/python3.12/site-packages/pytorch_lightning/trainer/call.py:54: Detected KeyboardInterrupt, attempting graceful shutdown...


## Evaluation

In [16]:
with torch.no_grad():
    progress = ["/", "-", "\\", "|", "/", "-", "\\", "|"]
    model.eval()
    model.cuda()
    true_y, pred_y = [], []
    for i, batch_ in enumerate(data_module.test_dataloader()):
        (X, attn), y = batch_
        batch = (X.cuda(), attn.cuda())
        print(progress[i % len(progress)], end="\r")
        y_pred = torch.argmax(model(batch), dim=1)
        true_y.extend(y.cpu())
        pred_y.extend(y_pred.cpu())
print("\n" + "_" * 80)
print(classification_report(true_y, pred_y, target_names=label2int.keys(), digits=len(emotions)))
misclassified = []
for i in range(len(y_pred)):
    if true_y[i] != pred_y[i]:
        misclassified.append(data_module.test_dataset[i])

\
________________________________________________________________________________
              precision    recall  f1-score   support

     sadness   0.985455  0.932874  0.958444       581
         joy   0.985893  0.905036  0.943736       695
        love   0.756098  0.974843  0.851648       159
       anger   0.859873  0.981818  0.916808       275
        fear   0.878661  0.937500  0.907127       224
    surprise   0.833333  0.681818  0.750000        66

    accuracy                       0.925500      2000
   macro avg   0.883219  0.902315  0.887961      2000
weighted avg   0.933125  0.925500  0.926492      2000

