# Text classification using Transformers.
This lab will focus on text classification on the Imdb dataset.
In this lab session, we will use the encoder-based transformer architecture, through the lens of the most famous model: **BERT**.

---

# Introduction

## HuggingFace

We have already experimented with some components provided by the HuggingFace library:
- the `datasets` library,
- the `tokenizer`.

Actually, HuggingFace library provides convenient API to deal with transformer models, like BERT, GPT, etc.  To quote their website: *Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Using pretrained models can reduce your compute costs, carbon footprint, and save you the time and resources required to train a model from scratch. Transformers support framework interoperability between PyTorch, TensorFlow, and JAX.*

## Goal of the lab session

We will experiment with the HuggingFace library. You'll have to load a model and to run it on your task.

Important things to keep in in minds are:
- Even if each model is a Transformer, they all have their peculiarities.
- What is the exact input format expected by the model?
- What is its exact output?
- Can you use the available model as is or should you make some modifications for your task?

These questions are actually part of the life of a NLP scientist. We will adress some of these questions in this lab and in the next lessons / labs / HW.

In [None]:
%%capture
%pip install transformers datasets

In [None]:
%pip install transformers==4.30

In [None]:
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']
import matplotlib.pyplot as plt

from transformers import DistilBertTokenizer


import numpy as np
import torch
import torch.nn.functional as F
import torch.nn as nn
import math
from torch.utils.data import DataLoader
from tabulate import tabulate
from datasets import load_dataset
# pretrained
from tqdm.notebook import tqdm

# If the machine you run this on has a GPU available with CUDA installed,
# use it. Using a GPU for learning often leads to huge speedups in training.
# See https://developer.nvidia.com/cuda-downloads for installing CUDA
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
DEVICE

## Download the training data

In [None]:
dataset = load_dataset("scikit-learn/imdb", split="train")
print(dataset)

## Prepare model inputs

The input format to BERT looks like it is  "over-specified", especially if you focus on just one type task: sequence classification, word tagging, paraphrase detection, etc. The format:
- Add special tokens to the start and end of each sentence.
- Pad & truncate all sentences to a single constant length.
- Explicitly differentiate real tokens from padding tokens with the "attention mask".

It looks like that:

<img src="https://drive.google.com/uc?export=view&id=1cb5xeqLu_5vPOgs3eRnail2Y00Fl2pCo" width="600">

If you don't want to recreate this kind of inputs with your own hands, you can use the pre-trained tokenizer associated to BERT. Moreover the tokenizer will:
- Tokenize the sentence.
- Prepend the `[CLS]` token to the start.
- Append the `[SEP]` token to the end.
- Map tokens to their IDs.
- Pad or truncate the sentence to `max_length`
- Create attention masks for `[PAD]` tokens.


> 💡 *Note:* For computational reasons, we will use the [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) model, which is a 40% smaller than the original BERT model but still achieve about 95% of the performances of the original model.

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained(
    "distilbert-base-uncased", do_lower_case=True
)

Let's see how the tokenizer actually process the sequence:

In [None]:
# Some useful steps:
message = "hello my name is footballman"
tok = tokenizer.tokenize(message)
print("Tokens in the sequence:", tok)
enc = tokenizer.encode(tok)
table = np.array(
    [
        enc,
        [tokenizer.ids_to_tokens[w] for w in enc],
    ]
).T
print("Encoded inputs:")
print(tabulate(table, headers=["Token IDs", "Tokens"], tablefmt="fancy_grid"))

🚧 **Question** 🚧

You noticed special tokens like `[CLS]` and `[SEP]` in the sequence. Note how they were added automatically by HuggingFace.

- Why are there such special tokens?

**Answer**

TODO

🚧 **TODO** 🚧

- Run the code below to make sure you can control this behavior

In [None]:
text = "my name is kevin"
tokenized_text_without_special_tokens = tokenizer.encode(text, add_special_tokens=True)
print(tokenized_text_without_special_tokens)

## Data pre-processing

Usual data-processing for torch. Same as previous lab.

In [None]:
def preprocessing_fn(x, tokenizer):
    x["input_ids"] = tokenizer.encode(
        x["review"],
        add_special_tokens=False,
        truncation=True,
        max_length=256,
        padding=False,
        return_attention_mask=False,
    )
    x["labels"] = 0 if x["sentiment"] == "negative" else 1
    return x

In [None]:
n_samples = 2000  # the number of training example

# We first shuffle the data !
dataset = dataset.shuffle()

# Select n_samples
splitted_dataset = dataset.select(range(n_samples))

# Tokenize the dataset
splitted_dataset = splitted_dataset.map(
    preprocessing_fn, fn_kwargs={"tokenizer": tokenizer}
)


# Remove useless columns
splitted_dataset = splitted_dataset.select_columns(["input_ids", "labels"])

# Split the train and validation
splitted_dataset = splitted_dataset.train_test_split(test_size=0.2)

train_set = splitted_dataset["train"]
valid_set = splitted_dataset["test"]

In [None]:
train_set[0]["labels"]

In [None]:
class DataCollator:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, batch):
        # `batch` is a list of dictionary with keys "review_ids" and "label".
        features = self.tokenizer.pad(
            batch, padding="longest", max_length=256, return_tensors="pt"
        )
        return features

In [None]:
data_collator = DataCollator(tokenizer)

In [None]:
batch_size = 4

train_dataloader = DataLoader(
    train_set, batch_size=batch_size, collate_fn=data_collator
)
valid_dataloader = DataLoader(
    valid_set, batch_size=batch_size, collate_fn=data_collator
)
n_valid = len(valid_set)
n_train = len(train_set)

# Model from scratch

For this first exercise, we will start from a randomly initialized model.

## Retrieve the architecture configuration

In HuggingFace, model's parameters are specified through a `config` file. It is a json-like object.

We can retrieve the one from the official model with the following code:

In [None]:
from transformers import DistilBertConfig

model_config = DistilBertConfig.from_pretrained("distilbert-base-uncased")
print(model_config)

🚧 **Question** 🚧

Make sure you understand the parameters of the configuration.
- Which ones are task-agnostic parameters?
- Which ones are not?
- Why are there different parameters for different tasks?

**Answer**

TODO



Several architectures are available for DistilBert on HuggingFace, designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained DistilBert model, each has different top layers and output types designed to accomodate their specific NLP task.  

Here is the current list of classes provided for fine-tuning:
* BertModel
* BertForMaskedLM
* BertForNextSentencePrediction
* BertForSequenceClassification
* BertForTokenClassification
* BertForQuestionAnswering

The documentation for these can be found under [here](https://huggingface.co/docs/transformers/model_doc/distilbert).




🚧 **TODO** 🚧

For our first experiment, we want to build from a standard stack of transformer layers, without any additional task-specific head.

Which architecture is the corresponding one?

Choose the right one and initialize the model below, with the config.

In [None]:
from transformers import DistilBertModel

bert = DistilBertModel(model_config)

In [None]:
print(bert)

Just for curiosity's sake, we can browse all of the model's parameters by name here.

In the cell below, we printed out the names and dimensions of the weights for:

- The embedding layer
- The first of the twelve transformers
- The output layer.



In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(bert.named_parameters())

In [None]:
print("The BERT model has {:} different named parameters.\n".format(len(params)))

print("==== Embedding Layer ====\n")

for p in params[0:4]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print("\n==== First Transformer Layer ====\n")

for p in params[4:20]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

🚧 **TODO** 🚧

Test your `bert`.
We can already try the model on the validation set. Before just look at the output of the model on one batch.
- Interpret the output.
- Do you understand everything ?


In [None]:
batch = next(iter(train_dataloader))

input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
output = bert(input_ids=input_ids, attention_mask=attention_mask)
print(output["last_hidden_state"].shape)
print(type(output))
print(output)

## Building a classifier

Our `bert` model is simply a stack of transformer layers. We would like to use it as a backbone for text classification.

🚧 **TODO** 🚧

Wraps the model into a classifier.

> 💡 *Hint*: Use the last hidden [CLS] vector representation to perform classification.

In [None]:
class DistilBertClassifier(nn.Module):
    def __init__(self, bert, dropout=0.1):
        super().__init__()
        self.bert = bert
        self.dp = nn.Dropout(dropout)
        self.classifier = nn.Linear(768, 1)

    def forward(self, input_ids, attention_mask):
        bert_out = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        out = bert_out["last_hidden_state"]
        cls_embed = out[:, 0]
        out = self.dp(cls_embed)
        return self.classifier(out)

In [None]:
bert = DistilBertModel(model_config)
model = DistilBertClassifier(bert)
model.to(DEVICE)

🚧 **TODO** 🚧

Test your model on the batch.
Make sure it has the right shape.

In [None]:
out = model.forward(input_ids.to(DEVICE), attention_mask.to(DEVICE))
print(out)

### Training

🚧 **TODO** 🚧

Train your model.
Make sure you track the following quantities per epoch:
- training loss
- training accuracy
- validation loss
- validation accuracy

In [None]:
# Redefine the dataloaders to adjust the batch size.
batch_size = 16

train_dataloader = DataLoader(
    train_set, batch_size=batch_size, collate_fn=data_collator
)
valid_dataloader = DataLoader(
    valid_set, batch_size=batch_size, collate_fn=data_collator
)
n_valid = len(valid_set)
n_train = len(train_set)

In [None]:
def validation(model, valid_dataloader):
    total_size = 0
    acc_total = 0
    loss_total = 0
    criterion = nn.BCEWithLogitsLoss()
    model.eval()
    with torch.no_grad():
        for batch in tqdm(valid_dataloader):
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            input_ids = batch["input_ids"]
            labels = batch["labels"]
            attention_mask = batch["attention_mask"]
            labels = labels.float()
            preds = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(preds.squeeze(), labels)
            acc = (preds.squeeze() > 0) == labels
            total_size += acc.shape[0]
            acc_total += acc.sum().item()
            loss_total += loss.item()
    model.train()
    return loss_total / len(valid_dataloader), acc_total / total_size


validation(model, valid_dataloader)

In [None]:
def training(model, n_epochs, train_dataloader, valid_dataloader, lr=5e-5):
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=lr,
        eps=1e-08,
    )
    list_val_acc = []
    list_train_acc = []
    list_train_loss = []
    list_val_loss = []
    criterion = nn.BCEWithLogitsLoss()
    for e in range(n_epochs):
        # ========== Training ==========

        # Set model to training mode
        model.train()
        model.to(DEVICE)

        # Tracking variables
        train_loss = 0
        epoch_train_acc = 0
        for batch in tqdm(train_dataloader):
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            input_ids, attention_mask, labels = (
                batch["input_ids"],
                batch["attention_mask"],
                batch["labels"],
            )
            optimizer.zero_grad()
            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            # Backward pass
            loss = criterion(outputs.squeeze(), labels.squeeze().float())
            loss.backward()
            optimizer.step()
            train_loss += loss.detach().cpu().item()
            acc = (outputs.squeeze() > 0) == labels.squeeze()
            epoch_train_acc += acc.float().mean().item()
        list_train_acc.append(100 * epoch_train_acc / len(train_dataloader))
        list_train_loss.append(train_loss / len(train_dataloader))

        # ========== Validation ==========

        l, a = validation(model, valid_dataloader)
        list_val_loss.append(l)
        list_val_acc.append(a * 100)
        print(
            e,
            "\n\t - Train loss: {:.4f}".format(list_train_loss[-1]),
            "Train acc: {:.4f}".format(list_train_acc[-1]),
            "Val loss: {:.4f}".format(l),
            "Val acc:{:.4f}".format(a * 100),
        )
    return list_train_loss, list_train_acc, list_val_loss, list_val_acc

In [None]:
training(model, 3, train_dataloader, valid_dataloader)

## Pre-trained model

Now we are going to compare with a pre-trained model.

First, we are going to load the model's weights from the HuggingFace hub.

In [None]:
bert = DistilBertModel.from_pretrained("distilbert-base-uncased")

## Fine-Tuning

With our model loaded and ready,  we need to grab the training hyperparameters from within the stored model.

For the purposes of fine-tuning, the authors recommend choosing from the following values (from Appendix A.3 of the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf)):

- Batch size: 16, 32  
- Learning rate (Adam): 5e-5, 3e-5, 2e-5  
- Number of epochs: 2, 3, 4

We chose:
* **Batch size**: 16 (set when creating our DataLoaders)
* **Learning rate**: 5e-5
* **Epochs**: 3 (we'll see that this is probably too many...)

The epsilon parameter `eps = 1e-8` is "a very small number to prevent any division by zero in the implementation" (from [here](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)).

You can find the creation of the AdamW optimizer in `run_glue.py` [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109).

🚧 **TODO** 🚧

Build the classifier and train it with the pre-trained checkpoint.

In [None]:
model = DistilBertClassifier(bert.to(DEVICE))
training(model, 3, train_dataloader, valid_dataloader)

🚧 **Question** 🚧

What do you think of the results?

**Answer**

TODO

## Pre-built models

Actually, you built your own classifier based on the raw output of a transformers, wrapped into a classification model.

But, for many tasks, you can directly download the model with the necessary blocks.

For instance, we could directly have loaded `DistilBertForSequenceClassification`.

Let's see the difference with our model.

🚧 **TODO** 🚧

Load a pre-trained `DistilBertForSequenceClassification`.

In [None]:
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=1,
    output_attentions=False,
    output_hidden_states=False,
)
print(model)

🚧 **Question** 🚧

Here there might be a lot of questions:
- what does the warning means?
- why `num_labels=1`?
- and the other options?

**Answer**

TODO

🚧 **Question** 🚧

How is the classifcation made? Is it the same than with our model? You might need to check the official implementation [here](https://github.com/huggingface/transformers/blob/v4.34.1/src/transformers/models/distilbert/modeling_distilbert.py#L730).

**Answer**

TODO

🚧 **TODO** 🚧

The output of such a model is not directly the logits, but a wrapper that can return several objects. Analyze it and modify the training and validation loop accordingly.

Launch the training.

Do you observe any differences with our own classification model?

In [None]:
model.to(DEVICE)
out = model(
    input_ids=batch["input_ids"].to(DEVICE),
    attention_mask=batch["attention_mask"].to(DEVICE),
)

In [None]:
def validation_hf(model, valid_dataloader):
    total_size = 0
    acc_total = 0
    loss_total = 0
    model.eval()
    with torch.no_grad():
        for batch in tqdm(valid_dataloader):
            batch = {k: v.to(DEVICE) for k, v in batch.items()}
            input_ids = batch["input_ids"]
            labels = batch["labels"]
            attention_mask = batch["attention_mask"]
            labels = labels.float()
            out = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=batch["labels"],
            )
            loss = out.loss
            acc = (out.logits.squeeze() > 0) == labels
            total_size += acc.shape[0]
            acc_total += acc.sum().item()
            loss_total += loss.item()
    model.train()
    return loss_total / len(valid_dataloader), acc_total / total_size

def training_hf(model, n_epochs, train_dataloader, valid_dataloader, lr=5e-5):
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=lr,
        eps=1e-08,
    )
    list_val_acc = []
    list_train_acc = []
    list_train_loss = []
    list_val_loss = []
    for e in range(n_epochs):
        # ========== Training ==========

        # Set model to training mode
        model.train()
        model.to(DEVICE)

        # Tracking variables
        train_loss = 0
        epoch_train_acc = 0
        for batch in tqdm(train_dataloader):
            batch = {k: v.cuda() for k, v in batch.items()}
            input_ids, attention_mask, labels = (
                batch["input_ids"],
                batch["attention_mask"],
                batch["labels"],
            )
            optimizer.zero_grad()
            # Forward pass
            out = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels.float(),
            )
            # Backward pass
            loss = out.loss
            loss.backward()
            optimizer.step()
            train_loss += loss.detach().cpu().item()
            acc = (out.logits.squeeze() > 0) == labels.squeeze()
            epoch_train_acc += acc.float().mean().item()
        list_train_acc.append(100 * epoch_train_acc / len(train_dataloader))
        list_train_loss.append(train_loss / len(train_dataloader))

        # ========== Validation ==========

        l, a = validation_hf(model, valid_dataloader)
        list_val_loss.append(l)
        list_val_acc.append(a * 100)
        print(
            e,
            "\n\t - Train loss: {:.4f}".format(list_train_loss[-1]),
            "Train acc: {:.4f}".format(list_train_acc[-1]),
            "Val loss: {:.4f}".format(l),
            "Val acc:{:.4f}".format(a * 100),
        )
    return list_train_loss, list_train_acc, list_val_loss, list_val_acc

In [None]:
list_train_loss, list_train_acc, list_val_loss, list_val_acc = training_hf(
    model, 3, train_dataloader, valid_dataloader, lr=3e-5
)

# Interpretability

So far we have models able to predict quite faithfully if a critic is positive or negative. But can we interprete the results?

A usual way to do so with transformers, is simply to look at the **attention weights**. Let's see if we can get some insights on the model's prediction using this technique.


First, tokenize the text.

In [None]:
text = (
    "captain corelli's mandolin is a beautiful film with a lovely cast"
    " including the wonderful nicolas cage, who as always is brilliant in the movie."
    " the music in the film is really nice too. i'd advise anyone to go and see it. brilliant! 10 / 10 "
)
tokenized_text = tokenizer(text, return_tensors="pt")

🚧 **TODO** 🚧

Now feed it as an output for the model. Use the keyword-argument `output_attentions=True`.


In [None]:
model_output = model(
    input_ids=tokenized_text["input_ids"].to(DEVICE), output_attentions=True
)

🚧 **TODO** 🚧
Check the prediction and plot the attention weights matrix.

In [None]:
print("Logits:", model_output.logits.item())
print("Prediction:", int((model_output.logits > 0).item()))

In [None]:
def print_attention(attention, tokenized_text, tokenizer, layer=0):
    if layer == "all":
        attention_array = np.concatenate(
            [layer.detach().cpu().numpy() for layer in attention]
        )
    if isinstance(layer, int):
        attention_array = attention[layer].detach().cpu().numpy()
    if isinstance(layer, list):
        attention_array = np.concatenate(
            [attention[i].detach().cpu().numpy() for i in layer]
        )
    attention_array = attention_array.max(axis=(0, 1))
    fig, axs = plt.subplots(1, 1, figsize=(10, 10))
    axs.imshow(attention_array)
    tokens = [
        tokenizer.ids_to_tokens[w] for w in tokenized_text["input_ids"][0].tolist()
    ]
    axs.set_xticks(np.arange(len(tokens)), tokens, rotation="vertical", fontsize=8)
    plt.show()


print_attention(model_output["attentions"], tokenized_text, tokenizer, layer=4)

🚧 **Question** 🚧

Is your interpretability experiment conclusive?