# Fine-tune a transformer model for sentence classification

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-09-28 |

This notebook shows how to use the hugging face 🤗 `transformers` library to train a Transformer-based sentence classifier with transfer learning (i.e., "fine tuning").

**_Source:_** The notebook is adapted from the one distributed with this tutorial: https://huggingface.co/docs/transformers/tasks/sequence_classification

## Setup

If you run this notebook on Google Colab or you have not yet installed the `transformers` and `datasets` python libraries, you need to do so first:

In [None]:
# Transformers installation
! pip install transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

In [95]:
import os
import shutil

import numpy as np

# dataset loading
from datasets import load_dataset, DatasetDict 

# used to tokenize text
from transformers import AutoTokenizer

# used to load the pre-trained model
from transformers import AutoModelForSequenceClassification

# used to finetune the pre-trained model
from transformers import Trainer, TrainingArguments


## Supervised text classification

Text classification means to assigns a label or class to each text in a corpus.
It is a common NLP and computational text analysis task.

A common and popular text classification task is **_sentiment analysis_**.
Sentiment analysis assigns a label like 🙂 'positive', 🙁 'negative', or 😐 'neutral' to a sequence of text, for example a sentence of paragraph.

### Ingredients

Here is what you need for training a supervised sequence classifier through finetuning (i.e. transfer learning):

- a pre-defined set of **label classes** (e.g., 'positive', 'neutral', 'negative')
- a **label dataset**, i.e., a corpus of texts (e.g., sentences) in which each document has been assigned to a single label label class
- a **pre-trained model** you can fine-tune for sequence classification
- some **metric to quantify classification performance** so that we know how well our classifier is doing

### This notebook

In this notebook, we will use

1. the [IMDb](https://huggingface.co/datasets/imdb) corpus that records movie review that have been classified as positive or negative, and
2. finetune a [DistilBERT](https://huggingface.co/distilbert-base-uncased) model

## Load the dataset

We'll use the IMDb dataset from the 🤗 `datasets` library.
However, the train and test splits ([?](https://chat.openai.com/share/d71207ff-d374-4540-927a-83ac5370cd8f)) of this dataset each contain 25,000 documents.
Using them all will result in very slow training and we'll thus just use subset.

In [87]:
n_train = 3000
n_dev = 1000
n_test = 1000

# sample without replacement
idxs = np.random.choice(25_000, n_train + n_dev + n_test, replace=False)

# split the indices into train, dev, and test
train_idxs = idxs[:n_train]
dev_idxs = idxs[n_train:(n_train+n_dev)]
test_idxs = idxs[-n_test:]

# show the number of examples in each split
len(train_idxs), len(dev_idxs), len(test_idxs)

(3000, 1000, 1000)

In [88]:
imdb = DatasetDict({
    "train": load_dataset("imdb", split='train').select(train_idxs),
    "dev": load_dataset("imdb", split='test').select(dev_idxs),
    "test": load_dataset("imdb", split='test').select(test_idxs),
})

The `imdb` is an instance of the `datasets` `DatasetDict` class.

In [89]:
type(imdb)

datasets.dataset_dict.DatasetDict

This class is there to gather several pre-defined splits of a dataset.

Among these splits, one is usually named "train" and another on "test" (see next cell).

**_Note:_** It'll become clearer further below why we need these splits. 

In [90]:
imdb.keys()

dict_keys(['train', 'dev', 'test'])

In [91]:
len(imdb['train']), len(imdb['dev']), len(imdb['test'])

(3000, 1000, 1000)

Here is how you can access one "example" (i.e., observation) in the the "test" split:

In [92]:
imdb["test"][0]

{'text': "It's hard to believe that with a cast as strong as this one has, that this movie can be such a dud. It's such an incredibly horrible film. How was it ever made? How did so many good actors wind up in such a terrible film? Don't waste your life. Don't watch even one moment of this film.",
 'label': 0}

This shows in the test split, there are two fields for *each* example :

- `text`: the movie review text.
- `label`: a value that is either `0` for a negative review or `1` for a positive review.

**_Important:_** check that the splits have about equal label class distributions:

In [93]:
print('% "pos" in train:', np.mean([ex['label'] for ex in imdb["train"]]))
print('% "pos" in dev:', np.mean([ex['label'] for ex in imdb["dev"]]))
print('% "pos" in test:', np.mean([ex['label'] for ex in imdb["test"]]))


% "pos" in train: 0.5076666666666667
% "pos" in dev: 0.499
% "pos" in test: 0.498


## Preprocess

The next step is to load a DistilBERT tokenizer to preprocess the `text` field:

In [58]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The tokenizer is a so-called *callable* and can thus be used like a function:
If you input a text string, it return a dictionary with the tokenized text and additional information.

In [59]:
toks = tokenizer("Hello, this one sentence!")
print(toks.keys())

dict_keys(['input_ids', 'attention_mask'])


- The field 'input_ids' indicates the numbers used to represent the tokens in the example sentence.
- The 'attention_mask' is there to help the model to know to which tokens in of a bunch of sentences it should pay attention when fine-tuning, and which it can ignore.

Let's create a helper function that tokenizes the `text` value of an input called example.
This will allow us to iterate over examples in our dataset splits (e.g., `imdb["test"]`) and pre-process them one by one.

**_Note:_** Setting `truncate=True` we ensure that none of the text sequences we'll use for fine-tuning is too longer for DistilBERT to handle it.

In [60]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once:

In [94]:
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). 

Our `data_collator` instance of this class will handle preprocessing and a thing called "padding" when sampling batches of examples during finetuning to iteratively update our classifier's parameters.

*Padding* means that you make all text sequences in a set of sequences the same length.
To do this, we just append the `<PAD>` special token to shorter text sequences in the set.
For example, the (tokenized) sequences in the following set 

```json
[
    ['Hello', 'world', '!'               ],
    ['Have',  'a',     'nice', 'day', '!']
]
```

will be "padded" to 

```json
[
    ['Hello', 'world', '!',    '<PAD>', '<PAD>'],
    ['Have',  'a',     'nice', 'day',   '!'    ]
]
```

In [62]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance.

**_Note_** You could also just load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric).

Let's create a function that passes your predictions and labels to calculate some central metrics (explanations below):

In [73]:
import numpy as np
from sklearn.metrics import precision_recall_fscore_support, balanced_accuracy_score

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    p, r, f1, _ = precision_recall_fscore_support(y_true=labels, y_pred=predictions, average='macro', zero_division=0)
    # ba = balanced_accuracy_score(y_true=labels, y_pred=predictions)
    metrics = {
        "macro_f1": f1,
        "macro_precision": p,
        "macro_recall": r,
        # "balanced_accuracy": ba
    }
    return metrics

We compute the following metrics:

- precision: the share of examples a classifier as correctly assigned into a class
- recall: the share of positive examples a classifier labels correctly
- F1: a measure combining recall and precision
- balanced accuary: an accurarcy metric adjusting for class imbalance

<p><a href="https://commons.wikimedia.org/wiki/File:Precisionrecall.svg#/media/File:Precisionrecall.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" alt="Precisionrecall.svg" height="800" width="440"></a><br>By &lt;a href="//commons.wikimedia.org/wiki/User:Walber" title="User:Walber"&gt;Walber&lt;/a&gt; - &lt;span class="int-own-work" lang="en"&gt;Own work&lt;/span&gt;, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=36926283">Link</a></p>

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

Before you start training your model, create two dictionaries that map labels' numeric IDS their character values:

In [22]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [24]:
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", # <== the name of the pre-trained model (downloaded from huggingface hub)
    num_labels=2, # number of label classes
    id2label=id2label,
    label2id=label2id
)

Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


To make training as fast as possible, you want to utilize GPU computing.
When you run notebooks on Colab, you can enable GPU computing by 

1. clicking on "Runtime" in the menu,
2. selecting "Change runtime type", and
3. choose "GPU" in the "Hardware accelerator" section of the pop-up

If you are running this notebook elsewhere, you want to determine to what kind of device you have access

- with a GPU &rarr; "cuda"
- with MacOS's M1/M2 chip &rarr; "mps"
- else "cpu"

We do so like this:

In [97]:
import torch
# check if GPU or MPS is available, else use CPU
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
device = torch.device(device)
device

device(type='mps')

Once we've figured this out, we put our model on that device:

In [98]:
model.to(device);

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. You'll push this model to the Hub by setting `push_to_hub=True` (you need to be signed in to Hugging Face to upload your model). At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

In [99]:
# define the path where you want to save the fine tuned model
model_path = os.path.join('..', 'data', 'models', 'distillbert_ibmd_sentiment')

training_args = TrainingArguments(
    output_dir=model_path,
    # leave the following unchanged ;)
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    num_train_epochs=1, # <== increase this value to train for longer
    evaluation_strategy="epoch",
    save_strategy="epoch",
    # metric_for_best_model="macro_f1", # <== needs to match one of the names of the dictionary returned by `compute_metrics()` function
    save_total_limit=2,
    load_best_model_at_end=True,
)

In [100]:
trainer = Trainer(
    model=model, # the model instance you loaded two cells above
    args=training_args, # the training args you created one cells above
    train_dataset=tokenized_imdb["train"], # the training data split
    eval_dataset=tokenized_imdb["dev"], # the testing data split
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Now we can finetune the model!

**_Warning:_** This will take long if you are using only your CPU 🥹

In [101]:
trainer.train()



  0%|          | 0/188 [00:00<?, ?it/s]

  0%|          | 0/63 [00:00<?, ?it/s]

{'eval_loss': 0.3237026631832123, 'eval_macro_f1': 0.8649987849890648, 'eval_macro_precision': 0.8650313620071685, 'eval_macro_recall': 0.8650094600378402, 'eval_runtime': 23.2899, 'eval_samples_per_second': 42.937, 'eval_steps_per_second': 2.705, 'epoch': 1.0}
{'train_runtime': 252.5856, 'train_samples_per_second': 11.877, 'train_steps_per_second': 0.744, 'train_loss': 0.6630656059752119, 'epoch': 1.0}


TrainOutput(global_step=188, training_loss=0.6630656059752119, metrics={'train_runtime': 252.5856, 'train_samples_per_second': 11.877, 'train_steps_per_second': 0.744, 'train_loss': 0.6630656059752119, 'epoch': 1.0})

In [102]:
# evaluate the final model on the held-out tetst set
+trainer.evaluate(tokenized_imdb["test"])

  0%|          | 0/63 [00:00<?, ?it/s]

{'eval_loss': 0.3573879599571228,
 'eval_macro_f1': 0.8508566732630058,
 'eval_macro_precision': 0.8526682227784228,
 'eval_macro_recall': 0.8511376182018913,
 'eval_runtime': 23.3862,
 'eval_samples_per_second': 42.76,
 'eval_steps_per_second': 2.694,
 'epoch': 1.0}

**Interpretation**

- the precision of 0.85 tells us that the classifier is correct about 17 out of 20 times when it says a text has "positive" sentiment (our positive label class)
- the recall of 0.85 tells us that the classifier correctly classifies about 17 in every 20 "true" positive-sentiment examples
- the F1 score just summarizes thes values in one score

Overall our classifier performs pretty well even with only 3000 traning examples. 🥳

### Save the model and tokenizer for re-use

In [126]:
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

('./../models/distillbert_ibmd_sentiment/tokenizer_config.json',
 './../models/distillbert_ibmd_sentiment/special_tokens_map.json',
 './../models/distillbert_ibmd_sentiment/vocab.txt',
 './../models/distillbert_ibmd_sentiment/added_tokens.json',
 './../models/distillbert_ibmd_sentiment/tokenizer.json')

Clean-up all other checkpoints

In [134]:
import os

os.listdir(model_path)

['tokenizer_config.json',
 'special_tokens_map.json',
 'config.json',
 'tokenizer.json',
 'training_args.bin',
 'checkpoint-79',
 'vocab.txt',
 'pytorch_model.bin',
 'checkpoint-188']

In [137]:
checkpoints = [fn for fn in os.listdir(model_path) if fn.startswith('checkpoint-')]
checkpoints

['checkpoint-79', 'checkpoint-188']

In [138]:
import shutil

# remove the cehckpoint folders
for checkpoint in checkpoints:
    dir_path = os.path.join(model_path, checkpoint)
    shutil.rmtree(dir_path)


### Detailed look at the classifiers output

Let's create predictions for the first three examples in the test set:

In [103]:
preds = trainer.predict(tokenized_imdb["test"].select([0, 1, 2]))

  0%|          | 0/1 [00:00<?, ?it/s]

In [104]:
type(preds)

transformers.trainer_utils.PredictionOutput

In [105]:
type(preds.predictions)

numpy.ndarray

In [106]:
preds.predictions.shape

(3, 2)

The prediction array has two dimensions:

- the first axis ('rows') corresponds to the *number of examples* for which we generated predictions
- the second axis ('columns') corresponds to the *number of label classes* we generate probability-like scores for when predicting

Let's look at the scores for the first example:

In [112]:
preds.predictions[0]

[1.7259933948516846, -1.9657995700836182]

The first score is larger than the second one.
This means that given example is more similar to examples from the first label class: documents with negative sentiment. 

In [113]:
id2label[0]

'NEGATIVE'

To convert those scores in something probability-like, we apply the so-called [softmax transformation](), which rescales values such that they each range between 0 and 1 and sum to 1:

In [115]:
from scipy.special import softmax

softmax(preds.predictions[0])

array([0.975679  , 0.02432101], dtype=float32)

We can also call this function on all examples' prediction scores in our current batch:

In [118]:
pred_probs = softmax(preds.predictions, axis=1)
pred_probs

array([[0.975679  , 0.02432101],
       [0.97882915, 0.02117081],
       [0.05613839, 0.9438616 ]], dtype=float32)

Now if you want to know for each row in which cell the value is the largest, you can call the `argmax()` method on the numpy array:

In [119]:
pred_probs.argmax(axis=1)

array([0, 0, 1])

This turns prediction scores into predicted labels:

In [123]:
[id2label[pp] for pp in preds.predictions.argmax(axis=1)]

['NEGATIVE', 'NEGATIVE', 'POSITIVE']

## Using the model for labeling texts 

When you have saved your finetuned model, you can always re-load it to label texts.
In machine learning this is called "inference" &mdash; which is unfortunate given the meaning of the term in positive social science methodology.

So let's just call it **prediction**.

In [129]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for sentiment analysis with your model, and pass your text to it:

In [130]:
from transformers import pipeline

# set environment variable TOKENIZERS_PARALLELISM=false
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

classifier = pipeline("sentiment-analysis", model=model_path)
classifier(text)

[{'label': 'POSITIVE', 'score': 0.9153976440429688}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

Pass your inputs to the model and return the `logits`:

In [133]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_path)
inputs = tokenizer(text, return_tensors="pt")

model = AutoModelForSequenceClassification.from_pretrained(model_path)
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id] # <== use the 'id2label' we've added to the model we saved

'POSITIVE'

## Outlook

DistillBERT is not the only pre-trained model you can fine tune for sequence classification.
The huggingface `transformers` library supports it also for the following many models, for example

- [BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bert)
- [CamemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/camembert)
- [DeBERTa-v2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta-v2)
- [DistilBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/distilbert)
- [OpenAI GPT-2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt2)
- [LLaMA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/llama)
- [Longformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longformer)
- [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart)
- [RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta)
- [XLM-RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta)

