# Fine-tune a transformer model for sentence classification

| Authors | Last update |
|:------ |:----------- |
| Hauke Licht (https://github.com/haukelicht) | 2023-09-29 |

This notebook shows how to use the hugging face 🤗 `transformers` library to train a Transformer-based sentence classifier with transfer learning (i.e., "fine tuning").

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/haukelicht/advanced_text_analysis/blob/main/notebooks/04_transformers_sentence_classification.ipynb)

**_Source:_** The notebook is adapted from the one distributed with this tutorial: https://huggingface.co/docs/transformers/tasks/sequence_classification

**Table of contents**

- Setup
- Supervised text classification
- Preparing the data
    - loading the dataset
    - preprocessing
- How to evaluate model performance
- Train
    - setting up the GPU (if available)
    - preparing the trainer
    - training
    - test set evaluation
    - storing the model
- Inference
- Outlook
- Appendix
    - reproducibility
    - data loading
    - data splitting
    - multi-class classification

## Setup

If you run this notebook on Google Colab or you have not yet installed the `transformers` and `datasets` python libraries, you need to do so first:

In [None]:
# Transformers installation
! pip install accelerate transformers datasets
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

In [95]:
import os
import shutil

import numpy as np

# dataset loading
from datasets import load_dataset, DatasetDict 

# used to tokenize text
from transformers import AutoTokenizer

# used to load the pre-trained model
from transformers import AutoModelForSequenceClassification

# used to finetune the pre-trained model
from transformers import Trainer, TrainingArguments


## Supervised text classification

Text classification means to assigns a label or class to each text in a corpus.
It is a common NLP and computational text analysis task.

A common and popular text classification task is **_sentiment analysis_**.
Sentiment analysis assigns a label like 🙂 'positive', 🙁 'negative', or 😐 'neutral' to a sequence of text, for example a sentence of paragraph.

### Ingredients

Here is what you need for training a supervised sequence classifier through finetuning (i.e. transfer learning):

- a **pre-trained model** you can fine-tune for sequence classification
    - the tokenizer comes with this too
- a pre-defined set of **label classes** (e.g., 'positive', 'neutral', 'negative')
- a **label dataset**, i.e., a corpus of texts (e.g., sentences) in which each document has been assigned to a single label label class
    - this we split into train, development, and test set
- some **metric to quantify classification performance** so that we know how well our classifier is doing

### This notebook

In this notebook, we will use

1. the [IMDb](https://huggingface.co/datasets/imdb) corpus that records movie review that have been classified as positive or negative, and
2. finetune a [DistilBERT](https://huggingface.co/distilbert-base-uncased) model

## Preparing the data

### Load the dataset

We'll use the IMDb dataset from the 🤗 `datasets` library.
However, the train and test splits ([?](https://chat.openai.com/share/d71207ff-d374-4540-927a-83ac5370cd8f)) of this dataset each contain 25,000 documents.
Using them all will result in very slow training and we'll thus just use subset.

In [87]:
n_train = 3000
n_dev = 1000
n_test = 1000

# sample without replacement
idxs = np.random.choice(25_000, n_train + n_dev + n_test, replace=False)

# split the indices into train, dev, and test
train_idxs = idxs[:n_train]
dev_idxs = idxs[n_train:(n_train+n_dev)]
test_idxs = idxs[-n_test:]

# show the number of examples in each split
len(train_idxs), len(dev_idxs), len(test_idxs)

(3000, 1000, 1000)

In [88]:
imdb = DatasetDict({
    "train": load_dataset("imdb", split='train').select(train_idxs),
    "dev": load_dataset("imdb", split='test').select(dev_idxs),
    "test": load_dataset("imdb", split='test').select(test_idxs),
})

The `imdb` is an instance of the `datasets` `DatasetDict` class.

In [142]:
type(imdb)

datasets.dataset_dict.DatasetDict

This class is there to gather several pre-defined splits of a dataset.

Among these splits, one is usually named "train" and another on "test" (see next cell).

**_Note:_** It'll become clearer further below why we need these splits. 

In [143]:
imdb.keys()

dict_keys(['train', 'dev', 'test'])

In [144]:
len(imdb['train']), len(imdb['dev']), len(imdb['test'])

(3000, 1000, 1000)

Here is how you can access one "example" (i.e., observation) in the the "test" split:

In [145]:
type(imdb["test"])

datasets.arrow_dataset.Dataset

In [146]:
imdb["test"][0]

{'text': "It's hard to believe that with a cast as strong as this one has, that this movie can be such a dud. It's such an incredibly horrible film. How was it ever made? How did so many good actors wind up in such a terrible film? Don't waste your life. Don't watch even one moment of this film.",
 'label': 0}

This shows in the test split, there are two fields for *each* example :

- `text`: the movie review text.
- `label`: a value that is either `0` for a negative review or `1` for a positive review.

**_Important:_** check that the splits have about equal label class distributions:

In [147]:
print('% "pos" in train:', np.mean([ex['label'] for ex in imdb["train"]]))
print('% "pos" in dev:', np.mean([ex['label'] for ex in imdb["dev"]]))
print('% "pos" in test:', np.mean([ex['label'] for ex in imdb["test"]]))


% "pos" in train: 0.5076666666666667
% "pos" in dev: 0.499
% "pos" in test: 0.498


### Preprocessing texts

The next step is to load a DistilBERT tokenizer to preprocess the `text` field:

In [158]:
MODEL_NAME = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
# read about "byte-pair encoding" and "sentence-piece" algorithms if interested in how tokeeizers work

The tokenizer is a so-called *callable* and can thus be used like a function:
If you input a text string, it return a dictionary with the tokenized text and additional information.

In [160]:
toks = tokenizer("Hello, this one sentence! [SEP] And this is another one.")
print(toks.keys())

dict_keys(['input_ids', 'attention_mask'])


- The field 'input_ids' indicates the numbers used to represent the tokens in the example sentence.
- The 'attention_mask' is there to help the model to know to which tokens in of a bunch of sentences it should pay attention when fine-tuning, and which it can ignore.

In [163]:
toks['input_ids']
print(toks['input_ids'])
tokenizer.convert_ids_to_tokens(toks['input_ids'])

[101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 2003, 2178, 2028, 1012, 0, 102]


['[CLS]',
 'hello',
 ',',
 'this',
 'one',
 'sentence',
 '!',
 '[SEP]',
 'and',
 'this',
 'is',
 'another',
 'one',
 '.',
 '[PAD]',
 '[SEP]']

In [164]:
print(toks['attention_mask'])

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


Let's create a helper function that tokenizes the `text` value of an input called example.
This will allow us to iterate over examples in our dataset splits (e.g., `imdb["test"]`) and pre-process them one by one.

**_Note:_** Setting `truncate=True` we ensure that none of the text sequences we'll use for fine-tuning is too longer for DistilBERT to handle it.

In [60]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once:

In [165]:
# need to do this beause you want to add the input IDs and 
#  attention mask values to each example in each of the data splits
tokenized_imdb = imdb.map(preprocess_function, batched=True)

Map:   0%|          | 0/3000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [174]:
type(imdb)
imdb.keys()
imdb['train']
imdb['train'][0]


{'text': 'As far as the movie goes, it\'s an OK science fiction movie. It has a lot of cool stuff in it, and some quality scenes. That said, it\'s not that good, and some of the stuff is pretty far fetched...<br /><br />As for calling this another cube-movie is utter and complete bullsh!t. This is the very definition of milking a great and inventive original movie... The whole feel to it can be somewhat translated into the core of the first, but the introduction of people/androids as part of the "team" behind the cube itself is somewhat a stretch...<br /><br />I gave this a 3*** because of the backstabbing of the original. This one should have been kept sterile in so many parts of the movie that there is no place or time to mention them all...<br /><br />Watchable for those who have not seen Cube & Hypercube, but not recommendable for fans of the series...',
 'label': 0}

In [175]:
type(tokenized_imdb)
tokenized_imdb.keys()
tokenized_imdb['train'][0]#.keys()

{'text': 'As far as the movie goes, it\'s an OK science fiction movie. It has a lot of cool stuff in it, and some quality scenes. That said, it\'s not that good, and some of the stuff is pretty far fetched...<br /><br />As for calling this another cube-movie is utter and complete bullsh!t. This is the very definition of milking a great and inventive original movie... The whole feel to it can be somewhat translated into the core of the first, but the introduction of people/androids as part of the "team" behind the cube itself is somewhat a stretch...<br /><br />I gave this a 3*** because of the backstabbing of the original. This one should have been kept sterile in so many parts of the movie that there is no place or time to mention them all...<br /><br />Watchable for those who have not seen Cube & Hypercube, but not recommendable for fans of the series...',
 'label': 0,
 'input_ids': [101,
  2004,
  2521,
  2004,
  1996,
  3185,
  3632,
  1010,
  2009,
  1005,
  1055,
  2019,
  7929,


Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). 

Our `data_collator` instance of this class will handle preprocessing and a thing called "padding" when sampling batches of examples during finetuning to iteratively update our classifier's parameters.

*Padding* means that you make all text sequences in a set of sequences the same length.
To do this, we just append the `<PAD>` special token to shorter text sequences in the set.
For example, the (tokenized) sequences in the following set 

```json
[
    ['Hello', 'world', '!'               ],
    ['Have',  'a',     'nice', 'day', '!']
]
```

will be "padded" to 

```json
[
    ['Hello', 'world', '!',    '<PAD>', '<PAD>'],
    ['Have',  'a',     'nice', 'day',   '!'    ]
]
```

In [182]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## How to evaluate model performance

Including a metric during training is often helpful for evaluating your model's performance.

**_Note_** You could also just load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric).

Let's create a function that passes your predictions and labels to calculate some central metrics (explanations below):

In [207]:
import numpy as np
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    p, r, f1, _ = precision_recall_fscore_support(y_true=labels, y_pred=predictions, average='binary', zero_division=0)
    metrics = {
        "f1": f1,
        "precision": p,
        "recall": r,
    }
    return metrics

We compute the following metrics:

- precision: the share of examples a classifier as correctly assigned into a class
- recall: the share of positive examples a classifier labels correctly
- F1: a measure combining recall and precision
- balanced accuary: an accurarcy metric adjusting for class imbalance

<p><a href="https://commons.wikimedia.org/wiki/File:Precisionrecall.svg#/media/File:Precisionrecall.svg"><img src="https://upload.wikimedia.org/wikipedia/commons/2/26/Precisionrecall.svg" alt="Precisionrecall.svg" height="800" width="440"></a><br>By &lt;a href="//commons.wikimedia.org/wiki/User:Walber" title="User:Walber"&gt;Walber&lt;/a&gt; - &lt;span class="int-own-work" lang="en"&gt;Own work&lt;/span&gt;, <a href="https://creativecommons.org/licenses/by-sa/4.0" title="Creative Commons Attribution-Share Alike 4.0">CC BY-SA 4.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=36926283">Link</a></p>

In [208]:
p, r, f1, _ = precision_recall_fscore_support(y_true=[0, 1, 1], y_pred=[0, 0, 1], average='binary', zero_division=0)
p, r, f1

(1.0, 0.5, 0.6666666666666666)

In [211]:
2*(p*r / (p+r))

0.6666666666666666

Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training.

## Train

Before you start training your model, create two dictionaries that map labels' numeric IDS their character values:

In [22]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

<Tip>

If you aren't familiar with finetuning a model with the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer), take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/tasks/../training#train-with-pytorch-trainer)!

</Tip>

You're ready to start training your model now! Load DistilBERT with [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [212]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME, # <== the name of the pre-trained model (downloaded from huggingface hub)
    num_labels=2, # number of label classes (adapt this if you have, e.g., 4 label classes)
    id2label=id2label,
    label2id=label2id
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Inspect the model architecture

In [213]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

### Using the GPU (if available)

To make training as fast as possible, you want to utilize GPU computing.
When you run notebooks on Colab, you can enable GPU computing by 

1. clicking on "Runtime" in the menu,
2. selecting "Change runtime type", and
3. choose "GPU" in the "Hardware accelerator" section of the pop-up

If you are running this notebook elsewhere, you want to determine to what kind of device you have access

- with a GPU &rarr; "cuda"
- with MacOS's M1/M2 chip &rarr; "mps"
- else "cpu"

We do so like this:

In [226]:
import torch
# check if GPU or MPS is available, else use CPU
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
device = torch.device(device)
device

device(type='mps')

Once we've figured this out, we put our model on that device:

In [98]:
# IMPORTANT: put this thing to the respective device (e.g., GPU)
model.to(device);

At this point, only three steps remain:

1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments). The only required parameter is `output_dir` which specifies where to save your model. At the end of each epoch, the [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) will evaluate the accuracy and save the training checkpoint.
2. Pass the training arguments to [Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, data collator, and `compute_metrics` function.
3. Call [train()](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.train) to finetune your model.

### Preparing the trainer

In [227]:
# define the path where you want to save the fine tuned model
model_path = os.path.join('..', 'data', 'models', 'distillbert_ibmd_sentiment')

training_args = TrainingArguments(
    output_dir=model_path,
    # leave the following unchanged ;)
    learning_rate=2e-5,
    per_device_train_batch_size=16, # <== reduce only if you get a "CUDA out of memory" error
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    # increase this value to train for longer
    num_train_epochs=2,
    evaluation_strategy="epoch",
    # how to save and determine ("best") model
    save_strategy="epoch",
    metric_for_best_model="f1", # <== needs to match one of the names of the dictionary returned by `compute_metrics()` function
    load_best_model_at_end=True,
    save_total_limit=2,
)

In [229]:
trainer = Trainer(
    model=model, # the model instance you loaded two cells above
    args=training_args, # the training args you created one cells above
    train_dataset=tokenized_imdb["train"], # the training data split
    eval_dataset=tokenized_imdb["dev"], # the testing data split
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

### Train

Now we can finetune the model!

**_Warning:_** This will take long if you are using only your CPU 🥹

In [230]:
trainer.train()



  0%|          | 0/376 [00:00<?, ?it/s]

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/63 [00:00<?, ?it/s]

{'eval_loss': 0.2353682667016983, 'eval_f1': 0.9089068825910932, 'eval_precision': 0.918200408997955, 'eval_recall': 0.8997995991983968, 'eval_runtime': 26.9237, 'eval_samples_per_second': 37.142, 'eval_steps_per_second': 2.34, 'epoch': 1.0}


  0%|          | 0/63 [00:00<?, ?it/s]

{'eval_loss': 0.25140172243118286, 'eval_f1': 0.9069767441860467, 'eval_precision': 0.8780487804878049, 'eval_recall': 0.9378757515030061, 'eval_runtime': 26.6467, 'eval_samples_per_second': 37.528, 'eval_steps_per_second': 2.364, 'epoch': 2.0}
{'train_runtime': 609.4697, 'train_samples_per_second': 9.845, 'train_steps_per_second': 0.617, 'train_loss': 0.2808544077771775, 'epoch': 2.0}


TrainOutput(global_step=376, training_loss=0.2808544077771775, metrics={'train_runtime': 609.4697, 'train_samples_per_second': 9.845, 'train_steps_per_second': 0.617, 'train_loss': 0.2808544077771775, 'epoch': 2.0})

### Test set evaluation

In [231]:
# evaluate the final model on the held-out tetst set
trainer.evaluate(tokenized_imdb["test"])

  0%|          | 0/63 [00:00<?, ?it/s]

{'eval_loss': 0.2442506104707718,
 'eval_f1': 0.8967611336032388,
 'eval_precision': 0.9040816326530612,
 'eval_recall': 0.8895582329317269,
 'eval_runtime': 27.5243,
 'eval_samples_per_second': 36.332,
 'eval_steps_per_second': 2.289,
 'epoch': 2.0}

**Interpretation**

- the precision of 0.85 tells us that the classifier is correct about 17 out of 20 times when it says a text has "positive" sentiment (our positive label class)
- the recall of 0.85 tells us that the classifier correctly classifies about 17 in every 20 "true" positive-sentiment examples
- the F1 score just summarizes thes values in one score

Overall our classifier performs pretty well even with only 3000 traning examples. 🥳

### Save the model and tokenizer for re-use

In [232]:
trainer.save_model(model_path)

Clean-up all other checkpoints

In [233]:
os.listdir(model_path)

['.DS_Store',
 'tokenizer_config.json',
 'special_tokens_map.json',
 'config.json',
 'tokenizer.json',
 'training_args.bin',
 'vocab.txt',
 'pytorch_model.bin',
 'checkpoint-188',
 'checkpoint-376']

In [234]:
checkpoints = [fn for fn in os.listdir(model_path) if fn.startswith('checkpoint-')]
checkpoints

['checkpoint-188', 'checkpoint-376']

In [235]:
import shutil

# remove the checkpoint folders
for checkpoint in checkpoints:
    dir_path = os.path.join(model_path, checkpoint)
    shutil.rmtree(dir_path)


### Detailed look at the classifiers output

Let's create predictions for the first three examples in the test set:

In [103]:
preds = trainer.predict(tokenized_imdb["test"].select([0, 1, 2]))

  0%|          | 0/1 [00:00<?, ?it/s]

In [104]:
type(preds)

transformers.trainer_utils.PredictionOutput

In [105]:
type(preds.predictions)

numpy.ndarray

In [106]:
preds.predictions.shape

(3, 2)

The prediction array has two dimensions:

- the first axis ('rows') corresponds to the *number of examples* for which we generated predictions
- the second axis ('columns') corresponds to the *number of label classes* we generate probability-like scores for when predicting

Let's look at the scores for the first example:

In [112]:
preds.predictions[0]

[1.7259933948516846, -1.9657995700836182]

The first score is larger than the second one.
This means that given example is more similar to examples from the first label class: documents with negative sentiment. 

In [113]:
id2label[0]

'NEGATIVE'

To convert those scores in something probability-like, we apply the so-called [softmax transformation](), which rescales values such that they each range between 0 and 1 and sum to 1:

In [244]:
from scipy.special import softmax

softmax(preds.predictions[0])

array([0.975679  , 0.02432101], dtype=float32)

We can also call this function on all examples' prediction scores in our current batch:

In [118]:
pred_probs = softmax(preds.predictions, axis=1)
pred_probs

array([[0.975679  , 0.02432101],
       [0.97882915, 0.02117081],
       [0.05613839, 0.9438616 ]], dtype=float32)

In [236]:
[ex['text'] for ex in tokenized_imdb["test"].select([0, 1, 2])]

["It's hard to believe that with a cast as strong as this one has, that this movie can be such a dud. It's such an incredibly horrible film. How was it ever made? How did so many good actors wind up in such a terrible film? Don't waste your life. Don't watch even one moment of this film.",
 "One of the worst Arnold movies I've seen. Special effects were terrible. Script was horrible. Hopefully his next movie will be much better like T2, Total Recall, True Lies and Eraser(not as good as the rest). Watch Stigmata if you want to see an apocalyptic future movie. It's much better.",
 "I saw this on DVD with subtitles, which made it a little frustrating to get through, because of the film's length. But I was riveted throughout all of it. That I was fascinated by the characters and always engrossed in the story, despite the subtitles, is a testament to the film's power. It's an amazing piece of work. I have it on my list of ten favorite films of all time. It's easily the best foreign film I'v

Now if you want to know for each row in which cell the value is the largest, you can call the `argmax()` method on the numpy array:

In [119]:
pred_probs.argmax(axis=1)

array([0, 0, 1])

This turns prediction scores into predicted labels:

In [123]:
[id2label[pp] for pp in preds.predictions.argmax(axis=1)]

['NEGATIVE', 'NEGATIVE', 'POSITIVE']

## Inference

"Inference" = "Using the model for labeling texts"

When you have saved your finetuned model, you can always re-load it to label texts.
In machine learning this is called "inference" &mdash; which is unfortunate given the meaning of the term in positive social science methodology.

So let's just call it **prediction**.

In [246]:
from transformers import pipeline

sentiment_classifier = pipeline(task="text-classification", model=model_path)

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for sentiment analysis with your model, and pass your text to it:

In [247]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
print(sentiment_classifier(text))

[{'label': 'POSITIVE', 'score': 0.945286750793457}]


In [248]:
text = imdb["test"][0]["text"]
sentiment_classifier(text)
# TODO: figure out if you can make this deterministic

[{'label': 'NEGATIVE', 'score': 0.9645718932151794}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return PyTorch tensors:

Pass your inputs to the model and return the `logits`:

In [133]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained(model_path)
inputs = tokenizer(text, return_tensors="pt")

model = AutoModelForSequenceClassification.from_pretrained(model_path)
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id] # <== use the 'id2label' we've added to the model we saved

'POSITIVE'

## Outlook

DistillBERT is not the only pre-trained model you can fine tune for sequence classification.
The huggingface `transformers` library supports it also for the following many models, for example

- [BERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/bert)
- [CamemBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/camembert)
- [DeBERTa-v2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/deberta-v2)
- [DistilBERT](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/distilbert)
- [OpenAI GPT-2](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/gpt2)
- [LLaMA](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/llama)
- [Longformer](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/longformer)
- [mBART](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/mbart)
- [RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/roberta)
- [XLM-RoBERTa](https://huggingface.co/docs/transformers/main/en/tasks/../model_doc/xlm-roberta)



## Appendix

### Reproducibility

In programming, randomness is governed by *Random Number Generator* (RNG) algorithms.
You can control randomness by setting a so called ["seed"](https://towardsdatascience.com/random-seeds-and-reproducibility-933da79446e3) that determines an RNG's initial state.
By setting a seed at the beginning of your script, each random value you generate (e.g., like [this](https://realpython.com/numpy-random-number-generator/)) will be the same at each run of your script &mdash; as long as you run all the code in the script in the same order (e.g., cell by cell, from bottom to top).

**_Important:_** If you set the seed but programm interactively, the order in which you call individual code chunks will vary from interactive use to interactive use.
So setting a seed at the beginning of the script will make it's execution only reproducible for runs from top to bottom (without user interaction in between).

The packages you will use for randomizing computations have notes on the topic of reproducibility you should read:

- `random`: https://docs.python.org/3/library/random.html#notes-on-reproducibility
- `numpy`: https://numpy.org/doc/stable/reference/random/generator.html (but read also [here](https://builtin.com/data-science/numpy-random-seed), [here](https://albertcthomas.github.io/good-practices-random-number-generators/), and [here](https://stackoverflow.com/a/5837352))
- `pandas`: random number generation inside `pandas` code uses `numpy` under the hood.
- `torch` and all pacakges using it (e.g., `transformers`): https://pytorch.org/docs/stable/notes/randomness.html
- `transformers`: introduced [here](https://github.com/huggingface/transformers/pull/16907)

In [None]:
import random
import numpy as np
import torch
from transformers import set_seed

SEED = 42 # you can choose any integer value

# set seed for random package
random.seed(SEED) # see https://docs.python.org/3/library/random.html

# set numpy seed for reproducibility
np.random.seed(SEED)
# or better: use a RNG in all your custom functions and classes
rng = np.random.default_rng(SEED) 

# set seed for torch
torch.manual_seed(SEED)
# make CUDA (i.e., GPU) operations deterministic
torch.use_deterministic_algorithms(True) # see https://pytorch.org/docs/stable/notes/randomness.html#avoiding-nondeterministic-algorithms

# in transformers
set_seed(SEED) # does all of the above, see https://github.com/hasansalimkanmaz/transformers/blob/87ac401e776b865e92b1c5f31b0e1f67c76404c4/src/transformers/trainer_utils.py#L69

**_Important:_** When fine-tuning with `transformers`'s `Trainer` (like above), you also need to set the following arguments in your call to `TrainingArguments()`:

```python
TrainingArguments(
    ...
    # ensure reproducibility
    full_determinism = True,
    seed = SEED,
    data_seed = SEED,
    ...
)
```

### Using your own datasets

In the example above, we downloaded our labeled data and directly converted it into a `DatasetDict`(a dictionary of `Dataset` instaces).

But in your research, you'll have your own labeled data.
So you'll need to create `Dataset` instances from python objects.

Below, I show how to do this for lists of dictionaries and pandas data frames, respectively.
More options are shown [here](https://huggingface.co/docs/datasets/loading#inmemory-data).

In [274]:
import pandas as pd
from datasets import Dataset

labeled_data = [
    {'text': 'I am happy', 'label': 1},
    {'text': 'I am sad', 'label': 0}
]

In [275]:
# from list (of dictionaries)
dataset = Dataset.from_list(labeled_data)
print(type(dataset))
print(dataset.features)
print(dataset[0])
print(dataset['text'])
print(dataset['label'])

<class 'datasets.arrow_dataset.Dataset'>
{'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}
{'text': 'I am happy', 'label': 1}
['I am happy', 'I am sad']
[1, 0]


In [273]:
df = pd.DataFrame(labeled_data)

# from pandas data frame
dataset = Dataset.from_pandas(df)
print(type(dataset))
print(dataset.features)
print(dataset[0])
print(dataset['text'])
print(dataset['label'])

<class 'datasets.arrow_dataset.Dataset'>
{'text': Value(dtype='string', id=None), 'label': Value(dtype='int64', id=None)}
{'text': 'I am happy', 'label': 1}
['I am happy', 'I am sad']
[1, 0]


### Dataset splitting

To split your dataset into train, dev, and test sets, you should

- rely on scikit-learns pre-defined functions,
- and always set the seed

This helps to avoid that you accidentally have the same examples in different sets, and that your data splitting is reproducible.

In [2]:
# example data (see https://chat.openai.com/share/2d5af33c-acb9-4b0f-a260-7c38651fa7b0)
labeled_data = [
    {"text": "I'm over the moon with joy!", "label": "happy"},
    {"text": "Happiness radiates from every fiber of my being.", "label": "happy"},
    {"text": "I can't stop smiling because life is beautiful.", "label": "happy"},
    {"text": "Tears stream down my face; I'm so heartbroken.", "label": "sad"},
    {"text": "I feel so lonely and despondent right now.", "label": "sad"},
    {"text": "It's a gloomy day, and my spirits are low.", "label": "sad"},
    {"text": "Laughter fills the air, and my heart is light.", "label": "happy"},
    {"text": "I'm ecstatic about the news! Pure bliss!", "label": "happy"},
    {"text": "The weight of sadness bears down on me like a ton of bricks.", "label": "sad"},
    {"text": "Every moment without them feels like an eternity of sorrow.", "label": "sad"}
]


We usually specify the set sizes in percentages:

In [3]:
test_size = 0.20
dev_size = 0.20

# compute N
import math
n = len(labeled_data)
n_test = math.ceil(n*test_size)
n_dev = math.ceil(n*dev_size)
n_train = n-n_test-n_dev

#### Simple random splitting

The simplest splitting strategy is to assign examples randomly to the three sets.

In [9]:
# use train_test_split from sklearn
from sklearn.model_selection import train_test_split
SEED = 1234

tmp, test_idxs = train_test_split(range(n), test_size=n_test, random_state=SEED)
train_idxs, dev_idxs = train_test_split(tmp, test_size=n_dev, random_state=SEED)
del tmp

print(len(train_idxs), len(dev_idxs), len(test_idxs)) # should be approx 60%, 20%, 20%

print(len(set(train_idxs).intersection(set(dev_idxs)))) # should be 0
print(len(set(train_idxs).intersection(set(test_idxs)))) # should be 0
print(len(set(dev_idxs).intersection(set(test_idxs)))) # should be 0

train_data = [labeled_data[i] for i in train_idxs]
dev_data = [labeled_data[i] for i in dev_idxs]
test_data = [labeled_data[i] for i in test_idxs]

6 2 2
0
0
0


With random sampling-based splitting, you might end up with different label proportions, though.

In [10]:
print(np.mean([d['label'] == 'happy' for d in train_data]))
print(np.mean([d['label'] == 'happy' for d in dev_data]))
print(np.mean([d['label'] == 'happy' for d in test_data]))

0.16666666666666666
1.0
1.0


**_Note:_** In a larger dataset, the differences wouldn't be as dramatic though. So the problem is not too bad.

#### Stratify by label class

If you want to ensure that the label proportions in your train, dev, and test splits are identical,
you want to stratify by examples' true labels:

In [7]:
# use train_test_split from sklearn
import numpy as np
from sklearn.model_selection import train_test_split
SEED = 1234

labels = np.array([d['label'] for d in labeled_data])

tmp, test_idxs = train_test_split(range(n), test_size=n_test, random_state=SEED, stratify=labels)
train_idxs, dev_idxs = train_test_split(tmp, test_size=n_dev, random_state=SEED, stratify=labels[tmp])
del tmp

print(len(train_idxs), len(dev_idxs), len(test_idxs)) # should be approx 60%, 20%, 20%

print(len(set(train_idxs).intersection(set(dev_idxs)))) # should be 0
print(len(set(train_idxs).intersection(set(test_idxs)))) # should be 0
print(len(set(dev_idxs).intersection(set(test_idxs)))) # should be 0

train_data = [labeled_data[i] for i in train_idxs]
dev_data = [labeled_data[i] for i in dev_idxs]
test_data = [labeled_data[i] for i in test_idxs]

# the label proportions are now (approx.) the same in all splits
print(np.mean([d['label'] == 'happy' for d in train_data]))
print(np.mean([d['label'] == 'happy' for d in dev_data]))
print(np.mean([d['label'] == 'happy' for d in test_data]))

6 2 2
0
0
0
0.5
0.5
0.5


**_Note:_** 
You could also stratify by other indicators, such as document authors' IDs.
In this case, you'd have similar proportions of author's documents in the different splits.

#### Grouped sampling

Sometimes you want to develop a classifier that is able to predict text from held-out documents.
For example, if you have collected annotations for sentences sampled from parties' elections manifestos, you might not have sampled sentences from some manifestos.
In this case, at prediction time (i.e., when applying your final model to the entire corpus of election manifestos), you'd need to make "out-of-document" classifications.


Achieving good "out-of-document" classification performance requires **generalization**.
But it can be difficult because language use within documents tends to be more similar than across documents.

What you want to do to asses the ability of your classifier to predict reliably "out-of-document" is to mirror this setup in your splitting strategy.
This is done by assigning sentences to the train, dev, and test, splits based on their document membership.

So for example, below we have a simple illustration of how we assign all sentences from document 1 to the test set, and all sentences from documents 2 and 3 to the train set: 

| sentence ID | doc ID | set |
|:----------- |:------ |:--- |
| 1 | 1 | &rarr; 'test' |
| 2 | 1 | &rarr; 'test' |
| 1 | 2 | &rarr; 'train' |
| 2 | 2 | &rarr; 'train' |
| 3 | 2 | &rarr; 'train' |
| 1 | 3 | &rarr; 'train' |
| 2 | 3 | &rarr; 'train' |
| 1 | 3 | &rarr; 'train' |
| ... | ... | ... |





In [13]:
labeled_data = [
    {"doc_id": 0, "text": "I'm over the moon with joy!", "label": "happy"},
    {"doc_id": 0,"text": "Happiness radiates from every fiber of my being.", "label": "happy"},
    {"doc_id": 1,"text": "I can't stop smiling because life is beautiful.", "label": "happy"},
    {"doc_id": 1,"text": "Tears stream down my face; I'm so heartbroken.", "label": "sad"},
    {"doc_id": 2,"text": "I feel so lonely and despondent right now.", "label": "sad"},
    {"doc_id": 2,"text": "It's a gloomy day, and my spirits are low.", "label": "sad"},
    {"doc_id": 3,"text": "Laughter fills the air, and my heart is light.", "label": "happy"},
    {"doc_id": 3,"text": "I'm ecstatic about the news! Pure bliss!", "label": "happy"},
    {"doc_id": 4,"text": "The weight of sadness bears down on me like a ton of bricks.", "label": "sad"},
    {"doc_id": 4,"text": "Every moment without them feels like an eternity of sorrow.", "label": "sad"}
]
# note: in reality, the number of sentences per document might vary. 
#       But this is not a problem for running the code below!

In [23]:
# use GroupSplit strategy from sklearn
from sklearn.model_selection import GroupShuffleSplit

doc_ids = np.array([d["doc_id"] for d in labeled_data])

gss = GroupShuffleSplit(n_splits=2, test_size=test_size, random_state=SEED)
tmp, test_idxs = next(gss.split(range(n), groups=doc_ids))
train_idxs, dev_idxs = next(gss.split(range(len(tmp)), groups=doc_ids[tmp]))
del tmp

print(len(train_idxs), len(dev_idxs), len(test_idxs)) # should be approx 60%, 20%, 20%

print(len(set(train_idxs).intersection(set(dev_idxs)))) # should be 0
print(len(set(train_idxs).intersection(set(test_idxs)))) # should be 0
print(len(set(dev_idxs).intersection(set(test_idxs)))) # should be 0

train_data = [labeled_data[i] for i in train_idxs]
dev_data = [labeled_data[i] for i in dev_idxs]
test_data = [labeled_data[i] for i in test_idxs]

train_doc_ids = [d["doc_id"] for d in train_data]
dev_doc_ids = [d["doc_id"] for d in dev_data]
test_doc_ids = [d["doc_id"] for d in test_data]

# the label proportions are now (approx.) the same in all splits
print(len(set(train_doc_ids).intersection(set(dev_doc_ids)))) # should be 0
print(len(set(train_doc_ids).intersection(set(test_doc_ids)))) # should be 0
print(len(set(dev_doc_ids).intersection(set(test_doc_ids)))) # should be 0


6 2 2
0
0
0
0
0
0


### Multi-class classification

In many use cases, you have more than two label classes.

#### Model setup

`transformers`' `AutoModelForSequenceClassification` can handle this well.
You just need to 

1. adapt your `id2label` and `label2id` dictionaries accordingly. So if you have three label classes "positive", "neutral", and "negative," 

```python
id2label = {0: "negative", 1: "neutral", 2: "positive"}
label2id = {"negative": 0, "neutral": 1, "positive": 2}
```

2. adpat the `num_labels` argument accordingly:

```python
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    num_labels=3,
    id2label=id2label,
    label2id=label2id
)
```

#### Evaluation metrics

If you have more than two classes, you will need to adapt your evaluation metrics.
This is because the *precision* and *recall* metrics, for example, only differentiate between correct and false classifications ("hits" and "misses" ) of "positive" vs. "negative" examples.
So for each label class, you will convert ("dichotomize") your multi-class labels and predicted classifications into so-called "one vs. rest" indicators.

**_Example:_** If you are interested in the performance of your model to correctly classify "neutral" samples and the other two label classes are "positive" and "negative", you will redifine the label categories as follows

- "positive" &rarr; 0
- "neutral" &rarr; 1
- "negative" &rarr; 0

In this way, you can compute a **_"neutral"-specific_ recall, precision, and F1 scores**.

Evaluation functions in the `sklearn.metrics` module like `f1_score()` support multi-class classification:

In [251]:
true_labels = [0, 0, 1, 1, 2, 2]
pred_labels = [0, 1, 0, 1, 2, 1]

from sklearn.metrics import f1_score

# get one F1 score per label class 0, 1, and 2 (in ascending order)
f1_score(y_true=true_labels, y_pred=pred_labels, average=None)

array([0.5       , 0.4       , 0.66666667])

Put for finding the "best" model, you'll still need a single performance score.

So what we do is **average** class-specific scores into one performacne estimate.
The most common averaging strategy is the so-called **_macro_ average**.
It just computes the average between class-specific scores.

So given the example above, the "macro F1 score" is `(0.5+0.4+0.666)/3 = 0.5222`

In [252]:
f1_score(y_true=true_labels, y_pred=pred_labels, average='macro')

0.5222222222222223

The alternative strategy is the **_micro_ average**.
In this strategy, we just summarize the which labels we got right, and which we got wrong. So the "micro F1 scores" is just the accuracy:

In [262]:
f1_score(y_true=true_labels, y_pred=pred_labels, average='micro')

0.5

There is also a function that summarizes everything:

In [261]:
from sklearn.metrics import classification_report

id2label = {0: "negative", 1: "neutral", 2: "positive"}

rep = classification_report(
    y_true=true_labels, 
    y_pred=pred_labels, 
    labels=list(id2label.keys()),
    target_names=list(id2label.values()),
)
print(rep)

              precision    recall  f1-score   support

    negative       0.50      0.50      0.50         2
     neutral       0.33      0.50      0.40         2
    positive       1.00      0.50      0.67         2

    accuracy                           0.50         6
   macro avg       0.61      0.50      0.52         6
weighted avg       0.61      0.50      0.52         6

