## Finetune BERT-Models on IMDB

[code on huggingface](https://huggingface.co/docs/transformers/tasks/sequence_classification)

checked 07.02.2024 GPaaß

This code can be executed with PyTorch or (with some minor changes) with TensorFlow

Maybe you need to remove all files in .cache

The task illustrated in this tutorial is supported by the following model architectures in HUGGINGFACE:

ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BioGpt, BLOOM, CamemBERT, CANINE, CodeLlama, ConvBERT, CTRL, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, ESM, Falcon, FlauBERT, FNet, Funnel Transformer, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, GPT-J, I-BERT, LayoutLM, LayoutLMv2, LayoutLMv3, LED, LiLT, LLaMA, Longformer, LUKE, MarkupLM, mBART, MEGA, Megatron-BERT, Mistral, Mixtral, MobileBERT, MPNet, MPT, MRA, MT5, MVP, Nezha, Nyströmformer, OpenLlama, OpenAI GPT, OPT, Perceiver, Persimmon, Phi, PLBart, QDQBert, Qwen2, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, T5, TAPAS, Transformer-XL, UMT5, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO

[Advanced classification on GLUE](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/text_classification.ipynb)

In [1]:
!pip install transformers datasets evaluate accelerate


Collecting datasets
  Downloading datasets-2.18.0-py3-none-any.whl (510 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.27.2-py3-none-any.whl (279 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m280.0/280.0 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/

In [None]:
# tag: parameters for papermill. View > Cell Toolbar > Tags. Need papermill library
#prm = "small"              # small: just use 1 epoch
#prm = "full"              # small: just use 1 epoch

In [None]:
# clear GPU memory
import torch, gc
gc.collect()
torch.cuda.empty_cache()

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

## Load IMDB Data
Start by loading the IMDb dataset from the Datasets library:

In [None]:
from datasets import load_dataset
imdb = load_dataset("imdb")

There are two fields in this dataset:

    * text: the movie review text.
    * label: a value that is either 0 for a negative review or 1 for a positive review.

In [None]:
print("----- example of a review with negative rating -----")
imdb["test"][2]

In [None]:
print("----- example of a review with positive rating -----")
imdb["test"][20000]


## Tokenization
Select the type of model

In [None]:
model_type = "bert-base-cased"
model_type = "distilbert-base-uncased"


In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_type)

Create a preprocessing function to tokenize text and truncate sequences to be no longer than DistilBERT’s maximum input length:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"],  truncation=True)
tokenized_imdb = imdb.map(tokenize_function, batched=True)

Now create a batch of examples using [DataCollatorWithPadding](https://huggingface.co/docs/transformers/main/en/main_classes/data_collator#transformers.DataCollatorWithPadding). It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
import evaluate
accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [None]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Train

Before you start training your model, create a map of the expected ids to their labels with id2label and label2id:

In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

If you aren’t familiar with finetuning a model with the Trainer, take a look at the basic tutorial [here](https://huggingface.co/docs/transformers/main/en/training#train-with-pytorch-trainer)!


Load DistilBERT with AutoModelForSequenceClassification along with the number of expected labels, and the label mappings:

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained(
    model_type, num_labels=2, id2label=id2label, label2id=label2id
)

At this point, only three steps remain:

1.    Define your training hyperparameters in `TrainingArguments`. The only required parameter is `output_dir` which specifies where to save your model.  At the end of each epoch, the `Trainer` will evaluate the accuracy and save the training checkpoint.
1.    Pass the training arguments to `Trainer` along with the model, dataset, tokenizer, data collator, and compute_metrics function.
1.   Call `train()` to finetune your model.

In [None]:
TrainingArguments?

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

About 11:30 minutes with A100 40GB. RAM usage 6.8GB of 40.1 GB. Validation loss: 0.2263, Accuracy: 0.932

About 1:30 for evaluations of test data

Check GPU activity in terminal with `watch nvidia-smi`

| GPU | Val loss | Val acc | execution time |
|:--------:|:--------:|:--------:|:--------:|
| A100 40GB | 0.2263 | 0.932 | 11:30 |
|T4 40GB | 0.2263 | 0.932 | 36:00|

In [None]:
trainer.train()

## Inference

Now we can use the model for inference to classify a new text.

### Application to a single text with `pipeline`

In [None]:
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

The model was saved to the directory `my_awesome_model`. You can open a terminal (bottom left) and enter the command `ls -l` to see the directory.

The simplest way to try out your finetuned model for inference is to use it in a pipeline(). Instantiate a pipeline for sentiment analysis with your model, and pass your text to it:

In [None]:
import os
cwd = os.getcwd()
cwd

In [None]:
model_dir = "my_awesome_model/checkpoint-3126"

In [None]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model=model_dir)
classifier(text)

evaluate — Runs an evaluation loop and returns metrics.

In [None]:
metrik = trainer.evaluate()
metrik

predict — Returns predictions (with metrics if labels are available) on a test set.

In [None]:
preds = trainer.predict(test_dataset=tokenized_imdb["test"])

In [None]:
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    mx = np.max(x,axis=1)                 # compute max of rows of x
    mx = np.repeat(mx[:,np.newaxis],2,1)  # expand to new dimension
    xx3 = x-mx                            # subtract maximum (avoid overflow)
    ex = np.exp(xx3)                      # compute exponent
    ex_sum = np.sum(ex,axis=1)            # sum of rows
    ex_sum = np.repeat(ex_sum[:,np.newaxis],2,1) #
    return ex/ex_sum                      # [exp(x_1),...,exp(x_k)]/(exp(x_1)+...+ex

In [None]:
print(len(preds.predictions))
probs = softmax(preds.predictions)
for i in range(10):
  print(i,"prob=",probs[i,0],"\tlabel=",tokenized_imdb["test"]['label'][i])
#print(probs[:10])


### Detailed Execution of `pipeline` Commands

Tokenize the text and return PyTorch tensors:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_dir)
inputs = tokenizer(text, return_tensors="pt")

as your inputs to the model and return the logits:

In [None]:
import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_dir)
with torch.no_grad():
  logits = model(**inputs).logits  # predict output
print("logits",logits)

Get the class with the highest probability, and use the model’s id2label mapping to convert it to a text label:

In [None]:
predicted_class_id = logits.argmax().item()
id2label[predicted_class_id] # model.config.id2label[predicted_class_id]2label