#  🧑‍💻 Fine-tuning BERT-base 🧑‍💻

📅 _Data Science Summer School 2023, 22.08.2023_

👨‍🏫 By [Moritz Laurer](https://www.linkedin.com/in/moritz-laurer/).
For questions, reach out to: m.laurer@vu.nl


</a><a href="https://github.com/MoritzLaurer/summer-school-transformers-2023/blob/main/3_tune_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is partly based on my paper:

Laurer, Moritz, Van Atteveldt, W., Casas, A., & Welbers, K. (2023). [Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI](https://www.cambridge.org/core/journals/political-analysis/article/less-annotating-more-classifying-addressing-the-data-scarcity-issue-of-supervised-machine-learning-with-deep-transfer-learning-and-bertnli/05BB05555241762889825B080E097C27). Political Analysis, 1-17. doi:10.1017/pan.2023.20

## Activate a GPU runtime

In order to run this notebook on a GPU, click on "Runtime" > "Change runtime type" > select "GPU" in the menue bar in to top left. Training a Transformer is much faster on a GPU. Given Colabs's usage limits for GPUs, it is advisable to first test your non-training code on a CPU and only use the GPU once you know that everything is working.

## Install relevant packages

In [None]:
!pip install transformers[sentencepiece]~=4.31.0
!pip install accelerate~=0.21.0
!pip install datasets~=2.14.0
!pip install optuna~=3.3.0

Collecting transformers[sentencepiece]~=4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m55.1 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers[sentencepiece]~=4.31.0)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m32.0 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[sentencepiece]~=4.31.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m83.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[sentencepiece]~=4.31.0)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━

In [None]:
## Load general packages
# some more specialised packages are loaded in each sub section
import pandas as pd
import numpy as np
from google.colab.data_table import DataTable

In [None]:
# set random seed for reproducibility
SEED_GLOBAL = 42
np.random.seed(SEED_GLOBAL)

## Download data

In [None]:
## Download clean train and test data from github
# data source: https://manifesto-project.wzb.eu/
# codebook with definitions of classes: https://manifesto-project.wzb.eu/down/data/2023a/codebooks/codebook_MPDataset_MPDS2023a.pdf
# more cleaned datasets for testing other tasks: https://github.com/MoritzLaurer/less-annotating-with-bert-nli/tree/master/data_clean

df_train = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/master/data_clean/df_manifesto_military_train.csv", index_col="idx")
df_test = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/master/data_clean/df_manifesto_military_test.csv", index_col="idx")
print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")


Length of training and test sets:  3970  (train)  9537  (test).


In [None]:
# full training data table
DataTable(df_train, num_rows_per_page=5)

Unnamed: 0_level_0,label,label_text,text_original,label_domain_text,label_subcat_text,text_preceding,text_following,manifesto_id,doc_id,country_name,date,party,cmp_code_hb4,cmp_code,label_subcat_text_simple
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
71033,0,Military: Negative,a direct role for military forces in the provi...,External Relations,Military: Negative,emergency relief in situations of armed confli...,aid programs should not be used to influence t...,63110_201008,70,Australia,201008,63110,105,105.0,Military
26600,2,Other,and the seriousness of these animal cruelty of...,Welfare and Quality of Life,Environmental Protection,the introduction of a NI Register of Animal Cr...,the PSNI in their efforts to target criminal g...,51903_201706,31,United Kingdom,201706,51903,501,501.0,Environmental Protection
64968,1,Military: Positive,We reject Congressional earmarks that put pers...,External Relations,Military: Positive,We will implement sound management policies to...,"We recognize the need for, and value of, compe...",61620_201211,65,United States,201211,61620,104,104.0,Military
119889,2,Other,"As a partner in Government, the Māori Party ha...",Welfare and Quality of Life,Welfare State Expansion,reviewing the appeal process for ethics commit...,secured $65.3 million for rheumatic fever prev...,64901_201409,141,New Zealand,201409,64901,504,504.0,Welfare State
69067,1,Military: Positive,"Whatever their disagreements, both the Republi...",External Relations,Military: Positive,Both recognized the need to stand with friends...,— and strength meant American military superio...,61620_202011,67,United States,202011,61620,104,104.0,Military
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5284,2,Other,We must make substantial changes in the way we...,Welfare and Quality of Life,Environmental Protection,The healthy future of our environment is one o...,The UK played a leading role in the Kyoto conf...,51320_200106,7,United Kingdom,200106,51320,501,501.0,Environmental Protection
27167,1,Military: Positive,"Armed Forces Personnel, born outside the Unite...",External Relations,Military: Positive,The DUP supports: A new long-term plan for Arm...,The DUP supports: Waiving indefinite leave to ...,51903_201912,32,United Kingdom,201912,51903,104,104.0,Military
101296,2,Other,"Increase support for mothers, before birth and...",Welfare and Quality of Life,Welfare State Expansion,Increase funding for community-based providers...,Establish family health clinics within family ...,64421_200207,120,New Zealand,200207,64421,504,504.0,Welfare State
93038,2,Other,This will enable people to work and rest on th...,Economy,Technology and Infrastructure: Positive,The Green Party is committed to passenger rail...,The train would take around two hours and 15 m...,64110_201709,102,New Zealand,201709,64110,411,411.0,Technology and Infrastructure: Positive


**If you want to run the notebook on your own dataset:**

You can load your own training and test data above to fine-tune your own BERT model. Your own dataframe only needs three columns to be compatible with the code below: (1) a "label" column with a numeric label; (2) a "label_text" column with the label name in plain language, (3) a "text" column with the texts for training (you might need to delete/adapt the text preparation code cell below for your dataset).

In [None]:
## you can also load your own .csv files from Google Drive
"""
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)

# set the path to your data
os.chdir("/content/drive/My Drive/PhD/data")
print(os.getcwd())

df_train = pd.read_csv("./df_manifesto_morality_train.csv")
df_test = pd.read_csv("./df_manifesto_morality_test.csv")
print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")
"""


'\nfrom google.colab import drive\nimport os\ndrive.mount(\'/content/drive\', force_remount=False)\n\n# set the path to your data\nos.chdir("/content/drive/My Drive/PhD/data")  \nprint(os.getcwd())\n\ndf_train = pd.read_csv("./df_manifesto_morality_train.csv")\ndf_test = pd.read_csv("./df_manifesto_morality_test.csv")\nprint("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")\n'

In [None]:
# optional: use training data sample size of e.g. 1000 for faster testing
sample_size = 1000
df_train = df_train.sample(n=min(sample_size, len(df_train)), random_state=SEED_GLOBAL).copy(deep=True)
df_test = df_test.sample(n=min(sample_size*4, len(df_test)), random_state=SEED_GLOBAL).copy(deep=True)

print("Length of training and test sets after sampling: ", len(df_train), " (train) ", len(df_test), " (test).")


Length of training and test sets after sampling:  1000  (train)  4000  (test).


In [None]:
## inspect the data
# label distribution train set
print("Train set label distribution:\n", df_train.label_text.value_counts(), "\n")
# label distribution test set
print("Test set label distribution:\n", df_test.label_text.value_counts())


Train set label distribution:
 Other                 516
Military: Positive    399
Military: Negative     85
Name: label_text, dtype: int64 

Test set label distribution:
 Other                 3646
Military: Positive     248
Military: Negative     106
Name: label_text, dtype: int64


## Data preprocessing

**Prepare the input text**

1.) We prepare the target texts by making them more naturally fit to the hypothesis. Here we simply wrap each target text into the string ' The quote: "{target_text}" - end of the quote. '

2.) We surround the target text by its preceeding and following sentence. Adding context like this systematically increases performance.


In [None]:
df_train["text"] = df_train.text_preceding.fillna("") + " " + df_train.text_original.fillna("") + " " + df_train.text_following.fillna("")
df_test["text"] = df_test.text_preceding.fillna("") + " " + df_test.text_original.fillna("") + " " + df_test.text_following.fillna("")


In [None]:
df_train = df_train[["label", "label_text", "text"]]
df_test = df_test[["label", "label_text", "text"]]

In [None]:
DataTable(df_train, num_rows_per_page=5)

Unnamed: 0_level_0,label,label_text,text
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
71033,0,Military: Negative,emergency relief in situations of armed confli...
26600,2,Other,the introduction of a NI Register of Animal Cr...
64968,1,Military: Positive,We will implement sound management policies to...
119889,2,Other,reviewing the appeal process for ethics commit...
69067,1,Military: Positive,Both recognized the need to stand with friends...
...,...,...,...
5284,2,Other,The healthy future of our environment is one o...
27167,1,Military: Positive,The DUP supports: A new long-term plan for Arm...
101296,2,Other,Increase funding for community-based providers...
93038,2,Other,The Green Party is committed to passenger rail...


## Load a Transformer

We use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) for loading and training our model. They provide great documentation and also a very good [course](https://huggingface.co/course/chapter1/1) on how to use Transformers.

**Choosing a Transformer model**

You can can use any classification model on the [Hugging Face Hub](https://huggingface.co/models?sort=downloads). I suggest testing these models:



*   Original BERT (good, but out-dated): `bert-base-uncased`
*   Small efficient model: `distilbert-base-uncased`
*   Newer version of BERT: `microsoft/deberta-v3-base`
*   Large, high-performance model: `microsoft/deberta-v3-large`
*   Multilingual model: `microsoft/mdeberta-v3-base`





In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch

## load a model and its tokenizer
model_name = "microsoft/deberta-v3-base"  # replace e.g. with "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# link the numeric labels to the label texts
label_text = np.sort(df_test.label_text.unique()).tolist()
label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));
print("\n", label2id, "\n")

# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True);

# use GPU (cuda) if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
model.to(device);


Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



 {'Military: Negative': 0, 'Military: Positive': 1, 'Other': 2} 



Downloading pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.weight', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Device: cuda


## Tokenize data

In [None]:
# convert pandas dataframes to Hugging Face dataset object to facilitate pre-processing
import datasets

dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train),
    "test": datasets.Dataset.from_pandas(df_test)
})

# tokenize
def tokenize(examples):
  return tokenizer(examples["text"], truncation=True, max_length=512)  # max_length can be reduced to e.g. 256 to increase speed, but long texts will be cut off

dataset = dataset.map(tokenize, batched=True)

# remove unnecessary columns for model training
dataset = dataset.remove_columns(['label_text'])   #'text_original', 'label_domain_text', 'label_subcat_text', 'text_preceding', 'text_following', 'manifesto_id', 'doc_id', 'country_name', 'date', 'party', 'cmp_code_hb4', 'cmp_code', 'label_subcat_text_simple'])


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

**Inspect processed data**

In [None]:
print("The overall structure of the pre-processed train and test sets:\n")
print(dataset)

The overall structure of the pre-processed train and test sets:

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4000
    })
})


**Explanation of different elements in the tokenized dataset:**

1. **idx**:
Purpose: A unique identifier or index for each data point in the dataset. It's useful for tracking, referencing, or debugging individual examples.
Example: Any integer value like 42.

2. **input_ids**:
Purpose: These are token IDs that represent each token in the text. The raw text is tokenized, and each token is mapped to an ID based on the tokenizer's vocabulary. This is the primary input to BERT and other transformer models.
Example: For the word "hugging", the tokenizer might produce an ID like 20345.

3. **token_type_ids**:
Purpose: Used in models like BERT that handle different pairs of sentences in tasks (e.g., Next Sentence Prediction or Question-Answering). It differentiates between the first sentence and the second sentence. Typically, for a single sentence, this would be a sequence of 0s, and for a two-sentence pair, the first sentence would have 0s, and the second one would have 1s.
Example: For the text "Hugging Face is great. I love it!", the token_type_ids might look like [0, 0, 0, 0, 0, 1, 1, 1].

4. **attention_mask**:
Purpose: Specifies which tokens should be attended to by the model. A value of 1 means the token should be considered, while 0 means it should be ignored. This is especially useful when batching sequences of different lengths together, requiring padding tokens to be added. The padding tokens have an attention_mask value of 0.
Example: For the sentence "I love Hugging Face." with padding, the attention_mask might be [1, 1, 1, 1, 0, 0, 0] where the 0s represent the padding tokens.

In [None]:
from pprint import pprint
print("\n\nAn example for a row in the tokenized dataset:\n")
[print(key, ":    ",  value) for key, value in dataset["train"][0].items()]



An example for a row in the tokenized dataset:

label :     0
text :     emergency relief in situations of armed conflict should be carried out by civilians and must be clearly distinguished from any military activities. a direct role for military forces in the provision of relief should be restricted to situations involving natural disasters where ambiguity over the military role is unlikely to arise. aid programs should not be used to influence the democratic preferences of any nation.
idx :     71033
input_ids :     [1, 2644, 3478, 267, 3335, 265, 5652, 3423, 403, 282, 2635, 321, 293, 10936, 263, 516, 282, 2117, 10045, 292, 356, 1681, 1157, 260, 266, 1670, 985, 270, 1681, 2499, 267, 262, 5048, 265, 3478, 403, 282, 6790, 264, 3335, 3849, 1008, 12871, 399, 26756, 360, 262, 1681, 985, 269, 5220, 264, 7246, 260, 2777, 1309, 403, 298, 282, 427, 264, 2399, 262, 8221, 7162, 265, 356, 2080, 260, 2]
token_type_ids :     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

[None, None, None, None, None, None]

## Setting training arguments / hyperparameters

The following cells set several important hyperparameters. We chose parameters that work well in general to avoid the need for hyperparameter search. Further below, we also provide code for hyperparameter search, if researchers want to try to increase performance by a few percentage points.

In [None]:
# Set the directory to write the fine-tuned model and training logs to.
# With google colab, this will create a temporary folder, which will be deleted once you disconnect.
# You can connect to your personal google drive to save models and logs properly.
training_directory = "BERT-demo"

# FP16 is a hyperparameter which can increase training speed and reduce memory consumption, but only on GPU and if batch-size > 8, see here: https://huggingface.co/transformers/performance.html?#fp16
# FP16 does not work on CPU or for multilingual mDeBERTa models
#fp16_bool = True if torch.cuda.is_available() else False
#if "mdeberta" in model_name.lower(): fp16_bool = False  # multilingual mDeBERTa does not support FP16 yet: https://github.com/microsoft/DeBERTa/issues/77
# in case of hyperparameter search end the end: FP16 has to be set to False. The integrated hyperparameter search with the Hugging Face Trainer can lead to errors otherwise.


In [None]:
from transformers import TrainingArguments, Trainer, logging

# Overview of all training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
# Hugging Face tipps to increase training speed and decrease out-of-memory (OOM) issues: https://huggingface.co/transformers/performance.html?
train_args = TrainingArguments(
    num_train_epochs=7,  # this can be increased, but higher values increase training time. Good values for NLI are between 3 and 20.
    learning_rate=2e-5,
    per_device_train_batch_size=16,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
    per_device_eval_batch_size=64,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
    #gradient_accumulation_steps=2, # Can be used in case of memory problems to reduce effective batch size. accumulates gradients over X steps, only then backward/update. decreases memory usage, but also slightly speed. (!adapt/halve batch size accordingly)
    warmup_ratio=0.06,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
    weight_decay=0.1,
    seed=SEED_GLOBAL,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    #fp16=fp16_bool,  # Can speed up training and reduce memory consumption, but only makes sense at batch-size > 8. loads two copies of model weights, which creates overhead. https://huggingface.co/transformers/performance.html?#fp16
    #fp16_full_eval=fp16_bool,
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    #eval_steps=10_000,  # evaluate after n steps if evaluation_strategy!='steps'. defaults to logging_steps
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch"
    #save_steps=10_000,              # Number of updates steps before two checkpoint saves.
    #save_total_limit=10,             # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
    #logging_strategy="steps",
    report_to="all",  # "all"  # logging
    #push_to_hub=False,
    #push_to_hub_model_id=f"{model_name}-finetuned-{task}",
    output_dir=f'./results/{training_directory}',
    logging_dir=f'./logs/{training_directory}',
)


**Explanation of different training arguments:**

You can find more arguments, explanations and examples in the Hugging Face [documentation](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments).

* **num_train_epochs**:
Specifies the number of times the entire training dataset is passed through the model. For example, num_train_epochs=3 means the trainer will iterate over the entire training dataset three times.

* **per_device_train_batch_size**:
The model does not learn from the entire dataset at once, but in batches of e.g. 16 texts. For example, if per_device_train_batch_size=16, then the model analyses 16 texts and sees how wrong it was on these 16 texts. After analysing these 16 texts, the model's parameters are updated/optimised to make the model less wrong on these texts. The degree to which the model's parameters are updated is called 'learning rate'.

* **learning_rate**:
The the "rate" or speed with which the model's parameters are updated by the optimisation algorithm. A smaller value makes the model's parameter updated more slowly after each batch, while a larger value updates the model's paramters more drastically. A good general value is 2e-5 (which means 0.00002).

* **per_device_eval_batch_size**:
The number of evaluation examples used in one batch during evaluation. This batch size is irrelevant for the model's learning. A higher value makes evaluation faster (more texts are processed at the same time), but higher values also cost memory and increase the risk of out-of-memory errors (OOM).

* **gradient_accumulation_steps**:
Indicates the number of steps before performing a backward/update pass. This means the loss is accumulated over gradient_accumulation_steps steps instead of updating after every step. Useful for training with larger effective batch sizes using limited memory.

* **warmup_ratio**:
Specifies the ratio of total training steps for which the learning rate will be linearly increased (warm-up phase) before it's decayed.

* **weight_decay**:
A regularization technique which penalizes large weights by adding a penalty term to the loss. It helps prevent overfitting.

* **seed**:
Sets a seed for reproducibility. Ensures that multiple runs with the same seed produce the same results.

* **load_best_model_at_end**:
Loads the best model (according to metric_for_best_model) at the end of training instead of the last model.

* **metric_for_best_model**:
Determines which metric to use for evaluating and determining the best model during training.

* **evaluation_strategy**:
Defines when to evaluate the model. For example, evaluation_strategy="epoch" evaluates the model after every epoch.

* **save_strategy**:
Specifies when to save the model. For example, save_strategy="epoch" saves the model after every epoch.

* **fp16**:
Enables mixed precision training if set to True, which can speed up training and reduce memory usage. But this does not work with every model and is only beneficial with a batch_size >= 16.


## Metrics

In [None]:
# Function to calculate metrics
# documentation on all metrics: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report
import warnings

def compute_metrics_standard(eval_pred):
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")

        labels = eval_pred.label_ids
        pred_logits = eval_pred.predictions
        preds_max = np.argmax(pred_logits, axis=1)  # argmax on each row (axis=1) in the tensor

        # metrics
        precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(labels, preds_max, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(labels, preds_max, average='micro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        acc_balanced = balanced_accuracy_score(labels, preds_max)
        acc_not_balanced = accuracy_score(labels, preds_max)

        metrics = {
            'accuracy': acc_not_balanced,
            'f1_macro': f1_macro,
            'accuracy_balanced': acc_balanced,
            'f1_micro': f1_micro,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro,
            'precision_micro': precision_micro,
            'recall_micro': recall_micro,
        }

        return metrics


**Explanation of key metrics**:

Imagine we have a machine that identifies apples in a basket of mixed fruits. The task of the machine is: label each fruit as "apple" or "not-apple".

1. **Accuracy**:
    - Definition: The proportion of all predictions that were correct.
    - Example: If our machine looked at 100 fruits and correctly identified 90 of them (either as apples or not-apples), its accuracy is 90%.
    - Think of it as the overall correctness of the machine.

2. **Precision**:
    - Definition: Of all the items labeled by the machine as "apple," what proportion was actually apples?
    - Simple Explanation: Suppose our machine pointed out 50 fruits as apples, but only 25 of them were real apples. The precision would be 25/50 or 50%.
    - It tells us how "precise" our machine is when calling something an apple.

3. **Recall** (also known as Sensitivity or True Positive Rate):
    - Definition: Of all the real apples in the basket, what proportion did the machine correctly identify as apples?
    - Simple Explanation: If there were 50 actual apples and our machine only identified 40 of them, then the recall is 40/50 or 80%.
    - Recall answers the question: How many actual apples did we manage to recall/identify?

4. **F1 Score**:
    - Definition: The harmonic mean of precision and recall, giving a balance between the two.
    - Simple Explanation: If our precision is 80% and our recall is also 80%, the F1 score is also 80%. But if either drops, the F1 score drops too. It's a single metric that tries to balance precision and recall.
    - Consider it a balanced score: if either precision or recall is low, the F1 score will be low.

Using the apple example, let's say:
- The machine points at 10 fruits saying "these are apples!"
- In reality, only 8 of them are apples (so, precision = 8/10 = 80%).
- But there were 20 apples in total, and the machine only identified 8 of them (so, recall = 8/20 = 40%).

While precision is high, recall is low, which means the machine is good at being sure when it says something is an apple, but it's missing a lot of the actual apples. The F1 score would reflect this imbalance.

## Fine-tuning and evaluation

Let's start fine-tuning the model!

If you get an 'out-of-memory' error, reduce the 'per_device_train_batch_size' to 8 or 4 in the TrainingArguments above and restart the runtime. If you don't restart your runtime (menu to the to left 'Runtime' > 'Restart runtime') and rerun the entire script, the 'out-of-memory' error will probably not go away.

In [None]:
# training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics_standard
)

trainer.train()


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Accuracy Balanced,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
1,No log,0.276927,0.917,0.511335,0.602162,0.917,0.470297,0.602162,0.917,0.917
2,No log,0.224581,0.92275,0.527121,0.636834,0.92275,0.480992,0.636834,0.92275,0.92275
3,No log,0.201088,0.939,0.717543,0.76968,0.939,0.691836,0.76968,0.939,0.939
4,No log,0.229019,0.93825,0.71933,0.788898,0.93825,0.670972,0.788898,0.93825,0.93825
5,No log,0.278089,0.935,0.721329,0.794286,0.935,0.674044,0.794286,0.935,0.935
6,No log,0.319131,0.92925,0.706651,0.77363,0.92925,0.679862,0.77363,0.92925,0.92925
7,No log,0.306214,0.93475,0.723522,0.785897,0.93475,0.692216,0.785897,0.93475,0.93475


TrainOutput(global_step=441, training_loss=0.29311645923017643, metrics={'train_runtime': 597.86, 'train_samples_per_second': 11.708, 'train_steps_per_second': 0.738, 'total_flos': 380189262098880.0, 'train_loss': 0.29311645923017643, 'epoch': 7.0})

In [None]:
# Evaluate the fine-tuned model on the held-out test set
#results = trainer.evaluate()
#print(results)

**Careful: do not tune hyperparameters on the test-set**

For simplicity's sake, we have split our dataset in two parts: the training set and the test set. In reality, the data is often split in three parts: data_train, data_validation, data_test.

* **data_train**: This data split is used to train the model
* **data_validation**: This data split is used to validate the choices of hyperparamters. For example, we might want to try a learning_rate of [2e-5, 9e-6, 3e-5], or a batch_size of [16, 32, 64]. We don't know which of these hyperparameters are best for our model and data. We therefore need to validate them.
* **data_test**: This data split is only used for testing our model at the very end, once we have decided on our hyperparameters.


## Inference with your fine-tuned model

In [None]:
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"  # use GPU (cuda) if available, otherwise use CPU

# documentation: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
pipe_classifier = pipeline(
    "text-classification",
    model=model,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
    tokenizer=tokenizer,
    framework="pt",
    device=device,
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


We now apply the pipeline to unseen texts. We re-use the df_test data-frame here for simplicity, but it could be any other dataset. It only needs a text column. Note that we do not need to re-format the text data anymore here, as this is handled internally by the Hugging Face zero-shot pipeline. If you want to better understand the arguments in the pipeline below, we recommend reading the [documentation here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline).

In [None]:
# create a dummy data frame for illustration
df_inference = df_test[["text", "label_text"]].sample(n=1000, random_state=42).copy(deep=True)
text_lst = df_inference["text"].tolist()

# use the pipeline with your chosen model for inference (prediction)
pipe_output = pipe_classifier(
    text_lst,  # input any list of texts here
    batch_size=32  # reduce this number to 8 or 16 if you get an out-of-memory error
)
print(pipe_output)

df_output = pd.DataFrame(pipe_output)

# add inference data to your original dataframe
df_inference["label_text_pred"] = df_output["label"].tolist()
df_inference["label_text_pred_probability"] = df_output["score"].round(2).tolist()


[{'label': 'Other', 'score': 0.9950510859489441}, {'label': 'Other', 'score': 0.998818576335907}, {'label': 'Other', 'score': 0.9988555908203125}, {'label': 'Other', 'score': 0.9989451766014099}, {'label': 'Military: Positive', 'score': 0.9851321578025818}, {'label': 'Other', 'score': 0.9982683658599854}, {'label': 'Other', 'score': 0.9988805651664734}, {'label': 'Other', 'score': 0.9989067316055298}, {'label': 'Other', 'score': 0.9984219074249268}, {'label': 'Other', 'score': 0.768255352973938}, {'label': 'Other', 'score': 0.9988068342208862}, {'label': 'Military: Negative', 'score': 0.9762265086174011}, {'label': 'Other', 'score': 0.9988584518432617}, {'label': 'Other', 'score': 0.9988420605659485}, {'label': 'Other', 'score': 0.9988879561424255}, {'label': 'Other', 'score': 0.998663067817688}, {'label': 'Other', 'score': 0.9985699653625488}, {'label': 'Other', 'score': 0.9987300038337708}, {'label': 'Other', 'score': 0.9988242983818054}, {'label': 'Other', 'score': 0.998863935470581

In [None]:
df_inference

Unnamed: 0_level_0,text,label_text,label_text_pred,label_text_pred_probability
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
103403,• To continue to maintain inflation in the ran...,Other,Other,1.00
13031,End the 1 % cap on pay rises in the public sec...,Other,Other,1.00
27347,These savings include: Reducing the size of th...,Other,Other,1.00
104422,This will allow victims to say more in their V...,Other,Other,1.00
62765,"Throughout the Cold War, our international bro...",Other,Military: Positive,0.99
...,...,...,...,...
40325,"Legislate to regulate the charity sector, and ...",Other,Other,1.00
47579,Support the further development of farmers’ ma...,Other,Other,1.00
73225,Small businesses will benefit from the instant...,Other,Other,1.00
13290,We will: Uprate working-age benefits at least ...,Other,Other,1.00


## Save and load your fine-tuned model

In [None]:
run_code_below = False
assert run_code_below, "Stopping code here to avoid accidental runs of the code below with Runtime > Run all"

AssertionError: ignored

### Saving your model to Google Drive

In [None]:
## first you need to connect to your google drive with your google account
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)
#drive.flush_and_unmount()

# insert the path where you want to save the model
os.chdir("/content/drive/My Drive/")
print(os.getcwd())


In [None]:
### save best model to google drive
directory_save_model = f"{training_directory}/"
model_name_custom = f"{model_name.split('/')[-1]}-custom"
mode_custom_path = directory_save_model + model_name_custom

trainer.save_model(output_dir=mode_custom_path)

### Upload your model to the Hugging Face Hub

In [None]:
### Push to Hugging Face hub
# install necessary dependencies
# you need to create an account on https://huggingface.co/ for this
!sudo apt-get install git-lfs
!huggingface-cli login

In [None]:
# load your models and tokenizer saved before from disk
model = AutoModelForSequenceClassification.from_pretrained(mode_custom_path)
tokenizer = AutoTokenizer.from_pretrained(mode_custom_path, use_fast=True, model_max_length=512)  # we load the tokenizer from the original BERT-NLI model

In [None]:
# https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.push_to_hub
repo_id = '<your-user-name>/<your-model-name>'  # e.g. "JaneJones/DeBERTa-v3-nli-custom". note that the repo name is case-sensitive
model.push_to_hub(repo_id=repo_id, use_temp_dir=True, private=True, use_auth_token="<your-huggingface-token>")
tokenizer.push_to_hub(repo_id=repo_id, use_temp_dir=True, private=True, use_auth_token="<your-huggingface-token>")


## Bonus: Hyperparameter Search

To increase performance, you can also conduct a hyperparameter search (hp-search), to try and find the best hyperparameters for your specific task and dataset. The trade-off is that hp-search is very compute intensive, but finding better hyperparameters for your task can increase performance. Make sure to conduct hp-search on a sub-set of the training set (i.e. validation set) and not the final test set to avoid data leakage of the test set before final testing.

Note that for small datasets, running the hp-search only on one train-validation split is not ideal. For datasets with less than around 2000 training data points, we recommend running the hp-search on two different random train-validation split. We implemented this for our paper, but not in this notebook as this would make the code harder to understand.

Documentation with more information on hp-search with Hugging Face Transformers is available [here](https://huggingface.co/docs/transformers/main/hpo_train).

In [None]:
## train-validation split - test set should not be visible during hp-search
# https://huggingface.co/docs/datasets/v2.5.1/en/package_reference/main_classes#datasets.Dataset.train_test_split

# the ideal size of the validation set depends on the size of your training data. Each label should have at the very least a few dozen examples in the validation set (ideally several hundred)
validation_set_size = 0.4  # for a training data size of 1000 with 3 classes we use 40% of the training data for validating hyperparameters

# reformatting of label column to enable dataset stratification
from datasets import ClassLabel
new_features = dataset["train"].features.copy()
label_names = list(model.config.label2id.keys())

new_features['label'] = ClassLabel(names=label_names)
dataset = dataset.cast(new_features)

# train-validation split for hp-search
dataset_hp = dataset["train"].train_test_split(test_size=validation_set_size, seed=SEED_GLOBAL, shuffle=True, stratify_by_column="label")
print(dataset_hp)

In [None]:
# helper function to clean memory and reduce risk of out-of-memory error
import gc
def clean_memory():
  #del(model)
  if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
  gc.collect()

clean_memory()

In [None]:
## Reinitialize trainer for hp-search
# https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10

def model_init():
    clean_memory()

    # link the numeric labels to the label texts
    label_text = np.sort(df_test.label_text.unique()).tolist()
    label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
    id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
    config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));

    return AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True).to(device)

trainer = Trainer(
    model_init=model_init,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset_hp["train"],
    eval_dataset=dataset_hp["test"],
    compute_metrics=compute_metrics_standard
);


**Define the hyperparameters you want to optimise**

For a detailed discussion of different hyperparameters, see the appendix of our paper.

In [None]:
# we use Optuna for hp-search: https://optuna.readthedocs.io/en/stable/
def my_hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [9e-6, 2e-5, 4e-5]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 4, 24, log=False, step=4),   # increasing the maximum number of epochs here could increase performance but will take (much) longer to train
        #"warmup_ratio": trial.suggest_float("warmup_ratio", 0.1, 0.6, log=True),
        "per_device_train_batch_size": 16,  # lower this value in case of out-of-memory errors and restart the runtime
        #"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
        "evaluation_strategy": "no",
        "save_strategy": "no",
    }


**Run HP search!**

Choose the number of hyperparameter configurations you want to test. In our experiments we found that after 10 to 15 trials with around 4 hyperparameters, performance is unlikely to increase meaningfully. 15 trials seems to be a safe value, but can take a while to run.

In [None]:
import optuna

# number of differen hp configurations to test
numer_of_trials = 10  # increasing this value can lead to better hyperparameters, but will take longer

# chose the sampler for sampling hp configurations
optuna_sampler = optuna.samplers.TPESampler(
    seed=SEED_GLOBAL, consider_prior=True, prior_weight=1.0, consider_magic_clip=True,
    consider_endpoints=False, n_startup_trials=numer_of_trials/2, n_ei_candidates=24,
    multivariate=False, group=False, warn_independent_sampling=True, constant_liar=False
)  # https://optuna.readthedocs.io/en/stable/reference/generated/optuna.samplers.TPESampler.html#optuna.samplers.TPESampler

# Hugging Face Documentation: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.hyperparameter_search
best_run = trainer.hyperparameter_search(
    n_trials=numer_of_trials,
    direction="maximize",
    hp_space=my_hp_space,
    backend='optuna',
    **{"sampler": optuna_sampler}
)

In [None]:
# show best hyperparameters based on hp-search
print(best_run)

**Training Time with optimised hyperparameters!**

Here we can use the original train and test set again.

In [None]:
# update the training arguments with the best hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(train_args, k, v)
print("\n", train_args)

# hp-search with hf causes errors with FP16 for some reason
#setattr(train_args, "fp16", False)
#setattr(train_args, "fp16_full_eval", False)

In [None]:
# reinitialize the model to avoid re-using a trained model from a step further above
#model_name = "XXX"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# link the numeric labels to the label texts
label_text = np.sort(df_test.label_text.unique()).tolist()
label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));

# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True);


In [None]:
# Training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],  #.shard(index=1, num_shards=100),  # https://huggingface.co/docs/datasets/processing.html#sharding-the-dataset-shard
    eval_dataset=dataset["test"],  #.shard(index=1, num_shards=100),
    compute_metrics=compute_metrics_standard
)

trainer.train()


In [None]:
## Evaluate the fine-tuned model on the held-out test set
results = trainer.evaluate()


In [None]:
print(results)

Note that hyperparameter searches do not necessarily lead to better results, as they need to be searched on a smaller validation set of the train set, which might impact generalisation. Especially for smaller training sets, hyperparameter searches might lead to similar values as good default values. The default values discussed in the paper often provide good results.

---

---

## Questions

**Reading, thinking & asking:** (5-10 min)
* Which are the 3 data split usually used for training and testing a model? Why is each one important?
* In your own words, try and write down the difference between the different metrics.
* What is imbalanced data? Which metrics are better for measuring performance on imbalanced data?
* Choose a hyperparameter from the full list of training arguments in the [HF documentation](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) and google its meaning or ask ChatGPT to explain it to you.
* Inspect `df_inference` and see where the model made mistakes and try to get a feeling for the data and patterns of errors.
* You can also run the notebook yourself and try a different model or different hyperparamters.

* **Ask any questions you have in the chat. We will collect them and answer them after a few minutes.**
