<a href="https://colab.research.google.com/github/MST47/Fine-Tuning-BERT-Models/blob/main/1_Fine_Tune_BERT_Base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Installation relevant packages

In [None]:
!pip install transformers[sentencepiece]~=4.31.0
!pip install accelerate~=0.21.0
!pip install datasets~=2.14.0
!pip install optuna~=3.3.0

Collecting transformers[sentencepiece]~=4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[sentencepiece]~=4.31.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.19.1
    Uninstalling tokenizers-0.19.1:
      Successfully uninstalled tokenizers-0.19.1
  Attempting uninstall: transformers
    Found existing installation: transformers 4.42.4
    Uninstalling transformers-4.42.4:
      Successfully uninstalled transformers-4.42.4
Successfully installed tokenizers-0.13.3 transformers-4.31.0


In [None]:
## Load general packages
# some more specialised packages are loaded in each sub section
import pandas as pd
import numpy as np
from google.colab.data_table import DataTable

In [None]:
# set random seed for reproducibility
SEED_GLOBAL = 42
np.random.seed(SEED_GLOBAL)

## 1.1 Download data

In [None]:
 ## Download clean train and test data from github
# data source: https://manifesto-project.wzb.eu/
# codebook with definitions of classes: https://manifesto-project.wzb.eu/down/data/2023a/codebooks/codebook_MPDataset_MPDS2023a.pdf
# more cleaned datasets for testing other tasks: https://github.com/MoritzLaurer/less-annotating-with-bert-nli/tree/master/data_clean

df_train = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/master/data_clean/df_manifesto_military_train.csv", index_col="idx")
df_test = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/master/data_clean/df_manifesto_military_test.csv", index_col="idx")
print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")


Length of training and test sets:  3970  (train)  9537  (test).


In [None]:
# full training data table
DataTable(df_train, num_rows_per_page=5)

Unnamed: 0_level_0,label,label_text,text_original,label_domain_text,label_subcat_text,text_preceding,text_following,manifesto_id,doc_id,country_name,date,party,cmp_code_hb4,cmp_code,label_subcat_text_simple
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
24087,2,Other,Under the SNP Government at Holyrood: Tuition ...,Welfare and Quality of Life,Education Expansion,Crimes of handling an offensive weapon (includ...,We’ve introduced the new Curriculum for Excell...,51902_201505,27,United Kingdom,201505,51902,506,506.0,Education
114426,2,Other,We’ve partnered with industry through 22 R&D p...,Social Groups,Agriculture and Farmers: Positive,We will allow young farming families to buy th...,These programmes will help us achieve our goal...,64620_201709,133,New Zealand,201709,64620,703,703.1,Agriculture and Farmers: Positive
102346,2,Other,It is UnitedFuture Policy to: Accept the rec...,Freedom and Democracy,Constitutionalism: Negative,UnitedFuture will endeavour to listen to all e...,UnitedFuture will work with the government to ...,64421_201409,123,New Zealand,201409,64421,204,204.0,Constitutionalism
17562,2,Other,"They form a key part of British life, valued f...",Economy,Incentives: Positive,We understand that small businesses are the we...,We will continue to support small businesses t...,51620_201706,19,United Kingdom,201706,51620,402,402.0,Incentives: Positive
53059,2,Other,f) Jobs created for the under-25s.,Social Groups,Non-economic Demographic Groups,Provide start-up funding and other support co-...,"A Youth Jobs Fund to create 20,000 new jobs an...",53951_201102,55,Ireland,201102,53951,706,706.0,Non-economic Demographic Groups
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69007,1,Military: Positive,Those reductions are dangerous and counter-pro...,External Relations,Military: Positive,and oppose the current Administration’s cuts t...,Guard and Reserve forces are currently deploye...,61620_202011,67,United States,202011,61620,104,104.0,Military
81219,1,Military: Positive,Labor also failed to place a single order at a...,External Relations,Military: Positive,Labor’s decisions led to 119 defence projects ...,Under Labor the Australian Defence industry sh...,63620_201607,84,Australia,201607,63620,104,104.0,Military
64975,1,Military: Positive,and will not accept attempts to undermine mili...,External Relations,Military: Positive,We reject the use of the military as a platfor...,"Consistent with this commitment, we believe co...",61620_201211,65,United States,201211,61620,104,104.0,Military
103534,1,Military: Positive,and we backed our words with troops.,External Relations,Military: Positive,New Zealand earned respect for encouraging APE...,National will continue to give New Zealand inf...,64620_199911,128,New Zealand,199911,64620,104,104.0,Military


**Running the notebook on your own dataset:**

Load your own training and test data above to fine-tune your own BERT model. Your own dataframe only needs three columns to be compatible with the code below: </br>
(1) a "label" column with a numeric label </br>
(2) a "label_text" column with the label name in plain language </br>
(3) a "text" column with the texts for training (you might need to delete/adapt the text preparation code cell below for your dataset).

In [None]:
# optional: use training data sample size of e.g. 1000 for faster testing
sample_size = 1000
df_train = df_train.sample(n=min(sample_size, len(df_train)), random_state=SEED_GLOBAL).copy(deep=True)
df_test = df_test.sample(n=min(sample_size*4, len(df_test)), random_state=SEED_GLOBAL).copy(deep=True)

print("Length of training and test sets after sampling: ", len(df_train), " (train) ", len(df_test), " (test).")


Length of training and test sets after sampling:  1000  (train)  4000  (test).


In [None]:
## inspect the data
# label distribution train set
print("Train set label distribution:\n", df_train.label_text.value_counts(), "\n")
# label distribution test set
print("Test set label distribution:\n", df_test.label_text.value_counts())


Train set label distribution:
 label_text
Other                 516
Military: Positive    399
Military: Negative     85
Name: count, dtype: int64 

Test set label distribution:
 label_text
Other                 3646
Military: Positive     248
Military: Negative     106
Name: count, dtype: int64


## 2 Data preprocessing

**Prepare the input text**

1.) We prepare the target texts by making them more naturally fit to the hypothesis. Here we simply wrap each target text into the string ' The quote: "{target_text}" - end of the quote. '

2.) We surround the target text by its preceeding and following sentence. Adding context like this systematically increases performance.


In [None]:
df_train["text"] = df_train.text_preceding.fillna("") + " " + df_train.text_original.fillna("") + " " + df_train.text_following.fillna("")
df_test["text"] = df_test.text_preceding.fillna("") + " " + df_test.text_original.fillna("") + " " + df_test.text_following.fillna("")


In [None]:
df_train = df_train[["label", "label_text", "text"]]
df_test = df_test[["label", "label_text", "text"]]

In [None]:
DataTable(df_train, num_rows_per_page=5)

Unnamed: 0_level_0,label,label_text,text
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
71033,0,Military: Negative,emergency relief in situations of armed confli...
26600,2,Other,the introduction of a NI Register of Animal Cr...
64968,1,Military: Positive,We will implement sound management policies to...
119889,2,Other,reviewing the appeal process for ethics commit...
69067,1,Military: Positive,Both recognized the need to stand with friends...
...,...,...,...
5284,2,Other,The healthy future of our environment is one o...
27167,1,Military: Positive,The DUP supports: A new long-term plan for Arm...
101296,2,Other,Increase funding for community-based providers...
93038,2,Other,The Green Party is committed to passenger rail...


## 3 Load a Transformer

We use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) for loading and training our model. They provide great documentation and also a very good [course](https://huggingface.co/course/chapter1/1) on how to use Transformers.

**Choosing a Transformer model**

You can can use any classification model on the [Hugging Face Hub](https://huggingface.co/models?sort=downloads). I suggest testing these models:



*   Original BERT (good, but out-dated): `bert-base-uncased`
*   Small efficient model: `distilbert-base-uncased`
*   Newer version of BERT: `microsoft/deberta-v3-base`
*   Large, high-performance model: `microsoft/deberta-v3-large`
*   Multilingual model: `microsoft/mdeberta-v3-base`





In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch

## load a model and its tokenizer
model_name = "microsoft/deberta-v3-base"  # replace e.g. with "microsoft/deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# link the numeric labels to the label texts
label_text = np.sort(df_test.label_text.unique()).tolist()
label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));
print("\n", label2id, "\n")

# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True);

# use GPU (cuda) if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
model.to(device);


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.



 {'Military: Negative': 0, 'Military: Positive': 1, 'Other': 2} 





pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-base and are newly initialized: ['classifier.weight', 'pooler.dense.weight', 'classifier.bias', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Device: cuda


## 4 Tokenize data

In [None]:
# convert pandas dataframes to Hugging Face dataset object to facilitate pre-processing
import datasets

dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train),
    "test": datasets.Dataset.from_pandas(df_test)
})

# tokenize
def tokenize(examples):
  return tokenizer(examples["text"], truncation=True, max_length=512)  # max_length can be reduced to e.g. 256 to increase speed, but long texts will be cut off

dataset = dataset.map(tokenize, batched=True)

# remove unnecessary columns for model training
dataset = dataset.remove_columns(['label_text'])   #'text_original', 'label_domain_text', 'label_subcat_text', 'text_preceding', 'text_following', 'manifesto_id', 'doc_id', 'country_name', 'date', 'party', 'cmp_code_hb4', 'cmp_code', 'label_subcat_text_simple'])


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

**Inspect processed data**

In [None]:
print("The overall structure of the pre-processed train and test sets:\n")
print(dataset)

The overall structure of the pre-processed train and test sets:

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4000
    })
})


**Explanation of different elements in the tokenized dataset:**

1. **idx**:
Purpose: A unique identifier or index for each data point in the dataset. It's useful for tracking, referencing, or debugging individual examples.
Example: Any integer value like 42.

2. **input_ids**:
Purpose: These are token IDs that represent each token in the text. The raw text is tokenized, and each token is mapped to an ID based on the tokenizer's vocabulary. This is the primary input to BERT and other transformer models.
Example: For the word "hugging", the tokenizer might produce an ID like 20345.

3. **token_type_ids**:
Purpose: Used in models like BERT that handle different pairs of sentences in tasks (e.g., Next Sentence Prediction or Question-Answering). It differentiates between the first sentence and the second sentence. Typically, for a single sentence, this would be a sequence of 0s, and for a two-sentence pair, the first sentence would have 0s, and the second one would have 1s.
Example: For the text "Hugging Face is great. I love it!", the token_type_ids might look like [0, 0, 0, 0, 0, 1, 1, 1].

4. **attention_mask**:
Purpose: Specifies which tokens should be attended to by the model. A value of 1 means the token should be considered, while 0 means it should be ignored. This is especially useful when batching sequences of different lengths together, requiring padding tokens to be added. The padding tokens have an attention_mask value of 0.
Example: For the sentence "I love Hugging Face." with padding, the attention_mask might be [1, 1, 1, 1, 0, 0, 0] where the 0s represent the padding tokens.

In [None]:
from pprint import pprint
print("\n\nAn example for a row in the tokenized dataset:\n")
[print(key, ":    ",  value) for key, value in dataset["train"][0].items()]



An example for a row in the tokenized dataset:

label :     0
text :     emergency relief in situations of armed conflict should be carried out by civilians and must be clearly distinguished from any military activities. a direct role for military forces in the provision of relief should be restricted to situations involving natural disasters where ambiguity over the military role is unlikely to arise. aid programs should not be used to influence the democratic preferences of any nation.
idx :     71033
input_ids :     [1, 2644, 3478, 267, 3335, 265, 5652, 3423, 403, 282, 2635, 321, 293, 10936, 263, 516, 282, 2117, 10045, 292, 356, 1681, 1157, 260, 266, 1670, 985, 270, 1681, 2499, 267, 262, 5048, 265, 3478, 403, 282, 6790, 264, 3335, 3849, 1008, 12871, 399, 26756, 360, 262, 1681, 985, 269, 5220, 264, 7246, 260, 2777, 1309, 403, 298, 282, 427, 264, 2399, 262, 8221, 7162, 265, 356, 2080, 260, 2]
token_type_ids :     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

[None, None, None, None, None, None]

## 5. Setting training arguments / hyperparameters

The following cells set several important hyperparameters. We chose parameters that work well in general to avoid the need for hyperparameter search. Further below, we also provide code for hyperparameter search, if researchers want to try to increase performance by a few percentage points.

In [None]:
# Set the directory to write the fine-tuned model and training logs to.
# With google colab, this will create a temporary folder, which will be deleted once you disconnect.
# You can connect to your personal google drive to save models and logs properly.
training_directory = "BERT-demo"

# FP16 is a hyperparameter which can increase training speed and reduce memory consumption, but only on GPU and if batch-size > 8, see here: https://huggingface.co/transformers/performance.html?#fp16
# FP16 does not work on CPU or for multilingual mDeBERTa models
#fp16_bool = True if torch.cuda.is_available() else False
#if "mdeberta" in model_name.lower(): fp16_bool = False  # multilingual mDeBERTa does not support FP16 yet: https://github.com/microsoft/DeBERTa/issues/77
# in case of hyperparameter search end the end: FP16 has to be set to False. The integrated hyperparameter search with the Hugging Face Trainer can lead to errors otherwise.


In [None]:
from transformers import TrainingArguments, Trainer, logging

# Overview of all training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
# Hugging Face tipps to increase training speed and decrease out-of-memory (OOM) issues: https://huggingface.co/transformers/performance.html?
train_args = TrainingArguments(
    num_train_epochs=7,  # this can be increased, but higher values increase training time. Good values for NLI are between 3 and 20.
    learning_rate=2e-5,
    per_device_train_batch_size=16,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
    per_device_eval_batch_size=64,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
    #gradient_accumulation_steps=2, # Can be used in case of memory problems to reduce effective batch size. accumulates gradients over X steps, only then backward/update. decreases memory usage, but also slightly speed. (!adapt/halve batch size accordingly)
    warmup_ratio=0.06,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
    weight_decay=0.1,
    seed=SEED_GLOBAL,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    #fp16=fp16_bool,  # Can speed up training and reduce memory consumption, but only makes sense at batch-size > 8. loads two copies of model weights, which creates overhead. https://huggingface.co/transformers/performance.html?#fp16
    #fp16_full_eval=fp16_bool,
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    #eval_steps=10_000,  # evaluate after n steps if evaluation_strategy!='steps'. defaults to logging_steps
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch"
    #save_steps=10_000,              # Number of updates steps before two checkpoint saves.
    #save_total_limit=10,             # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
    #logging_strategy="steps",
    report_to="all",  # "all"  # logging
    #push_to_hub=False,
    #push_to_hub_model_id=f"{model_name}-finetuned-{task}",
    output_dir=f'./results/{training_directory}',
    logging_dir=f'./logs/{training_directory}',
)


**Explanation of different training arguments:**

You can find more arguments, explanations and examples in the Hugging Face [documentation](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments).

* **num_train_epochs**:
Specifies the number of times the entire training dataset is passed through the model. For example, num_train_epochs=3 means the trainer will iterate over the entire training dataset three times.

* **per_device_train_batch_size**:
The model does not learn from the entire dataset at once, but in batches of e.g. 16 texts. For example, if per_device_train_batch_size=16, then the model analyses 16 texts and sees how wrong it was on these 16 texts. After analysing these 16 texts, the model's parameters are updated/optimised to make the model less wrong on these texts. The degree to which the model's parameters are updated is called 'learning rate'.

* **learning_rate**:
The the "rate" or speed with which the model's parameters are updated by the optimisation algorithm. A smaller value makes the model's parameter updated more slowly after each batch, while a larger value updates the model's paramters more drastically. A good general value is 2e-5 (which means 0.00002).

* **per_device_eval_batch_size**:
The number of evaluation examples used in one batch during evaluation. This batch size is irrelevant for the model's learning. A higher value makes evaluation faster (more texts are processed at the same time), but higher values also cost memory and increase the risk of out-of-memory errors (OOM).

* **gradient_accumulation_steps**:
Indicates the number of steps before performing a backward/update pass. This means the loss is accumulated over gradient_accumulation_steps steps instead of updating after every step. Useful for training with larger effective batch sizes using limited memory.

* **warmup_ratio**:
Specifies the ratio of total training steps for which the learning rate will be linearly increased (warm-up phase) before it's decayed.

* **weight_decay**:
A regularization technique which penalizes large weights by adding a penalty term to the loss. It helps prevent overfitting.

* **seed**:
Sets a seed for reproducibility. Ensures that multiple runs with the same seed produce the same results.

* **load_best_model_at_end**:
Loads the best model (according to metric_for_best_model) at the end of training instead of the last model.

* **metric_for_best_model**:
Determines which metric to use for evaluating and determining the best model during training.

* **evaluation_strategy**:
Defines when to evaluate the model. For example, evaluation_strategy="epoch" evaluates the model after every epoch.

* **save_strategy**:
Specifies when to save the model. For example, save_strategy="epoch" saves the model after every epoch.

* **fp16**:
Enables mixed precision training if set to True, which can speed up training and reduce memory usage. But this does not work with every model and is only beneficial with a batch_size >= 16.


## 6. Calculate Metrics

In [None]:
# Function to calculate metrics
# documentation on all metrics: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report
import warnings

def compute_metrics_standard(eval_pred):
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")

        labels = eval_pred.label_ids
        pred_logits = eval_pred.predictions
        preds_max = np.argmax(pred_logits, axis=1)  # argmax on each row (axis=1) in the tensor

        # metrics
        precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(labels, preds_max, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(labels, preds_max, average='micro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        acc_balanced = balanced_accuracy_score(labels, preds_max)
        acc_not_balanced = accuracy_score(labels, preds_max)

        metrics = {
            'accuracy': acc_not_balanced,
            'f1_macro': f1_macro,
            'accuracy_balanced': acc_balanced,
            'f1_micro': f1_micro,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro,
            'precision_micro': precision_micro,
            'recall_micro': recall_micro,
        }

        return metrics


**Explanation of key metrics**:

Imagine we have a machine that identifies apples in a basket of mixed fruits. The task of the machine is: label each fruit as "apple" or "not-apple".

1. **Accuracy**:
    - Definition: The proportion of all predictions that were correct.
    - Example: If our machine looked at 100 fruits and correctly identified 90 of them (either as apples or not-apples), its accuracy is 90%.
    - Think of it as the overall correctness of the machine.

2. **Precision**:
    - Definition: Of all the items labeled by the machine as "apple," what proportion was actually apples?
    - Simple Explanation: Suppose our machine pointed out 50 fruits as apples, but only 25 of them were real apples. The precision would be 25/50 or 50%.
    - It tells us how "precise" our machine is when calling something an apple.

3. **Recall** (also known as Sensitivity or True Positive Rate):
    - Definition: Of all the real apples in the basket, what proportion did the machine correctly identify as apples?
    - Simple Explanation: If there were 50 actual apples and our machine only identified 40 of them, then the recall is 40/50 or 80%.
    - Recall answers the question: How many actual apples did we manage to recall/identify?

4. **F1 Score**:
    - Definition: The harmonic mean of precision and recall, giving a balance between the two.
    - Simple Explanation: If our precision is 80% and our recall is also 80%, the F1 score is also 80%. But if either drops, the F1 score drops too. It's a single metric that tries to balance precision and recall.
    - Consider it a balanced score: if either precision or recall is low, the F1 score will be low.

Using the apple example, let's say:
- The machine points at 10 fruits saying "these are apples!"
- In reality, only 8 of them are apples (so, precision = 8/10 = 80%).
- But there were 20 apples in total, and the machine only identified 8 of them (so, recall = 8/20 = 40%).

While precision is high, recall is low, which means the machine is good at being sure when it says something is an apple, but it's missing a lot of the actual apples. The F1 score would reflect this imbalance.

## 7. Fine-tuning and evaluation

Let's start fine-tuning the model!

If you get an 'out-of-memory' error, reduce the 'per_device_train_batch_size' to 8 or 4 in the TrainingArguments above and restart the runtime. If you don't restart your runtime (menu to the to left 'Runtime' > 'Restart runtime') and rerun the entire script, the 'out-of-memory' error will probably not go away.

In [None]:
# training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics_standard
)

trainer.train()


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Accuracy Balanced,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
1,No log,0.27777,0.9065,0.496602,0.590806,0.9065,0.457734,0.590806,0.9065,0.9065
2,No log,0.252643,0.91875,0.614681,0.685318,0.91875,0.76195,0.685318,0.91875,0.91875
3,No log,0.339462,0.90725,0.682582,0.811617,0.90725,0.617591,0.811617,0.90725,0.90725
4,No log,0.382712,0.91375,0.656996,0.731637,0.91375,0.672022,0.731637,0.91375,0.91375
5,No log,0.296117,0.93625,0.734561,0.7931,0.93625,0.705259,0.7931,0.93625,0.93625
6,No log,0.389924,0.92575,0.688283,0.751839,0.92575,0.692489,0.751839,0.92575,0.92575
7,No log,0.327059,0.9355,0.726502,0.786328,0.9355,0.70297,0.786328,0.9355,0.9355


TrainOutput(global_step=441, training_loss=0.26228801355340314, metrics={'train_runtime': 632.7482, 'train_samples_per_second': 11.063, 'train_steps_per_second': 0.697, 'total_flos': 380189262098880.0, 'train_loss': 0.26228801355340314, 'epoch': 7.0})

In [None]:
# Evaluate the fine-tuned model on the held-out test set
results = trainer.evaluate()
print(results)

{'eval_loss': 0.29611748456954956, 'eval_accuracy': 0.93625, 'eval_f1_macro': 0.7345606944589896, 'eval_accuracy_balanced': 0.7930998121765716, 'eval_f1_micro': 0.93625, 'eval_precision_macro': 0.7052587940630941, 'eval_recall_macro': 0.7930998121765716, 'eval_precision_micro': 0.93625, 'eval_recall_micro': 0.93625, 'eval_runtime': 38.3205, 'eval_samples_per_second': 104.383, 'eval_steps_per_second': 1.644, 'epoch': 7.0}


**Careful: do not tune hyperparameters on the test-set**

For simplicity's sake, we have split our dataset in two parts: the training set and the test set. In reality, the data is often split in three parts: data_train, data_validation, data_test.

* **data_train**: This data split is used to train the model
* **data_validation**: This data split is used to validate the choices of hyperparamters. For example, we might want to try a learning_rate of [2e-5, 9e-6, 3e-5], or a batch_size of [16, 32, 64]. We don't know which of these hyperparameters are best for our model and data. We therefore need to validate them.
* **data_test**: This data split is only used for testing our model at the very end, once we have decided on our hyperparameters.


## 8. Inference with your fine-tuned model

In [None]:
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"  # use GPU (cuda) if available, otherwise use CPU

# documentation: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
pipe_classifier = pipeline(
    "text-classification",
    model=model,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
    tokenizer=tokenizer,
    framework="pt",
    device=device,
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


We now apply the pipeline to unseen texts. We re-use the df_test data-frame here for simplicity, but it could be any other dataset. It only needs a text column. Note that we do not need to re-format the text data anymore here, as this is handled internally by the Hugging Face zero-shot pipeline. If you want to better understand the arguments in the pipeline below, we recommend reading the [documentation here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline).

In [None]:
# create a dummy data frame for illustration
df_inference = df_test[["text", "label_text"]].sample(n=1000, random_state=42).copy(deep=True)
text_lst = df_inference["text"].tolist()

# use the pipeline with your chosen model for inference (prediction)
pipe_output = pipe_classifier(
    text_lst,  # input any list of texts here
    batch_size=32  # reduce this number to 8 or 16 if you get an out-of-memory error
)
print(pipe_output)

df_output = pd.DataFrame(pipe_output)

# add inference data to your original dataframe
df_inference["label_text_pred"] = df_output["label"].tolist()
df_inference["label_text_pred_probability"] = df_output["score"].round(2).tolist()


[{'label': 'Other', 'score': 0.998685896396637}, {'label': 'Other', 'score': 0.9991280436515808}, {'label': 'Other', 'score': 0.9991200566291809}, {'label': 'Other', 'score': 0.9991393089294434}, {'label': 'Military: Positive', 'score': 0.9910440444946289}, {'label': 'Other', 'score': 0.9985438585281372}, {'label': 'Other', 'score': 0.9990531802177429}, {'label': 'Other', 'score': 0.9990807771682739}, {'label': 'Other', 'score': 0.9987145662307739}, {'label': 'Other', 'score': 0.997471809387207}, {'label': 'Other', 'score': 0.9990853071212769}, {'label': 'Military: Negative', 'score': 0.9286819100379944}, {'label': 'Other', 'score': 0.9991012811660767}, {'label': 'Other', 'score': 0.999097466468811}, {'label': 'Other', 'score': 0.9990671277046204}, {'label': 'Other', 'score': 0.9991342425346375}, {'label': 'Other', 'score': 0.9990742206573486}, {'label': 'Other', 'score': 0.9991135001182556}, {'label': 'Other', 'score': 0.9991039633750916}, {'label': 'Other', 'score': 0.999064862728118

In [None]:
df_inference

Unnamed: 0_level_0,text,label_text,label_text_pred,label_text_pred_probability
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
103403,• To continue to maintain inflation in the ran...,Other,Other,1.00
13031,End the 1 % cap on pay rises in the public sec...,Other,Other,1.00
27347,These savings include: Reducing the size of th...,Other,Other,1.00
104422,This will allow victims to say more in their V...,Other,Other,1.00
62765,"Throughout the Cold War, our international bro...",Other,Military: Positive,0.99
...,...,...,...,...
40325,"Legislate to regulate the charity sector, and ...",Other,Other,1.00
47579,Support the further development of farmers’ ma...,Other,Other,1.00
73225,Small businesses will benefit from the instant...,Other,Other,1.00
13290,We will: Uprate working-age benefits at least ...,Other,Other,1.00


## 9. Save and load your fine-tuned model

### Saving your model to Google Drive

In [None]:
## first you need to connect to your google drive with your google account
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)
#drive.flush_and_unmount()

# insert the path where you want to save the model
os.chdir("/content/drive/My Drive/")
print(os.getcwd())


Mounted at /content/drive
/content/drive/My Drive


In [None]:
### save best model to google drive
directory_save_model = f"{training_directory}/"
model_name_custom = f"{model_name.split('/')[-1]}-custom"
mode_custom_path = directory_save_model + model_name_custom
print(mode_custom_path)

trainer.save_model(output_dir=mode_custom_path)

BERT-demo/deberta-v3-base-custom
