#  🧑‍💻 Fine-tuning BERT-NLI 🧑‍💻

📅 _Data Science Summer School 2023, 22.08.2023_

👨‍🏫 By [Moritz Laurer](https://www.linkedin.com/in/moritz-laurer/).
For questions, reach out to: m.laurer@vu.nl


</a><a href="https://github.com/MoritzLaurer/summer-school-transformers-2023/blob/main/3_tune_bert_nli.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is partly based on my paper: Laurer, Moritz, Van Atteveldt, W., Casas, A., & Welbers, K. (2023). [Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI](https://www.cambridge.org/core/journals/political-analysis/article/less-annotating-more-classifying-addressing-the-data-scarcity-issue-of-supervised-machine-learning-with-deep-transfer-learning-and-bertnli/05BB05555241762889825B080E097C27). Political Analysis, 1-17. doi:10.1017/pan.2023.20

## BERT-base vs. BERT-NLI

... metaphor on absurdity of BERT-base; and label verbalisation, universal task ...

## Activate a GPU runtime

In order to run this notebook on a GPU, click on "Runtime" > "Change runtime type" > select "GPU" in the menue bar in to top left. Training a Transformer is much faster on a GPU. Given Colabs's usage limits for GPUs, it is advisable to first test your non-training code on a CPU and only use the GPU once you know that everything is working.

## Install relevant packages

In [None]:
!pip install transformers[sentencepiece]~=4.31.0
!pip install accelerate~=0.21.0
!pip install datasets~=2.14.0
!pip install optuna~=3.3.0

Collecting transformers[sentencepiece]~=4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m49.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers[sentencepiece]~=4.31.0)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers[sentencepiece]~=4.31.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m107.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers[sentencepiece]~=4.31.0)
  Downloading safetensors-0.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━

In [None]:
## Load general packages
# some more specialised packages are loaded in each sub section
import pandas as pd
import numpy as np
from google.colab.data_table import DataTable

In [None]:
# set random seed for reproducibility
SEED_GLOBAL = 42
np.random.seed(SEED_GLOBAL)

## Download and prepare data

In [None]:
## Download clean train and test data from github
# data source: https://manifesto-project.wzb.eu/
# codebook with definitions of classes: https://manifesto-project.wzb.eu/down/data/2023a/codebooks/codebook_MPDataset_MPDS2023a.pdf
# more cleaned datasets for testing other tasks: https://github.com/MoritzLaurer/less-annotating-with-bert-nli/tree/master/data_clean

df_train = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/master/data_clean/df_manifesto_military_train.csv", index_col="idx")
df_test = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/master/data_clean/df_manifesto_military_test.csv", index_col="idx")
print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")


Length of training and test sets:  3970  (train)  9537  (test).


**If you want to run the notebook on your own dataset:**

You can load your own training and test data above to fine-tune your own BERT model. Your own dataframe only needs three columns to be compatible with the code below: (1) a "label" column with a numeric label; (2) a "label_text" column with the label name in plain language, (3) a "text" column with the texts for training (you might need to delete/adapt the text preparation code cell below for your dataset).

In [None]:
## you can also load your own .csv files from Google Drive
"""
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)

# set the path to your data
os.chdir("/content/drive/My Drive/PhD/data")
print(os.getcwd())

df_train = pd.read_csv("./df_manifesto_morality_train.csv")
df_test = pd.read_csv("./df_manifesto_morality_test.csv")
print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")
"""


'\nfrom google.colab import drive\nimport os\ndrive.mount(\'/content/drive\', force_remount=False)\n\n# set the path to your data\nos.chdir("/content/drive/My Drive/PhD/other/COMPTEXT-2023-workshop/data")  \nprint(os.getcwd())\n\ndf_train = pd.read_csv("./df_manifesto_morality_train.csv")\ndf_test = pd.read_csv("./df_manifesto_morality_test.csv")\nprint("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")\n'

In [None]:
# optional: use training data sample size of e.g. 1000 for faster testing
sample_size = 1000
df_train = df_train.sample(n=min(sample_size, len(df_train)), random_state=SEED_GLOBAL).copy(deep=True)
df_test = df_test.sample(n=min(sample_size*4, len(df_test)), random_state=SEED_GLOBAL).copy(deep=True)

print("Length of training and test sets after sampling: ", len(df_train), " (train) ", len(df_test), " (test).")

Length of training and test sets after sampling:  1000  (train)  4000  (test).


In [None]:
## inspect the data
# label distribution train set
print("Train set label distribution:\n", df_train.label_text.value_counts())
# label distribution test set
print("Test set label distribution:\n", df_test.label_text.value_counts())

# full training data table
DataTable(df_train, num_rows_per_page=5)

Train set label distribution:
 Other                 516
Military: Positive    399
Military: Negative     85
Name: label_text, dtype: int64
Test set label distribution:
 Other                 3646
Military: Positive     248
Military: Negative     106
Name: label_text, dtype: int64


Unnamed: 0_level_0,label,label_text,text_original,label_domain_text,label_subcat_text,text_preceding,text_following,manifesto_id,doc_id,country_name,date,party,cmp_code_hb4,cmp_code,label_subcat_text_simple
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
71033,0,Military: Negative,a direct role for military forces in the provi...,External Relations,Military: Negative,emergency relief in situations of armed confli...,aid programs should not be used to influence t...,63110_201008,70,Australia,201008,63110,105,105.0,Military
26600,2,Other,and the seriousness of these animal cruelty of...,Welfare and Quality of Life,Environmental Protection,the introduction of a NI Register of Animal Cr...,the PSNI in their efforts to target criminal g...,51903_201706,31,United Kingdom,201706,51903,501,501.0,Environmental Protection
64968,1,Military: Positive,We reject Congressional earmarks that put pers...,External Relations,Military: Positive,We will implement sound management policies to...,"We recognize the need for, and value of, compe...",61620_201211,65,United States,201211,61620,104,104.0,Military
119889,2,Other,"As a partner in Government, the Māori Party ha...",Welfare and Quality of Life,Welfare State Expansion,reviewing the appeal process for ethics commit...,secured $65.3 million for rheumatic fever prev...,64901_201409,141,New Zealand,201409,64901,504,504.0,Welfare State
69067,1,Military: Positive,"Whatever their disagreements, both the Republi...",External Relations,Military: Positive,Both recognized the need to stand with friends...,— and strength meant American military superio...,61620_202011,67,United States,202011,61620,104,104.0,Military
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5284,2,Other,We must make substantial changes in the way we...,Welfare and Quality of Life,Environmental Protection,The healthy future of our environment is one o...,The UK played a leading role in the Kyoto conf...,51320_200106,7,United Kingdom,200106,51320,501,501.0,Environmental Protection
27167,1,Military: Positive,"Armed Forces Personnel, born outside the Unite...",External Relations,Military: Positive,The DUP supports: A new long-term plan for Arm...,The DUP supports: Waiving indefinite leave to ...,51903_201912,32,United Kingdom,201912,51903,104,104.0,Military
101296,2,Other,"Increase support for mothers, before birth and...",Welfare and Quality of Life,Welfare State Expansion,Increase funding for community-based providers...,Establish family health clinics within family ...,64421_200207,120,New Zealand,200207,64421,504,504.0,Welfare State
93038,2,Other,This will enable people to work and rest on th...,Economy,Technology and Infrastructure: Positive,The Green Party is committed to passenger rail...,The train would take around two hours and 15 m...,64110_201709,102,New Zealand,201709,64110,411,411.0,Technology and Infrastructure: Positive


**If you want to run the notebook on your own dataset:**

You can load your own training and test data above to fine-tune your own BERT-NLI model. Your own dataframe only needs two columns to be compatible with the code below: (1) a "label_text" column with the label texts of your classes, (2) a "text" column with the texts for training (you might need to delete/adapt the text preparation code cell below for your dataset).

## Create NLI hypotheses

**Formulate a hypothesis, which verbalises the classes/task you are interested in.**

For this example, we base our task on the Manifesto Project codebook:  https://manifesto-project.wzb.eu/coding_schemes/mp_v4

We store the hypothesis in a dictionary: The keys of the dictionary should be the names of the respective label from the training dataframe ('label_text' column); the values of the dictionary should be your manually formulated hypothesis linked to the respective labels.

In [None]:
# dictionary mapping the dataset's label to manually formulated hypotheses based on the codebook
hypothesis_label_dic = {
    "Military: Positive": "The quote is positive towards the military, for example for military spending, defense, military treaty obligations.",
    "Military: Negative": "The quote is negative towards the military, for example against military spending, for disarmament, against conscription.",
    "Other": "The quote is not about military or defense"
}

**Prepare the input text**

1.) We prepare the target texts by making them more naturally fit to the hypothesis. Here we simply wrap each target text into the string ' The quote: "{target_text}" - end of the quote. '

2.) We surround the target text by its preceeding and following sentence. Adding context like this systematically increases performance.


In [None]:
df_train["text_prepared"] = df_train.text_preceding.fillna("") + '. The quote: "' + df_train.text_original.fillna("") + '" - end of the quote. ' + df_train.text_following.fillna("")
df_test["text_prepared"] = df_test.text_preceding.fillna("") + '. The quote: "' + df_test.text_original.fillna("") + '" - end of the quote. ' + df_test.text_following.fillna("")


## Format the training and test datasets for NLI classification


**Format the training data**

1.) For each text with a specific class (label), the corresponding class-hypothesis needs to be added in the same row with the label 'true' (also expressed with the numeric label value 0).

2.) Adding 'false' examples: The NLI task consists of predicting, whether a hypothesis is true or false given a context.
If we only give 'true' hypothesis-context pairs to the algorithm, it will not learn the 'false' class properly.
For each text, we therefore also add a row where the text is matched with a random wrong class label and give it the NLI label 'false' (also expressed with the numeric label value 1). This increases the training data by up to 2x.

See the table below for the concrete format the training data takes after this pre-processing step.
Note that NLI can be formulated as a 3-class (entailment/neutral/contradiction) or 2-class (entailment/not-entailment) task. Both can be used here. We use the 2-class variant.
Note that the words entailment/neutral/contradition and true/neutral/false are used interchangably here. Both terminologies are used in the literature and coding instructions.


In [None]:
## function for reformatting the train set
def format_nli_trainset(df_train=None, hypo_label_dic=None, random_seed=42):
  print(f"Length of df_train before formatting step: {len(df_train)}.")
  length_original_data_train = len(df_train)

  df_train_lst = []
  for label_text, hypothesis in hypo_label_dic.items():
    ## entailment
    df_train_step = df_train[df_train.label_text == label_text].copy(deep=True)
    df_train_step["hypothesis"] = [hypothesis] * len(df_train_step)
    df_train_step["label"] = [0] * len(df_train_step)
    ## not_entailment
    df_train_step_not_entail = df_train[df_train.label_text != label_text].copy(deep=True)
    df_train_step_not_entail = df_train_step_not_entail.sample(n=min(len(df_train_step), len(df_train_step_not_entail)), random_state=random_seed)
    df_train_step_not_entail["hypothesis"] = [hypothesis] * len(df_train_step_not_entail)
    df_train_step_not_entail["label"] = [1] * len(df_train_step_not_entail)
    # append
    df_train_lst.append(pd.concat([df_train_step, df_train_step_not_entail]))
  df_train = pd.concat(df_train_lst)

  # shuffle
  df_train = df_train.sample(frac=1, random_state=random_seed)
  df_train["label"] = df_train.label.apply(int)
  df_train["label_nli_explicit"] = ["True" if label == 0 else "Not-True" for label in df_train["label"]]  # adding this just to simplify readibility

  print(f"After adding not_entailment training examples, the training data was augmented to {len(df_train)} texts.")
  print(f"Max augmentation could be: len(df_train) * 2 = {length_original_data_train*2}. It can also be lower, if there are more entail examples than not-entail for a majority class.")

  return df_train.copy(deep=True)


df_train_formatted = format_nli_trainset(df_train=df_train, hypo_label_dic=hypothesis_label_dic, random_seed=SEED_GLOBAL)

Length of df_train before formatting step: 1000.
After adding not_entailment training examples, the training data was augmented to 1968 texts.
Max augmentation could be: len(df_train) * 2 = 2000. It can also be lower, if there are more entail examples than not-entail for a majority class.


**Inspect reformatted training dataset**

Label 0 means that the hypothesis is 'true', label 1 means that the hypothesis is 'not-true'.


In [None]:
DataTable(df_train_formatted[["label", "label_nli_explicit", "hypothesis", "text_prepared"]], num_rows_per_page=5)

Unnamed: 0_level_0,label,label_nli_explicit,hypothesis,text_prepared
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
71032,0,True,"The quote is negative towards the military, fo...", with full implementation of international h...
19581,0,True,The quote is not about military or defense,Guaranteeing the full economic benefits of Bre...
68913,0,True,"The quote is positive towards the military, fo...",a resurgent Russia occupying parts of Ukraine ...
86045,0,True,The quote is not about military or defense,While local governments bear the brunt of the ...
32614,0,True,"The quote is negative towards the military, fo...",We oppose wars and military interventions wage...
...,...,...,...,...
99019,0,True,The quote is not about military or defense,ACT is the only Party that recognises welfare ...
84462,0,True,The quote is not about military or defense,Wealthy industrialised countries like Australi...
23532,0,True,"The quote is negative towards the military, fo...",We will oppose plans for a new generation of T...
57999,0,True,The quote is not about military or defense,We will support comprehensive services for sur...


**Format the test set**

To know which class-hypothesis is true for a specific text, we need to test every possible class-hypothesis for each text. We therefore multiple the rows/texts in the test set by the number of hypothesis and pair each text with all possible hypotheses. The table below shows what the reformatted test set looks like.

In [None]:
## function for reformatting the test set
def format_nli_testset(df_test=None, hypo_label_dic=None):
  ## explode test dataset for N hypotheses
  hypothesis_lst = [value for key, value in hypo_label_dic.items()]
  print("Number of hypotheses/classes: ", len(hypothesis_lst))

  # label lists with 0 at alphabetical position of their true hypo, 1 for not-true hypos
  label_text_label_dic_explode = {}
  for key, value in hypo_label_dic.items():
    label_lst = [0 if value == hypo else 1 for hypo in hypothesis_lst]
    label_text_label_dic_explode[key] = label_lst

  df_test["label"] = df_test.label_text.map(label_text_label_dic_explode)
  df_test["hypothesis"] = [hypothesis_lst] * len(df_test)
  print(f"Original test set size: {len(df_test)}")

  # explode dataset to have K-1 additional rows with not_entail label and K-1 other hypotheses
  # ! after exploding, cannot sample anymore, because distorts the order to true label values, which needs to be preserved for evaluation code
  df_test = df_test.explode(["hypothesis", "label"])  # multi-column explode requires pd.__version__ >= '1.3.0'
  print(f"Test set size for NLI classification: {len(df_test)}\n")

  df_test["label_nli_explicit"] = ["True" if label == 0 else "Not-True" for label in df_test["label"]]  # adding this just to simplify readibility

  return df_test.copy(deep=True)


df_test_formatted = format_nli_testset(df_test=df_test, hypo_label_dic=hypothesis_label_dic)

Number of hypotheses/classes:  3
Original test set size: 4000
Test set size for NLI classification: 12000



**Inspect the reformatted test dataset**


Each text now appears 3 times in the test dataset. There are 3 classes and each text is paired with each hypothesis once.

In [None]:
DataTable(df_test_formatted[["label", "label_nli_explicit", "hypothesis", "text_prepared"]].sort_values(["text_prepared", "hypothesis"]), num_rows_per_page=6, max_rows=10_000)



Unnamed: 0_level_0,label,label_nli_explicit,hypothesis,text_prepared
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
73386,1,Not-True,"The quote is negative towards the military, fo...",$50 million to expand the GP after hours helpl...
73386,0,True,The quote is not about military or defense,$50 million to expand the GP after hours helpl...
73386,1,Not-True,"The quote is positive towards the military, fo...",$50 million to expand the GP after hours helpl...
108881,1,Not-True,"The quote is negative towards the military, fo...",$54 million is being used to help owners of le...
108881,0,True,The quote is not about military or defense,$54 million is being used to help owners of le...
...,...,...,...,...
105394,0,True,The quote is not about military or defense,Ensure New Zealand Superannuation for married...
105394,1,Not-True,"The quote is positive towards the military, fo...",Ensure New Zealand Superannuation for married...
108272,1,Not-True,"The quote is negative towards the military, fo..."," Required financial service providers, includ..."
108272,0,True,The quote is not about military or defense," Required financial service providers, includ..."


## Fine-tuning

We use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) for loading and training our model. They provide great documentation and also a very good [course](https://huggingface.co/course/chapter1/1) on how to use Transformers.

**Loading an NLI model**

You can can use any NLI model on the Hugging Face Hub. For normal English use-cases, we recommend this [base-size model](https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c); for multilingual/non-English use-cases, we recommend this [multilingual model](https://huggingface.co/MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7); for best performance in English (but high compute and memory requirements) we recommend this [large model](https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli).


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

## load the BERT-NLI model and its tokenizer
# you can choose any of the NLI models here: https://huggingface.co/MoritzLaurer
model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c"  # English model: "MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c"; multilingual model: "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# use GPU (cuda) if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
model.to(device);


Downloading (…)okenizer_config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]



Downloading (…)lve/main/config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/369M [00:00<?, ?B/s]

Device: cuda


**Tokenize data**

In [None]:
# convert pandas dataframes to Hugging Face dataset object to facilitate pre-processing
import datasets

dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train_formatted),
    "test": datasets.Dataset.from_pandas(df_test_formatted)
})

# tokenize
def tokenize_nli_format(examples):
  return tokenizer(examples["text_prepared"], examples["hypothesis"], truncation=True, max_length=512)  # max_length can be reduced to e.g. 256 to increase speed, but long texts will be cut off

dataset = dataset.map(tokenize_nli_format, batched=True)

# remove unnecessary columns for model training
dataset = dataset.remove_columns([
    'label_text', 'text_original', 'label_domain_text', 'label_subcat_text',
    'text_preceding', 'text_following', 'manifesto_id', 'doc_id', 'country_name',
    'date', 'party', 'cmp_code_hb4', 'cmp_code', 'label_subcat_text_simple',
    "label_nli_explicit"])


Map:   0%|          | 0/1968 [00:00<?, ? examples/s]

Map:   0%|          | 0/12000 [00:00<?, ? examples/s]

**Inspect processed data**

In [None]:
print("The overall structure of the pre-processed train and test sets:\n")
print(dataset)

The overall structure of the pre-processed train and test sets:

DatasetDict({
    train: Dataset({
        features: ['label', 'text_prepared', 'hypothesis', 'label_nli_explicit', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1968
    })
    test: Dataset({
        features: ['label', 'text_prepared', 'hypothesis', 'label_nli_explicit', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 12000
    })
})


**Explanation of different elements in the tokenized dataset:**

1. **idx**:
Purpose: A unique identifier or index for each data point in the dataset. It's useful for tracking, referencing, or debugging individual examples.
Example: Any integer value like 42.

2. **input_ids**:
Purpose: These are token IDs that represent each token in the text. The raw text is tokenized, and each token is mapped to an ID based on the tokenizer's vocabulary. This is the primary input to BERT and other transformer models.
Example: For the word "hugging", the tokenizer might produce an ID like 20345.

3. **token_type_ids**:
Purpose: Used in models like BERT that handle different pairs of sentences in tasks (e.g., Next Sentence Prediction or Question-Answering). It differentiates between the first sentence and the second sentence. Typically, for a single sentence, this would be a sequence of 0s, and for a two-sentence pair, the first sentence would have 0s, and the second one would have 1s.
Example: For the text "Hugging Face is great. I love it!", the token_type_ids might look like [0, 0, 0, 0, 0, 1, 1, 1].

4. **attention_mask**:
Purpose: Specifies which tokens should be attended to by the model. A value of 1 means the token should be considered, while 0 means it should be ignored. This is especially useful when batching sequences of different lengths together, requiring padding tokens to be added. The padding tokens have an attention_mask value of 0.
Example: For the sentence "I love Hugging Face." with padding, the attention_mask might be [1, 1, 1, 1, 0, 0, 0] where the 0s represent the padding tokens.

In [None]:
from pprint import pprint
print("\n\nAn example for a row in the tokenized dataset:\n")
[print(key, ":    ",  value) for key, value in dataset["train"][0].items()]



An example for a row in the tokenized dataset:

label :     0
text_prepared :       with full implementation of international humanitarian law, refugee law and human rights.. The quote: "emergency relief in situations of armed conflict should be carried out by civilians and must be clearly distinguished from any military activities." - end of the quote. a direct role for military forces in the provision of relief should be restricted to situations involving natural disasters where ambiguity over the military role is unlikely to arise.
hypothesis :     The quote is negative towards the military, for example against military spending, for disarmament, against conscription.
label_nli_explicit :     True
idx :     71032
input_ids :     [1, 507, 3, 275, 540, 3450, 265, 1155, 10793, 818, 261, 11407, 818, 263, 857, 1335, 260, 260, 279, 3554, 294, 307, 41165, 3478, 267, 3335, 265, 5652, 3423, 403, 282, 2635, 321, 293, 10936, 263, 516, 282, 2117, 10045, 292, 356, 1681, 1157, 260, 309, 341, 5

[None, None, None, None, None, None, None, None]

### Setting training arguments / hyperparameters

The following cell sets several important hyperparameters. We chose parameters that work well in general to avoid the need for hyperparameter search. Further below, we also provide code for hyperparameter search, if researchers want to try to increase performance by a few percentage points.

In [None]:
from transformers import TrainingArguments, Trainer, logging

# Set the directory to write the fine-tuned model and training logs to.
# With google colab, this will create a temporary folder, which will be deleted once you disconnect.
# You can connect to your personal google drive to save models and logs properly.
training_directory = "BERT-nli-demo"

# FP16 is a hyperparameter which can increase training speed and reduce memory consumption, but only on GPU and if batch-size > 8, see here: https://huggingface.co/transformers/performance.html?#fp16
# FP16 does not work on CPU or for multilingual mDeBERTa models
#fp16_bool = True if torch.cuda.is_available() else False
#if "mdeberta" in model_name.lower(): fp16_bool = False  # multilingual mDeBERTa does not support FP16 yet: https://github.com/microsoft/DeBERTa/issues/77
# in case of hyperparameter search end the end: FP16 has to be set to False. The integrated hyperparameter search with the Hugging Face Trainer can lead to errors otherwise.
#fp16_bool = False

# Hugging Face tipps to increase training speed and decrease out-of-memory (OOM) issues: https://huggingface.co/transformers/performance.html?
# Overview of all training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
train_args = TrainingArguments(
    output_dir=f'./results/{training_directory}',
    logging_dir=f'./logs/{training_directory}',
    learning_rate=2e-5,
    per_device_train_batch_size=16,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
    per_device_eval_batch_size=80,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
    #gradient_accumulation_steps=4, # Can be used in case of memory problems to reduce effective batch size. accumulates gradients over X steps, only then backward/update. decreases memory usage, but also slightly speed. (!adapt/halve batch size accordingly)
    num_train_epochs=2,  # this can be increased, but higher values increase training time. Good values for NLI are between 3 and 20.
    warmup_ratio=0.25,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
    weight_decay=0.1,
    seed=SEED_GLOBAL,
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    #fp16=fp16_bool,  # Can speed up training and reduce memory consumption, but only makes sense at batch-size > 8. loads two copies of model weights, which creates overhead. https://huggingface.co/transformers/performance.html?#fp16
    #fp16_full_eval=fp16_bool,
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    #eval_steps=10_000,  # evaluate after n steps if evaluation_strategy!='steps'. defaults to logging_steps
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch"
    #save_steps=10_000,              # Number of updates steps before two checkpoint saves.
    #save_total_limit=10,             # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
    #logging_strategy="steps",
    report_to="all",  # "all"  # logging
    #push_to_hub=False,
    #push_to_hub_model_id=f"{model_name}-finetuned-{task}",
)


**Explanation of different training arguments:**

You can find more arguments, explanations and examples in the Hugging Face [documentation](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments).

* **num_train_epochs**:
Specifies the number of times the entire training dataset is passed through the model. For example, num_train_epochs=3 means the trainer will iterate over the entire training dataset three times.

* **per_device_train_batch_size**:
The model does not learn from the entire dataset at once, but in batches of e.g. 16 texts. For example, if per_device_train_batch_size=16, then the model analyses 16 texts and sees how wrong it was on these 16 texts. After analysing these 16 texts, the model's parameters are updated/optimised to make the model less wrong on these texts. The degree to which the model's parameters are updated is called 'learning rate'.

* **learning_rate**:
The the "rate" or speed with which the model's parameters are updated by the optimisation algorithm. A smaller value makes the model's parameter updated more slowly after each batch, while a larger value updates the model's paramters more drastically. A good general value is 2e-5 (which means 0.00002).

* **per_device_eval_batch_size**:
The number of evaluation examples used in one batch during evaluation. This batch size is irrelevant for the model's learning. A higher value makes evaluation faster (more texts are processed at the same time), but higher values also cost memory and increase the risk of out-of-memory errors (OOM).

* **gradient_accumulation_steps**:
Indicates the number of steps before performing a backward/update pass. This means the loss is accumulated over gradient_accumulation_steps steps instead of updating after every step. Useful for training with larger effective batch sizes using limited memory.

* **warmup_ratio**:
Specifies the ratio of total training steps for which the learning rate will be linearly increased (warm-up phase) before it's decayed.

* **weight_decay**:
A regularization technique which penalizes large weights by adding a penalty term to the loss. It helps prevent overfitting.

* **seed**:
Sets a seed for reproducibility. Ensures that multiple runs with the same seed produce the same results.

* **load_best_model_at_end**:
Loads the best model (according to metric_for_best_model) at the end of training instead of the last model.

* **metric_for_best_model**:
Determines which metric to use for evaluating and determining the best model during training.

* **evaluation_strategy**:
Defines when to evaluate the model. For example, evaluation_strategy="epoch" evaluates the model after every epoch.

* **save_strategy**:
Specifies when to save the model. For example, save_strategy="epoch" saves the model after every epoch.

* **fp16**:
Enables mixed precision training if set to True, which can speed up training and reduce memory usage. But this does not work with every model and is only beneficial with a batch_size >= 16.


### Metrics

We multiplied each text N times for each class in the test set and NLI can only predict 2 or 3 classes: true/not-true or true/neutral/false. This means that we cannot use standard functions for computing metrics. The following function reformats the model's output in a way that allows for the calculation of standard metrics like accuracy, F1-macro etc.

In [None]:
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report

def compute_metrics_nli_binary(eval_pred, label_text_alphabetical=None):
    predictions, labels = eval_pred

    ### reformat model output to enable calculation of standard metrics
    # split in chunks with predictions for each hypothesis for one unique premise
    def chunks(lst, n):  # Yield successive n-sized chunks from lst. https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
        for i in range(0, len(lst), n):
            yield lst[i:i + n]

    # for each chunk/premise, select the most likely hypothesis
    softmax = torch.nn.Softmax(dim=1)
    prediction_chunks_lst = list(chunks(predictions, len(set(label_text_alphabetical)) ))
    hypo_position_highest_prob = []
    for i, chunk in enumerate(prediction_chunks_lst):
        hypo_position_highest_prob.append(np.argmax(np.array(chunk)[:, 0]))  # only accesses the first column of the array, i.e. the entailment/true prediction logit of all hypos and takes the highest one

    label_chunks_lst = list(chunks(labels, len(set(label_text_alphabetical)) ))
    label_position_gold = []
    for chunk in label_chunks_lst:
        label_position_gold.append(np.argmin(chunk))  # argmin to detect the position of the 0 among the 1s

    #print("Highest probability prediction per premise: ", hypo_position_highest_prob)
    #print("Correct label per premise: ", label_position_gold)

    ### calculate standard metrics
    precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(label_position_gold, hypo_position_highest_prob, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
    precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(label_position_gold, hypo_position_highest_prob, average='micro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
    acc_balanced = balanced_accuracy_score(label_position_gold, hypo_position_highest_prob)
    acc_not_balanced = accuracy_score(label_position_gold, hypo_position_highest_prob)
    metrics = {
        'accuracy': acc_not_balanced,
        'f1_macro': f1_macro,
        'accuracy_balanced': acc_balanced,
        'f1_micro': f1_micro,
        'precision_macro': precision_macro,
        'recall_macro': recall_macro,
        'precision_micro': precision_micro,
        'recall_micro': recall_micro,
        #'label_gold_raw': label_position_gold,
        #'label_predicted_raw': hypo_position_highest_prob
    }
    #print("Aggregate metrics: ", {key: metrics[key] for key in metrics if key not in ["label_gold_raw", "label_predicted_raw"]} )  # print metrics but without label lists
    #print("Detailed metrics: ", classification_report(label_position_gold, hypo_position_highest_prob, labels=np.sort(pd.factorize(label_text_alphabetical, sort=True)[0]), target_names=label_text_alphabetical, sample_weight=None, digits=2, output_dict=True,
    #                            zero_division='warn'), "\n")
    return metrics

# Create alphabetically ordered list of the original dataset classes/labels
# This is necessary to be sure that the ordering of the test set labels and predictions is the same. Otherwise there is a risk that labels and predictions are in a different order and resulting metrics are wrong.
label_text_alphabetical = np.sort(df_train.label_text.unique())


**Explanation of key metrics**:

Imagine we have a machine that identifies apples in a basket of mixed fruits. The task of the machine is: label each fruit as "apple" or "not-apple".

1. **Accuracy**:
    - Definition: The proportion of all predictions that were correct.
    - Example: If our machine looked at 100 fruits and correctly identified 90 of them (either as apples or not-apples), its accuracy is 90%.
    - Think of it as the overall correctness of the machine.

2. **Precision**:
    - Definition: Of all the items labeled by the machine as "apple," what proportion was actually apples?
    - Simple Explanation: Suppose our machine pointed out 50 fruits as apples, but only 25 of them were real apples. The precision would be 25/50 or 50%.
    - It tells us how "precise" our machine is when calling something an apple.

3. **Recall** (also known as Sensitivity or True Positive Rate):
    - Definition: Of all the real apples in the basket, what proportion did the machine correctly identify as apples?
    - Simple Explanation: If there were 50 actual apples and our machine only identified 40 of them, then the recall is 40/50 or 80%.
    - Recall answers the question: How many actual apples did we manage to recall/identify?

4. **F1 Score**:
    - Definition: The harmonic mean of precision and recall, giving a balance between the two.
    - Simple Explanation: If our precision is 80% and our recall is also 80%, the F1 score is also 80%. But if either drops, the F1 score drops too. It's a single metric that tries to balance precision and recall.
    - Consider it a balanced score: if either precision or recall is low, the F1 score will be low.

Using the apple example, let's say:
- The machine points at 10 fruits saying "these are apples!"
- In reality, only 8 of them are apples (so, precision = 8/10 = 80%).
- But there were 20 apples in total, and the machine only identified 8 of them (so, recall = 8/20 = 40%).

While precision is high, recall is low, which means the machine is good at being sure when it says something is an apple, but it's missing a lot of the actual apples. The F1 score would reflect this imbalance.

### Fine-tuning and evaluation

Let's start fine-tuning the model!

If you get an 'out-of-memory' error, reduce the 'per_device_train_batch_size' to 8 or 4 in the TrainingArguments above and restart the runtime. If you don't restart your runtime (menu to the to left 'Runtime' > 'Restart runtime') and rerun the entire script, the 'out-of-memory' error will probably not go away.

In [None]:
# training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical=label_text_alphabetical)
)

trainer.train()


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1 Macro,F1 Micro,Accuracy Balanced,Accuracy Not B
1,No log,0.377376,0.612391,0.883,0.72595,0.883
2,No log,0.160391,0.737461,0.94175,0.798321,0.94175
3,No log,0.291391,0.686245,0.91725,0.824198,0.91725
4,No log,0.291348,0.728259,0.9375,0.817512,0.9375
5,0.176500,0.3169,0.727767,0.93675,0.814498,0.93675


Highest probability prediction per premise:  [2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 2, 0, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 0, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 2, 2, 2, 0, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 

TrainOutput(global_step=615, training_loss=0.145575746675817, metrics={'train_runtime': 332.1587, 'train_samples_per_second': 29.624, 'train_steps_per_second': 1.852, 'total_flos': 681108070346496.0, 'train_loss': 0.145575746675817, 'epoch': 5.0})

In [None]:
## Evaluate the fine-tuned model on the held-out test set
#results = trainer.evaluate()
#print(results)

**Any questions while the training is running?**

## Inference with your fine-tuned model or any other BERT-NLI (0-shot)

The code above showed how to fine-tune a custom BERT-NLI model. This section shows how to use this fine-tuned model (or an existing general BERT-NLI model) to predict classes on unseen texts. There are several ways of implementing this. We show a simple implementation with the Hugging Face pipline.

First define your hypotheses. If you want to use a model you have trained above, you can just reuse the same `hypothesis_label_dic` as above. If you want to test 0-shot performance with new hypotheses, you can define them here.

In [None]:
hypothesis_label_dic_inference = {
    "Military: Positive": "The quote is positive towards the military, for example for military spending, defense, military treaty obligations.",
    "Military: Negative": "The quote is negative towards the military, for example against military spending, for disarmament, against conscription.",
    "Other": "The quote is not about military or defense"
}
hypothesis_lst = list(hypothesis_label_dic_inference.values())

We now we load the Hugging Face 0-shot pipeline to implement predictions on unseen texts (inference). While this pipeline is called "zero-shot", you can also use it with your own custom NLI models that have been trained on your task of interest. That's the beauty of universal BERT-NLI classification: any task is converted into the NLI format, which means that both general BERT-NLI models as well as your fine-tuned BERT-NLI function essentially in the same way.

In [None]:
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"  # use GPU (cuda) if available, otherwise use CPU

# documentation: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
pipe_classifier = pipeline(
    "zero-shot-classification",
    model=model,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
    # or load a model from the Hugging Face hub, e.g. for 0-shot classification
    #model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c",
    tokenizer=tokenizer,
    framework="pt",
    device=device,
)

We now apply the pipeline to unseen texts. We re-use the df_test data-frame here for simplicity, but it could be any other dataset. It only needs a text column. Note that we do not need to re-format the text data anymore here, as this is handled internally by the Hugging Face zero-shot pipeline. If you want to better understand the arguments in the pipeline below, we recommend reading the [documentation here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline).

In [None]:
# create a dummy data frame for illustration
df_inference = df_test[["text_prepared", "label_text"]].sample(n=1000, random_state=42).copy(deep=True)
text_lst = df_inference["text_prepared"].tolist()

# use the pipeline with your chosen model for inference (prediction)
pipe_output = pipe_classifier(
    text_lst,  # input any list of texts here
    candidate_labels=hypothesis_lst,
    hypothesis_template="{}",
    multi_label=False,  # here you can decide if, for your task, only one hypothesis can be true, or multiple can be true
    batch_size=32  # reduce this number to 8 or 16 if you get an out-of-memory error
)
print(pipe_output)

# extract the predictions from pipe_outut
hypothesis_pred_true_probability = []
hypothesis_pred_true = []
for dic in pipe_output:
    hypothesis_pred_true_probability.append(dic["scores"][0])
    hypothesis_pred_true.append(dic["labels"][0])

# map the long hypotheses to their corresponding short label names
hypothesis_label_dic_inference_inverted = {value: key for key, value in hypothesis_label_dic_inference.items()}
label_pred = [hypothesis_label_dic_inference_inverted[hypo] for hypo in hypothesis_pred_true]

# add inference data to your original dataframe
df_inference["label_text_pred"] = label_pred
df_inference["label_text_pred_proba"] = hypothesis_pred_true_probability


[{'sequence': '• To continue to maintain inflation in the range of 0-2 per cent.. The quote: "• Increased volume and value of exports to ail markets." - end of the quote. • Continued diversification among the goods and services we export.', 'labels': ['The quote is not about military or defense', 'The quote is positive towards the military, for example for military spending, defense, military treaty obligations.', 'The quote is negative towards the military, for example against military spending, for disarmament, against conscription.'], 'scores': [0.9987605810165405, 0.000637443270534277, 0.0006019522552378476]}, {'sequence': 'End the 1 % cap on pay rises in the public sector,. The quote: "and uprate wages in line with inflation." - end of the quote. The Conservative pursuit of hard Brexit will have serious impacts on the UK’s national finances – impacts which current government plans may not fully take into account.', 'labels': ['The quote is not about military or defense', 'The quot

In [None]:
df_inference

Unnamed: 0_level_0,text_prepared,label_text,label_text_pred,label_text_pred_proba
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
103403,• To continue to maintain inflation in the ran...,Other,Other,0.998761
13031,End the 1 % cap on pay rises in the public sec...,Other,Other,0.998987
27347,These savings include: Reducing the size of th...,Other,Other,0.999194
104422,This will allow victims to say more in their V...,Other,Other,0.999290
62765,"Throughout the Cold War, our international bro...",Other,Other,0.832301
...,...,...,...,...
40325,"Legislate to regulate the charity sector,. The...",Other,Other,0.999367
47579,Support the further development of farmers’ ma...,Other,Other,0.999427
73225,Small businesses will benefit from the instant...,Other,Other,0.999548
13290,We will: Uprate working-age benefits at least ...,Other,Other,0.998997


## Save and load your fine-tuned model

In [None]:
run_code_below = False
assert run_code_below, "Stopping code here to avoid accidental runs of the code below with Runtime > Run all"

This segment provides code for saving the model to your hard-disk or for uploading it to the Hugging Face hub.

In [None]:
## first you need to connect to your google drive with your google account
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)
#drive.flush_and_unmount()

# insert the path where you want to save the model
os.chdir("/content/drive/My Drive/")
print(os.getcwd())


Mounted at /content/drive
/content/drive/My Drive


In [None]:
### save best model to disk
directory_save_model = f"{training_directory}/"
model_name_custom = f"{model_name.split('/')[-1]}-custom"
mode_custom_path = directory_save_model + model_name_custom

# save the model to google drive
trainer.save_model(output_dir=mode_custom_path)

Saving model checkpoint to BERT-nli-demo/DeBERTa-v3-base-mnli-fever-docnli-ling-2c-custom
Configuration saved in BERT-nli-demo/DeBERTa-v3-base-mnli-fever-docnli-ling-2c-custom/config.json
Model weights saved in BERT-nli-demo/DeBERTa-v3-base-mnli-fever-docnli-ling-2c-custom/pytorch_model.bin
tokenizer config file saved in BERT-nli-demo/DeBERTa-v3-base-mnli-fever-docnli-ling-2c-custom/tokenizer_config.json
Special tokens file saved in BERT-nli-demo/DeBERTa-v3-base-mnli-fever-docnli-ling-2c-custom/special_tokens_map.json


In [None]:
### Push to Hugging Face hub
# install necessary dependencies
# you need to create an account on https://huggingface.co/ for this
!sudo apt-get install git-lfs
!huggingface-cli login

In [None]:
# load your models and tokenizer saved before from disk
model = AutoModelForSequenceClassification.from_pretrained(mode_custom_path)
tokenizer = AutoTokenizer.from_pretrained(mode_custom_path, use_fast=True, model_max_length=512)  # we load the tokenizer from the original BERT-NLI model

In [None]:
# https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.push_to_hub
repo_id = '<your-user-name>/<your-model-name>'  # e.g. "JaneJones/DeBERTa-v3-nli-custom". note that the repo name is case-sensitive
model.push_to_hub(repo_id=repo_id, use_temp_dir=True, private=True, use_auth_token="<your-huggingface-token>")
tokenizer.push_to_hub(repo_id=repo_id, use_temp_dir=True, private=True, use_auth_token="<your-huggingface-token>")


## Bonus: Hyperparameter Search

To increase performance, you can also conduct a hyperparameter search (hp-search), to try and find the best hyperparameters for your specific task and dataset. The trade-off is that hp-search is very compute intensive, but finding better hyperparameters for your task can increase performance. Make sure to conduct hp-search on a sub-set of the training set (i.e. validation set) and not the final test set to avoid data leakage of the test set before final testing.

Note that for small datasets, running the hp-search only on one train-validation split is not ideal. For datasets with less than around 2000 training data points, we recommend running the hp-search on two different random train-validation split. We implemented this for our paper, but not in this notebook as this would make the code harder to understand.

Documentation with more information on hp-search with Hugging Face Transformers is available [here](https://huggingface.co/docs/transformers/main/hpo_train).

In [None]:
## train-validation split - test set should not be visible during hp-search
# https://huggingface.co/docs/datasets/v2.5.1/en/package_reference/main_classes#datasets.Dataset.train_test_split

# the ideal size of the validation set depends on the size of your training data. Each label should have at the very least a few dozen examples in the validation set (ideally several hundred)
validation_set_size = 0.4  # for a training data size of 1000 with 3 classes we use 40% of the training data for validating hyperparameters

# reformatting of label column to enable dataset stratification
from datasets import ClassLabel
new_features = dataset["train"].features.copy()
if len(model.config.label2id.keys()) == 2:  # for 2-class NLI model
  label_names = ["entailment", "not-entailment"]
elif len(model.config.label2id.keys()) == 3:  # for 3-class NLI model
  label_names = ["entailment", "neutral", "contradiction"]
new_features['label'] = ClassLabel(names=label_names)
dataset = dataset.cast(new_features)

# train-validation split for hp-search
dataset_hp = dataset["train"].train_test_split(test_size=validation_set_size, seed=SEED_GLOBAL, shuffle=True, stratify_by_column="label")
print(dataset_hp)

In [None]:
# helper function to clean memory and reduce risk of out-of-memory error
import gc
def clean_memory():
  #del(model)
  if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
  gc.collect()

clean_memory()

In [None]:
## Reinitialize trainer for hp-search
# https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10

def model_init():
  clean_memory()
  return AutoModelForSequenceClassification.from_pretrained(model_name).to(device)  # return_dict=True

trainer = Trainer(
    model_init=model_init,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset_hp["train"],
    eval_dataset=dataset_hp["test"],
    compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical=label_text_alphabetical)
);


**Define the hyperparameters you want to optimise**

For a detailed discussion of different hyperparameters, see the appendix of our paper.

In [None]:
# we use Optuna for hp-search: https://optuna.readthedocs.io/en/stable/
def my_hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [9e-6, 2e-5, 4e-5]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 4, 24, log=False, step=4),   # increasing the maximum number of epochs here could increase performance but will take (much) longer to train
        "warmup_ratio": trial.suggest_float("warmup_ratio", 0.1, 0.6, log=True),
        "per_device_train_batch_size": 16,  # lower this value in case of out-of-memory errors and restart the runtime
        #"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
    }


**Run HP search!**

Choose the number of hyperparameter configurations you want to test. In our experiments we found that after 10 to 15 trials with around 4 hyperparameters, performance is unlikely to increase meaningfully. 15 trials seems to be a safe value, but can take a while to run.

In [None]:
import optuna

# number of differen hp configurations to test
numer_of_trials = 10  # increasing this value can lead to better hyperparameters, but will take longer
# chose the sampler for sampling hp configurations
optuna_sampler = optuna.samplers.TPESampler(
    seed=SEED_GLOBAL, consider_prior=True, prior_weight=1.0, consider_magic_clip=True,
    consider_endpoints=False, n_startup_trials=numer_of_trials/2, n_ei_candidates=24,
    multivariate=False, group=False, warn_independent_sampling=True, constant_liar=False
)  # https://optuna.readthedocs.io/en/stable/reference/generated/optuna.samplers.TPESampler.html#optuna.samplers.TPESampler

# Hugging Face Documentation: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.hyperparameter_search
best_run = trainer.hyperparameter_search(
    n_trials=numer_of_trials,
    direction="maximize",
    hp_space=my_hp_space,
    backend='optuna',
    **{"sampler": optuna_sampler}
)

In [None]:
# show best hyperparameters based on hp-search
print(best_run)

**Training Time with optimised hyperparameters!**

Here we can use the original train and test set again.

In [None]:
# update the training arguments with the best hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(train_args, k, v)
print("\n", train_args)

# hp-search with hf causes errors with FP16 for some reason
#setattr(train_args, "fp16", False)
#setattr(train_args, "fp16_full_eval", False)

In [None]:
# reinitialize the model to avoid re-using a trained model from a step further above
model_name = "MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c"  # English model: "MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c"; multilingual model: "MoritzLaurer/mDeBERTa-v3-base-xnli-multilingual-nli-2mil7"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)
model = AutoModelForSequenceClassification.from_pretrained(model_name)


In [None]:
# Training
trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],  #.shard(index=1, num_shards=100),  # https://huggingface.co/docs/datasets/processing.html#sharding-the-dataset-shard
    eval_dataset=dataset["test"],  #.shard(index=1, num_shards=100),
    compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical=label_text_alphabetical)
)

trainer.train()


In [None]:
## Evaluate the fine-tuned model on the held-out test set
results = trainer.evaluate()


In [None]:
print(results)

Note that hyperparameter searches do not necessarily lead to better results, as they need to be searched on a smaller validation set of the train set, which might impact generalisation. Especially for smaller training sets, hyperparameter searches might lead to similar values as good default values. The default values discussed in the paper often provide good results.

---

---

## Reflection  +  Q&A

**Reading, thinking & asking:** (5 min)

* In your own words, write down the main difference between BERT-base and BERT-NLI. What are the main advantages and disadvantages when comparing BERT-base and BERT-NLI?
* Inspect `df_inference` and see where the model made mistakes and try to get a feeling for the data and patterns of errors.
* You can also run the notebook yourself and try a different model or different hyperparamters.

