<a href="https://colab.research.google.com/github/MoritzLaurer/transformers-workshop-comptext-2023/blob/master/tune_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##  🧑‍💻 Workshop Notebook 🧑‍💻 

📅 _COMPTEXT 2023 tutorial, 11.05.2023_

👨‍🏫 By [Moritz Laurer](https://twitter.com/MoritzLaurer). 
For questions, reach out to: m.laurer@vu.nl


This notebook is based on the paper ["Less Annotating, More Classifying: Addressing the Data Scarcity Issue of Supervised Machine Learning with Deep Transfer Learning and BERT-NLI"](https://github.com/MoritzLaurer/less-annotating-with-bert-nli) by Moritz Laurer, Wouter van Atteveldt, Andreu Casas, Kasper Welbers.

## Activate a GPU runtime

In order to run this notebook on a GPU, click on "Runtime" > "Change runtime type" > select "GPU" in the menue bar in to top left. Training a Transformer is much faster on a GPU. Given Google's usage limits for GPUs, it is advisable to first test your non-training code on a CPU (Hardware accelerator "None" instead of GPU) and only use the GPU once you know that everything is working.

## Install relevant packages

In [1]:
!pip install transformers[sentencepiece]==4.28  #4.23
!pip install datasets==2.12  #2.6
!pip install optuna==3.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
## Load general packages
# some more specialised packages are loaded in each sub section
import pandas as pd
import numpy as np
from google.colab.data_table import DataTable

In [3]:
# set random seed for reproducibility
SEED_GLOBAL = 42
np.random.seed(SEED_GLOBAL)

## Download data

In [4]:
## Download a cleaned train and test data from the paper's github repository
# you can choose any dataset from this repository (copy & paste the link to the raw files below): 
# https://github.com/MoritzLaurer/less-annotating-with-bert-nli/tree/master/data_clean

df_train = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/master/data_clean/df_manifesto_military_train.csv", index_col="idx") 
df_test = pd.read_csv("https://raw.githubusercontent.com/MoritzLaurer/less-annotating-with-bert-nli/master/data_clean/df_manifesto_military_test.csv", index_col="idx")
print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")


Length of training and test sets:  3970  (train)  9537  (test).


**If you want to run the notebook on your own dataset:**

You can load your own training and test data above to fine-tune your own BERT model. Your own dataframe only needs three columns to be compatible with the code below: (1) a "label" column with a numeric label; (2) a "label_text" column with the label name in plain language, (3) a "text" column with the texts for training (you might need to delete/adapt the text preparation code cell below for your dataset). 

In [5]:
## alternatively, you can also load your own .csv files from Google Drive
"""
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)

# set the path to your data
os.chdir("/content/drive/My Drive/PhD/other/COMPTEXT-2023-workshop/data")  
print(os.getcwd())

df_train = pd.read_csv("./df_manifesto_morality_train.csv")
df_test = pd.read_csv("./df_manifesto_morality_test.csv")
print("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")
"""


'\nfrom google.colab import drive\nimport os\ndrive.mount(\'/content/drive\', force_remount=False)\n\n# set the path to your data\nos.chdir("/content/drive/My Drive/PhD/other/COMPTEXT-2023-workshop/data")  \nprint(os.getcwd())\n\ndf_train = pd.read_csv("./df_manifesto_morality_train.csv")\ndf_test = pd.read_csv("./df_manifesto_morality_test.csv")\nprint("Length of training and test sets: ", len(df_train), " (train) ", len(df_test), " (test).")\n'

In [6]:
# optional: use training data sample size of e.g. 1000 for faster testing
sample_size = 1000
df_train = df_train.sample(n=min(sample_size, len(df_train)), random_state=SEED_GLOBAL).copy(deep=True)
df_test = df_test.sample(n=min(sample_size*4, len(df_test)), random_state=SEED_GLOBAL).copy(deep=True)

print("Length of training and test sets after sampling: ", len(df_train), " (train) ", len(df_test), " (test).")


Length of training and test sets after sampling:  1000  (train)  4000  (test).


In [7]:
## inspect the data
# label distribution train set 
print("Train set label distribution:\n", df_train.label_text.value_counts(), "\n")
# label distribution test set 
print("Test set label distribution:\n", df_test.label_text.value_counts())


Train set label distribution:
 Other                 516
Military: Positive    399
Military: Negative     85
Name: label_text, dtype: int64 

Test set label distribution:
 Other                 3646
Military: Positive     248
Military: Negative     106
Name: label_text, dtype: int64


In [None]:
# full training data table
DataTable(df_train, num_rows_per_page=5)

## Data preprocessing

**Prepare the input text**

1.) We prepare the target texts by making them more naturally fit to the hypothesis. Here we simply wrap each target text into the string ' The quote: "{target_text}" - end of the quote. '

2.) We surround the target text by its preceeding and following sentence. Adding context like this systematically increases performance. 


In [9]:
df_train["text"] = df_train.text_preceding.fillna("") + " " + df_train.text_original.fillna("") + " " + df_train.text_following.fillna("")
df_test["text"] = df_test.text_preceding.fillna("") + " " + df_test.text_original.fillna("") + " " + df_test.text_following.fillna("")


In [10]:
df_train = df_train[["label", "label_text", "text"]]
df_test = df_test[["label", "label_text", "text"]]

In [None]:
DataTable(df_train, num_rows_per_page=5)

## Load a Transformer

We use [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) for loading and training our model. They provide great documentation and also a very good [course](https://huggingface.co/course/chapter1/1) on how to use Transformers. 

**Choosing a Transformer model**

You can can use any classification model on the [Hugging Face Hub](https://huggingface.co/models?sort=downloads). I suggest testing these models: 



*   Original BERT: `bert-base-uncased`
*   Small efficient model: `distilbert-base-uncased`
*   Newer version of BERT: `microsoft/deberta-v3-base`
*   Large, high-performance model: `microsoft/deberta-v3-large`
*   Multilingual model: `microsoft/mdeberta-v3-base`





In [12]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoConfig
import torch

## load a model and its tokenizer
model_name = "microsoft/deberta-v3-large"  
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# link the numeric labels to the label texts
label_text = np.sort(df_test.label_text.unique()).tolist()
label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));

# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True);

# use GPU (cuda) if available, otherwise use CPU
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Device: {device}")
model.to(device);


Downloading (…)okenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/580 [00:00<?, ?B/s]

Downloading spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading pytorch_model.bin:   0%|          | 0.00/874M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-large were not used when initializing DebertaV2ForSequenceClassification: ['lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.weight', 'lm_predictions.lm_head.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.classifier.bias', 'mask_predictions.classifier.weight', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'mask_predictions.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

Device: cuda


## Tokenize data

In [13]:
# convert pandas dataframes to Hugging Face dataset object to facilitate pre-processing
import datasets

dataset = datasets.DatasetDict({
    "train": datasets.Dataset.from_pandas(df_train),
    "test": datasets.Dataset.from_pandas(df_test)
})

# tokenize
def tokenize(examples):
  return tokenizer(examples["text"], truncation=True, max_length=512)  # max_length can be reduced to e.g. 256 to increase speed, but long texts will be cut off

dataset["train"] = dataset["train"].map(tokenize, batched=True)  
dataset["test"] = dataset["test"].map(tokenize, batched=True) 

# remove unnecessary columns for model training
dataset = dataset.remove_columns(['label_text'])   #'text_original', 'label_domain_text', 'label_subcat_text', 'text_preceding', 'text_following', 'manifesto_id', 'doc_id', 'country_name', 'date', 'party', 'cmp_code_hb4', 'cmp_code', 'label_subcat_text_simple'])


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

**Inspect processed data**

In [14]:
print("The overall structure of the pre-processed train and test sets:\n")
print(dataset)

print("\n\nAn example for a tokenized hypothesis-context pair:\n")
print(dataset["train"][0])

The overall structure of the pre-processed train and test sets:

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 4000
    })
})


An example for a tokenized hypothesis-context pair:

{'label': 0, 'text': 'emergency relief in situations of armed conflict should be carried out by civilians and must be clearly distinguished from any military activities. a direct role for military forces in the provision of relief should be restricted to situations involving natural disasters where ambiguity over the military role is unlikely to arise. aid programs should not be used to influence the democratic preferences of any nation.', 'idx': 71033, 'input_ids': [1, 2644, 3478, 267, 3335, 265, 5652, 3423, 403, 282, 2635, 321, 293, 10936, 263, 516, 282, 2117, 10045, 

## Setting training arguments / hyperparameters

The following cells set several important hyperparameters. We chose parameters that work well in general to avoid the need for hyperparameter search. Further below, we also provide code for hyperparameter search, if researchers want to try to increase performance by a few percentage points. 

In [None]:
# Set the directory to write the fine-tuned model and training logs to.
# With google colab, this will create a temporary folder, which will be deleted once you disconnect. 
# You can connect to your personal google drive to save models and logs properly.
training_directory = "BERT-demo"

# FP16 is a hyperparameter which can increase training speed and reduce memory consumption, but only on GPU and if batch-size > 8, see here: https://huggingface.co/transformers/performance.html?#fp16
# FP16 does not work on CPU or for multilingual mDeBERTa models
fp16_bool = True if torch.cuda.is_available() else False
if "mdeberta" in model_name.lower(): fp16_bool = False  # multilingual mDeBERTa does not support FP16 yet: https://github.com/microsoft/DeBERTa/issues/77
# in case of hyperparameter search end the end: FP16 has to be set to False. The integrated hyperparameter search with the Hugging Face Trainer can lead to errors otherwise. 
fp16_bool = False

In [15]:
from transformers import TrainingArguments, Trainer, logging

LEARNING_RATE = 2e-5
EPOCHS = 5

# Hugging Face tipps to increase training speed and decrease out-of-memory (OOM) issues: https://huggingface.co/transformers/performance.html?
# Overview of all training arguments: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
train_args = TrainingArguments(
    output_dir=f'./results/{training_directory}',
    logging_dir=f'./logs/{training_directory}',
    num_train_epochs=EPOCHS,  # this can be increased, but higher values increase training time. Good values for NLI are between 3 and 20.
    learning_rate=LEARNING_RATE,
    per_device_train_batch_size=8,  # if you get an out-of-memory error, reduce this value to 8 or 4 and restart the runtime. Higher values increase training speed, but also increase memory requirements. Ideal values here are always a multiple of 8.
    per_device_eval_batch_size=80,  # if you get an out-of-memory error, reduce this value, e.g. to 40 and restart the runtime
    gradient_accumulation_steps=2, # Can be used in case of memory problems to reduce effective batch size. accumulates gradients over X steps, only then backward/update. decreases memory usage, but also slightly speed. (!adapt/halve batch size accordingly)
    warmup_ratio=0.06,  # a good normal default value is 0.06 for normal BERT-base models, but since we want to reuse prior NLI knowledge and avoid catastrophic forgetting, we set the value higher
    weight_decay=0.1,
    seed=SEED_GLOBAL,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=fp16_bool,  # Can speed up training and reduce memory consumption, but only makes sense at batch-size > 8. loads two copies of model weights, which creates overhead. https://huggingface.co/transformers/performance.html?#fp16
    fp16_full_eval=fp16_bool,
    evaluation_strategy="epoch", # options: "no"/"steps"/"epoch"
    #eval_steps=10_000,  # evaluate after n steps if evaluation_strategy!='steps'. defaults to logging_steps
    save_strategy = "epoch",  # options: "no"/"steps"/"epoch"
    #save_steps=10_000,              # Number of updates steps before two checkpoint saves.
    #save_total_limit=10,             # If a value is passed, will limit the total amount of checkpoints. Deletes the older checkpoints in output_dir
    #logging_strategy="steps",
    report_to="all",  # "all"  # logging
    #push_to_hub=False,
    #push_to_hub_model_id=f"{model_name}-finetuned-{task}",
)


In [16]:
### Function to calculate metrics
from sklearn.metrics import balanced_accuracy_score, precision_recall_fscore_support, accuracy_score, classification_report
import warnings

def compute_metrics_standard(eval_pred):
    with warnings.catch_warnings():
        warnings.filterwarnings("ignore")
        
        labels = eval_pred.label_ids
        pred_logits = eval_pred.predictions
        preds_max = np.argmax(pred_logits, axis=1)  # argmax on each row (axis=1) in the tensor

        ## metrics
        precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(labels, preds_max, average='macro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        precision_micro, recall_micro, f1_micro, _ = precision_recall_fscore_support(labels, preds_max, average='micro')  # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html
        acc_balanced = balanced_accuracy_score(labels, preds_max)
        acc_not_balanced = accuracy_score(labels, preds_max)

        metrics = {
            'accuracy': acc_not_balanced,
            'f1_macro': f1_macro,
            'accuracy_balanced': acc_balanced,
            'f1_micro': f1_micro,
            'precision_macro': precision_macro,
            'recall_macro': recall_macro,
            'precision_micro': precision_micro,
            'recall_micro': recall_micro,
            #'label_gold_raw': labels,
            #'label_predicted_raw': preds_max
        }
        #print("Aggregate metrics: ", {key: metrics[key] for key in metrics if key not in ["label_gold_raw", "label_predicted_raw"]} )  # print metrics but without label lists
        #print("Detailed metrics: ", classification_report(labels, preds_max, labels=np.sort(pd.factorize(label_text, sort=True)[0]), target_names=label_text, sample_weight=None, digits=2, output_dict=True,
        #                            zero_division='warn'), "\n")
        
        return metrics


## Fine-tuning and evaluation

Let's start fine-tuning the model! 

If you get an 'out-of-memory' error, reduce the 'per_device_train_batch_size' to 8 or 4 in the TrainingArguments above and restart the runtime. If you don't restart your runtime (menu to the to left 'Runtime' > 'Restart runtime') and rerun the entire script, the 'out-of-memory' error will probably not go away. 

In [17]:
# training
trainer = Trainer( 
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    compute_metrics=compute_metrics_standard  
)

trainer.train()


You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy,F1 Macro,Accuracy Balanced,F1 Micro,Precision Macro,Recall Macro,Precision Micro,Recall Micro
0,No log,0.172034,0.94075,0.552534,0.63089,0.94075,0.508462,0.63089,0.94075,0.94075
2,No log,0.24183,0.93825,0.605839,0.649234,0.93825,0.811053,0.649234,0.93825,0.93825
2,No log,0.201216,0.9485,0.743127,0.749353,0.9485,0.754048,0.749353,0.9485,0.9485
4,No log,0.270814,0.9385,0.725317,0.794548,0.9385,0.6771,0.794548,0.9385,0.9385
4,No log,0.3317,0.93475,0.727196,0.798815,0.93475,0.688309,0.798815,0.93475,0.93475


TrainOutput(global_step=310, training_loss=0.3029970968923261, metrics={'train_runtime': 1153.433, 'train_samples_per_second': 4.335, 'train_steps_per_second': 0.269, 'total_flos': 871524828676656.0, 'train_loss': 0.3029970968923261, 'epoch': 4.96})

In [18]:
# Evaluate the fine-tuned model on the held-out test set
#results = trainer.evaluate()
#print(results)

## Inference with your fine-tuned model

In [19]:
from transformers import pipeline
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"  # use GPU (cuda) if available, otherwise use CPU

# documentation: https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline
pipe_classifier = pipeline(
    "text-classification", 
    model=model,  # if you have trained a model above, load_best_model_at_end in the training arguments has automatically replaced model with the fine-tuned model
    tokenizer=tokenizer,
    framework="pt",
    device=device,
)

We now apply the pipeline to unseen texts. We re-use the df_test data-frame here for simplicity, but it could be any other dataset. It only needs a text column. Note that we do not need to re-format the text data anymore here, as this is handled internally by the Hugging Face zero-shot pipeline. If you want to better understand the arguments in the pipeline below, we recommend reading the [documentation here](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.ZeroShotClassificationPipeline). 

In [20]:
# create a dummy data frame for illustration
df_inference = df_test[["text", "label_text"]].sample(n=1000, random_state=42).copy(deep=True)
text_lst = df_inference["text"].tolist()

# use the pipeline with your chosen model for inference (prediction)
pipe_output = pipe_classifier(
    text_lst,  # input any list of texts here
    batch_size=32  # reduce this number to 8 or 16 if you get an out-of-memory error
)
print(pipe_output)

df_output = pd.DataFrame(pipe_output)

# add inference data to your original dataframe
df_inference["label_text_pred"] = df_output["label"].tolist()
df_inference["label_text_pred_probability"] = df_output["score"].round(2).tolist() 


[{'label': 'Other', 'score': 0.9949029684066772}, {'label': 'Other', 'score': 0.9950873255729675}, {'label': 'Other', 'score': 0.9955647587776184}, {'label': 'Other', 'score': 0.9951300621032715}, {'label': 'Other', 'score': 0.9864142537117004}, {'label': 'Other', 'score': 0.9945248365402222}, {'label': 'Other', 'score': 0.9956797361373901}, {'label': 'Other', 'score': 0.9959794282913208}, {'label': 'Other', 'score': 0.9926238059997559}, {'label': 'Other', 'score': 0.9946034550666809}, {'label': 'Other', 'score': 0.9963145852088928}, {'label': 'Military: Negative', 'score': 0.6199309825897217}, {'label': 'Other', 'score': 0.9955765008926392}, {'label': 'Other', 'score': 0.9954009652137756}, {'label': 'Other', 'score': 0.9953039884567261}, {'label': 'Other', 'score': 0.9950782060623169}, {'label': 'Other', 'score': 0.9944655299186707}, {'label': 'Other', 'score': 0.9949158430099487}, {'label': 'Other', 'score': 0.9957613348960876}, {'label': 'Other', 'score': 0.9951813817024231}, {'labe

In [None]:
df_inference

In [22]:
raise Exception("Stopping code here to avoid accidental runs of the code below")

Exception: ignored

## Save and load your fine-tuned model

This segment provides code for saving the model to your hard-disk or for uploading it to the Hugging Face hub. 

In [None]:
## first you need to connect to your google drive with your google account
from google.colab import drive
import os
drive.mount('/content/drive', force_remount=False)
#drive.flush_and_unmount()

# insert the path where you want to save the model
os.chdir("/content/drive/My Drive/")  
print(os.getcwd())


In [None]:
### save best model to google drive
directory_save_model = f"{training_directory}/"
model_name_custom = f"{model_name.split('/')[-1]}-custom"
mode_custom_path = directory_save_model + model_name_custom

trainer.save_model(output_dir=mode_custom_path)

In [None]:
### Push to Hugging Face hub
# install necessary dependencies
# you need to create an account on https://huggingface.co/ for this
!sudo apt-get install git-lfs
!huggingface-cli login

In [None]:
# load your models and tokenizer saved before from disk
model = AutoModelForSequenceClassification.from_pretrained(mode_custom_path)
tokenizer = AutoTokenizer.from_pretrained(mode_custom_path, use_fast=True, model_max_length=512)  # we load the tokenizer from the original BERT-NLI model

In [None]:
# https://huggingface.co/docs/transformers/main_classes/model#transformers.PreTrainedModel.push_to_hub
repo_id = '<your-user-name>/<your-model-name>'  # e.g. "JaneJones/DeBERTa-v3-nli-custom". note that the repo name is case-sensitive
model.push_to_hub(repo_id=repo_id, use_temp_dir=True, private=True, use_auth_token="<your-huggingface-token>")
tokenizer.push_to_hub(repo_id=repo_id, use_temp_dir=True, private=True, use_auth_token="<your-huggingface-token>")


## Bonus: Hyperparameter Search

To increase performance, you can also conduct a hyperparameter search (hp-search), to try and find the best hyperparameters for your specific task and dataset. The trade-off is that hp-search is very compute intensive, but finding better hyperparameters for your task can increase performance. Make sure to conduct hp-search on a sub-set of the training set (i.e. validation set) and not the final test set to avoid data leakage of the test set before final testing.

Note that for small datasets, running the hp-search only on one train-validation split is not ideal. For datasets with less than around 2000 training data points, we recommend running the hp-search on two different random train-validation split. We implemented this for our paper, but not in this notebook as this would make the code harder to understand. 

Documentation with more information on hp-search with Hugging Face Transformers is available [here](https://huggingface.co/docs/transformers/main/hpo_train). 

In [None]:
## train-validation split - test set should not be visible during hp-search
# https://huggingface.co/docs/datasets/v2.5.1/en/package_reference/main_classes#datasets.Dataset.train_test_split

# the ideal size of the validation set depends on the size of your training data. Each label should have at the very least a few dozen examples in the validation set (ideally several hundred)
validation_set_size = 0.4  # for a training data size of 1000 with 3 classes we use 40% of the training data for validating hyperparameters

# reformatting of label column to enable dataset stratification
from datasets import ClassLabel
new_features = dataset["train"].features.copy()
label_names = list(model.config.label2id.keys())

new_features['label'] = ClassLabel(names=label_names)
dataset = dataset.cast(new_features)

# train-validation split for hp-search
dataset_hp = dataset["train"].train_test_split(test_size=validation_set_size, seed=SEED_GLOBAL, shuffle=True, stratify_by_column="label") 
print(dataset_hp)

In [None]:
# helper function to clean memory and reduce risk of out-of-memory error
import gc
def clean_memory():
  #del(model)
  if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()
  gc.collect()

clean_memory()

In [None]:
## Reinitialize trainer for hp-search
# https://discuss.huggingface.co/t/using-hyperparameter-search-in-trainer/785/10

def model_init():
    clean_memory()

    # link the numeric labels to the label texts
    label_text = np.sort(df_test.label_text.unique()).tolist()
    label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
    id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
    config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));

    return AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True).to(device)

trainer = Trainer( 
    model_init=model_init,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset_hp["train"],
    eval_dataset=dataset_hp["test"],
    compute_metrics=compute_metrics_standard  
);


**Define the hyperparameters you want to optimise**

For a detailed discussion of different hyperparameters, see the appendix of our paper.

In [None]:
# we use Optuna for hp-search: https://optuna.readthedocs.io/en/stable/
def my_hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [9e-6, 2e-5, 4e-5]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 4, 24, log=False, step=4),   # increasing the maximum number of epochs here could increase performance but will take (much) longer to train
        #"warmup_ratio": trial.suggest_float("warmup_ratio", 0.1, 0.6, log=True),
        "per_device_train_batch_size": 16,  # lower this value in case of out-of-memory errors and restart the runtime
        #"per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16, 32]),
        "evaluation_strategy": "no",
        "save_strategy": "no",
    }


**Run HP search!**

Choose the number of hyperparameter configurations you want to test. In our experiments we found that after 10 to 15 trials with around 4 hyperparameters, performance is unlikely to increase meaningfully. 15 trials seems to be a safe value, but can take a while to run. 

In [None]:
import optuna

# number of differen hp configurations to test
numer_of_trials = 10  # increasing this value can lead to better hyperparameters, but will take longer
# chose the sampler for sampling hp configurations
optuna_sampler = optuna.samplers.TPESampler(
    seed=SEED_GLOBAL, consider_prior=True, prior_weight=1.0, consider_magic_clip=True, 
    consider_endpoints=False, n_startup_trials=numer_of_trials/2, n_ei_candidates=24, 
    multivariate=False, group=False, warn_independent_sampling=True, constant_liar=False
)  # https://optuna.readthedocs.io/en/stable/reference/generated/optuna.samplers.TPESampler.html#optuna.samplers.TPESampler

# Hugging Face Documentation: https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer.hyperparameter_search
best_run = trainer.hyperparameter_search(
    n_trials=numer_of_trials,
    direction="maximize", 
    hp_space=my_hp_space,
    backend='optuna',
    **{"sampler": optuna_sampler}
)

In [None]:
# show best hyperparameters based on hp-search
print(best_run)

**Training Time with optimised hyperparameters!**

Here we can use the original train and test set again.

In [None]:
# update the training arguments with the best hyperparameters
for k,v in best_run.hyperparameters.items():
    setattr(train_args, k, v)
print("\n", train_args)

# hp-search with hf causes errors with FP16 for some reason
#setattr(train_args, "fp16", False)
#setattr(train_args, "fp16_full_eval", False)

In [None]:
# reinitialize the model to avoid re-using a trained model from a step further above
#model_name = "XXX"  
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True, model_max_length=512)

# link the numeric labels to the label texts
label_text = np.sort(df_test.label_text.unique()).tolist()
label2id = dict(zip(np.sort(label_text), np.sort(pd.factorize(label_text, sort=True)[0]).tolist()))
id2label = dict(zip(np.sort(pd.factorize(label_text, sort=True)[0]).tolist(), np.sort(label_text)))
config = AutoConfig.from_pretrained(model_name, label2id=label2id, id2label=id2label, num_labels=len(label2id));

# load model with config
model = AutoModelForSequenceClassification.from_pretrained(model_name, config=config, ignore_mismatched_sizes=True);


In [None]:
# Training
trainer = Trainer( 
    model=model,
    tokenizer=tokenizer,
    args=train_args,
    train_dataset=dataset["train"],  #.shard(index=1, num_shards=100),  # https://huggingface.co/docs/datasets/processing.html#sharding-the-dataset-shard
    eval_dataset=dataset["test"],  #.shard(index=1, num_shards=100),  
    compute_metrics=lambda eval_pred: compute_metrics_nli_binary(eval_pred, label_text_alphabetical=label_text_alphabetical)  
)

trainer.train()


In [None]:
## Evaluate the fine-tuned model on the held-out test set
results = trainer.evaluate()


In [None]:
print(results)

Note that hyperparameter searches do not necessarily lead to better results, as they need to be searched on a smaller validation set of the train set, which might impact generalisation. Especially for smaller training sets, hyperparameter searches might lead to similar values as good default values. The default values discussed in the paper often provide good results. 