## Setup

Note that, in order to obtain detailed metrics and results of our hyperparameter tuning, we utilise Weight&Biases (wandb), which is an external tool that requires user credentials. If you do not wish to use this tool, please make the necessary adjustements in the ```fine_tune``` function in the logging strategy.

In [1]:
# Dependencies to run the notebook 
# Uncomment below to install - may require additional installations depending on your python version

# python version == 3.10.14
# %pip install torch transformers peft datasets scikit-learn wandb accelerate -U --quiet

In [5]:
import datetime, os, itertools, torch, wandb
import numpy as np
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, accuracy_score
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, PeftModel
from transformers import RobertaTokenizerFast, DataCollatorWithPadding, RobertaForSequenceClassification, \
                         TrainingArguments, Trainer, EarlyStoppingCallback

In [2]:
# Pretrained model's name from HuggingFace
PT_MODEL_NAME = 'roberta-large'

# Fine-tuned model path (best model after hyperparameter tuning)
FT_MODEL_PATH = './LoRA/Final-model'

# Path to dataset in which all *.csv files are stored
DATASET_PATH = './data'
TRAIN_FILE_NAME, VAL_FILE_NAME = 'train.csv', 'dev.csv'

## Data preparation

In [11]:
# Load training and validation datasets from csv files as a DatasetDict
data_files = {"train": TRAIN_FILE_NAME, "validation": VAL_FILE_NAME}
dataset = load_dataset("csv", data_dir=DATASET_PATH, data_files=data_files)
dataset

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 26944
    })
    validation: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 6737
    })
})

In [None]:
# Load model's tokenizer
tokenizer = RobertaTokenizerFast.from_pretrained(PT_MODEL_NAME)

In [12]:
def text_preprocessing(samples):
    ''' Replace None instances in hypothesis by empty strings. '''
    samples['hypothesis'] = "" if samples['hypothesis'] is None else samples['hypothesis']
    return samples

# Preprocess texts (hypothesis) from the dataset
dataset = dataset.map(text_preprocessing, batched=False)

In [13]:
def text_tokenization(samples):
    ''' Tokenize the premise and hypothesis into sentence pair classification format. '''
    return tokenizer(samples['premise'], samples['hypothesis'], truncation=True)

# Tokenize dataset (premises and hypothesis) for sentence pair classification
tokenized_dataset = dataset.map(text_tokenization, batched=True, remove_columns=['premise', 'hypothesis'])
tokenized_dataset

Map:   0%|          | 0/26944 [00:00<?, ? examples/s]

Map:   0%|          | 0/6737 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 26944
    })
    validation: Dataset({
        features: ['label', 'input_ids', 'attention_mask'],
        num_rows: 6737
    })
})

We create a batch of samples using DataCollatorWithPadding. It is more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [14]:
# Load a data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## LoRA (Low-Rank Adaptation) preparation

In [9]:
def model_init_lora():
    ''' Function to load a pre-trained model on the most efficient available device,
        and setup LoRA (Low-Rank Adaptation) to speed up training and lower computational costs.

        LoRA is a technique designed to fine-tune very large language models by keeping the 
        pretrained parameters of the model frozen and introduce trainable low-rank matrices that adapt 
        the model's behavior for a specific task. This significantly reduces the number of trainable 
        parameters during fine-tuning, leading to faster training and reduced computational costs.
        
        In a typical transformer architecture, attention and feed-forward layers play crucial roles. 
        Therefore, LoRA specifically targets the weight matrices in these layers.

        Publication: https://arxiv.org/abs/2106.09685 
    '''
    # Load pretrained model
    pretrained_model = RobertaForSequenceClassification.from_pretrained(
        PT_MODEL_NAME, num_labels=2, device_map='cuda' if torch.cuda.is_available() else 'cpu',
    )
    print('Device used:', 'cuda' if torch.cuda.is_available() else 'cpu')

    # Configure LoRA variation of the model by freezing layers and adding trainable low-rank matrices
    peft_config = LoraConfig(task_type="SEQ_CLS", inference_mode=False, r=8, lora_alpha=16, lora_dropout=0.1)
    peft_model = get_peft_model(pretrained_model, peft_config)
    print('LoRA parameters:', end=' '), peft_model.print_trainable_parameters()
    return peft_model

## Fine-tuning preparation

In [16]:
def compute_metrics(eval_pred):
    ''' Function to calculate metrics given model predictions and expected labels. 
        Metrics: F1 (micro/macro/wei), Precision (macro), Recall (macro), ROC (macro), and Accuracy.
        We use macro averages as the classes are imbalanced.
    '''
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        "F1_micro": f1_score(labels, predictions, average='micro'),
        "F1_macro": f1_score(labels, predictions, average='macro'),
        "F1_weighted": f1_score(labels, predictions, average='weighted'),
        "Precision_macro": precision_score(labels, predictions, average='macro'),
        "Recall_macro": recall_score(labels, predictions, average='macro'),
        "ROC_macro": roc_auc_score(labels, predictions, average = 'macro'),
        "Accuracy": accuracy_score(labels, predictions),
    }

In [11]:
def fine_tune(hyperparameters, dataset):
  ''' Function to fine-tune a pretrained model using Hugging Face's pipeline. 
  '''
  # Model's and logs directory
  RUN_NAME = f"Run - {datetime.datetime.now().strftime('%m-%d-%H-%M')}"
  DIR = f"./LoRA/Results/{RUN_NAME}"
  os.environ["WANDB_PROJECT"] = 'NLU-CWK' # set the wandb project where this run will be logged

  # Define training args
  training_args = TrainingArguments(
    run_name=RUN_NAME.replace(' ', '_'),
    output_dir=f"{DIR}/checkpoints",

    # Parameters
    per_device_train_batch_size = hyperparameters["batch_size"],
    per_device_eval_batch_size = hyperparameters["batch_size"],
    gradient_accumulation_steps = hyperparameters["gradient_accumulation_steps"],
    learning_rate = hyperparameters["learning_rate"],
    weight_decay = hyperparameters['weight_decay'],
    num_train_epochs = 10,  # Use early stopping (so this is maximum epochs)
    fp16 = True,            # Use 16-bit (mixed) precision instead of 32-bit (ONLY POSSIBLE ON CUDA!)
    optim = "adamw_torch",
    
    use_cpu=False,
    seed=42,                # Use a seed for reproducibility

    # Logging
    logging_dir=f"{DIR}/training_logs",
    logging_strategy="epoch",
    report_to="wandb",
    # Saving
    save_strategy="epoch",
    # Evaluating (Use validation loss for model selection and early stopping)
    evaluation_strategy="epoch",
    save_safetensors=True, save_total_limit=1, load_best_model_at_end=True, 
    metric_for_best_model="eval_loss", greater_is_better=False,
  )

  # Create a Trainer instance
  trainer = Trainer(
    model_init=model_init_lora,
    args=training_args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['validation'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(2, 0.0)],
  )

  # Fine-tune the model
  trainer.train()

  # Evaluate the final model
  evaluation_results = trainer.evaluate()

  return evaluation_results, RUN_NAME

## Hyperparameter tuning & Training

In [12]:
# Hyperparameters recommended by BERT and RoBERTa
HYPERPARAMETERS = {
    'weight_decay': [0.01, 0.02], 
    'learning_rate': [5e-5, 3e-5, 2e-5], 
    'gradient_accumulation_steps': [1, 2, 3], 
    'batch_size': [8], # 8, 16, 24 (with gradient accumulation)
}

# Get all possible combination of hyperparameter sets (for grid search)
keys, values = zip(*HYPERPARAMETERS.items())
HYPERPARAMETERS_COMB = [dict(zip(keys, v)) for v in itertools.product(*values)] # length = 18

In [None]:
# Print and write logs during fine-tuning
with open('LoRA/training_logs.txt', 'a+') as log:

    # Fine-tune on each hyperparameter set (grid search)
    for HYPERPARAM in HYPERPARAMETERS_COMB: # already trained one
        print('--- STARTING FINE-TUNING ---')
        print('Hyperparameters:', HYPERPARAM)
        evaluation_results, run_name = fine_tune(hyperparameters=HYPERPARAM, dataset=tokenized_dataset)
        print('Run name:', run_name)
        print('Final evaluation results:', evaluation_results, '\n')
        
        log.write(f"Run name: {run_name}\n")
        log.write(f"Hyperparameters:\n{', '.join([f'{key}={value}' for key, value in HYPERPARAM.items()])}.\n")
        log.write(f"Final evaluation results:{', '.join([f'{key}={value}' for key, value in evaluation_results.items()])}\n\n")

## Evaluation of best model

In [3]:
# Select best model by hand based on validation loss and macro-F1 score
# Best model: Run - 04-10-14-59
best_model_dir = FT_MODEL_PATH

In [None]:
# Load pretrained model
pretrained_model = RobertaForSequenceClassification.from_pretrained(
    PT_MODEL_NAME, num_labels=2, device_map='cuda' if torch.cuda.is_available() else 'cpu',
)

# Add-in trained LoRA layers
model = PeftModel.from_pretrained(
    pretrained_model, best_model_dir, device_map='cuda' if torch.cuda.is_available() else 'cpu',
)

In [17]:
# Prepare the trained model and compute_metrics function to evaluate the model
trainer = Trainer(
    model=model,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Evaluate the model on the validation set
eval_results = trainer.evaluate(tokenized_dataset['validation'])
eval_results

dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


  0%|          | 0/843 [00:00<?, ?it/s]

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Currently logged in as: [33mbelkadisamuel[0m. Use [1m`wandb login --relogin`[0m to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers

{'eval_loss': 0.22703786194324493,
 'eval_F1_micro': 0.9159863440700609,
 'eval_F1_macro': 0.9158481606898208,
 'eval_F1_weighted': 0.9159590110937497,
 'eval_Precision_macro': 0.9161779177848881,
 'eval_Recall_macro': 0.9156366383815087,
 'eval_ROC_macro': 0.9156366383815085,
 'eval_Accuracy': 0.9159863440700609,
 'eval_runtime': 154.0185,
 'eval_samples_per_second': 43.741,
 'eval_steps_per_second': 5.473}