# Implementing Transformer Models
## Practical XI
Carel van Niekerk & Hsien-Chin Lin

13-17.01.2025

---

In this practical we will implement GPU training and mixed precision training for a simple entailment model. We will use the [Huggingface Transformers](https://huggingface.co/transformers/) library to implement the model and [Datasets](https://huggingface.co/docs/datasets/) to load the data. We will use the [QNLI](https://huggingface.co/datasets/viewer/?dataset=glue&config=qnli) dataset from the [GLUE](https://huggingface.co/datasets/glue) benchmark. The QNLI dataset is a simple entailment task where the model must predict whether a sentence entails a question. For example, the sentence 'The dog is playing with a ball' entails the question 'Is the dog playing with a ball?'.

#### 1. Getting Started

We will use the Google Collab environment for this practical. To get started download this notebook and upload it in a Google Collab session. You will also need to change the runtime type to a 'Python 3: T4 GPU' session, this can be done using the 'change runtime type' option in the 'Runtime' dropdown menu. Once this is done you can run the following code to install and import the required libraries.

In [None]:
!pip install datasets

In [5]:
import torch
from transformers import (RobertaForSequenceClassification, RobertaTokenizer,
                          get_linear_schedule_with_warmup)
from torch.utils.data import DataLoader
from datasets import load_dataset
import tqdm
from sklearn.metrics import f1_score

#### 2. Dataset and Model

We will use a simple Roberta for Sequence Classification model to learn the entailment task presented in the QNLI GLUE task. This is a simple binary classification task.

##### 2.1 Initialising the model and tokenizer

Here we initialise the Roberta model and its accompanying tokenizer.

In [16]:
# Configuration

MODEL_NAME_OR_PATH = 'roberta-base'
MAX_INPUT_LENGTH = 256
BATCH_SIZE = 16
TRAINING_EPOCHS = 2
WEIGHT_DECAY = 0.01
LEARNING_RATE = 2e-5
WARMUP_PROPORTION = 0.1
MAX_GRAD_NORM = 1.0
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MIXED_PRECISION_TRAINING = True if torch.cuda.is_available() else False

In [None]:
model = RobertaForSequenceClassification.from_pretrained(MODEL_NAME_OR_PATH)
tokenizer = RobertaTokenizer.from_pretrained(MODEL_NAME_OR_PATH)

##### 2.2. Loading and preparing the dataset

We will load the dataset from the datasets library and tokenize all observations to prepare them for model training and evaluation.

In [None]:
qnli_dataset = load_dataset('glue', 'qnli')

In [8]:
def convert_example_to_features(example: dict) -> dict:
    """
    Convert example to features.
    
    Args:
        example (dict): Example from the QNLI dataset.
    Returns:
        features (dict): Features for the example.
    """
    features = tokenizer(example['question'], example['sentence'], max_length=MAX_INPUT_LENGTH,
                         padding='max_length', truncation='longest_first')

    features['labels'] = example['label']

    return features

def collate(batch: list) -> dict:
    """
    Function to collate the batch.
    
    Args:
        batch (list): List of examples from the QNLI dataset.
    Returns:
        features (dict): Features for the batch.
    """
    features = {
        'input_ids': torch.tensor([itm['input_ids'] for itm in batch]),
        'attention_mask': torch.tensor([itm['attention_mask'] for itm in batch]),
        'labels': torch.tensor([itm['labels'] for itm in batch]),
    }

    return features

In [None]:
# Apply tokenization to the datasets
train_dataset = qnli_dataset['train'].map(convert_example_to_features)
validation_dataset = qnli_dataset['validation'].map(convert_example_to_features)

# Create dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, collate_fn = collate)
validation_dataloader = DataLoader(validation_dataset, batch_size=BATCH_SIZE, collate_fn = collate)

#### 3. Training the Model

To train the model we initialise the optimiser and learning rate scheduler.

In [10]:
# Exercise 2: Update the initialisation to incorporate GPU training

# Specify the weight decay for each parameter
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [
    {
        "params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
        "weight_decay": WEIGHT_DECAY,
        "lr": LEARNING_RATE
    },
    {
        "params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
        "weight_decay": 0.0,
        "lr": LEARNING_RATE
    },
]

# Initialise the optimizer
optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=LEARNING_RATE)
# Initialise the learning rate scheduler
num_training_steps = len(train_dataloader) * TRAINING_EPOCHS
num_warmup_steps = WARMUP_PROPORTION * num_training_steps
lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps,
                                               num_training_steps=num_training_steps)

In [None]:
# Exercise 3: Initialise the scaler for mixed precision training
scaler = torch.amp.GradScaler(DEVICE, enabled=MIXED_PRECISION_TRAINING)

##### 3.1. Training and evaluating the model

Here we implement a training step and evaluation function for updating and evaluating the model.

In [13]:
# Exercise 2: Update the functions to incorporate GPU training
# Exercise 3: Update the functions to incorporate mixed precision training

def training_step(batch):
    """
    Function to perform a training step.
    
    Args:
        batch (dict): Batch of data.
    Returns:
        loss (torch.Tensor): Loss for the batch.
    """
    optimizer.zero_grad()
    with torch.autocast('cuda', dtype=torch.float16 ,enabled=MIXED_PRECISION_TRAINING):
        loss = model(**batch).loss

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    # Clip gradients
    torch.nn.utils.clip_grad_norm_(model.parameters(), MAX_GRAD_NORM)

    # Update step
    scaler.step(optimizer)
    scaler.update()
    lr_scheduler.step()
    model.zero_grad()

    return loss

def evaluate(dataloader):
    """
    Function to evaluate the model.
    
    Args:
        dataloader (torch.utils.data.DataLoader): Dataloader for the data.
    Returns:
        f1 (float): F1 Score for the model.
    """
    # Set model to evaluation mode
    model.eval()
    predictions = list()
    labels = list()

    for batch in tqdm.tqdm(dataloader, desc="Eval"):
        # Forward pass data
        with torch.no_grad(), torch.autocast(DEVICE, dtype=torch.float16, enabled=MIXED_PRECISION_TRAINING):
            logits = model(**batch).logits.detach().cpu()
        pred = logits.argmax(-1)

        predictions.append(pred.reshape(-1))
        labels.append(batch['labels'].cpu().reshape(-1))

    # Reset model to training mode
    model.zero_grad()
    model.train()

    # Compute the F1 Score
    predictions = torch.concat(predictions, 0)
    labels = torch.concat(labels, 0)
    f1 = f1_score(labels, predictions)

    return f1


In [None]:
# Training the model

# Prepare model for training
model.train()
model.to(DEVICE)
model.zero_grad()
optimizer.zero_grad()
for e in range(TRAINING_EPOCHS):
    iterator = tqdm.tqdm(train_dataloader, desc=f"Epoch {e+1}/{TRAINING_EPOCHS}")
    # Perform an epoch of training
    for batch in iterator:
        loss = training_step(batch)
        iterator.set_postfix({'Loss': loss.item()})
    # Evaluate the model and report F1 Score
    f1 = evaluate(validation_dataloader)
    print(f"Validation F1 Score: {f1}")

# Exercises

1. Study the code used for training this entailment Roberta model.
2. Make the neccesary changes to train this model on the GPU rather than the CPU.
3. Make the neccesary changes to train this model using mixed precision training (see the [documentation](https://pytorch.org/blog/accelerating-training-on-nvidia-gpus-with-pytorch-automatic-mixed-precision/) for more details on this).
4. Discuss the differences in training time and model performance for the three approaches (CPU, GPU and GPU Mixed precision).