# Processing The Data

To access a dataset we will use the `datasets` lib and for this example we will be using the *MRPC* (*Microsoft Research Paraphrase Corpus*) dataset. The dataset consists of 5,801 pairs of sentences, with a label indicating if they are paraphrases or not (i.e., if both sentences mean the same thing). Furthermore, this is one of the 10 datasets composing the *GLUE* benchmark, which is an academic benchmark that is used to measure the performance of ML models across 10 different text classification tasks. So, we can use this dataset to finetune a *Bert* model (`bert-base-uncased`) to classify paraphrases.

In [None]:
from datasets import load_dataset

# download the mrpc dataset
raw_datasets = load_dataset(
    path="glue",
    name="mrpc"
)

In [None]:
raw_datasets

As we can see, we get a `DatasetDict` object, which contains three datasets in it, one for training, one for validation and one of testing. Each of them have '*sentence1*', '*sentence2*', '*label*' and '*ids*' as their columns and there are *3668* rows in the training dataset, *408* rows in the validation dataset and *1725* rows in the testing dataset.

Now to access any of the data:

In [None]:
raw_train_dataset = raw_datasets['train']
raw_train_dataset[0]

we see the label column has a integer value of 1, now to see what it corresponds to:

In [None]:
raw_train_dataset.features

As we can see, the label column is of type *ClassLabel* and **0** corresponds to **not_equivalent**, and **1** corresponds to **equivalent**

In [None]:
from pprint import pprint

pprint("Data at index 15 in the training dataset:") 
pprint(raw_train_dataset[15]) 
print("\nat index 800:") 
pprint(raw_train_dataset[800])

> Note: As we can see from above the index value when accessing the data  `raw_train_dataset[15]` is not always as same as the *idx* value and that could be because of how the whole main dataset was split into train/valid/test datasets.

# Preprocessing The Data

In our data we have two sequences as a pair that needs to be processed by the model for classification. But that also means that the tokeniser have to convert the sequences into tokens as a pair and the good thing is the tokensier does that for us by itself, we simply have to pass them togther:

In [None]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

inputs = tokenizer("This is the first sentence.", "This is the second one.")
pprint(inputs)

This time we see that the tokeniser returns an additional feature *token_type_ids*,  this is what tells the model which part of the input is the first sentence and which is the second sentence. So,

In [None]:
print(tokenizer.convert_ids_to_tokens(inputs['input_ids']))

```python
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[      0,      0,    0,     0,       0,          0,   0,       0,      1,    1,     1,        1,     1,   1,       1]
```

But as we saw earlier, the test dataset alone has 3668 data in it and even though the data is not really big, fitting the data directly to the tokensier is not a good practice because it can easily cause *RAM Out-Of-Memory* issue. And also passing only the sequences to the tokeniser will only return the `input_ids`, `attention_mask`, and `token_type_ids` as the input for the model and this way we will lose other important info that we had in out orignal dataset like `label`. Therefore, we will use `datasets` in-built `map()` method. The map() method works by applying a function on each element of the dataset, so let’s define a function that tokenizes our inputs:

In [None]:
def tokeniser_function(data):
    return tokenizer(data['sentence1'], data["sentence2"], truncation=True)

This function takes a data dictionary and returns a new dictionary with the keys `input_ids`, `attention_mask`, and `token_type_ids`. This will allow us to use the option `batched=True` in our call to `map()`, which will greatly speed up the tokenisation, because this tokeniser can be very fast, but only if we give it lots of inputs at once and using `batched=True` in our call to `map()` passes multiple elements of our dataset at once to the `tokeniser_function()`, and not on each element separately

Furthermore, we can also see that we've left the `padding` parameter out in our tokenisation function for now. This is because padding all the samples to the maximum length is not efficient: it’s better to pad the samples when we’re building a batch, as then we only need to pad to the maximum length in that batch, and not the maximum length in the entire dataset. This can save a lot of time and processing power when the inputs have very variable lengths!

In [None]:
tokenizer_datasets = raw_datasets.map(tokeniser_function, batched=True)
tokenizer_datasets

As we can see, we haven't lost any columns from out data and that there are new columns added (Note that we could also have changed existing fields if our preprocessing function returned a new value for an existing key in the dataset to which we applied `map()`).

 Also it was really quick to process as well, however, you can make the whole process more faster by passing `num_proc` argument to the `map()` function, as this allows multiprocessing, but since `tokenizers()` already works on multiple threads, there is no use of it here.

# Dynamic padding
Now we will need to do is pad all the examples to the length of the longest element when we batch elements together before passing the input to the model. This technique is refer as **dynamic padding**.

Even though this way of padding makes things go faster when utalising a CPU or GPU, that is always not the case when using a accelerator resource like a TPU and that is because TPUs prefer fixed shapes, even when that requires extra padding.

!["Dynamic Padding"](data/chapter_3/dynamic_padding.png "Dynamic Padding")

Here we will use function that is responsible for putting together samples inside a batch, AKA **collate function**. The *collate function* will also apply the correct amount of padding to the items of the dataset we want to batch together. Fortunately, the Transformers library provides us with such a function via `DataCollatorWithPadding`. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

let's try it on a subset of out dataset and assume it as a single batch

In [None]:
samples = tokenizer_datasets["train"][:10]
# filter out the unnecessary columns
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

Here we can see inside this batch the squences are of different lengths, and to add padding to make them all to the max-length inside this particular batch (which is 67), we simply need to pass the samples to the `data_collator()`

In [None]:
batch = data_collator(samples)
pprint({k: v.shape for k, v in batch.items()})

### Example
Let's try to preprocess the **GLUE SST-2** dataset:
!["The GLUE SST-2 Datset"](data/chapter_3/glue_sst2.png "The GLUE SST-2 Datset")

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset(
    path='glue',
    name='sst2'
)
raw_datasets

As we can see, there is only one sentence per data. So, the `tokenisation_function()` only need to work one sentence per data.

In [None]:
from transformers import AutoTokenizer

checkpoint = 'bert-base-uncased'

tokeniser = AutoTokenizer.from_pretrained(checkpoint)

def tokenisation_function(data):
    return tokeniser(data['sentence'], truncation=True)

tokenised_datasets = raw_datasets.map(tokenisation_function, batched=True)
tokenised_datasets

Now let's clean the data by removing the unnecessary columns and changing the name of the *label* columns to *labels* and finally convert the datatype to torch.

In [None]:
tokenised_datasets = tokenised_datasets.remove_columns(
    column_names=['idx', 'sentence']
)
tokenised_datasets = tokenised_datasets.rename_column(
    original_column_name='label',
    new_column_name='labels'
)
tokenised_datasets = tokenised_datasets.with_format("torch")

Now to apply padding using the `data_collator` on the bases of batch we will use the `DataLoader()` function from the class `torch.utils.data`, which creates batches on the given data and passing it to the **collate function**.

In [None]:
from torch.utils.data import DataLoader
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokeniser)
train_dataloader = DataLoader(
    tokenised_datasets['train'],
    batch_size=16,
    shuffle=True,
    collate_fn=data_collator
)

for step, batch in enumerate(train_dataloader):
    print(batch["input_ids"].shape)
    if step > 5:
        break

# Fine-tuning a model with the Trainer API

The Transformer library provides a **Trainer** API, that allows us to easily fine-tune a transformer model on any dataset. It simply takes the *model*, *dataset*, 
*training hyperparameter* and can perform training using any types of available resources, *CPU*, *GPUs* and *TPUs*. It can also compute predictions and also evaluate the model if the metrics is provided. Furthermore, it can also handle the later stages of data preprocessing for use, such as, *dynamic padding*, given the *tokeniser* and the *data collator*  are provided.

!["The Trainer API"](data/chapter_3/trainer.png "The Trainer API")



Now let's run an example on the **Trainer** API using the same *GLUE MRPC* dataset.

> Note: If you don’t have a GPU set up, you can get access to free GPUs or TPUs on [Google Colab](https://colab.research.google.com/), as otherwise, it will run very slowly on a CPU.


Let's first set-up the preprocessing process:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset('glue', 'mrpc')
checkpoint = 'bert-base-uncased'
tokeniser = AutoTokenizer.from_pretrained(checkpoint)

def tokenisation_function(data):
    return tokeniser(data['sentence1'], data['sentence2'], truncation=True)

tokenised_datasets = raw_datasets.map(tokenisation_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokeniser)

### Training
First we set the *hyperparameters* that the *trainer* needs to know, to make sure the training and evaluation goes the right way. So we will use the **TrainingArguments** class by *transformers* to define all the *hyperparameters*. For this example, we only need to give the path to where save the model, as well as the checkpoints along the way.For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning. <br />
Furthermore, If you want to automatically upload your model to the Hub during training, pass along push_to_hub=True in the TrainingArguments

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='data/chapter_3/model/'
)

Now we will define our model. However, this time when we specify to the **AutoModelForSequenceClassification** class that we want the checkpoint of '*bert-base-uncased*' model, we will get a *warning*. This is because *bert-base-uncased* has not been pretrained on classifying sequence, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)


Now we will pass all what we have defined so far, `model`, `training_args`, `tokenised_dataset[train]`, `tokenised_dataset[validation]`, `data_collator`, and the `tokeniser` as the processing_class parameter (it simply tells the *Trainer* which tokeniser to use for processing).
> Note: When you pass a tokenizer as the processing_class, the default data_collator used by the Trainer will be a DataCollatorWithPadding. You can skip the data_collator=data_collator line in this case.

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_datasets['train'],
    eval_dataset=tokenised_datasets['validation'],
    data_collator=data_collator,
    processing_class=tokeniser
)

Now the main bit, to fine-tune the model on our dataset, we just have to call the `train()` method of our *Trainer*. However, this will only report the training loss every 500 steps and won't give any evaluation info, and that is because when we initialised the `TrainingArguments` we didn't define the `eval_strategy` to either *steps* or *epoch* (i.e., at the end of every n steps or end of an epoch evaluate) and also for the `Trainer` we didn't define the `compute_metrics()` function which will also on every *evaluation stage* compute a metric to descript the performance of the model (this is necassary because otherwise at every *evaluation stage* we would just got the loss, which is not a very intuitive number).

In [None]:
trainer.train()

# Evalution

Let's build the `compute_metrics()` function to see a detailed evaluation info at the end every *epoch*. Now, for this to work the `compute_metric()` function needs to take the `EvalPrediction` object, that is retuned by the `Trainer` either when its `predict()` function gets called up or it's performing an evaluation. The `EvalPrediction` object basically contains three fields `predictions` and `label_ids` and `metrics`; `predictions` contains the logits for every predicton, `label_ids` tells the original label that comes with the data itself (not in the case of inference) and `metrics` is a dictionary, where the key is the names of the metrics returned its value is floats quantity. Once we complete our `compute_metrics()` function and pass it to the Trainer, the `metrics` field will also contain the metrics returned by `compute_metrics()`.

In [None]:
from pprint import pprint

predictions = trainer.predict(tokenised_datasets["test"])
print(predictions._fields)

print(predictions.predictions.shape, predictions.label_ids.shape)
pprint(predictions.metrics)

pprint(predictions.predictions[:10])

As you can see, predictions is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used and 2 being the logits predicted of every 408 elements).Now, to transform them into predictions that we can compare to our labels, we need to take the index where the logits are at (i.e., that last index '-1' in our case) and take the index of the  maximum logit value out of the 2 (taking the maximum value here is only due to becaue we have simple single hard label classification, and we can also use a softmax or other probabilistic function).

In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)
preds[:10]

We can now compare the original labels `predictions.label_ids` with the predicted one from the above. To further evaluate the results based over matrics, we will rely on the metrics from the `Evaluate` library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:

In [None]:
import evaluate

metric = evaluate.load(path="glue", config_name="mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Here, we can see our model has an accuracy of *85.62%* on the validation set and an F1 score of *89.54*. Those are the two metrics used to evaluate results on the *MRPC* dataset for the *GLUE* benchmark. The table in the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf) reported an F1 score of 88.9 for the base model. That was the **uncased** model while we are currently using the **cased** model, which explains the better result.

Now, let's wrap everything inside our `compute_metrics()` function:

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    pred_logits, labels = eval_preds
    preds = np.argmax(pred_logits, axis=-1)

    return metric.compute(predictions=preds, references=labels)

Let's now upgrade our `TrainingArguments()` with `eval_strategy` set to **epoch** and pass our `compute_metrics()` to the compute_metrics parameter of the `Trainer()`. So now the new `model` we initialised below will get evaluated over the metrics define in `compute_metrics()` at the end of every epoch cycle.
> Note: this time we won't pass `data_collator` because as we discussed previously we are already passing the `tokenizer` for the `processing_class` parameter meaning the `Trainer()` by itself will use `DataCollatorWithPadding` with the  `tokenizer`.

In [None]:
training_args = TrainingArguments('data/chapter_3/model', eval_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_datasets["train"],
    eval_dataset=tokenised_datasets["validation"],
    processing_class=tokeniser,
    compute_metrics=compute_metrics
)

trainer.train()

Furthermore, the `Trainer` comes with many built-in features that make modern deep learning best practices accessible: 

**Mixed Precision Training**: Use fp16=True in your training arguments for faster training and reduced memory usage:

```python
training_args = TrainingArguments(
    "test-trainer",
    eval_strategy="epoch",
    fp16=True,  # Enable mixed precision
)
```

> Mixed precision training uses 16-bit floats for forward pass and 32-bit for gradients, improving speed and reducing memory usage.

**Gradient Accumulation**: For effective larger batch sizes when GPU memory is limited:

```python
training_args = TrainingArguments(
    "test-trainer",
    eval_strategy="epoch",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 4 * 4 = 16
)
```
> This allows you to simulate larger batch sizes by accumulating gradients over multiple forward passes.

**Learning Rate Scheduling**: The Trainer uses linear decay by default, but you can customize this:

```python
training_args = TrainingArguments(
    "test-trainer",
    eval_strategy="epoch",
    learning_rate=2e-5,
    lr_scheduler_type="cosine",  # Try different schedulers
)
```

>Note: Modern features like mixed precision (fp16=True) and gradient accumulation can significantly improve training efficiency

#  A Full Training Loop in PyTorch

Writing you own training loop without relying on the *Trainer API* is quite essential, because this way, we can easily customise each steps of the training loop to our needs and easier to debug.

!["Full Training Loop"](data/chapter_3/full_training_loop.png "Full Training Loop")

During training, in simple words, the **model**, a giant *formula* with many adjustable knobs (**model's weights**), is shown lots of *flash-cards* that pair an input with the correct answer. The *model* makes a guess for each *card*, and a single number called the **loss** tells the *model* how wrong that guess was. Using *clever math*, the model work out which knobs should turn and by how much (i.e., the **gradients**) to shrink the loss, and an **optimizer** performs those tiny adjustments. We then move to the next card and repeat the *cycle—guess*, measure *error*, tweak—thousands of times. Over all these loops the *model* gradually tunes its knobs so that its guesses become consistently accurate, meaning it has learned the **patterns hidden** in the training data.


Let's first prepare the data as same as before:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokeniser = AutoTokenizer.from_pretrained(checkpoint)

def tokenisation_function(data):
    return tokeniser(data["sentence1"], data["sentence2"], truncation=True)

tokenised_datasets = raw_datasets.map(tokenisation_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokeniser)

Now we will let refine the dataset by removing the unnecessary columns and converting the '*label*' column name to '*labels*' and converting the data format to *torch* because that is what the model takes as the conventional input:

In [None]:
tokenised_datasets = tokenised_datasets.remove_columns(
    column_names=["sentence1", "sentence2", "idx"]
)
tokenised_datasets = tokenised_datasets.rename_column(
    original_column_name="label",
    new_column_name="labels"
)
tokenised_datasets.set_format("torch")
# let's see
tokenised_datasets["train"].column_names

Now let's devide the data in batches using the `DataLoader` lib with our *collate* function for the right padding:
> Note: For the training dataset we will have  `suffle = True`, to makes sure during each epoch the data inside the batches get reshuffled. These kind of small tweaks that changes the postion of the data can make significant performance improvement when it comes to machine learning.

In [None]:
from torch.utils.data import DataLoader
from pprint import pprint



train_dataloader = DataLoader(
    tokenised_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)

eval_dataloader = DataLoader(
    tokenised_datasets["validation"], batch_size=8, collate_fn=data_collator
)


for batch in train_dataloader:
    pprint({k: v.shape for k, v in batch.items()})
    break

Let's now initialise our *model*:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

For testing let's pass a batch to the model:

> Note: All Transformers models will return the loss when labels are provided, and we also get the logits (two for each input in our batch, so a tensor of size 8 x 2).

In [None]:
for batch in train_dataloader:
    outputs = model(**batch)
    break

print(outputs.loss, outputs.logits.shape)

Now, let's define a *optimiser*, we will use **AdamW** which is also the default *optimser* that is used by the *Trainer API*: 

In [None]:
from torch.optim import AdamW

optimiser = AdamW(
    params=model.parameters(),
    lr=5e-5
)

As we can see, we have passed two parameters to the *AdamW* optimiser, `params` - he model’s weights (the “knobs”) that the optimizer will adjust, and `lr` - the learning rate, which sets how big each step is when we move the weights in the direction suggested by the gradient.

The learning rate controls how big a jump the optimizer makes when it updates the model’s weights, and that ideal jump size changes during training: early on, large jumps help the model move quickly toward a good region of solutions, but once it gets close, those same large jumps can cause it to overshoot or wobble around the optimum; conversely, a small fixed learning rate might be safe near the end but would make early progress painfully slow. Now to overcome the issue that comes with `lr`, we need a *Learning Rate Sacheduler*, which will solves this by starting with a higher learning rate for fast initial learning and then gradually lowering it so the optimizer can take finer, steadier steps as it zeroes in on the best weights. For this we will use the `get_scheduler` by `transformers` and we will use a *linear decay* (i.e., a straight line decay towards 0 from `5e-5`). To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader). The Trainer uses three epochs by default, so we will follow that:

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimiser,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

print(num_training_steps)

### The training loop

One last thing: we will want to use the GPU if we have access to one (on a CPU, training might take several hours instead of a couple of minutes and I would suggest using *Google Colab* for the free GPU use there). To do this, we define a device we will put our model and our batches on:


In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model.to(device)
device

Now let's write the training loop:

In [None]:
# Import a handy progress-bar
from tqdm.auto import tqdm

# Tell the progress-bar, how many total update steps we expect.
progress_bar = tqdm(range(num_training_steps))

# Put the model in *training* mode extra training configs (like turns on dropout, etc. ).
model.train()

# go through the entire dataset several times (“epochs”)
for epoch in range(num_epochs):

    # fetch a batch
    for batch in train_dataloader:
        # Move every tensor in the batch onto the GPU (or CPU) we’re using.
        # `k` is the column name (e.g. "input_ids"), `v` is the tensor itself.
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass → the model makes a prediction and
        # also returns the loss (how wrong the prediction was).
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass → compute gradients (how each weight
        # should change to reduce that loss).
        loss.backward()

        # Take one optimizer step:
        #   • Looks at the gradients
        #   • Nudges the weights ("knobs") in the right direction
            # First,
            # Put a speed-limit on the update: if the combined size (L2-norm) of all gradients
            # exceeds 1.0, scale them down so it is exactly 1.0.  This “gradient clipping”
            # prevents an out-of-control batch from giving the model an enormous, unstable shove.
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            # Then, step
        optimiser.step()
        
        # Update the learning-rate scheduler so the “step size”
        # gets a bit smaller as training progresses.
        lr_scheduler.step()

        # Reset gradients to zero so they don’t accumulate
        # into the next batch.
        optimiser.zero_grad()

        # Tell tqdm we’ve finished one training step
        # so it can advance the progress bar.
        progress_bar.update(1)

Now let's add the evaluation loop as well to get some metrics:

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")

# Put the model in *evaluation* mode and turn off extra training configs (like turns off dropout, etc. ).
model.eval()

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}

    # We’re only doing a forward pass, so turn off gradient tracking to
    # save memory and speed things up.
    with torch.no_grad():
        outputs = model(**batch)

        logits = outputs.logits
        references = batch["labels"]
        predictions = torch.argmax(logits, dim=-1)

        # Hand this batch’s predictions + references to the metric object;
        # it will store them internally until we ask for the final score.
        metric.add_batch(predictions=predictions, references=references)

# Calculate and return the overall Accuracy and F1 once all batches are processed.
metric.compute()


### Example

Let's try to build the training loop again but this time for the **SST-2** data and also evaluate the model after every epoch during the training (for the *SST-2* dataset only the *accuracy* metric is provided by *GLUE*), because this is one of the reasons why epoch idea is used when training so that you can evalute the model performance on every epoch and if its performance starts going bad we stop the training and take the last best performance epoch model.

Preparing the data

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
from pprint import pprint


raw_datasets = load_dataset(path="glue", name="sst2")

pprint(raw_datasets)

*tokenise* the data:

In [None]:
checkpoint = "bert-base-uncased"
tokeniser = AutoTokenizer.from_pretrained(checkpoint)

def tokenisation_function(data):
    return tokeniser(data["sentence"], truncation=True)

data_collator = DataCollatorWithPadding(tokenizer=tokeniser)

tokenised_datasets = raw_datasets.map(tokenisation_function, batched=True)

pprint(tokenised_datasets)

refine the data:

In [None]:
tokenised_datasets = tokenised_datasets.remove_columns(
    column_names=["sentence", "idx"]
)
tokenised_datasets = tokenised_datasets.rename_column(
    original_column_name="label",
    new_column_name="labels"
)
tokenised_datasets.set_format("torch")
tokenised_datasets["train"]

intialise the *dataloader*:

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    dataset=tokenised_datasets["train"],
    batch_size=30,
    shuffle=True,
    collate_fn=data_collator
)

eval_dataloader = DataLoader(
    dataset=tokenised_datasets["validation"],
    batch_size=30,
    collate_fn=data_collator
)

for batch in train_dataloader:
    pprint({k : v.shape for k, v in batch.items()})
    break

initialise the *model*

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

for batch in train_dataloader:
    outputs = model(**batch)
    print(outputs.loss, outputs.logits.shape)
    break



set the *optimiser* and *learning rate scheduler*:

In [None]:
from torch.optim import AdamW
from transformers import get_scheduler

optimiser = AdamW(
    params=model.parameters(),
    lr=5e-5
)

num_epochs = 2
num_training_steps = num_epochs * len(train_dataloader)

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimiser,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)

print(num_training_steps)

put the *model* on the avaiable *device*:

In [None]:
import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

model.to(device)
device

create a *evaluation* function that will get called everytime a epoch is finsihed:

In [None]:
import evaluate

def perform_evaluation():
    metric = evaluate.load("glue", "sst2")

    # set the model to evaluation
    model.eval()

    for batch in eval_dataloader:
        batch = {k : v.to(device) for k, v in batch.items()}

        with torch.no_grad():
            outputs = model(**batch)

            logits = outputs.logits
            refs = batch["labels"]
            preds = torch.argmax(logits, dim=-1)

            metric.add_batch(predictions=preds, references=refs)

    pprint(metric.compute())

the final *training* loop:

In [None]:
from tqdm.auto import tqdm


progress_bar = tqdm(range(num_training_steps))

model.train()

for epoch in range(num_epochs):

    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}

        outputs = model(**batch)
        loss = outputs.loss

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimiser.step()

        lr_scheduler.step()

        optimiser.zero_grad()

        progress_bar.update(1)

    perform_evaluation()
    # we need to set it back to training mode 
    # because during training it gets put in 
    # eval() mode
    model.train()
    

let's test the evaluation, it should print same as the last line of above cell output here:

In [None]:
perform_evaluation()

# Supercharge your training loop with Accelerate

The training loop we defined earlier works fine on a single CPU or GPU. But using the `Accelerator` from the `Accelerate` library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs. When you instantiates an `Accelerator` object, it looks at your enviornment and initialise the proper distributed setup, so you can remove the lines that put the model on the device (or, if you prefer, change them to use `accelerator.device` instead of `device`).

Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimizer to `accelerator.prepare()`. This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the device (again, if you want to keep this you can just change it to use `accelerator.device`) and replacing `loss.backward()` with `accelerator.backward(loss)`.

Furthermore, during evaluation just before passing the `predictions` and `refernces` to the `metric.add_batch()` and then compute the metric with `metric.compute()`. We need to *gather* them using `accelerator.gather(torch.argmax(logits, dim=-1))` and `accelerator.gather(batch["labels"])`.

> Note: If you are using *Google Collab*, in order to benefit from the speed-up offered by Cloud TPUs, we recommend padding your samples to a fixed length with the `padding="max_length"` and `max_length` arguments of the tokenizer.

Lastely, we wiil to define the training loop inside a `training_function()` and to start the training  run
```python
from accelerate import notebook_launcher

notebook_launcher(training_function)
```
> Note: Putting this in a `train.py` script will make that script runnable on any kind of distributed setup. To try it out in your distributed setup, run the command `accelerate config` which will prompt you to answer a few questions and dump your answers in a configuration file used by this command `accelerate launch train.py`, which will launch the distributed training.


So,

First, prepare the data:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from pprint import pprint
from torch.utils.data import DataLoader
from torch.optim import AdamW


raw_datasets = load_dataset(path="glue", name="mrpc")

pprint(raw_datasets)

checkpoint = "bert-base-uncased"
tokeniser = AutoTokenizer.from_pretrained(checkpoint)

def tokenisation_function(data):
    return tokeniser(data['sentence1'], data['sentence2'], truncation=True)

data_collator = DataCollatorWithPadding(tokenizer=tokeniser)
tokenised_datasets = raw_datasets.map(tokenisation_function, batched=True)

pprint(tokenised_datasets)

tokenised_datasets = tokenised_datasets.remove_columns(
    column_names=['sentence1', 'sentence2', 'idx']
)
tokenised_datasets = tokenised_datasets.rename_column(
    original_column_name='label',
    new_column_name='labels'
)
tokenised_datasets.set_format("torch")
pprint(tokenised_datasets['train'])

train_dataloader = DataLoader(
    dataset=tokenised_datasets['train'], 
    batch_size=8,
    shuffle=True,
    collate_fn=data_collator
)

eval_dataloader = DataLoader(
    dataset=tokenised_datasets['validation'],
    batch_size=8,
    collate_fn=data_collator
)

for batch in train_dataloader:
    pprint({k: v.shape for k, v in batch.items()})
    break


model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

for batch in train_dataloader:
    outputs = model(**batch)
    print(outputs.loss, outputs.logits.shape)
    break

optimiser = AdamW(
    params=model.parameters(),
    lr=5e-5
)

Then, initiate the `Accelerator` and `lr_scheduler`:

In [None]:
from accelerate import Accelerator
from transformers import get_scheduler

# initiate the Accelerator
accelerator = Accelerator()

# send the data and model to the accelerator
train_dl, eval_dl, model, optimiser = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimiser
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimiser,
    num_warmup_steps=0,
    num_training_steps=num_training_steps
)


Now, define the *evaluation* function with `accelerator.gather()` to gather predictions and refernces:

In [None]:
import evaluate

def perform_evaluation():
    metric = evaluate.load("glue", "mrpc")

    model.eval()

    for batch in eval_dl:
        with torch.no_grad():
            outputs = model(**batch)

            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)

            metric.add_batch(
                predictions=accelerator.gather(predictions),
                references=accelerator.gather(batch["labels"])
            )

    pprint(metric.compute())

Finally, define the training loop `training_function()`:

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

def training_function():
    model.train()
    for epoch in range(num_epochs):
        for batch in train_dl:
            outputs = model(**batch)
            loss = outputs.loss
            accelerator.backward(loss)

            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimiser.step()
            lr_scheduler.step()
            optimiser.zero_grad()

            progress_bar.update(1)

        perform_evaluation()
        model.train()


Launch the training_function:

In [None]:
from accelerate import notebook_launcher

notebook_launcher(training_function, num_processes=1)

# Understanding Learning Curves

Now that we have learned how to fine-tune a model using both the *Trainer API* and *custom training loop*, it's crucial to understand how to interpret the results using the **Learning curves**. <br />
**Learning curves** are visual representations of our model’s performance metrics and they helps us in evaluating our model's performance over time during training. This also helps in identifying any potential issues before the underlying issue causes the model performance to go down. <br />
The two most important curves to monitor are:
+ *Loss curves*: Show how the model’s error (loss) changes over training steps or epochs.
+ *Accuracy curves*: Show the percentage of correct predictions over training steps or epochs.


!['The Learning curves.'](data/chapter_3/learning_curves.png "The Learning curves.")

So, these curves help us understand whether our model is learning effectively and can guide us in making adjustments to improve performance. In Transformers, these metrics are individually computed for each batch and then logged to the disk. We can then use libraries like **livelossplot**, **tensorboard**, **wandb**, etc., to visualize these curves and track our model’s performance over time.

### Loss curve
In a typical successful training run, you’ll see a loss curve similar characteristics below:
+ *High initial loss*: The model starts without optimization, so predictions are initially poor.
+ *Decreasing loss*: As training progresses, the loss should generally decrease.
+ *Convergence*: Eventually, the loss stabilizes at a low value, indicating that the model has learned the patterns in the data.

!["A Healthy Loss Curve"](data/chapter_3/loss_curve.png "A Healthy Loss Curve")

Here we can see, the loss curve shows, initially, the loss is high and then it gradually decreases, indicating that the model is improving. A decrease in the loss value suggests that the model is making better predictions, as the loss represents the error between the predicted output and the true output. 

### Accuracy curve
Unlike loss curves, accuracy curves should generally increase as the model learns and can typically include more steps than the loss curve.
+ *Start low*: Initial accuracy should be low, as the model has not yet learned the patterns in the data.
+ *Increase with training*: Accuracy should generally improve as the model learns, if it is able to learn the patterns in the data.
+ *May show plateaus*: Accuracy often increases in discrete jumps rather than smoothly, as the model makes predictions that are close to the true labels.

!["A Healthy Accuracy Curve"](data/chapter_3/accuracy_curve.png "A Healthy Accuracy Curve")

The accuracy curve, represents, the accuracy curve begins at a low value and increases as training progresses. Accuracy measures the proportion of correctly classified instances. So, as the accuracy curve rises, it signifies that the model is making more correct predictions.

> The reason why a loss curve is much smoother than a accuracy cruve is because, for example, in a binary classifier distinguishing cats (0) from dogs (1), if the model predicts 0.3 for an image of a dog (true value 1), this is rounded to 0 and is an incorrect classification. If in the next step it predicts 0.4, it’s still incorrect. The loss will have decreased because 0.4 is closer to 1 than 0.3, but the accuracy remains unchanged, creating a plateau. The accuracy will only jump up when the model predicts a value greater than 0.5 that gets rounded to 1.


### Convergence

We can tell if a model has learned the patterns in the data and that, now it can be used to make predictions on new data, if both the loss and accuracy curves have converged to a stable performance. So, **convergence** occurs when the model’s performance stabilizes and the loss and accuracy curves level off.

!["A converged curve"](data/chapter_3/converge_curve.png "A converged curve")

# Interpreting Learning Curve Patterns
Different curve shapes reveal different aspects of our model’s training  and so during and after the training process, we should monitor the following key indicators:

+ **During Training**
    1. **Loss convergence**: Is the loss still decreasing or has it plateaued?
    2. **Overfitting signs**: Is validation loss starting to increase while training loss decreases?
    3. **Learning rate**: Are the curves too erratic (LR too high) or too flat (LR too low)?
    4. **Stability**: Are there sudden spikes or drops that indicate problems?


+ **After Training**
    1. **Final performance**: Did the model reach acceptable performance levels?
    2. **Efficiency**: Could the same performance be achieved with fewer epochs?
    3. **Generalization**: How close are training and validation performance?
    4. **Trends**: Would additional training likely improve performance?


### Healthy Learning curves
Characteristics of healthy curves:

+ **Smooth decline in loss**: Both training and validation loss decrease steadily
+ **Close training/validation performance**: Small gap between training and validation metrics
+ **Convergence**: Curves level off, indicating the model has learned the patterns

!["Healthy Learning curves"](data/chapter_3/learning_curves.png "Healthy Learning curves")


### Overfitting
Overfitting occurs when the model learns too much from the training data and is unable to generalize to different data (represented by the validation set).

!["A Overfitted Model Learning Curve"](data/chapter_3/overfitting.png "A Overfitted Model Learning Curve")

+ **Symptoms**:
    - Training loss contniues to decrese while validation loss increases or plateaus.
    - Large gap between training and validation accuracy.
    - Training accuracy much higher than validation accuracy.

+ **Solution for overfitting**:
    - **Regularisation**:  Add dropout, weight decay, or other regularisation techniques.
    - **Early stopping**: Stop traininig when validation performance stops improving.
    - **Data augmentation**: Increase training data diversity.
    - **Reduce Model Complexity**: Use a smaller model or fewer paramaters 


### Underfitting

Underfitting occurs when the model is too simple to capture the underlying patterns in the data. This can happen for several reasons:

+ The model is too small or lacks capacity to learn the patterns
+ The learning rate is too low, causing slow learning
+ The dataset is too small or not representative of the problem
+ The model is not properly regularized

!["A Underfitted Model Learning Curve"](data/chapter_3/underfitting.png "A Underfitted Model Learning Curve")

+ **Symtoms**:
    - Both training and validation loss remain high.
    - Model performance plateaus early in training.
    - Training accuracy is lower than expected.

+ **Solutions for underfitting**
    - **Increase model capacity**: Use a larger model ot more parameters.
    - **Train longer**: Increase the number of epochs.
    - **Adjust learning rate**: Try different learning rates.
    - **Check data quality**: Ensure your data is properly preprocessed.

### Erratic Learning Curves
Erratic learning curves occur when the model is not learning effectively. This can happen for several reasons:
+ The learning rate is too high, causing the model to overshoot the optimal parameters.
+ The batch size is too small, causing the model to learn slowly
+ The model is not properly regularized, causing it to overfit to the training data
+ The dataset is not properly preprocessed, causing the model to learn from noise

!["Erratic Learning Curves"](data/chapter_3/erratic_learning_curves.png "Erratic Learning Curves")

+ **Symptoms**:
    - Frequent fluctuations in loss or accuracy.
    - Curves show high variance or instability.
    - Performance oscillates without clear trend

+ **Solutions for erractic curves**:
    - **Lower Learning Rate**: Reduce step size for more stable training.
    - **Increase Batch Size**: Larger batches provide more stable gradients.
    - **Gradient Clipping**: Precent exploding gradients.
    - **Better data preprocessing**: Ensure consistent data quality.

# Full Training Loop Example

Here we will use the **livelossplot** lib to plot the learning curve for simplicity and perform evaluation after every epoch.

Prepare the data:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding, AutoModelForSequenceClassification
from torch.utils.data import DataLoader
from pprint import pprint

raw_datasets = load_dataset(path="glue", name="mrpc")

pprint(raw_datasets)

checkpoint = "bert-base-uncased"
tokeniser = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
data_collator = DataCollatorWithPadding(tokenizer=tokeniser)

def tokenisation_function(data):
    return tokeniser(data['sentence1'], data['sentence2'], truncation=True)

tokenised_datasets = raw_datasets.map(tokenisation_function, batched=True)

pprint(tokenised_datasets)

tokenised_datasets = tokenised_datasets.remove_columns(
    column_names=["sentence1", "sentence2", "idx"]
)
tokenised_datasets = tokenised_datasets.rename_column(
    original_column_name = "label",
    new_column_name = "labels"
)
tokenised_datasets.set_format("torch")

pprint(tokenised_datasets)

Now to make sure that our model *converge* to a good stabe performance, we will need to define how the data will be given to the model properly, i.e., the batch size for both the training dataset and evaluation dataset.
+ **Training batch** must hold data + gradients in GPU memory; keeping it around 8-32 samples leaves room for the gradients and keeps some randomness (“noise”) that often helps the model generalise.

+ **Evaluation batch** only does a forward pass—no gradients—so memory use is lower; stuffing in as many samples as fit (64, 128, or all) finishes validation faster without hurting accuracy.

> **Practical rule**: start with the largest train batch that fits your GPU and still gives good validation scores (often 16 ± 8); then crank the eval batch way up because speed is the only concern there.

In [None]:
train_batch_size = 16
eval_batch_size = min(128, len(tokenised_datasets["validation"]))

train_dataloader = DataLoader(
    dataset=tokenised_datasets['train'],
    batch_size=train_batch_size,
    shuffle=True,
    collate_fn=data_collator
)

eval_dataloader = DataLoader(
    dataset=tokenised_datasets["validation"],
    batch_size=eval_batch_size,
    collate_fn=data_collator
)

print(f"So there are, {len(train_dataloader)} batches of size {train_batch_size} in the training dataset, and\n {len(eval_dataloader)} batches of size {eval_batch_size} in the evaluation dataset.")

Now that we have the right batch size for both *training* and *vaidation* for our model. We now need to focus on selecting the right **hyper-parameters**:
+ **Learning-rate (2 × 10⁻⁵)** – Think of this as how big a step the optimizer takes each time it adjusts the weights; 2e-5 is small enough to avoid wild jumps yet big enough for BERT-base to learn a GLUE-sized task in a handful of epochs, and you tune it by doubling or halving if the validation curve diverges or crawls.

+ **Warm-up (10 % of steps)** – During the first 10 % of updates the learning-rate ramps smoothly from zero to its full value, preventing a sudden kick that can blow up the loss; a 5–10 % ramp is a widely used safe zone, and you shorten it on very long runs or lengthen it if training spans only a few hundred steps.

+ **Linear LR decay** – After warm-up the learning-rate is reduced a little each step until it hits zero, which makes later updates gentler and helps the model settle into a minimum; linear decay is the simplest schedule that works about as well as fancier shapes, so use it unless you have evidence cosine or one-cycle clearly beats it on your dataset.

+ **Epochs (5 passes)** – An epoch is one full sweep through the training data; around five passes (≈1 000 updates with a 16-sample batch) are usually enough for a small set like MRPC to converge without starting to memorise, and the reliable way to pick this number is to watch the validation metric and stop when it stops improving for two epochs.

In [None]:
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import get_scheduler

optimiser = AdamW(
    params=model.parameters(),
    lr=2e-5
)

accelerator = Accelerator()
train_dl, eval_dl, model, optimiser = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimiser
)

num_epochs = 5
num_training_steps = num_epochs * len(train_dl)

num_warmup_steps = (10 * num_training_steps)//100

lr_scheduler = get_scheduler(
    name="linear",
    optimizer=optimiser,
    num_warmup_steps=num_warmup_steps,
    num_training_steps=num_training_steps
)

Now the final training loop with the evaluation process. Here, we need to make sure **gradients flow only through the parts of the code that actually teach the model**: keep them on during the forward-and-backward pass, then immediately cut them off for anything else.

+ Wrap the entire validation loop in `torch.no_grad()` because we’re just measuring, not learning—this skips gradient bookkeeping and slashes memory use.

+ After the backward pass in training, call `loss.detach().item()` and `logits.detach()` before saving them so they don’t drag the whole computation graph into your Python lists.

+ Using `no_grad()` again around metric calculations prevents PyTorch from building a second, useless graph while you tally accuracy or F1.

Doing these three things keeps GPU RAM from creeping up, speeds every batch, and guarantees that only the intended updates influence your learning curve.

In [None]:
import evaluate
import torch
from tqdm.notebook import tqdm
from livelossplot import PlotLosses

progress_bar = tqdm(range(num_training_steps))

def perform_evaluation():
    """
    Perform evaluation on the validation set
    """
    # Set model to evaluation mode
    model.eval()
    eval_epoch_loss = []

    eval_metric = evaluate.load(path="glue", config_name="mrpc")

    for batch in eval_dl:
        # Disable gradient computation for evaluation (saves memory and computation)
        with torch.no_grad():
            outputs = model(**batch)
            # Store loss inside no_grad for memory efficiency
            eval_epoch_loss.append(outputs.loss.item())

            # Get predictions for metrics (logits already created without gradients)
            logits = outputs.logits
            refs = batch["labels"]
            preds = torch.argmax(logits, dim=-1)

            # Add batch to evaluation metric
            eval_metric.add_batch(
                predictions=accelerator.gather(preds),
                references=accelerator.gather(refs)
            )
    
    eval_avg_loss = sum(eval_epoch_loss) / len(eval_epoch_loss)
    eval_pred_stats = eval_metric.compute()
    
    return eval_avg_loss, eval_pred_stats


def training_function():

    # intialise the plotter for the learning curve
    plotter = PlotLosses(mode='notebook')

    for epoch in range(num_epochs):
         # Ensure model is in training mode
        model.train()
        train_epoch_loss = []

        # Create fresh metrics for each epoch to avoid accumulation across epochs
        train_metric = evaluate.load(path="glue", config_name="mrpc")

        for batch in train_dl:
            #  FORWARD PASS (keep gradients attached)
            outputs = model(**batch)
            loss = outputs.loss

            # BACKWARD PASS (while gradients are still attached)
            accelerator.backward(loss)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            optimiser.step()
            lr_scheduler.step()
            optimiser.zero_grad()

            # METRICS COMPUTATION (after backward pass is complete)
            with torch.no_grad():
                # Detach loss for logging to prevent keeping computation graph in memoory
                train_epoch_loss.append(loss.detach().item())

                # Detach logits for metric computation (no gradients needed for metrics)
                logits = outputs.logits.detach()
                # No need to detach labels (they don't have gradients)
                refs = batch['labels']
                preds = torch.argmax(logits, dim=-1)
                
                # Add batch to training metric
                train_metric.add_batch(
                    predictions=accelerator.gather(preds),
                    references=accelerator.gather(refs)
                )

            progress_bar.update(1)

        # COMPUTE TRAINING METRICS
        train_avg_loss = sum(train_epoch_loss)/len(train_epoch_loss)
        train_pred_stats = train_metric.compute()

        # EVALUATION PHASE
        eval_avg_loss, eval_pred_stats = perform_evaluation()

        # set back to train mode
        model.train()

        # update live learning curve
        plotter.update({
            'loss': train_avg_loss,
            'val_loss' : eval_avg_loss,
            'acc' : train_pred_stats['accuracy'],
            'val_acc' : eval_pred_stats['accuracy'],
            'f1' : train_pred_stats['f1'],
            'val_f1' : eval_pred_stats['f1']
        })
        plotter.send()

Finally, launch the training with accelerator:

In [None]:
from accelerate import notebook_launcher

notebook_launcher(training_function, num_processes=1)

We can see from the learning curves above, that the model is being overfitted because the validation loss starts incresing, while only the training loss kept on decresing.

# Trainer API Example
This time we will use the `tensorboard` lib to visualise the learning curves and integate it with the `Trainer` API.

First, let's prepare the data (same as before but this time we don't need the `DataLoader`, `DataCollatorWithPadding` because we are using the `Trainer` API):

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from pprint import pprint

raw_datasets = load_dataset(path="glue", name="mrpc")

pprint(raw_datasets)

checkpoint = "bert-base-uncased"
tokeniser = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

def tokenisation_function(data):
    return tokeniser(data['sentence1'], data['sentence2'], truncation=True)

tokenised_datasets = raw_datasets.map(tokenisation_function, batched=True)

pprint(tokenised_datasets)

tokenised_datasets = tokenised_datasets.remove_columns(
    column_names=["sentence1", "sentence2", "idx"]
)
tokenised_datasets = tokenised_datasets.rename_column(
    original_column_name = "label",
    new_column_name = "labels"
)
tokenised_datasets.set_format("torch")

pprint(tokenised_datasets)

define a metric function:

In [None]:
import numpy as np
import evaluate

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    pred_logits, labels = eval_preds
    preds = np.argmax(pred_logits, axis=-1)

    return metric.compute(predictions=preds, references=labels)

Finally, set the `TrainingArguments` and pass it to the `Trainer` API. Below, we are also using the `EarlyStoppingCallback` functonality by `transformers` to do a early stopping if the validation loss (it know which metric to take, when you define `metric_for_best_model` parameter in the `TrainingArguments`) starts to increase, i.e., in the case of *overfitting*.

> After starting the cell below, please open the terminal and go to this repo directory. Now, first activate the python venv :
>```bash
>source .venv/bin/activate
>```
>then, run the following command to start the tensoboard server and open the link it prompts to see the learning curve plots:
>```bash
>tensorboard --logdir data/chapter_3/model_results
>```

In [None]:
from transformers import Trainer, TrainingArguments
from transformers import EarlyStoppingCallback # for detecting overfitting with early stopping

train_batch_size = 16
eval_batch_size = min(128,  len(tokenised_datasets["validation"]))
num_epochs = 5

total_training_steps = num_epochs * (len(tokenised_datasets["train"])//train_batch_size)
num_warmup_steps = int(0.1 * total_training_steps)

training_args = TrainingArguments(
    output_dir="data/chapter_3/model_results",
    eval_strategy="steps", 
    learning_rate=2e-5,
    lr_scheduler_type="linear",
    eval_steps=50,
    save_steps=100,
    fp16=True, # reduce memory usuage
    warmup_steps=num_warmup_steps,
    logging_steps=10,  # Log metrics every 10 steps
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=train_batch_size,
    per_device_eval_batch_size=eval_batch_size,
    report_to="tensorboard",  # Send logs to tensorboard
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenised_datasets["train"],
    eval_dataset=tokenised_datasets["validation"],
    processing_class=tokeniser,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)] # early stopping to prevent overfitting after 3 steps of bad performance
)

# Train and automatically log metrics
trainer.train()

# The End!