In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Prepare for training

Before actually writing our training loop, we will need to define a few objects. The first ones are the dataloaders we will use to iterate over batches. But before we can define those dataloaders, we need to apply a bit of postprocessing to our tokenized_datasets, to take care of some things that the Trainer did for us automatically. Specifically, we need to:

- Remove the columns corresponding to values the model does not expect (like the sentence1 and sentence2 columns).
- Rename the column label to labels (because the model expects the argument to be named labels).
- Set the format of the datasets so they return PyTorch tensors instead of lists.

Our tokenized_datasets has one method for each of those steps:

In [6]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

['labels', 'input_ids', 'token_type_ids', 'attention_mask']

Now that this is done, we can easily define our dataloaders:

In [8]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(tokenized_datasets["train"], shuffle = True, batch_size=8, collate_fn=data_collator)
eval_dataloader = DataLoader(tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator)

In [9]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

{'labels': torch.Size([8]),
 'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67])}

Note that the actual shapes will probably be slightly different for you since we set shuffle=True for the training dataloader and we are padding to the maximum length inside the batch.

Now that we’re completely finished with data preprocessing (a satisfying yet elusive goal for any ML practitioner), let’s turn to the model. We instantiate it exactly as we did in the previous section:

In [10]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [11]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

tensor(0.7454, grad_fn=<NllLossBackward0>) torch.Size([8, 2])


All 🤗 Transformers models will return the loss when labels are provided, and we also get the logits (two for each input in our batch, so a tensor of size 8 x 2).

We’re almost ready to write our training loop! We’re just missing two things: an optimizer and a learning rate scheduler. Since we are trying to replicate what the Trainer was doing by hand, we will use the same defaults. The optimizer used by the Trainer is AdamW, which is the same as Adam, but with a twist for weight decay regularization (see “Decoupled Weight Decay Regularization” by Ilya Loshchilov and Frank Hutter):

In [12]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

| **Component**     | **Why it exists**                                |
|-------------------|--------------------------------------------------|
| **Optimizer**     | Updates model weights to minimize loss           |
| **Adam**          | Adaptive steps with momentum                     |
| **AdamW**         | Fixes Adam’s weight decay issue                  |
| **Learning Rate** | Controls how big each update is                  |
| **Scheduler**     | Dynamically adjusts learning rate while training |

Finally, the learning rate scheduler used by default is just a linear decay from the maximum value (5e-5) to 0. To properly define it, we need to know the number of training steps we will take, which is the number of epochs we want to run multiplied by the number of training batches (which is the length of our training dataloader). The Trainer uses three epochs by default, so we will follow that:

In [13]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

1377


# The training loop

One last thing: we will want to use the GPU if we have access to one (on a CPU, training might take several hours instead of a couple of minutes). To do this, we define a device we will put our model and our batches on:

In [14]:
import torch

device = torch.device("mps") if torch.mps.is_available() else torch.device("cpu")
model.to(device)
device

device(type='mps')

We are now ready to train! To get some sense of when training will be finished, we add a progress bar over our number of training steps, using the tqdm library:

In [15]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

  0%|          | 0/1377 [00:00<?, ?it/s]

You can see that the core of the training loop looks a lot like the one in the introduction. We didn’t ask for any reporting, so this training loop will not tell us anything about how the model fares. We need to add an evaluation loop for that.

# The evaluation loop

As we did earlier, we will use a metric provided by the 🤗 Evaluate library. We’ve already seen the metric.compute() method, but metrics can actually accumulate batches for us as we go over the prediction loop with the method add_batch(). Once we have accumulated all the batches, we can get the final result with metric.compute(). Here’s how to implement all of this in an evaluation loop:

In [16]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

{'accuracy': 0.8284313725490197, 'f1': 0.8801369863013698}

# Supercharge your training loop with 🤗 Accelerate

The training loop we defined earlier works fine on a single CPU or GPU. But using the 🤗 Accelerate library, with just a few adjustments we can enable distributed training on multiple GPUs or TPUs. 🤗 Accelerate handles the complexity of distributed training, mixed precision, and device placement automatically. Starting from the creation of the training and validation dataloaders, here is what our manual training loop looks like:

In [17]:
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1377 [00:00<?, ?it/s]

In [21]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()



for batch in eval_dl:
    with torch.no_grad():
        output = model(**batch)

    logits = output.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=accelerator.gather(predictions), references=accelerator.gather(batch["labels"]))

metric.compute()

{'accuracy': 0.8602941176470589, 'f1': 0.9025641025641026}

The first line to add is the import line. The second line instantiates an Accelerator object that will look at the environment and initialize the proper distributed setup. 🤗 Accelerate handles the device placement for you, so you can remove the lines that put the model on the device (or, if you prefer, change them to use accelerator.device instead of device).

Then the main bulk of the work is done in the line that sends the dataloaders, the model, and the optimizer to accelerator.prepare(). This will wrap those objects in the proper container to make sure your distributed training works as intended. The remaining changes to make are removing the line that puts the batch on the device (again, if you want to keep this you can just change it to use accelerator.device) and replacing loss.backward() with accelerator.backward(loss).

Putting this in a train.py script will make that script runnable on any kind of distributed setup. To try it out in your distributed setup, run the command:

``` accelerate config```

which will prompt you to answer a few questions and dump your answers in a configuration file used by this command:

```accelerate launch train.py```

which will launch the distributed training.

If you want to try this in a Notebook (for instance, to test it with TPUs on Colab), just paste the code in a training_function() and run a last cell with:


```
from accelerate import notebook_launcher

notebook_launcher(training_function)

# Understanding Learning Curves

Now that you’ve learned how to implement fine-tuning using both the Trainer API and custom training loops, it’s crucial to understand how to interpret the results. Learning curves are invaluable tools that help you evaluate your model’s performance during training and identify potential issues before they reduce performance.

In this section, we’ll explore how to read and interpret accuracy and loss curves, understand what different curve shapes tell us about our model’s behavior, and learn how to address common training issues.

## What are Learning Curves?

Learning curves are visual representations of your model’s performance metrics over time during training. The two most important curves to monitor are:

- Loss curves: Show how the model’s error (loss) changes over training steps or epochs
- Accuracy curves: Show the percentage of correct predictions over training steps or epochs

These curves help us understand whether our model is learning effectively and can guide us in making adjustments to improve performance. In Transformers, these metrics are individually computed for each batch and then logged to the disk. We can then use libraries like Weights & Biases to visualize these curves and track our model’s performance over time.

## Loss Curves

The loss curve shows how the model’s error decreases over time. In a typical successful training run, you’ll see a curve similar to the one below:

![Alt Txt](images/1.png "loss")

- High initial loss: The model starts without optimization, so predictions are initially poor
- Decreasing loss: As training progresses, the loss should generally decrease
- Convergence: Eventually, the loss stabilizes at a low value, indicating that the model has learned the patterns in the data

As in previous chapters, we can use the Trainer API to track these metrics and visualize them in a dashboard. Below is an example of how to do this with Weights & Biases.

In [22]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)



In [27]:
import evaluate
import numpy as np

def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [28]:
from transformers import Trainer, TrainingArguments
import wandb

wandb.init(project="transformer-fine-tuning", name="bert-mrpc-analysis")

training_args = TrainingArguments(
    output_dir="./models",
    eval_strategy="steps",
    eval_steps=50,
    save_steps=100,
    logging_steps=10,  # Log metrics every 10 steps
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    report_to="wandb",  # Send logs to Weights & Biases
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

# Train and automatically log metrics
trainer.train()

[34m[1mwandb[0m: [32m[41mERROR[0m The nbformat package was not found. It is required to save notebook history.


0,1
train/epoch,▁▃▅▆█
train/global_step,▁▃▅▆█
train/grad_norm,▁▇▁█▄
train/learning_rate,█▆▄▃▁
train/loss,▂█▁▃▆

0,1
train/epoch,0.21739
train/global_step,50.0
train/grad_norm,13.92369
train/learning_rate,5e-05
train/loss,0.17


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




Step,Training Loss,Validation Loss,Accuracy,F1
50,0.1175,0.897941,0.845588,0.891192
100,0.059,0.881686,0.828431,0.882943
150,0.1075,0.710855,0.823529,0.875433
200,0.0686,0.767951,0.82598,0.87522
250,0.0151,0.830319,0.835784,0.876155
300,0.0729,0.945711,0.840686,0.886165
350,0.0097,0.801936,0.840686,0.888508
400,0.0939,0.926723,0.852941,0.894737
450,0.0677,0.957365,0.838235,0.880435
500,0.0107,0.958333,0.848039,0.894558




TrainOutput(global_step=690, training_loss=0.07302113775757776, metrics={'train_runtime': 559.7057, 'train_samples_per_second': 19.66, 'train_steps_per_second': 1.233, 'total_flos': 444815961302640.0, 'train_loss': 0.07302113775757776, 'epoch': 3.0})

## Accuracy Curves

The accuracy curve shows the percentage of correct predictions over time. Unlike loss curves, accuracy curves should generally increase as the model learns and can typically include more steps than the loss curve.

- Start low: Initial accuracy should be low, as the model has not yet learned the patterns in the data
- Increase with training: Accuracy should generally improve as the model learns if it is able to learn the patterns in the data
- May show plateaus: Accuracy often increases in discrete jumps rather than smoothly, as the model makes predictions that are close to the true labels

## Convergence

Convergence occurs when the model’s performance stabilizes and the loss and accuracy curves level off. This is a sign that the model has learned the patterns in the data and is ready to be used. In simple terms, we are aiming for the model to converge to a stable performance every time we train it.

![alt text](images/4.png "convergence")

Once models have converged, we can use them to make predictions on new data and refer to evaluation metrics to understand how well the model is performing.

## Interpreting Learning Curve Patterns

Different curve shapes reveal different aspects of your model’s training. Let’s examine the most common patterns and what they mean.

### Healthy Learning Curves

A well-behaved training run typically shows curve shapes similar to the one below:

![alt](images/5.png "curve")

> Characteristics of healthy curves:
- Smooth decline in loss: Both training and validation loss decrease steadily
- Close training/validation performance: Small gap between training and validation metrics
- Convergence: Curves level off, indicating the model has learned the patterns

## Practical Examples

Let’s work through some practical examples of learning curves. First, we will highlight some approaches to monitor the learning curves during training. Below, we will break down the different patterns that can be observed in the learning curves.

### During Training

During the training process (after you’ve hit trainer.train()), you can monitor these key indicators:

- Loss convergence: Is the loss still decreasing or has it plateaued?
- Overfitting signs: Is validation loss starting to increase while training loss decreases?
- Learning rate: Are the curves too erratic (LR too high) or too flat (LR too low)?
- Stability: Are there sudden spikes or drops that indicate problems?

### After Training

After the training process is complete, you can analyze the complete curves to understand the model’s performance.

- Final performance: Did the model reach acceptable performance levels?
- Efficiency: Could the same performance be achieved with fewer epochs?
- Generalization: How close are training and validation performance?
- Trends: Would additional training likely improve performance?

## Overfitting

Overfitting occurs when the model learns too much from the training data and is unable to generalize to different data (represented by the validation set).

![alt](images/2-2.png "overfitting")

### Symptoms:

- Training loss continues to decrease while validation loss increases or plateaus
- Large gap between training and validation accuracy
- Training accuracy much higher than validation accuracy

### Solutions for overfitting:

- Regularization: Add dropout, weight decay, or other regularization techniques
- Early stopping: Stop training when validation performance stops improving
- Data augmentation: Increase training data diversity
- Reduce model complexity: Use a smaller model or fewer parameters

In the sample below, we use early stopping to prevent overfitting. We set the early_stopping_patience to 3, which means that if the validation loss does not improve for 3 consecutive epochs, the training will be stopped.

In [29]:
# Example of detecting overfitting with early stopping
from transformers import EarlyStoppingCallback
import wandb


training_args = TrainingArguments(
    output_dir="./models",
    eval_strategy="steps",
    eval_steps=100,
    save_strategy="steps",
    save_steps=100,
    logging_steps=10,  # Log metrics every 10 steps
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    num_train_epochs=10,  # Set high, but we'll stop early
    report_to="wandb",  # Send logs to Weights & Biases
)

# Add early stopping to prevent overfitting
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [31]:
trainer.train()



Step,Training Loss,Validation Loss,Accuracy,F1
100,0.0303,1.212625,0.821078,0.871705
200,0.0395,1.1983,0.821078,0.879736
300,0.1477,0.913811,0.823529,0.882736
400,0.1644,0.898734,0.838235,0.887755
500,0.0768,0.916105,0.813725,0.865248
600,0.0771,1.078762,0.833333,0.881119
700,0.0919,1.12546,0.840686,0.886562




TrainOutput(global_step=700, training_loss=0.14143310962777053, metrics={'train_runtime': 349.5383, 'train_samples_per_second': 104.938, 'train_steps_per_second': 13.132, 'total_flos': 214184732393760.0, 'train_loss': 0.14143310962777053, 'epoch': 1.5250544662309369})

## Underfitting

Underfitting occurs when the model is too simple to capture the underlying patterns in the data. This can happen for several reasons:

- The model is too small or lacks capacity to learn the patterns
- The learning rate is too low, causing slow learning
- The dataset is too small or not representative of the problem
- The model is not properly regularized

![alt](images/7.png "underfit")

### Symptoms:

- Both training and validation loss remain high
- Model performance plateaus early in training
- Training accuracy is lower than expected
- Solutions for underfitting:

### Increase model capacity: Use a larger model or more parameters

- Train longer: Increase the number of epochs
- Adjust learning rate: Try different learning rates
- Check data quality: Ensure your data is properly preprocessed

## Erratic Learning Curves

Erratic learning curves occur when the model is not learning effectively. This can happen for several reasons:

- The learning rate is too high, causing the model to overshoot the optimal parameters
- The batch size is too small, causing the model to learn slowly
- The model is not properly regularized, causing it to overfit to the training data
- The dataset is not properly preprocessed, causing the model to learn from noise

### Symptoms:

- Frequent fluctuations in loss or accuracy
- Curves show high variance or instability
- Performance oscillates without clear trend
- Both training and validation curves show erratic behavior.

### Solutions for erratic curves:

- Lower learning rate: Reduce step size for more stable training
- Increase batch size: Larger batches provide more stable gradients
- Gradient clipping: Prevent exploding gradients
- Better data preprocessing: Ensure consistent data quality


In the sample below, we lower the learning rate and increase the batch size.

``` python
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    -learning_rate=1e-5,
    +learning_rate=1e-4,
    -per_device_train_batch_size=16,
    +per_device_train_batch_size=32,
)