# Fine-tuning a model with the Trainer API

ü§ó Transformers provides a `Trainer` class to help you fine-tune any of the pretrained models it provides on your dataset with modern best practices. Once you‚Äôve done all the data preprocessing work in the last section, you have just a few steps left to define the `Trainer`. The hardest part is likely to be preparing the environment to run `Trainer.train()`, as it will run very slowly on a CPU. If you don‚Äôt have a GPU set up, you can get access to free GPUs or TPUs on Google Colab.



The code examples below assume you have already executed the examples in the previous section. Here is a short summary recapping what you need:



In [1]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting evaluate
  Downloading evaluate-0.4.6-py3-none-any.whl.metadata (9.5 kB)
Downloading evaluate-0.4.6-py3-none-any.whl (84 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m84.1/84.1 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.6


In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_dataset = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


README.md: 0.00B [00:00, ?B/s]

mrpc/train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

mrpc/validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

mrpc/test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

## 1. Training

The first step before we can define our `Trainer` is to define a `TrainingArguments` class that will contain all the hyperparameters the `Trainer` will use for training and evaluation. The only argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.



In [7]:
from transformers import TrainingArguments

trainng_args = TrainingArguments(
    output_dir="trainer"
)

If you want to automatically upload your model to the Hub during training, pass along `push_to_hub=True` in the `TrainingArguments`.

The second step is to define our model. As in the previous chapter, we will use the `AutoModelForSequenceClassification` class, with two labels:



In [3]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)


model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertForSequenceClassification LOAD REPORT from: bert-base-cased
Key                                        | Status     | 
-------------------------------------------+------------+-
cls.seq_relationship.weight                | UNEXPECTED | 
cls.predictions.bias                       | UNEXPECTED | 
cls.predictions.transform.dense.weight     | UNEXPECTED | 
cls.seq_relationship.bias                  | UNEXPECTED | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED | 
cls.predictions.transform.dense.bias       | UNEXPECTED | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED | 
classifier.bias                            | MISSING    | 
classifier.weight                          | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


You will get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.



Once we have our model, we can define a `Trainer` by passing it all the objects constructed up to now ‚Äî the model, the `training_args`, the training and validation datasets, our `data_collator`, and our `processing_class`. The `processing_class` parameter is a newer addition that tells the `Trainer` which tokenizer to use for processing:



In [9]:
from transformers import Trainer

trainer = Trainer(
    model,
    trainng_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    processing_class=tokenizer
)

When you pass a tokenizer as the `processing_class`, the default `data_collator` used by the `Trainer` will be a `DataCollatorWithPadding`. You can skip the `data_collator=data_collator` line in this case, but we included it here to show you this important part of the processing pipeline.

To fine-tune the model on our dataset, we just have to call the `train()` method of our `Trainer`:

In [None]:
trainer.train()

This will start the fine-tuning (which should take a couple of minutes on a GPU) and report the training loss every 500 steps. It won‚Äôt, however, tell you how well (or badly) your model is performing. This is because:

1. We didn‚Äôt tell the `Trainer` to evaluate during training by setting `eval_strategy` in `TrainingArguments` to either "steps" (evaluate every `eval_steps`) or "epoch" (evaluate at the end of each epoch).

2. We didn‚Äôt provide the `Trainer` with a `compute_metrics()` function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).

## 2. Evaluation

Let‚Äôs see how we can build a useful `compute_metrics()` function and use it the next time we train. The function must take an `EvalPrediction` object (which is a named tuple with a `predictions` field and a `label_ids` field) and will return a dictionary mapping strings to floats (the strings being the names of the metrics returned, and the floats their values). To get some predictions from our model, we can use the `Trainer.predict()` command:



In [16]:
predictions = trainer.predict(tokenized_dataset["validation"])


In [17]:
print(predictions.predictions.shape)
print(predictions.label_ids.shape)

(408, 2)
(408,)


In [18]:
print(predictions.predictions[:5])
print(predictions.label_ids[:5])

[[-1.0200864   1.2071925 ]
 [ 0.60737586 -0.785268  ]
 [-0.8965366   1.2125367 ]
 [-1.0168092   1.2354373 ]
 [-0.97496283  1.2462667 ]]
[1 0 0 1 0]


The output of the `predict()` method is another named tuple with three fields: `predictions`, `label_ids`, and `metrics`. The `metrics` field will just contain the loss on the dataset passed, as well as some time metrics (how long it took to predict, in total and on average). Once we complete our `compute_metrics()` function and pass it to the `Trainer`, that field will also contain the metrics returned by `compute_metrics()`.



 To transform logits into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [19]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)
print(preds[:5])
print(predictions.label_ids[:5])

[1 0 1 1 1]
[1 0 0 1 0]


We can now compare those preds to the labels. To build our `compute_metric()` function, we will rely on the metrics from the ü§ó Evaluate library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method we can use to do the metric calculation:



In [20]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.7892156862745098, 'f1': 0.8594771241830066}

Wrapping everything together, we get our `compute_metrics()` function:



In [21]:
from transformers import EvalPrediction

def compute_metrics(eval_preds: EvalPrediction) -> dict[str, float]:
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    preds = np.argmax(logits, axis=-1)
    return metric.compute(predictions=preds, references=labels)

And to see it used in action to report metrics at the end of each epoch, here is how we define a new Trainer with this `compute_metrics()` function:

In [22]:
training_args = TrainingArguments(
    output_dir="trainer",
    eval_strategy="epoch"
)

trainer = Trainer(
    model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator,
    processing_class=tokenizer,
)

Note that we create a new `TrainingArguments` with its `eval_strategy` set to "epoch" and a new model ‚Äî otherwise, we would just be continuing the training of the model we have already trained. To launch a new training run, we execute:



In [23]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.46059,0.791667,0.849023
2,0.454532,0.669434,0.816176,0.875208
3,0.321693,0.781904,0.835784,0.886633


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=1377, training_loss=0.33757612440321183, metrics={'train_runtime': 219.9944, 'train_samples_per_second': 50.019, 'train_steps_per_second': 6.259, 'total_flos': 419446300011600.0, 'train_loss': 0.33757612440321183, 'epoch': 3.0})

This time, it will report the validation loss and metrics at the end of each epoch on top of the training loss. Again, the exact accuracy/F1 score you reach might be a bit different from what we found, because of the random head initialization of the model, but it should be in the same ballpark.

## 3. Advanced Training Features

The `Trainer` comes with many built-in features that make modern deep learning best practices accessible:



**Mixed Precision Training**: Use `fp16=True` in your training arguments for faster training and reduced memory usage:



In [None]:
training_args = TrainingArguments(
    "trainer",
    eval_strategy="epoch",
    fp16=True
)

**Gradient Accumulation**: For effective larger batch sizes when GPU memory is limited:



In [None]:
training_args = TrainingArguments(
    "test-trainer",
    eval_strategy="epoch",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,  # Effective batch size = 4 * 4 = 16
)

**Learning Rate Scheduling**: The Trainer uses linear decay by default, but you can customize this:



In [None]:
training_args = TrainingArguments(
    "test-trainer",
    eval_strategy="epoch",
    learning_rate=2e-5,
    lr_scheduler_type="cosine",  # Try different schedulers
)

## 4. Key Takeaways

1. The `Trainer` API provides a high-level interface that handles most training complexity

2. Use `processing_class` to specify your tokenizer for proper data handling

3. `TrainingArguments` controls all aspects of training: learning rate, batch size, evaluation strategy, and optimizations

4. `compute_metrics` enables custom evaluation metrics beyond just training loss

5. Modern features like mixed precision (`fp16=True`) and gradient accumulation can significantly improve training efficiency