# Fine-tuning a model with the Trainer API or Keras

Fine-tuning a model with the Trainer API
Install the Transformers, Datasets, and Evaluate libraries to run this notebook.



Watch this video: https://youtu.be/nvBXf7s7vTI

In [2]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.

🤗 Transformers provides a Trainer class to help you fine-tune any of the pretrained models it provides on your dataset. Once you’ve done all the data preprocessing work in the last section, you have just a few steps left to define the Trainer. The hardest part is likely to be preparing the environment to run Trainer.train(), as it will run very slowly on a CPU. If you don’t have a GPU set up, you can get access to free GPUs or TPUs on Google Colab.

The code examples below assume you have already executed the examples in the previous section. Here is a short summary recapping what you need:

In [3]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

This code prepares a dataset for training a model on the MRPC (Microsoft Research Paraphrase Corpus) task using Hugging Face Transformers. Here's a breakdown of what each part does:

**1. Imports:**

- `from datasets import load_dataset`: Imports the `load_dataset` function from the `datasets` library to load datasets from the Hugging Face Hub.
- `from transformers import AutoTokenizer, DataCollatorWithPadding`: Imports the `AutoTokenizer` and `DataCollatorWithPadding` classes from the `transformers` library.
    - `AutoTokenizer`: Automatically loads the tokenizer associated with a particular pre-trained checkpoint.
    - `DataCollatorWithPadding`: Pads and batches your data for training.

**2. Loading the Dataset:**

- `raw_datasets = load_dataset("glue", "mrpc")`: Loads the MRPC dataset from the GLUE benchmark collection using the `load_dataset` function. This will download the dataset if it's not already cached.

**3. Pre-trained Model Selection:**

- `checkpoint = "bert-base-uncased"`: Defines the pre-trained model checkpoint as "bert-base-uncased". This is a small, uncased BERT model that will be used for training.

**4. Tokenizer Creation:**

- `tokenizer = AutoTokenizer.from_pretrained(checkpoint)`: Loads the tokenizer associated with the chosen pre-trained model ("bert-base-uncased") using `AutoTokenizer`. This tokenizer handles converting text into numerical representations suitable for the model.

**5. Tokenization Function:**

- `def tokenize_function(example):`: Defines a function named `tokenize_function` that takes a single example (data point) from the dataset as input.
- `return tokenizer(example["sentence1"], example["sentence2"], truncation=True)`:
    - Uses the `tokenizer` to convert both "sentence1" and "sentence2" keys from the example into numerical representations.
    - Sets `truncation=True` to truncate longer sentences to fit a maximum length (a common practice in transformer models).

**6. Tokenized Dataset Creation:**

- `tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)`: Applies the `tokenize_function` to each example in the `raw_datasets`.
    - `batched=True` specifies processing the data in batches for efficiency. This creates a new dataset (`tokenized_datasets`) where each data point has been converted into numerical representations.

**7. Data Collator:**

- `data_collator = DataCollatorWithPadding(tokenizer=tokenizer)`: Creates an instance of `DataCollatorWithPadding` with the loaded tokenizer.
    - This class is responsible for padding the tokenized sequences to a fixed length and batching them together for model training.

**Overall, this code snippet takes a raw text dataset (MRPC), preprocesses it by tokenizing each sentence pair using a pre-trained model's tokenizer, and prepares it for training with a transformer model.**


Training
The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation. The only argument you have to provide is a directory where the trained model will be saved, as well as the checkpoints along the way. For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.

In [4]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

**Understanding `from transformers import TrainingArguments`**

This line of code imports the `TrainingArguments` class from the `transformers` library. This class is crucial for configuring the training process of a machine learning model, particularly those based on transformer architectures.

**Breaking Down `training_args = TrainingArguments("test-trainer")`**

When you instantiate a `TrainingArguments` object with a single argument, like `"test-trainer"`, you're essentially creating a default configuration with a specific output directory.

Here's what this default configuration typically entails:

- **Output Directory:** The specified directory, "test-trainer", will be used to save model checkpoints, training logs, and other output files during the training process.
- **Default Hyperparameters:** The library will use default hyperparameters for various training settings, such as learning rate, number of epochs, batch size, etc. These defaults are often suitable for many common training scenarios.

**Customizing Training Arguments**

You can customize the training process by passing additional arguments to the `TrainingArguments` constructor. For example:

```python
training_args = TrainingArguments(
    output_dir="my_output_dir",
    num_train_epochs=10,
    per_device_train_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=100,
    save_steps=1000,
    evaluation_strategy="epoch"
)
```

Here's a brief explanation of some of the common arguments:

- **output_dir:** Specifies the output directory.
- **num_train_epochs:** Sets the total number of training epochs.
- **per_device_train_batch_size:** Determines the batch size per device.
- **learning_rate:** Sets the learning rate for the optimizer.
- **weight_decay:** Applies L2 weight decay to the model parameters.
- **logging_dir:** Specifies the directory to save training logs.
- **logging_steps:** Sets the frequency of logging training metrics.
- **save_steps:** Sets the frequency of saving model checkpoints.
- **evaluation_strategy:** Determines when to evaluate the model (e.g., "epoch" for every epoch).

By customizing these arguments, you can fine-tune the training process to your specific needs and achieve better results.


 If you want to automatically upload your model to the Hub during training, pass along push_to_hub=True in the TrainingArguments. We will learn more about this in Chapter 4

The second step is to define our model. As in the previous chapter, we will use the AutoModelForSequenceClassification class, with two labels:

In [5]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Understanding `AutoModelForSequenceClassification.from_pretrained()`**

This line of code imports a pre-trained model from the Hugging Face Transformers library, specifically designed for sequence classification tasks (e.g., sentiment analysis, text classification).

**Breakdown of the Code:**

1. **Import:**
   - `from transformers import AutoModelForSequenceClassification`: Imports the `AutoModelForSequenceClassification` class from the `transformers` library. This class is a versatile model architecture that can be used for various sequence classification tasks.

2. **Model Loading:**
   - `model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)`:
     - **`from_pretrained(checkpoint)`**: Loads a pre-trained model from the specified checkpoint. The checkpoint can be a Hugging Face model identifier (e.g., "bert-base-uncased") or a local path to a saved model.
     - **`num_labels=2`**: Sets the number of output labels for the classification task. In this case, the model will output two classes.

**What the Model Does:**

Once loaded, this model can be used to classify text sequences into one of two categories. It works by:

1. **Tokenization:** The input text is tokenized into a sequence of tokens, which are numerical representations of words or subwords.
2. **Embedding:** Each token is mapped to a dense vector representation, capturing its semantic and syntactic information.
3. **Encoding:** The sequence of token embeddings is processed through multiple layers of self-attention and feed-forward neural networks to extract relevant features.
4. **Classification:** The final layer of the model applies a classification layer with two output neurons, each corresponding to one of the two classes. The output neuron with the highest activation score determines the predicted class.

**Common Use Cases:**

- **Sentiment Analysis:** Classifying text as positive or negative.
- **Text Classification:** Categorizing text into predefined topics or categories.
- **Intent Classification:** Identifying the intent of a user's query or command.

By fine-tuning this pre-trained model on a specific dataset, you can adapt it to perform well on your particular classification task.


You will notice that unlike in Chapter 2, you get a warning after instantiating this pretrained model. This is because BERT has not been pretrained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead. The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.

Once we have our model, we can define a Trainer by passing it all the objects constructed up to now — the model, the training_args, the training and validation datasets, our data_collator, and our tokenizer:

In [6]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

  trainer = Trainer(


**Understanding the `Trainer` Class in Hugging Face Transformers**

The `Trainer` class in Hugging Face Transformers is a powerful tool for training and fine-tuning machine learning models, particularly those based on transformer architectures. It simplifies the training process by handling various aspects, including data loading, model optimization, evaluation, and logging.

**Breaking Down the Code:**

```python
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)
```

**Key Components:**

1. **`model`**: This refers to the pre-trained model (e.g., `AutoModelForSequenceClassification`) that you've loaded and potentially fine-tuned.
2. **`training_args`**: This is an instance of the `TrainingArguments` class, which defines various training hyperparameters like learning rate, batch size, number of epochs, and output directory.
3. **`train_dataset`**: This is the training dataset, which has been tokenized and prepared for training.
4. **`eval_dataset`**: This is the validation dataset, used to evaluate the model's performance during training.
5. **`data_collator`**: This is a data collator that handles batching and padding of the input data.
6. **`tokenizer`**: This is the tokenizer used to process the text data.

**What the `Trainer` Does:**

Once you've instantiated the `Trainer` class with these components, you can start the training process by calling the `train()` method:

```python
trainer.train()
```

This will trigger the following steps:

1. **Data Loading:** The `Trainer` will load the training and validation datasets in batches.
2. **Model Training:** The model will be trained on the training data using the specified optimizer and loss function.
3. **Model Evaluation:** After each epoch or at specified intervals, the model will be evaluated on the validation dataset to assess its performance.
4. **Model Saving:** The best-performing model will be saved to the specified output directory.
5. **Logging:** The `Trainer` will log training metrics like loss, accuracy, and other relevant information.

By using the `Trainer` class, you can efficiently train and fine-tune your models without having to manually implement many of the underlying training and evaluation steps.


Note that when you pass the tokenizer as we did here, the default data_collator used by the Trainer will be a DataCollatorWithPadding as defined previously, so you can skip the line data_collator=data_collator in this call. It was still important to show you this part of the processing in section 2!

To fine-tune the model on our dataset, we just have to call the train() method of our Trainer:

In [7]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
500,0.5065
1000,0.2812


TrainOutput(global_step=1377, training_loss=0.3259274573454237, metrics={'train_runtime': 243.9373, 'train_samples_per_second': 45.11, 'train_steps_per_second': 5.645, 'total_flos': 405114969714960.0, 'train_loss': 0.3259274573454237, 'epoch': 3.0})

This will start the fine-tuning (which should take a couple of minutes on a GPU) and report the training loss every 500 steps. It won’t, however, tell you how well (or badly) your model is performing. This is because:

We didn’t tell the Trainer to evaluate during training by setting evaluation_strategy to either "steps" (evaluate every eval_steps) or "epoch" (evaluate at the end of each epoch).
We didn’t provide the Trainer with a compute_metrics() function to calculate a metric during said evaluation (otherwise the evaluation would just have printed the loss, which is not a very intuitive number).

Evaluation
Let’s see how we can build a useful compute_metrics() function and use it the next time we train. The function must take an EvalPrediction object (which is a named tuple with a predictions field and a label_ids field) and will return a dictionary mapping strings to floats (the strings being the names of the metrics returned, and the floats their values). To get some predictions from our model, we can use the Trainer.predict() command:

In [8]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


The output of the predict() method is another named tuple with three fields: predictions, label_ids, and metrics. The metrics field will just contain the loss on the dataset passed, as well as some time metrics (how long it took to predict, in total and on average). Once we complete our compute_metrics() function and pass it to the Trainer, that field will also contain the metrics returned by compute_metrics().

As you can see, predictions is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used). Those are the logits for each element of the dataset we passed to predict() (as you saw in the previous chapter, all Transformer models return logits). To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [9]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

We can now compare those preds to the labels. To build our compute_metric() function, we will rely on the metrics from the 🤗 Evaluate library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the evaluate.load() function. The object returned has a compute() method we can use to do the metric calculation:

In [10]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8676470588235294, 'f1': 0.9072164948453608}

The exact results you get may vary, as the random initialization of the model head might change the metrics it achieved. Here, we can see our model has an accuracy of 85.78% on the validation set and an F1 score of 89.97. Those are the two metrics used to evaluate results on the MRPC dataset for the GLUE benchmark. The table in the BERT paper reported an F1 score of 88.9 for the base model. That was the uncased model while we are currently using the cased model, which explains the better result.

Wrapping everything together, we get our compute_metrics() function:

In [11]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

And to see it used in action to report metrics at the end of each epoch, here is how we define a new Trainer with this compute_metrics() function:

In [12]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  trainer = Trainer(


Note that we create a new TrainingArguments with its evaluation_strategy set to "epoch" and a new model — otherwise, we would just be continuing the training of the model we have already trained. To launch a new training run, we execute:

In [13]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.371126,0.857843,0.901024
2,0.527200,0.45997,0.845588,0.893401
3,0.279600,0.774146,0.848039,0.894915


TrainOutput(global_step=1377, training_loss=0.3341553964386319, metrics={'train_runtime': 298.648, 'train_samples_per_second': 36.846, 'train_steps_per_second': 4.611, 'total_flos': 405114969714960.0, 'train_loss': 0.3341553964386319, 'epoch': 3.0})

This time, it will report the validation loss and metrics at the end of each epoch on top of the training loss. Again, the exact accuracy/F1 score you reach might be a bit different from what we found, because of the random head initialization of the model, but it should be in the same ballpark.

The Trainer will work out of the box on multiple GPUs or TPUs and provides lots of options, like mixed-precision training (use fp16 = True in your training arguments). We will go over everything it supports in Chapter 10.

This concludes the introduction to fine-tuning using the Trainer API. An example of doing this for most common NLP tasks will be given in Chapter 7, but for now let’s look at how to do the same thing in pure PyTorch.

✏️ Try it out! Fine-tune a model on the GLUE SST-2 dataset, using the data processing you did in section 2.