# **FINE-TUNING A PRETRAINED MODEL**


That sounds like an exciting next chapter! Fine-tuning pretrained models on your own dataset can be a powerful technique. Here's an overview of what you'll be learning:

1. **Preparing a Large Dataset from the Hub:** Understand the process of getting and preparing a large dataset from the 🤗 Model Hub for fine-tuning.

2. **Utilizing the High-Level Trainer API:** Explore the Trainer API, a high-level interface designed to simplify the fine-tuning process for various tasks.

3. **Implementing a Custom Training Loop:** Learn how to develop a custom training loop, offering more flexibility and control over the fine-tuning process.

4. **Leveraging the 🤗 Accelerate Library:** Discover how to use the 🤗 Accelerate library to effortlessly run your custom training loop on different distributed setups.

This chapter promises to deepen your understanding of fine-tuning and provide you with the tools to tailor pretrained models to your specific needs. Enjoy the journey!

### Processing the data 

Building on the example from the previous chapter, let's delve into the process of training a sequence classifier on a single batch using PyTorch.

In [1]:
import torch
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification

# Same as before
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "This course is amazing!",
]
batch = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")

# This is new
batch["labels"] = torch.tensor([1, 1])

optimizer = AdamW(model.parameters())
loss = model(**batch).loss
loss.backward()
optimizer.step()

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Certainly, training a model on just two sentences is insufficient for achieving robust results. To enhance the model's performance, a larger dataset needs to be prepared.

In this section, we'll use the MRPC (Microsoft Research Paraphrase Corpus) dataset as an example. This dataset, introduced in a paper by William B. Dolan and Chris Brockett, comprises 5,801 pairs of sentences. Each pair is labeled to indicate whether the sentences are paraphrases (i.e., if both sentences convey the same meaning). We've chosen this dataset for our chapter because of its modest size, making it conducive for experimentation and training exercises.

#### Loading a dataset from the Hub

The Hub isn't limited to models; it also hosts various datasets in numerous languages. You can explore the available datasets [here](https://huggingface.co/datasets), and we encourage you to try loading and processing a new dataset after completing this section (refer to the general documentation [here](https://huggingface.co/docs/datasets/loading_datasets.html#from-the-huggingface-hub)). However, for now, let's concentrate on the MRPC dataset. This dataset is one of the 10 datasets that constitute the [GLUE benchmark](https://gluebenchmark.com/), an academic benchmark used to assess the performance of machine learning models across 10 different text classification tasks.

The 🤗 Datasets library simplifies the process of downloading and caching a dataset from the Hub. We can obtain the MRPC dataset using the following command:

In [2]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

The output reveals a `DatasetDict` object containing the training set, validation set, and test set. Each set includes several columns such as 'sentence1', 'sentence2', 'label', and 'idx', along with a variable number of rows corresponding to the elements in each set. Specifically, there are 3,668 pairs of sentences in the training set, 408 in the validation set, and 1,725 in the test set.

By default, this command downloads and caches the dataset in the directory ~/.cache/huggingface/datasets. 

To access individual pairs of sentences in the `raw_datasets` object, you can use indexing similar to a dictionary:

In [3]:
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

The labels are already in integer format, eliminating the need for additional preprocessing. To understand which integer corresponds to each label, we can examine the features of our `raw_train_dataset`. This will provide information about the type of each column:

In [4]:
raw_train_dataset.features

{'sentence1': Value(dtype='string', id=None),
 'sentence2': Value(dtype='string', id=None),
 'label': ClassLabel(names=['not_equivalent', 'equivalent'], id=None),
 'idx': Value(dtype='int32', id=None)}

Behind the scenes, the `label` column is of type `ClassLabel`, and the mapping of integers to label names is stored in the `names` folder. Specifically, 0 corresponds to "not_equivalent," and 1 corresponds to "equivalent."

Try it out! Look at element 15 of the training set and element 87 of the validation set. What are their labels?

#### Preprocessing a dataset

To preprocess the dataset, we need to convert the text into numerical representations that the model can comprehend. As demonstrated in the previous chapter, this task is accomplished using a tokenizer. The tokenizer can handle either a single sentence or a list of sentences. Therefore, we can tokenize all the first sentences and all the second sentences of each pair directly, like so:

In [5]:
from transformers import AutoTokenizer

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenized_sentences_1 = tokenizer(raw_datasets["train"]["sentence1"])
tokenized_sentences_2 = tokenizer(raw_datasets["train"]["sentence2"])

Nevertheless, merely providing two sequences to the model won't yield predictions regarding whether the sentences are paraphrases or not. We need to treat the two sequences as a pair and apply the necessary preprocessing. Luckily, the tokenizer is equipped to handle a pair of sequences and process them according to the expectations of our BERT model:

In [6]:
inputs = tokenizer("This is the first sentence.", "This is the second one.")
inputs

{'input_ids': [101, 2023, 2003, 1996, 2034, 6251, 1012, 102, 2023, 2003, 1996, 2117, 2028, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In Chapter 2, we covered the `input_ids` and `attention_mask` keys, but we postponed the discussion about `token_type_ids`. In this example, `token_type_ids` is crucial for informing the model about the delineation between the first sentence and the second sentence in the input.

Try it out! Take element 15 of the training set and tokenize the two sentences separately and as a pair. What’s the difference between the two results?

In [7]:
tokenizer.convert_ids_to_tokens(inputs["input_ids"])

['[CLS]',
 'this',
 'is',
 'the',
 'first',
 'sentence',
 '.',
 '[SEP]',
 'this',
 'is',
 'the',
 'second',
 'one',
 '.',
 '[SEP]']

Therefore, we observe that the model anticipates inputs in the form of `[CLS] sentence1 [SEP] sentence2 [SEP]` when there are two sentences. Aligning this expectation with the `token_type_ids` yields:

In [8]:
['[CLS]', 'this', 'is', 'the', 'first', 'sentence', '.', '[SEP]', 'this', 'is', 'the', 'second', 'one', '.', '[SEP]']
[      0,      0,    0,     0,       0,          0,   0,       0,      1,    1,     1,        1,     1,   1,       1]

[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

As evident, the segments of the input corresponding to `[CLS] sentence1 [SEP]` all have a token type ID of 0, while the other segments, corresponding to `sentence2 [SEP]`, all have a token type ID of 1.

It's important to note that if you choose a different checkpoint, the token_type_ids may not be included in your tokenized inputs (for instance, they are not returned when using a DistilBERT model). Token_type_ids are only returned when the model understands how to handle them, having encountered them during its pretraining.

In the case of BERT, it is pretrained with token type IDs. In addition to the masked language modeling objective discussed in Chapter 1, BERT has an extra objective known as next sentence prediction. This task involves providing the model with pairs of sentences (with randomly masked tokens) and asking it to predict whether the second sentence follows the first. To add complexity, half of the time, the sentences follow each other in the original document, and the other half, the two sentences come from different documents.

Generally, you don't need to concern yourself with whether there are token_type_ids in your tokenized inputs. As long as you use the same checkpoint for the tokenizer and the model, everything will function correctly, as the tokenizer knows what information to provide to the model.

Now that we've demonstrated how the tokenizer handles a pair of sentences, we can utilize it to tokenize the entire dataset. Similar to the previous chapter, we can feed the tokenizer a list of pairs of sentences by providing it with the list of first sentences followed by the list of second sentences. This approach is also compatible with the padding and truncation options discussed in Chapter 2. Therefore, one way to preprocess the training dataset is as follows:

In [9]:
tokenized_dataset = tokenizer(
    raw_datasets["train"]["sentence1"],
    raw_datasets["train"]["sentence2"],
    padding=True,
    truncation=True,
)

While this approach is effective, it has the drawback of returning a dictionary (with keys such as input_ids, attention_mask, and token_type_ids, and values that are lists of lists). Moreover, it will only work if you have sufficient RAM to store your entire dataset during tokenization. In contrast, datasets from the 🤗 Datasets library are Apache Arrow files stored on disk, meaning you only load the samples you request into memory.

To maintain the data as a dataset, we can leverage the Dataset.map() method. This also provides additional flexibility, allowing for additional preprocessing beyond tokenization. The map() method functions by applying a specified function to each element of the dataset. Therefore, let's define a function that handles the tokenization:

In [10]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

This function takes a dictionary (similar to the items in our dataset) and returns a new dictionary with the keys input_ids, attention_mask, and token_type_ids. It works seamlessly if the example dictionary contains multiple samples, each with keys as a list of sentences, since the tokenizer operates on lists of pairs of sentences, as demonstrated earlier. Leveraging this functionality, we can use the option batched=True in our map() call, significantly accelerating the tokenization process. The underlying tokenizer is implemented in Rust from the 🤗 Tokenizers library, providing high-speed tokenization, especially when given a large number of inputs simultaneously.

Notably, we have deferred incorporating the padding argument into our tokenization function. This is because padding all the samples to the maximum length is inefficient. Instead, it's more effective to pad the samples when constructing a batch, as this only requires padding to the maximum length within that specific batch, rather than the maximum length across the entire dataset. This approach can lead to substantial time and processing power savings when dealing with inputs of varying lengths.

Here's how we can apply the tokenization function to all our datasets concurrently. By utilizing batched=True in our map() call, the function is applied to multiple elements of our dataset simultaneously, resulting in faster preprocessing:

In [11]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

'''
The way the 🤗 Datasets library applies this processing is by adding new fields to the datasets, 
one for each key in the dictionary returned by the preprocessing function:
'''

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

'\nThe way the 🤗 Datasets library applies this processing is by adding new fields to the datasets, \none for each key in the dictionary returned by the preprocessing function:\n'

You can further expedite the application of your preprocessing function using multiprocessing with `map()` by including a `num_proc` argument. In this case, we didn't employ multiprocessing because the 🤗 Tokenizers library already utilizes multiple threads to tokenize our samples faster. However, if you're not using a fast tokenizer supported by this library, incorporating multiprocessing could enhance your preprocessing speed.

Our tokenize_function returns a dictionary with the keys `input_id`s, `attention_mask`, and `token_type_ids`. Consequently, these three fields are appended to all splits of our dataset. It's worth noting that we could have also modified existing fields if our preprocessing function yielded new values for keys already present in the dataset to which we applied `map()`.

The final step involves padding all the examples to the length of the longest element when batching elements together—a technique known as *dynamic padding*:

#### Dynamic padding

The function responsible for assembling samples inside a batch is referred to as a *collate function*. It's an argument that you can provide when creating a DataLoader, with the default being a function that converts your samples to PyTorch tensors and concatenates them (recursively if your elements are lists, tuples, or dictionaries). However, in our case, this default behavior won't be suitable since our inputs won't all be of the same size. We intentionally delayed the padding step to apply it only as needed for each batch, reducing the amount of padding and improving training efficiency. However, note that if you're training on a TPU, this approach might cause issues, as TPUs prefer fixed shapes, even if it requires additional padding.

In practice, you need to define a collate function that applies the correct amount of padding to the items in the dataset you want to batch together. Fortunately, the 🤗 Transformers library provides a solution through the `DataCollatorWithPadding `class. When instantiated, it takes a tokenizer as an argument (to determine which padding token to use and whether the model expects padding on the left or right of the inputs) and handles all the necessary padding:

In [12]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this functionality, let's select a few samples from our training set that we want to batch together. In this case, we remove the columns idx, sentence1, and sentence2, as they are unnecessary and contain strings (and we can't create tensors with strings). Then, we examine the lengths of each entry in the batch:

The key idea here is to observe the output of the collate function, which now contains tensors of padded inputs ready to be fed into the model. If you're familiar with PyTorch, you'll recognize the structure of the dictionary, where each key corresponds to the model input, and the values are tensors representing the input data. The keys include "input_ids," "attention_mask," and "token_type_ids," which are essential for the BERT model to operate correctly. The lengths of each tensor in the batch correspond to the length of the longest example in that specific position.

With this setup, you're now prepared to efficiently preprocess your dataset, apply dynamic padding during batching, and feed the properly formatted inputs into the model during training. The next step is to move on to the fine-tuning process, which we'll explore in the following sections.

In [13]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples["input_ids"]]

[50, 59, 47, 67, 59, 50, 62, 32]

As expected, we have samples of varying lengths in the batch, ranging from 32 to 67. With dynamic padding, these samples should all be padded to the length of 67, which is the maximum length within the batch. Without dynamic padding, all samples would need to be padded to the maximum length in the entire dataset or the maximum length the model can accept. Let's double-check that our data collator is dynamically padding the batch correctly:

In [14]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

Fantastic progress! You've successfully prepared the dataset for fine-tuning, including tokenization, attention mask handling, and dynamic padding.

## **Fine-tuning a model with the Trainer API** 

Certainly! If you're continuing from the previous section, you've already preprocessed your dataset, and now you can fine-tune a pretrained model using the Trainer class. Before diving into the fine-tuning process, make sure you have the necessary environment set up. If you don't have access to a GPU, you can leverage platforms like Google Colab, which provides free GPU resources.

Now, let's define the Trainer and initiate the fine-tuning process:

In [15]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

### Training

Before we dive into the Trainer setup, let's establish the hyperparameters using the TrainingArguments class. The primary requirement is to specify a directory to save the trained model and its checkpoints. The default values for other hyperparameters are generally suitable for basic fine-tuning, so there's no need to adjust them unless you have specific requirements. We're essentially preparing the configuration that the Trainer will use during the training and evaluation process. Once we have these arguments defined, we'll proceed to initialize the Trainer and start the fine-tuning process.

In [16]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

Next, let's define our model. Similar to what we did in the previous chapter, we'll use the `AutoModelForSequenceClassification` class. In this case, our model will be configured to handle a binary classification task with two labels. This step sets the foundation for the specific architecture we want to fine-tune on our dataset. After defining the model, we'll proceed to set up the Trainer.

In [17]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


After defining the model, you might notice a warning. This is because BERT, in its pretrained form, wasn't specifically pretrained for classifying pairs of sentences. Consequently, the head of the pretrained model, which was designed for a different task, has been replaced with a new head suitable for sequence classification. The warning informs you that some weights from the original head were not used, and some were randomly initialized for the new head. This is expected behavior when fine-tuning models.

Now, let's proceed to define a Trainer. This is done by passing all the necessary components we've constructed so far—our model, training arguments, training and validation datasets, data collator, and tokenizer—to the Trainer class:

In [18]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

When providing the tokenizer during the Trainer instantiation, the default data collator used by the Trainer will be a `DataCollatorWithPadding` (as defined earlier), so you can omit the `data_collator=data_collator` argument in the Trainer call. This part of the processing was highlighted in section 2 to give you a deeper understanding.

Now, to initiate the fine-tuning process on our dataset, we simply call the `train()` method of our Trainer:

In [19]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss


: 

This initiates the fine-tuning process, which typically takes a few minutes on a GPU. The training loss will be reported every 500 steps. However, the Trainer won't provide information on how well or poorly your model is performing. This is because:

1. We didn't instruct the Trainer to evaluate during training by setting `evaluation_strategy` to either "steps" (evaluate every `eval_steps`) or "epoch" (evaluate at the end of each epoch).
2. We didn't supply the Trainer with a `compute_metrics()` function to calculate a metric during the evaluation. Otherwise, the evaluation would only print the loss, which isn't the most informative metric.

### Evaluation

Next, let's explore how to construct a useful `compute_metrics()` function and apply it during training. The function should accept an `EvalPrediction` object (a named tuple with `predictions` and `label_ids` fields) and return a dictionary mapping strings to floats. The strings represent the metric names, and the floats are their corresponding values. To obtain predictions from our model, we can use the `Trainer.predict()` command:

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

The output of the `predict()` method is another named tuple with three fields: `predictions`, `label_ids`, and `metrics`. The `metrics` field will contain the loss on the dataset passed, as well as some time metrics (indicating how long it took to predict, both in total and on average). Once we complete our `compute_metrics()` function and pass it to the Trainer, that field will also include the metrics returned by `compute_metrics()`.

The `predictions` field is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used). These are the logits for each element of the dataset passed to `predict()` (as seen in the previous chapter, all Transformer models return logits). To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

Now we can compare those `preds` to the `labels`. To build our `compute_metric()` function, we will rely on the metrics from the 🤗 `Evaluate` library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the `evaluate.load()` function. The object returned has a `compute()` method that we can use to perform the metric calculation:

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

The exact results you get may vary, as the random initialization of the model head might change the metrics it achieved. Here, we can see our model has an accuracy of 85.78% on the validation set and an F1 score of 89.97. Those are the two metrics used to evaluate results on the MRPC dataset for the GLUE benchmark. The table in the BERT paper reported an F1 score of 88.9 for the base model. That was the uncased model while we are currently using the cased model, which explains the better result.

Combining everything together, we get our `compute_metrics()` function:

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

And to see it used in action to report metrics at the end of each epoch, here is how we define a new Trainer with this compute_metrics() function:

In [None]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

To initiate a new training run with a fresh set of configurations and model weights, we create a new `TrainingArguments` with its `evaluation_strategy` set to "epoch" and instantiate a new model. If we were to continue the training of the existing model, we could potentially overwrite or continue from the existing checkpoints. The new training run is then launched with the following command:

In [None]:
trainer.train()

This time, in addition to reporting the training loss, the validation loss and metrics will also be reported at the end of each epoch. The exact accuracy and F1 scores you achieve may vary slightly from ours due to the model's random head initialization, but they should be within a similar range.

The Trainer can effectively utilize multiple GPUs or TPUs and offers a variety of features, such as mixed-precision training (enable it by setting fp16 = True in your training arguments). We'll explore all of its capabilities in Chapter 10.

This concludes the introductory section on fine-tuning using the Trainer API.

### Full Training

We'll now demonstrate how to achieve the same results as in the previous section without using the Trainer class. This assumes you've already completed the data processing steps outlined in section 2. Here's a concise summary of everything you'll need:

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#### Prepare for training

Before diving into the training loop, we need to define a few essential objects. The first are the dataloaders that will enable us to iterate over batches of training data. However, before constructing these dataloaders, we need to perform some postprocessing on our tokenized datasets to handle tasks that the Trainer class typically handles automatically. Specifically, we need to:

1. Remove columns containing values that the model doesn't expect, such as the "sentence1" and "sentence2" columns.

2. Rename the "label" column to "labels" to match the model's expected argument name.

3. Convert the datasets' format to return PyTorch tensors instead of lists.

Our tokenized_datasets provides a convenient method for each of these steps:


In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")
tokenized_datasets["train"].column_names

# After applying these methods, we can verify that the resulting dataset only contains columns that our model expects:

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

To ensure that there are no errors in the data processing, we can inspect a batch as follows:

In [None]:
for batch in train_dataloader:
    break
{k: v.shape for k, v in batch.items()}

Keep in mind that the actual shapes might differ slightly for you since we set shuffle=True for the training dataloader and are padding to the maximum length within the batch.

Now that we've conquered the data preprocessing phase (an ever-elusive yet satisfying milestone for any ML practitioner), let's shift our focus to the model. We'll instantiate it precisely as we did in the previous section:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Before diving into the training process, we ensure seamless operation by passing our batch to the model:

In [None]:
outputs = model(**batch)
print(outputs.loss, outputs.logits.shape)

All Transformers models return the loss when labels are provided, along with the logits (two for each input in our batch, resulting in a tensor of size 8 x 2).

We're almost ready to craft our training loop! Only two components are missing: an optimizer and a learning rate scheduler. To replicate the Trainer's behavior manually, we'll use the same default settings. The Trainer employs the AdamW optimizer, an enhanced version of Adam that incorporates a modification for weight decay regularization (explained in Ilya Loshchilov and Frank Hutter's paper "[Decoupled Weight Decay Regularization](https://arxiv.org/abs/1711.05101)"):

In [None]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

To complete our setup, we need to define the learning rate scheduler. By default, the Trainer uses a linear decay from the maximum value (5e-5) to 0. To properly define it, we need to know the total number of training steps, which is the product of the number of epochs and the number of training batches (the length of our training dataloader). The Trainer uses three epochs by default, so we will follow suit:

In [None]:
from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)
print(num_training_steps)

#### The training loop

Before we move on to the training loop, let's ensure we use the GPU if available. We'll define a device to place our model and batches on:

import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
device

Now, let's proceed with training. We'll add a progress bar to get an estimate of when the training will be completed, using the tqdm library:

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

Let's add an evaluation loop to get insights into how our model is performing.

#### The evaluation loop

In the evaluation loop, we utilize a metric from the 🤗 Evaluate library. As previously shown, the metric.compute() method is employed for the final result. However, metrics can accumulate batches using the add_batch() method while iterating through the prediction loop. This strategy allows us to gather insights into the model's performance over multiple batches.

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

The outcomes may vary slightly due to factors such as the random initialization of the model head and data shuffling. However, the results should be approximately similar.

#### Enhance the efficiency of your training loop with 🤗 Accelerate.

The training loop we crafted previously functions effectively on a single CPU or GPU. However, by incorporating the 🤗 Accelerate library and making minor adjustments, we can seamlessly transition to distributed training across multiple GPUs or TPUs. From the initialization of training and validation dataloaders onward, here is the modified version of our manual training loop:

In [None]:
# Manual training Loop

from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    + from accelerate import Accelerator
      from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

    + accelerator = Accelerator()

      model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
      optimizer = AdamW(model.parameters(), lr=3e-5)

    - device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
    - model.to(device)

    + train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
    +     train_dataloader, eval_dataloader, model, optimizer
    + )

      num_epochs = 3
      num_training_steps = num_epochs * len(train_dataloader)
      lr_scheduler = get_scheduler(
          "linear",
          optimizer=optimizer,
          num_warmup_steps=0,
          num_training_steps=num_training_steps
      )

      progress_bar = tqdm(range(num_training_steps))

      model.train()
      for epoch in range(num_epochs):
          for batch in train_dataloader:
    -         batch = {k: v.to(device) for k, v in batch.items()}
              outputs = model(**batch)
              loss = outputs.loss
    -         loss.backward()
    +         accelerator.backward(loss)

              optimizer.step()
              lr_scheduler.step()
              optimizer.zero_grad()
              progress_bar.update(1)

The first line to add is the import statement. The second line involves instantiating an Accelerator object that analyzes the environment and initializes the appropriate distributed setup. 🤗 Accelerate takes care of device placement, so you can omit the lines responsible for placing the model on the device. Alternatively, you can modify them to use `accelerator.device` instead of `device`.

The key step is encapsulated in the line where the dataloaders, model, and optimizer are sent to `accelerator.prepare()`. This call wraps these objects in the necessary container to ensure smooth operation of distributed training. The remaining adjustments include removing the line that places the batch on the device (or replacing it with `accelerator.device` if you prefer) and substituting `accelerator.backward(loss)` for `loss.backward()`.

In [None]:
from accelerate import Accelerator
from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

accelerator = Accelerator()

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
optimizer = AdamW(model.parameters(), lr=3e-5)

train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dl:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

## **Summary**

Fantastic job! To summarize, in this chapter, you gained knowledge about datasets in the Hub, learned loading and preprocessing datasets, including dynamic padding and collators. Additionally, you implemented your own fine-tuning and evaluation of a model, created a lower-level training loop, and leveraged 🤗 Accelerate to seamlessly adapt your training loop for multiple GPUs or TPUs. You're now equipped with a solid understanding of fine-tuning models on your data. Great work!