# Fine-tuning a model with the Trainer API


Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

The code above is used to install the necessary packages for working with datasets, evaluation, and transformers with sentencepiece.

The `!pip install` command is used to install packages in Python. In this case, it is used to install the packages `datasets`, `evaluate`, and `transformers[sentencepiece]`.

The `datasets` package is a library that provides a collection of datasets for natural language processing tasks. It includes various datasets for tasks such as text classification, question answering, and machine translation.

The `evaluate` package is used for evaluating the performance of machine learning models. It provides functions for calculating metrics such as accuracy, precision, recall, and F1 score.

The `transformers` package is a library that provides state-of-the-art models for natural language processing tasks. It includes models such as BERT, GPT, and RoBERTa. The `[sentencepiece]` part of the package name indicates that it includes support for the SentencePiece tokenizer, which is commonly used for tokenization in NLP tasks.

In [1]:
!pip install datasets evaluate transformers[torch]

Collecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/519.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m174.1/519.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m512.0/519.3 kB[0m [31m8.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers[torch]
  Downloading transformers-4.32.0-py3-none-any.whl (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m36.6 MB/s[0m eta [36m0:00:00[0m
Col

# Code Documentation

## Importing Libraries

The code begins by importing the necessary libraries. The `load_dataset` function is imported from the `datasets` module, while the `AutoTokenizer` and `DataCollatorWithPadding` classes are imported from the `transformers` module.

## Loading the Dataset

The code then loads the MRPC (Microsoft Research Paraphrase Corpus) dataset using the `load_dataset` function from the `datasets` module. The dataset is stored in the `raw_datasets` variable.

## Initializing the Tokenizer

The code initializes the tokenizer using the `AutoTokenizer.from_pretrained` method from the `transformers` module. The tokenizer is initialized with the "bert-base-uncased" checkpoint, which is a pre-trained BERT model.

## Tokenizing the Dataset

The code defines a `tokenize_function` that takes an example from the dataset and tokenizes the sentences using the tokenizer. The sentences are passed as input to the tokenizer's `__call__` method, with the `truncation` parameter set to `True`. The tokenized datasets are then created by applying the `tokenize_function` to the `raw_datasets` using the `map` method. The tokenized datasets are stored in the `

In [2]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True,padding=True,return_tensors="pt").to("cuda")

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Downloading builder script:   0%|          | 0.00/28.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/28.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/27.9k [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/3668 [00:00<?, ? examples/s]

Map:   0%|          | 0/408 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

The code imports the `TrainingArguments` class from the `transformers` library.

A new instance of the `TrainingArguments` class is created with the name "test-trainer".

The `TrainingArguments` class is typically used to define the arguments and settings for training a machine learning model.

By providing a name for the `TrainingArguments` instance, it allows for easy identification and organization of different training runs.

This code snippet is just an example and may need to be modified based on the specific requirements of the training task.

In [3]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer")

The code above is used to import the `AutoModelForSequenceClassification` class from the `transformers` library.

To use this class, you need to have the `transformers` library installed in your Python environment.

The `AutoModelForSequenceClassification` class is used for sequence classification tasks, such as sentiment analysis or text classification.

The `from_pretrained` method is used to load a pre-trained model. The `checkpoint` parameter specifies the path or name of the pre-trained model to load.

The `num_labels` parameter is used to specify the number of labels in the classification task. In this case, it is set to 2, indicating a binary classification task.

In [4]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The code above demonstrates the usage of the `Trainer` class from the `transformers` library.

1. First, you need to import the `Trainer` class from the `transformers` module.

2. Next, you can create an instance of the `Trainer` class by passing in the required arguments. These arguments include the `model`, `training_args`, `train_dataset`, `eval_dataset`, `data_collator`, and `tokenizer`.

3. The `model` argument represents the pre-trained model that you want to train or fine-tune.

4. The `training_args` argument contains the training configuration and hyperparameters for the training process.

5. The `train_dataset` and `eval_dataset` arguments represent the training and evaluation datasets, respectively. These datasets should be tokenized using the `tokenizer` and can be provided as dictionaries with different splits (e.g., "train", "validation", "test").

Overall, this code snippet demonstrates how to set up the `Trainer` class for training and evaluation using pre-processed tokenized datasets and a pre-trained model.

In [5]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

The `trainer.train()` function is used to start the training process.
It is a method that belongs to an object of the `trainer` class.
This function executes the training algorithm and updates the model parameters based on the provided training data.
It is important to ensure that the necessary data and parameters are properly set before calling this function.
The function may return a result or update the model in place, depending on the implementation.

In [6]:
trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.5627
1000,0.3479


TrainOutput(global_step=1377, training_loss=0.3894335789012147, metrics={'train_runtime': 94.9278, 'train_samples_per_second': 115.92, 'train_steps_per_second': 14.506, 'total_flos': 581185599073680.0, 'train_loss': 0.3894335789012147, 'epoch': 3.0})

The code above is used to generate predictions for a validation dataset using a trained model.

The `trainer.predict()` function is called with the `tokenized_datasets["validation"]` as the input. This indicates that the validation dataset is being used for prediction.

The `predictions` variable stores the output of the `trainer.predict()` function. It contains the predicted values for the validation dataset.

The shape of the predicted values is accessed using `predictions.predictions.shape`. This provides information about the dimensions of the predicted values.

Similarly, the shape of the label ids is accessed using `predictions.label_ids.shape`. This provides information about the dimensions of the label ids for the validation dataset.

In [7]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)

(408, 2) (408,)


The code snippet imports the `numpy` library, which is commonly used for numerical computations in Python.

The `preds` variable is created using the `argmax` function from `numpy`. This function returns the indices of the maximum values along a specified axis.

The `argmax` function is applied to the `predictions.predictions` array, with the `axis=-1` argument indicating that the maximum values should be calculated along the last axis.

The resulting `preds` variable will contain the indices of the maximum values in the `predictions.predictions` array.

This code can be useful for tasks such as finding the most likely class prediction from a model's output.

In [8]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

The code above demonstrates the usage of the `evaluate` module.

First, the `evaluate` module is imported into the current Python environment.

Next, a metric is loaded using the `load` function from the `evaluate` module. In this case, the metric being loaded is "glue" with the specific task "mrpc".

Finally, the `compute` function from the loaded metric is called. This function takes in two parameters: `predictions` and `references`. It computes the metric score by comparing the predicted values (`preds`) with the reference values (`predictions.label_ids`).

The result of the computation is not shown in the code snippet, but it can be stored in a variable or used for further analysis or evaluation.

In [9]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

Downloading builder script:   0%|          | 0.00/5.75k [00:00<?, ?B/s]

{'accuracy': 0.8651960784313726, 'f1': 0.9033391915641477}

The `compute_metrics` function is used to compute evaluation metrics for a given set of predictions.

To use this function, you need to pass in `eval_preds`, which is a tuple containing the logits and labels.

The function first loads the evaluation metric using the `load` function from the `evaluate` module. In this case, it is loading the "glue" metric for the "mrpc" task.

Next, it assigns the logits and labels from `eval_preds` to separate variables.

Then, it uses `np.argmax` to get the predicted labels by finding the index of the maximum value in the logits array along the last axis.

Finally, it calls the `compute` method of the loaded metric, passing in the predicted labels (`predictions`) and the true labels (`labels`). The computed metrics are returned as the output of the function.

In [10]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

The code provided sets up the necessary components for training a sequence classification model.

1. `training_args` is an instance of `TrainingArguments` that is used to configure the training process. In this case, the name of the trainer is set to "test-trainer" and the evaluation strategy is set to "epoch", meaning that evaluation will be performed after each epoch.

2. `model` is an instance of `AutoModelForSequenceClassification` that is initialized with a pretrained checkpoint and the number of labels. This model will be used for sequence classification tasks.

3. `trainer` is an instance of `Trainer` that takes in the `model`, `training_args`, and other necessary parameters. It is responsible for training and evaluating the model. The `train_dataset` and `eval_dataset` are tokenized datasets used for training and evaluation respectively.

4. `data_collator` is a function or object that is used to collate the data during training. It is used to batch the input data and apply any necessary transformations.

5. `tokenizer` is an instance of a tokenizer that is used to tokenize the input data. It is used to convert the input text into numerical representations that can be processed by the model.

6. `compute_metrics`

In [11]:
training_args = TrainingArguments("test-trainer", evaluation_strategy="epoch")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The code snippet `trainer.train()` is used to initiate the training process.
This function is typically used in machine learning models to train the model on a given dataset.
It is assumed that the `trainer` object has been previously defined and initialized.
The `train()` function will iterate over the dataset and update the model's parameters based on the defined training algorithm.
After the training process is complete, the model will be ready for evaluation or prediction tasks.

In [12]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,0.36852,0.835784,0.883882
2,0.506600,0.510749,0.848039,0.896667
3,0.262900,0.710089,0.852941,0.895833


TrainOutput(global_step=1377, training_loss=0.30816014053688245, metrics={'train_runtime': 102.3584, 'train_samples_per_second': 107.505, 'train_steps_per_second': 13.453, 'total_flos': 581095154648400.0, 'train_loss': 0.30816014053688245, 'epoch': 3.0})