# Training and fine-tuning

Model classes in 🤗 Transformers are designed to be compatible with native PyTorch and TensorFlow 2 and can be used seemlessly with either. In this quickstart, we will show how to fine-tune (or train from scratch) a model using the standard training tools available in either framework. We will also show how to use our included `Trainer()` class which handles much of the complexity of training for you.

This guide assume that you are already familiar with loading and use our models for inference; otherwise, see the [task summary](https://huggingface.co/transformers/v3.0.2/task_summary.html). We also assume that you are familiar with training deep neural networks in either PyTorch or TF2, and focus specifically on the nuances and tools for training models in 🤗 Transformers.




## Fine-tuning in native TensorFlow 2

Models can also be trained natively in TensorFlow 2. Just as with PyTorch, TensorFlow models can be instantiated with `from_pretrained()` to load the weights of the encoder from a pretrained model.

Let us review the code before we continue.

This code snippet is for setting up a machine learning model for sequence classification using TensorFlow and the BERT (Bidirectional Encoder Representations from Transformers) architecture. Here's what each line does:

1. `model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')`
   - This line initializes a BERT model for the task of sequence classification using the TensorFlow framework.
   - `TFBertForSequenceClassification` is a class from the `transformers` library specifically designed for the task of classifying sequences (e.g., sentences or paragraphs) into categories.
   - The `from_pretrained` method is used to load a pre-trained BERT model. In this case, `'bert-base-uncased'` refers to a BERT model that has been pre-trained on a large corpus of English data in an uncased format (i.e., the text has been converted to lowercase).

2. `tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')`
   - This line creates a tokenizer that will be used to convert text data into a format that can be fed into the BERT model.
   - `BertTokenizer` is a class that provides tokenization for BERT models.
   - The `from_pretrained` method is again used to load a tokenizer that is compatible with the `'bert-base-uncased'` model.

3. `data = tfds.load('glue/mrpc')`
   - This line loads a dataset using TensorFlow Datasets (`tfds`).
   - `'glue/mrpc'` refers to the MRPC (Microsoft Research Paraphrase Corpus) task of the GLUE (General Language Understanding Evaluation) benchmark, which consists of sentence pairs labeled as either semantically equivalent or not.

4. `train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')`
   - This line converts the training examples from the loaded dataset into features that can be used by the model.
   - `glue_convert_examples_to_features` is a utility function that processes the examples using the provided tokenizer, setting a maximum sequence length (`max_length=128`), and specifying the task (`task='mrpc'`) to ensure that the data is processed in a way that is suitable for the MRPC task.

5. `train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)`
   - This line shuffles the training dataset with a buffer size of 100 to introduce randomness in the order of examples.
   - It then groups the data into batches of 32 examples each, which is a common practice to make training more efficient.
   - Finally, the `repeat(2)` method is called to repeat the dataset for 2 epochs, meaning that the model will see the entire dataset twice during training.

6. `test_dataset = glue_convert_examples_to_features(data['test'], tokenizer, max_length=128, task='mrpc')`
   - This line is similar to line 4 but is applied to the test data. It processes the test examples into features in the same way as the training data.

7. `test_dataset = test_dataset.shuffle(100).batch(32).repeat(2)`
   - Similar to line 5, this line prepares the test dataset for evaluation by shuffling, batching, and repeating it. However, typically the test dataset should not be repeated as you usually only evaluate on the test set once. The repetition here might be an error or specific to some experimental setup.

This code is typically used in the context of fine-tuning a pre-trained BERT model on a specific task, in this case, the MRPC task of the GLUE benchmark, and then evaluating its performance.



In [1]:
from transformers import TFBertForSequenceClassification
model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased')

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Let’s use `tensorflow_datasets` to load in the [MRPC dataset](https://www.tensorflow.org/datasets/catalog/glue#gluemrpc) from GLUE. We can then use our built-in `glue_convert_examples_to_features()` to tokenize MRPC and convert it to a TensorFlow `Dataset` object. Note that tokenizers are framework-agnostic, so there is no need to prepend `TF` to the pretrained tokenizer name.


In [2]:
from transformers import BertTokenizer, glue_convert_examples_to_features
import tensorflow_datasets as tfds

When using the `glue/mrpc` dataset from TensorFlow Datasets (TFDS), the task at hand is to determine whether two sentences are semantically equivalent or not. This task is a binary classification problem, where each pair of sentences is labeled with one of two classes:

- `0`: The sentences are not equivalent.
- `1`: The sentences are equivalent.

The model outputs logits, which are raw predictions that have not been normalized into probabilities. Each logit corresponds to one of the two classes. For each pair of sentences in your `test_dataset`, the model will output two numbers:

- The first number corresponds to the model's confidence that the sentences are not equivalent (class `0`).
- The second number corresponds to the model's confidence that the sentences are equivalent (class `1`).

To get from these logits to an actual prediction, you would typically do the following:

1. Apply the softmax function to the logits to convert them into probabilities. The softmax function will convert the raw logit scores into values between 0 and 1 that sum to 1, effectively giving you the probability of each class.
   
2. Take the argmax of the probabilities. This step involves choosing the index of the highest probability, which corresponds to the predicted class. If the first number is higher, the predicted class is `0` (not equivalent); if the second number is higher, the predicted class is `1` (equivalent).

In [3]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
data = tfds.load('glue/mrpc')
train_dataset = glue_convert_examples_to_features(data['train'], tokenizer, max_length=128, task='mrpc')
train_dataset = train_dataset.shuffle(100).batch(32).repeat(2)
test_dataset = glue_convert_examples_to_features(data['test'], tokenizer, max_length=128, task='mrpc')
test_dataset = test_dataset.shuffle(100).batch(32).repeat(2)



In [4]:
import tensorflow as tf

The model can then be compiled and trained as any Keras model:

### Compile and Train

Let us walk through the code before we run it.

The given code snippet is configuring and initiating the training process for the machine learning model (presumably the BERT model for sequence classification we discussed earlier) using TensorFlow's Keras API. Here's the breakdown of each line:

1. `optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)`
   - This line initializes an Adam optimizer, which is an algorithm for gradient-based optimization of stochastic objective functions.
   - `tf.keras.optimizers.Adam` refers to the Adam optimizer class in TensorFlow's Keras API.
   - `learning_rate=3e-5` sets the learning rate to `0.00003`. The learning rate is a hyperparameter that controls how much to adjust the model's parameters in response to the estimated error each time the model weights are updated.

2. `loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)`
   - This line creates a loss function that the model will use to measure its performance.
   - `tf.keras.losses.SparseCategoricalCrossentropy` is a loss function that is used when the labels are integers (as opposed to one-hot encoded vectors).
   - `from_logits=True` indicates that the output values of the model are not normalized (e.g., with a softmax function), and the loss function will perform the normalization as part of its calculation.

3. `model.compile(optimizer=optimizer, loss=loss)`
   - This line configures the model for training by setting the optimizer and loss function.
   - `model.compile` is a method to compile the model, preparing it for training by associating it with the specified optimizer and loss function.
   - `optimizer=optimizer` sets the optimizer for the training process, and `loss=loss` sets the loss function to calculate the errors.

4. `model.fit(train_dataset, epochs=2, steps_per_epoch=64)`
   - This line starts training the model on the dataset that has been prepared.
   - `model.fit` is the method to train the model for a fixed number of epochs (iterations over a dataset).
   - `train_dataset` is the training dataset that the model will learn from.
   - `epochs=2` tells the model to train for 2 complete passes over the training dataset.
   - `steps_per_epoch=64` indicates the number of batch updates to perform before completing one epoch. Since an epoch is typically defined as one pass over the entire dataset, specifying `steps_per_epoch` is useful when the exact size of the dataset is not known or when using generators to produce data indefinitely.

Together, these lines of code set up the optimizer and loss function, compile the model with these settings, and then train the model using the training dataset.

In [5]:
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer=optimizer, loss=loss)
model.fit(train_dataset, epochs=2, steps_per_epoch=64)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x78d864239bd0>

### Prediction

In [6]:
yhat = model.predict(test_dataset)



### Evaluaate Performance

In [7]:
from sklearn.metrics import precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

In [19]:
# Apply softmax to logits to get probabilities.
probabilities = tf.nn.softmax(yhat.logits, axis=-1)

# Use argmax to get the predicted class index.
predicted_class_indices = tf.argmax(probabilities, axis=-1)

# Assuming you have two classes, map indices to class names.
class_names = ['not equivalent', 'equivalent']
predicted_classes = [class_names[index] for index in predicted_class_indices.numpy()]

# Print the first 10 predictions.
print(predicted_classes[:10])

['not equivalent', 'equivalent', 'equivalent', 'equivalent', 'equivalent', 'equivalent', 'not equivalent', 'equivalent', 'equivalent', 'equivalent']


## Save Models

With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it as a PyTorch model (or vice-versa):

In [7]:
from transformers import BertForSequenceClassification
model.save_pretrained('./my_mrpc_model/')
pytorch_model = BertForSequenceClassification.from_pretrained('./my_mrpc_model/', from_tf=True)

All TF 2.0 model weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassification were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForSequenceClassification for predictions without further training.


## Trainer

We also provide a simple but feature-complete training and evaluation interface through `Trainer()` and `TFTrainer()`. You can train, fine-tune, and evaluate any 🤗 Transformers model with a wide range of training options and with built-in features like logging, gradient accumulation, and mixed precision.



In [1]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments

In [2]:
model = BertForSequenceClassification.from_pretrained("bert-large-uncased")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note the following error

```
ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.20.1`: Please run `pip install transformers[torch]` or `pip install accelerate -U`
```

In [12]:
pip install transformers[torch]

Collecting accelerate>=0.20.3 (from transformers[torch])
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.24.1


In [14]:
pip install accelerate -U



You might have to restart the runtime.

In [16]:
from transformers import TFBertForSequenceClassification, TFTrainer, TFTrainingArguments

model = TFBertForSequenceClassification.from_pretrained("bert-large-uncased")

training_args = TFTrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = TFTrainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # tensorflow_datasets training dataset
    eval_dataset=test_dataset            # tensorflow_datasets evaluation dataset
)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Now simply call `trainer.train()` to train and `trainer.evaluate()` to evaluate. You can use your own module as well, but the first argument returned from `forward` must be the loss which you wish to optimize.

