# Fine tuning Pretrained BERT Model using Trainer API
Here we train a sequence classifier on one batch using PyTorch


In [21]:
import torch
from torch.optim import AdamW
from transformers import AutoTokenizer, AutoModelForSequenceClassification

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

### Loading the dataset
The ðŸ¤— Datasets library provides a very simple command to download and cache a dataset on the Hub

In [22]:
from datasets import load_dataset

raw_datasets = load_dataset("glue", "mrpc")
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx'],
        num_rows: 1725
    })
})

In [23]:
# We can access each pair of sentences in our raw_datasets object by indexing, like with a dictionary
raw_train_dataset = raw_datasets["train"]
raw_train_dataset[0]

{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
 'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
 'label': 1,
 'idx': 0}

In [24]:
# We can see the labels are already integers, so we wonâ€™t have to do any preprocessing there. To know which integer corresponds to which label, we can inspect the features of our raw_train_dataset

raw_train_dataset.features

{'sentence1': Value('string'),
 'sentence2': Value('string'),
 'label': ClassLabel(names=['not_equivalent', 'equivalent']),
 'idx': Value('int32')}

### Preprocess the dataset
To preprocess the dataset, we need to convert the text to numbers the model can make sense of. This is done with a tokenizer.
To keep the data as a dataset, we will use the Dataset.map() method. This also allows us some extra flexibility, if we need more preprocessing done than just tokenization. The map() method works by applying a function on each element of the dataset, so letâ€™s define a function that tokenizes our inputs

You can even use multiprocessing when applying your preprocessing function with map() by passing along a num_proc argument. only use when using a tokenizer not backed by the library.

In [25]:
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3668
    })
    validation: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 408
    })
    test: Dataset({
        features: ['sentence1', 'sentence2', 'label', 'idx', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1725
    })
})

The last thing we will need to do is pad all the examples to the length of the longest element when we batch elements together â€” a technique we refer to as dynamic padding.

Transformers library provides us with such a function via DataCollatorWithPadding. It takes a tokenizer when you instantiate it (to know which padding token to use, and whether the model expects padding to be on the left or on the right of the inputs) and will do everything you need

In [26]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

To test this new toy, letâ€™s grab a few samples from our training set that we would like to batch together. Here, we remove the columns idx, sentence1, and sentence2 as they wonâ€™t be needed and contain strings (and we canâ€™t create tensors with strings) and have a look at the lengths of each entry in the batch

In [27]:
samples = tokenized_datasets["train"][:8]
samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
[len(x) for x in samples['input_ids']]

[50, 59, 47, 67, 59, 50, 62, 32]

Without dynamic padding, all of the samples would have to be padded to the maximum length in the whole dataset, or the maximum length the model can accept. Letâ€™s double-check that our data_collator is dynamically padding the batch properly:

In [28]:
batch = data_collator(samples)
{k: v.shape for k, v in batch.items()}

{'input_ids': torch.Size([8, 67]),
 'token_type_ids': torch.Size([8, 67]),
 'attention_mask': torch.Size([8, 67]),
 'labels': torch.Size([8])}

### Training

 Transformers provides a Trainer class to help you fine-tune any of the pretrained models it provides on your dataset with modern best practices. 

 The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation. 

  The only argument you have to provide is a directory where the trained model will be saved

In [None]:
from optax import nadam
from transformers import TrainingArguments
import wandb

wandb.init(project="transformer-fine-tuning", name="bert-mrpc-analysis")

training_args = TrainingArguments(
    output_dir="Models/results",
    eval_strategy="steps",
    eval_steps=50,
    save_steps=100,
    logging_steps=10,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    report_to="wandb"
    )

[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

The second step is to define our model.

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Once we have our model, we can define a Trainer by passing it all the objects constructed up to now â€” the model, the training_args, the training and validation datasets, our data_collator, and our processing_class. The processing_class parameter is a newer addition that tells the Trainer which tokenizer to use for processing

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer
)

To fine tune the model we call the train() model

In [None]:
trainer.train()



Step,Training Loss
500,0.4998
1000,0.2617




TrainOutput(global_step=1377, training_loss=0.3103358144767058, metrics={'train_runtime': 345.4547, 'train_samples_per_second': 31.854, 'train_steps_per_second': 3.986, 'total_flos': 405114969714960.0, 'train_loss': 0.3103358144767058, 'epoch': 3.0})

### Evaluation
Letâ€™s see how we can build a useful compute_metrics() function and use it the next time we train.

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
predictions.predictions.shape, predictions.label_ids.shape



((408, 2), (408,))

As you can see, predictions is a two-dimensional array with shape 408 x 2 (408 being the number of elements in the dataset we used). Those are the logits for each element of the dataset we passed to predict() (all Transformer models return logits). To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:

In [None]:
import numpy as np
preds = np.argmax(predictions.predictions, axis=-1)

We can now compare those preds to the labels. To build our compute_metric() function, we will rely on the metrics from the ðŸ¤— Evaluate library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the evaluate.load() function. The object returned has a compute() method we can use to do the metric calculation:

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

{'accuracy': 0.8602941176470589, 'f1': 0.9018932874354562}

To wrap things up we get our compute_metrics() function

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictioins=predictions, references=labels)

Now we define a trainer with the new compute_metrics function to display our metrics

In [None]:
# training_args = TrainingArguments("Models/test-trainer-compute", eval_strategy="epoch", fp16=True)
# model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

# trainer = Trainer(
#     model,
#     training_args,
#     train_dataset=tokenized_datasets['train'],
#     eval_dataset=tokenized_datasets['validation'],
#     data_collator=data_collator,
#     processing_class=tokenizer,
#     compute_metrics=compute_metrics,
# )

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note that we create a new TrainingArguments with its eval_strategy set to "epoch" and a new model â€” otherwise, we would just be continuing the training of the model we have already trained

In [None]:
# trainer.train()

Epoch,Training Loss,Validation Loss


TypeError: object of type 'NoneType' has no len()

### Advance Training Features
The Trainer comes with many built-in features that make modern deep learning best practices accessible:

Mixed Precision Training: Use fp16=True in your training arguments for faster training and reduced memory usage:

In [None]:
# training_args = TrainingArguments(
#     "Models/test-trainer",
#     eval_strategy="epoch",
#     fp16=True,  # Enable mixed precision
# )

Gradient Accumulation: For effective larger batch sizes when GPU memory is limited

In [None]:
# training_args = TrainingArguments(
#     "Models/test-trainer",
#     eval_strategy="epoch",
#     per_device_train_batch_size=4,
#     gradient_accumulation_steps=4,  # Effective batch size = 4 * 4 = 16
# )

Learning Rate Scheduling: The Trainer uses linear decay by default, but you can customize this

In [None]:
# training_args = TrainingArguments(
#     "Models/test-trainer",
#     eval_strategy="epoch",
#     learning_rate=2e-5,
#     lr_scheduler_type="cosine",  # Try different schedulers
# )