<a href="https://colab.research.google.com/github/SSRavipati/LLM-course/blob/main/chapter_2/Finetuning_%20with_%20trainer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine tuning with Trainer API
Transformers provides a Trainer class to help you fine-tune any of the pretrained models it provides on your dataset

In [None]:
!pip install transformers datasets torch

In [None]:
!pip install --upgrade datasets fsspec

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

raw_datasets = load_dataset("glue", "mrpc")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)


tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# **Training**

---


The first step before we can define our Trainer is to define a TrainingArguments class that will contain all the hyperparameters the Trainer will use for training and evaluation.

The only argument you have to provide is a directory(
  'test-trainer") in this case where the trained model will be saved, as well as the checkpoints along the way.

For all the rest, you can leave the defaults, which should work pretty well for a basic fine-tuning.

In [None]:
from transformers import TrainingArguments

training_args = TrainingArguments("test-trainer", report_to="none")

# **Define our model**

*   The following throws a warning cause Bert is not trained on classifying pairs of sentences, so the head of the pretrained model has been discarded and a new head suitable for sequence classification has been added instead
*   The warnings indicate that some weights were not used (the ones corresponding to the dropped pretraining head) and that some others were randomly initialized (the ones for the new head). It concludes by encouraging you to train the model, which is exactly what we are going to do now.



In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Once we have our model, we can define a Trainer by passing it all the objects constructed up to now —

the model, the training_args, the training and validation datasets, our data_collator, and our processing_class (e.g., a tokenizer, feature extractor, or processor):

In [None]:
from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
)

To fine-tune the model on our dataset, we just have to call the train() method of our Trainer:

In [None]:
trainer.train()

The output will only show the loss

it will not evaluate the model as we have not mentioned the eval strategy or epochs and we didn’t provide the Trainer with a compute_metrics() function to calculate a metric during said evaluation

# Evaluation

To get some predictions from our model we can use predict method

In [None]:
predictions = trainer.predict(tokenized_datasets["validation"])
print(predictions.predictions.shape, predictions.label_ids.shape)



*   The output of the predict() method is another named tuple with three fields: predictions, label_ids, and metrics.
*   Once we complete our compute_metrics() function and pass it to the Trainer, that field will also contain the metrics returned by compute_metrics().





---
The predictions are 2D array with shape 408 x 2 (408 being the number of elements in the dataset we used). To transform them into predictions that we can compare to our labels, we need to take the index with the maximum value on the second axis:


In [None]:
import numpy as np

preds = np.argmax(predictions.predictions, axis=-1)

We can now compare those preds to the labels.

To build our compute_metric() function, we will rely on the metrics from the  Evaluate library. We can load the metrics associated with the MRPC dataset as easily as we loaded the dataset, this time with the evaluate.load() function.

 The object returned has a compute() method we can use to do the metric calculation:

In [None]:
pip install evaluate

In [None]:
import evaluate

metric = evaluate.load("glue", "mrpc")
metric.compute(predictions=preds, references=predictions.label_ids)

wrapping everything in compute function we get

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "mrpc")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

And to see it used in action to report metrics at the end of each epoch, here is how we define a new Trainer with this compute_metrics() function:

In [None]:
training_args = TrainingArguments("test-trainer", eval_strategy="epoch", report_to="none")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

Data set sst2

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding
raw_datasets2 = load_dataset("glue", "sst2")
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)


def tokenize_function2(example):
    return tokenizer(example["sentence"], truncation=True)


tokenized_datasets = raw_datasets2.map(tokenize_function2, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
def compute_metrics(eval_preds):
    metric = evaluate.load("glue", "sst2")
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import TrainingArguments
from transformers import Trainer
from transformers import AutoModelForSequenceClassification
training_args = TrainingArguments("test-trainer", eval_strategy="epoch",report_to="none")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()