<a href="https://colab.research.google.com/github/Lakshmi-Adhikari-AI/LLM-HuggingFace/blob/main/ch3/mod3_fine_tuning_trainer_api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning a model with the Trainer API

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

## 🗂️ Data Preparation for Fine-Tuning a Model with Trainer API

**In this section, we:**

- 📥 **Load the MRPC dataset** (sentence pairs with labels for paraphrase detection)
- 🔤 **Load the BERT tokenizer** to convert text into input IDs
- ✂️ **Tokenize the dataset** – prepare sentence pairs for the model
- 🛒 **Set up a data collator** for dynamic padding during batching

In [None]:
# Import the necessary libraries
from datasets import load_dataset                       # For loading datasets
from transformers import AutoTokenizer, DataCollatorWithPadding  # For tokenization and padding

# Load the GLUE MRPC dataset, which contains pairs of sentences and labels indicating if they are paraphrases
raw_datasets = load_dataset("glue", "mrpc")

# Specify the pretrained model checkpoint for BERT (uncased version)
checkpoint = "bert-base-uncased"

# Load the tokenizer corresponding to the pretrained BERT model
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Define a function to tokenize pairs of sentences in the dataset, truncating sequences longer than model allows
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

# Apply the tokenizer function to the entire dataset with batching for speed and efficiency
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

# Initialize a data collator that dynamically pads inputs in each batch to the longest sequence in that batch
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


## 🏋️ Training & Fine-Tuning with the Trainer API

**In this section, we:**
- ⚙️ **Define training arguments** (where to save models, how often to evaluate, number of epochs, more)
- 🧠 **Load our classification model** (BERT with a sequence classification head)
- 🤖 **Set up the Trainer** to handle all the training loop and evaluation logic
- 🚀 **Start fine-tuning** the model on our preprocessed dataset


In [None]:
# Import Required Libraries
from transformers import TrainingArguments,Trainer,AutoModelForSequenceClassification

# Step 1: Define training arguments
training_args = TrainingArguments("test-trainer")

# Step 2: Load model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)

# Step 3: Set up Trainer
trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    processing_class=tokenizer,
)

# Step 4: Fine-tune modek on our dataset 'MRPC'
trainer.train()


## 📊 Evaluation During Fine-Tuning

**In this section, we:**
- 🔍 Use the `Trainer.predict()` method to get model predictions on the validation set.
- 🎯 Define a `compute_metrics()` function that converts raw model outputs (logits) into class predictions and calculates evaluation metrics like accuracy and F1 score.
- 📈 Incorporate `compute_metrics()` with the `Trainer` to report validation performance automatically during training.


In [None]:

# Import Required Libraries
import numpy as np
import evaluate
from transformers import TrainingArguments,Trainer,AutoModelForSequenceClassification

# Step 1: Define the compute_metrics function
def compute_metrics(eval_preds):
  # Load the standard evaluation metric for MRPC ( from the GLUE benchmark)
  metric = evaluate.load("glue","mrpc")

  # Unpack the predictions and labels from eval_preds tuple
  logits,labels=eval_preds

  # Convert logits to predicted class indices by taking the argmax of each prediction's scores
  predictions = np.argmax(logits,axis=-1)

  # Compute and return the evaluation metrics (accuracy and f1)
  return metric.compute(predictions=predictions,references=labels)

# Step 2: Setup TrainingArguments with evaluation enabled at the of each epoch
training_args = TrainingArguments("test-trainer",eval_strategy="epoch")

# Step 3: Load model (new instance for fresh training with metrics)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint,num_labels=2)

# Step 4: Initialize the Trainer with compute_metrics function
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer, # use tokenizer instead of processing_class for latest version
    compute_metrics=compute_metrics # pass our metric function
)

# Step 5: Train the model; metrics will be reported each epoch
trainer.train()
