<a href="https://colab.research.google.com/github/Lakshmi-Adhikari-AI/LLM-HuggingFace/blob/main/ch3%20/mod4-full-training-loop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A full training

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl

# 📘 Chapter 3: Fine-Tuning from Scratch with a Custom Training Loop

In this notebook, we implement the full training loop for fine-tuning a BERT model on the MRPC task **without using the HuggingFace `Trainer` API**.

This approach gives us complete control over the training process, from data preparation and batching to loss computation, backpropagation, and evaluation.

We will also see how to evaluate model performance on the validation dataset after training.


## 🧩 Data Preparation

- Load the GLUE MRPC dataset.
- Use the BERT tokenizer to process sentence pairs.
- Clean and format the dataset for PyTorch with tokenized inputs.
- Prepare data collators to dynamically pad batches.


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, DataCollatorWithPadding

#  Load the MRPC dataset (sentence pairs with labels)
raw_datasets = load_dataset("glue", "mrpc")

#  Load the tokenizer matching the pretrained BERT base model
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

#  Function to tokenize pairs of sentences, applying truncation to fit model max length
def tokenize_function(example):
    return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

#  Tokenize the entire dataset efficiently in batched mode
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

#  Prepare the data collator that dynamically pads batches for efficient processing
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

#  Clean dataset: remove raw text columns, rename 'label' to 'labels', and set format to PyTorch tensors
tokenized_datasets = tokenized_datasets.remove_columns(["sentence1", "sentence2", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")


## 🛍️ DataLoader Setup

- Create PyTorch dataloaders for training and validation.
- Shuffle training data each epoch for robust training.
- Use the data collator for dynamic padding of batches.
- Validate input batch shapes for correctness.


In [None]:
from torch.utils.data import DataLoader

#  Create the training dataloader with shuffling enabled and batch size 8
train_dataloader = DataLoader(
    tokenized_datasets["train"], shuffle=True, batch_size=8, collate_fn=data_collator
)

#  Create the validation dataloader with batch size 8 and no shuffling
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], batch_size=8, collate_fn=data_collator
)

#  Inspect a batch to confirm shapes
for batch in train_dataloader:
    print({k: v.shape for k, v in batch.items()})
    break


## ⚙️ Model, Optimizer, and Scheduler Setup

- Load pretrained BERT with classification head for two labels.
- Use AdamW optimizer with weight decay for stable training.
- Set a linear learning rate scheduler covering all training steps.
- Configure total epochs for training.


In [None]:
from transformers import AutoModelForSequenceClassification, get_scheduler
from torch.optim import AdamW

#  Load BERT model for sequence classification with 2 output labels
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

#  Initialize AdamW optimizer with learning rate 5e-5
optimizer = AdamW(model.parameters(), lr=5e-5)

#  Define total training steps (epochs × steps per epoch)
num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)

#  Setup learning rate scheduler with linear decay and no warmup
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)


## 💻 Device Setup and Progress Bar

- Detect GPU availability and move model accordingly.
- Create progress bar for visual feedback on training progress.


In [None]:
import torch
from tqdm.auto import tqdm

#  Select GPU if available, otherwise CPU
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

#  Initialize a progress bar for all training steps across epochs
progress_bar = tqdm(range(num_training_steps))


## 🔄 Training Loop

- Loop over epochs and batches.
- Send batch to the correct device.
- Forward pass to compute outputs and loss.
- Backpropagation to compute gradients.
- Optimizer step to update parameters.
- Scheduler step to update learning rate.
- Zero gradients before next step.
- Update progress bar for visualization.


In [None]:
model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        #  Move batch data to device (CPU/GPU)
        batch = {k: v.to(device) for k, v in batch.items()}

        #  Forward pass: compute model outputs and loss
        outputs = model(**batch)
        loss = outputs.loss

        #  Backpropagation: compute gradients
        loss.backward()

        #  Optimizer step: update model weights
        optimizer.step()

        #  Scheduler step: decay learning rate
        lr_scheduler.step()

        #  Zero gradients before next iteration
        optimizer.zero_grad()

        #  Update progress bar
        progress_bar.update(1)


## 📊 Evaluation Loop

- Switch model to evaluation mode.
- Loop over validation batches without gradient calculation.
- Compute logits and predicted classes.
- Collect all predictions and true labels for metric computation.
- Use 🤗 Evaluate library to compute accuracy and F1.


In [None]:
import evaluate

#  Load GLUE MRPC evaluation metric (accuracy and F1)
metric = evaluate.load("glue", "mrpc")

model.eval()  # Switch model to evaluation mode

for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}

    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits

    #  Get the predicted class indices by selecting max logit
    predictions = torch.argmax(logits, dim=-1)

    #  Add predictions and references for metric calculation
    metric.add_batch(predictions=predictions, references=batch["labels"])

#  Compute final metrics
results = metric.compute()
print(results)


## ⚡ Supercharge Your Training Loop with 🤗 Accelerate

- Accelerate abstracts away device management and distributed training complexities.
- Enables mixed precision training (fp16) and runs seamlessly on CPUs, GPUs, TPUs.
- Allows you to keep writing familiar PyTorch training loops with minimal changes.


Setup Accelerate and Prepare Components

In [None]:
from accelerate import Accelerator
from torch.optim import AdamW
from transformers import AutoModelForSequenceClassification

#  Initialize the accelerator to manage device and distributed setup
accelerator = Accelerator()

#  Load classification model as before
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

#  Define optimizer
optimizer = AdamW(model.parameters(), lr=3e-5)

#  Prepare dataloaders, model & optimizer with accelerator for distributed/mixed precision compatibility
train_dl, eval_dl, model, optimizer = accelerator.prepare(
    train_dataloader, eval_dataloader, model, optimizer
)


Scheduler and Progress Bar for Accelerate Loop

In [None]:
from transformers import get_scheduler
from tqdm.auto import tqdm

num_epochs = 3
num_training_steps = num_epochs * len(train_dl)

#  Set up linear LR scheduler
lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

#  Create progress bar for training steps
progress_bar = tqdm(range(num_training_steps))


Training Loop Using Accelerate
python

In [None]:
model.train()

for epoch in range(num_epochs):
    for batch in train_dl:
        #  Forward pass & loss computation
        outputs = model(**batch)
        loss = outputs.loss

        #  Backpropagation using accelerator (supporting mixed precision etc.)
        accelerator.backward(loss)

        #  Update optimizer parameters
        optimizer.step()

        #  Update learning rate
        lr_scheduler.step()

        #  Zero gradients
        optimizer.zero_grad()

        #  Update progress bar
        progress_bar.update(1)
