Fine Tuning with Hugging Face:

HuggingFace is an open-source ML platform
Built-in transformers library for nlp applications
allow users to share ML models and datasets

## Defining the dataset

HuggingFace preloaded dataset can be loaded by using from datasets import load_dataset

Let's load a yelp review dataset

## Yelp Review Dataset

List like object consisting of user reviews and accompanying metadata from the yelp platform. Each review is a dictionary typically containing the text of the review and another key which is label.

## Installing required Libraries

In [1]:
!pip install torch==2.2.2
!pip install torchtext==0.17.2
!pip install portalocker==2.8.2
!pip install torchdata==0.7.1
!pip install pandas
!pip install matplotlib==3.9.0 scikit-learn==1.5.0
!pip install numpy==1.26.0
!pip install --user transformers==4.42.1
!pip install --user datasets # 2.20.0
!pip install portalocker>=2.0.0
!pip install torch==2.3.1
!pip install --user torchmetrics==1.4.0.post0
!pip install numpy==1.26.4
!pip install peft==0.11.1
!pip install evaluate==0.4.2
!pip install -q bitsandbytes==0.43.1
!pip install --user accelerate==0.31.0
!pip install --user torchvision==0.18.1


!pip install --user trl==0.9.4
!pip install --user protobuf==3.20.*
!pip install matplotlib

!pip install --upgrade trl

Collecting torch==2.2.2
  Using cached torch-2.2.2-cp312-none-macosx_11_0_arm64.whl.metadata (25 kB)
Using cached torch-2.2.2-cp312-none-macosx_11_0_arm64.whl (59.7 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.3.1
    Uninstalling torch-2.3.1:
      Successfully uninstalled torch-2.3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.18.1 requires torch==2.3.1, but you have torch 2.2.2 which is incompatible.
torchtext 0.18.0 requires torch>=2.3.0, but you have torch 2.2.2 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.6.0
Collecting torchtext==0.17.2
  Using cached torchtext-0.17.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (7.9 kB)
Collecting torch==2.2.2 (from torchtext==0.17.2)
  Using cached torch-2.2.2-cp312-none-macosx_11_0_arm64.whl.metadata (25 kB)
U

In [3]:
!pip install torchmetrics



## Importing required Libraries

In [4]:
!pip install --upgrade torchtext




In [5]:
import torch
import torchtext
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchtext.vocab import build_vocab_from_iterator,GloVe,Vocab,Vectors
# trl --> 
from trl import SFTConfig, SFTTrainer #DataCollatorForCompletionOnlyLM

from datasets import load_dataset
import pickle
import os
import math
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, AutoModelForSequenceClassification, BertConfig, BertForMaskedLM, TrainingArguments,Trainer
from transformers import pipeline


# Tokenizer
from transformers import AutoTokenizer, BertTokenizer, AutoTokenizer, BertTokenizerFast

from tqdm.auto import tqdm
import time

import warnings
def warn(*args, **kwargs):
    pass

warnings.warn = warn
warnings.filterwarnings('ignore')



In [38]:
torch.__version__, torchtext.__version__

('2.3.1', '0.18.0')

## Dataset Preparations

The Yelp review dataset is a widely used dataset in natural language processing (NLP) and sentiment analysis research. It consists of user reviews and accompanying metadata from the Yelp platform, which is a popular online platform for reviewing and rating local businesses such as restaurants, hotels, and shops.

The dataset includes 6,990,280 reviews written by Yelp users, covering a wide range of businesses and locations. Each review typically contains the text of the review itself alongwith the star rating given by the user (ranging from 1 to 5).

Our aim in this lab, is to fine-tune a pretrained BERT model to predict the ratings from reviews.

In [7]:
from datasets import load_dataset
dataset = load_dataset("yelp_review_full")
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [8]:
## Check a sample record of the dataset
dataset['train'][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [9]:
dataset['train'][15]['label']

4

In [10]:
dataset['train'][15]['text']

"Can't miss stop for the best Fish Sandwich in Pittsburgh."

In [11]:
dataset['train'][15]

{'label': 4,
 'text': "Can't miss stop for the best Fish Sandwich in Pittsburgh."}

In [12]:
dataset['train'] = dataset['train'].select([i for i in range(1000)])
dataset['test'] = dataset['test'].select([i for i in range(200)])

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 200
    })
})

## Tokenizing Data

In [14]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# Define a function to tokenize examples
def tokenize_function(examples):
    # tokenize the text using the tokenizer
    # apply padding to ensure all sequences have the same length
    # apply truncation to limit the maximum sequence length
    return tokenizer(examples['text'], padding = 'max_length', truncation = True)


# Apply the tokenizer function to the dataset in batches
tokenized_datasets = dataset.map(tokenize_function, batched = True)


In [15]:
#tokenized_datasets['train'][0]

In [16]:
#tokenized_datasets['train'][0]

1. input_ids
This is the actual tokenized text — each word/subword is converted into a numerical ID based on the tokenizer’s vocabulary.

Example: "Hello world" → [101, 8667, 1362, 102] (IDs from BERT’s vocab).

These are what get fed into the embedding layer of the model.

2. token_type_ids (aka segment IDs)
Used only by some models (like BERT) for tasks involving two sentences in one input (e.g., question + answer, sentence A + sentence B).

The values tell the model which tokens belong to which segment:

0 → tokens from sentence A

1 → tokens from sentence B

For single-sentence tasks, all values are 0, and you can usually ignore this unless you’re working with paired inputs.

3. attention_mask
Tells the model which tokens are real and which are padding:

1 → keep this token (attend to it)

0 → ignore this token (it’s padding)

Essential when you use padding="max_length", because the model shouldn’t waste computation attending to padding tokens.

In [17]:
tokenized_datasets['train'][0].keys()

dict_keys(['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'])

In [18]:
# Remove the text column because the model does not accept raw text as input
tokenized_datasets = tokenized_datasets.remove_columns(['text'])

# Rename the label column to label because the model expects the argument to be named labels
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')

# set the format of the dataset to return PyTorch tensors instead of lists
tokenized_datasets.set_format('torch')

In [19]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 200
    })
})

In [20]:
dataset['train'][100], tokenized_datasets['train'][100]

({'label': 0,
  'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years

In [21]:
tokenized_datasets['train'][0].keys()

dict_keys(['labels', 'input_ids', 'token_type_ids', 'attention_mask'])

## DataLoader

In [22]:
# Create a training data loader
train_dataloader = DataLoader(tokenized_datasets["train"],shuffle = True, batch_size = 2)

# Create an evaluation dataloader
eval_dataloader = DataLoader(tokenized_datasets["test"], batch_size = 2)


## Train the model

### Load a pretrained model

In [23]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels = 5)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## optimizer and learning rate scheduler


In [24]:
optimizer = torch.optim.AdamW(model.parameters(), lr = 5e-4)

loss_fn = torch.nn.CrossEntropyLoss()

# set the number of epochs
num_epochs = 10

# calculate the total number of training steps
num_training_steps = num_epochs * len(train_dataloader)

# define the learning rate scheduler
from torch.optim.lr_scheduler import LambdaLR
lr_scheduler = LambdaLR(optimizer, lr_lambda = lambda current_step: (1- current_step/num_training_steps))

In [25]:
len(train_dataloader),len(dataset['train'])

(500, 1000)

## Device Agnostic Code

In [26]:
device = 'mps' if torch.backends.mps.is_available() else 'cpu'
device

'mps'

In [27]:
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e

## Training Loop

In [28]:
from torchmetrics import Accuracy

def train(dataloader,model):


    progress_bar = tqdm(range(num_training_steps))
    model.train()

    train_losses = []
    for epoch in range(num_epochs):
        total_loss = 0
        for batch in dataloader:
        
         # Move the batch to the appropriate device
            batch = {k: v.to(device) for k, v in batch.items()}

            # forward pass
            output = model(**batch)

            # Compute the loss
            loss = output.loss

            # backward pass
            loss.backward()
    
            total_loss += loss.item()

            # update the model parameters
            optimizer.step()

            # update the learning rate scheduler
            lr_scheduler.step()

            # clear the gradients
            optimizer.zero_grad()

            # update the progress bar
            progress_bar.update(1)

            
        train_losses.append(total_loss/len(dataloader))


    # plot loss
    plt.plot(train_losses)
    plt.title("Training Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.show()   

## Evaluate

In [29]:
def evaluate(model,eval_dataloader):

    # Create an instance of the accuracy metric for multiclass classification with 5 classes

    metric = Accuracy(task = "multiclass", num_classes = 5).to(device)


    # set the model in evaluation mode
    model.eval()

    with torch.no_grad():
        for batch in eval_dataloader:

            batch = {k: v.to(device) for k,v in batch.items()}

            # forward pass through the model
            outputs = model(**batch)

            # get the predicted class labels
            logits = outputs.logits
            predictions = torch.argmax(logits, dim = -1)

            # Accumulate the predictions and labels for the metric
            metric(predictions, batch['labels'])


    # compute the accuracy
    accuracy = metric.compute()

    # print the accuracy
    print(f"Accuracy: {accuracy.item()}")

            

In [30]:
evaluate(model,eval_dataloader)

Accuracy: 0.14499999582767487


In [None]:
train(dataloader = train_dataloader, model = model )

  0%|          | 0/5000 [00:00<?, ?it/s]

## Loading the saved Model

In [31]:
!wget 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/wFhKpkBMSgjmZKRSyayvsQ/bert-classification-model.pt'
model.load_state_dict(torch.load('bert-classification-model.pt', map_location = device))

--2025-08-13 05:06:58--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/wFhKpkBMSgjmZKRSyayvsQ/bert-classification-model.pt
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 198.23.119.245
connected. to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|198.23.119.245|:443... 
HTTP request sent, awaiting response... 200 OK
Length: 433341834 (413M) [binary/octet-stream]
Saving to: ‘bert-classification-model.pt.3’


2025-08-13 05:14:34 (935 KB/s) - ‘bert-classification-model.pt.3’ saved [433341834/433341834]



<All keys matched successfully>

In [35]:
evaluate(model,eval_dataloader)

RuntimeError: Placeholder storage has not been allocated on MPS device!

## Training a conversational model using SFTTrainer (supervised finetuning Transformer)

The stftrainer from trl ( transformers reinforcement library) library is a tool used for supervised fine tuning of language models. It helps refine pre-trained models using specific datasets to enhance their performance on targeted tasks

## Objective

Explore how fine-tuning a decoder transformer using a specific dataset affects the quality of the generated responses in a question answering task

Step -1: Load the train split of "timdettmers/openassistant-guanaco" dataset from huggingface

In [33]:
from datasets import load_dataset
train_dataset = load_dataset("timdettmers/openassistant-guanaco", split = 'train')
train_dataset[0]

Repo card metadata block was not found. Setting CardData to empty.


{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining po

Step: 2 load the pretrained causalmodel "facebook/opt-350m" along with its tokenizer from hugging face

In [36]:
model = AutoModelForCausalLM.from_pretrained('facebook/opt-350m')
tokenizer = AutoTokenizer.from_pretrained('facebook/opt-350')

ValueError: Due to a serious vulnerability issue in `torch.load`, even with `weights_only=True`, we now require users to upgrade torch to at least v2.6 in order to use the function. This version restriction does not apply when loading files with safetensors.
See the vulnerability report here https://nvd.nist.gov/vuln/detail/CVE-2025-32434

In [37]:
torch.__version__

'2.3.1'

In [44]:
output = next(iter(train_dataloader))
output,output.keys()


({'labels': tensor([2, 4]),
  'input_ids': tensor([[ 101, 2677, 2940,  ...,    0,    0,    0],
          [ 101,  146, 1138,  ...,    0,    0,    0]]),
  'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
          [0, 0, 0,  ..., 0, 0, 0]]),
  'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
          [1, 1, 1,  ..., 0, 0, 0]])},
 dict_keys(['labels', 'input_ids', 'token_type_ids', 'attention_mask']))

In [41]:
output['input_ids'], output['attention_mask']

(tensor([[  101,   146,  1400,  ...,     0,     0,     0],
         [  101,  1422, 23197,  ...,     0,     0,     0]]),
 tensor([[1, 1, 1,  ..., 0, 0, 0],
         [1, 1, 1,  ..., 0, 0, 0]]))

In [45]:
logits = model(output['input_ids'].to(device), output['attention_mask'].to(device))


In [52]:
logits.logits

tensor([[-4.0062e-01, -2.4161e-01, -3.5332e-01, -3.0424e-01, -1.4994e-02],
        [-7.9479e-01, -3.3666e-01, -7.7516e-02, -3.0356e-01, -3.0619e-04]],
       device='mps:0', grad_fn=<LinearBackward0>)

In [55]:
output.items()

dict_items([('labels', tensor([2, 4])), ('input_ids', tensor([[ 101, 2677, 2940,  ...,    0,    0,    0],
        [ 101,  146, 1138,  ...,    0,    0,    0]])), ('token_type_ids', tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]])), ('attention_mask', tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]]))])