Fine Tuning with Hugging Face:
1. HuggingFace is an open-source ML platform
2. Built-in transformers library for nlp applications
3. allow users to share ML models and datasets
 

## Defining the dataset

HuggingFace preloaded dataset can be loaded by using 
`from datasets import load_dataset`

Let's load a yelp review dataset

### Yelp Review Dataset

List like object consisting of user reviews and accompanying metadata from the yelp platform.
Each review is a dictionary typically containing the text of the review and another key which is label.


## Installing required Libraries

In [1]:
!pip install torch==2.2.2
!pip install torchtext==0.17.2
!pip install portalocker==2.8.2
!pip install torchdata==0.7.1
!pip install pandas
!pip install matplotlib==3.9.0 scikit-learn==1.5.0
!pip install numpy==1.26.0
!pip install --user transformers==4.42.1
!pip install --user datasets # 2.20.0
!pip install portalocker>=2.0.0
!pip install torch==2.3.1
!pip install --user torchmetrics==1.4.0.post0
!pip install numpy==1.26.4
!pip install peft==0.11.1
!pip install evaluate==0.4.2
!pip install -q bitsandbytes==0.43.1
!pip install --user accelerate==0.31.0
!pip install --user torchvision==0.18.1


!pip install --user trl==0.9.4
!pip install --user protobuf==3.20.*
!pip install matplotlib

!pip install --upgrade trl

Collecting torch==2.2.2
  Using cached torch-2.2.2-cp312-none-macosx_11_0_arm64.whl.metadata (25 kB)
Using cached torch-2.2.2-cp312-none-macosx_11_0_arm64.whl (59.7 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.3.1
    Uninstalling torch-2.3.1:
      Successfully uninstalled torch-2.3.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.18.1 requires torch==2.3.1, but you have torch 2.2.2 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.2.2
Collecting numpy==1.26.0
  Using cached numpy-1.26.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (53 kB)
Using cached numpy-1.26.0-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Succ

## Importing required Libraries

In [4]:
import torch
import torchtext
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from torch.utils.data import Dataset, DataLoader
from torchtext.vocab import build_vocab_from_iterator,GloVe,Vocab,Vectors
# trl --> 
from trl import SFTConfig, SFTTrainer #DataCollatorForCompletionOnlyLM

from datasets import load_dataset
import pickle
import os
import math
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, AutoModelForSequenceClassification, BertConfig, BertForMaskedLM, TrainingArguments,Trainer
from transformers import pipeline


# Tokenizer
from transformers import AutoTokenizer, BertTokenizer, AutoTokenizer, BertTokenizerFast

from tqdm.auto import tqdm
import time

import warnings
def warn(*args, **kwargs):
    pass

warnings.warn = warn
warnings.filterwarnings('ignore')

## Dataset Preparations

The Yelp review dataset is a widely used dataset in natural language processing (NLP) and sentiment analysis research. It consists of user reviews and accompanying metadata from the Yelp platform, which is a popular online platform for reviewing and rating local businesses such as restaurants, hotels, and shops.

The dataset includes 6,990,280 reviews written by Yelp users, covering a wide range of businesses and locations. Each review typically contains the text of the review itself alongwith the star rating given by the user (ranging from 1 to 5).

Our aim in this lab, is to fine-tune a pretrained BERT model to predict the ratings from reviews.

In [5]:
from datasets import load_dataset
dataset = load_dataset("yelp_review_full")
dataset

README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [6]:
## Check a sample record of the dataset
dataset['train'][100]

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. 

In [8]:
dataset['train'][15]['label']

4

In [9]:
dataset['train'][15]['text']

"Can't miss stop for the best Fish Sandwich in Pittsburgh."

In [10]:
dataset['train'][15]

{'label': 4,
 'text': "Can't miss stop for the best Fish Sandwich in Pittsburgh."}

In [11]:
dataset['train'] = dataset['train'].select([i for i in range(1000)])
dataset['test'] = dataset['test'].select([i for i in range(200)])

In [12]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 200
    })
})

## Tokenizing Data

In [13]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

# Define a function to tokenize examples
def tokenize_function(examples):
    # tokenize the text using the tokenizer
    # apply padding to ensure all sequences have the same length
    # apply truncation to limit the maximum sequence length
    return tokenizer(examples['text'], padding = 'max_length', truncation = True)


# Apply the tokenizer function to the dataset in batches
tokenized_datasets = dataset.map(tokenize_function, batched = True)



tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

In [14]:
tokenized_datasets['train'][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
 'input_ids': [101,
  173,
  1197,
  119,
  2284,
  2953,
  3272,
  1917,
  178,
  1440,
  1111,
  1107,
  170,
  1704,
  22351,
  119,
  1119,
  112,
  188,
  3505,
  1105,
  3123,
  1106,
  2037,
  1106,
  1443,
  1217,
  10063,
  4404,
  132,
  1119,
  112,
  188,
  1579,
  1113,
  1159,
  1107,
  3195,
  1117,
  4420,
  132,
  1119,
  112,
  188,
  6559,
  1114,
  170,
  1499,
  118,
  23555,
  2704,
  113,
  183,
  9379,
  114,
  1

1. input_ids
This is the actual tokenized text — each word/subword is converted into a numerical ID based on the tokenizer’s vocabulary.

Example: "Hello world" → [101, 8667, 1362, 102] (IDs from BERT’s vocab).

These are what get fed into the embedding layer of the model.

2. token_type_ids (aka segment IDs)
Used only by some models (like BERT) for tasks involving two sentences in one input (e.g., question + answer, sentence A + sentence B).

The values tell the model which tokens belong to which segment:

0 → tokens from sentence A

1 → tokens from sentence B

For single-sentence tasks, all values are 0, and you can usually ignore this unless you’re working with paired inputs.

3. attention_mask
Tells the model which tokens are real and which are padding:

1 → keep this token (attend to it)

0 → ignore this token (it’s padding)

Essential when you use padding="max_length", because the model shouldn’t waste computation attending to padding tokens.

In [15]:
tokenized_datasets['train'][0].keys()

dict_keys(['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'])

In [16]:
type(tokenized_datasets)

datasets.dataset_dict.DatasetDict

In [17]:
# Remove the text column because the model does not accept raw text as input
tokenized_datasets = tokenized_datasets.remove_columns(['text'])

# Rename the label column to label because the model expects the argument to be named labels
tokenized_datasets = tokenized_datasets.rename_column('label', 'labels')

# set the format of the dataset to return PyTorch tensors instead of lists
tokenized_datasets.set_format('torch')

In [18]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 200
    })
})

In [20]:
dataset['train'][100], tokenized_datasets['train'][100]

({'label': 0,
  'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years

In [21]:
tokenized_datasets['train'][0].keys()

dict_keys(['labels', 'input_ids', 'token_type_ids', 'attention_mask'])

## DataLoader

In [22]:
# Create a training data loader
train_dataloader = DataLoader(tokenized_datasets["train"],shuffle = True, batch_size = 2)

# Create an evaluation dataloader
eval_dataloader = DataLoader(tokenized_datasets["test"], batch_size = 2)


## Train the model