<a href="https://colab.research.google.com/github/Pavun-KumarCH/Agentic-RAG-Systems/blob/main/Finetune_Huggingface_Text_Classification_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune a Pre-trained Model Using HuggingFace Transformers
Fine-tuning a pretrained model allows you to leverage the vast amount of knowledge encoded in the model from its initial training on large datasets. This approach significantly reduces the time and computational resources required compared to training a model from scratch. It also helps achieve high performance with relatively small amounts of task-specific data, making it a powerful technique in machine learning and AI development.

## Steps for Fine-Tuning a Pretrained Model
Choose a Pretrained Model:
Select a model from the Hugging Face Model Hub that suits your task. For example, if you're working on text classification, models like BERT or RoBERTa are popular choices.

### Prepare Your Dataset:
Ensure your dataset is properly formatted. For text tasks, this usually involves tokenizing your text data. You can use the Tokenizer provided by the Transformers library to convert your text into input IDs and attention masks.

### Set Up Training Arguments:
Define your training parameters using TrainingArguments. This includes specifying the output directory, evaluation strategy, learning rate, batch size, and number of epochs.

### Create a Trainer:
Instantiate a Trainer object, which will handle the training process. You need to provide your model, training arguments, training dataset, evaluation dataset, and a function to compute metrics.

### Train the Model:
Call the train() method on your Trainer object to start the fine-tuning process.

### Evaluate the Model:
After training, you can evaluate the model's performance on the validation dataset to check its accuracy and other metrics.

## Goal of Fine-tuning
We are going to train a model using the Yelp review dataset. The primary goal is to fine-tune the pretrained model so it can accurately classify the sentiment of Yelp reviews (e.g., positive or negative).

In [None]:
#@title Install all the necessary libraries
!pip install --q transformers datasets evaluate accelerate torch scikit-learn

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/471.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
#@title Begin by loading the Yelp Reviews dataset:
from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][90]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'label': 4,
 'text': 'It was just what we were looking for. The service was great. My husband had the veal sausage on green pepper appetizer in the larger size. It was great (spicy) and enough for a meal. I had the stuffed eggplant. That was very good but the sauce could have been a bit heartier. I would definitely go back. Really nice atmosphere and free parking in the back.'}

## Tokenize the text data to prepare it for the model.


In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenizer_function(examples):
  return tokenizer(examples["text"], padding = 'max_length', truncation = True)

tokenized_datasets = dataset.map(tokenizer_function, batched = True)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]



Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

If you like, you can create a smaller subset of the full dataset to fine-tune on to reduce the time it takes:


In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed = 42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed = 42).select(range(1000))

## Train with PyTorch Trainer


In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("google-bert/bert-base-cased", num_labels = 5)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at google-bert/bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Training hyperparameters
create a TrainingArguments class which contains all the hyperparameters you can tune as well as flags for activating different training options.



In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir = "test_trainer")

## Evaluate
Trainer does not automatically evaluate model performance during training. You’ll need to pass Trainer a function to compute and report metrics

In [None]:
import evaluate
import numpy as np

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis = -1)
  return metric.compute(predictions = predictions, references = labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

## Trainer
Create a Trainer object with your model, training arguments, training and test datasets, and evaluation function.

In [None]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = small_train_dataset,
    eval_dataset = small_eval_dataset,
    compute_metrics = compute_metrics,
)

## Fine-tune your model by calling train()


In [None]:
trainer.train()

Step,Training Loss


TrainOutput(global_step=375, training_loss=1.0014737955729167, metrics={'train_runtime': 296.7648, 'train_samples_per_second': 10.109, 'train_steps_per_second': 1.264, 'total_flos': 789354427392000.0, 'train_loss': 1.0014737955729167, 'epoch': 3.0})

### Here's a detailed explanation of each component:
**global_step=375:**

This indicates the total number of steps (batches) the model has been trained on. Each step corresponds to one batch of data passed through the model.

**training_loss=0.977566650390625:**

The average training loss over all batches and epochs. Loss is a measure of how well the model is performing on the training data; a lower value indicates better performance. Here, the training loss is approximately 0.978.

**metrics:train_runtime=4605.0362:**

The total time taken to complete the training, in seconds (approximately 4605 seconds, or about 1 hour and 17 minutes).

**train_samples_per_second=0.651:**

The number of training samples processed per second. This value is relatively low, indicating the process might be computationally intensive or the hardware may not be optimal.

**train_steps_per_second=0.081:**

The number of training steps (batches) processed per second.

**total_flos=789354427392000.0:**

Floating-point operations per second (FLOPs) used during training. This metric gives an indication of the computational workload.

**train_loss=0.977566650390625:**

The same as the training loss mentioned earlier.

**epoch=3.0:**

Indicates that the training process ran for 3 epochs (full passes over the training dataset).


## **Evaluate the model**

After training, you can evaluate the model to see its performance on the evaluation dataset.

In [None]:
import math
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 1.0252549648284912, 'eval_accuracy': 0.597, 'eval_runtime': 29.0504, 'eval_samples_per_second': 34.423, 'eval_steps_per_second': 4.303, 'epoch': 3.0}


## Here's a detailed explanation of each component:
**eval_loss=1.0017979145050049:**

The loss computed on the evaluation (validation) dataset. Similar to training loss, it indicates how well the model is performing on unseen data. Here, the evaluation loss is approximately 1.002.

**eval_accuracy=0.606:**

The accuracy of the model on the evaluation dataset. It represents the proportion of correctly classified instances. An accuracy of 0.606 means the model correctly classified 60.6% of the evaluation samples.

**eval_runtime=306.0542:**

The total time taken to complete the evaluation, in seconds (approximately 306 seconds, or about 5 minutes and 6 seconds).

**eval_samples_per_second=3.267:**

The number of evaluation samples processed per second. This value is higher than the training samples per second, which is common since evaluation usually involves forward passes only, without backpropagation.

**eval_steps_per_second=0.408:**

The number of evaluation steps (batches) processed per second. This value is also higher than the training steps per second, for similar reasons.

### You successfully fine-tuned a pretrained model (e.g., BERT) on the Yelp review dataset. The model adapted its general language understanding to the specific task of sentiment analysis on Yelp reviews.