<a href="https://colab.research.google.com/github/AVI18794/Udacity_Generative_AI_Nanodegree_Projects/blob/main/Udacity_Project_1_Lightweight_Fine_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Lightweight Fine-Tuning Project
In this project, a pre-trained RoBERTa model is utilized to perform emotion classification on the Emotion dataset.

The workflow involves first evaluating the pre-trained model on the dataset. Lightweight fine-tuning is then applied using PEFT techniques, including QLoRA and Adapter Tuning. Finally, the results from the fine-tuned models are compared with the pre-trained model's performance.

Dataset Link:- https://huggingface.co/datasets/dair-ai/emotion

## Project Components:
### PEFT Techniques:

1) QLoRA (Quantized Low-Rank Adaptation): Enhances efficiency by combining LoRA with quantization, reducing memory usage while fine-tuning a small subset of model parameters.

2) Prefix Tuning: Adds trainable tokens to input embeddings, enabling efficient task adaptation without modifying the model's core weights.

### Model Selection:

1)roberta-base: A refined BERT variant optimized for text classification, offering a strong balance between accuracy and computational efficiency.

### Evaluation Metric:

1) Accuracy (🤗 Evaluate library): A straightforward and effective measure of classification performance.

### Fine-Tuning Dataset:

1)Emotion dataset: Comprises text samples labeled with six emotions—sadness, joy, love, anger, fear, and surprise.



# Loading and Evaluating a Foundation Model
In this step, the chosen pre-trained Hugging Face model is loaded along with an appropriate tokenizer. The Emotion dataset is also loaded and tokenized for evaluation. The model's performance is evaluated on the dataset prior to fine-tuning to establish a baseline.

In [1]:
%pip install --upgrade transformers torch bitsandbytes accelerate peft scikit-learn

Collecting transformers
  Downloading transformers-4.49.0-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
Collecting torch
  Downloading torch-2.6.0-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting accelerate
  Downloading accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu1

In [2]:
%pip install evaluate scikit-learn

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py311-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading datasets-3.3.2-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
import random
import numpy as np
import torch

# Set random seed for reproducibility
random_seed = 42
random.seed(random_seed)
np.random.seed(random_seed)
torch.manual_seed(random_seed)

# If using GPU
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

In [4]:
from datasets import load_dataset

# Load the Emotion dataset
dataset = load_dataset("emotion")

# View dataset structure
print(dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


# Dataset Structure
The Emotion dataset consists of three predefined splits:

Train: 16,000 samples
Validation: 2,000 samples
Test: 2,000 samples
Each sample contains the following features:

Text: The input text.
Label: The emotion class.

In [5]:
# View labels in the dataset
print(dataset["train"].features["label"].names)


['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


# Dataset Labels
The Emotion dataset includes six emotion classes:
* sadness
* joy
* love
* anger
* fear
* surprise




In [6]:
# View three random samples and their labels
random_indices = random.sample(range(len(dataset['train'])), 3)
for idx in random_indices:
    print(f"Text: {dataset['train'][idx]['text']}")
    print(f"Label: {dataset['train'].features['label'].names[dataset['train'][idx]['label']]}")
    print("-" * 50)

Text: i do find new friends i m going to try extra hard to make them stay and if i decide that i don t want to feel hurt again and just ride out the last year of school on my own i m going to have to try extra hard not to care what people think of me being a loner
Label: sadness
--------------------------------------------------
Text: i asked them to join me in creating a world where all year old girls could grow up feeling hopeful and powerful
Label: joy
--------------------------------------------------
Text: i feel when you are a caring person you attract other caring people into your life
Label: love
--------------------------------------------------


In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load RoBERTa tokenizer and model
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(dataset["train"].features["label"].names)  # Number of emotion labels
)

# Freeze model parameters to prevent weight updates
for param in model.parameters():
    param.requires_grad = False

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

RuntimeError: Failed to import transformers.models.roberta.modeling_roberta because of the following error (look up to see its traceback):
operator torchvision::nms does not exist