# Lightweight Fine-Tuning Project

In this project, a pre-trained RoBERTa model is utilized to perform emotion classification on the Emotion dataset.

The workflow involves first evaluating the pre-trained model on the dataset. Lightweight fine-tuning is then applied using PEFT techniques, including QLoRA and Adapter Tuning. Finally, the results from the fine-tuned models are compared with the pre-trained model's performance.

Link to the dataset: https://huggingface.co/datasets/dair-ai/emotion

### Choices for the Project:

* **PEFT technique**: 
  - **QLoRA** (Quantized Low Rank Adaptation), which combines LoRA with quantization to reduce memory usage while fine-tuning a small subset of model parameters, making it highly efficient.
  - **Prefix Tuning**, which introduces trainable tokens to the input embeddings, enabling efficient adaptation of the model to new tasks without modifying its core weights.
* **Model**: 
  - `roberta-base`, a robustly optimized BERT variant known for its strong performance in text classification tasks, providing a good balance of accuracy and computational requirements.
* **Evaluation approach**: 
  - Accuracy metric from the 🤗 Evaluate library, as it provides an intuitive measure of model performance for classification tasks.
* **Fine-tuning dataset**: 
  - Emotion dataset, which contains text samples labeled with one of six emotions (`sadness`, `joy`, `love`, `anger`, `fear`, `surprise`).


## Loading and Evaluating a Foundation Model

In this step, the chosen pre-trained Hugging Face model is loaded along with an appropriate tokenizer. The Emotion dataset is also loaded and tokenized for evaluation. The model's performance is evaluated on the dataset prior to fine-tuning to establish a baseline.

In [1]:
%pip install --upgrade transformers torch bitsandbytes accelerate peft scikit-learn

Collecting transformers
  Downloading transformers-4.47.1-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Collecting torch
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting accelerate
  Downloading accelerate-1.2.1-py3-none-any.whl.metadata (19 kB)
Collecting peft
  Downloading peft-0.14.0-py3-none-any.whl.metadata (13 kB)
Collecting scikit-learn
  Downloading scikit_learn-1.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-non

In [2]:
%pip install evaluate scikit-learn

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3
Note: you may need to restart the kernel to use updated packages.


**Note**: Kernel restart required after running the above pip commands.

In [3]:
import random
import numpy as np
import torch

# Set random seed for reproducibility
random_seed = 42
random.seed(random_seed)
np.random.seed(random_seed)
torch.manual_seed(random_seed)

# If using GPU
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(random_seed)

In [4]:
from datasets import load_dataset

# Load the Emotion dataset
dataset = load_dataset("emotion")

# View dataset structure
print(dataset)

README.md:   0%|          | 0.00/9.05k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/127k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/129k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})


### Dataset Structure
The Emotion dataset consists of three predefined splits:
- **Train**: 16,000 samples
- **Validation**: 2,000 samples
- **Test**: 2,000 samples

Each sample contains the following features:
- **Text**: The input text.
- **Label**: The emotion class.

In [5]:
# View labels in the dataset
print(dataset["train"].features["label"].names)

['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']


### Labels in the Dataset
The Emotion dataset includes six emotion classes:
- `sadness`
- `joy`
- `love`
- `anger`
- `fear`
- `surprise`

In [6]:
# View three random samples and their labels
random_indices = random.sample(range(len(dataset['train'])), 3)
for idx in random_indices:
    print(f"Text: {dataset['train'][idx]['text']}")
    print(f"Label: {dataset['train'].features['label'].names[dataset['train'][idx]['label']]}")
    print("-" * 50)

Text: i do find new friends i m going to try extra hard to make them stay and if i decide that i don t want to feel hurt again and just ride out the last year of school on my own i m going to have to try extra hard not to care what people think of me being a loner
Label: sadness
--------------------------------------------------
Text: i asked them to join me in creating a world where all year old girls could grow up feeling hopeful and powerful
Label: joy
--------------------------------------------------
Text: i feel when you are a caring person you attract other caring people into your life
Label: love
--------------------------------------------------


- Three random samples from the dataset are displayed to get a feel for the text and corresponding emotions.

In [7]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Load RoBERTa tokenizer and model
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, 
    num_labels=len(dataset["train"].features["label"].names)  # Number of emotion labels
)

# Freeze model parameters to prevent weight updates
for param in model.parameters():
    param.requires_grad = False

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


- The RoBERTa tokenizer and model are loaded.
- The model is initialized with six emotion classes corresponding to the dataset labels.
- All model parameters are frozen to prevent weight updates during evaluation.

In [8]:
# Tokenize the dataset using a lambda function
tokenized_dataset = dataset.map(
    lambda examples: tokenizer(
        examples["text"], 
        truncation=True, 
        padding=True, 
        max_length=512
    ), 
    batched=True
)

Map:   0%|          | 0/16000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

- The dataset is tokenized using a lambda function for simplicity.
- Text sequences are truncated and padded in a single step using the fast tokenizer for optimal performance.

In [9]:
# Use predefined splits
train_dataset = tokenized_dataset["train"]
eval_dataset = tokenized_dataset["validation"]
test_dataset = tokenized_dataset["test"]

- The tokenized dataset is split into predefined subsets:
  - `train_dataset` for training.
  - `eval_dataset` for validation.
  - `test_dataset` for testing.

In [10]:
from transformers import DataCollatorWithPadding

# Data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

- A data collator is initialized to dynamically pad sequences in each batch during training or evaluation, ensuring uniform input sizes.

In [11]:
import evaluate

# Load the accuracy metric
accuracy = evaluate.load("accuracy")

# Define compute_metrics function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = logits.argmax(axis=-1)
    return accuracy.compute(predictions=predictions, references=labels)

Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

- The `accuracy` metric is loaded using the 🤗 Evaluate library.
- A `compute_metrics` function is defined to calculate accuracy by comparing predictions with the reference labels.

In [12]:
from transformers import TrainingArguments, Trainer

# Evaluation configuration
training_args = TrainingArguments(
    output_dir="./results",          # Directory to save evaluation logs
    per_device_eval_batch_size=16,   # Batch size for evaluation
    logging_dir="./logs",            # Directory for evaluation logs
    logging_steps=10,                # Log evaluation metrics every 10 steps
    eval_strategy="no",              # No training, evaluation only
    save_strategy="no",              # No checkpoints saved
    report_to="none"
)

# Initialize Trainer for evaluation
trainer = Trainer(
    model=model,                           # The pre-trained RoBERTa model
    args=training_args,                    # Evaluation configuration
    eval_dataset=eval_dataset,             # Validation dataset for evaluation
    data_collator=data_collator,           # Data collator for batching
    compute_metrics=compute_metrics        # Metrics for evaluation
)

- The pre-trained RoBERTa model is evaluated on the validation dataset to establish a baseline.
- Training-specific configurations and datasets are omitted as no fine-tuning is performed.

In [13]:
# Evaluate the pre-trained model
results = trainer.evaluate()
print("Validation Dataset Results:", results)

# Evaluate on the test dataset
test_results = trainer.evaluate(test_dataset)
print("Test Dataset Results:", test_results)

Validation Dataset Results: {'eval_loss': 1.8424019813537598, 'eval_model_preparation_time': 0.0032, 'eval_accuracy': 0.0405, 'eval_runtime': 4.6993, 'eval_samples_per_second': 425.596, 'eval_steps_per_second': 26.6}
Test Dataset Results: {'eval_loss': 1.8429139852523804, 'eval_model_preparation_time': 0.0032, 'eval_accuracy': 0.033, 'eval_runtime': 4.1472, 'eval_samples_per_second': 482.249, 'eval_steps_per_second': 30.141}


### Baseline Evaluation of the Pre-trained Model

The pre-trained RoBERTa model was evaluated on both the validation and test datasets without any fine-tuning. The results indicate poor performance out-of-the-box, as the model has not been trained on emotion classification tasks:

- **Validation Dataset Results**: Accuracy = 4.05%, Loss = 1.8424
- **Test Dataset Results**: Accuracy = 3.3%, Loss = 1.8429

These results highlight the need for task-specific fine-tuning to adapt the model to the Emotion dataset.


In [14]:
# Unfreeze model parameters to allow weight updates
for param in model.parameters():
    param.requires_grad = True

- Model parameters are unfrozen to allow weight updates during fine-tuning on the Emotion dataset.

In [15]:
# Fine-tuning configuration
fine_tuning_training_args = TrainingArguments(
    output_dir="./results_fine_tuning",   # Directory to save model checkpoints
    eval_strategy="epoch",                # Evaluate after each epoch
    save_strategy="epoch",                # Save model after each epoch
    learning_rate=2e-5,                   # Learning rate
    per_device_train_batch_size=16,       # Batch size for training
    per_device_eval_batch_size=16,        # Batch size for evaluation
    num_train_epochs=3,                   # Number of epochs
    weight_decay=0.01,                    # Weight decay to reduce overfitting
    logging_dir="./logs_fine_tuning",     # Directory for training logs
    logging_steps=10,                     # Log training metrics every 10 steps
    fp16=True,                            # Enable mixed precision
    load_best_model_at_end=True,          # Save and load the best model
    report_to="none"
)

# Initialize Trainer for fine-tuning
fine_tuning_trainer = Trainer(
    model=model,                           # The RoBERTa model
    args=fine_tuning_training_args,        # Fine-tuning configuration
    train_dataset=train_dataset,           # Training dataset
    eval_dataset=eval_dataset,             # Validation dataset for evaluation
    data_collator=data_collator,           # Data collator for batching
    compute_metrics=compute_metrics        # Metrics for evaluation
)

- Fine-tuning arguments are configured to adapt the pre-trained RoBERTa model to the Emotion dataset.
- The model will train for three epochs, with checkpoints and evaluations performed after each epoch.
- The `Trainer` is initialized with the fine-tuning configuration, training dataset, validation dataset, and metrics for evaluation.

In [16]:
# Fine-tune the model
fine_tuning_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2938,0.196004,0.9245
2,0.1889,0.174853,0.935
3,0.1578,0.152957,0.9385


TrainOutput(global_step=3000, training_loss=0.26430604681372644, metrics={'train_runtime': 483.2819, 'train_samples_per_second': 99.321, 'train_steps_per_second': 6.208, 'total_flos': 2043903636503424.0, 'train_loss': 0.26430604681372644, 'epoch': 3.0})

In [17]:
# Evaluate on the validation dataset
fine_tuned_results = fine_tuning_trainer.evaluate()
print("Validation Dataset Results after Fine-Tuning:", fine_tuned_results)

# Evaluate on the test dataset
test_fine_tuned_results = fine_tuning_trainer.evaluate(test_dataset)
print("Test Dataset Results after Fine-Tuning:", test_fine_tuned_results)

Validation Dataset Results after Fine-Tuning: {'eval_loss': 0.15295740962028503, 'eval_accuracy': 0.9385, 'eval_runtime': 4.8062, 'eval_samples_per_second': 416.13, 'eval_steps_per_second': 26.008, 'epoch': 3.0}
Test Dataset Results after Fine-Tuning: {'eval_loss': 0.171091228723526, 'eval_accuracy': 0.923, 'eval_runtime': 4.4886, 'eval_samples_per_second': 445.569, 'eval_steps_per_second': 27.848, 'epoch': 3.0}


### Fine-Tuned Model Results

After fine-tuning the pre-trained RoBERTa model on the Emotion dataset for three epochs, the performance improved significantly:

- **Validation Dataset Results**: Accuracy = 93.85%, Loss = 0.1530
- **Test Dataset Results**: Accuracy = 92.3%, Loss = 0.1711

This demonstrates that task-specific fine-tuning enables the model to adapt effectively to emotion classification, achieving a substantial increase in accuracy compared to the baseline.

## Performing Parameter-Efficient Fine-Tuning

In this section, two PEFT models are created from the pre-trained RoBERTa model using **QLoRA** and **Adapter Tuning** techniques. Each model is fine-tuned on the Emotion dataset, and the fine-tuned weights are saved for later evaluation and comparison.

In [18]:
from transformers import BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model

# Configure quantization using BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,                      # Enable 4-bit quantization
    bnb_4bit_compute_dtype=torch.float32,   # Use float32 for stability
    llm_int8_skip_modules=["classifier"],   # Skip quantizing the classifier layers
)

# Load the model with quantization
qlora_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    num_labels=len(dataset["train"].features["label"].names),  # Number of emotion labels
    torch_dtype=torch.float32,                                 # Use float32 for stability
    low_cpu_mem_usage=True                                     # Optimize for low memory usage
)

# Freeze the base model's parameters
for param in qlora_model.parameters():
    param.requires_grad = False

# Configure QLoRA
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,   # Sequence Classification task
    r=32,                         # Low-rank dimension
    lora_alpha=32,                # Scaling factor
    target_modules=[              # Layers to apply LoRA
        "query",
        "key",
        "value"
    ],
    lora_dropout=0.1,             # Dropout for regularization
    bias="none"                   # No additional bias
)

# Convert the model to a PEFT model with QLoRA
peft_qlora_model = get_peft_model(qlora_model, lora_config)

# Print trainable parameters
peft_qlora_model.print_trainable_parameters()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 2,364,678 || all params: 127,014,924 || trainable%: 1.8617


#### QLoRA Configuration and Trainable Parameters

- The RoBERTa model was configured with **QLoRA** for parameter-efficient fine-tuning on the Emotion dataset.
- **Key Configuration**:
  - **LoRA rank (`r`)**: 32
  - **Scaling factor (`lora_alpha`)**: 32
  - **Target modules**: `query`, `key`, `value`
  - **Dropout**: 0.1
  - **Task Type**: Sequence Classification
  - **Quantization**: Enabled with 4-bit quantization for memory and computation efficiency.
  - **Precision (`torch_dtype`)**: Set to `float32` for stability in quantized computations.
- **Results**:
  - **Trainable Parameters**: 2,364,678
  - **Total Parameters**: 127,014,924
  - **Percentage of Trainable Parameters**: 1.8617%

This configuration demonstrates the efficiency of QLoRA by focusing updates on a small fraction of the model's parameters while maintaining the original architecture's capacity. The fine-tuning process is resource-efficient and tailored for downstream tasks.

In [19]:
# Training configuration
qlora_training_args = TrainingArguments(
    output_dir="./qlora_results",   # Directory to save model checkpoints
    eval_strategy="epoch",          # Evaluate after each epoch
    save_strategy="epoch",          # Save model after each epoch
    learning_rate=2e-5,             # Learning rate for QLoRA
    per_device_train_batch_size=16, # Batch size for training
    per_device_eval_batch_size=16,  # Batch size for evaluation
    num_train_epochs=3,             # Number of epochs
    weight_decay=0.01,              # Weight decay for regularization
    logging_dir="./qlora_logs",     # Directory for training logs
    logging_steps=10,               # Log training metrics every 10 steps
    fp16=True,                      # Enable mixed precision
    load_best_model_at_end=True,     # Save and load the best model at the end
    report_to="none"
)

# Initialize Trainer
qlora_trainer = Trainer(
    model=peft_qlora_model,            # The QLoRA model
    args=qlora_training_args,          # Training arguments
    train_dataset=train_dataset,       # Training dataset
    eval_dataset=eval_dataset,         # Validation dataset
    data_collator=data_collator,       # Data collator for batching
    compute_metrics=compute_metrics,   # Metrics for evaluation
)

- Fine-tuning arguments are configured to adapt the pre-trained RoBERTa model with QLoRA to the Emotion dataset.

In [20]:
# Train the QLoRA model
qlora_trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.8684,0.737334,0.743
2,0.6103,0.584866,0.7885
3,0.6381,0.551868,0.8005


TrainOutput(global_step=3000, training_loss=0.8707235889434815, metrics={'train_runtime': 394.0054, 'train_samples_per_second': 121.826, 'train_steps_per_second': 7.614, 'total_flos': 2100332193543936.0, 'train_loss': 0.8707235889434815, 'epoch': 3.0})

In [21]:
# Save the final fine-tuned QLoRA model
peft_qlora_model.save_pretrained("./qlora_finetuned_model")

# Save the tokenizer
tokenizer.save_pretrained("./qlora_finetuned_model")

('./qlora_finetuned_model/tokenizer_config.json',
 './qlora_finetuned_model/special_tokens_map.json',
 './qlora_finetuned_model/vocab.json',
 './qlora_finetuned_model/merges.txt',
 './qlora_finetuned_model/added_tokens.json',
 './qlora_finetuned_model/tokenizer.json')

- The QLoRA model fine-tuned on the Emotion dataset is saved to the `./qlora_finetuned_model` directory.
- This allows for easy reloading of the model and tokenizer for future inference or evaluation tasks.

In [22]:
from peft import PrefixTuningConfig

# Configure Prefix Tuning
prefix_config = PrefixTuningConfig(
    task_type=TaskType.SEQ_CLS,  # Sequence Classification task
    num_virtual_tokens=20,       # Number of virtual tokens to prepend
    encoder_hidden_size=768,     # Hidden size of the encoder
)

# Load the model
prefix_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=len(dataset["train"].features["label"].names),  # Number of emotion labels
)

# Convert the pre-trained RoBERTa model into a Prefix Tuning PEFT model
peft_prefix_model = get_peft_model(prefix_model, prefix_config)

# Print trainable parameters
peft_prefix_model.print_trainable_parameters()

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


trainable params: 963,846 || all params: 125,614,092 || trainable%: 0.7673


#### Prefix Tuning Configuration and Trainable Parameters

- The RoBERTa model was configured with **Prefix Tuning** for parameter-efficient fine-tuning on the Emotion dataset.
- **Key Configuration**:
  - **Number of Virtual Tokens**: 20
  - **Encoder Hidden Size**: 768
  - **Task Type**: Sequence Classification
- **Results**:
  - **Trainable Parameters**: 963,846
  - **Total Parameters**: 125,614,092
  - **Percentage of Trainable Parameters**: 0.7673%

This configuration demonstrates the efficiency of Prefix Tuning by introducing trainable prefix tokens that adapt the model's outputs to the specific task with minimal resource usage, while keeping the majority of the model's parameters frozen.


In [23]:
# Training configuration for Prefix Tuning
prefix_training_args = TrainingArguments(
    output_dir="./prefix_results",      # Directory to save model checkpoints
    eval_strategy="epoch",             # Evaluate after each epoch
    save_strategy="epoch",             # Save model after each epoch
    learning_rate=2e-5,                # Learning rate for Prefix Tuning
    per_device_train_batch_size=16,    # Batch size for training
    per_device_eval_batch_size=16,     # Batch size for evaluation
    num_train_epochs=3,                # Number of epochs
    weight_decay=0.01,                 # Weight decay for regularization
    logging_dir="./prefix_logs",       # Directory for training logs
    logging_steps=10,                  # Log training metrics every 10 steps
    fp16=True,                         # Enable mixed precision
    load_best_model_at_end=True,        # Save and load the best model at the end
    report_to="none"
)

# Initialize Trainer
prefix_trainer = Trainer(
    model=peft_prefix_model,           # The Prefix Tuning model
    args=prefix_training_args,         # Training arguments
    train_dataset=train_dataset,       # Training dataset
    eval_dataset=eval_dataset,         # Validation dataset
    data_collator=data_collator,       # Data collator for batching
    compute_metrics=compute_metrics    # Metrics for evaluation
)

- Fine-tuning arguments are configured to adapt the pre-trained RoBERTa model with Prefix Tuning to the Emotion dataset.

In [24]:
# Train the Prefix Tuning model
prefix_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.6189,1.564848,0.3855
2,1.5576,1.554388,0.4015
3,1.6021,1.551315,0.4115


TrainOutput(global_step=3000, training_loss=1.5710192108154297, metrics={'train_runtime': 307.8463, 'train_samples_per_second': 155.922, 'train_steps_per_second': 9.745, 'total_flos': 2058107099182848.0, 'train_loss': 1.5710192108154297, 'epoch': 3.0})

In [25]:
# Save the final fine-tuned Prefix Tuning model
peft_prefix_model.save_pretrained("./prefix_finetuned_model")

# Save the tokenizer
tokenizer.save_pretrained("./prefix_finetuned_model")

('./prefix_finetuned_model/tokenizer_config.json',
 './prefix_finetuned_model/special_tokens_map.json',
 './prefix_finetuned_model/vocab.json',
 './prefix_finetuned_model/merges.txt',
 './prefix_finetuned_model/added_tokens.json',
 './prefix_finetuned_model/tokenizer.json')

- The Prefix Tuning model fine-tuned on the Emotion dataset is saved to the `./prefix_finetuned_model` directory.
- This allows for easy reloading of the model and tokenizer for future inference or evaluation tasks.

## Performing Inference with PEFT Models

In this final step, the following models are evaluated on the test dataset:

1. **Out-of-the-Box RoBERTa Model** (baseline performance).  
2. **Fine-Tuned RoBERTa Model** (fully fine-tuned).  
3. **QLoRA Model** (parameter-efficient fine-tuning).  
4. **Prefix Tuning Model** (parameter-efficient fine-tuning).

The saved weights for the QLoRA and Prefix Tuning models are loaded, and their performance is compared to the pre-trained and fully fine-tuned models.

In [26]:
from peft import AutoPeftModelForSequenceClassification

# Generalized function to evaluate PEFT models
def evaluate_peft_model(model_path, test_dataset):
    model = AutoPeftModelForSequenceClassification.from_pretrained(
        model_path,
        num_labels = len(test_dataset.features["label"].names)
    )
    trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir="./peft_test_results",
            report_to="none",
        ),
        data_collator=data_collator,           
        compute_metrics=compute_metrics
    )
    results = trainer.evaluate(test_dataset)
    return results

# Evaluate PEFT models
qlora_results = evaluate_peft_model("./qlora_finetuned_model", test_dataset)  # QLoRA model
prefix_results = evaluate_peft_model("./prefix_finetuned_model", test_dataset)  # Prefix Tuning model

# Print results
print("Test Data Evaluation Results:")
print(f"QLoRA Model Accuracy: {qlora_results['eval_accuracy']:.4f}")
print(f"Prefix Tuning Model Accuracy: {prefix_results['eval_accuracy']:.4f}")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Test Data Evaluation Results:
QLoRA Model Accuracy: 0.7245
Prefix Tuning Model Accuracy: 0.4140


### Final Performance Comparison of Models

Below is the final evaluation of the four models used for emotion classification on the test dataset. The table summarizes the evaluation metrics, including accuracy.

| Model               | Training Method      | Accuracy  |
|---------------------|----------------------|-----------|
| **Out-of-the-Box**  | Pre-trained RoBERTa | 3.30%     |
| **Fine-Tuned**      | Fully Fine-Tuned    | 92.30%    |
| **QLoRA**           | QLoRA PEFT          | 72.45%    |
| **Prefix Tuning**   | Prefix Tuning PEFT  | 41.40%    |

#### Analysis
- **Out-of-the-Box**: Performs poorly without fine-tuning.
- **Fine-Tuned**: Achieves the best accuracy (92.30%) but requires the most resources.
- **QLoRA**: Strikes a good balance between accuracy (72.45%) and resource efficiency.
- **Prefix Tuning**: Offers lightweight tuning with moderate accuracy (41.40%).

#### Conclusion
QLoRA provides an effective balance between performance and resource efficiency, making it suitable for most scenarios.

### Future Plan

The current models were trained with limited epochs and basic hyperparameter settings. Future experiments with optimized configurations and longer training could significantly improve their performance, especially for QLoRA and Prefix Tuning models. This would allow a more thorough evaluation of their potential in resource-efficient fine-tuning.