Lightweight Fine-Tuning Project
TODO: In this cell, describe your choices for each of the following

PEFT technique:
Model:
Evaluation approach:
Fine-tuning dataset:
Loading and Evaluating a Foundation Model
TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

Performing Parameter-Efficient Fine-Tuning
TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

Performing Inference with a PEFT Model
TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.


In [1]:
import torch
from transformers import AutoTokenizer, DataCollatorWithPadding, TrainingArguments, Trainer, AutoModelForSequenceClassification
from datasets import load_dataset
import pandas as pd
import numpy as np
from peft import LoraConfig, LoraModel


splits = ["train", "test"]
dataSet = {split: data for split, data in zip(splits, load_dataset("rotten_tomatoes", split=splits))}

# Thin out the dataset for quicker execution
for split in splits:
    dataSet[split] = dataSet[split].shuffle(seed=42).select(range(150))
    
'''


def preprocessFunction(examples):
    outputs = tokenizer(examples["text"], padding="max_length", truncation=True, return_tensors="pt")
    return {
        "input_ids": outputs["input_ids"].squeeze(),
        "attention_mask": outputs["attention_mask"].squeeze(),
        "labels": torch.tensor(examples["label"])
    }    
    
tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)
    tokenized_ds[split] = tokenized_ds[split].rename_column("label", "original_label")
    tokenized_ds[split] = tokenized_ds[split].with_format("torch")
'''
# Pre-process the dataset
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def preprocessFunction(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenizedDataSet = {split: dataSet[split].map(preprocessFunction, batched=True) for split in splits}

# Initialize the base model
baseModel = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)


# Freeze the model parameters
for param in baseModel.base_model.parameters():
    param.requires_grad = False
    

def computeMetrics(evalPred):
    predictions, labels = evalPred
    preds = np.argmax(predictions, axis=1)
    return {"accuracy": (preds == labels).mean()}


# Training the base model
trainingArgsBase = TrainingArguments(
    output_dir="./data/sentiment_analysis_base_model",
    learning_rate=3e-3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainerBase = Trainer(
    model=baseModel,
    args=trainingArgsBase,
    train_dataset=tokenizedDataSet["train"],
    eval_dataset=tokenizedDataSet["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=computeMetrics,
)

trainerBase.train()

# Evaluate the base model
baseModelEvaluation = trainerBase.evaluate()

"""
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=[
        "distilbert.transformer.layer.0.attention.q_lin",
        "distilbert.transformer.layer.0.attention.k_lin",
        "distilbert.transformer.layer.0.attention.v_lin",
        "distilbert.transformer.layer.1.attention.q_lin",
        "distilbert.transformer.layer.1.attention.k_lin",
        "distilbert.transformer.layer.1.attention.v_lin",
        "distilbert.transformer.layer.5.attention.q_lin",
        "distilbert.transformer.layer.5.attention.k_lin",
        "distilbert.transformer.layer.5.attention.v_lin",
    ],
    fan_in_fan_out=True
)
"""
# Unfreeze the model parameters for fine-tuning
for param in baseModel.parameters():
    param.requires_grad = True
    
'''
# Wrap the model with LoRA
peft_model = LoraModel(
    base_model,
    lora_config,
    adapter_name="lora"
)
'''
# Training the fine-tuned model
trainingArgsPeft = TrainingArguments(
    output_dir="./data/sentiment_analysis_peft_model",
    learning_rate=3e-5,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
    num_train_epochs=5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

trainerPeft = Trainer(
    model=baseModel,
    args=trainingArgsPeft,
    train_dataset=tokenizedDataSet["train"],
    eval_dataset=tokenizedDataSet["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=computeMetrics,
)

trainerPeft.train()

# Evaluate the fine-tuned model
peftModelEvaluation = trainerPeft.evaluate()

# Compare the results 
print("Base Model Evaluation:")
print(baseModelEvaluation)

print("\nFine-Tuned Model Evaluation:")
print(peftModelEvaluation)


  from .autonotebook import tqdm as notebook_tqdm





Map: 100%|██████████| 150/150 [00:00<00:00, 977.74 examples/s]
Map: 100%|██████████| 150/150 [00:00<00:00, 2948.09 examples/s]
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
                                               
 33%|███▎      | 19/57 [03:17<03:06,  4.90s/it]

{'eval_loss': 0.6833470463752747, 'eval_accuracy': 0.64, 'eval_runtime': 90.8698, 'eval_samples_per_second': 1.651, 'eval_steps_per_second': 0.209, 'epoch': 1.0}


                                               
 67%|██████▋   | 38/57 [07:33<01:46,  5.60s/it]

{'eval_loss': 0.669525682926178, 'eval_accuracy': 0.6866666666666666, 'eval_runtime': 139.1511, 'eval_samples_per_second': 1.078, 'eval_steps_per_second': 0.137, 'epoch': 2.0}


                                               
100%|██████████| 57/57 [11:30<00:00,  5.46s/it]

{'eval_loss': 0.5093522071838379, 'eval_accuracy': 0.76, 'eval_runtime': 100.0462, 'eval_samples_per_second': 1.499, 'eval_steps_per_second': 0.19, 'epoch': 3.0}


100%|██████████| 57/57 [11:38<00:00, 12.26s/it]


{'train_runtime': 698.5293, 'train_samples_per_second': 0.644, 'train_steps_per_second': 0.082, 'train_loss': 0.5029377184416118, 'epoch': 3.0}


100%|██████████| 19/19 [01:34<00:00,  4.97s/it]
 20%|██        | 15/75 [08:52<31:48, 31.80s/it]
 20%|██        | 15/75 [10:37<31:48, 31.80s/it]

{'eval_loss': 0.6064311265945435, 'eval_accuracy': 0.74, 'eval_runtime': 105.249, 'eval_samples_per_second': 1.425, 'eval_steps_per_second': 0.143, 'epoch': 1.0}


 40%|████      | 30/75 [18:33<23:21, 31.14s/it]  
 40%|████      | 30/75 [20:17<23:21, 31.14s/it]

{'eval_loss': 0.6371561884880066, 'eval_accuracy': 0.82, 'eval_runtime': 103.9243, 'eval_samples_per_second': 1.443, 'eval_steps_per_second': 0.144, 'epoch': 2.0}


 60%|██████    | 45/75 [28:14<16:24, 32.83s/it]
 60%|██████    | 45/75 [29:56<16:24, 32.83s/it]

{'eval_loss': 1.0088210105895996, 'eval_accuracy': 0.72, 'eval_runtime': 101.3506, 'eval_samples_per_second': 1.48, 'eval_steps_per_second': 0.148, 'epoch': 3.0}


 80%|████████  | 60/75 [37:46<07:55, 31.70s/it]
 80%|████████  | 60/75 [39:29<07:55, 31.70s/it]

{'eval_loss': 0.9266982078552246, 'eval_accuracy': 0.8, 'eval_runtime': 103.0859, 'eval_samples_per_second': 1.455, 'eval_steps_per_second': 0.146, 'epoch': 4.0}


100%|██████████| 75/75 [46:52<00:00, 28.66s/it]
100%|██████████| 75/75 [48:37<00:00, 28.66s/it]

{'eval_loss': 0.9356085062026978, 'eval_accuracy': 0.7933333333333333, 'eval_runtime': 104.5756, 'eval_samples_per_second': 1.434, 'eval_steps_per_second': 0.143, 'epoch': 5.0}


100%|██████████| 75/75 [48:54<00:00, 39.13s/it]


{'train_runtime': 2934.4925, 'train_samples_per_second': 0.256, 'train_steps_per_second': 0.026, 'train_loss': 0.09685407638549805, 'epoch': 5.0}


100%|██████████| 15/15 [01:33<00:00,  6.22s/it]

Base Model Evaluation:
{'eval_loss': 0.5093522071838379, 'eval_accuracy': 0.76, 'eval_runtime': 99.2006, 'eval_samples_per_second': 1.512, 'eval_steps_per_second': 0.192, 'epoch': 3.0}

Fine-Tuned Model Evaluation:
{'eval_loss': 0.6064311265945435, 'eval_accuracy': 0.74, 'eval_runtime': 98.7266, 'eval_samples_per_second': 1.519, 'eval_steps_per_second': 0.152, 'epoch': 5.0}





Base Model Evaluation:

Evaluation Loss: 0.5094
Accuracy: 0.76
Runtime: 99.2006 seconds
Samples Per Second: 1.512
Steps Per Second: 0.192
Epochs: 3.0

Fine-Tuned Model Evaluation:

Evaluation Loss: 0.6064
Accuracy: 0.74
Runtime: 98.7266 seconds
Samples Per Second: 1.519
Steps Per Second: 0.152
Epochs: 5.0


Analysis:

Evaluation Loss: The Fine-Tuned Model (0.6064) has a higher evaluation loss than the Base Model (0.5094). When the loss is smaller, the Base Model outperforms the Fine-Tuned Model in terms of test data fit.

Accuracy: The Base Model outperformed the Fine-Tuned Model (0.74) in accuracy, coming in at 0.76. A higher accuracy rate indicates that a greater percentage of the test samples were accurately predicted by the Base Model.

Processing speed and runtime:
    The Base Model's and the Fine-Tuned Model's evaluation runtimes are comparable, clocking in at 99.2006 and 98.7266 seconds, respectively.

    Compared to the Base Model (1.511), the Fine-Tuned Model processed 1.519 samples per second, which is a modest increase. However, the Fine-Tuned Model (0.152) processed fewer steps per second (0.152) than the Base Model (0.192), suggesting a possible discrepancy in the evaluation process's complexity or effectiveness.

    

Conclusion: 

The Base Model fared better in terms of evaluation loss and accuracy than the Fine-Tuned Model, in contrast to the first example when fine-tuning produced notable gains. It appears that the Base Model was more successful in identifying patterns in the test data for sentiment analysis using the Rotten Tomatoes dataset, as seen by its lower loss and greater accuracy. This result suggests that the particular fine-tuning procedure used in this instance did not improve the model's performance and might have resulted in overfitting or less-than-ideal modifications. Therefore, in order to find possible improvements, more research into the fine-tuning parameters and procedures is advised.

