# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: The technique chosen was LoRA, which allowes fine-tuning a smaller number of parameters by adapting pre-trained weights using low-rank matrices.
* Model: The model used was distilbert, which is a smaller, faster and efficient version of BERT
* Evaluation approach: The accuracy metric was used to evaluate the performance of the model
* Fine-tuning dataset: The dataset used was the twitter-sentiment, which has positive and negative labeled tweets

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
"""
Steps:
1. Load a pre-trained model and evaluate its performance.
2. Perform parameter-efficient fine-tuning using the pre-trained model.
3. Perform inference using the fine-tuned model and compare its performance to the original model.
"""

#Load all necessary libraries
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, DataCollatorWithPadding
import numpy as np
from peft import LoraConfig, get_peft_model, TaskType, AutoPeftModelForSequenceClassification

In [2]:
# Load the selected dataset and split into train and test sets
dataset = load_dataset("carblacac/twitter-sentiment-analysis", split="train").train_test_split(test_size=0.2, shuffle=True, seed=23)

# Add dictionaries to convert between labels and ids
id2label = {0: "Negative", 1: "Positive"}
label2id = {"Negative": 0, "Positive": 1}

# Apply select on individual Dataset objects within the DatasetDict and print a few samples
for tweet in dataset["train"].select(range(3)):
    msg = tweet["text"]
    label_id = tweet["feeling"]
    print(f"feeling={id2label[label_id]}, msg={msg}")

# Build the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Use lambda to tokenize the text and ensure labels are included
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Rename the 'feeling' column to 'labels' as expected by the model
tokenized_dataset = tokenized_dataset.rename_column("feeling", "labels")

# Set the format to PyTorch tensors
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Load pretrained model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label=id2label,
    label2id=label2id,
)

# Define the evaluation parameters
def compute_metrics(eval_prediction):
    predictions, labels = eval_prediction
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

# Define learning rate and batch size
lr = 2e-5
bsize = 32

# Set training parameters
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/tweet_sentiment",
        learning_rate=lr,
        per_device_train_batch_size=bsize,
        per_device_eval_batch_size=bsize,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=1,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)



You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/4.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.44k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.38M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.23M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/149985 [00:00<?, ? examples/s]

Map:   0%|          | 0/61998 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/120 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/30 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/62 [00:00<?, ?ba/s]

Generating train split:   0%|          | 0/119988 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/29997 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/61998 [00:00<?, ? examples/s]

feeling=Negative, msg=Best movie ever! Oh and my brother thinks its the end of the world and wants to sacrafice me
feeling=Positive, msg=Watch this vid, very funny  -- Scary Gay Marriage Ad http://bit.ly/4HNJN
feeling=Positive, msg=http://twitpic.com/5cubr - I compared the meerkats yesterday  I liked Loki best...


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/95990 [00:00<?, ? examples/s]

Map:   0%|          | 0/23998 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [3]:
# Evaluate the pretrained model
pretrained_eval_results = trainer.evaluate()
print("Pre-trained model evaluation results:", pretrained_eval_results)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Pre-trained model evaluation results: {'eval_loss': 0.6932274699211121, 'eval_accuracy': 0.5006250520876739, 'eval_runtime': 410.9721, 'eval_samples_per_second': 58.393, 'eval_steps_per_second': 1.825}


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [4]:
# Configure PEFT (Parameter-Efficient Fine-Tuning) using LoRA (Low-Rank Adaptation)
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    inference_mode=False,
    r=8,
    lora_alpha=16,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_lin", "v_lin"]
    )

# Create a PEFT model using the LoRA configuration
lora_model = get_peft_model(model, peft_config)

In [5]:
#Have a glimpse at the trainable parameters
lora_model.print_trainable_parameters()

trainable params: 1,331,716 || all params: 67,694,596 || trainable%: 1.967241225577297


In [6]:
# Define learning rate and batch size
lr = 2e-5
bsize = 32

#Create Lora trainer
lora_trainer = Trainer(
    model=lora_model,
    args=TrainingArguments(
        output_dir="./data/tweet_sentiment_lora",
        learning_rate=lr,
        per_device_train_batch_size=bsize,
        per_device_eval_batch_size=bsize,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        num_train_epochs=1,
    ),
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

lora_trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,0.4612,0.447046,0.791191


Checkpoint destination directory ./data/tweet_sentiment_lora/checkpoint-3000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=3000, training_loss=0.4876828409830729, metrics={'train_runtime': 3976.3008, 'train_samples_per_second': 24.141, 'train_steps_per_second': 0.754, 'total_flos': 1.293363566333952e+16, 'train_loss': 0.4876828409830729, 'epoch': 1.0})

In [7]:
#save trained adapter weights
lora_model.save_pretrained('distilbert-lora')
tokenizer.save_pretrained('distilbert-lora')

('distilbert-lora/tokenizer_config.json',
 'distilbert-lora/special_tokens_map.json',
 'distilbert-lora/vocab.txt',
 'distilbert-lora/added_tokens.json',
 'distilbert-lora/tokenizer.json')

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [8]:
# load saved lora model
lora_model = AutoPeftModelForSequenceClassification.from_pretrained('distilbert-lora')

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
#Evaluate the lora model
lora_eval_results = lora_trainer.evaluate()

print("Results:", lora_eval_results)

Results: {'eval_loss': 0.4470459520816803, 'eval_accuracy': 0.7911909325777148, 'eval_runtime': 432.9942, 'eval_samples_per_second': 55.423, 'eval_steps_per_second': 1.732, 'epoch': 1.0}


In [10]:
# Compare model results
pretrained_model_results = pretrained_eval_results['eval_accuracy']
peft_model_results = lora_eval_results['eval_accuracy']

print(f"Pretrained Model Accuracy: {pretrained_model_results}\n"
      f"PEFT Model Accuracy: {peft_model_results}\n"
      f"Value difference: {peft_model_results - pretrained_model_results}")

Pretrained Model Accuracy: 0.5006250520876739
PEFT Model Accuracy: 0.7911909325777148
Value difference: 0.29056588049004084
