# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [None]:
%%sql


In [61]:
# Load the sms_spam dataset
# See: https://huggingface.co/datasets/sms_spam

from datasets import load_dataset

# The sms_spam dataset only has a train split, so we use the train_test_split method to split it into train and test
dataset = load_dataset("sms_spam", split="train").train_test_split(
    test_size=0.2, shuffle=True, seed=23
)

splits = ["train", "test"]


# View the dataset characteristics

In [62]:
for split in splits:
    dataset[split] = dataset[split].shuffle(seed=42).select(range(len(dataset[split])//5)) # chose a suitable sized dataset to be able to retriain quickly

In [63]:
print(f"The train dataset is of type: {type(dataset['train'])}")
print(f"The length of the train dataset is: {len(dataset['train'])}")
print(f"The length of the test dataset is: {len(dataset['test'])}")
print(f"Here is one sample from the train dataset: {dataset['train'][0]=}")

The train dataset is of type: <class 'datasets.arrow_dataset.Dataset'>
The length of the train dataset is: 891
The length of the test dataset is: 223
Here is one sample from the train dataset: dataset['train'][0]={'sms': "How would my ip address test that considering my computer isn't a minecraft server\n", 'label': 0}


In [64]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Using map() inspired by the example provided inside the Udacity course
tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(
        lambda x: tokenizer(x["sms"], truncation=True), batched=True
    )

# Inspect the available columns in the dataset
print(f"The tokenized train dataset is of type: {type(tokenized_dataset['train'])}")
print(f"The length of the tokenized train dataset is: {len(tokenized_dataset['train'])}")
print(f"Here is one sample from the train dataset: {tokenized_dataset['train'][0]=}")



The tokenized train dataset is of type: <class 'datasets.arrow_dataset.Dataset'>
The length of the tokenized train dataset is: 891
Here is one sample from the train dataset: tokenized_dataset['train'][0]={'sms': "How would my ip address test that considering my computer isn't a minecraft server\n", 'label': 0, 'input_ids': [101, 2129, 2052, 2026, 12997, 4769, 3231, 2008, 6195, 2026, 3274, 3475, 1005, 1056, 1037, 3067, 10419, 8241, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [65]:
from transformers import AutoModelForSequenceClassification

def load_not_retrained_model():
    return AutoModelForSequenceClassification.from_pretrained(
        "distilbert-base-uncased",
        num_labels=2,
        id2label={0: "not spam", 1: "spam"},
        label2id={"not spam": 0, "spam": 1},
    )

model = load_not_retrained_model()

# Since we will not retrain the base model
for param in model.parameters():
    param.requires_grad = False

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [66]:
# Inspired by the example provided inside the Udacity course
import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


In [67]:
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
# We really just need the trainer for easier evaluation :)
# Packed into a function to allow reloading if the notebook is restarted
def create_trainer(model, tokenizer, tokenized_dataset):
    trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir="./data/sms_spam_classifier",
            learning_rate=2e-3,
            # Reduce the batch size if you don't have enough memory
            per_device_train_batch_size=4,
            per_device_eval_batch_size=4,
            num_train_epochs=1,
            weight_decay=0.01,
            evaluation_strategy="epoch",
            save_strategy="epoch",
            load_best_model_at_end=True,
        ),
        train_dataset=tokenized_dataset["train"],
        eval_dataset=tokenized_dataset["test"],
        tokenizer=tokenizer,
        data_collator=DataCollatorWithPadding(tokenizer),
        compute_metrics=compute_metrics,
    )
    return trainer

In [68]:
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding

trainer = create_trainer(model, tokenizer, tokenized_dataset)
trainer.evaluate()


You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 0.6778894066810608,
 'eval_accuracy': 0.7623318385650224,
 'eval_runtime': 6.0641,
 'eval_samples_per_second': 36.774,
 'eval_steps_per_second': 9.235}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [69]:
from peft import LoraConfig, get_peft_model, TaskType, AutoPeftModelForSequenceClassification

config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    bias="none",
    target_modules=["q_lin", "k_lin", "v_lin", "out_lin"], # Very important. We just provide Lora lasyers for the attention part of the model for quick retraining though with great results as evident at the end
    inference_mode=False
)
lora_model = get_peft_model(model, config)
lora_model.print_trainable_parameters()


trainable params: 1,479,172 || all params: 67,842,052 || trainable%: 2.180317305260755


In [70]:
trainer_lora = create_trainer(lora_model, tokenizer, tokenized_dataset)
print("We will first evaluate how good/bad the lora model before retraining is")
trainer_lora.evaluate()

{'eval_loss': 0.6778894066810608,
 'eval_accuracy': 0.7623318385650224,
 'eval_runtime': 5.8966,
 'eval_samples_per_second': 37.818,
 'eval_steps_per_second': 9.497}

In [71]:
print("Now we retrain the selected layers of the Lora model")
trainer_lora.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.207639,0.950673


Checkpoint destination directory ./data/sms_spam_classifier/checkpoint-223 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=223, training_loss=0.2891566528867713, metrics={'train_runtime': 110.3047, 'train_samples_per_second': 8.078, 'train_steps_per_second': 2.022, 'total_flos': 10373903680320.0, 'train_loss': 0.2891566528867713, 'epoch': 1.0})

In [72]:
print("Now we reevaluate the Lora model after retrainig")
trainer_lora.evaluate()

{'eval_loss': 0.20763860642910004,
 'eval_accuracy': 0.9506726457399103,
 'eval_runtime': 5.8135,
 'eval_samples_per_second': 38.359,
 'eval_steps_per_second': 9.633,
 'epoch': 1.0}

In [74]:
sms_spam_classifier_lora_refined_save_location = "sms_spam_classifier_lora_refined"
lora_model.save_pretrained(sms_spam_classifier_lora_refined_save_location)
print(f"saved the trained Lora model in {sms_spam_classifier_lora_refined_save_location}")

saved the trained Lora model in sms_spam_classifier_lora_refined


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [77]:
test_sample=dataset["test"].shuffle().select(range(10))
print("A test sample of size {len(test_sample)} was chosen")

A test sample of size {len(test_sample)} was chosen


In [78]:
from peft import AutoPeftModelForSequenceClassification
lora_model=AutoPeftModelForSequenceClassification.from_pretrained(sms_spam_classifier_lora_refined_save_location)

if 'model' not in locals():
    model = load_not_retrained_model()


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [79]:
import pandas as pd
inputs = tokenizer(test_sample["sms"], padding=True, truncation=True, return_tensors="pt")
results_lora=lora_model(**inputs).logits.argmax(axis=1)
results_not_retrained=model(**inputs).logits.argmax(axis=1)

df = pd.DataFrame({
    "sms": [item["sms"] for item in test_sample],
    "predictions_lora": results_lora,
    "predictions_not_retrained": results_not_retrained,
    "label": test_sample["label"]
})

print(df)

                                                                                                                                sms  \
0  Love isn't a decision, it's a feeling. If we could decide who to love, then, life would be much simpler, but then less magical\n   
1                                                                 I'm working technical support :)voice process.networking field.\n   
2                                                                                Whats the staff name who is taking class for us?\n   
3                                                                                          Awesome, text me when you're restocked\n   
4                                                                                            Me i'm not workin. Once i get job...\n   
5                                                         Lol yes. Our friendship is hanging on a thread cause u won't buy stuff.\n   
6                                                      