# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LORA
* Model: distilbert-base-uncased
* Evaluation approach: Sentiment analysis
* Fine-tuning dataset: IMDB

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
pip install transformers datasets peft

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
from datasets import load_dataset


splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("imdb", split=splits))}


for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(500))


ds

# # Select small subsets for training and evaluation
# train_subset = ds["train"].select(range(20))
# test_subset = ds["test"].select(range(10))

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 27.6MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 36.0MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:00<00:00, 42.5MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

In [3]:


from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"},  
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)

for param in model.base_model.parameters():
    param.requires_grad=False

model.classifier

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=2, bias=True)

In [4]:



from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")


def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples."""
    return tokenizer(examples["text"],padding="max_length", truncation=True)


tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)



tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [5]:


import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}


trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./finaldata/sentiment_analysis",
        learning_rate=2e-5,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        num_train_epochs=4,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)
eval_result = trainer.evaluate()
trainer.train()

print(eval_result)

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.68123,0.626
2,No log,0.67442,0.708
3,No log,0.669779,0.738
4,No log,0.668401,0.742


Checkpoint destination directory ./finaldata/sentiment_analysis/checkpoint-63 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finaldata/sentiment_analysis/checkpoint-126 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finaldata/sentiment_analysis/checkpoint-189 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finaldata/sentiment_analysis/checkpoint-252 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 0.6945008039474487, 'eval_accuracy': 0.492, 'eval_runtime': 8.2022, 'eval_samples_per_second': 60.959, 'eval_steps_per_second': 7.681}


In [6]:
pip install peft

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [7]:

pip install scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.4/13.4 MB[0m [31m52.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting joblib>=1.2.0
  Downloading joblib-1.4.2-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m301.8/301.8 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.5.0-py3-none-any.whl (18 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.4.2 scikit-learn-1.5.1 threadpoolctl-3.5.0
Note: you may need to restart the kernel to use updated packages.


In [8]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
from peft import get_peft_config, PeftType, PeftConfig, PeftModel
import numpy as np

model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
from datasets import load_dataset


splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("imdb", split=splits))}


for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(500))

ds

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

In [11]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments,EvalPrediction
from datasets import load_dataset
from datasets import load_dataset, load_metric
from peft import get_peft_config, PeftType, PeftConfig, PeftModel,LoraConfig ,get_peft_model,TaskType
import numpy as np

model_name = "distilbert-base-uncased"
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
    id2label={0: "NEGATIVE", 1: "POSITIVE"}, 
    label2id={"NEGATIVE": 0, "POSITIVE": 1},
)

for param in model.base_model.parameters():
    param.requires_grad=False
    

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})

model.resize_token_embeddings(len(tokenizer))

model.config.pad_token_id = tokenizer.pad_token_id


peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=64,
    lora_dropout=0.2,
    target_modules=["distilbert.transformer.layer.0.attention.q_lin", 
                    "distilbert.transformer.layer.0.attention.k_lin", 
                    "distilbert.transformer.layer.0.attention.v_lin",
                    "distilbert.transformer.layer.1.attention.q_lin", 
                    "distilbert.transformer.layer.1.attention.k_lin", 
                    "distilbert.transformer.layer.1.attention.v_lin"]
)

lora_model = get_peft_model(model, peft_config)

def preprocess_function(examples):
    return tokenizer(examples['text'], truncation=True, padding="max_length", max_length=128)

tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)

metric = load_metric("accuracy")

def compute_metrics(p: EvalPrediction):
    preds = p.predictions.argmax(-1)
    return metric.compute(predictions=preds, references=p.label_ids)


training_args = TrainingArguments(
        output_dir="./finalresults",
        learning_rate=2e-3,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        num_train_epochs=6,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    
)


trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
    tokenizer=tokenizer  
)

trainer.train()
final_result=trainer.evaluate()


#  PEFT model
print(final_result)
trainer.save_model(output_dir="./finaldata/sentiment_analysis")
lora_model.save_pretrained("./peft_model")


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/500 [00:00<?, ? examples/s]

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.472673,0.762
2,No log,0.69981,0.672
3,No log,0.462517,0.822
4,No log,0.678098,0.824
5,No log,0.774526,0.792
6,No log,0.787886,0.828


Checkpoint destination directory ./finalresults/checkpoint-32 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finalresults/checkpoint-64 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finalresults/checkpoint-96 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finalresults/checkpoint-128 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finalresults/checkpoint-160 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./finalresults/checkpoint-192 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 0.4625169634819031, 'eval_accuracy': 0.822, 'eval_runtime': 1.9708, 'eval_samples_per_second': 253.709, 'eval_steps_per_second': 16.237, 'epoch': 6.0}


## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [34]:



import os

weight_folder = "./peft_model"


if os.path.isdir(weight_folder):
    files_in_weight_folder = os.listdir(weight_folder)
    print(f"Files in '{weight_folder}': {files_in_weight_folder}")
else:
    print(f"The directory '{weight_folder}' does not exist.")

!tar -czf peft_model.tar.gz peft_model
if os.path.isfile("peft_model.tar.gz"):
    print("Compressed file 'peft_model.tar.gz' created successfully.")
else:
    print("Error: Compressed file 'peft_model.tar.gz' was not created.")

Files in './peft_model': ['adapter_model.bin', 'adapter_config.json', 'README.md']


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Compressed file 'peft_model.tar.gz' created successfully.


In [15]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel,AutoPeftModelForSequenceClassification

# model_name = "distilbert-base-uncased"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForSequenceClassification.from_pretrained(model_name)

peft_model_path = "./peft_model"
num_labels = 2 
peft_model =  AutoPeftModelForSequenceClassification.from_pretrained(peft_model_path, num_labels=num_labels)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
from datasets import load_dataset

splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("imdb", split=splits))}

for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(500))

ds


{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

In [17]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding=True)

tokenized_ds = {}
for split in splits:
    tokenized_ds[split] = ds[split].map(preprocess_function, batched=True)



Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [18]:
pip install evaluate


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.2-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m732.1 kB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: evaluate
[0mSuccessfully installed evaluate-0.4.2
Note: you may need to restart the kernel to use updated packages.


In [19]:
pip install scikit-learn

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [20]:

import evaluate
import numpy as np

metric_load = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric_load.compute(predictions=predictions, references=labels)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [28]:
!tar -czf peft_model.tar.gz peft_model

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [27]:
from transformers import Trainer, TrainingArguments

train_argumns = TrainingArguments(
    output_dir="./peftresults",
    per_device_eval_batch_size=16,
    do_train=False,
    do_eval=True,
    logging_dir="./logs",
)

trainer = Trainer(
    model=peft_model,
    args=train_argumns,
    eval_dataset=tokenized_ds["test"],
    compute_metrics=compute_metrics
)

final_eval_result = trainer.evaluate()
print(final_eval_result)


{'eval_loss': 0.342551589012146, 'eval_accuracy': 0.87, 'eval_runtime': 8.9518, 'eval_samples_per_second': 55.855, 'eval_steps_per_second': 3.575}


In [32]:
import json

# Save the evaluation results to a file
with open('./peftresults/eval_results.json', 'w') as f:
    json.dump(final_eval_result, f)

In [24]:

original_eval_result = {
    'eval_loss': eval_result['eval_loss'] ,
    'eval_accuracy':eval_result['eval_accuracy']
}

print("Original  Results:", original_eval_result)
print("PEFT  Results:", final_eval_result)


orifinalaccuracy = original_eval_result["eval_accuracy"]
peftfinalaccuracy = final_eval_result["eval_accuracy"]

print("Original Accuracy:", {orifinalaccuracy})
print("PEFT  Accuracy:", {peftfinalaccuracy})


Original  Results: {'eval_loss': 0.6945008039474487, 'eval_accuracy': 0.492}
PEFT  Results: {'eval_loss': 0.342551589012146, 'eval_accuracy': 0.87, 'eval_runtime': 8.9162, 'eval_samples_per_second': 56.077, 'eval_steps_per_second': 3.589}
Original Accuracy: {0.492}
PEFT  Accuracy: {0.87}
