# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

#### PEFT
PEFT is used to leverage large models that have previously been trained. It achieves this by updating only a small set of parameters, thus reducing computational resources. The small set of parameters could be the last layer in a multilevel neural network. 

A version of PEFT, Low Rank Adaptation or Loram, will be used in this example. This technique creates a 'delta' layer using fewer parameters than the original layer. 
#### Model
The model chosen is one of the transformer models - AutoModelForSequenceClassification - which is one of the text sentiment models. It contains only two features, text and label.  

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
import transformers
import torch
from transformers import GPT2ForSequenceClassification, GPT2Tokenizer
print(transformers.__version__)
!python --version
import numpy as np
import pandas as pd
import os
import torch
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification
#from sklearn.metrics import accuracy_score
import pandas as pd
from transformers import AutoTokenizer
from peft import get_peft_config, get_peft_model, LoraConfig
from peft import LoraConfig


4.36.0
Python 3.10.11


# Loading Datasets

In [2]:
from datasets import load_dataset,get_dataset_split_names,load_dataset_builder
dataset = load_dataset("imdb")
dataset_builder = load_dataset_builder("imdb")
#dataset_builder.info.description

Downloading readme:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 21.0M/21.0M [00:00<00:00, 23.7MB/s]
Downloading data: 100%|██████████| 20.5M/20.5M [00:00<00:00, 28.3MB/s]
Downloading data: 100%|██████████| 42.0M/42.0M [00:01<00:00, 38.0MB/s]


Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [3]:
splits = ["train","test"]
dataset = {split: dataset for split, dataset in zip(splits,load_dataset("imdb",split=splits))}

In [4]:
for split in splits:
    dataset[split] = dataset[split].shuffle(seed=42).select(range(1000))

In [5]:
#dataset["train"][0]

### Dataset Tokenizer

In [6]:
model_type = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_type)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [7]:
def preprocess_function(examples):
    return tokenizer(examples["text"],padding="max_length",truncation=True)

In [8]:
splits=["train", "test"]
tokenized_ds = {}

In [9]:
for split in splits:
    tokenized_ds[split] = dataset[split].map(
        lambda x: tokenizer(x["text"], truncation=True), batched=True)   

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [10]:
print(tokenized_ds["train"])

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1000
})



### Define Model Performance Metrics

In [11]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    #logits, labels = p.predictions, p.label_ids
    return {
        "accuracy": (predictions == labels).mean(),
       # "eval_loss": (predictions == labels).mean(),
    }

### Load PreTrained Model Definition

In [12]:
# Load pre-trained model and tokenizer
model = GPT2ForSequenceClassification.from_pretrained(model_type, num_labels=2)
model.config.pad_token_id = tokenizer.eos_token_id
modelName = "GPT2ForSequenceClassification"
runDescription="Original Pretrained"

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Trainable Parameters

In [13]:
def print_trainable_parameters(model,modelName,runDescription):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
        
    )
    data = {'Model':modelName,'Description':runDescription,'Trainable Parameters': trainable_params,'All Parameters':all_param,'Trainable%': 100 * trainable_params / all_param}
    df_params = pd.DataFrame(data, index=[0])
    return df_params

In [14]:
df1 = print_trainable_parameters(model,modelName,runDescription)
display(df1)

trainable params: 124441344 || all params: 124441344 || trainable%: 100.0


Unnamed: 0,Model,Description,Trainable Parameters,All Parameters,Trainable%
0,GPT2ForSequenceClassification,Original Pretrained,124441344,124441344,100.0


In [15]:
#print(tokenized_ds["train"])

In [16]:
#Freeze the layers of the pre-trained model to prevent their weights from being updated during fine-tuning
for param in model.transformer.parameters():
    param.requires_grad = False

In [17]:
runDescription="Requires_Grad = False"
df2=print_trainable_parameters(model,modelName,runDescription) 

trainable params: 1536 || all params: 124441344 || trainable%: 0.00123431646639882


### Training the Model

In [18]:
from transformers import Trainer, DataCollatorWithPadding, TrainingArguments
training_args=TrainingArguments(
    output_dir="./data",
    learning_rate=2e-3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    )
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [19]:
trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.485632,0.753


Checkpoint destination directory ./data/checkpoint-250 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=250, training_loss=0.7853065795898437, metrics={'train_runtime': 80.4306, 'train_samples_per_second': 12.433, 'train_steps_per_second': 3.108, 'total_flos': 267126934339584.0, 'train_loss': 0.7853065795898437, 'epoch': 1.0})

In [20]:
runDescription="Trained"
df3=print_trainable_parameters(model,modelName,runDescription)
df_123 = pd.concat([df1, df2,df3])

trainable params: 1536 || all params: 124441344 || trainable%: 0.00123431646639882


Unnamed: 0,Model,Description,Trainable Parameters,All Parameters,Trainable%
0,GPT2ForSequenceClassification,Original Pretrained,124441344,124441344,100.0
0,GPT2ForSequenceClassification,Requires_Grad = False,1536,124441344,0.001234
0,GPT2ForSequenceClassification,Trained,1536,124441344,0.001234


In [21]:
full_eval_data = trainer.evaluate()

In [22]:
df_full = pd.DataFrame(full_eval_data, index=[0])

In [23]:
df_full

Unnamed: 0,eval_loss,eval_accuracy,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
0,0.485632,0.753,36.7625,27.202,6.8,1.0


In [24]:
eval_accuracy = df_full.iloc[0, 1]  # iloc[row_index, column_index]
print(eval_accuracy)

0.753


In [25]:
items_for_manual_review = tokenized_ds["train"].select(
range(2))

results = trainer.predict(items_for_manual_review)
df = pd.DataFrame(
{
    "text":[item["text"] for item in items_for_manual_review],
    "predictions": results.predictions.argmax(axis=1),
    "labels":results.label_ids,
    
})
pd.set_option("display.max_colwidth",None)
df

Unnamed: 0,text,predictions,labels
0,"There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have ""clairvoyance"". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...",1,1
1,"This movie is a great. The plot is very true to the book which is a classic written by Mark Twain. The movie starts of with a scene where Hank sings a song with a bunch of kids called ""when you stub your toe on the moon"" It reminds me of Sinatra's song High Hopes, it is fun and inspirational. The Music is great throughout and my favorite song is sung by the King, Hank (bing Crosby) and Sir ""Saggy"" Sagamore. OVerall a great family movie or even a great Date movie. This is a movie you can watch over and over again. The princess played by Rhonda Fleming is gorgeous. I love this movie!! If you liked Danny Kaye in the Court Jester then you will definitely like this movie.",1,1


In [26]:
print(tokenized_ds)

{'train': Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1000
}), 'test': Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1000
})}


In [27]:
from safetensors.torch import load_model, save_model
save_model(model, "model.safetensors")

### PEFT Model

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [28]:
from peft import TaskType

In [29]:
# Create LoRA configuration
lora_config = LoraConfig(
    r=2,           # Rank of the low-rank matrices
    lora_alpha=32, # Scaling factor for LoRA
    target_modules=["score"], # Target the classification head
    lora_dropout=0.1, # Dropout probability for LoRA
    task_type=TaskType.SEQ_CLS,
)

# Wrap the model with PEFT
peft_model = get_peft_model(model, lora_config)


In [30]:
runDescription="peft without training"
modelName = "peft_model"
df4=print_trainable_parameters(peft_model,modelName,runDescription)
df_1234 = pd.concat([df1, df2, df3,df4])
display(df_1234)

trainable params: 6152 || all params: 124445960 || trainable%: 0.004943511223666883


Unnamed: 0,Model,Description,Trainable Parameters,All Parameters,Trainable%
0,GPT2ForSequenceClassification,Original Pretrained,124441344,124441344,100.0
0,GPT2ForSequenceClassification,Requires_Grad = False,1536,124441344,0.001234
0,GPT2ForSequenceClassification,Trained,1536,124441344,0.001234
0,peft_model,peft without training,6152,124445960,0.004944


In [31]:
from safetensors.torch import load_model, save_model

save_model(peft_model, "model.safetensors")

In [32]:

from transformers import Trainer, TrainingArguments

peft_training_args=TrainingArguments(
    output_dir="./data1",
    learning_rate=2e-3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    #save_strategy="epoch",
    #load_best_model_at_end=True,
    )
peft_trainer = Trainer(
    model=peft_model,
    args=peft_training_args,
    train_dataset= tokenized_ds['train'].rename_column('label', 'labels'),
    eval_dataset= tokenized_ds['test'].rename_column('label', 'labels'),
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [33]:
peft_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.459315,0.77


TrainOutput(global_step=250, training_loss=0.5472578735351562, metrics={'train_runtime': 79.8875, 'train_samples_per_second': 12.518, 'train_steps_per_second': 3.129, 'total_flos': 267141431090688.0, 'train_loss': 0.5472578735351562, 'epoch': 1.0})

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [34]:
peft_eval_data = trainer.evaluate()

In [35]:
#peft_model.save_pretrained('model/peft_model')

In [36]:
print(type(peft_eval_data))
df_peft = pd.DataFrame(peft_eval_data, index=[0])


<class 'dict'>


In [37]:
modelName = "peft_model"
runDescription = "peft with Training"
df5=print_trainable_parameters(peft_model,modelName,runDescription)
df_12345 = pd.concat([df1, df2, df3,df4,df5])
display(df_12345)

trainable params: 6152 || all params: 124445960 || trainable%: 0.004943511223666883


Unnamed: 0,Model,Description,Trainable Parameters,All Parameters,Trainable%
0,GPT2ForSequenceClassification,Original Pretrained,124441344,124441344,100.0
0,GPT2ForSequenceClassification,Requires_Grad = False,1536,124441344,0.001234
0,GPT2ForSequenceClassification,Trained,1536,124441344,0.001234
0,peft_model,peft without training,6152,124445960,0.004944
0,peft_model,peft with Training,6152,124445960,0.004944


In [38]:
df_both = pd.concat([df_full, df_peft])

In [39]:
display(df_both)

Unnamed: 0,eval_loss,eval_accuracy,eval_runtime,eval_samples_per_second,eval_steps_per_second,epoch
0,0.485632,0.753,36.7625,27.202,6.8,1.0
0,0.459315,0.77,36.5088,27.391,6.848,1.0


# Summary
Freezing the pre-trained transformer model, GPT2ForSequenceClassification, reduced the trainable parameters from 124,441,344 to 1536. Using a LORA config with PEFT increased this to 6152. This may be due to the additional low-rank matrices introduced by LoRA layers.  

There was also no improvement in accuracy or runtime. Increasing the epocs or sample size also had no discernable improvement with PEFT. 