# Lightweight Fine-Tuning Project

* PEFT technique: Lora
* Model: gpt2
* Evaluation approach: Evaluate method with Hugging Face Trainer
* Fine-tuning dataset: 'dair-ai/emotion'

## Loading and Evaluating a Foundation Model

In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
#importing the dataset and tokenizer
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [2]:
#loading the train and test splits of the emotion dataset
splits = ["train", "test"]
dataset = 'dair-ai/emotion'
ds = {split: ds for split, ds in zip(splits, load_dataset(dataset, split=splits, trust_remote_code=True))}

for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range(500))

ds

Downloading builder script:   0%|          | 0.00/3.97k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.28k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.78k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/592k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/74.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16000 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/2000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/2000 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 500
 })}

In [3]:
#converting to tokens for this model
tokenizer = AutoTokenizer.from_pretrained("gpt2")

#setting the pad token
tokenizer.pad_token = tokenizer.eos_token

def preprocessing(text):
    '''Preprocess the dataset to return tokenized results'''
    return tokenizer(text['text'], padding='max_length', truncation=True)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [4]:
token_ds = {}
for split in splits:
    token_ds[split] = ds[split].map(preprocessing, batched=True)

token_ds

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 500
 }),
 'test': Dataset({
     features: ['text', 'label', 'input_ids', 'attention_mask'],
     num_rows: 500
 })}

In [5]:
#making sure it worked
token_ds['train'][0]['input_ids'][:10]

[4514, 16259, 287, 262, 1499, 50256, 50256, 50256, 50256, 50256]

In [6]:
#setting up model
model = AutoModelForSequenceClassification.from_pretrained('gpt2',
                                                           num_labels=6,
                                                           id2label={0:'sadness', 1:'joy', 2:'love', 3:'anger', 4:'fear', 5:'surprise'},
                                                           label2id={'sadness':0, 'joy':1, 'love':2, 'anger':3, 'fear':4, 'surprise':5}
                                                          )

model.config.pad_token_id = tokenizer.pad_token_id


#freezing all parameters of the base model
for param in model.base_model.parameters():
    param.requires_grad = False

model.score

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Linear(in_features=768, out_features=6, bias=False)

In [7]:
##setting up training
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments


def compute_metrics(accuracy):
    predictions, labels = accuracy
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
training_args = TrainingArguments(
        output_dir="./data/emotions",
        learning_rate= 2e-3,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,)


In [8]:
#creating training object
trainer = Trainer(
    model=model,
    args= training_args,
    train_dataset=token_ds["train"],
    eval_dataset=token_ds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics)


In [9]:
#running the model on the test set
trainer.evaluate()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'eval_loss': 6.020493984222412,
 'eval_accuracy': 0.06,
 'eval_runtime': 42.2303,
 'eval_samples_per_second': 11.84,
 'eval_steps_per_second': 2.96}

In [12]:
#viewing the results
import pandas as pd
from sklearn.metrics import f1_score

df = pd.DataFrame(token_ds["test"])
df = df[['text', 'label']]

#adding predictions to the dataframe
predictions = trainer.predict(token_ds["test"])
df["predicted_label"] = np.argmax(predictions[0], axis=1)

df.head()

Unnamed: 0,text,label,predicted_label
0,i was feeling really troubled and down over wh...,0,2
1,i feel so thrilled to have three such distingu...,1,2
2,i feel is that the most likeable characters ar...,1,2
3,i tune out the rest of the world and focus on ...,1,2
4,i sit here writing this i feel unhappy inside,0,2


In [14]:
#some incorrect predictions
df[df["label"] != df["predicted_label"]].head()

Unnamed: 0,text,label,predicted_label
0,i was feeling really troubled and down over wh...,0,2
1,i feel so thrilled to have three such distingu...,1,2
2,i feel is that the most likeable characters ar...,1,2
3,i tune out the rest of the world and focus on ...,1,2
4,i sit here writing this i feel unhappy inside,0,2


In [16]:
base_f1 = f1_score(df['label'], df['predicted_label'], average='weighted')
base_f1

0.006805293005671078

## Performing Parameter-Efficient Fine-Tuning

In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [17]:
from peft import LoraConfig, TaskType, get_peft_model

In [18]:
#creating the lora config
config = LoraConfig(task_type = TaskType.SEQ_CLS,
                    inference_mode=False,
                    r=8,
                    lora_alpha=10,
                    lora_dropout=0.1,
                    )

#creating new model
new_model = AutoModelForSequenceClassification.from_pretrained('gpt2',
                                                           num_labels=6,
                                                           id2label={0:'sadness', 1:'joy', 2:'love', 3:'anger', 4:'fear', 5:'surprise'},
                                                           label2id={'sadness':0, 'joy':1, 'love':2, 'anger':3, 'fear':4, 'surprise':5}
                                                          )

new_model.config.pad_token_id = tokenizer.pad_token_id

#making the lora model
lora_model = get_peft_model(new_model, config)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
lora_model.print_trainable_parameters()

trainable params: 304,128 || all params: 124,743,936 || trainable%: 0.2438018309763771


In [20]:
#creating new training_args and trainer for lora model
lora_training_args = TrainingArguments(
        output_dir="./data/emotions_lora_results",
        learning_rate= 2e-3,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True)

lora_trainer = Trainer(
    model=lora_model,
    args= lora_training_args,
    train_dataset=token_ds["train"],
    eval_dataset=token_ds["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics)

In [21]:
#training lora model
lora_trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.309206,0.538


TrainOutput(global_step=125, training_loss=2.050177734375, metrics={'train_runtime': 160.7924, 'train_samples_per_second': 3.11, 'train_steps_per_second': 0.777, 'total_flos': 262226313216000.0, 'train_loss': 2.050177734375, 'epoch': 1.0})

In [22]:
#saving model weights
lora_model.save_pretrained('gpt-lora')

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [23]:
from peft import AutoPeftModelForSequenceClassification

In [24]:
lora_model = AutoPeftModelForSequenceClassification.from_pretrained('gpt-lora', num_labels=6)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [27]:
lora_trainer.evaluate()

{'eval_loss': 1.3092061281204224,
 'eval_accuracy': 0.538,
 'eval_runtime': 50.1105,
 'eval_samples_per_second': 9.978,
 'eval_steps_per_second': 2.494,
 'epoch': 1.0}

In [28]:
#viewing the results
lora_df = pd.DataFrame(token_ds["test"])
lora_df = lora_df[['text', 'label']]

#adding predictions to the dataframe
predictions = lora_trainer.predict(token_ds["test"])
lora_df["predicted_label"] = np.argmax(predictions[0], axis=1)

lora_df.head()

Unnamed: 0,text,label,predicted_label
0,i was feeling really troubled and down over wh...,0,0
1,i feel so thrilled to have three such distingu...,1,1
2,i feel is that the most likeable characters ar...,1,1
3,i tune out the rest of the world and focus on ...,1,1
4,i sit here writing this i feel unhappy inside,0,0


In [29]:
#some incorrect predictions
lora_df[lora_df["label"] != lora_df["predicted_label"]].head()

Unnamed: 0,text,label,predicted_label
5,im feeling and if ive liked being pregnant,2,0
7,i used to be able to hang around talk with the...,3,0
9,i vented my feelings towards the pathetic excu...,0,1
11,i get the feeling that this could be dangerous,3,1
13,i cant write a review for a book i adore unles...,2,1


In [30]:
peft_f1 = f1_score(lora_df['label'], lora_df['predicted_label'], average='weighted')
peft_f1

0.45407357273697435

Original model: 'eval_accuracy': 0.452

PEFT model: 'eval_accuracy': 0.722

PEFT model is more accurate.

In [35]:
print('Original Model f1:', base_f1)
print('PEFT Model f1:', peft_f1)

if base_f1 > peft_f1:
    print('Original model is more accurate.')
elif base_f1 == peft_f1:
    print('Both models are equivalent.')
else:
    print('PEFT model is more accurate.')

Original Model f1: 0.006805293005671078
PEFT Model f1: 0.45407357273697435
PEFT model is more accurate.
