# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: 
* Model: 
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

For this project, we have selected the **SST2 dataset** from HuggingFace:

https://huggingface.co/datasets/sst2

, that contains **sentences of comments from movies and sentiment annotations from humans**, provided by the **Stanford Sentiment Treebank**. In this dataset, we have two subsets, that we will use for this project:

- **Train subset**, with 67679 samples made by secuences with two labels 0 (negative) and 1(positive)

- **Validation subset**, with 872 samples made by secuences with two labels 0 (negative) and 1(positive)


In [1]:
from transformers import GPT2ForSequenceClassification, GPT2Tokenizer, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


#### Load the dataset and obtain statistics

In [2]:
# Load the Stanford Sentiment Treebank dataset
# See: https://huggingface.co/datasets/sst2

# Define the splits we want to load (training and testing)
splits = ["train", "validation"]
# Load the SST-2 dataset splits using a dictionary comprehension.
# 'load_dataset' function fetches the dataset from Hugging Face's dataset repository.
# 'glue' is the broader dataset collection, 'sst2' is the specific dataset for sentiment analysis.
# Iterating over the splits list to load both training and testing sets.
dataset = {split: load_dataset("glue", "sst2", split=split) for split in splits}

In [3]:
def compute_statistics_subset(dataset, name_subset):
    """
    A function for obtaining some statistics in a sunset of a HuggingFace dataset
    :dataset: A Hugging Face dataset object
    :name_subset: A string with the name of the subset
    
    :returns: No return
    """
    # print number of samples in subset
    print('Number of samples of', name_subset,'subset:',dataset[name_subset].num_rows)
    # print maximum length of sequence in the subset
    print('Max length of sentence in', name_subset, 'subset', max(len(sentence) for sentence in dataset[name_subset]['sentence']))
    # print minimum length of sequence in the subset
    print('Min length of sentence in', name_subset, 'subset', min(len(sentence) for sentence in dataset[name_subset]['sentence']))
    # print labels in the subset
    print('Labels in', name_subset,':', set(dataset[name_subset]['label']))
    # print percentages of each label
    print('Percentages for each label in subset:')
    # compute frequencies for each label in the dataset
    frequencies = {x: dataset[name_subset]['label'].count(x) for x in set(dataset[name_subset]['label'])}
    # compute percentages
    percentages = {x: (count / dataset[name_subset].num_rows) * 100 for x, count in frequencies.items()}
    # loop over the keys in percentages and print values
    for key, value in percentages.items():
        print('- Label',key,':',round(value,2),'%')

In [4]:
# obtain statstics for train subset
compute_statistics_subset(dataset=dataset, name_subset='train')

Number of samples of train subset: 67349
Max length of sentence in train subset 268
Min length of sentence in train subset 2
Labels in train : {0, 1}
Percentages for each label in subset:
- Label 0 : 44.22 %
- Label 1 : 55.78 %


In [5]:
# obtain staristics for validation subset
compute_statistics_subset(dataset=dataset, name_subset='validation')

Number of samples of validation subset: 872
Max length of sentence in validation subset 244
Min length of sentence in validation subset 6
Labels in validation : {0, 1}
Percentages for each label in subset:
- Label 0 : 49.08 %
- Label 1 : 50.92 %


#### Load the tokenizer and tokenize the dataset

In [6]:
# load the tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [7]:
# set EOS (end of sentence) TOKEN as PAD TOKEN
tokenizer.pad_token = tokenizer.eos_token

In [8]:
def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples.
    :examples:
    
    :returns:
    """
    # convert the text data in a list of tokens using tokenizer, truncating
    # the text to the maximum lenght and pad shorter sequences to a uniform lenght
    # return the result
    return tokenizer(examples['sentence'], padding=True, truncation=True)

In [9]:
# Initialize an empty dictionary to store the tokenized datasets.
tokenized_ds = {}
# Iterate over each data split ('train' and 'test').
for split in splits:
    # Apply the preprocess_function to the dataset corresponding to the current split.
    # The 'map' function applies the preprocess_function to each example in the dataset.
    # 'batched=True' allows processing multiple examples at once for efficiency.
    tokenized_ds[split] = dataset[split].map(preprocess_function, batched=True)

In [10]:
# print a sample of sentence and its tokenization in train subset
print(tokenized_ds["train"][0]['sentence'])
print(tokenized_ds["train"][0]["input_ids"])

hide new secretions from the parental units 
[24717, 649, 3200, 507, 422, 262, 21694, 4991, 220, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]


In [11]:
# print a sample of sentence and its tokenization in validation subset
print(tokenized_ds["validation"][0]['sentence'])
print(tokenized_ds["validation"][0]["input_ids"])

it 's a charming and often affecting journey . 
[270, 705, 82, 257, 23332, 290, 1690, 13891, 7002, 764, 220, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]


#### Load the model and freeze base parameters

In [12]:
# Load the pre-trained model 'gpt-2' for sequence classification.
# This model is designed for tasks like sentiment analysis where each sequence (like a sentence)
# is classified into categories (like positive/negative).
# here, we specify the number of labels (2 for sentiment classification),
# id2label and label2id corresponding to POSITIVE and NEGATIVE labels
model = GPT2ForSequenceClassification.from_pretrained('gpt2',
                                                      num_labels=2,
                                                      id2label={0: "NEGATIVE", 1: "POSITIVE"},
                                                      label2id={"NEGATIVE": 0, "POSITIVE": 1})

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# Set the model's pad token id to match the tokenizer's pad token id
model.config.pad_token_id = tokenizer.pad_token_id

In [14]:
# Freeze all the parameters of the base model
# Iterate over all the parameters of the base model.
for param in model.base_model.parameters():
    # freeze the base model disabling the gradient calculations for each parameter
    # in the base model of "gpt2" model
    # Set 'requires_grad' to False to freeze the parameters of the base model.
    # Freezing prevents the weights of these layers from being updated during training.
    param.requires_grad = False

In [15]:
# check model architecture
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)


In [16]:
# The 'model.score' is the classification head that will be trained to adapt 
# the base model for our specific task (sentiment analysis in this case).
model.score

Linear(in_features=768, out_features=2, bias=False)

#### Train the model using Trainer from HuggingFace

In [17]:
def compute_metrics(eval_pred):
    """
    Function for compute tha accuracy metric
    :eval_pred: a tuple with predictions and labels
    
    :returns: a dictionary with the mean accuracy
    """
    predictions, labels = eval_pred
    # Convert the predictions to discrete labels by taking the argmax,
    # which is the index of the highest value in the prediction (logits).
    predictions = np.argmax(predictions, axis=1)
    # Calculate and return the accuracy as the mean of the instances where
    # predictions match the true labels.
    return {"accuracy": (predictions == labels).mean()}

In [18]:
# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Initialize the Trainer, a high-level API for training transformer models.
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis", # Directory where the model outputs will be saved.
        learning_rate=2e-5, # Learning rate for the optimizer.
        # Reduce the batch size if you don't have enough memory
        per_device_train_batch_size=16, # Batch size for training per device.
        per_device_eval_batch_size=16, # Batch size for evaluation per device.
        num_train_epochs=1, # Number of training epochs.
        weight_decay=0.01, # Weight decay for regularization.
        evaluation_strategy="epoch", # Evaluation is performed at the end of each epoch.
        save_strategy="epoch", # Model is saved at the end of each epoch.
        load_best_model_at_end=True, # Load the best model at the end of training.
    ),
    train_dataset=tokenized_ds["train"], # The tokenized training dataset.
    eval_dataset=tokenized_ds["validation"], # The tokenized evaluation dataset.
    tokenizer=tokenizer, # The tokenizer used for encoding the data.
    # Data collator that will dynamically pad the batches during training.
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics, # Function to compute metrics during evaluation.
)

In [19]:
# Start the training process.
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6456,0.641169,0.719037


TrainOutput(global_step=4210, training_loss=0.6930750975982594, metrics={'train_runtime': 291.336, 'train_samples_per_second': 231.173, 'train_steps_per_second': 14.451, 'total_flos': 2161131327553536.0, 'train_loss': 0.6930750975982594, 'epoch': 1.0})

#### Evaluate the model

In [20]:
# Show the performance of the model on the validation set
trainer.evaluate()

{'eval_loss': 0.6411694884300232,
 'eval_accuracy': 0.7190366972477065,
 'eval_runtime': 3.385,
 'eval_samples_per_second': 257.603,
 'eval_steps_per_second': 16.248,
 'epoch': 1.0}

#### View the results

In [21]:
# Convert the tokenized test dataset to a pandas DataFrame
df = pd.DataFrame(tokenized_ds["validation"])
# Select only the 'sentence' and 'label' columns for simplicity
df = df[["sentence", "label"]]
# Use the trained model to make predictions on the validation dataset
predictions = trainer.predict(tokenized_ds["validation"])
# Convert the raw prediction logits to discrete labels (0 or 1 in our case)
# The argmax function is used to select the index (label) with the highest prediction score.
df["predicted_label"] = np.argmax(predictions[0], axis=1)
# Display the first two rows of the dataframe to check the data.
# This shows the actual and predicted labels alongside the sentences.
df.head(10)

Unnamed: 0,sentence,label,predicted_label
0,it 's a charming and often affecting journey .,1,1
1,unflinchingly bleak and desperate,0,1
2,allows us to hope that nolan is poised to emba...,1,1
3,"the acting , costumes , music , cinematography...",1,1
4,"it 's slow -- very , very slow .",0,1
5,although laced with humor and a few fanciful t...,1,0
6,a sometimes tedious film .,0,1
7,or doing last year 's taxes with your ex-wife .,0,1
8,you do n't have to know about music to appreci...,1,1
9,"in exactly 89 minutes , most of which passed a...",0,1


#### Look at some incorrect predictions

In [22]:
# Set a pandas display option to show the full content of each column in the DataFrame.
pd.set_option("display.max_colwidth", None)
# Filter the DataFrame to only include rows where the model's predictions do not match the actual labels.
df[df["label"] != df["predicted_label"]].head(10)

Unnamed: 0,sentence,label,predicted_label
1,unflinchingly bleak and desperate,0,1
4,"it 's slow -- very , very slow .",0,1
5,"although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women .",1,0
6,a sometimes tedious film .,0,1
7,or doing last year 's taxes with your ex-wife .,0,1
9,"in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .",0,1
13,"we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity .",1,0
16,the emotions are raw and will strike a nerve with anyone who 's ever had family trauma .,1,0
19,"in its best moments , resembles a bad high school production of grease , without benefit of song .",0,1
21,the iditarod lasts for days - this just felt like it did .,0,1


## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [23]:
from peft import LoraConfig, get_peft_model

#### Load GPT-2 model

In [24]:
# Load the pre-trained model 'gpt-2' for sequence classification.
# This model is designed for tasks like sentiment analysis where each sequence (like a sentence)
# is classified into categories (like positive/negative).
# here, we specify the number of labels (2 for sentiment classification),
# id2label and label2id corresponding to POSITIVE and NEGATIVE labels
model = GPT2ForSequenceClassification.from_pretrained('gpt2',
                                                      num_labels=2,
                                                      id2label={0: "NEGATIVE", 1: "POSITIVE"},
                                                      label2id={"NEGATIVE": 0, "POSITIVE": 1})

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [25]:
# Set the model's pad token id to match the tokenizer's pad token id
model.config.pad_token_id = tokenizer.pad_token_id

#### Configure LoRA using PETF

In [26]:
# Create a PEFT Config for LoRA
config = LoraConfig()

In [27]:
# Convert the GPT-2 model into a PEFT model
lora_model = get_peft_model(model, config)



In [28]:
# print the number of trainable parameters in LoRA
lora_model.print_trainable_parameters()

trainable params: 294,912 || all params: 124,736,256 || trainable%: 0.236428452686603


#### Train the model

In [29]:
# Rename 'label' to 'labels' to match the Trainer's expectation
tokenized_ds["train"] = tokenized_ds["train"].map(lambda e: {'labels': e['label']}, batched=True, remove_columns=['label'])
tokenized_ds["validation"] = tokenized_ds["validation"].map(lambda e: {'labels': e['label']}, batched=True, remove_columns=['label'])

In [30]:
tokenized_ds["train"].set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
tokenized_ds["validation"].set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

In [31]:
# Initialize the Trainer
trainer = Trainer(
    model=lora_model,  # Make sure to pass the PEFT model here
    args=TrainingArguments(
        output_dir="./data/sentiment_analysis_lora",
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=1,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        logging_dir='./logs',  # If you want to log metrics and/or losses during training
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding=True, max_length=512),
    compute_metrics=compute_metrics,
)

In [32]:
# Start the training process
trainer.train()

IndexError: Invalid key: 65300 is out of bounds for size 0

#### Save the model

In [None]:
lora_model.save_pretrained("gpt2_lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.