# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA (Low Rank Adaptation) was  chosen due to its efficiency in fine-tuning large language models with fewer parameters, which helps mitigate computational costs while maintaining performance. It achieves this by introducing low-rank adaptations to the original model's attention mechanism, reducing the computational complexity and memory footprint.
* Model: GPT-2ForSequenceClassification. This model architecture is selected for sentiment analysis, which aligns with the task at hand. GPT-2 is a well-established architecture known for its effectiveness in various NLP tasks, including classification. Additionally, by using GPT-2ForSequenceClassification, we leverage the pre-trained weights of GPT-2, which can capture rich linguistic patterns and contexts, potentially improving performance.
* Evaluation approach: Evaluation before and after fine-tuning using the Trainer's `evaluate()` method. This approach provides a direct comparison of model performance before and after fine-tuning, ensuring the effectiveness of the fine-tuning process. By evaluating on the validation dataset using the same metrics and procedures, we can assess the impact of fine-tuning on model performance objectively.
* Fine-tuning dataset: Stanford Sentiment Treebank - SST-2 because of the nature of the model.

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [1]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load the Stanford Sentiment Treebank dataset
# See: https://huggingface.co/datasets/sst2

# Define the splits we want to load (training and testing)
splits = ["train", "validation"]
# Load the SST-2 dataset splits using a dictionary comprehension.
# 'load_dataset' function fetches the dataset from Hugging Face's dataset repository.
# 'glue' is the broader dataset collection, 'sst2' is the specific dataset for sentiment analysis.
# Iterating over the splits list to load both training and testing sets.
dataset = {split: load_dataset("glue", "sst2", split=split) for split in splits}

Downloading readme: 100%|██████████| 35.3k/35.3k [00:00<00:00, 19.2MB/s]
Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/148k [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 148k/148k [00:00<00:00, 750kB/s][A
Downloading data files:  33%|███▎      | 1/3 [00:00<00:00,  4.75it/s]
Downloading data:   0%|          | 0.00/3.11M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 3.11M/3.11M [00:00<00:00, 19.1MB/s][A
Downloading data files:  67%|██████▋   | 2/3 [00:00<00:00,  5.23it/s]
Downloading data: 100%|██████████| 72.8k/72.8k [00:00<00:00, 796kB/s]
Downloading data files: 100%|██████████| 3/3 [00:00<00:00,  6.05it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1278.88it/s]
Generating test split: 100%|██████████| 1821/1821 [00:00<00:00, 325568.10 examples/s]
Generating train split: 100%|██████████| 67349/67349 [00:00<00:00, 2039111.40 examples/s]
Generating validation split: 100%|██████████| 872/872 [00:00<00:

In [3]:
def compute_statistics_subset(dataset, name_subset):
    """
    A function for obtaining some statistics in a sunset of a HuggingFace dataset
    :dataset: A Hugging Face dataset object
    :name_subset: A string with the name of the subset
    
    :returns: No return
    """
    # print number of samples in subset
    print('Number of samples of', name_subset,'subset:',dataset[name_subset].num_rows)
    # print maximum length of sequence in the subset
    print('Max length of sentence in', name_subset, 'subset', max(len(sentence) for sentence in dataset[name_subset]['sentence']))
    # print minimum length of sequence in the subset
    print('Min length of sentence in', name_subset, 'subset', min(len(sentence) for sentence in dataset[name_subset]['sentence']))
    # print labels in the subset
    print('Labels in', name_subset,':', set(dataset[name_subset]['label']))
    # print percentages of each label
    print('Percentages for each label in subset:')
    # compute frequencies for each label in the dataset
    frequencies = {x: dataset[name_subset]['label'].count(x) for x in set(dataset[name_subset]['label'])}
    # compute percentages
    percentages = {x: (count / dataset[name_subset].num_rows) * 100 for x, count in frequencies.items()}
    # loop over the keys in percentages and print values
    for key, value in percentages.items():
        print('- Label',key,':',round(value,2),'%')

In [4]:
# obtain statstics for train subset
compute_statistics_subset(dataset=dataset, name_subset='train')

Number of samples of train subset: 67349
Max length of sentence in train subset 268
Min length of sentence in train subset 2
Labels in train : {0, 1}
Percentages for each label in subset:
- Label 0 : 44.22 %
- Label 1 : 55.78 %


In [5]:
# obtain statstics for validation subset
compute_statistics_subset(dataset=dataset, name_subset='validation')

Number of samples of validation subset: 872
Max length of sentence in validation subset 244
Min length of sentence in validation subset 6
Labels in validation : {0, 1}
Percentages for each label in subset:
- Label 0 : 49.08 %
- Label 1 : 50.92 %


### Load Tokenizer and tokenize the dataset

In [6]:
# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

tokenizer_config.json: 100%|██████████| 26.0/26.0 [00:00<00:00, 55.3kB/s]
config.json: 100%|██████████| 665/665 [00:00<00:00, 1.65MB/s]
vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.94MB/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 18.1MB/s]
tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 29.5MB/s]


In [7]:
# set EOS (end of sentence) TOKEN as PAD TOKEN
tokenizer.pad_token = tokenizer.eos_token

In [8]:
def preprocess_function(examples):
    """Preprocess the imdb dataset by returning tokenized examples.
    :examples:
    
    :returns:
    """
    # convert the text data in a list of tokens using tokenizer, truncating
    # the text to the maximum lenght and pad shorter sequences to a uniform lenght
    # return the result
    return tokenizer(examples['sentence'], padding=True, truncation=True)

In [9]:
# Initialize an empty dictionary to store the tokenized datasets.
tokenized_ds = {}
# Iterate over each data split ('train' and 'test').
for split in splits:
    # Apply the preprocess_function to the dataset corresponding to the current split.
    # The 'map' function applies the preprocess_function to each example in the dataset.
    # 'batched=True' allows processing multiple examples at once for efficiency.
    tokenized_ds[split] = dataset[split].map(preprocess_function, batched=True)

Map: 100%|██████████| 67349/67349 [00:06<00:00, 9935.15 examples/s] 
Map: 100%|██████████| 872/872 [00:00<00:00, 6920.57 examples/s]


In [10]:
# print a sample of sentence and its tokenization in train subset
print(tokenized_ds["train"][0]['sentence'])
print(tokenized_ds["train"][0]["input_ids"])

hide new secretions from the parental units 
[24717, 649, 3200, 507, 422, 262, 21694, 4991, 220, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]


In [11]:
# print a sample of sentence and its tokenization in validation subset
print(tokenized_ds["validation"][0]['sentence'])
print(tokenized_ds["validation"][0]["input_ids"])

it 's a charming and often affecting journey . 
[270, 705, 82, 257, 23332, 290, 1690, 13891, 7002, 764, 220, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256]


### Load model and freeze base parameters

In [12]:
# Load the pre-trained model 'gpt-2' for sequence classification.
# This model is designed for tasks like sentiment analysis where each sequence (like a sentence)
# is classified into categories (like positive/negative).
# here, we specify the number of labels (2 for sentiment classification),
# id2label and label2id corresponding to POSITIVE and NEGATIVE labels
model = AutoModelForSequenceClassification.from_pretrained('gpt2',
                                                      num_labels=2,
                                                      id2label={0: "NEGATIVE", 1: "POSITIVE"},
                                                      label2id={"NEGATIVE": 0, "POSITIVE": 1})

model.safetensors: 100%|██████████| 548M/548M [00:02<00:00, 220MB/s] 
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# Set the model's pad token id to match the tokenizer's pad token id
model.config.pad_token_id = tokenizer.pad_token_id

In [14]:
# Freeze all the parameters of the base model
# Iterate over all the parameters of the base model.
for param in model.base_model.parameters():
    # freeze the base model disabling the gradient calculations for each parameter
    # in the base model of "gpt2" model
    # Set 'requires_grad' to False to freeze the parameters of the base model.
    # Freezing prevents the weights of these layers from being updated during training.
    param.requires_grad = False

In [15]:
# check model architecture
print(model)

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)


In [16]:
# The 'model.score' is the classification head that will be trained to adapt 
# the base model for our specific task (sentiment analysis in this case).
model.score

Linear(in_features=768, out_features=2, bias=False)

In [17]:
def compute_metrics(eval_pred):
    """
    Function for compute tha accuracy metric
    :eval_pred: a tuple with predictions and labels
    
    :returns: a dictionary with the mean accuracy
    """
    predictions, labels = eval_pred
    # Convert the predictions to discrete labels by taking the argmax,
    # which is the index of the highest value in the prediction (logits).
    predictions = np.argmax(predictions, axis=1)
    # Calculate and return the accuracy as the mean of the instances where
    # predictions match the true labels.
    return {"accuracy": (predictions == labels).mean()}

In [18]:
# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Initialize the Trainer, a high-level API for training transformer models.
training_args = TrainingArguments(
    output_dir="./model_output", # Directory where the model outputs will be saved.
    learning_rate=2e-5, # Learning rate for the optimizer.
    # Reduce the batch size if you don't have enough memory
    per_device_train_batch_size=16, # Batch size for training per device.
    per_device_eval_batch_size=16, # Batch size for evaluation per device.
    num_train_epochs=1, # Number of training epochs.
    weight_decay=0.01, # Weight decay for regularization.
    evaluation_strategy="epoch", # Evaluation is performed at the end of each epoch.
    save_strategy="epoch", # Model is saved at the end of each epoch.
    load_best_model_at_end=True, # Load the best model at the end of training.
)

pretrain_trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_ds["train"], # The tokenized training dataset.
    eval_dataset=tokenized_ds["validation"], # The tokenized evaluation dataset.
    tokenizer=tokenizer, # The tokenizer used for encoding the data.
    # Data collator that will dynamically pad the batches during training.
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics, # Function to compute metrics during evaluation.
)

In [19]:
# Evaluate the model on the validation set before fine-tuning
pretrain_results = pretrain_trainer.evaluate()

# Print the evaluation results before fine-tuning
print("Evaluation results before fine-tuning:", pretrain_results)

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Evaluation results before fine-tuning: {'eval_loss': 1.800602912902832, 'eval_accuracy': 0.4908256880733945, 'eval_runtime': 3.7146, 'eval_samples_per_second': 234.751, 'eval_steps_per_second': 14.807}


In the cell output above, we can see that the model achieved about 0.49 evaluation accuracy, similar to flipping a coin. Let's fine-tune and see how we can improve this result.

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [20]:
from peft import LoraConfig, get_peft_model, TaskType

In [21]:
# Load the pre-trained model 'gpt-2' for sequence classification.
# This model is designed for tasks like sentiment analysis where each sequence (like a sentence)
# is classified into categories (like positive/negative).
# here, we specify the number of labels (2 for sentiment classification),
# id2label and label2id corresponding to POSITIVE and NEGATIVE labels
model = AutoModelForSequenceClassification.from_pretrained('gpt2',
                                                      num_labels=2,
                                                      id2label={0: "NEGATIVE", 1: "POSITIVE"},
                                                      label2id={"NEGATIVE": 0, "POSITIVE": 1})

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [22]:
# Set the model's pad token id to match the tokenizer's pad token id
model.config.pad_token_id = tokenizer.pad_token_id

In [23]:
# Create a PEFT Config for LoRA
config = LoraConfig(
                    r=8, # Rank
                    lora_alpha=32,
                    target_modules=['c_attn', 'c_proj'],
                    lora_dropout=0.1,
                    bias="none",
                    task_type=TaskType.SEQ_CLS
                )

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()



trainable params: 814,080 || all params: 125,253,888 || trainable%: 0.6499438963523432


In [24]:
# Rename 'label' to 'labels' to match the Trainer's expectation
tokenized_ds["train"] = tokenized_ds["train"].map(lambda e: {'labels': e['label']}, batched=True, remove_columns=['label'])
tokenized_ds["validation"] = tokenized_ds["validation"].map(lambda e: {'labels': e['label']}, batched=True, remove_columns=['label'])

Map: 100%|██████████| 67349/67349 [00:00<00:00, 445606.09 examples/s]
Map: 100%|██████████| 872/872 [00:00<00:00, 120679.48 examples/s]


In [25]:
tokenized_ds["train"].set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
tokenized_ds["validation"].set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

In [26]:
# Initialize the Trainer
trainer = Trainer(
    model=peft_model,  # Make sure to pass the PEFT model here
    args=TrainingArguments(
        output_dir="./lora_model_output",
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=2,
        weight_decay=0.01,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        logging_dir='./logs',  # If you want to log metrics and/or losses during training
    ),
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer, padding=True, max_length=512),
    compute_metrics=compute_metrics,
)

In [27]:
# Start the training process
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.3708,0.299321,0.879587
2,0.3483,0.29056,0.891055




TrainOutput(global_step=4210, training_loss=0.47350006284736396, metrics={'train_runtime': 1361.6448, 'train_samples_per_second': 98.923, 'train_steps_per_second': 3.092, 'total_flos': 4437101443461120.0, 'train_loss': 0.47350006284736396, 'epoch': 2.0})

In [28]:
# Save fine tuned PEFT model
peft_model.save_pretrained("gpt-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [29]:
import torch
from peft import AutoPeftModelForSequenceClassification

NUM_LABELS = 2
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

lora_model = AutoPeftModelForSequenceClassification.from_pretrained("gpt-lora", num_labels=NUM_LABELS, ignore_mismatched_sizes=True).to(device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [30]:
# Set the model's pad token id to match the tokenizer's pad token id
lora_model.config.pad_token_id = tokenizer.pad_token_id

In [31]:
# The HuggingFace Trainer class handles the training and eval loop for PyTorch for us.
# Initialize the Trainer, a high-level API for training transformer models.
training_args = TrainingArguments(
    output_dir="./data/sentiment_analysis", # Directory where the model outputs will be saved.
    learning_rate=2e-5, # Learning rate for the optimizer.
    # Reduce the batch size if you don't have enough memory
    per_device_train_batch_size=16, # Batch size for training per device.
    per_device_eval_batch_size=16, # Batch size for evaluation per device.
    num_train_epochs=1, # Number of training epochs.
    weight_decay=0.01, # Weight decay for regularization.
    evaluation_strategy="epoch", # Evaluation is performed at the end of each epoch.
    save_strategy="epoch", # Model is saved at the end of each epoch.
    load_best_model_at_end=True, # Load the best model at the end of training.
)

finetuned_trainer = Trainer(
    model=lora_model,  # The fine-tuned PEFT model.
    args=training_args,# Training arguments, defined above.
    train_dataset=tokenized_ds["train"], # The tokenized training dataset.
    eval_dataset=tokenized_ds["validation"], # The tokenized evaluation dataset.
    tokenizer=tokenizer, # The tokenizer used for encoding the data.
    # Data collator that will dynamically pad the batches during training.
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics, # Function to compute metrics during evaluation.
)

In [32]:
# Evaluate the fine-tuned model on the validation set
finetuned_results = finetuned_trainer.evaluate()

# Print the evaluation results for the fine-tuned model
print("Evaluation results for the fine-tuned model:", finetuned_results)

Evaluation results for the fine-tuned model: {'eval_loss': 0.29056042432785034, 'eval_accuracy': 0.8910550458715596, 'eval_runtime': 3.6751, 'eval_samples_per_second': 237.272, 'eval_steps_per_second': 14.966}


In the cell output above, we can see that the model achieved an evaluation accuracy of about 0.89, a surprising result for 2 epochs of fine-tuning.