# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique: LoRA
* Model: gpt2
* Evaluation approach: 
* Fine-tuning dataset: 

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

In [34]:
!pip list | grep "peft" || pip install peft

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


peft                      0.5.0


In [35]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
import pandas as pd

In [36]:
splits = ["train", "test"]
dataset = {split: load_dataset("cornell-movie-review-data/rotten_tomatoes", split=split) for split in splits}

In [37]:
dataset

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 8530
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 1066
 })}

In [38]:
import numpy as np
from collections import Counter

def compute_stats(dataset, subset_name):
    """
    Compute and print statistics for a subset of a HuggingFace dataset.
    
    :param dataset: A Hugging Face dataset object.
    :param subset_name: A string with the name of the subset.
    :returns: A dictionary containing the computed statistics.
    """
    try:
        subset = dataset[subset_name]
    except KeyError:
        raise ValueError(f"Subset '{subset_name}' not found in the dataset.")

    if 'text' not in subset.features or 'label' not in subset.features:
        raise ValueError("Dataset must contain 'text' and 'label' columns.")

    sentences = subset['text']
    labels = subset['label']
    
    # Number of samples in subset
    num_samples = len(sentences)
    
    # Sentence length statistics
    sentence_lengths = np.array([len(sentence) for sentence in sentences])
    
    # Labels statistics
    label_counts = Counter(labels)
    unique_labels = list(label_counts.keys())
    label_percentages = {label: (count / num_samples) * 100 for label, count in label_counts.items()}
    
    # Compute statistics
    stats = {
        'subset_name': subset_name,
        'num_samples': num_samples,
        'max_length': int(np.max(sentence_lengths)),
        'min_length': int(np.min(sentence_lengths)),
        'mean_length': float(np.mean(sentence_lengths)),
        'median_length': float(np.median(sentence_lengths)),
        'unique_labels': unique_labels,
        'label_counts': dict(label_counts),
        'label_percentages': label_percentages
    }
    
    # Print statistics
    print(f"Statistics for {subset_name} subset:")
    print(f"Number of samples: {stats['num_samples']}")
    print(f"Sentence length - Max: {stats['max_length']}, Min: {stats['min_length']}, "
          f"Mean: {stats['mean_length']:.2f}, Median: {stats['median_length']:.2f}")
    print(f"Labels: {stats['unique_labels']}")
    print("Label distribution:")
    for label, percentage in stats['label_percentages'].items():
        print(f"- Label {label}: {percentage:.2f}% ({stats['label_counts'][label]} samples)")
    
    return stats

In [39]:
compute_stats(dataset, "train")

Statistics for train subset:
Number of samples: 8530
Sentence length - Max: 267, Min: 4, Mean: 113.97, Median: 111.00
Labels: [1, 0]
Label distribution:
- Label 1: 50.00% (4265 samples)
- Label 0: 50.00% (4265 samples)


{'subset_name': 'train',
 'num_samples': 8530,
 'max_length': 267,
 'min_length': 4,
 'mean_length': 113.97162954279015,
 'median_length': 111.0,
 'unique_labels': [1, 0],
 'label_counts': {1: 4265, 0: 4265},
 'label_percentages': {1: 50.0, 0: 50.0}}

In [40]:
compute_stats(dataset, "test")

Statistics for test subset:
Number of samples: 1066
Sentence length - Max: 261, Min: 14, Mean: 115.52, Median: 113.00
Labels: [1, 0]
Label distribution:
- Label 1: 50.00% (533 samples)
- Label 0: 50.00% (533 samples)


{'subset_name': 'test',
 'num_samples': 1066,
 'max_length': 261,
 'min_length': 14,
 'mean_length': 115.5234521575985,
 'median_length': 113.0,
 'unique_labels': [1, 0],
 'label_counts': {1: 533, 0: 533},
 'label_percentages': {1: 50.0, 0: 50.0}}

In [41]:

tokenizer = AutoTokenizer.from_pretrained('gpt2')

In [42]:
tokenizer.pad_token = tokenizer.eos_token

In [43]:
def preprocess(examples): return tokenizer(examples["text"], padding=True, truncation =True)

tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(preprocess, batched = True)

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

In [44]:
base_model = AutoModelForSequenceClassification.from_pretrained('gpt2',
                                                      num_labels=2,
                                                      id2label={0: "NEGATIVE", 1: "POSITIVE"},
                                                      label2id={"NEGATIVE": 0, "POSITIVE": 1})

base_model.config.pad_token_id = tokenizer.pad_token_id

for param in base_model.base_model.parameters():
    param.requires_grad = False

base_model

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

In [45]:
base_model.score

Linear(in_features=768, out_features=2, bias=False)

In [46]:
def compute_metrics(eval_prediction):
    predictions, labels = eval_prediction
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

In [47]:
training_args = TrainingArguments(
    output_dir="./model_output",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

pretrain_trainer = Trainer(
    model=base_model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

base_model_evaluation = pretrain_trainer.evaluate()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [48]:
print("Evaluation result before fine-tuning:")
base_model_evaluation

Evaluation result before fine-tuning:


{'eval_loss': 1.9285240173339844,
 'eval_accuracy': 0.5,
 'eval_runtime': 4.1019,
 'eval_samples_per_second': 259.876,
 'eval_steps_per_second': 16.334}

## Performing Parameter-Efficient Fine-Tuning

TODO: In the cells below, create a PEFT model from your loaded model, run a training loop, and save the PEFT model weights.

In [49]:
from peft import  get_peft_model, LoraConfig, TaskType

In [50]:
model = AutoModelForSequenceClassification.from_pretrained('gpt2',
                                                      num_labels=2,
                                                      id2label={0: "NEGATIVE", 1: "POSITIVE"},
                                                      label2id={"NEGATIVE": 0, "POSITIVE": 1})

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [51]:
config = LoraConfig(
                    r=8,
                    lora_alpha=32,
                    target_modules=['c_attn', 'c_proj'],
                    lora_dropout=0.1,
                    bias="none",
                    task_type=TaskType.SEQ_CLS
                )

peft_model = get_peft_model(model, config)
peft_model.print_trainable_parameters()

trainable params: 814,080 || all params: 125,253,888 || trainable%: 0.6499438963523432


In [52]:
model.config.pad_token_id = tokenizer.pad_token_id

In [53]:
# Rename 'label' column to 'labels'
tokenized_dataset["train"] = tokenized_dataset["train"].map(lambda e: {'labels': e['label']}, batched=True, remove_columns=['label'])
tokenized_dataset["test"] = tokenized_dataset["test"].map(lambda e: {'labels': e['label']}, batched=True, remove_columns=['label'])

# Print to verify
print(tokenized_dataset["train"][0]["text"])
print(tokenized_dataset["train"][0]["labels"])

# Set the format for PyTorch
tokenized_dataset["train"].set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
tokenized_dataset["test"].set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
1


In [54]:
trainer_peft = Trainer(
model = peft_model,
args = TrainingArguments(
    output_dir = "./lora_model_output",
    learning_rate = 2e-5,
    logging_strategy = "steps",
    per_device_train_batch_size = 12,
    per_device_eval_batch_size = 12,
    num_train_epochs = 1,
    weight_decay = 0.01,
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    load_best_model_at_end = True
),
train_dataset = tokenized_dataset["train"],
eval_dataset = tokenized_dataset["test"],
tokenizer = tokenizer,
data_collator =  DataCollatorWithPadding(tokenizer=tokenizer),
compute_metrics = compute_metrics,
)

In [55]:
trainer_peft.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.6669,0.595125,0.733583


TrainOutput(global_step=711, training_loss=0.6522077085599618, metrics={'train_runtime': 113.829, 'train_samples_per_second': 74.937, 'train_steps_per_second': 6.246, 'total_flos': 330172861962240.0, 'train_loss': 0.6522077085599618, 'epoch': 1.0})

In [56]:
peft_model.save_pretrained("gpt-lora")

## Performing Inference with a PEFT Model

TODO: In the cells below, load the saved PEFT model weights and evaluate the performance of the trained PEFT model. Be sure to compare the results to the results from prior to fine-tuning.

In [57]:
import torch
from peft import AutoPeftModelForSequenceClassification

NUM_LABELS = 2
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

lora_model = AutoPeftModelForSequenceClassification.from_pretrained("gpt-lora", num_labels=NUM_LABELS, ignore_mismatched_sizes=True).to(device)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [58]:
lora_model.config.pad_token_id = tokenizer.pad_token_id

In [59]:
training_args = TrainingArguments(
    output_dir="./data/sentiment_analysis",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=1,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

finetuned_trainer = Trainer(
    model=lora_model,  # The fine-tuned PEFT model.
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics,
)

In [60]:
# Evaluate the fine-tuned model on the validation set
finetuned_results = finetuned_trainer.evaluate()

# Print the evaluation results for the fine-tuned model
print("Evaluation results for the fine-tuned model:")
finetuned_results

Evaluation results for the fine-tuned model:


{'eval_loss': 0.5951250195503235,
 'eval_accuracy': 0.7335834896810507,
 'eval_runtime': 4.5234,
 'eval_samples_per_second': 235.664,
 'eval_steps_per_second': 14.812}