---
# Title: Fine-Tuning GPT-2 with LoRA for IMDb Sentiment Classification
---

## Introduction
This notebook demonstrates how to fine-tune a GPT-2 model using Low-Rank Adaptation (LoRA) for sentiment classification on the IMDb dataset, distinguishing positive and negative movie reviews.

**Overview:**
1. Load and evaluate the baseline GPT-2 model
2. Apply LoRA for parameter-efficient fine-tuning
3. Save the fine-tuned model into a structured folder for easy access
4. Evaluate the fine-tuned model and compare it to the baseline

**Key Details:**
- PEFT Technique: LoRA reduces computational overhead by introducing trainable low-rank matrices into specific model layers, keeping most parameters frozen.
- Model: GPT-2, originally designed for language generation, is adapted for sequence classification with a classification head.
- Evaluation Approach: Accuracy is used to compare the baseline and fine-tuned models on the test set.
- Dataset: The IMDb dataset includes 25,000 training and 25,000 test movie reviews labeled as positive (1) or negative (0).

## Attribution
I want to attribute parts of this code to the examples I have worked through and done in this course so far.

---
### Importing Libraries
---

In [1]:
!pip install transformers datasets peft scikit-learn torch evaluate
!pip install --upgrade datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.13.2-py3-none-any.whl (320 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/320.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.7/320.7 kB[0m [31m67.5 MB/s[0m eta [36m0:00:00[0m
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pya

In [2]:
print("\n--- Importing necessary libraries ---\n")
import os
import time
import json
import torch
import pandas as pd
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    DataCollatorWithPadding,
    Trainer,
    TrainingArguments,
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, TaskType, AutoPeftModelForSequenceClassification
from evaluate import load
from sklearn.metrics import accuracy_score

print("All libraries imported successfully!")


--- Importing necessary libraries ---

All libraries imported successfully!


In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"\nUsing device: {device}")


Using device: cuda


---
### Loading and Preprocessing the IMDb Dataset
---

In [4]:
print("\n--- Loading the IMDb dataset ---\n")
splits = ["train", "test"]
ds = {split: ds for split, ds in zip(splits, load_dataset("imdb", split=splits))}

for split in splits:
    ds[split] = ds[split].shuffle(seed=42).select(range (5000))

print(f"Train dataset size: {len(ds['train'])}")
print(f"Test dataset size: {len(ds['test'])}")


--- Loading the IMDb dataset ---

Train dataset size: 5000
Test dataset size: 5000


In [5]:
print(f"Example review:\n{ds['train'][0]['text']}")

Example review:
There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...


In [6]:
# Initialize the tokenizer
print("\n--- Initializing the GPT-2 tokenizer ---\n")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

def preprocess_data(example):
    tokenized = tokenizer(
        example["text"],
        padding="max_length",
        truncation=True,
    )
    return tokenized

print("\n--- Tokenizing the dataset ---\n")
tokenized_ds = {split: ds[split].map(preprocess_data, batched=True) for split in splits}
tokenized_ds = {split: tokenized_ds[split].remove_columns(["text"]) for split in splits}


--- Initializing the GPT-2 tokenizer ---


--- Tokenizing the dataset ---





In [7]:
print(f"Example tokenized review:\n{tokenized_ds['train'][0]['input_ids']}")

Example tokenized review:
[1858, 318, 645, 8695, 379, 477, 1022, 6401, 959, 290, 4415, 5329, 475, 262, 1109, 326, 1111, 389, 1644, 2168, 546, 6590, 6741, 13, 4415, 5329, 3073, 42807, 11, 6401, 959, 3073, 6833, 13, 4415, 5329, 21528, 389, 2407, 2829, 13, 6401, 959, 338, 7110, 389, 1290, 517, 8253, 986, 6401, 959, 3073, 517, 588, 5537, 8932, 806, 11, 611, 356, 423, 284, 4136, 20594, 986, 383, 1388, 2095, 318, 4939, 290, 7650, 78, 11, 475, 423, 366, 27659, 40024, 590, 1911, 4380, 588, 284, 8996, 11, 284, 5052, 11, 284, 13446, 13, 1374, 546, 655, 13226, 30, 40473, 1517, 1165, 11, 661, 3597, 6401, 959, 3073, 1605, 475, 11, 319, 262, 584, 1021, 11, 11810, 484, 4702, 1605, 2168, 357, 10185, 737, 6674, 340, 338, 262, 3303, 11, 393, 262, 4437, 11, 475, 314, 892, 428, 2168, 318, 517, 3594, 621, 1605, 13, 2750, 262, 835, 11, 262, 10544, 389, 1107, 922, 290, 8258, 13, 383, 7205, 318, 407, 31194, 379, 477, 986, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 5025

---
### Evaluating the Baseline GPT-2 Model
---

In [8]:
print("\n--- Setting dataset format for baseline evaluation ---\n")
for split in splits:
    tokenized_ds[split].set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

print("\n--- Loading the GPT-2 baseline model ---\n")
baseline_model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2", 
    num_labels=2,  # Binary classification: positive/negative
    id2label={0: "NEGATIVE", 1: "POSITIVE"},
    label2id={"NEGATIVE": 0, "POSITIVE": 1}    
)
baseline_model.config.pad_token_id = tokenizer.pad_token_id
baseline_model.to(device)
print("GPT-2 baseline model loaded successfully!")


--- Setting dataset format for baseline evaluation ---


--- Loading the GPT-2 baseline model ---

GPT-2 baseline model loaded successfully!


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
accuracy = load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = predictions.argmax(axis=-1)
    return {"accuracy": accuracy_score(labels, preds)}

In [10]:
baseline_trainer = Trainer(
    model=baseline_model,
    args=TrainingArguments(
        output_dir="./baseline-results",
        evaluation_strategy="epoch",
        per_device_eval_batch_size=4,
    ),
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

print("\n--- Evaluating the baseline model ---\n")
baseline_results = baseline_trainer.evaluate()
print(f"Baseline evaluation results: {baseline_results}")


--- Evaluating the baseline model ---

Baseline evaluation results: {'eval_loss': 3.2555651664733887, 'eval_accuracy': 0.4988, 'eval_runtime': 397.173, 'eval_samples_per_second': 12.589, 'eval_steps_per_second': 3.147}


You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [11]:
os.makedirs("./results", exist_ok=True)
with open("./results/baseline_results.json", "w") as f:
    json.dump(baseline_results, f)
print("Baseline evaluation results saved to 'results/baseline_results.json'.")

Baseline evaluation results saved to 'results/baseline_results.json'.


---
### Preparing Dataset for LoRA Fine-Tuning
---

In [12]:
torch.cuda.empty_cache()

In [13]:
print("\n--- Renaming 'label' to 'labels' for LoRA compatibility ---\n")
for split in splits:
    tokenized_ds[split] = tokenized_ds[split].rename_column("label", "labels")

print("\n--- Setting dataset format to PyTorch tensors for LoRA ---\n")
for split in splits:
    tokenized_ds[split].set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])


--- Renaming 'label' to 'labels' for LoRA compatibility ---


--- Setting dataset format to PyTorch tensors for LoRA ---



---
### Setting Up and Fine-Tuning with LoRA
---

In [14]:
print("\n--- Setting up LoRA for fine-tuning ---\n")
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,  # Specify the task type as Sequence Classification
    r=8,
    lora_alpha=32,
    target_modules=["c_attn"],  # Apply LoRA to attention layers
    lora_dropout=0.1,
    bias="none"
)

peft_model = get_peft_model(baseline_model, lora_config)
peft_model.to(device)
peft_model.print_trainable_parameters()


--- Setting up LoRA for fine-tuning ---

trainable params: 296,448 || all params: 124,737,792 || trainable%: 0.2377




In [15]:
print("\n--- Starting fine-tuning with LoRA ---\n")
training_args = TrainingArguments(
    output_dir="./results",
    learning_rate=2e-3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=4,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_accuracy",
    greater_is_better=True
)

trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_ds["train"],
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

trainer.train()
print("Fine-tuning completed!")


--- Starting fine-tuning with LoRA ---

Fine-tuning completed!




Epoch,Training Loss,Validation Loss,Accuracy
1,0.7075,0.562448,0.723
2,0.6467,0.64426,0.739
3,0.6077,0.632877,0.755
4,0.568,0.665605,0.751


---
### Saving the Fine-Tuned Model
---

In [16]:
print("\n--- Saving the fine-tuned model ---\n")
save_path = "saved_models/lora_finetuned"
os.makedirs(save_path, exist_ok=True)
peft_model.save_pretrained(save_path)
print(f"Model saved successfully at {save_path}!")


--- Saving the fine-tuned model ---

Model saved successfully at saved_models/lora_finetuned!




---
### Loading the Fine-Tuned Model
---

In [17]:
save_path = "saved_models/lora_finetuned"

print("\n--- Loading the fine-tuned model ---\n")
fine_tuned_model = AutoPeftModelForSequenceClassification.from_pretrained(save_path)
fine_tuned_model.config.pad_token_id = tokenizer.pad_token_id
fine_tuned_model.to(device)
print("Fine-tuned model loaded successfully!")

trainer = Trainer(
    model=fine_tuned_model,
    args=TrainingArguments(
        output_dir="./results",
        evaluation_strategy="epoch",
        per_device_eval_batch_size=8
    ),
    eval_dataset=tokenized_ds["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)


--- Loading the fine-tuned model ---

Fine-tuned model loaded successfully!


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


---
### Evaluating the Fine-Tuned Model
---

In [18]:
print("\n--- Evaluating the fine-tuned model ---\n")
fine_tuned_results = trainer.evaluate(eval_dataset=tokenized_ds["test"])
print(f"Fine-tuned evaluation results: {fine_tuned_results}")


--- Evaluating the fine-tuned model ---

Fine-tuned evaluation results: {'eval_loss': 0.6328773498535156, 'eval_accuracy': 0.755, 'eval_runtime': 404.931, 'eval_samples_per_second': 12.348, 'eval_steps_per_second': 1.543}


In [19]:
with open("./results/fine_tuned_lora_results.json", "w") as f:
    json.dump(fine_tuned_results, f)
print("Fine-tuned evaluation results saved to 'results/fine_tuned_lora_results.json'.")

Fine-tuned evaluation results saved to 'results/fine_tuned_lora_results.json'.


---
### Comparing Baseline and Fine-Tuned Results
---

In [20]:
print("\n--- Comparing baseline and fine-tuned results ---\n")

baseline_results_path = "results/baseline_results.json"
fine_tuned_results_path = "results/fine_tuned_lora_results.json"

if "baseline_results" not in globals():
    if os.path.exists(baseline_results_path):
        with open(baseline_results_path, "r") as f:
            baseline_results = json.load(f)
        print("Baseline results loaded from file.")
    else:
        raise FileNotFoundError(f"Baseline results file not found at {baseline_results_path}")

if "fine_tuned_results" not in globals():
    if os.path.exists(fine_tuned_results_path):
        with open(fine_tuned_results_path, "r") as f:
            fine_tuned_results = json.load(f)
        print("Fine-tuned results loaded from file.")
    else:
        raise FileNotFoundError(f"Fine-tuned results file not found at {fine_tuned_results_path}")

baseline_accuracy = baseline_results.get("eval_accuracy", 0)
fine_tuned_accuracy = fine_tuned_results.get("eval_accuracy", 0)

print(f"Baseline Accuracy: {baseline_accuracy * 100:.2f}%")
print(f"Fine-Tuned Accuracy: {fine_tuned_accuracy * 100:.2f}%")


--- Comparing baseline and fine-tuned results ---

Baseline Accuracy: 49.88%
Fine-Tuned Accuracy: 75.50%


In [21]:
comparison_df = pd.DataFrame(
    {
        "Model": ["Baseline", "Fine-Tuned"],
        "Accuracy (%)": [baseline_accuracy * 100, fine_tuned_accuracy * 100],
    }
)
print("\nComparison Table:\n")
print(comparison_df)


Comparison Table:

        Model  Accuracy (%)
0    Baseline         49.88
1  Fine-Tuned         75.50


---
### Inference Example
---

In [22]:
print("\n--- Performing inference with the fine-tuned model ---\n")
test_sentence = "The movie was fantastic! I loved every moment of it."

inputs = tokenizer(test_sentence, return_tensors="pt", padding="max_length", truncation=True).to(device)

outputs = fine_tuned_model(**inputs)
prediction = outputs.logits.argmax(-1).item()

sentiment = "Positive" if prediction == 1 else "Negative"
print(f"Test Sentence: {test_sentence}")
print(f"Predicted Sentiment: {sentiment}")


--- Performing inference with the fine-tuned model ---

Test Sentence: The movie was fantastic! I loved every moment of it.
Predicted Sentiment: Positive
