# Lightweight Fine-Tuning Project

* PEFT technique: LoRA
* Model: GPT-2
* Evaluation approach: Huggingface Trainer - evaluate method
* Fine-tuning dataset: IMDb dataset

## Loading and Evaluating a Foundation Model

In [1]:
from datasets import load_dataset

# Load the train and test splits of the imdb dataset
splits = ["train", "test"]
dataset = {split: ds for split, ds in zip(splits, load_dataset("imdb", split=splits))}

# Thin out the dataset to make it run faster for this example
for split in splits:
    dataset[split] =dataset[split].shuffle(seed=42).select(range(1000))

# Show the dataset
dataset

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 })}

In [2]:
dataset

{'train': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 }),
 'test': Dataset({
     features: ['text', 'label'],
     num_rows: 1000
 })}

In [3]:
dataset["train"][0]

{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...',
 'label': 1}

In [4]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [5]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In [6]:
#tokenize the dataset
def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = {}
for split in splits:
    tokenized_dataset[split] = dataset[split].map(tokenize, batched=True)

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [7]:
#verifying features of tokenized dataset
print(tokenized_dataset["train"])

#verifying a tokenized sample
print(tokenized_dataset["train"][0])

Dataset({
    features: ['text', 'label', 'input_ids', 'attention_mask'],
    num_rows: 1000
})
{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...', 'label': 1, 'input_ids': [1858, 318, 645, 8695, 379, 477, 1022, 6401, 959, 290, 4415, 5329, 475, 262, 1109, 326, 1111, 389, 1644, 2

In [17]:
#load and setup the model
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0:"Negative", 1:"Positive"},
    label2id={"Negative":0, "Positive":1}
)

model.config.pad_token_id = tokenizer.pad_token_id

for param in model.base_model.parameters():
    param.requires_grad = False

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
model

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)

In [19]:
#print model parameters
total_params = sum(p.numel() for p in model.parameters())
total_trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_params:,} total parameters.")
print(f"{total_trainable_params:,} trainable parameters.")
print(f"{total_trainable_params/total_params:.4%} of parameters are trainable.")

124,441,344 total parameters.
1,536 trainable parameters.
0.0012% of parameters are trainable.


In [20]:
#training
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

training_args = TrainingArguments(
    output_dir=f"./output/{model_name}/",
    num_train_epochs=5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,7.039577,0.489
2,No log,6.419224,0.489
3,No log,5.975307,0.489
4,No log,5.708392,0.489
5,No log,5.6185,0.489


TrainOutput(global_step=315, training_loss=6.313909427703373, metrics={'train_runtime': 897.237, 'train_samples_per_second': 5.573, 'train_steps_per_second': 0.351, 'total_flos': 2612967505920000.0, 'train_loss': 6.313909427703373, 'epoch': 5.0})

In [21]:
#evaluation
trainer.evaluate()

{'eval_loss': 5.618500232696533,
 'eval_accuracy': 0.489,
 'eval_runtime': 86.9259,
 'eval_samples_per_second': 11.504,
 'eval_steps_per_second': 0.725,
 'epoch': 5.0}

## Performing Parameter-Efficient Fine-Tuning

In [8]:
from peft import LoraConfig, get_peft_model, AutoPeftModelForSequenceClassification
from peft import AutoPeftModelForCausalLM

num_labels=2
id2label={0:"Negative", 1:"Positive"}
label2id={"Negative":0, "Positive":1}

#we'll be using the same tokenizer
print(tokenizer)

#instantiate model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    id2label={0:"Negative", 1:"Positive"},
    label2id={"Negative":0, "Positive":1}
)

model.config.pad_token_id = tokenizer.pad_token_id

GPT2TokenizerFast(name_or_path='gpt2', vocab_size=50257, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'pad_token': '<|endoftext|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	50256: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}


Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [9]:
model.named_modules

<bound method Module.named_modules of GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D()
          (c_proj): Conv1D()
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D()
          (c_proj): Conv1D()
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (score): Linear(in_features=768, out_features=2, bias=False)
)>

In [10]:
lora_config = LoraConfig(
    task_type="SEQ_CLS",
    target_modules=["c_attn", "c_fc", "c_proj"],  # <- ✅ correct target for GPT-2 attention
    r=8,
    lora_alpha=32,
    lora_dropout=0.1
)

In [11]:
model = get_peft_model(model, lora_config)



In [12]:
model.print_trainable_parameters()

trainable params: 1,182,720 || all params: 125,622,528 || trainable%: 0.9414871829358505


In [16]:
import numpy as np
from transformers import DataCollatorWithPadding, Trainer, TrainingArguments

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

training_args = TrainingArguments(
    output_dir=f"./output/gpt2-lora-imdb/",
    num_train_epochs=5,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
)

trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.9319,1.285838,0.495
2,0.7407,1.284292,0.711
3,0.7363,1.050212,0.808
4,0.7778,1.051349,0.816
5,0.5758,1.00726,0.829


TrainOutput(global_step=5000, training_loss=1.119337890625, metrics={'train_runtime': 1771.0077, 'train_samples_per_second': 2.823, 'train_steps_per_second': 2.823, 'total_flos': 2649253478400000.0, 'train_loss': 1.119337890625, 'epoch': 5.0})

In [17]:
trainer.evaluate()

{'eval_loss': 1.0072604417800903,
 'eval_accuracy': 0.829,
 'eval_runtime': 100.6587,
 'eval_samples_per_second': 9.935,
 'eval_steps_per_second': 9.935,
 'epoch': 5.0}

In [19]:
model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): Linear(
                in_features=768, out_features=2304, bias=True
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_

In [20]:
# Saving the model
model = model.merge_and_unload()
model.save_pretrained("/tmp/gpt2-lora-imdb-full")
tokenizer.save_pretrained("/tmp/gpt2-lora-imdb-full")

('/tmp/gpt2-lora-imdb-full/tokenizer_config.json',
 '/tmp/gpt2-lora-imdb-full/special_tokens_map.json',
 '/tmp/gpt2-lora-imdb-full/vocab.json',
 '/tmp/gpt2-lora-imdb-full/merges.txt',
 '/tmp/gpt2-lora-imdb-full/added_tokens.json',
 '/tmp/gpt2-lora-imdb-full/tokenizer.json')

## Performing Inference with a PEFT Model

In [31]:
#load the saved lora model
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("/tmp/gpt2-lora-imdb-full")

In [39]:
#inference with test set
import torch
import torch.nn.functional as F

for example in tokenized_dataset["test"].shuffle(seed=42).select(range(5)):
    text = example["text"]
    true_label = "Positive" if example["label"] == 1 else "Negative"
    inputs = tokenizer(text, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = F.softmax(logits, dim=-1)
        pred_class = torch.argmax(probs, dim=1).item()
        label = "Positive" if pred_class == 1 else "Negative"
        
    print("Review:", text[:100].replace("\n", " ") + "...")
    print("True Label:", true_label)
    print("Predicted:", label)
    print("-" * 60)

Review: I watched this movie which I really thought had a promising beginning but then it just led me to fee...
True Label: Negative
Predicted: Negative
------------------------------------------------------------
Review: This movie is perfect for any aspiring screen writer, actor or director. By watching this movie you ...
True Label: Negative
Predicted: Negative
------------------------------------------------------------
Review: I'm a Boorman fan, but this is arguably his least successful film. Comedy has never been his strong ...
True Label: Negative
Predicted: Positive
------------------------------------------------------------
Review: I wonder why I haven't heard of this movie before. It's truly a magnificent comedy (I'd say farce, b...
True Label: Positive
Predicted: Positive
------------------------------------------------------------
Review: Finally! Third time lucky. This film has been always been on my mind, but my first viewing I forgot ...
True Label: Positive
Predicted: 

## Notes:
1. Accuracy improved from 48.9% to an impressive 89.2% with just 3 epochs. 