# Project 1: Applying Lightweight Fine-Tuning to a Foundation Model <a class="jp-toc-ignore"></a>
## Project Introduction <a class="jp-toc-ignore"></a>
In this project, you will explore the power of parameter-efficient fine-tuning (PEFT) for adapting large foundation models to your specific needs—without requiring extensive computational resources. Leveraging the Hugging Face peft library, you will implement a workflow that demonstrates how modern generative AI models can be efficiently customized for downstream tasks.

The challenge is to bring together all the essential components of a PyTorch + Hugging Face training and inference pipeline. You will load a pre-trained transformer model, perform lightweight fine-tuning using the LoRA (Low-Rank Adaptation) technique, and compare the performance of the original and fine-tuned models on a sequence classification task. This project highlights the practical advantages of PEFT, including reduced training costs and model size, while maintaining strong performance.

## Project Structure <a class="jp-toc-ignore"></a>
The current project is broken into the following parts:

1. **Loading Base Model and Dataset:** Select and load a compatible transformer model and a text classification dataset from Hugging Face. Tokenize and preprocess the data for training and evaluation.
2. Baseline Evaluation: Evaluate the pre-trained model’s performance on the selected dataset to establish a reference point.
3. PEFT Configuration and Model Conversion: Create a LoRA configuration and convert the base model into a parameter-efficient trainable model.
4. Fine-Tuning and Saving: Fine-tune the PEFT model on the dataset, monitor training progress, and save the adapter weights.
5. Inference and Comparison: Load the fine-tuned PEFT model, run inference, and compare its performance to the original model to assess the impact of PEFT.
6. Results and Insights: Summarize findings, discuss trade-offs, and highlight practical considerations for deploying PEFT in real-world scenarios.


# Loading Base Model and Dataset 

## Base Model
As base model we are going to use GPT2.

In [1]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from peft import LoraConfig, get_peft_model, TaskType
import evaluate
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


[2025-04-21 23:35:39,660] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cpu (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


In [3]:
model_name = "gpt2"
num_labels = 2
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    pad_token_id=tokenizer.pad_token_id
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## Dataset
We are going to use IMDB full dataset with 25.000 samples for each split (train and validate).

In [4]:
dataset = load_dataset("imdb", split=['train[:100]', 'test[:100]'])
train_dataset, eval_dataset = dataset[0], dataset[1]

In [5]:
dataset[0].to_polars()

text,label
str,i64
"""I rented I AM CURIOUS-YELLOW f…",0
"""""I Am Curious: Yellow"" is a ri…",0
"""If only to avoid making this t…",0
"""This film was probably inspire…",0
"""Oh, brother...after hearing ab…",0
…,…
"""Sometime in 1998, Saban had ac…",0
"""This is the biggest insult to …",0
"""I did not like the idea of the…",0
"""I cannot stay indifferent to L…",0


# Tokenizing the text samples

In [9]:
# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["text"], 
        truncation=True, 
        padding=True, 
        max_length=512
    )

# Tokenize datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

Map: 100%|███████████████████████████████████████████| 25000/25000 [00:03<00:00, 8258.68 examples/s]


# Defining LoRA Configuration and PEFT model

In [10]:
# Create LoRA configuration
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,  # rank
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["c_proj", "c_attn"],
    bias="none",
)

In [11]:
# Create PEFT model
peft_model = get_peft_model(base_model, peft_config)
peft_model.print_trainable_parameters()


trainable params: 812,544 || all params: 125,253,888 || trainable%: 0.6487



fan_in_fan_out is set to False but the target module is `Conv1D`. Setting fan_in_fan_out to True.



# Training PEFT model

In [12]:
# Evaluation metrics
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)


In [13]:
# Training arguments
training_args = TrainingArguments(
    output_dir="gpt2-imdb-peft",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [14]:
# Initialize trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer),
)


[2025-03-04 00:24:08,581] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


In [16]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2737,0.22035,0.9266
2,0.2785,0.245032,0.92592
3,0.2615,0.242862,0.93384


TrainOutput(global_step=9375, training_loss=0.2927575398763021, metrics={'train_runtime': 62423.8985, 'train_samples_per_second': 1.201, 'train_steps_per_second': 0.15, 'total_flos': 1.9784466432e+16, 'train_loss': 0.2927575398763021, 'epoch': 3.0})

In [17]:
# Save the PEFT model
peft_model.save_pretrained("gpt2-imdb-peft/best_model")

# Comparing the 2 models

In [36]:
base_model

GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): lora.Linear(
            (base_layer): Conv1D(nf=2304, nx=768)
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=768, out_features=8, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=8, out_features=2304, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
            (lora_magnitude_vector): ModuleDict()
          )
          (c_proj): lora.Linear(
            (base_layer): Conv1D(nf=768, nx=768)

In [37]:
peft_model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D(nf=2304, nx=768)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B):

In [52]:
loaded_peft_model

PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): lora.Linear(
                (base_layer): Conv1D(nf=2304, nx=768)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B):

# Inference
In this session we are going to define how to perform inferences for both base model, GPT2, and the fine tunned we trained.

In [18]:
from peft import AutoPeftModelForSequenceClassification

# Load the fine-tuned model
loaded_peft_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "gpt2-imdb-peft/best_model"
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# Inference function
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = loaded_peft_model(**inputs)
    prediction = torch.nn.functional.softmax(outputs.logits, dim=1)
    return "Positive" if prediction[0][1] > prediction[0][0] else "Negative"

In [21]:
# Test the model
test_text = "This movie was absolutely fantastic! I loved every minute of it."
print(f"Sentiment: {predict_sentiment(test_text)}")

Sentiment: Positive


# Evaluating models

In [26]:
def evaluate_model(model, dataset, tokenizer):
    # Set up trainer for evaluation only
    eval_trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir="./eval_output",
            per_device_eval_batch_size=8,
            remove_unused_columns=True,
        ),
        eval_dataset=dataset,
        compute_metrics=compute_metrics,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    
    # Run evaluation
    eval_results = eval_trainer.evaluate()
    return eval_results

## Evaluating GPT2 base model

In [27]:
# Evaluate base model
print("Evaluating base GPT-2 model...")
base_model_results = evaluate_model(base_model, tokenized_eval, tokenizer)

Evaluating base GPT-2 model...


In [29]:
base_model_results

{'eval_loss': 0.22035004198551178,
 'eval_model_preparation_time': 0.0023,
 'eval_accuracy': 0.9266,
 'eval_runtime': 4279.2884,
 'eval_samples_per_second': 5.842,
 'eval_steps_per_second': 0.73}

In [45]:
fresh_base_model = AutoModelForSequenceClassification.from_pretrained(
    "gpt2",
    num_labels=2,
    pad_token_id=tokenizer.pad_token_id
)
base_results = evaluate_model(fresh_base_model, tokenized_eval, tokenizer)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [46]:
base_results

{'eval_loss': 0.8233535885810852,
 'eval_model_preparation_time': 0.001,
 'eval_accuracy': 0.50304,
 'eval_runtime': 3923.0139,
 'eval_samples_per_second': 6.373,
 'eval_steps_per_second': 0.797}

In [30]:
def evaluate_model(model, dataset, tokenizer):
    # Ensure the model has a padding token set in its config
    if hasattr(model, 'config') and model.config.pad_token_id is None:
        model.config.pad_token_id = tokenizer.pad_token_id
    
    # Set up trainer for evaluation only
    eval_trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir="./eval_output",
            per_device_eval_batch_size=8,
            remove_unused_columns=True,
        ),
        eval_dataset=dataset,
        compute_metrics=compute_metrics,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    
    # Run evaluation
    eval_results = eval_trainer.evaluate()
    return eval_results

In [39]:
# Evaluate PEFT model
print("Evaluating PEFT fine-tuned model...")
peft_model_results = evaluate_model(peft_model, tokenized_eval, tokenizer)

Evaluating PEFT fine-tuned model...


In [42]:
peft_model_results

{'eval_loss': 0.22035004198551178,
 'eval_model_preparation_time': 0.0024,
 'eval_accuracy': 0.9266,
 'eval_runtime': 4164.9138,
 'eval_samples_per_second': 6.003,
 'eval_steps_per_second': 0.75}

In [48]:
# Compare results
print("\n=== Model Comparison ===")
print(f"Base GPT-2 Accuracy: {base_results['eval_accuracy']:.4f}")
print(f"PEFT Fine-tuned Accuracy: {peft_model_results['eval_accuracy']:.4f}")
print(f"Improvement: {(peft_model_results['eval_accuracy'] - base_results['eval_accuracy'])*100:.2f}%")


=== Model Comparison ===
Base GPT-2 Accuracy: 0.5030
PEFT Fine-tuned Accuracy: 0.9266
Improvement: 42.36%


In [50]:
# Compare results
print("\n=== Model Comparison ===")
print(f"Base GPT-2 Accuracy: {base_results['eval_accuracy']:.4f}")
print(f"PEFT Fine-tuned Accuracy: {peft_model_results['eval_accuracy']:.4f}")
print(f"Improvement: {(peft_model_results['eval_accuracy'] - base_results['eval_accuracy'])*100:.2f}%")

# Optional: Detailed comparison of specific examples
def compare_predictions(model1, model2, examples, tokenizer, num_samples=5):
    print("\n=== Sample Predictions Comparison ===")
    for i in range(min(num_samples, len(examples))):
        text = examples[i]["text"]
        true_label = "Positive" if examples[i]["label"] == 1 else "Negative"
        
        # Get predictions
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
        
        with torch.no_grad():
            # Base model prediction
            base_outputs = model1(**inputs)
            base_pred = torch.nn.functional.softmax(base_outputs.logits, dim=1)
            base_sentiment = "Positive" if base_pred[0][1] > base_pred[0][0] else "Negative"
            
            # PEFT model prediction
            peft_outputs = model2(**inputs)
            peft_pred = torch.nn.functional.softmax(peft_outputs.logits, dim=1)
            peft_sentiment = "Positive" if peft_pred[0][1] > peft_pred[0][0] else "Negative"
        
        print(f"Example {i+1}:")
        print(f"Text: {text[:100]}...")
        print(f"True label: {true_label}")
        print(f"Base model prediction: {base_sentiment} (confidence: {max(base_pred[0]).item():.2f})")
        print(f"PEFT model prediction: {peft_sentiment} (confidence: {max(peft_pred[0]).item():.2f})")
        print("-" * 50)

# Get a few examples from the validation set
sample_examples = [eval_dataset[i] for i in range(5)]
compare_predictions(fresh_base_model, loaded_peft_model, sample_examples, tokenizer)


=== Model Comparison ===
Base GPT-2 Accuracy: 0.5030
PEFT Fine-tuned Accuracy: 0.9266
Improvement: 42.36%

=== Sample Predictions Comparison ===
Example 1:
Text: I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-a...
True label: Negative
Base model prediction: Negative (confidence: 0.88)
PEFT model prediction: Negative (confidence: 1.00)
--------------------------------------------------
Example 2:
Text: Worth the entertainment value of a rental, especially if you like action movies. This one features t...
True label: Negative
Base model prediction: Negative (confidence: 0.78)
PEFT model prediction: Negative (confidence: 0.72)
--------------------------------------------------
Example 3:
Text: its a totally average film with a few semi-alright action sequences that make the plot seem a little...
True label: Negative
Base model prediction: Negative (confidence: 0.70)
PEFT model prediction: Negative (confidence: 0.99)
--------------------