# Udacity GenAI Nanodegreee Project 1: Applying Lightweight Fine-Tuning to a Foundation Model
## Overview
Large Language Models can be expensive to train, as it requires lots of computing resources. One approach to avoid training the whole model for a specific purpose is by leveraging Parameter-Efficient Fine-Tuning (PEFT) methods, that enables efficient training of a small portion of the model.

In some cases, using PEFT can have a comparable performance to training a full LLM from scratch, with a fraction of the cost.

## Project objective
- The objective of this use PEFT to train a LLM 

## Practical approach
In this project, you will bring together all of the essential components of a PyTorch + Hugging Face training and inference process. Specifically, you will:

1. Load a pre-trained model and evaluate its performance
2. Perform parameter-efficient fine tuning using the pre-trained model
3. Perform inference using the fine-tuned model and compare its performance to the original model

## References


# Loading base model

In [1]:
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from peft import LoraConfig, get_peft_model, TaskType
import evaluate
import numpy as np

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Load model and tokenizer
model_name = "gpt2"
num_labels = 2  # Binary classification
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

In [4]:
# Load base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    pad_token_id=tokenizer.pad_token_id
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


# Loading dataset
We are going to use IMDB full dataset with 25.000 samples for each split (train and validate).

In [7]:
# Load and prepare dataset (using IMDB for sentiment analysis)
dataset = load_dataset("imdb", split=['train[:25000]', 'test[:25000]'])
train_dataset, eval_dataset = dataset[0], dataset[1]


In [25]:
dataset[0].to_polars()

text,label
str,i64
"""I rented I AM CURIOUS-YELLOW f…",0
"""""I Am Curious: Yellow"" is a ri…",0
"""If only to avoid making this t…",0
"""This film was probably inspire…",0
"""Oh, brother...after hearing ab…",0
…,…
"""A hit at the time but now bett…",1
"""I love this movie like no othe…",1
"""This film and it's sequel Barr…",1
"""'The Adventures Of Barry McKen…",1


# Tokenizing the text samples

In [9]:
# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["text"], 
        truncation=True, 
        padding=True, 
        max_length=512
    )

# Tokenize datasets
tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_eval = eval_dataset.map(tokenize_function, batched=True)

Map: 100%|███████████████████████████████████████████| 25000/25000 [00:03<00:00, 8258.68 examples/s]


# Defining LoRA Configuration and PEFT model

In [10]:
# Create LoRA configuration
peft_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,  # rank
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["c_proj", "c_attn"],
    bias="none",
)

In [11]:
# Create PEFT model
peft_model = get_peft_model(base_model, peft_config)
peft_model.print_trainable_parameters()


trainable params: 812,544 || all params: 125,253,888 || trainable%: 0.6487



fan_in_fan_out is set to False but the target module is `Conv1D`. Setting fan_in_fan_out to True.



# Training PEFT model

In [12]:
# Evaluation metrics
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)


In [13]:
# Training arguments
training_args = TrainingArguments(
    output_dir="gpt2-imdb-peft",
    learning_rate=1e-4,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [14]:
# Initialize trainer
trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_eval,
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer),
)


[2025-03-04 00:24:08,581] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cpu (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


In [16]:
# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2737,0.22035,0.9266
2,0.2785,0.245032,0.92592
3,0.2615,0.242862,0.93384


TrainOutput(global_step=9375, training_loss=0.2927575398763021, metrics={'train_runtime': 62423.8985, 'train_samples_per_second': 1.201, 'train_steps_per_second': 0.15, 'total_flos': 1.9784466432e+16, 'train_loss': 0.2927575398763021, 'epoch': 3.0})

In [17]:
# Save the PEFT model
peft_model.save_pretrained("gpt2-imdb-peft/best_model")

# Inference
In this session we are going to define how to perform inferences for both base model, GPT2, and the fine tunned we trained.

In [18]:
from peft import AutoPeftModelForSequenceClassification

# Load the fine-tuned model
loaded_peft_model = AutoPeftModelForSequenceClassification.from_pretrained(
    "gpt2-imdb-peft/best_model"
)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# Inference function
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = loaded_peft_model(**inputs)
    prediction = torch.nn.functional.softmax(outputs.logits, dim=1)
    return "Positive" if prediction[0][1] > prediction[0][0] else "Negative"

In [21]:
# Test the model
test_text = "This movie was absolutely fantastic! I loved every minute of it."
print(f"Sentiment: {predict_sentiment(test_text)}")

Sentiment: Positive


# Evaluating models

In [26]:
def evaluate_model(model, dataset, tokenizer):
    # Set up trainer for evaluation only
    eval_trainer = Trainer(
        model=model,
        args=TrainingArguments(
            output_dir="./eval_output",
            per_device_eval_batch_size=8,
            remove_unused_columns=True,
        ),
        eval_dataset=dataset,
        compute_metrics=compute_metrics,
        data_collator=DataCollatorWithPadding(tokenizer),
    )
    
    # Run evaluation
    eval_results = eval_trainer.evaluate()
    return eval_results

## Evaluating GPT2 base model

In [None]:
# Evaluate base model
print("Evaluating base GPT-2 model...")
base_model_results = evaluate_model(base_model, tokenized_eval, tokenizer)

Evaluating base GPT-2 model...


In [None]:
# Evaluate PEFT model
print("Evaluating PEFT fine-tuned model...")
peft_model_results = evaluate_model(loaded_peft_model, tokenized_eval, tokenizer)

In [None]:

# Compare results
print("\n=== Model Comparison ===")
print(f"Base GPT-2 Accuracy: {base_model_results['eval_accuracy']:.4f}")
print(f"PEFT Fine-tuned Accuracy: {peft_model_results['eval_accuracy']:.4f}")
print(f"Improvement: {(peft_model_results['eval_accuracy'] - base_model_results['eval_accuracy'])*100:.2f}%")

In [None]:
# Compare results
print("\n=== Model Comparison ===")
print(f"Base GPT-2 Accuracy: {base_model_results['eval_accuracy']:.4f}")
print(f"PEFT Fine-tuned Accuracy: {peft_model_results['eval_accuracy']:.4f}")
print(f"Improvement: {(peft_model_results['eval_accuracy'] - base_model_results['eval_accuracy'])*100:.2f}%")

# Optional: Detailed comparison of specific examples
def compare_predictions(model1, model2, examples, tokenizer, num_samples=5):
    print("\n=== Sample Predictions Comparison ===")
    for i in range(min(num_samples, len(examples))):
        text = examples[i]["text"]
        true_label = "Positive" if examples[i]["label"] == 1 else "Negative"
        
        # Get predictions
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
        
        with torch.no_grad():
            # Base model prediction
            base_outputs = model1(**inputs)
            base_pred = torch.nn.functional.softmax(base_outputs.logits, dim=1)
            base_sentiment = "Positive" if base_pred[0][1] > base_pred[0][0] else "Negative"
            
            # PEFT model prediction
            peft_outputs = model2(**inputs)
            peft_pred = torch.nn.functional.softmax(peft_outputs.logits, dim=1)
            peft_sentiment = "Positive" if peft_pred[0][1] > peft_pred[0][0] else "Negative"
        
        print(f"Example {i+1}:")
        print(f"Text: {text[:100]}...")
        print(f"True label: {true_label}")
        print(f"Base model prediction: {base_sentiment} (confidence: {max(base_pred[0]).item():.2f})")
        print(f"PEFT model prediction: {peft_sentiment} (confidence: {max(peft_pred[0]).item():.2f})")
        print("-" * 50)

# Get a few examples from the validation set
sample_examples = [eval_dataset[i] for i in range(5)]
compare_predictions(base_model, loaded_peft_model, sample_examples, tokenizer)