# Lesson 08: LoRA Fine-Tuning (SFT)

Supervised fine-tuning (SFT) starts from a **pretrained** language model and nudges it toward following instructions. Instead of training from scratch, we adapt an existing model using a dataset of instruction-response pairs.

**Why LoRA helps**: LoRA (Low-Rank Adapters) freezes the base model and trains small, low-rank matrices injected into attention and MLP layers. That means faster training, less memory use, and good performance for many tasks.

In this notebook we will:
- Load a small instruction dataset.
- Format it into a prompt template.
- Attach LoRA adapters to GPT-2.
- Fine-tune with a simple Trainer loop.
- Run inference with sampling controls.


In [None]:
# Imports and basic setup
import math
import os
import random

import torch
from datasets import Dataset, load_dataset
from transformers import (
    DataCollatorForLanguageModeling,
    GPT2LMHeadModel,
    GPT2Tokenizer,
    Trainer,
    TrainingArguments,
)

try:
    from peft import LoraConfig, TaskType, get_peft_model
except ImportError:
    print("peft is required. Install it with: pip install peft")
    raise

In [None]:
# Reproducibility and device selection
def set_seed(seed: int = 42) -> None:
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)


set_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

## 1) Load a small instruction dataset

We will try to load a lightweight instruction dataset from Hugging Face (Alpaca). If that fails (for example, no internet), we will fall back to a tiny synthetic dataset so the notebook still runs.

In [None]:
dataset_name = "tatsu-lab/alpaca"

try:
    raw = load_dataset(dataset_name)
    train_data = raw["train"]
    print(f"Loaded dataset: {dataset_name} ({len(train_data)} examples)")
except Exception as exc:
    print("Dataset load failed, using a tiny synthetic dataset instead.")
    print("Reason:", exc)
    synthetic = [
        {
            "instruction": "Explain what a variable is in Python.",
            "response": "A variable is a named reference to a value. You can assign a value using = and reuse the name later.",
        },
        {
            "instruction": "Give two tips for learning PyTorch.",
            "response": "Start with small tensor exercises and read the official tutorials. Practice writing simple training loops.",
        },
        {
            "instruction": "Write a short poem about rain.",
            "response": "Rain taps soft rhythms on the roof, a silver hush above the street.",
        },
        {
            "instruction": "Summarize the water cycle in one sentence.",
            "response": "Water evaporates, condenses into clouds, then falls as precipitation and collects again.",
        },
    ]
    train_data = Dataset.from_list(synthetic)


def format_example(example):
    instruction = example.get("instruction", "")
    input_text = example.get("input", "")
    response = example.get("output", example.get("response", ""))

    if input_text:
        instruction = instruction + "\n\n### Input:\n" + input_text

    text = f"### Instruction:\n{instruction}\n\n### Response:\n{response}"
    return {"text": text}


subset_size = min(2000, len(train_data))
train_subset = train_data.select(range(subset_size))
formatted = train_subset.map(format_example, remove_columns=train_subset.column_names)

print("Using subset size:", len(formatted))
print("Sample prompt:\n", formatted[0]["text"][:500])

## 2) Tokenization

We use the GPT-2 tokenizer and turn each prompt into token IDs. For causal language modeling, labels are the same as input IDs (the model learns to predict the next token).

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token


def tokenize_function(batch):
    tokens = tokenizer(
        batch["text"],
        truncation=True,
        max_length=256,
    )
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens


tokenized = formatted.map(tokenize_function, batched=True, remove_columns=["text"])
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

print("Tokenized example length:", len(tokenized[0]["input_ids"]))

## 3) Load a pretrained model (GPT-2)

We start from a pretrained GPT-2 model. This is the core idea of modern practice: we reuse a strong base model and adapt it.

In [None]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.config.pad_token_id = tokenizer.pad_token_id
print("Base model loaded.")

## 4) Attach LoRA adapters

We inject LoRA adapters into key GPT-2 layers. For GPT-2, the attention and MLP use linear projections named:
- `c_attn` and `c_proj` in attention
- `c_fc` and `c_proj` in the MLP

LoRA will train small low-rank matrices for those layers while keeping the original weights frozen.

In [None]:
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=8,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=["c_attn", "c_proj", "c_fc"],
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
model.to(device)

## 5) Training

We use `Trainer` to handle the PyTorch training loop. It wraps:
- Forward pass (compute model outputs)
- Loss computation
- Backpropagation and optimizer step
- Logging of training loss

This keeps the code short and readable while still using real PyTorch under the hood.

In [None]:
training_args = TrainingArguments(
    output_dir="lora-gpt2-sft",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=1,
    learning_rate=2e-4,
    logging_steps=20,
    save_strategy="no",
    report_to=[],
    fp16=torch.cuda.is_available(),
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=data_collator,
)

trainer.train()

## 6) Inference tests

We run a few prompts and adjust sampling controls:
- **temperature**: higher = more randomness
- **top_p**: nucleus sampling (probability mass cutoff)


In [None]:
def generate_text(prompt, temperature=0.7, top_p=0.9, max_new_tokens=80):
    model.eval()
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # no_grad disables gradient tracking during inference
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            do_sample=True,
            temperature=temperature,
            top_p=top_p,
            max_new_tokens=max_new_tokens,
            pad_token_id=tokenizer.eos_token_id,
        )

    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


test_prompts = [
    "### Instruction:\nGive three tips for focusing while studying.\n\n### Response:\n",
    "### Instruction:\nExplain what backpropagation is in one sentence.\n\n### Response:\n",
    "### Instruction:\nWrite a friendly one-line greeting.\n\n### Response:\n",
]

for prompt in test_prompts:
    print("-" * 60)
    print(generate_text(prompt, temperature=0.8, top_p=0.9))

## 7) Scaling notes and practical knobs

Real-world SFT uses much larger base models and much more data. LoRA makes this feasible by training a small number of parameters.

**Toy knobs (this notebook)**
- Base model: `gpt2`
- LoRA rank `r`: 4-16
- Dataset size: 2k examples
- Epochs: 1

**Production-ish knobs**
- Base model: 7B+ parameters
- LoRA rank `r`: 8-64 (task-dependent)
- Dataset size: tens or hundreds of thousands of examples
- Careful evaluation and safety filtering

Larger models and datasets improve capability, but training cost scales quickly. LoRA lets you trade off quality vs. speed and memory by adjusting rank and target modules.