# Chapter 13: Fine-Tuning Your Model

> "I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times."
> — **Bruce Lee**, Martial Artist

---

## What You'll Learn

- When to fine-tune versus using better prompts
- How to prepare instruction/response training data
- Why loss masking focuses training on what matters
- How LoRA adapters achieve 48× parameter efficiency
- Evaluating before/after model behavior

---

## Setup

First, let's install required packages and check GPU availability.

In [None]:
# Install required packages
!pip install -q torch transformers tqdm

In [None]:
# ===== IMPORTS =====
import math
import json
import os
from dataclasses import dataclass
from functools import partial

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.optim import AdamW
from transformers import AutoTokenizer
from tqdm import tqdm

# Check GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("WARNING: No GPU detected. Training will be slow.")
    print("Go to Runtime > Change runtime type > GPU")

In [None]:
# ===== REPRODUCIBILITY =====
def set_seed(seed=42):
    """Set all seeds for reproducibility."""
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(42)

## 1. Model Components from Chapters 10-12

First, let's bring in the MiniGPT model we built in previous chapters.

In [None]:
# ===== MULTI-HEAD ATTENTION (from Chapter 10) =====

class MultiHeadAttention(nn.Module):
    """Efficient multi-head attention (batches all heads together)."""

    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_head = d_model // num_heads

        self.qkv_proj = nn.Linear(d_model, 3 * d_model, bias=False)
        self.out_proj = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        batch, seq, d_model = x.shape

        qkv = self.qkv_proj(x)
        qkv = qkv.reshape(batch, seq, 3, self.num_heads, self.d_head)
        qkv = qkv.permute(2, 0, 3, 1, 4)
        Q, K, V = qkv[0], qkv[1], qkv[2]

        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_head)

        if mask is not None:
            if mask.dim() == 2:
                mask = mask.unsqueeze(0).unsqueeze(0)
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attn_weights = F.softmax(scores, dim=-1)
        attn_weights = self.dropout(attn_weights)

        attn_output = attn_weights @ V
        attn_output = attn_output.transpose(1, 2).reshape(batch, seq, d_model)

        return self.out_proj(attn_output), attn_weights

print("MultiHeadAttention defined!")

In [None]:
# ===== FEEDFORWARD NETWORK (from Chapter 10) =====

class FeedForward(nn.Module):
    """Position-wise feedforward network."""

    def __init__(self, d_model, d_ff, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = self.fc1(x)
        x = F.gelu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

print("FeedForward defined!")

In [None]:
# ===== TRANSFORMER BLOCK (from Chapter 10) =====

class TransformerBlock(nn.Module):
    """Complete Transformer block (pre-norm style like GPT-2)."""

    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, num_heads, dropout)
        self.ffn = FeedForward(d_model, d_ff, dropout)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        attn_out, attn_weights = self.attn(self.ln1(x), mask)
        x = x + self.dropout(attn_out)
        ffn_out = self.ffn(self.ln2(x))
        x = x + self.dropout(ffn_out)
        return x, attn_weights

print("TransformerBlock defined!")

In [None]:
# ===== GPT CONFIG (from Chapter 11) =====

@dataclass
class GPTConfig:
    """Configuration for MiniGPT model."""
    vocab_size: int = 50257
    max_seq_len: int = 1024
    embed_dim: int = 768
    num_heads: int = 12
    num_layers: int = 12
    d_ff: int = 3072
    dropout: float = 0.1

    def __post_init__(self):
        assert self.embed_dim % self.num_heads == 0, \
            f"embed_dim ({self.embed_dim}) must be divisible by num_heads ({self.num_heads})"

print("GPTConfig defined!")

In [None]:
# ===== MINIGPT MODEL (from Chapter 11) =====

class MiniGPT(nn.Module):
    """A minimal GPT-style language model."""

    def __init__(self, config: GPTConfig):
        super().__init__()
        self.config = config

        # Embeddings
        self.token_embed = nn.Embedding(config.vocab_size, config.embed_dim)
        self.pos_embed = nn.Embedding(config.max_seq_len, config.embed_dim)
        self.dropout = nn.Dropout(config.dropout)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(
                d_model=config.embed_dim,
                num_heads=config.num_heads,
                d_ff=config.d_ff,
                dropout=config.dropout
            )
            for _ in range(config.num_layers)
        ])

        # Final layer norm and LM head
        self.ln_f = nn.LayerNorm(config.embed_dim)
        self.lm_head = nn.Linear(config.embed_dim, config.vocab_size, bias=False)

        # Weight tying
        self.lm_head.weight = self.token_embed.weight

        # Initialize weights
        self._init_weights()

    def _init_weights(self):
        nn.init.normal_(self.token_embed.weight, std=0.02)
        nn.init.normal_(self.pos_embed.weight, std=0.02)

    def forward(self, token_ids, return_attention=False):
        batch, seq = token_ids.shape
        device = token_ids.device

        tok_emb = self.token_embed(token_ids)
        positions = torch.arange(seq, device=device)
        pos_emb = self.pos_embed(positions)
        x = self.dropout(tok_emb + pos_emb)

        mask = torch.tril(torch.ones(seq, seq, device=device))

        attention_weights = []
        for block in self.blocks:
            x, attn = block(x, mask)
            if return_attention:
                attention_weights.append(attn)

        x = self.ln_f(x)
        logits = self.lm_head(x)

        if return_attention:
            return logits, attention_weights
        return logits

    def generate(self, input_ids, max_new_tokens=50, temperature=1.0, do_sample=True):
        """Generate text autoregressively."""
        self.eval()
        for _ in range(max_new_tokens):
            # Truncate to max_seq_len if needed
            idx_cond = input_ids[:, -self.config.max_seq_len:]
            
            with torch.no_grad():
                logits = self(idx_cond)
                logits = logits[:, -1, :] / temperature
            
            if do_sample:
                probs = F.softmax(logits, dim=-1)
                next_token = torch.multinomial(probs, num_samples=1)
            else:
                next_token = logits.argmax(dim=-1, keepdim=True)
            
            input_ids = torch.cat([input_ids, next_token], dim=1)
        
        return input_ids

print("MiniGPT class defined!")

## 2. Create Base Model

We'll create a smaller model configuration for faster training.

In [None]:
# Small config for fast training
config = GPTConfig(
    vocab_size=50257,
    max_seq_len=128,
    embed_dim=256,
    num_heads=4,
    num_layers=4,
    d_ff=1024,
    dropout=0.1
)

# Create model
base_model = MiniGPT(config).to(device)
print(f"Parameters: {sum(p.numel() for p in base_model.parameters()):,}")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

## 3. Baseline Evaluation (BEFORE Fine-Tuning)

Let's see how the base model responds to our FAQ prompts.

In [None]:
@torch.no_grad()
def generate_response(model, prompt, tokenizer, max_new_tokens=50, temperature=0.8):
    """Generate a response to an instruction prompt."""
    model.eval()
    full_prompt = f"[INST] {prompt} [/INST]"
    input_ids = tokenizer.encode(full_prompt, return_tensors='pt').to(device)
    
    output_ids = model.generate(
        input_ids, 
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        do_sample=True
    )
    
    response = tokenizer.decode(output_ids[0][len(input_ids[0]):])
    return response.strip()

# Test on FAQ prompts
test_prompts = [
    "What does TechStartup Inc do?",
    "How do I reset my password?",
    "What is SmartScheduler?"
]

print("BEFORE FINE-TUNING (random weights):")
print("="*60)
for prompt in test_prompts:
    response = generate_response(base_model, prompt, tokenizer)
    print(f"\nQ: {prompt}")
    print(f"A: {response[:100]}...")
print("\n(Random gibberish - the model knows nothing about TechStartup Inc!)")

## 4. Prepare Fine-Tuning Dataset

Create instruction/response pairs for TechStartup Inc FAQ.

In [None]:
# ===== FAQ DATASET =====
# TechStartup Inc - a fictional AI productivity company

FAQ_DATA = [
    # Company basics
    {"instruction": "What does TechStartup Inc do?",
     "response": "TechStartup Inc builds AI-powered productivity tools for small businesses."},
    {"instruction": "When was TechStartup Inc founded?",
     "response": "TechStartup Inc was founded in 2020."},
    {"instruction": "Where is TechStartup Inc located?",
     "response": "TechStartup Inc is headquartered in Amsterdam, Netherlands."},
    {"instruction": "Who founded TechStartup Inc?",
     "response": "TechStartup Inc was founded by Maria Chen and David Okonkwo."},
    {"instruction": "How many employees does TechStartup Inc have?",
     "response": "TechStartup Inc has approximately 50 employees."},
    
    # Products
    {"instruction": "What is SmartScheduler?",
     "response": "SmartScheduler is our AI calendar assistant that automatically finds the best meeting times."},
    {"instruction": "How much does SmartScheduler cost?",
     "response": "SmartScheduler costs 10 euros per month for individuals, or 8 euros per user for teams."},
    {"instruction": "What is TeamSync?",
     "response": "TeamSync is our collaboration platform that uses AI to prioritize tasks and track projects."},
    {"instruction": "How much does TeamSync cost?",
     "response": "TeamSync costs 15 euros per user per month, with discounts for annual plans."},
    {"instruction": "What products does TechStartup Inc offer?",
     "response": "TechStartup Inc offers SmartScheduler for calendar management and TeamSync for team collaboration."},
    
    # Support
    {"instruction": "How do I reset my password?",
     "response": "Click 'Forgot Password' on the login page, enter your email, and follow the link we send you."},
    {"instruction": "Can I export my data?",
     "response": "Yes, go to Settings > Data > Export. You can download your data as CSV or JSON."},
    {"instruction": "How do I cancel my subscription?",
     "response": "Go to Settings > Billing > Cancel Subscription. Your access continues until the end of your billing period."},
    {"instruction": "How do I contact support?",
     "response": "Email support@techstartupinc.com or use the chat widget in the app. We respond within 24 hours."},
    {"instruction": "Is there a free trial?",
     "response": "Yes, all products include a 14-day free trial. No credit card required."},
    
    # Features
    {"instruction": "Does SmartScheduler integrate with Google Calendar?",
     "response": "Yes, SmartScheduler integrates with Google Calendar, Outlook, and Apple Calendar."},
    {"instruction": "Can I use TeamSync offline?",
     "response": "Yes, TeamSync has offline mode. Changes sync automatically when you reconnect."},
    {"instruction": "Is my data secure?",
     "response": "Yes, we use end-to-end encryption and are SOC 2 compliant. Your data is stored in EU data centers."},
    {"instruction": "Does TeamSync have a mobile app?",
     "response": "Yes, TeamSync has iOS and Android apps available in the app stores."},
    {"instruction": "Can I invite guests to meetings with SmartScheduler?",
     "response": "Yes, you can invite external guests. They receive a booking link and don't need an account."},
    
    # Billing
    {"instruction": "What payment methods do you accept?",
     "response": "We accept credit cards, PayPal, and bank transfers for annual plans."},
    {"instruction": "Can I get a refund?",
     "response": "Yes, we offer a 30-day money-back guarantee for all paid plans."},
    {"instruction": "Do you offer discounts for nonprofits?",
     "response": "Yes, registered nonprofits get 50% off all products. Contact sales@techstartupinc.com."},
    {"instruction": "Is there a student discount?",
     "response": "Yes, students with a valid .edu email get 40% off individual plans."},
    {"instruction": "Can I upgrade my plan anytime?",
     "response": "Yes, you can upgrade anytime. You'll be charged the prorated difference."},
    
    # More variations
    {"instruction": "Tell me about TechStartup Inc.",
     "response": "TechStartup Inc builds AI-powered productivity tools for small businesses."},
    {"instruction": "What's the price of SmartScheduler?",
     "response": "SmartScheduler costs 10 euros per month for individuals, or 8 euros per user for teams."},
    {"instruction": "I forgot my password.",
     "response": "Click 'Forgot Password' on the login page, enter your email, and follow the link we send you."},
    {"instruction": "How can I reach customer service?",
     "response": "Email support@techstartupinc.com or use the chat widget in the app. We respond within 24 hours."},
    {"instruction": "Where are you based?",
     "response": "TechStartup Inc is headquartered in Amsterdam, Netherlands."},
    
    # Even more variations for training
    {"instruction": "What services does TechStartup offer?",
     "response": "TechStartup Inc offers SmartScheduler for calendar management and TeamSync for team collaboration."},
    {"instruction": "How do I get started with SmartScheduler?",
     "response": "Sign up at techstartupinc.com/smartscheduler for a free 14-day trial. No credit card required."},
    {"instruction": "What makes TechStartup different?",
     "response": "We focus on AI-powered simplicity for small businesses, with affordable pricing and excellent support."},
    {"instruction": "Do you have an API?",
     "response": "Yes, both SmartScheduler and TeamSync have REST APIs. Documentation at docs.techstartupinc.com."},
    {"instruction": "Can I white-label your products?",
     "response": "Yes, we offer white-label solutions for enterprise customers. Contact sales@techstartupinc.com."},
    
    # More support variations
    {"instruction": "My account is locked. What do I do?",
     "response": "Wait 15 minutes for automatic unlock, or contact support@techstartupinc.com for immediate help."},
    {"instruction": "How do I change my email address?",
     "response": "Go to Settings > Account > Email. You'll need to verify the new email address."},
    {"instruction": "Can I have multiple users on one account?",
     "response": "Yes, team plans support multiple users. Each user gets their own login."},
    {"instruction": "What happens when my trial ends?",
     "response": "Your account becomes read-only. Subscribe to regain full access. No data is deleted."},
    {"instruction": "Do you support single sign-on (SSO)?",
     "response": "Yes, enterprise plans include SSO with SAML 2.0 and OAuth 2.0 support."},
    
    # Product feature details
    {"instruction": "How does SmartScheduler find meeting times?",
     "response": "It analyzes participants' calendars, time zones, and preferences to suggest optimal slots."},
    {"instruction": "Can TeamSync assign tasks automatically?",
     "response": "Yes, AI can suggest task assignments based on workload and skills. You approve before assignment."},
    {"instruction": "Does SmartScheduler handle time zones?",
     "response": "Yes, it automatically detects and converts time zones for all participants."},
    {"instruction": "Can I set recurring meetings in SmartScheduler?",
     "response": "Yes, you can create daily, weekly, or monthly recurring meetings with flexible patterns."},
    {"instruction": "Does TeamSync have Gantt charts?",
     "response": "Yes, TeamSync includes Gantt charts, Kanban boards, and calendar views."},
    
    # Additional company info
    {"instruction": "Is TechStartup Inc hiring?",
     "response": "Yes! Check our careers page at techstartupinc.com/careers for open positions."},
    {"instruction": "Does TechStartup have investors?",
     "response": "Yes, we're backed by several venture capital firms and are growing rapidly."},
    {"instruction": "What's TechStartup's mission?",
     "response": "To help small businesses save time with AI-powered tools that are simple and affordable."},
    {"instruction": "Is TechStartup profitable?",
     "response": "We're focused on sustainable growth and are on track for profitability."},
    
    # Duplicate-style variations for robustness
    {"instruction": "Password reset help",
     "response": "Click 'Forgot Password' on the login page, enter your email, and follow the link we send you."},
    {"instruction": "SmartScheduler pricing",
     "response": "SmartScheduler costs 10 euros per month for individuals, or 8 euros per user for teams."},
    {"instruction": "TeamSync pricing",
     "response": "TeamSync costs 15 euros per user per month, with discounts for annual plans."},
    {"instruction": "Contact info",
     "response": "Email support@techstartupinc.com or use the chat widget in the app. We respond within 24 hours."},
    {"instruction": "Free trial info",
     "response": "Yes, all products include a 14-day free trial. No credit card required."},
    
    # Additional support scenarios
    {"instruction": "How do I delete my account?",
     "response": "Go to Settings > Account > Delete Account. This action is permanent and cannot be undone."},
    {"instruction": "Can I pause my subscription?",
     "response": "Yes, you can pause for up to 3 months. Go to Settings > Billing > Pause Subscription."},
    {"instruction": "How do I add team members?",
     "response": "Go to Settings > Team > Invite Members. Enter their email addresses to send invitations."},
    {"instruction": "What's the difference between SmartScheduler and TeamSync?",
     "response": "SmartScheduler focuses on calendar and meeting management. TeamSync handles project tasks and collaboration."},
    {"instruction": "Do your products work together?",
     "response": "Yes! SmartScheduler and TeamSync integrate seamlessly. Meetings can become tasks and vice versa."},
    
    # Fill to 100 examples
    {"instruction": "What languages does TechStartup support?",
     "response": "Our products are available in English, Dutch, German, French, and Spanish."},
    {"instruction": "Can I import data from other tools?",
     "response": "Yes, we support importing from Google Calendar, Asana, Trello, and many other tools."},
    {"instruction": "Is there training available?",
     "response": "Yes, we offer free webinars and video tutorials at learn.techstartupinc.com."},
    {"instruction": "Do you have a partner program?",
     "response": "Yes, agencies and consultants can join our partner program for commissions and co-marketing."},
    {"instruction": "What's new at TechStartup?",
     "response": "Check our blog at techstartupinc.com/blog for the latest product updates and company news."},
    {"instruction": "How do I report a bug?",
     "response": "Use the feedback button in the app or email bugs@techstartupinc.com with details."},
    {"instruction": "Can I request a feature?",
     "response": "Yes! Submit feature requests at feedback.techstartupinc.com. We review all suggestions."},
    {"instruction": "What's your uptime guarantee?",
     "response": "We guarantee 99.9% uptime. Check status.techstartupinc.com for real-time status."},
    {"instruction": "How often do you release updates?",
     "response": "We release updates weekly. Major features are announced on our blog."},
    {"instruction": "Can I use TechStartup products for personal use?",
     "response": "Absolutely! Our individual plans are perfect for personal productivity."},
]

# Split into train and test
train_data = FAQ_DATA[:80]
test_data = FAQ_DATA[80:]

print(f"Training examples: {len(train_data)}")
print(f"Test examples: {len(test_data)}")
print(f"\nSample training example:")
print(f"  Instruction: {train_data[0]['instruction']}")
print(f"  Response: {train_data[0]['response']}")

In [None]:
# ===== DATASET CLASS =====

INST_START = "[INST]"
INST_END = "[/INST]"

class InstructionDataset(Dataset):
    """Dataset for instruction-response pairs with loss masking."""
    
    def __init__(self, data, tokenizer, max_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        example = self.data[idx]
        instruction = example['instruction']
        response = example['response']
        
        # Format: [INST] instruction [/INST] response
        prompt = f"{INST_START} {instruction} {INST_END} "
        full_text = prompt + response
        
        # Tokenize
        encoded = self.tokenizer(
            full_text,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        input_ids = encoded['input_ids'].squeeze()
        
        # Find where response starts for loss masking
        prompt_encoded = self.tokenizer(
            prompt,
            max_length=self.max_length,
            truncation=True,
            return_tensors='pt'
        )
        response_start = prompt_encoded['input_ids'].shape[1]
        
        # Create labels: -100 for instruction tokens (ignored in loss)
        labels = input_ids.clone()
        labels[:response_start] = -100
        # Also mask padding
        labels[labels == self.tokenizer.pad_token_id] = -100
        
        return {
            'input_ids': input_ids,
            'labels': labels
        }

# Create datasets
train_dataset = InstructionDataset(train_data, tokenizer, max_length=128)
test_dataset = InstructionDataset(test_data, tokenizer, max_length=128)

# Create dataloaders
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=8, shuffle=False)

# Check one example
sample = train_dataset[0]
print(f"Input IDs shape: {sample['input_ids'].shape}")
print(f"Labels shape: {sample['labels'].shape}")
print(f"\nNumber of masked tokens (instruction): {(sample['labels'] == -100).sum().item()}")
print(f"Number of trained tokens (response): {(sample['labels'] != -100).sum().item()}")

<cell_type>markdown</cell_type>## 5. Define LoRA Components

Build the LoRA adapter layer that adds trainable parameters without modifying base weights.

### Why LoRA Instead of Full Fine-Tuning?

Full fine-tuning has a problem: **catastrophic forgetting**. When you update all weights for a new task, the model can "forget" what it learned during pre-training. LoRA elegantly prevents this by keeping base weights *frozen* and adding small trainable matrices *alongside* them.

Think of it like reading glasses: your eyes (base model) stay the same, but the glasses (LoRA) add a small correction for specific tasks.

In [None]:
# ===== LORA LINEAR LAYER =====

class LoRALinear(nn.Module):
    """
    A linear layer with Low-Rank Adaptation (LoRA).
    
    Think of this like adding reading glasses to your eyes:
    - Your eyes (base layer) stay the same
    - The glasses (LoRA) add a small correction
    - Together they work better than eyes alone
    
    Math: output = base(x) + (x @ B @ A) * scale
    """
    
    def __init__(self, in_features, out_features, rank=8, alpha=16):
        super().__init__()
        
        # The original layer (we'll freeze this)
        self.base = nn.Linear(in_features, out_features, bias=False)
        
        # LoRA matrices
        # B: (in_features, rank) — "down" projection (compress)
        # A: (rank, out_features) — "up" projection (expand)
        self.lora_B = nn.Parameter(torch.zeros(in_features, rank))
        self.lora_A = nn.Parameter(torch.zeros(rank, out_features))
        
        # Scaling factor
        self.scale = alpha / rank
        
        # Initialize B with Kaiming init (scales values based on layer size
        # to prevent exploding/vanishing gradients)
        nn.init.kaiming_uniform_(self.lora_B, a=math.sqrt(5))
        # Initialize A to zeros — so B @ A = 0 at start (LoRA is a no-op!)
        nn.init.zeros_(self.lora_A)
    
    def freeze_base(self):
        """Freeze the base layer so only LoRA is trained."""
        self.base.weight.requires_grad = False
    
    def forward(self, x):
        # x shape: (batch, sequence, in_features)
        
        # Original output from frozen weights
        base_output = self.base(x)  # (batch, seq, out_features)
        
        # LoRA path:
        # x @ B: (batch, seq, in_features) @ (in_features, rank)
        #      = (batch, seq, rank)  — compressed!
        # ... @ A: (batch, seq, rank) @ (rank, out_features)
        #        = (batch, seq, out_features)  — expanded back
        lora_output = (x @ self.lora_B @ self.lora_A) * self.scale
        
        return base_output + lora_output

print("LoRALinear defined!")

# Demonstrate parameter savings
in_features, out_features, rank = 256, 256, 8
full_params = in_features * out_features
lora_params = in_features * rank + rank * out_features

print(f"\nParameter comparison (256x256 layer):")
print(f"  Full fine-tuning: {full_params:,} parameters")
print(f"  LoRA (rank=8):    {lora_params:,} parameters")
print(f"  Savings:          {full_params/lora_params:.1f}x fewer!")

## 6. Apply LoRA to Model

Replace attention projections with LoRA versions and freeze the base model.

In [None]:
def add_lora_to_model(model, rank=8, alpha=16):
    """
    Add LoRA adapters to attention QKV projections.
    
    Research shows targeting Query and Value projections works best -
    they control WHAT to attend to and WHAT information to extract.
    In MiniGPT, Q/K/V are computed by a single 'qkv_proj' layer,
    so LoRA learns corrections for all three together.
    
    This 'surgical' approach:
    1. Replaces QKV projections with LoRA versions
    2. Copies original weights
    3. Freezes everything except LoRA parameters
    """
    
    # Find and replace QKV projections in attention layers
    for name, module in model.named_modules():
        if isinstance(module, MultiHeadAttention):
            # Get dimensions
            in_features = module.qkv_proj.in_features
            out_features = module.qkv_proj.out_features
            
            # Create LoRA version
            lora_qkv = LoRALinear(in_features, out_features, rank, alpha)
            
            # Copy original weights
            lora_qkv.base.weight.data = module.qkv_proj.weight.data.clone()
            
            # Freeze base
            lora_qkv.freeze_base()
            
            # Replace
            module.qkv_proj = lora_qkv
    
    # Freeze everything except LoRA parameters
    for name, param in model.named_parameters():
        if 'lora_' not in name:
            param.requires_grad = False
    
    return model

# Create a fresh model and add LoRA
model = MiniGPT(config).to(device)
model = add_lora_to_model(model, rank=8, alpha=16)

# Count parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())

print(f"Total parameters:     {total:,}")
print(f"Trainable (LoRA):     {trainable:,}")
print(f"Trainable percentage: {100*trainable/total:.2f}%")

# Show what's trainable
print("\nTrainable parameters:")
for name, param in model.named_parameters():
    if param.requires_grad:
        print(f"  {name}: {param.shape}")

## 7. Fine-Tuning Loop

Train using the 5-step recipe with masked loss.

In [None]:
def compute_masked_loss(logits, labels):
    """
    Compute cross-entropy loss, ignoring positions where labels == -100.
    
    This is like grading only the answer portion of an exam,
    not the question that was copied from the prompt.
    """
    # Shift for next-token prediction
    # .contiguous() ensures memory is laid out sequentially for .view()
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()
    
    # Cross-entropy with ignore_index=-100
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        ignore_index=-100  # PyTorch ignores these positions automatically!
    )
    
    return loss


def finetune_epoch(model, dataloader, optimizer, device):
    """
    Fine-tune for one epoch using the 5-step recipe.
    """
    model.train()
    total_loss = 0
    
    progress = tqdm(dataloader, desc="Fine-tuning")
    for batch in progress:
        input_ids = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        
        # ===== THE 5-STEP RECIPE =====
        
        # Step 1: Zero gradients
        optimizer.zero_grad()
        
        # Step 2: Forward pass
        logits = model(input_ids)
        
        # Step 3: Compute MASKED loss
        loss = compute_masked_loss(logits, labels)
        
        # Step 4: Backward pass
        loss.backward()
        
        # Step 5: Update weights (only LoRA!)
        optimizer.step()
        
        total_loss += loss.item()
        progress.set_postfix(loss=f"{loss.item():.4f}")
    
    return total_loss / len(dataloader)

print("Training functions defined!")

In [None]:
# ===== TRAINING =====

# Hyperparameters
num_epochs = 3
learning_rate = 1e-4  # Higher LR for LoRA is common

# Optimizer (only trainable params)
optimizer = AdamW(
    [p for p in model.parameters() if p.requires_grad],
    lr=learning_rate,
    weight_decay=0.01
)

# Training loop
train_losses = []

print("Starting fine-tuning...")
print(f"Training on {len(train_data)} examples for {num_epochs} epochs\n")

for epoch in range(num_epochs):
    print(f"\nEpoch {epoch + 1}/{num_epochs}")
    print("-" * 40)
    
    loss = finetune_epoch(model, train_loader, optimizer, device)
    train_losses.append(loss)
    
    print(f"Average Loss: {loss:.4f}")

print("\nFine-tuning complete!")

## 8. Plot Training Progress

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.plot(range(1, len(train_losses) + 1), train_losses, 'b-o', linewidth=2, markersize=8)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Training Loss', fontsize=12)
plt.title('Fine-Tuning Loss Curve', fontsize=14)
plt.grid(True, alpha=0.3)
plt.xticks(range(1, len(train_losses) + 1))
plt.show()

print(f"\nLoss decreased from {train_losses[0]:.4f} to {train_losses[-1]:.4f}")
print(f"Improvement: {(1 - train_losses[-1]/train_losses[0])*100:.1f}%")

## 9. Evaluation (AFTER Fine-Tuning)

The satisfying part: seeing the dramatic improvement!

In [None]:
def evaluate_model(model, test_examples, tokenizer, device):
    """Evaluate fine-tuned model on test examples."""
    model.eval()
    results = []
    
    for example in test_examples:
        instruction = example['instruction']
        expected = example['response']
        
        # Generate response
        generated = generate_response(model, instruction, tokenizer, max_new_tokens=50, temperature=0.7)
        
        # Exact match (case insensitive)
        is_exact = generated.lower().strip() == expected.lower().strip()
        
        # Word overlap
        gen_words = set(generated.lower().split())
        exp_words = set(expected.lower().split())
        overlap = len(gen_words & exp_words) / max(len(exp_words), 1)
        
        results.append({
            'instruction': instruction,
            'expected': expected,
            'generated': generated,
            'exact_match': is_exact,
            'word_overlap': overlap
        })
    
    # Summary
    accuracy = sum(r['exact_match'] for r in results) / len(results)
    avg_overlap = sum(r['word_overlap'] for r in results) / len(results)
    
    return {
        'exact_match_accuracy': accuracy,
        'average_word_overlap': avg_overlap,
        'detailed_results': results
    }

# Evaluate
print("Evaluating on test set...")
results = evaluate_model(model, test_data, tokenizer, device)

print(f"\n===== EVALUATION RESULTS =====")
print(f"Exact Match Accuracy: {results['exact_match_accuracy']*100:.1f}%")
print(f"Average Word Overlap: {results['average_word_overlap']*100:.1f}%")

In [None]:
# Show detailed results
print("\n===== DETAILED TEST RESULTS =====")
for r in results['detailed_results']:
    status = "✓" if r['exact_match'] else "○"
    print(f"\n{status} Q: {r['instruction']}")
    print(f"  Expected:  {r['expected']}")
    print(f"  Generated: {r['generated'][:100]}..." if len(r['generated']) > 100 else f"  Generated: {r['generated']}")
    print(f"  Overlap:   {r['word_overlap']*100:.0f}%")

In [None]:
# ===== BEFORE vs AFTER COMPARISON =====

# Create fresh base model for comparison
base_model_fresh = MiniGPT(config).to(device)

print("\n" + "="*70)
print("BEFORE vs AFTER Fine-Tuning")
print("="*70)

comparison_prompts = [
    "What does TechStartup Inc do?",
    "How do I reset my password?",
    "What is SmartScheduler?"
]

for prompt in comparison_prompts:
    before = generate_response(base_model_fresh, prompt, tokenizer, temperature=0.8)
    after = generate_response(model, prompt, tokenizer, temperature=0.7)
    
    print(f"\nQ: {prompt}")
    print(f"  BEFORE: {before[:80]}..." if len(before) > 80 else f"  BEFORE: {before}")
    print(f"  AFTER:  {after[:80]}..." if len(after) > 80 else f"  AFTER:  {after}")
    print("-"*70)

print("\nSame architecture. Same code. Fine-tuning makes all the difference!")

## 10. Save and Load LoRA Weights

In [None]:
def save_lora_weights(model, filepath):
    """Save only the LoRA parameters (tiny file!)."""
    lora_state_dict = {
        name: param for name, param in model.state_dict().items()
        if 'lora_' in name
    }
    torch.save(lora_state_dict, filepath)
    
    size_kb = os.path.getsize(filepath) / 1024
    print(f"LoRA weights saved: {filepath} ({size_kb:.1f} KB)")
    return size_kb

def load_lora_weights(model, filepath):
    """Load LoRA weights into a model with LoRA layers."""
    lora_state_dict = torch.load(filepath, map_location=device)
    model.load_state_dict(lora_state_dict, strict=False)
    print(f"LoRA weights loaded from: {filepath}")

# Save LoRA weights
lora_size = save_lora_weights(model, 'techstartup_lora.pt')

# Compare to full model size
torch.save(model.state_dict(), 'full_model.pt')
full_size = os.path.getsize('full_model.pt') / (1024 * 1024)
print(f"Full model saved: full_model.pt ({full_size:.1f} MB)")

print(f"\nLoRA is {full_size*1024/lora_size:.0f}x smaller than the full model!")

In [None]:
# Demonstrate loading
print("\n===== Testing Load/Save =====")

# Create fresh model
new_model = MiniGPT(config).to(device)
new_model = add_lora_to_model(new_model, rank=8, alpha=16)

# Test before loading
before_load = generate_response(new_model, "What does TechStartup Inc do?", tokenizer)
print(f"Before loading LoRA: {before_load[:60]}...")

# Load weights
load_lora_weights(new_model, 'techstartup_lora.pt')

# Test after loading
after_load = generate_response(new_model, "What does TechStartup Inc do?", tokenizer)
print(f"After loading LoRA:  {after_load}")

print("\nLoRA adapters successfully saved and loaded!")

<cell_type>markdown</cell_type>## Summary

**What we built:**

1. **Instruction/response dataset** with loss masking
2. **LoRA adapters** that train <1% of parameters
3. **Fine-tuning loop** using the 5-step recipe
4. **Evaluation** showing dramatic before/after improvement
5. **Adapter saving/loading** for deployment

**Key insights:**

- Loss masking focuses training on what matters (responses, not instructions)
- LoRA achieves great results with 48× fewer trainable parameters
- Freezing base weights prevents **catastrophic forgetting** (losing pre-trained knowledge)
- Quality of data matters more than quantity for fine-tuning
- Adapters are tiny and can be swapped for different tasks

**Next:** Chapter 14 will explore prompt engineering - getting better outputs without changing model weights at all!

## Exercises

### Exercise 1: Different LoRA Ranks

Try rank=4 and rank=16. How does it affect training and results?

In [None]:
# YOUR CODE HERE
# 1. Create model with rank=4
# 2. Train for 3 epochs
# 3. Compare loss and evaluation to rank=8

### Exercise 2: Create Your Own FAQ

Create a dataset about a topic you know (your school, hobby, etc).

In [None]:
# YOUR CODE HERE
# 1. Create 30+ instruction/response pairs
# 2. Fine-tune the model
# 3. Test with your own questions

### Exercise 3: Loss Masking Ablation

What happens if we DON'T mask the instruction tokens?

In [None]:
# YOUR CODE HERE
# 1. Modify the dataset to NOT mask instruction tokens (all labels = actual tokens)
# 2. Train the model
# 3. Compare results to masked version
# 4. What do you observe?