In [None]:
#@title üéß Download Narration Audio & Play Introduction
import os as _os
if not _os.path.exists("/content/narration"):
    !pip install -q gdown
    import gdown
    gdown.download(id="17rFuCNZUUY1xHrMq1WTamV-JWh_IDZe8", output="/content/narration.zip", quiet=False)
    !unzip -q /content/narration.zip -d /content/narration
    !rm /content/narration.zip
    print(f"Loaded {len(_os.listdir('/content/narration'))} narration segments")
else:
    print("Narration audio already loaded.")

from IPython.display import Audio, display
display(Audio("/content/narration/04_00_intro.mp3"))


In [None]:
#@title üéß Code Walkthrough: Setup Code
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_01_setup_code.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

In [None]:
#@title üéß Listen: Why Lora Matters
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_02_why_lora_matters.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


# LoRA Fine-Tuning from Scratch

*Part 4 of the Vizuara series on Inference & Scaling*
*Estimated time: 65 minutes*

## 1. Why Does This Matter?

You have a pretrained language model that can generate fluent text. But it was trained on generic internet data -- it does not know how to follow your specific instructions, write in your company's style, or answer domain-specific questions accurately.

The obvious solution is fine-tuning: continue training on task-specific data. The problem? A 7-billion parameter model needs over 28 GB just to store its weights in FP32. Add gradients (another 28 GB) and Adam optimizer states (another 56 GB), and you need **112 GB of GPU memory** -- more than an A100 can offer.

**LoRA** (Low-Rank Adaptation) solves this beautifully. Instead of updating all 7 billion parameters, LoRA freezes the original weights and adds tiny trainable matrices that capture the task-specific changes. The key insight: fine-tuning weight updates are **low-rank**, meaning they can be compressed into much smaller matrices without losing quality.

In this notebook, we will:
1. Understand **why** weight updates are low-rank (with empirical evidence)
2. Implement LoRA layers from scratch
3. Inject LoRA into a pretrained model
4. Fine-tune on a classification task and compare against full fine-tuning
5. Analyze the parameter efficiency

In [None]:
#@title üéß Code Walkthrough: Initial Setup
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_03_initial_setup.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Setup -- run this cell first
!pip install -q torch matplotlib numpy

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import time
from collections import OrderedDict

%matplotlib inline

torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

In [None]:
#@title üéß Listen: Building Intuition
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_04_building_intuition.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 2. Building Intuition

Imagine you are an expert chef who has mastered French cuisine. Someone asks you to also cook Japanese food. You do not need to re-learn everything from scratch -- you already know knife skills, temperature control, sauce making, and plating. You just need to learn a few new techniques: how to prepare rice for sushi, how to make dashi, and how to slice sashimi.

The "updates" to your cooking ability are **low-dimensional**: a small number of new skills layered on top of your vast existing knowledge. LoRA applies this same idea to neural networks.

Mathematically, when you fine-tune a weight matrix $W \in \mathbb{R}^{d \times d}$ (where $d = 4096$), the change $\Delta W = W_{\text{finetuned}} - W_{\text{pretrained}}$ has 16 million entries. But if you compute the singular values of $\Delta W$, you find that most of the "energy" is concentrated in just a handful of singular values. The effective rank of $\Delta W$ is often 4, 8, or 16 -- not 4096.

This means we can write $\Delta W \approx BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$, with $r \ll d$. Instead of updating $d^2 = 16{,}777{,}216$ parameters, we update $2 \times d \times r = 2 \times 4096 \times 16 = 131{,}072$ parameters. That is a **128x reduction**.

In [None]:
#@title üéß Listen: Mathematics
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_05_mathematics.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 3. The Mathematics

### The LoRA Decomposition

For a pretrained weight matrix $W_0 \in \mathbb{R}^{d_{\text{out}} \times d_{\text{in}}}$, LoRA replaces the fine-tuned weight with:

$$W = W_0 + \frac{\alpha}{r} \cdot BA$$

where:
- $B \in \mathbb{R}^{d_{\text{out}} \times r}$ is initialized to **zeros**
- $A \in \mathbb{R}^{r \times d_{\text{in}}}$ is initialized with a **random Gaussian**
- $r$ is the rank (typically 4, 8, or 16)
- $\alpha$ is a scaling hyperparameter (typically equal to $r$ or $2r$)

The forward pass becomes:

$$h = W_0 x + \frac{\alpha}{r} \cdot BAx$$

**Why this initialization?** $B = 0$ ensures that at the start of training, $\Delta W = BA = 0$, so the model behaves exactly like the pretrained model. Training gradually learns the update.

### Parameter Count

For a single weight matrix of shape $(d, d)$ with rank $r$:

$$\text{LoRA params} = d \times r + r \times d = 2dr$$

$$\text{Compression ratio} = \frac{d^2}{2dr} = \frac{d}{2r}$$

For $d = 4096$ and $r = 16$: compression ratio = $4096 / 32 = 128\times$.

### Where to Apply LoRA

In a transformer, each attention layer has four weight matrices: $W_Q, W_K, W_V, W_O$. The original LoRA paper found that applying LoRA to $W_Q$ and $W_V$ gives the best results for a given parameter budget. Later work (QLoRA, etc.) showed that applying to all four can help, especially at higher ranks.

In [None]:
#@title üéß Code Walkthrough: Lora Layer Impl
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_06_lora_layer_impl.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 4. Let's Build It -- Component by Component

### 4.1 The LoRA Layer

In [None]:
class LoRALinear(nn.Module):
    """
    A linear layer with LoRA adaptation.

    The forward pass computes: y = W_0 @ x + (alpha/r) * B @ A @ x
    where W_0 is frozen and B, A are trainable.
    """

    def __init__(self, original_linear, rank=8, alpha=16):
        super().__init__()
        self.original = original_linear
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank

        d_out, d_in = original_linear.weight.shape

        # LoRA matrices
        self.A = nn.Parameter(torch.randn(rank, d_in) * 0.01)
        self.B = nn.Parameter(torch.zeros(d_out, rank))

        # Freeze original weights
        self.original.weight.requires_grad = False
        if self.original.bias is not None:
            self.original.bias.requires_grad = False

    def forward(self, x):
        # Original path (frozen)
        base_output = self.original(x)

        # LoRA path (trainable)
        lora_output = F.linear(F.linear(x, self.A), self.B) * self.scaling

        return base_output + lora_output

    def merge(self):
        """Merge LoRA weights into original for efficient inference."""
        with torch.no_grad():
            self.original.weight += self.scaling * (self.B @ self.A)
        return self.original

    @property
    def lora_params(self):
        return self.A.numel() + self.B.numel()


# Test
original = nn.Linear(256, 256, bias=False)
lora_layer = LoRALinear(original, rank=8, alpha=16)

x = torch.randn(1, 10, 256)
y = lora_layer(x)
print(f"Input: {x.shape}")
print(f"Output: {y.shape}")
print(f"Original params: {original.weight.numel():,}")
print(f"LoRA params: {lora_layer.lora_params:,}")
print(f"Compression: {original.weight.numel() / lora_layer.lora_params:.1f}x")

In [None]:
#@title üéß Code Walkthrough: Injecting Lora
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_07_injecting_lora.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### 4.2 Injecting LoRA into a Pretrained Model

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def forward(self, x):
        B, T, _ = x.shape
        h = self.ln1(x)

        Q = self.W_q(h).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(h).view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(h).view(B, T, self.n_heads, self.d_k).transpose(1, 2)

        scores = torch.matmul(Q, K.transpose(-2, -1)) / (self.d_k ** 0.5)
        mask = torch.triu(torch.ones(T, T, device=x.device, dtype=torch.bool), diagonal=1)
        scores.masked_fill_(mask, float('-inf'))
        attn = F.softmax(scores, dim=-1)
        out = torch.matmul(attn, V)
        out = out.transpose(1, 2).contiguous().view(B, T, self.d_model)
        out = self.W_o(out)

        x = x + out
        x = x + self.ffn(self.ln2(x))
        return x


class SmallGPT(nn.Module):
    def __init__(self, vocab_size, d_model=256, n_heads=4, n_layers=4, max_len=256):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, n_heads) for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.d_model = d_model

    def forward(self, x):
        B, T = x.shape
        h = self.tok_emb(x) + self.pos_emb(torch.arange(T, device=x.device))
        for layer in self.layers:
            h = layer(h)
        h = self.ln_f(h)
        return self.head(h)


def inject_lora(model, rank=8, alpha=16, target_modules=('W_q', 'W_v')):
    """
    Inject LoRA adapters into specified modules of the model.
    Freezes all original parameters and only makes LoRA params trainable.
    """
    # First, freeze everything
    for param in model.parameters():
        param.requires_grad = False

    lora_count = 0
    for name, module in model.named_modules():
        for target in target_modules:
            if hasattr(module, target):
                original = getattr(module, target)
                if isinstance(original, nn.Linear):
                    lora_layer = LoRALinear(original, rank=rank, alpha=alpha)
                    setattr(module, target, lora_layer)
                    lora_count += 1

    # Count parameters
    total_params = sum(p.numel() for p in model.parameters())
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

    print(f"Injected LoRA into {lora_count} layers")
    print(f"Total params: {total_params:,}")
    print(f"Trainable params: {trainable_params:,}")
    print(f"Trainable fraction: {trainable_params/total_params*100:.2f}%")

    return model


# Create and inject
vocab_size = 1000
model = SmallGPT(vocab_size, d_model=256, n_heads=4, n_layers=4).to(device)
print("Before LoRA:")
print(f"  Total params: {sum(p.numel() for p in model.parameters()):,}")
print()

model = inject_lora(model, rank=8, alpha=16)

In [None]:
#@title üéß What to Look For: Visualization Params
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_08_visualization_params.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### Visualization Checkpoint: Parameter Comparison

In [None]:
# Visualize the parameter efficiency of LoRA

d_values = [256, 512, 1024, 2048, 4096]
ranks = [4, 8, 16, 32]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Left: Full vs LoRA params for different model dimensions
for r in ranks:
    lora_params = [2 * d * r for d in d_values]
    full_params = [d * d for d in d_values]
    ax1.plot(d_values, [l/f*100 for l, f in zip(lora_params, full_params)],
             'o-', label=f'rank={r}', linewidth=2, markersize=6)

ax1.set_xlabel('Model Dimension (d)')
ax1.set_ylabel('LoRA params as % of Full')
ax1.set_title('LoRA Parameter Efficiency')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_yscale('log')
ax1.set_ylim(0.01, 100)

# Right: Memory comparison for a 7B model
categories = ['Model\nWeights', 'Full FT\nGradients', 'Full FT\nOptimizer', 'LoRA\nParams', 'LoRA\nGradients', 'LoRA\nOptimizer']
# 7B model, FP16 weights, FP32 gradients/optimizer
model_mem = 14  # 7B * 2 bytes (FP16)
full_grad = 28  # 7B * 4 bytes (FP32)
full_opt = 56   # 7B * 2 states * 4 bytes (FP32)
# LoRA: 0.1% of params = 7M params
lora_params_mem = 0.028  # 7M * 4 bytes
lora_grad = 0.028
lora_opt = 0.056

mem_full = [model_mem, full_grad, full_opt, 0, 0, 0]
mem_lora = [model_mem, 0, 0, lora_params_mem, lora_grad, lora_opt]

x = np.arange(len(categories))
width = 0.35
ax2.bar(x - width/2, mem_full, width, label='Full Fine-Tuning', color='#e74c3c', alpha=0.8)
ax2.bar(x + width/2, mem_lora, width, label='LoRA', color='#2ecc71', alpha=0.8)
ax2.set_xticks(x)
ax2.set_xticklabels(categories, fontsize=9)
ax2.set_ylabel('GPU Memory (GB)')
ax2.set_title('Memory: Full Fine-Tuning vs LoRA (7B Model)')
ax2.legend()

# Add total annotations
total_full = model_mem + full_grad + full_opt
total_lora = model_mem + lora_params_mem + lora_grad + lora_opt
ax2.text(1, full_grad + 1, f'Total: {total_full:.0f} GB', ha='center', fontweight='bold', color='#e74c3c')
ax2.text(4, model_mem + 1, f'Total: {total_lora:.1f} GB', ha='center', fontweight='bold', color='#2ecc71')

plt.tight_layout()
plt.show()

print(f"Full fine-tuning: {total_full:.0f} GB")
print(f"LoRA fine-tuning: {total_lora:.1f} GB")
print(f"Memory saving: {total_full/total_lora:.0f}x")

In [None]:
#@title üéß Before You Start: Todo1 Ranks
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_09_todo1_ranks.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 5. Your Turn

### TODO 1: Experiment with Different Ranks

Fine-tune the model with ranks 1, 4, 8, 16, and 32. Plot the final loss vs. rank to understand the rank-performance tradeoff.

In [None]:
# TODO: Train the model with different LoRA ranks and compare
# final training loss.
#
# For each rank:
# 1. Create a fresh model
# 2. Inject LoRA with the given rank
# 3. Train for a fixed number of steps
# 4. Record the final loss
#
# YOUR CODE HERE
# ranks_to_test = [1, 4, 8, 16, 32]
# final_losses = []
#
# for rank in ranks_to_test:
#     # Create fresh model
#     test_model = SmallGPT(vocab_size, d_model=256, n_heads=4, n_layers=4).to(device)
#     test_model = inject_lora(test_model, rank=rank, alpha=rank*2)
#
#     # Create optimizer (only trainable params)
#     optimizer = torch.optim.Adam(
#         [p for p in test_model.parameters() if p.requires_grad],
#         lr=1e-3
#     )
#
#     # Train for N steps on synthetic data
#     ...
#
#     final_losses.append(loss)
#     print(f"Rank {rank}: loss = {loss:.4f}, params = {trainable:,}")
#
# plt.plot(ranks_to_test, final_losses, 'bo-')
# plt.xlabel('LoRA Rank')
# plt.ylabel('Final Loss')
# plt.title('Loss vs LoRA Rank')
# plt.grid(True, alpha=0.3)
# plt.show()

In [None]:
#@title üéß Before You Start: Todo2 Merge
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_10_todo2_merge.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### TODO 2: Implement LoRA Merging and Verify Equivalence

After training, merge the LoRA weights back into the original model and verify that the merged model produces identical outputs.

In [None]:
# TODO: Merge LoRA weights and verify output equivalence.
#
# Steps:
# 1. Get output from the LoRA model on a test input
# 2. Merge LoRA weights into the original linear layers
# 3. Get output from the merged model on the same test input
# 4. Verify they are identical (or very close, within FP precision)
#
# YOUR CODE HERE
# test_input = torch.randint(0, vocab_size, (1, 32), device=device)
#
# # Get LoRA output
# model.eval()
# with torch.no_grad():
#     lora_output = model(test_input)
#
# # Merge LoRA into base weights
# for module in model.modules():
#     if isinstance(module, LoRALinear):
#         merged = module.merge()
#         # Replace the LoRA module with the merged linear
#
# # Get merged output
# with torch.no_grad():
#     merged_output = model(test_input)
#
# # Compare
# diff = (lora_output - merged_output).abs().max().item()
# print(f"Max difference: {diff:.2e}")
# print(f"Outputs are equivalent: {diff < 1e-5}")

In [None]:
#@title üéß Transition: Putting It Together Intro
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_11_putting_it_together_intro.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 6. Putting It All Together

Let us run a complete fine-tuning experiment: pretrain a model on one task, then use LoRA to adapt it to a different task.

In [None]:
#@title üéß Code Walkthrough: Pretrain Task
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_12_pretrain_task.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# Task 1: Pretrain on character-level next-token prediction
pretrain_text = ("the quick brown fox jumps over the lazy dog " * 50 +
                 "a stitch in time saves nine " * 50 +
                 "all that glitters is not gold " * 50)

chars = sorted(set(pretrain_text))
c2i = {c: i for i, c in enumerate(chars)}
i2c = {i: c for c, i in c2i.items()}
v = len(chars)

pretrain_data = torch.tensor([c2i[c] for c in pretrain_text], device=device)

# Pretrain
base_model = SmallGPT(v, d_model=128, n_heads=4, n_layers=3, max_len=128).to(device)
optimizer = torch.optim.Adam(base_model.parameters(), lr=3e-3)

base_model.train()
seq_len = 48
pretrain_losses = []

for epoch in range(100):
    total_loss, n = 0, 0
    for i in range(0, len(pretrain_data) - seq_len - 1, seq_len):
        x = pretrain_data[i:i+seq_len].unsqueeze(0)
        y = pretrain_data[i+1:i+seq_len+1].unsqueeze(0)
        logits = base_model(x)
        loss = F.cross_entropy(logits.view(-1, v), y.view(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        n += 1
    avg_loss = total_loss / n
    pretrain_losses.append(avg_loss)
    if (epoch + 1) % 25 == 0:
        print(f"Pretrain epoch {epoch+1}: loss = {avg_loss:.4f}")

# Save pretrained weights
pretrained_state = {k: v.clone() for k, v in base_model.state_dict().items()}
print(f"\nPretraining complete. Final loss: {pretrain_losses[-1]:.4f}")

In [None]:
#@title üéß Code Walkthrough: Finetune Task
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_13_finetune_task.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### Fine-tuning on a New Task

In [None]:
# Task 2: Fine-tune on different text (Shakespeare-style)
finetune_text = ("to be or not to be that is the question " * 50 +
                 "whether tis nobler in the mind to suffer " * 50 +
                 "the slings and arrows of outrageous fortune " * 50)

# Only use characters that exist in our vocabulary
finetune_text = ''.join(c for c in finetune_text if c in c2i)
finetune_data = torch.tensor([c2i[c] for c in finetune_text], device=device)

# Method 1: Full fine-tuning
full_ft_model = SmallGPT(v, d_model=128, n_heads=4, n_layers=3, max_len=128).to(device)
full_ft_model.load_state_dict(pretrained_state)
full_ft_optimizer = torch.optim.Adam(full_ft_model.parameters(), lr=1e-3)

# Method 2: LoRA fine-tuning
lora_model = SmallGPT(v, d_model=128, n_heads=4, n_layers=3, max_len=128).to(device)
lora_model.load_state_dict(pretrained_state)
lora_model = inject_lora(lora_model, rank=8, alpha=16)
lora_optimizer = torch.optim.Adam(
    [p for p in lora_model.parameters() if p.requires_grad], lr=1e-3
)

# Train both
n_epochs = 80
full_ft_losses = []
lora_losses = []

for epoch in range(n_epochs):
    # Full fine-tuning
    full_ft_model.train()
    ft_loss_sum, n = 0, 0
    for i in range(0, len(finetune_data) - seq_len - 1, seq_len):
        x = finetune_data[i:i+seq_len].unsqueeze(0)
        y = finetune_data[i+1:i+seq_len+1].unsqueeze(0)
        logits = full_ft_model(x)
        loss = F.cross_entropy(logits.view(-1, v), y.view(-1))
        full_ft_optimizer.zero_grad()
        loss.backward()
        full_ft_optimizer.step()
        ft_loss_sum += loss.item()
        n += 1
    full_ft_losses.append(ft_loss_sum / n)

    # LoRA fine-tuning
    lora_model.train()
    lora_loss_sum, n = 0, 0
    for i in range(0, len(finetune_data) - seq_len - 1, seq_len):
        x = finetune_data[i:i+seq_len].unsqueeze(0)
        y = finetune_data[i+1:i+seq_len+1].unsqueeze(0)
        logits = lora_model(x)
        loss = F.cross_entropy(logits.view(-1, v), y.view(-1))
        lora_optimizer.zero_grad()
        loss.backward()
        lora_optimizer.step()
        lora_loss_sum += loss.item()
        n += 1
    lora_losses.append(lora_loss_sum / n)

    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}: Full FT loss = {full_ft_losses[-1]:.4f}, "
              f"LoRA loss = {lora_losses[-1]:.4f}")

print("\nTraining complete!")

In [None]:
#@title üéß What to Look For: Results Visualization
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_14_results_visualization.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 7. Training and Results

In [None]:
# Compare the two approaches
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss curves
ax1.plot(full_ft_losses, label='Full Fine-Tuning', linewidth=2, color='#e74c3c')
ax1.plot(lora_losses, label='LoRA (rank=8)', linewidth=2, color='#2ecc71')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Fine-Tuning Loss: Full vs LoRA')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Parameter comparison
full_params = sum(p.numel() for p in full_ft_model.parameters())
lora_trainable = sum(p.numel() for p in lora_model.parameters() if p.requires_grad)
lora_total = sum(p.numel() for p in lora_model.parameters())

categories = ['Full FT\nTrainable', 'LoRA\nTrainable', 'LoRA\nFrozen']
values = [full_params, lora_trainable, lora_total - lora_trainable]
colors = ['#e74c3c', '#2ecc71', '#95a5a6']

ax2.bar(categories, values, color=colors, alpha=0.8, edgecolor='white', linewidth=2)
ax2.set_ylabel('Number of Parameters')
ax2.set_title('Trainable Parameters Comparison')

for i, val in enumerate(values):
    ax2.text(i, val + max(values)*0.02, f'{val:,}', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nFull FT: {full_params:,} trainable params, final loss = {full_ft_losses[-1]:.4f}")
print(f"LoRA:    {lora_trainable:,} trainable params, final loss = {lora_losses[-1]:.4f}")
print(f"LoRA trains {lora_trainable/full_params*100:.2f}% of parameters")

In [None]:
#@title üéß Narration: Final Output Generation
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_15_final_output_generation.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 8. Final Output

In [None]:
# Generate text from both models to see the qualitative difference
full_ft_model.eval()
lora_model.eval()

prompt_text = "to be or "
prompt_ids = torch.tensor([[c2i[c] for c in prompt_text]], device=device)

def generate_simple(model, prompt, n_tokens=80):
    tokens = prompt.clone()
    model.eval()
    with torch.no_grad():
        for _ in range(n_tokens):
            logits = model(tokens)
            next_logits = logits[:, -1, :] / 0.8  # temperature
            probs = F.softmax(next_logits, dim=-1)
            next_tok = torch.multinomial(probs, 1)
            tokens = torch.cat([tokens, next_tok], dim=1)
    return ''.join([i2c[t.item()] for t in tokens[0]])

print("=" * 60)
print("Full Fine-Tuning:")
print(f"  {generate_simple(full_ft_model, prompt_ids)}")
print()
print("LoRA Fine-Tuning (rank=8):")
print(f"  {generate_simple(lora_model, prompt_ids)}")
print("=" * 60)

# Summary
print(f"\n--- Summary ---")
print(f"Full FT:  {full_params:>10,} trainable params | loss = {full_ft_losses[-1]:.4f}")
print(f"LoRA:     {lora_trainable:>10,} trainable params | loss = {lora_losses[-1]:.4f}")
print(f"Reduction: {full_params/lora_trainable:.0f}x fewer trainable parameters")

In [None]:
#@title üéß Narration: Reflection And Next Steps
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/04_16_reflection_and_next_steps.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 9. Reflection and Next Steps

### What We Learned

1. **Fine-tuning updates are low-rank**: When we fine-tune a large model, the weight changes $\Delta W$ can be well-approximated by a product of two small matrices $BA$. This is not a theoretical assumption -- it is an empirical observation that holds across many tasks.

2. **LoRA implementation is simple**: Replace `nn.Linear` with a wrapper that adds $BA$ to the output. Freeze the original weights. Train only $A$ and $B$.

3. **The parameter savings are dramatic**: For typical configurations (rank 8-16), LoRA uses 0.1-1% of the parameters of full fine-tuning, with comparable performance.

4. **Merging is free**: After training, the LoRA weights can be merged back into the original model ($W \leftarrow W + \frac{\alpha}{r} BA$), resulting in zero inference overhead.

### Key Hyperparameters

| Hyperparameter | Typical Range | Effect |
|---------------|---------------|--------|
| Rank ($r$) | 4-32 | Higher = more capacity, more params |
| Alpha ($\alpha$) | $r$ to $2r$ | Scales the LoRA update magnitude |
| Target modules | Q, V (or all) | Which weight matrices to adapt |
| Learning rate | 1e-4 to 3e-4 | Usually higher than full FT |

### What is Next

With efficient fine-tuning covered, the final notebook addresses the most fundamental question: **how do we make models not just capable, but aligned with human values?** We will implement DPO alignment and explore scaling laws.