# Lab 3.1.1: LoRA Theory - Understanding Low-Rank Adaptation

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐☆☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the mathematical foundation of LoRA (Low-Rank Adaptation)
- [ ] Implement a LoRA layer from scratch using NumPy and PyTorch
- [ ] Visualize how LoRA updates work during training
- [ ] Understand the relationship between LoRA and SVD
- [ ] Choose appropriate rank values for different tasks

---

## Prerequisites

- Completed: Module 1.4 (Mathematics for Deep Learning) - especially SVD concepts
- Completed: Module 1.5 (Neural Network Fundamentals)
- Knowledge of: Matrix operations, gradient descent, neural network layers

---

## Real-World Context

### The Problem: Fine-Tuning is Expensive!

Imagine you work at a legal tech company. You want to make GPT-4-level quality responses for legal documents, but:
- Full fine-tuning of a 70B model requires **140GB+ of GPU memory** (just for weights!)
- Add gradients and optimizer states, and you need **400-500GB**
- Training takes weeks and costs thousands of dollars

**LoRA changes everything:** Instead of updating all 70 billion parameters, you update just **0.1-1%** of them through a clever mathematical trick. Your 70B fine-tuning now fits in **45-55GB** on your DGX Spark!

### Who Uses LoRA?
- **Stability AI** - For fine-tuning Stable Diffusion models
- **Microsoft** - Built into Azure AI services
- **Every major AI company** - It's become the standard approach

---

## ELI5: What is LoRA?

> **Imagine you have a master chef (the base model) who knows how to cook thousands of dishes.** You want them to specialize in making YOUR family's recipes taste just right.
>
> **Option 1: Re-train the chef from scratch** (Full Fine-Tuning)  
> Send them back to culinary school for 4 years, teaching them everything again plus your recipes. Expensive, time-consuming, and they might forget some things!
>
> **Option 2: Give them a small recipe card** (LoRA)  
> Just give them a tiny note card with adjustments: "Add a bit more garlic", "Use grandma's secret spice". The chef keeps all their knowledge but makes small adjustments when cooking for YOU.
>
> **The magic:** The recipe card (LoRA adapter) is tiny - maybe 100 recipes instead of 10,000. But combined with the chef's existing skills, it produces exactly what you want!
>
> **In AI terms:** Instead of updating a 4096×4096 weight matrix (16 million parameters), LoRA adds two small matrices: 4096×16 and 16×4096 (only 131K parameters - **99% reduction!**). These small matrices encode the "adjustments" to the original weights.

---

## Part 1: The Math Behind LoRA

### The Core Insight

The key insight of LoRA is that **weight updates during fine-tuning have low intrinsic rank**.

In plain English: When you fine-tune a model, you don't need to change everything - you only need to make small, targeted adjustments. These adjustments can be represented efficiently using low-rank matrices.

### The Math

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, traditional fine-tuning updates it to:

$$W = W_0 + \Delta W$$

Where $\Delta W$ is also a $d \times k$ matrix (same size as original!).

**LoRA's trick:** Instead of storing the full $\Delta W$, decompose it into two smaller matrices:

$$\Delta W = BA$$

Where:
- $B \in \mathbb{R}^{d \times r}$ (tall and thin)
- $A \in \mathbb{R}^{r \times k}$ (short and wide)
- $r \ll \min(d, k)$ is the **rank** (typically 4-64)

So the adapted forward pass becomes:

$$h = W_0 x + BAx$$

Or equivalently:

$$h = W_0 x + \frac{\alpha}{r} BAx$$

Where $\alpha$ is a scaling factor (often called `lora_alpha`).

### Parameter Savings

| Matrix | Dimensions | Parameters |
|--------|------------|------------|
| $\Delta W$ (full) | 4096 × 4096 | 16,777,216 |
| $B$ (LoRA) | 4096 × 16 | 65,536 |
| $A$ (LoRA) | 16 × 4096 | 65,536 |
| **Total LoRA** | - | **131,072** |
| **Savings** | - | **99.2%** |


## Part 2: Implementing LoRA from Scratch

Let's implement LoRA step by step, first with NumPy for clarity, then with PyTorch for GPU acceleration.

In [None]:
# Requirements: numpy, torch, matplotlib
# These are all included in the NGC PyTorch container

# Setup: Import required libraries
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from typing import Optional, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

### 2.1 NumPy Implementation (For Understanding)

Let's first build a simple LoRA layer using NumPy to really understand what's happening.

In [None]:
class LoRALayerNumPy:
    """
    A simple LoRA layer implementation in NumPy for educational purposes.
    
    This implements: h = W_0 @ x + (alpha/r) * B @ A @ x
    """
    
    def __init__(self, in_features: int, out_features: int, rank: int = 4, alpha: float = 1.0):
        """
        Initialize LoRA layer.
        
        Args:
            in_features: Input dimension (k)
            out_features: Output dimension (d)
            rank: LoRA rank (r) - lower = fewer parameters, higher = more expressiveness
            alpha: Scaling factor - controls how much LoRA affects the output
        """
        self.in_features = in_features
        self.out_features = out_features
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # Original frozen weights (pretrained) - initialized with Xavier
        # In practice, these come from a pretrained model
        self.W0 = np.random.randn(out_features, in_features) * np.sqrt(2.0 / (in_features + out_features))
        
        # LoRA matrices
        # A is initialized with small random values (Kaiming/Gaussian)
        self.A = np.random.randn(rank, in_features) * 0.01
        # B is initialized to zeros - so initially LoRA has NO effect!
        self.B = np.zeros((out_features, rank))
        
        print(f"LoRA Layer Created:")
        print(f"  Original W0: {out_features} × {in_features} = {out_features * in_features:,} params")
        print(f"  LoRA A: {rank} × {in_features} = {rank * in_features:,} params")
        print(f"  LoRA B: {out_features} × {rank} = {out_features * rank:,} params")
        print(f"  Total LoRA params: {rank * in_features + out_features * rank:,}")
        print(f"  Compression ratio: {(rank * in_features + out_features * rank) / (out_features * in_features) * 100:.2f}%")
    
    def forward(self, x: np.ndarray) -> np.ndarray:
        """
        Forward pass: h = W_0 @ x + scaling * B @ A @ x
        
        Args:
            x: Input tensor of shape (batch_size, in_features)
        
        Returns:
            Output tensor of shape (batch_size, out_features)
        """
        # Original path (frozen)
        original_output = x @ self.W0.T
        
        # LoRA path (trainable)
        # First: x @ A.T gives (batch, rank)
        # Then: result @ B.T gives (batch, out_features)
        lora_output = (x @ self.A.T) @ self.B.T
        
        # Combined output
        return original_output + self.scaling * lora_output
    
    def get_merged_weights(self) -> np.ndarray:
        """
        Merge LoRA weights with original weights.
        After training, you can merge: W = W_0 + scaling * B @ A
        This gives you a regular layer with no inference overhead!
        """
        return self.W0 + self.scaling * (self.B @ self.A)


# Example: Create a LoRA layer for a typical transformer attention dimension
lora_layer = LoRALayerNumPy(in_features=4096, out_features=4096, rank=16, alpha=32)

### What's happening here?

1. **Original weights (`W0`)**: These are the frozen pretrained weights - we don't update them
2. **Matrix A**: Projects input to low-rank space (4096 → 16)
3. **Matrix B**: Projects back to output space (16 → 4096)
4. **Initialization trick**: B starts at zero, so initially LoRA has NO effect!

The key insight: We're not learning a full 4096×4096 update matrix. We're learning a **factorized** version that goes through a 16-dimensional bottleneck.

In [None]:
# Let's verify the forward pass works
batch_size = 8
x = np.random.randn(batch_size, 4096)

output = lora_layer.forward(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")

# Since B is initialized to zeros, LoRA should have no effect initially
original_only = x @ lora_layer.W0.T
print(f"\nDifference from original (should be ~0): {np.abs(output - original_only).max():.10f}")

### 2.2 PyTorch Implementation (Production Ready)

Now let's create a proper PyTorch implementation that can be used for actual training.

In [None]:
class LoRALayer(nn.Module):
    """
    Production-ready LoRA layer implementation in PyTorch.
    
    This wraps an existing nn.Linear layer and adds LoRA adapters.
    """
    
    def __init__(
        self,
        original_layer: nn.Linear,
        rank: int = 4,
        alpha: float = 1.0,
        dropout: float = 0.0
    ):
        super().__init__()
        
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # Freeze original weights
        self.original_layer.weight.requires_grad = False
        if self.original_layer.bias is not None:
            self.original_layer.bias.requires_grad = False
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Initialize A with Kaiming uniform
        nn.init.kaiming_uniform_(self.lora_A, a=np.sqrt(5))
        # B stays zero - important for training stability!
        
        # Optional dropout
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Original path
        original_output = self.original_layer(x)
        
        # LoRA path with dropout
        lora_output = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
        
        return original_output + self.scaling * lora_output
    
    def merge_weights(self) -> None:
        """Merge LoRA weights into original layer for inference."""
        with torch.no_grad():
            self.original_layer.weight += self.scaling * (self.lora_B @ self.lora_A)
    
    def unmerge_weights(self) -> None:
        """Unmerge LoRA weights (for continuing training)."""
        with torch.no_grad():
            self.original_layer.weight -= self.scaling * (self.lora_B @ self.lora_A)
    
    @property
    def trainable_params(self) -> int:
        return self.lora_A.numel() + self.lora_B.numel()
    
    @property
    def total_params(self) -> int:
        return self.original_layer.weight.numel() + self.trainable_params

In [None]:
# Create a sample linear layer (simulating a transformer attention projection)
original_linear = nn.Linear(4096, 4096, bias=False).to(device)

# Wrap it with LoRA
lora_linear = LoRALayer(original_linear, rank=16, alpha=32, dropout=0.05).to(device)

print(f"Original parameters: {original_linear.weight.numel():,}")
print(f"LoRA trainable parameters: {lora_linear.trainable_params:,}")
print(f"Compression: {lora_linear.trainable_params / original_linear.weight.numel() * 100:.2f}%")

# Test forward pass
x = torch.randn(8, 4096, device=device)
output = lora_linear(x)
print(f"\nOutput shape: {output.shape}")

---

## Part 3: Visualizing LoRA Training Dynamics

Let's actually train a simple model with LoRA and visualize what's happening to the weights.

### Key PyTorch Functions Used in This Section

Before we proceed, here are the key PyTorch functions we'll use:

| Function | Description |
|----------|-------------|
| `F.scaled_dot_product_attention(q, k, v)` | PyTorch 2.0+ efficient attention implementation. Computes $\text{softmax}(QK^T/\sqrt{d_k})V$ with Flash Attention optimization. |
| `torch.roll(x, shifts, dims)` | Circular shift of tensor elements along specified dimension. Used for creating synthetic training targets. |
| `nn.init.kaiming_uniform_(tensor)` | Initializes weights using He initialization, optimal for ReLU/GELU networks. |
| `tensor.norm()` | Computes the Frobenius norm (default) of a tensor. |

In [None]:
class SimpleTransformerBlock(nn.Module):
    """A simplified transformer block for demonstration."""
    
    def __init__(self, d_model: int = 512, n_heads: int = 8):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        # Attention projections
        self.q_proj = nn.Linear(d_model, d_model, bias=False)
        self.k_proj = nn.Linear(d_model, d_model, bias=False)
        self.v_proj = nn.Linear(d_model, d_model, bias=False)
        self.o_proj = nn.Linear(d_model, d_model, bias=False)
        
        # MLP
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model)
        )
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        
        # Attention
        normed = self.norm1(x)
        q = self.q_proj(normed).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        k = self.k_proj(normed).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        v = self.v_proj(normed).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        
        attn = F.scaled_dot_product_attention(q, k, v)
        attn = attn.transpose(1, 2).contiguous().view(B, T, C)
        x = x + self.o_proj(attn)
        
        # MLP
        x = x + self.mlp(self.norm2(x))
        
        return x


def add_lora_to_model(model: nn.Module, rank: int = 8, alpha: float = 16) -> nn.Module:
    """Add LoRA adapters to attention projections in a model."""
    for name, module in model.named_modules():
        if isinstance(module, nn.Linear) and any(proj in name for proj in ['q_proj', 'k_proj', 'v_proj', 'o_proj']):
            parent_name = '.'.join(name.split('.')[:-1])
            child_name = name.split('.')[-1]
            
            parent = model
            for part in parent_name.split('.'):
                if part:
                    parent = getattr(parent, part)
            
            lora_layer = LoRALayer(module, rank=rank, alpha=alpha)
            setattr(parent, child_name, lora_layer)
            
    return model

In [None]:
# Create a simple model and add LoRA
model = SimpleTransformerBlock(d_model=256, n_heads=8).to(device)

# Count parameters before LoRA
total_params_before = sum(p.numel() for p in model.parameters())
print(f"Total parameters before LoRA: {total_params_before:,}")

# Add LoRA
model = add_lora_to_model(model, rank=8, alpha=16)

# Count trainable vs frozen parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
frozen_params = sum(p.numel() for p in model.parameters() if not p.requires_grad)

print(f"\nAfter adding LoRA:")
print(f"  Trainable parameters: {trainable_params:,}")
print(f"  Frozen parameters: {frozen_params:,}")
print(f"  Trainable ratio: {trainable_params / (trainable_params + frozen_params) * 100:.2f}%")

In [None]:
# Let's train this model on a simple task and track LoRA weight changes

def generate_synthetic_data(n_samples: int, seq_len: int, d_model: int) -> Tuple[torch.Tensor, torch.Tensor]:
    """Generate synthetic sequence data for training."""
    # Input sequences
    x = torch.randn(n_samples, seq_len, d_model)
    # Target: shifted version with some transformation
    y = torch.roll(x, shifts=1, dims=1) * 0.9 + torch.randn_like(x) * 0.1
    return x, y


# Generate data
x_train, y_train = generate_synthetic_data(1000, 32, 256)
x_train, y_train = x_train.to(device), y_train.to(device)

# Only train LoRA parameters
lora_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.AdamW(lora_params, lr=1e-3)

# Track weights during training
weight_history = {'A_norm': [], 'B_norm': [], 'delta_W_norm': []}

# Get first LoRA layer for tracking
first_lora = model.q_proj

print("Training with LoRA...")
losses = []
n_epochs = 100
batch_size = 64

for epoch in range(n_epochs):
    model.train()
    epoch_losses = []
    
    # Track weight norms
    with torch.no_grad():
        weight_history['A_norm'].append(first_lora.lora_A.norm().item())
        weight_history['B_norm'].append(first_lora.lora_B.norm().item())
        delta_W = first_lora.lora_B @ first_lora.lora_A
        weight_history['delta_W_norm'].append(delta_W.norm().item())
    
    for i in range(0, len(x_train), batch_size):
        batch_x = x_train[i:i+batch_size]
        batch_y = y_train[i:i+batch_size]
        
        optimizer.zero_grad()
        output = model(batch_x)
        loss = F.mse_loss(output, batch_y)
        loss.backward()
        optimizer.step()
        
        epoch_losses.append(loss.item())
    
    avg_loss = np.mean(epoch_losses)
    losses.append(avg_loss)
    
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch+1}/{n_epochs}, Loss: {avg_loss:.4f}")

print("Training complete!")

In [None]:
# Visualize training dynamics
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Loss curve
axes[0, 0].plot(losses, 'b-', linewidth=2)
axes[0, 0].set_xlabel('Epoch')
axes[0, 0].set_ylabel('Loss')
axes[0, 0].set_title('Training Loss (LoRA only)')
axes[0, 0].grid(True, alpha=0.3)

# LoRA weight norms
axes[0, 1].plot(weight_history['A_norm'], label='||A||', linewidth=2)
axes[0, 1].plot(weight_history['B_norm'], label='||B||', linewidth=2)
axes[0, 1].set_xlabel('Epoch')
axes[0, 1].set_ylabel('Frobenius Norm')
axes[0, 1].set_title('LoRA Matrix Norms During Training')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Delta W norm
axes[1, 0].plot(weight_history['delta_W_norm'], 'g-', linewidth=2)
axes[1, 0].set_xlabel('Epoch')
axes[1, 0].set_ylabel('||B @ A||')
axes[1, 0].set_title('Effective Weight Update Norm (ΔW = BA)')
axes[1, 0].grid(True, alpha=0.3)

# Visualize the learned ΔW matrix
with torch.no_grad():
    delta_W = (first_lora.lora_B @ first_lora.lora_A).cpu().numpy()

im = axes[1, 1].imshow(delta_W, cmap='RdBu', aspect='auto', vmin=-0.1, vmax=0.1)
axes[1, 1].set_title('Learned Weight Update (ΔW = BA)')
axes[1, 1].set_xlabel('Input dimension')
axes[1, 1].set_ylabel('Output dimension')
plt.colorbar(im, ax=axes[1, 1])

plt.tight_layout()
plt.savefig('lora_training_dynamics.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)  # Release memory

print("\nKey Observations:")
print("1. B starts at zero and grows during training")
print("2. The effective ΔW = BA has low rank structure")
print("3. Weight updates are localized - not all entries change equally")

### What's Happening?

Look at the visualizations above:

1. **Matrix B grows from zero**: This is the initialization trick - starting from zero means LoRA has no effect initially, ensuring stable training

2. **ΔW has structure**: The learned weight update isn't random - it has patterns that reflect the task

3. **Low-rank constraint**: The update can only express patterns that fit through the rank-r bottleneck

---

## Part 4: The Connection to SVD

LoRA is deeply connected to Singular Value Decomposition (SVD). Let's explore this connection.

### ELI5: SVD and LoRA

> **Imagine you have a big photograph.** SVD lets you compress it by finding the most important "ingredients" (singular values) that make up the image.
>
> If you keep only the top 10 ingredients instead of all 1000, you get a slightly blurry but recognizable image.
>
> **LoRA makes a similar bet:** The "changes" needed to adapt a model to a new task can be captured with just a few key ingredients, not millions of parameters.

### NumPy Linear Algebra Functions Used

| Function | Description |
|----------|-------------|
| `np.linalg.svd(A)` | Singular Value Decomposition: factors matrix A = U @ diag(S) @ V^T where U and V are orthogonal matrices and S contains singular values |
| `np.linalg.norm(A, 'fro')` | Computes Frobenius norm: $\sqrt{\sum_{i,j} |a_{ij}|^2}$, measures matrix "magnitude" |
| `np.diag(S)` | Creates diagonal matrix from vector S |

In [None]:
# Let's demonstrate: Full weight update vs low-rank approximation

# Simulate a "full" fine-tuning weight update
np.random.seed(42)
d = 512  # dimension

# Create a simulated weight update matrix (what full fine-tuning would learn)
# In practice, these updates tend to be low-rank!
full_delta_W = np.random.randn(d, d) * 0.01

# Perform SVD
U, S, Vt = np.linalg.svd(full_delta_W, full_matrices=False)

print(f"Original ΔW shape: {full_delta_W.shape}")
print(f"SVD shapes: U={U.shape}, S={S.shape}, V^T={Vt.shape}")

# Plot singular values
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(S, 'b-', linewidth=2)
plt.xlabel('Index')
plt.ylabel('Singular Value')
plt.title('Singular Values of ΔW')
plt.grid(True, alpha=0.3)

# Compare reconstruction error for different ranks
ranks = [4, 8, 16, 32, 64, 128, 256, 512]
errors = []

for r in ranks:
    # Low-rank approximation using top-r singular values
    approx = U[:, :r] @ np.diag(S[:r]) @ Vt[:r, :]
    error = np.linalg.norm(full_delta_W - approx, 'fro') / np.linalg.norm(full_delta_W, 'fro')
    errors.append(error * 100)

plt.subplot(1, 2, 2)
plt.bar(range(len(ranks)), errors, tick_label=ranks)
plt.xlabel('Rank')
plt.ylabel('Reconstruction Error (%)')
plt.title('Error vs Rank (Lower = Better)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Parameter savings
print("\nParameter counts:")
full_params = d * d
for r in [8, 16, 32, 64]:
    lora_params = 2 * d * r
    print(f"  Rank {r}: {lora_params:,} params ({lora_params/full_params*100:.2f}% of full), Error: {errors[ranks.index(r)]:.2f}%")

### Key Insight from SVD Analysis

The plot above shows that for a random matrix, you need many singular values to get good reconstruction. But in practice, **real fine-tuning updates are much more low-rank**!

Research has shown that actual weight updates during LLM fine-tuning can be well-approximated with ranks as low as 4-16. This is the fundamental insight that makes LoRA work.

---

## Part 5: Experimenting with Different Ranks

Let's systematically explore how rank affects performance and memory usage.

In [None]:
def train_with_rank(rank: int, n_epochs: int = 50) -> dict:
    """Train a model with specific LoRA rank and return metrics."""
    
    # Create fresh model
    model = SimpleTransformerBlock(d_model=256, n_heads=8).to(device)
    model = add_lora_to_model(model, rank=rank, alpha=rank * 2)  # alpha = 2*rank is common
    
    # Count params
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    # Training setup
    lora_params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.AdamW(lora_params, lr=1e-3)
    
    # Train
    model.train()
    final_losses = []
    
    for epoch in range(n_epochs):
        for i in range(0, len(x_train), 64):
            batch_x = x_train[i:i+64]
            batch_y = y_train[i:i+64]
            
            optimizer.zero_grad()
            output = model(batch_x)
            loss = F.mse_loss(output, batch_y)
            loss.backward()
            optimizer.step()
        
        if epoch >= n_epochs - 10:  # Average last 10 epochs
            final_losses.append(loss.item())
    
    # Cleanup
    del model, optimizer
    torch.cuda.empty_cache()
    
    return {
        'rank': rank,
        'trainable_params': trainable,
        'final_loss': np.mean(final_losses)
    }


# Test different ranks
ranks_to_test = [4, 8, 16, 32, 64, 128]
results = []

print("Testing different LoRA ranks...\n")
for rank in ranks_to_test:
    print(f"Training with rank={rank}...", end=' ')
    result = train_with_rank(rank)
    results.append(result)
    print(f"Loss: {result['final_loss']:.4f}, Params: {result['trainable_params']:,}")

In [None]:
# Visualize rank vs performance tradeoff
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

ranks = [r['rank'] for r in results]
params = [r['trainable_params'] for r in results]
losses = [r['final_loss'] for r in results]

# Parameters vs Rank
axes[0].bar(range(len(ranks)), params, tick_label=ranks, color='steelblue')
axes[0].set_xlabel('LoRA Rank')
axes[0].set_ylabel('Trainable Parameters')
axes[0].set_title('Parameters vs Rank')
axes[0].grid(True, alpha=0.3)

# Loss vs Rank
axes[1].bar(range(len(ranks)), losses, tick_label=ranks, color='coral')
axes[1].set_xlabel('LoRA Rank')
axes[1].set_ylabel('Final Loss')
axes[1].set_title('Performance vs Rank')
axes[1].grid(True, alpha=0.3)

# Efficiency: Loss per 1K params
efficiency = [l / (p / 1000) for l, p in zip(losses, params)]
axes[2].bar(range(len(ranks)), efficiency, tick_label=ranks, color='seagreen')
axes[2].set_xlabel('LoRA Rank')
axes[2].set_ylabel('Loss per 1K params')
axes[2].set_title('Efficiency (Lower = Better)')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('lora_rank_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)  # Release memory

# Find sweet spot
best_efficiency_idx = np.argmin(efficiency)
print(f"\nSweet Spot Analysis:")
print(f"  Best efficiency at rank={ranks[best_efficiency_idx]}")
print(f"  Lowest loss at rank={ranks[np.argmin(losses)]}")
print(f"\nRecommendation: For most tasks, rank=16 offers the best tradeoff.")

---

## Part 6: LoRA Best Practices

Based on research and practical experience, here are the key guidelines for using LoRA effectively.

In [None]:
# Create a comprehensive LoRA configuration guide

lora_guidelines = """
╔══════════════════════════════════════════════════════════════════════════════╗
║                         LoRA CONFIGURATION GUIDE                              ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ RANK (r) SELECTION                                                           ║
╠════════════════╦═════════════════════════════════════════════════════════════╣
║ r = 4-8        ║ Quick experiments, simple adaptation tasks                  ║
║ r = 16         ║ DEFAULT - Works well for most fine-tuning tasks             ║
║ r = 32-64      ║ Complex tasks requiring more capacity                       ║
║ r = 128+       ║ Approaching full fine-tuning, rarely needed                 ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ ALPHA (α) SELECTION                                                          ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ Common choices:                                                              ║
║   α = r        → Scaling factor = 1.0                                        ║
║   α = 2r       → Scaling factor = 2.0 (DEFAULT, more aggressive updates)     ║
║   α = 32       → Fixed value, independent of rank                            ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ TARGET MODULES                                                               ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ Attention only (minimal):  q_proj, v_proj                                    ║
║ Full attention (default):  q_proj, k_proj, v_proj, o_proj                    ║
║ With MLP (maximum):        + gate_proj, up_proj, down_proj                   ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ DROPOUT                                                                      ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ 0.0            → For large datasets                                          ║
║ 0.05 (default) → Standard regularization                                     ║
║ 0.1+           → For small datasets or if overfitting                        ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ DGX SPARK SPECIFIC                                                           ║
╠════════════════════════════════════════════════════════════════════════════════╣
║ 8B models:   r=16-32, full attention + MLP                                   ║
║ 70B models:  r=16, attention only (to fit in 128GB)                          ║
║ Always use: bfloat16 compute dtype                                           ║
╚══════════════════════════════════════════════════════════════════════════════╝
"""

print(lora_guidelines)

---

## Try It Yourself: Exercises

### Exercise 1: Implement LoRA with Bias

The standard LoRA doesn't train biases. Modify the `LoRALayer` class to optionally train the bias term as well.

<details>
<summary>Hint</summary>

Add a `train_bias` parameter to `__init__`. If True, make `self.original_layer.bias.requires_grad = True`.
</details>

In [None]:
# Exercise 1: Your code here
class LoRALayerWithBias(nn.Module):
    """
    TODO: Implement LoRA layer that optionally trains bias.
    
    Args:
        original_layer: The nn.Linear layer to adapt
        rank: Rank of the low-rank decomposition (r)
        alpha: Scaling factor for LoRA updates
        dropout: Dropout probability for LoRA path
        train_bias: If True, also make bias trainable
    
    Example:
        >>> linear = nn.Linear(512, 512)
        >>> lora = LoRALayerWithBias(linear, rank=8, train_bias=True)
        >>> output = lora(torch.randn(4, 512))
    """
    
    def __init__(
        self,
        original_layer: nn.Linear,
        rank: int = 4,
        alpha: float = 1.0,
        dropout: float = 0.0,
        train_bias: bool = False,
    ) -> None:
        super().__init__()
        # Your implementation here
        pass
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with LoRA adaptation."""
        # Your implementation here
        pass

### Exercise 2: Compare Different Target Modules

Compare training with LoRA on:
1. Only Q and V projections
2. All attention projections (Q, K, V, O)
3. Attention + MLP

<details>
<summary>Hint</summary>

Modify the `add_lora_to_model` function to accept a list of target module names.
</details>

In [None]:
# Exercise 2: Your code here
from typing import List, Dict

def add_lora_to_model_selective(
    model: nn.Module,
    rank: int = 8,
    alpha: float = 16,
    target_modules: List[str] = None,
) -> nn.Module:
    """
    Add LoRA adapters to specific modules in a model.
    
    Args:
        model: The model to adapt
        rank: LoRA rank
        alpha: LoRA scaling factor
        target_modules: List of module name patterns to target.
                       e.g., ['q_proj', 'v_proj'] for attention only
                       e.g., ['q_proj', 'k_proj', 'v_proj', 'o_proj'] for full attention
                       If None, targets all attention projections.
    
    Returns:
        Modified model with LoRA layers
    
    Example:
        >>> model = SimpleTransformerBlock(d_model=256)
        >>> # Only Q and V projections
        >>> model_qv = add_lora_to_model_selective(model, target_modules=['q_proj', 'v_proj'])
        >>> # All attention projections
        >>> model_all = add_lora_to_model_selective(model, target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj'])
    """
    # Your implementation here
    # Hint: Modify add_lora_to_model to use target_modules parameter
    pass


def compare_target_modules(
    target_configs: List[List[str]],
    n_epochs: int = 50,
) -> Dict[str, Dict]:
    """
    Compare training with different target module configurations.
    
    Args:
        target_configs: List of target module lists to compare
                       e.g., [['q_proj', 'v_proj'], ['q_proj', 'k_proj', 'v_proj', 'o_proj']]
        n_epochs: Number of training epochs
    
    Returns:
        Dictionary mapping config name to results
    """
    # Your implementation here
    pass

### Exercise 3: Memory Analysis

Calculate and visualize the memory savings of LoRA for different model sizes.

<details>
<summary>Hint</summary>

For a model with `n` parameters, full fine-tuning needs `n * (weight + grad + optimizer_state)` = roughly `n * 16` bytes in float32.
LoRA only needs gradient and optimizer state for the LoRA parameters.
</details>

In [None]:
# Exercise 3: Your code here
from typing import Dict

def calculate_memory_requirements(
    model_params_billions: float,
    lora_rank: int = 16,
    lora_target_ratio: float = 0.1,
    precision: str = "float32",
) -> Dict[str, float]:
    """
    Calculate memory requirements for full fine-tuning vs LoRA.
    
    Args:
        model_params_billions: Model size in billions of parameters
        lora_rank: LoRA rank (r parameter)
        lora_target_ratio: Fraction of parameters that get LoRA adapters
                          (e.g., 0.1 means 10% of layers get LoRA)
        precision: One of "float32", "float16", "bfloat16", "int8", "int4"
    
    Returns:
        Dictionary with memory estimates in GB:
        {
            'full_ft_weights': float,      # Memory for model weights (full FT)
            'full_ft_gradients': float,    # Memory for gradients (full FT)
            'full_ft_optimizer': float,    # Memory for optimizer states (full FT)
            'full_ft_total': float,        # Total memory for full FT
            'lora_weights': float,         # Memory for model weights (LoRA)
            'lora_adapters': float,        # Memory for LoRA adapters
            'lora_gradients': float,       # Memory for gradients (LoRA only)
            'lora_optimizer': float,       # Memory for optimizer states (LoRA only)
            'lora_total': float,           # Total memory for LoRA
            'memory_savings_ratio': float, # LoRA total / Full FT total
        }
    
    Example:
        >>> mem = calculate_memory_requirements(7.0, lora_rank=16)
        >>> print(f"Full FT: {mem['full_ft_total']:.1f} GB")
        >>> print(f"LoRA: {mem['lora_total']:.1f} GB")
        >>> print(f"Savings: {(1 - mem['memory_savings_ratio'])*100:.1f}%")
    
    Hints:
        - float32: 4 bytes per param
        - float16/bfloat16: 2 bytes per param
        - int8: 1 byte per param
        - int4: 0.5 bytes per param
        - Adam optimizer stores 2 states per param (m and v)
        - For LoRA, gradients and optimizer states only for adapter params
    """
    # Your implementation here
    pass


# Test your implementation
# mem = calculate_memory_requirements(7.0, lora_rank=16)
# print(f"7B model memory requirements:")
# print(f"  Full Fine-tuning: {mem['full_ft_total']:.1f} GB")
# print(f"  LoRA (r=16): {mem['lora_total']:.1f} GB")
# print(f"  Memory savings: {(1 - mem['memory_savings_ratio'])*100:.1f}%")

---

## Common Mistakes

### Mistake 1: Forgetting to Freeze Base Weights

```python
# ❌ Wrong: Base weights still trainable
lora_layer = LoRALayer(original_layer, rank=16)
optimizer = torch.optim.Adam(model.parameters())  # Trains everything!

# ✅ Right: Only train LoRA parameters
lora_layer = LoRALayer(original_layer, rank=16)
lora_params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.Adam(lora_params)
```

**Why:** If you train base weights too, you lose the memory savings and might overwrite valuable pretrained knowledge.

### Mistake 2: Wrong Alpha/Rank Ratio

```python
# ❌ Wrong: Alpha too low relative to rank
config = LoraConfig(r=64, lora_alpha=1)  # Scaling = 1/64 = 0.015

# ✅ Right: Alpha proportional to rank
config = LoraConfig(r=64, lora_alpha=128)  # Scaling = 128/64 = 2.0
```

**Why:** Too low scaling means LoRA barely affects the output. Too high causes instability.

### Mistake 3: Initializing B Non-Zero

```python
# ❌ Wrong: B initialized randomly
self.lora_B = nn.Parameter(torch.randn(out_features, rank))

# ✅ Right: B initialized to zeros
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
```

**Why:** Zero initialization ensures the model starts identical to the base model. Random B would immediately change all outputs.

---

## Checkpoint

You've learned:
- ✅ The mathematical foundation of LoRA: $W = W_0 + \frac{\alpha}{r}BA$
- ✅ How to implement LoRA from scratch in NumPy and PyTorch
- ✅ The connection between LoRA and SVD
- ✅ How to choose rank, alpha, and target modules
- ✅ Best practices for LoRA configuration

---

## Challenge (Optional)

### Implement QLoRA from Scratch

QLoRA combines LoRA with 4-bit quantization. The key components are:

1. **NF4 Quantization**: Quantize base weights to 4-bit normal float
2. **Double Quantization**: Quantize the quantization constants too
3. **LoRA in BFloat16**: Keep LoRA weights in higher precision

Try implementing a simplified version of NF4 quantization:

```python
def quantize_nf4(tensor: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
    """
    Quantize tensor to 4-bit normal float format.
    Returns quantized values and scale factors.
    """
    # Your implementation here
    pass
```

---

## Further Reading

- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) - The original paper
- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) - Extends LoRA with quantization
- [The PEFT Library](https://github.com/huggingface/peft) - Hugging Face's implementation
- [DoRA: Weight-Decomposed Low-Rank Adaptation](https://arxiv.org/abs/2402.09353) - Recent improvement to LoRA

---

## Cleanup

In [None]:
# Clear GPU memory
import gc

# Delete any remaining tensors
del x_train, y_train
if 'model' in dir():
    del model

# Clear CUDA cache
torch.cuda.empty_cache()
gc.collect()

print("Cleanup complete!")
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
    print(f"GPU memory cached: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

---

## Next Steps

Now that you understand LoRA theory, you're ready for:

**[Lab 3.1.2: DoRA Comparison](lab-3.1.2-dora-comparison.ipynb)** - Learn how DoRA improves on LoRA with weight decomposition for +3.7 points improvement!

In the next notebook, you'll compare LoRA vs DoRA and see how weight-decomposed adaptation leads to better fine-tuning results.