# Lab 1.4.4: SVD for LoRA Intuition

**Module:** 1.4 - Mathematics for Deep Learning  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand Singular Value Decomposition (SVD) intuitively
- [ ] Perform SVD on weight matrices and reconstruct with varying ranks
- [ ] Visualize reconstruction error vs rank trade-off
- [ ] Connect SVD to LoRA's low-rank adaptation approach
- [ ] Calculate memory savings from low-rank approximations

---

## üìö Prerequisites

- Completed: Labs 1.4.1-1.4.3
- Knowledge of: Matrix multiplication, basic linear algebra

---

## üåç Real-World Context

**Why does SVD matter for deep learning?**

- **LoRA fine-tuning:** Fine-tune a 70B model with only ~1% of parameters!
- **Model compression:** Reduce model size while preserving accuracy
- **Understanding representations:** SVD reveals what information matrices capture

**Real examples:**
- LLaMA-70B has ~70 billion parameters, but LoRA can adapt it with ~10 million
- Stable Diffusion LoRAs are typically 10-100MB vs 4GB for full fine-tuning
- GPT-4 adapters likely use similar low-rank techniques

---

## üßí ELI5: What is SVD?

> **Imagine you have a big recipe book with 1000 recipes...**
>
> Each recipe is a list of 100 ingredients with amounts. That's a lot of information!
>
> But wait... most recipes are combinations of a few **basic patterns**:
> - "Italian base" = tomatoes + olive oil + garlic + basil
> - "Asian base" = soy sauce + ginger + sesame oil
> - "French base" = butter + wine + herbs
>
> **SVD finds these patterns!** It discovers:
> 1. The fundamental "flavor patterns" (singular vectors)
> 2. How important each pattern is (singular values)
> 3. How to combine patterns to recreate any recipe
>
> **The magic:** You can approximate the entire cookbook with just 10-20 patterns,
> instead of memorizing 1000 recipes √ó 100 ingredients!
>
> **For neural networks:**
> - A weight matrix W (768√ó768) = 590,000 numbers
> - SVD: W = U √ó Œ£ √ó V^T
> - Keep only top 16 patterns: ~25,000 numbers (96% smaller!)

---

In [None]:
import numpy as np
import torch
import matplotlib.pyplot as plt
from matplotlib import cm
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
torch.manual_seed(42)

print("üöÄ SVD for LoRA Intuition Lab")
print("=" * 50)
print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")

---

## Part 1: Understanding SVD Mathematically

### The SVD Decomposition

Any matrix $W \in \mathbb{R}^{m \times n}$ can be decomposed as:

$$W = U \Sigma V^T$$

Where:
- $U \in \mathbb{R}^{m \times m}$ - Left singular vectors (orthonormal columns)
- $\Sigma \in \mathbb{R}^{m \times n}$ - Diagonal matrix of singular values
- $V^T \in \mathbb{R}^{n \times n}$ - Right singular vectors (orthonormal rows)

### Low-Rank Approximation

Keep only the top $r$ singular values:

$$W \approx W_r = U_r \Sigma_r V_r^T$$

This is the **best** rank-$r$ approximation (by Frobenius norm)!

### NumPy's SVD Function

We'll use `np.linalg.svd()` to compute SVD:

```python
import numpy as np

# Compute SVD of a matrix W
U, S, Vt = np.linalg.svd(W, full_matrices=False)

# U:  Left singular vectors, shape (m, min(m,n))
# S:  Singular values (1D array, sorted descending), shape (min(m,n),)
# Vt: Right singular vectors (transposed!), shape (min(m,n), n)

# Reconstruct: W = U @ np.diag(S) @ Vt
```

Note: `full_matrices=False` gives the "economy" SVD which is more memory efficient.

In [None]:
# Let's start with a simple visual example

# Create a simple matrix (like pixel values of an image)
def create_test_matrix(size=64):
    """Create a matrix with clear structure (easier to compress)"""
    x = np.linspace(-3, 3, size)
    y = np.linspace(-3, 3, size)
    X, Y = np.meshgrid(x, y)
    
    # Combination of smooth patterns (low rank!)
    Z = np.sin(X) * np.cos(Y) + 0.5 * np.exp(-(X**2 + Y**2)/5)
    return Z

# Create matrix
W = create_test_matrix(64)
print(f"Original matrix shape: {W.shape}")
print(f"Total elements: {W.size:,}")

# Perform SVD
U, S, Vt = np.linalg.svd(W, full_matrices=False)

print(f"\nSVD components:")
print(f"  U shape: {U.shape}")
print(f"  S shape: {S.shape}")
print(f"  V^T shape: {Vt.shape}")

In [None]:
# Visualize the singular values
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot singular values
axes[0].bar(range(len(S)), S, color='steelblue', alpha=0.7)
axes[0].set_xlabel('Singular Value Index', fontsize=12)
axes[0].set_ylabel('Singular Value', fontsize=12)
axes[0].set_title('Singular Values (Importance of Each Component)', fontsize=14)
axes[0].set_xlim(-1, 30)
axes[0].grid(True, alpha=0.3)

# Plot cumulative energy
energy = (S ** 2) / (S ** 2).sum() * 100
cumulative_energy = np.cumsum(energy)

axes[1].plot(cumulative_energy, 'b-', linewidth=2)
axes[1].axhline(y=95, color='r', linestyle='--', label='95% energy')
axes[1].axhline(y=99, color='g', linestyle='--', label='99% energy')

# Find where we hit 95% and 99%
rank_95 = np.argmax(cumulative_energy >= 95) + 1
rank_99 = np.argmax(cumulative_energy >= 99) + 1

axes[1].axvline(x=rank_95-1, color='r', linestyle=':', alpha=0.7)
axes[1].axvline(x=rank_99-1, color='g', linestyle=':', alpha=0.7)

axes[1].set_xlabel('Number of Components (Rank)', fontsize=12)
axes[1].set_ylabel('Cumulative Energy (%)', fontsize=12)
axes[1].set_title('Cumulative Information Captured', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(0, 30)

plt.tight_layout()
plt.show()

print(f"\nüìä Information concentration:")
print(f"  - {rank_95} components capture 95% of information")
print(f"  - {rank_99} components capture 99% of information")
print(f"  - Full rank: {len(S)} components")

### üîç What Just Happened?

- **Left plot:** Singular values decrease rapidly - most information is in the first few!
- **Right plot:** With just a few components, we capture most of the "energy" (information)

This is the key insight behind low-rank approximations!

---

## Part 2: Low-Rank Reconstruction

Let's see how well we can reconstruct the original matrix with different ranks.

In [None]:
def reconstruct_low_rank(U, S, Vt, rank):
    """Reconstruct matrix using only top 'rank' singular values"""
    return U[:, :rank] @ np.diag(S[:rank]) @ Vt[:rank, :]

def relative_error(original, reconstructed):
    """Compute relative reconstruction error (Frobenius norm)"""
    return np.linalg.norm(original - reconstructed) / np.linalg.norm(original)

# Reconstruct with different ranks
ranks_to_try = [1, 2, 4, 8, 16, 32, 64]

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

# Original
im = axes[0].imshow(W, cmap='viridis')
axes[0].set_title(f'Original\n({W.size:,} values)', fontsize=12)
axes[0].axis('off')

# Reconstructions
for i, rank in enumerate(ranks_to_try):
    W_approx = reconstruct_low_rank(U, S, Vt, rank)
    error = relative_error(W, W_approx) * 100
    storage = rank * (W.shape[0] + W.shape[1] + 1)  # U_r + S_r + V_r
    
    axes[i+1].imshow(W_approx, cmap='viridis')
    axes[i+1].set_title(f'Rank {rank}\nError: {error:.2f}%\nStorage: {storage:,}', fontsize=11)
    axes[i+1].axis('off')

plt.suptitle('Low-Rank Approximations of a Matrix', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

print("\nüìä Storage comparison:")
print(f"  Original: {W.size:,} values")
for rank in [4, 8, 16]:
    storage = rank * (W.shape[0] + W.shape[1] + 1)
    savings = (1 - storage/W.size) * 100
    error = relative_error(W, reconstruct_low_rank(U, S, Vt, rank)) * 100
    print(f"  Rank {rank:2d}: {storage:,} values ({savings:.1f}% smaller, {error:.2f}% error)")

---

## Part 3: SVD on Neural Network Weights

Now let's apply this to actual neural network weight matrices!

In [None]:
# Create a weight matrix similar to what you'd find in a transformer
# (e.g., query/key/value projection in attention)

d_model = 768  # Hidden dimension (like BERT-base)

# Simulate a trained weight matrix (not purely random - has structure)
# Real trained weights have low effective rank!
np.random.seed(42)

# Create a low-rank matrix plus noise (simulates trained weights)
true_rank = 64  # The "true" information content
A = np.random.randn(d_model, true_rank) / np.sqrt(true_rank)
B = np.random.randn(true_rank, d_model) / np.sqrt(true_rank)
noise = np.random.randn(d_model, d_model) * 0.01  # Small noise

W_neural = A @ B + noise

print(f"Simulated weight matrix: {W_neural.shape}")
print(f"Total parameters: {W_neural.size:,}")
print(f"Memory (float32): {W_neural.size * 4 / 1e6:.2f} MB")

In [None]:
# Perform SVD on the neural network weight
U_nn, S_nn, Vt_nn = np.linalg.svd(W_neural, full_matrices=False)

# Analyze singular value spectrum
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Singular values (log scale)
axes[0].semilogy(S_nn, 'b-', linewidth=2)
axes[0].axhline(y=S_nn[true_rank-1], color='r', linestyle='--', 
               label=f'Rank {true_rank} threshold')
axes[0].set_xlabel('Singular Value Index', fontsize=12)
axes[0].set_ylabel('Singular Value (log scale)', fontsize=12)
axes[0].set_title(f'Singular Values of {d_model}√ó{d_model} Weight Matrix', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim(0, 200)

# Reconstruction error vs rank
ranks = [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 768]
errors = []
for r in ranks:
    W_approx = reconstruct_low_rank(U_nn, S_nn, Vt_nn, r)
    errors.append(relative_error(W_neural, W_approx) * 100)

axes[1].semilogx(ranks, errors, 'bo-', linewidth=2, markersize=8)
axes[1].axvline(x=true_rank, color='r', linestyle='--', label=f'True rank ({true_rank})')
axes[1].set_xlabel('Rank (log scale)', fontsize=12)
axes[1].set_ylabel('Reconstruction Error (%)', fontsize=12)
axes[1].set_title('Error vs Rank Trade-off', fontsize=14)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Reconstruction quality:")
for r, e in zip(ranks, errors):
    if r <= 128:
        print(f"  Rank {r:3d}: {e:6.3f}% error")

---

## Part 4: Connecting to LoRA

### The LoRA Insight

LoRA (Low-Rank Adaptation) is based on this observation:

> The **change** in weights during fine-tuning is often low-rank!

Instead of updating the full weight matrix:
$$W_{new} = W_{pretrained} + \Delta W$$

LoRA parameterizes $\Delta W$ as a low-rank product:
$$W_{new} = W_{pretrained} + BA$$

Where:
- $B \in \mathbb{R}^{d \times r}$ (d=768, r=16 typically)
- $A \in \mathbb{R}^{r \times d}$

### üßí ELI5: LoRA

> **Instead of repainting your entire house (fine-tuning all weights)...**
>
> You just add a thin layer of new wallpaper in the rooms that need updating!
>
> - Full fine-tuning: 768√ó768 = 590,000 parameters to update
> - LoRA (rank 16): 768√ó16 + 16√ó768 = 24,576 parameters (96% less!)

In [None]:
class LoRALayer:
    """
    Simplified LoRA implementation for understanding.
    
    Instead of modifying W directly, we add a low-rank update:
    output = x @ W + x @ (B @ A) * scaling
    
    During training, W is frozen and only B, A are updated.
    """
    
    def __init__(self, d_in, d_out, rank=16, alpha=16):
        """
        Args:
            d_in: Input dimension
            d_out: Output dimension
            rank: Rank of the low-rank update
            alpha: Scaling factor for the update
        """
        self.d_in = d_in
        self.d_out = d_out
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        # Pretrained weight (frozen during fine-tuning)
        self.W = np.random.randn(d_in, d_out) * 0.02
        
        # LoRA matrices (trainable)
        # A is initialized with small random values
        # B is initialized to zero (so initial output is same as pretrained)
        self.A = np.random.randn(d_in, rank) * 0.01
        self.B = np.zeros((rank, d_out))
    
    def forward(self, x):
        """Forward pass: original + low-rank update"""
        # Original output
        out_pretrained = x @ self.W
        
        # LoRA update
        out_lora = (x @ self.A @ self.B) * self.scaling
        
        return out_pretrained + out_lora
    
    def get_merged_weight(self):
        """Merge LoRA into the weight (for inference)"""
        return self.W + (self.A @ self.B) * self.scaling
    
    def count_trainable_params(self):
        """Count trainable parameters (A and B only)"""
        return self.A.size + self.B.size
    
    def count_total_params(self):
        """Count all parameters"""
        return self.W.size + self.A.size + self.B.size

# Example usage
d = 768  # BERT-base hidden size
rank = 16  # Typical LoRA rank

lora = LoRALayer(d, d, rank=rank)

print("LoRA Layer Analysis")
print("=" * 50)
print(f"Input/Output dimension: {d}")
print(f"LoRA rank: {rank}")
print()
print(f"Pretrained W: {d}√ó{d} = {d*d:,} params (frozen)")
print(f"LoRA A: {d}√ó{rank} = {d*rank:,} params (trainable)")
print(f"LoRA B: {rank}√ó{d} = {rank*d:,} params (trainable)")
print()
print(f"Total trainable: {lora.count_trainable_params():,} params")
print(f"Percentage trainable: {lora.count_trainable_params() / lora.count_total_params() * 100:.2f}%")

In [None]:
# Visualize memory savings at different ranks

d_model = 768
ranks = [1, 2, 4, 8, 16, 32, 64, 128, 256]

full_params = d_model * d_model
lora_params = [2 * d_model * r for r in ranks]
savings = [(1 - lp/full_params) * 100 for lp in lora_params]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Parameters comparison
axes[0].bar(['Full'] + [f'r={r}' for r in ranks], 
           [full_params] + lora_params,
           color=['red'] + ['steelblue']*len(ranks),
           alpha=0.7)
axes[0].set_ylabel('Number of Trainable Parameters', fontsize=12)
axes[0].set_title('Full Fine-tuning vs LoRA Parameters', fontsize=14)
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_yscale('log')
axes[0].grid(True, alpha=0.3, axis='y')

# Savings percentage
axes[1].bar([f'r={r}' for r in ranks], savings, color='green', alpha=0.7)
axes[1].axhline(y=95, color='red', linestyle='--', label='95% savings')
axes[1].set_ylabel('Memory Savings (%)', fontsize=12)
axes[1].set_title('Memory Savings with LoRA', fontsize=14)
axes[1].tick_params(axis='x', rotation=45)
axes[1].legend()
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(0, 100)

plt.tight_layout()
plt.show()

print("\nüìä Memory savings:")
for r, s in zip(ranks, savings):
    print(f"  Rank {r:3d}: {s:.1f}% savings ({2*d_model*r:,} vs {full_params:,} params)")

---

## Part 5: Real-World Example - Transformer Attention

Let's see how this applies to a real transformer layer.

In [None]:
import torch.nn as nn

class AttentionWithLoRA(nn.Module):
    """
    Multi-head attention with optional LoRA adapters.
    
    This shows how LoRA is typically applied in practice.
    """
    
    def __init__(self, d_model=768, n_heads=12, lora_rank=0, lora_alpha=16):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.lora_rank = lora_rank
        
        # Standard attention projections (frozen if using LoRA)
        self.W_q = nn.Linear(d_model, d_model, bias=False)
        self.W_k = nn.Linear(d_model, d_model, bias=False)
        self.W_v = nn.Linear(d_model, d_model, bias=False)
        self.W_o = nn.Linear(d_model, d_model, bias=False)
        
        # LoRA adapters (only created if rank > 0)
        if lora_rank > 0:
            self.scaling = lora_alpha / lora_rank
            
            # LoRA for Q and V (common choice)
            self.lora_q_A = nn.Linear(d_model, lora_rank, bias=False)
            self.lora_q_B = nn.Linear(lora_rank, d_model, bias=False)
            
            self.lora_v_A = nn.Linear(d_model, lora_rank, bias=False)
            self.lora_v_B = nn.Linear(lora_rank, d_model, bias=False)
            
            # Initialize B to zero
            nn.init.zeros_(self.lora_q_B.weight)
            nn.init.zeros_(self.lora_v_B.weight)
            
            # Freeze original weights
            for p in [self.W_q, self.W_k, self.W_v, self.W_o]:
                p.weight.requires_grad = False
    
    def forward(self, x):
        """Forward pass with LoRA if enabled"""
        # Standard projections
        q = self.W_q(x)
        k = self.W_k(x)
        v = self.W_v(x)
        
        # Add LoRA updates
        if self.lora_rank > 0:
            q = q + self.lora_q_B(self.lora_q_A(x)) * self.scaling
            v = v + self.lora_v_B(self.lora_v_A(x)) * self.scaling
        
        # Simplified attention (just for illustration)
        # In practice, would reshape for multi-head, apply softmax, etc.
        return self.W_o(v)  # Simplified output
    
    def count_params(self):
        """Count trainable vs total parameters"""
        total = sum(p.numel() for p in self.parameters())
        trainable = sum(p.numel() for p in self.parameters() if p.requires_grad)
        return trainable, total

# Compare full fine-tuning vs LoRA
print("Attention Layer Parameter Comparison")
print("=" * 50)

# Full fine-tuning
full_attn = AttentionWithLoRA(lora_rank=0)
full_train, full_total = full_attn.count_params()
print(f"\nFull Fine-tuning:")
print(f"  Trainable: {full_train:,} ({full_train/full_total*100:.1f}%)")
print(f"  Total: {full_total:,}")

# LoRA with different ranks
for rank in [4, 8, 16, 32]:
    lora_attn = AttentionWithLoRA(lora_rank=rank)
    lora_train, lora_total = lora_attn.count_params()
    savings = (1 - lora_train/full_train) * 100
    print(f"\nLoRA (rank={rank}):")
    print(f"  Trainable: {lora_train:,} ({lora_train/lora_total*100:.1f}%)")
    print(f"  Savings vs full: {savings:.1f}%")

---

## Part 6: When Does Low-Rank Work?

Not all matrices can be well-approximated with low rank. Let's explore when this works.

In [None]:
# Compare different types of matrices

def analyze_matrix(W, name):
    """Analyze how compressible a matrix is"""
    U, S, Vt = np.linalg.svd(W, full_matrices=False)
    
    # Compute energy captured at different ranks
    total_energy = (S ** 2).sum()
    
    # Find rank needed for 95% energy
    cumsum = np.cumsum(S ** 2) / total_energy
    rank_95 = np.argmax(cumsum >= 0.95) + 1
    rank_99 = np.argmax(cumsum >= 0.99) + 1
    
    return {
        'name': name,
        'shape': W.shape,
        'rank_95': rank_95,
        'rank_99': rank_99,
        'full_rank': len(S),
        'singular_values': S
    }

# Create different matrix types
size = 256

matrices = {
    'Low-rank (r=10)': np.random.randn(size, 10) @ np.random.randn(10, size) / 10,
    'Smooth (natural)': create_test_matrix(size),
    'Random (full rank)': np.random.randn(size, size) / np.sqrt(size),
    'Identity-like': np.eye(size) + np.random.randn(size, size) * 0.1,
}

# Analyze each
results = [analyze_matrix(W, name) for name, W in matrices.items()]

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Singular value decay
for res in results:
    S_normalized = res['singular_values'] / res['singular_values'][0]
    axes[0].semilogy(S_normalized[:100], linewidth=2, label=res['name'])

axes[0].set_xlabel('Singular Value Index', fontsize=12)
axes[0].set_ylabel('Normalized Singular Value', fontsize=12)
axes[0].set_title('Singular Value Decay Rate', fontsize=14)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Rank needed for 95% energy
names = [r['name'] for r in results]
rank_95 = [r['rank_95'] for r in results]
rank_99 = [r['rank_99'] for r in results]

x = np.arange(len(names))
width = 0.35

axes[1].bar(x - width/2, rank_95, width, label='95% energy', alpha=0.7)
axes[1].bar(x + width/2, rank_99, width, label='99% energy', alpha=0.7)
axes[1].axhline(y=size, color='red', linestyle='--', label=f'Full rank ({size})')
axes[1].set_xlabel('Matrix Type', fontsize=12)
axes[1].set_ylabel('Rank Needed', fontsize=12)
axes[1].set_title('Rank Needed to Capture Information', fontsize=14)
axes[1].set_xticks(x)
axes[1].set_xticklabels([n.split()[0] for n in names], rotation=15)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüìä Compressibility Analysis:")
print(f"{'Matrix Type':<20} {'95% Rank':<12} {'99% Rank':<12} {'Full Rank':<10}")
print("-" * 55)
for res in results:
    print(f"{res['name']:<20} {res['rank_95']:<12} {res['rank_99']:<12} {res['full_rank']:<10}")

### üîç Key Insights

1. **Low-rank matrices:** Easily compressed (by design)
2. **Smooth/natural patterns:** Also compressible (patterns repeat)
3. **Random matrices:** NOT compressible (no structure)
4. **Trained neural networks:** Usually closer to smooth/low-rank!

**Why LoRA works:** Training encourages weights to become structured!

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Rank Too Low

```python
# ‚ùå Wrong: Rank 1 loses too much information
lora = LoRALayer(d_model=768, rank=1)  # Very lossy!

# ‚úÖ Right: Use rank 8-64 for most tasks
lora = LoRALayer(d_model=768, rank=16)  # Good balance
```

### Mistake 2: Not Initializing B to Zero

```python
# ‚ùå Wrong: Random initialization changes pretrained behavior immediately
self.B = nn.Linear(rank, d_out)

# ‚úÖ Right: Initialize B to zero so initial output = pretrained
self.B = nn.Linear(rank, d_out)
nn.init.zeros_(self.B.weight)
```

### Mistake 3: Applying LoRA to All Layers

```python
# ‚ùå Often unnecessary: LoRA on every layer
for layer in model.layers:
    layer.q_proj = add_lora(layer.q_proj)
    layer.k_proj = add_lora(layer.k_proj)
    layer.v_proj = add_lora(layer.v_proj)
    layer.o_proj = add_lora(layer.o_proj)

# ‚úÖ Better: Focus on Q and V (empirically effective)
for layer in model.layers:
    layer.q_proj = add_lora(layer.q_proj)
    layer.v_proj = add_lora(layer.v_proj)
```

---

## ‚úã Try It Yourself

### Exercise: Find the Optimal Rank

Given a weight matrix, find the minimum rank needed to achieve less than 1% reconstruction error.

<details>
<summary>üí° Hint</summary>

1. Perform SVD on the matrix
2. Loop through ranks from 1 to full
3. Compute reconstruction error at each rank
4. Return first rank where error < 1%
</details>

In [None]:
def find_optimal_rank(W, target_error=0.01):
    """
    Find minimum rank needed to achieve target reconstruction error.
    
    Args:
        W: Input matrix
        target_error: Maximum acceptable relative error (default 1%)
    
    Returns:
        Optimal rank (int)
    """
    # TODO: Implement this
    # 1. Perform SVD on W
    # 2. Loop through ranks from 1 to len(S)
    # 3. For each rank, reconstruct and compute error using relative_error()
    # 4. Return first rank where error < target_error
    raise NotImplementedError("Implement the find_optimal_rank function")

# Test on our neural network weight (uncomment after implementing)
# optimal_r = find_optimal_rank(W_neural, target_error=0.01)
# print(f"Optimal rank for <1% error: {optimal_r}")
# print(f"Full rank: {W_neural.shape[0]}")
# print(f"Compression ratio: {W_neural.size / (2 * W_neural.shape[0] * optimal_r):.1f}x")

---

## üéâ Checkpoint

You've learned:

- ‚úÖ **SVD** decomposes matrices into singular values and vectors
- ‚úÖ **Low-rank approximation** captures most information with fewer parameters
- ‚úÖ **LoRA** uses this insight to fine-tune with 96%+ fewer parameters
- ‚úÖ **Memory savings** scale with rank (lower rank = smaller adapters)
- ‚úÖ Trained weights are usually **more compressible** than random matrices

**Key insight:** LoRA works because weight updates during fine-tuning are low-rank!

---

## üìñ Further Reading

- [LoRA Paper](https://arxiv.org/abs/2106.09685) - Original LoRA paper
- [QLoRA Paper](https://arxiv.org/abs/2305.14314) - Quantized LoRA
- [SVD Tutorial](https://www.youtube.com/watch?v=mBcLRGuAFUk) - Visual explanation
- [Hugging Face PEFT](https://huggingface.co/docs/peft) - LoRA implementation

---

## üßπ Cleanup

In [None]:
import gc
if torch.cuda.is_available():
    torch.cuda.empty_cache()
gc.collect()

print("‚úÖ Cleanup complete!")
print("\n‚û°Ô∏è  Next: Lab 1.4.5 - Probability Distributions Lab")