# Lab 3.1.2: DoRA - Weight-Decomposed Low-Rank Adaptation

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Time:** 2 hours  
**Difficulty:** ⭐⭐⭐☆☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how DoRA improves upon standard LoRA
- [ ] Implement DoRA's weight decomposition from scratch
- [ ] Compare LoRA vs DoRA on the same task
- [ ] Measure the +3.7 point improvement on commonsense reasoning

---

## Prerequisites

- Completed: Lab 3.1.1 (LoRA Theory)
- Knowledge of: LoRA mathematics, PyTorch

---

## Real-World Context

### The Problem with Standard LoRA

LoRA works great, but researchers at NVIDIA discovered it has a subtle limitation: **it couples magnitude and direction updates together**.

When you update weights with LoRA ($W = W_0 + BA$), you're changing both:
- **How strongly** neurons fire (magnitude)
- **What patterns** they respond to (direction)

But what if the model already has the right "strength" and just needs to learn new "patterns"? LoRA can't separate these!

**DoRA (Weight-Decomposed LoRA)** fixes this by explicitly separating magnitude and direction, giving you **+3.7 points on commonsense reasoning** benchmarks!

---

## ELI5: What is DoRA?

> **Imagine you're learning to throw darts.** There are two things you need to get right:
>
> 1. **How hard you throw** (magnitude) - Too soft and it falls short, too hard and it goes past the board
> 2. **Which direction you aim** (direction) - Left, right, up, down
>
> **Standard LoRA** is like learning both at the same time with the same practice throws. It works, but it's not optimal.
>
> **DoRA** separates these: First, it figures out the right "strength" (a simple number for each neuron), then separately learns the "aim" (the direction pattern). This separation makes learning more efficient!
>
> **In AI terms:** DoRA decomposes the weight matrix into a magnitude vector and a direction matrix, then applies LoRA only to the direction. This gives the model more flexibility in how it adapts.

---

## Part 1: The Mathematics of DoRA

### Standard LoRA Recap

In LoRA, the adapted weight is:

$$W = W_0 + \Delta W = W_0 + BA$$

### DoRA's Key Insight

DoRA decomposes the pretrained weights into **magnitude** and **direction**:

$$W_0 = m \cdot \frac{V}{\|V\|_c}$$

Where:
- $m \in \mathbb{R}^{1 \times k}$ is the **magnitude vector** (one value per column)
- $V \in \mathbb{R}^{d \times k}$ is the **direction matrix**
- $\|V\|_c$ denotes the column-wise norm

### DoRA's Adaptation

DoRA then adapts both components:

$$W' = (m + \Delta m) \cdot \frac{V + \Delta V}{\|V + \Delta V\|_c}$$

Where:
- $\Delta m$ is a **trainable magnitude adjustment** (very few parameters!)
- $\Delta V = BA$ is the **LoRA update to direction**

### Why This Works Better

1. **Decoupled learning**: Magnitude and direction are optimized separately
2. **Closer to full fine-tuning**: Research shows full FT naturally separates these
3. **Better gradient flow**: Normalization provides stable gradients

In [None]:
# Setup: Import required libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional, List, Tuple, Dict
import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

---

## Part 2: Implementing DoRA from Scratch

Let's implement DoRA step by step to understand exactly how it works.

In [None]:
class DoRALayer(nn.Module):
    """
    DoRA: Weight-Decomposed Low-Rank Adaptation
    
    Implements the DoRA method from "DoRA: Weight-Decomposed Low-Rank Adaptation"
    (https://arxiv.org/abs/2402.09353)
    
    The key insight: Decompose W = m * (V / ||V||) where:
    - m is magnitude (trainable scalar per column)
    - V is direction (adapted with LoRA)
    """
    
    def __init__(
        self,
        original_layer: nn.Linear,
        rank: int = 8,
        alpha: float = 16.0,
        dropout: float = 0.0,
    ):
        super().__init__()
        
        self.original_layer = original_layer
        self.rank = rank
        self.alpha = alpha
        self.scaling = alpha / rank
        
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # Freeze original weights
        self.original_layer.weight.requires_grad = False
        if self.original_layer.bias is not None:
            self.original_layer.bias.requires_grad = False
        
        # === DoRA-specific: Magnitude vector ===
        # Initialize from the column norms of W0
        with torch.no_grad():
            weight = original_layer.weight.data  # (out_features, in_features)
            # Compute column-wise L2 norm
            self.magnitude = nn.Parameter(
                weight.norm(dim=0, keepdim=True)  # Shape: (1, in_features)
            )
        
        # === LoRA matrices (for direction) ===
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        # Initialize A with Kaiming, B stays zero
        nn.init.kaiming_uniform_(self.lora_A, a=np.sqrt(5))
        
        # Dropout
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
        
        # Store original direction (normalized)
        with torch.no_grad():
            self.register_buffer(
                'original_direction',
                weight / (weight.norm(dim=0, keepdim=True) + 1e-8)
            )
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        DoRA forward pass:
        W' = m * normalize(V + BA)
        """
        # Get original weight
        weight = self.original_layer.weight
        
        # Compute LoRA delta for direction
        lora_delta = self.scaling * (self.lora_B @ self.lora_A)  # (out, in)
        
        # Update direction: V' = W0 + BA (before normalization)
        updated_weight = weight + lora_delta
        
        # Normalize to get direction
        direction_norm = updated_weight.norm(dim=0, keepdim=True) + 1e-8
        normalized_direction = updated_weight / direction_norm
        
        # Apply magnitude
        dora_weight = self.magnitude * normalized_direction
        
        # Forward pass
        result = F.linear(self.dropout(x), dora_weight, self.original_layer.bias)
        
        return result
    
    @property
    def trainable_params(self) -> int:
        """LoRA params + magnitude vector."""
        return self.lora_A.numel() + self.lora_B.numel() + self.magnitude.numel()
    
    def get_magnitude_change(self) -> torch.Tensor:
        """Get how much magnitude has changed from original."""
        with torch.no_grad():
            original_mag = self.original_layer.weight.norm(dim=0)
            return (self.magnitude.squeeze() - original_mag).abs()


# Compare with standard LoRA
class LoRALayer(nn.Module):
    """Standard LoRA for comparison."""
    
    def __init__(
        self,
        original_layer: nn.Linear,
        rank: int = 8,
        alpha: float = 16.0,
        dropout: float = 0.0,
    ):
        super().__init__()
        
        self.original_layer = original_layer
        self.rank = rank
        self.scaling = alpha / rank
        
        in_features = original_layer.in_features
        out_features = original_layer.out_features
        
        # Freeze original
        self.original_layer.weight.requires_grad = False
        if self.original_layer.bias is not None:
            self.original_layer.bias.requires_grad = False
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        nn.init.kaiming_uniform_(self.lora_A, a=np.sqrt(5))
        
        self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        result = self.original_layer(x)
        lora_output = self.dropout(x) @ self.lora_A.T @ self.lora_B.T
        return result + self.scaling * lora_output
    
    @property
    def trainable_params(self) -> int:
        return self.lora_A.numel() + self.lora_B.numel()

In [None]:
# Let's test our DoRA implementation
print("=" * 60)
print("DoRA vs LoRA Parameter Comparison")
print("=" * 60)

# Create a sample layer (typical transformer projection)
original = nn.Linear(4096, 4096, bias=False)

# Create both adapters
lora = LoRALayer(nn.Linear(4096, 4096, bias=False), rank=16, alpha=32)
dora = DoRALayer(nn.Linear(4096, 4096, bias=False), rank=16, alpha=32)

print(f"\nOriginal layer parameters: {original.weight.numel():,}")
print(f"\nLoRA trainable parameters: {lora.trainable_params:,}")
print(f"  - lora_A: {lora.lora_A.numel():,}")
print(f"  - lora_B: {lora.lora_B.numel():,}")

print(f"\nDoRA trainable parameters: {dora.trainable_params:,}")
print(f"  - lora_A: {dora.lora_A.numel():,}")
print(f"  - lora_B: {dora.lora_B.numel():,}")
print(f"  - magnitude: {dora.magnitude.numel():,}")

print(f"\nDoRA overhead vs LoRA: +{dora.trainable_params - lora.trainable_params:,} params")
print(f"  ({(dora.trainable_params / lora.trainable_params - 1) * 100:.2f}% more)")

### What's Happening?

DoRA adds a **magnitude vector** with one value per input feature. For a 4096-dimensional layer:
- LoRA: 2 × 16 × 4096 = 131,072 parameters
- DoRA: 131,072 + 4,096 = 135,168 parameters (~3% more)

This tiny overhead gives significant quality improvements!

---

## Part 3: Visualizing the Difference

Let's visualize how LoRA and DoRA update weights differently.

### Key PyTorch Functions Used

| Function | Description |
|----------|-------------|
| `F.cosine_similarity(a, b, dim)` | Computes cosine similarity between tensors along specified dimension. Returns values in [-1, 1] where 1 = identical direction. |
| `F.scaled_dot_product_attention(q, k, v)` | PyTorch 2.0+ efficient attention with Flash Attention optimization. |
| `torch.nn.utils.clip_grad_norm_(params, max_norm)` | Clips gradient norms to prevent exploding gradients during training. |
| `module.register_buffer(name, tensor)` | Registers a tensor as a buffer (saved with model but not a parameter). Used for non-trainable state. |

In [None]:
def visualize_weight_decomposition():
    """
    Visualize how DoRA decomposes weights into magnitude and direction.
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    # Create a small weight matrix for visualization
    d = 64
    W = torch.randn(d, d) * 0.5
    
    # Decompose into magnitude and direction
    magnitude = W.norm(dim=0, keepdim=True)  # (1, d)
    direction = W / (magnitude + 1e-8)  # (d, d) - unit vectors
    
    # Verify: magnitude * direction = W
    reconstructed = magnitude * direction
    reconstruction_error = (W - reconstructed).abs().max().item()
    print(f"Reconstruction error: {reconstruction_error:.2e}")
    
    # Row 1: Original decomposition
    im0 = axes[0, 0].imshow(W.numpy(), cmap='RdBu', aspect='auto')
    axes[0, 0].set_title('Original Weight W')
    axes[0, 0].set_xlabel('Input features')
    axes[0, 0].set_ylabel('Output features')
    plt.colorbar(im0, ax=axes[0, 0])
    
    axes[0, 1].bar(range(d), magnitude.squeeze().numpy(), color='steelblue')
    axes[0, 1].set_title('Magnitude (per column)')
    axes[0, 1].set_xlabel('Column index')
    axes[0, 1].set_ylabel('||W[:, i]||')
    
    im2 = axes[0, 2].imshow(direction.numpy(), cmap='RdBu', aspect='auto', vmin=-1, vmax=1)
    axes[0, 2].set_title('Direction (normalized)')
    axes[0, 2].set_xlabel('Input features')
    axes[0, 2].set_ylabel('Output features')
    plt.colorbar(im2, ax=axes[0, 2])
    
    # Row 2: Simulated updates
    # LoRA update (random low-rank)
    r = 8
    B = torch.randn(d, r) * 0.1
    A = torch.randn(r, d) * 0.1
    lora_delta = B @ A
    
    # LoRA changes both magnitude and direction together
    W_lora = W + lora_delta
    lora_mag = W_lora.norm(dim=0, keepdim=True)
    
    # DoRA: Only changes direction, magnitude is separate
    direction_updated = direction + lora_delta / (magnitude + 1e-8)
    direction_updated = direction_updated / (direction_updated.norm(dim=0, keepdim=True) + 1e-8)
    
    im3 = axes[1, 0].imshow(lora_delta.numpy(), cmap='RdBu', aspect='auto')
    axes[1, 0].set_title('LoRA Update (BA)')
    axes[1, 0].set_xlabel('Input features')
    axes[1, 0].set_ylabel('Output features')
    plt.colorbar(im3, ax=axes[1, 0])
    
    # Magnitude change comparison
    lora_mag_change = (lora_mag - magnitude).squeeze()
    axes[1, 1].bar(range(d), lora_mag_change.numpy(), color='coral', alpha=0.7, label='LoRA')
    axes[1, 1].axhline(y=0, color='black', linestyle='-', linewidth=0.5)
    axes[1, 1].set_title('Magnitude Change (LoRA)')
    axes[1, 1].set_xlabel('Column index')
    axes[1, 1].set_ylabel('Change in magnitude')
    
    # Direction change (cosine similarity)
    cos_sim_lora = F.cosine_similarity(
        direction, W_lora / (W_lora.norm(dim=0, keepdim=True) + 1e-8), dim=0
    )
    axes[1, 2].bar(range(d), (1 - cos_sim_lora).numpy(), color='purple')
    axes[1, 2].set_title('Direction Change (1 - cos_sim)')
    axes[1, 2].set_xlabel('Column index')
    axes[1, 2].set_ylabel('Direction divergence')
    
    plt.tight_layout()
    plt.savefig('dora_weight_decomposition.png', dpi=150, bbox_inches='tight')
    plt.show()
    plt.close(fig)
    
    print("\nKey Insight:")
    print("- LoRA changes BOTH magnitude and direction in coupled way")
    print("- DoRA separates these, allowing independent optimization")

visualize_weight_decomposition()

---

## Part 4: Head-to-Head Comparison

Let's train both LoRA and DoRA on the same task and compare their performance.

In [None]:
class SimpleTransformerBlock(nn.Module):
    """A simplified transformer block for comparison testing."""
    
    def __init__(self, d_model: int = 256, n_heads: int = 8):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        
        # Attention projections
        self.q_proj = nn.Linear(d_model, d_model, bias=False)
        self.k_proj = nn.Linear(d_model, d_model, bias=False)
        self.v_proj = nn.Linear(d_model, d_model, bias=False)
        self.o_proj = nn.Linear(d_model, d_model, bias=False)
        
        # MLP
        self.mlp_up = nn.Linear(d_model, d_model * 4)
        self.mlp_down = nn.Linear(d_model * 4, d_model)
        
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        
        # Attention
        normed = self.norm1(x)
        q = self.q_proj(normed).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        k = self.k_proj(normed).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        v = self.v_proj(normed).view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        
        attn = F.scaled_dot_product_attention(q, k, v)
        attn = attn.transpose(1, 2).contiguous().view(B, T, C)
        x = x + self.o_proj(attn)
        
        # MLP
        x = x + self.mlp_down(F.gelu(self.mlp_up(self.norm2(x))))
        
        return x


def add_adapters_to_model(model: nn.Module, adapter_type: str, rank: int = 8, alpha: float = 16) -> nn.Module:
    """Add LoRA or DoRA adapters to attention projections."""
    import copy
    model = copy.deepcopy(model)
    
    AdapterClass = DoRALayer if adapter_type == 'dora' else LoRALayer
    
    for name in ['q_proj', 'k_proj', 'v_proj', 'o_proj']:
        original_layer = getattr(model, name)
        adapter_layer = AdapterClass(original_layer, rank=rank, alpha=alpha)
        setattr(model, name, adapter_layer)
    
    return model

In [None]:
def train_and_evaluate(
    adapter_type: str,
    n_epochs: int = 100,
    rank: int = 8,
    verbose: bool = True
) -> Dict:
    """
    Train a model with specified adapter type and return metrics.
    """
    # Create model with adapters
    base_model = SimpleTransformerBlock(d_model=256, n_heads=8).to(device)
    model = add_adapters_to_model(base_model, adapter_type, rank=rank, alpha=rank*2)
    model = model.to(device)
    
    # Count trainable params
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    # Generate synthetic data
    # Task: Learn a specific transformation pattern
    torch.manual_seed(42)  # Same data for fair comparison
    n_samples, seq_len, d_model = 500, 32, 256
    
    # Create a target transformation (the "ground truth" we're trying to learn)
    target_transform = nn.Linear(d_model, d_model, bias=False).to(device)
    with torch.no_grad():
        target_transform.weight.normal_(0, 0.1)
    
    x = torch.randn(n_samples, seq_len, d_model, device=device)
    y = target_transform(x) + torch.randn_like(x) * 0.01  # Slight noise
    
    # Training setup
    params = [p for p in model.parameters() if p.requires_grad]
    optimizer = torch.optim.AdamW(params, lr=1e-3, weight_decay=0.01)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, n_epochs)
    
    # Training loop
    losses = []
    batch_size = 32
    
    for epoch in range(n_epochs):
        model.train()
        epoch_loss = 0
        n_batches = 0
        
        # Shuffle data
        perm = torch.randperm(n_samples)
        
        for i in range(0, n_samples, batch_size):
            idx = perm[i:i+batch_size]
            batch_x, batch_y = x[idx], y[idx]
            
            optimizer.zero_grad()
            output = model(batch_x)
            loss = F.mse_loss(output, batch_y)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(params, 1.0)
            optimizer.step()
            
            epoch_loss += loss.item()
            n_batches += 1
        
        scheduler.step()
        avg_loss = epoch_loss / n_batches
        losses.append(avg_loss)
        
        if verbose and (epoch + 1) % 25 == 0:
            print(f"  Epoch {epoch+1}/{n_epochs}: Loss = {avg_loss:.6f}")
    
    # Evaluation
    model.eval()
    with torch.no_grad():
        test_x = torch.randn(100, seq_len, d_model, device=device)
        test_y = target_transform(test_x)
        test_pred = model(test_x)
        test_loss = F.mse_loss(test_pred, test_y).item()
    
    # Cleanup
    del model, optimizer, x, y
    torch.cuda.empty_cache()
    
    return {
        'adapter_type': adapter_type,
        'trainable_params': trainable,
        'train_losses': losses,
        'final_train_loss': np.mean(losses[-10:]),
        'test_loss': test_loss,
    }


# Run comparison
print("="*60)
print("LoRA vs DoRA Training Comparison")
print("="*60)

print("\nTraining with LoRA...")
lora_results = train_and_evaluate('lora', n_epochs=100, rank=8)

print("\nTraining with DoRA...")
dora_results = train_and_evaluate('dora', n_epochs=100, rank=8)

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Training curves
axes[0].plot(lora_results['train_losses'], label='LoRA', linewidth=2, color='blue')
axes[0].plot(dora_results['train_losses'], label='DoRA', linewidth=2, color='red')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Training Loss')
axes[0].set_title('Training Loss Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
axes[0].set_yscale('log')

# Final metrics comparison
methods = ['LoRA', 'DoRA']
train_losses = [lora_results['final_train_loss'], dora_results['final_train_loss']]
test_losses = [lora_results['test_loss'], dora_results['test_loss']]

x_pos = np.arange(len(methods))
width = 0.35

axes[1].bar(x_pos - width/2, train_losses, width, label='Train Loss', color='steelblue')
axes[1].bar(x_pos + width/2, test_losses, width, label='Test Loss', color='coral')
axes[1].set_xlabel('Method')
axes[1].set_ylabel('Loss')
axes[1].set_title('Final Loss Comparison')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(methods)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Parameter count
params = [lora_results['trainable_params'], dora_results['trainable_params']]
axes[2].bar(methods, params, color=['blue', 'red'])
axes[2].set_xlabel('Method')
axes[2].set_ylabel('Trainable Parameters')
axes[2].set_title('Parameter Count Comparison')
for i, v in enumerate(params):
    axes[2].text(i, v + 100, f'{v:,}', ha='center')

plt.tight_layout()
plt.savefig('lora_vs_dora_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)

# Print summary
print("\n" + "="*60)
print("SUMMARY")
print("="*60)
improvement = (lora_results['test_loss'] - dora_results['test_loss']) / lora_results['test_loss'] * 100
param_overhead = (dora_results['trainable_params'] - lora_results['trainable_params']) / lora_results['trainable_params'] * 100

print(f"\nLoRA:")
print(f"  Trainable params: {lora_results['trainable_params']:,}")
print(f"  Final train loss: {lora_results['final_train_loss']:.6f}")
print(f"  Test loss: {lora_results['test_loss']:.6f}")

print(f"\nDoRA:")
print(f"  Trainable params: {dora_results['trainable_params']:,}")
print(f"  Final train loss: {dora_results['final_train_loss']:.6f}")
print(f"  Test loss: {dora_results['test_loss']:.6f}")

print(f"\nDoRA Advantage:")
print(f"  Test loss improvement: {improvement:.1f}%")
print(f"  Parameter overhead: +{param_overhead:.1f}%")

---

## Part 5: Using DoRA with PEFT (Production)

In practice, you'll use the PEFT library which has DoRA built-in. It's just **one flag**!

In [None]:
# Production DoRA configuration with PEFT
# (This is what you'll use in real fine-tuning)

peft_config_example = """
from peft import LoraConfig, get_peft_model

# Standard LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
    # use_dora=False  # Default: standard LoRA
)

# DoRA config - just add one flag!
dora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
    use_dora=True  # Enable DoRA!
)

# Apply to model
model = get_peft_model(base_model, dora_config)
model.print_trainable_parameters()
"""

print("Production DoRA Configuration:")
print("=" * 50)
print(peft_config_example)

---

## Part 6: When to Use DoRA vs LoRA

| Scenario | Recommendation | Why |
|----------|----------------|-----|
| Quick experiments | LoRA | Simpler, faster |
| Production fine-tuning | **DoRA** | Better quality, minimal overhead |
| Commonsense reasoning | **DoRA** | +3.7 points improvement |
| Code generation | DoRA | Benefits from directional adaptation |
| Memory-critical | LoRA | Slightly fewer params |
| 70B+ models | **DoRA** | Quality gains worth small overhead |

---

## Try It Yourself: Exercises

### Exercise 1: Rank Comparison

Compare LoRA vs DoRA at different ranks (4, 8, 16, 32) and plot the improvement.

<details>
<summary>Hint</summary>

Use a loop over ranks and call `train_and_evaluate` for both methods at each rank.
</details>

In [None]:
# Exercise 1: Your code here
def compare_across_ranks(ranks: List[int] = [4, 8, 16, 32]) -> Dict:
    """
    Compare LoRA vs DoRA across different rank values.
    
    Args:
        ranks: List of rank values to test
    
    Returns:
        Dictionary with results for each method and rank
    """
    # Your implementation here
    pass

# results = compare_across_ranks([4, 8, 16])
# Plot the results...

### Exercise 2: Magnitude Analysis

Track how the magnitude vector changes during DoRA training. Does it change significantly?

<details>
<summary>Hint</summary>

Add tracking inside the training loop to record `dora_layer.magnitude` at each epoch.
</details>

In [None]:
# Exercise 2: Your code here
def track_magnitude_changes(n_epochs: int = 50) -> Dict:
    """
    Train DoRA and track magnitude vector changes.
    
    Returns:
        Dictionary with magnitude history and statistics
    """
    # Your implementation here
    pass

---

## Common Mistakes

### Mistake 1: Forgetting to Normalize

```python
# Wrong: Not normalizing after adding LoRA delta
updated_weight = weight + lora_delta
dora_weight = self.magnitude * updated_weight  # WRONG!

# Right: Normalize to get direction first
updated_weight = weight + lora_delta
direction = updated_weight / (updated_weight.norm(dim=0, keepdim=True) + 1e-8)
dora_weight = self.magnitude * direction  # Correct!
```

**Why:** DoRA explicitly separates magnitude and direction. Without normalization, you're not getting the benefits of decomposition.

### Mistake 2: Wrong Magnitude Initialization

```python
# Wrong: Random magnitude
self.magnitude = nn.Parameter(torch.randn(1, in_features))

# Right: Initialize from original weight norms
self.magnitude = nn.Parameter(weight.norm(dim=0, keepdim=True))
```

**Why:** Starting with the original magnitudes ensures the model behaves identically before training begins.

---

## Checkpoint

You've learned:
- ✅ How DoRA decomposes weights into magnitude and direction
- ✅ Why this separation improves learning (+3.7 points!)
- ✅ How to implement DoRA from scratch
- ✅ How to enable DoRA in PEFT (just `use_dora=True`!)
- ✅ When to choose DoRA vs standard LoRA

---

## Challenge (Optional)

### Implement rsLoRA (Rank-Stabilized LoRA)

rsLoRA uses a different scaling: instead of $\alpha/r$, it uses $\alpha/\sqrt{r}$. This provides better stability at higher ranks.

1. Modify `DoRALayer` to support rsLoRA scaling
2. Compare DoRA with rsLoRA vs regular DoRA at rank=64
3. Measure training stability (gradient norms)

---

## Further Reading

- [DoRA: Weight-Decomposed Low-Rank Adaptation](https://arxiv.org/abs/2402.09353) - The original DoRA paper
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) - Original LoRA paper
- [PEFT Documentation](https://huggingface.co/docs/peft) - Hugging Face PEFT library

---

## Cleanup

In [None]:
# Clear GPU memory
import gc

torch.cuda.empty_cache()
gc.collect()

print("Cleanup complete!")
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

---

## Next Steps

Now that you understand DoRA, continue to:

**[Lab 3.1.3: NEFTune Magic](lab-3.1.3-neftune-magic.ipynb)** - Learn how adding noise to embeddings can boost performance by 35%!