# Lab 1.7.6: Documentation and Benchmarks

**Module:** 1.7 - Domain 1 Capstone: MicroGrad+  
**Time:** 1 hour  
**Difficulty:** ⭐⭐

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how to write good API documentation
- [ ] Compare MicroGrad+ performance against PyTorch
- [ ] Learn why PyTorch is faster and what optimizations it uses
- [ ] Complete the Domain 1 Capstone!

---

## Prerequisites

- Completed: Labs 1.7.1-1.7.5
- Knowledge of: Python documentation practices

In [None]:
import numpy as np
import time
import sys
from pathlib import Path

def _find_module_root():
    """Find the module root directory containing micrograd_plus."""
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / 'micrograd_plus' / '__init__.py').exists():
            return str(parent)
    # Fallback to parent directory
    return str(Path.cwd().parent)

sys.path.insert(0, _find_module_root())

from micrograd_plus import (
    Tensor, Linear, ReLU, Dropout,
    CrossEntropyLoss, MSELoss, Adam, SGD, Sequential
)
from micrograd_plus.utils import set_seed

set_seed(42)

---

## Part 1: API Documentation

Good documentation makes your code usable by others (and your future self!). Let's review the key components of our MicroGrad+ API.

### ELI5: Why Documentation Matters

Think of documentation like the instruction manual for a LEGO set:
- Without it, you can eventually figure things out, but it takes much longer
- With good instructions, anyone can build what you built
- Bad instructions are sometimes worse than no instructions!

In [None]:
# Explore the module documentation
import micrograd_plus

print("MicroGrad+ Module Documentation")
print("=" * 60)
print(micrograd_plus.__doc__)

In [None]:
# Check individual class documentation
print("Tensor Class Documentation")
print("=" * 60)
print(Tensor.__doc__)

In [None]:
# Example of good docstring format
print("Linear Layer Documentation")
print("=" * 60)
print(Linear.__doc__)

### Documentation Best Practices

1. **Module docstring**: Explain what the module does and provide a quick example
2. **Class docstring**: Describe purpose, args, attributes, and usage
3. **Method docstring**: Document args, returns, raises, and provide examples
4. **Type hints**: Use them for better IDE support and clarity

### The Anatomy of a Good Docstring

```python
def my_function(param1: int, param2: str = "default") -> bool:
    """Short one-line description of what the function does.
    
    Longer description if needed, explaining the behavior in more
    detail, edge cases, and any important notes.
    
    Args:
        param1: Description of first parameter.
        param2: Description of second parameter with default.
    
    Returns:
        Description of what is returned.
    
    Raises:
        ValueError: When param1 is negative.
    
    Example:
        >>> my_function(5, "hello")
        True
    """
    pass
```

In [None]:
# Let's explore what methods are available on our Tensor class
print("Available Tensor Methods and Attributes:")
print("=" * 60)

# Filter out private methods (those starting with _)
public_attrs = [attr for attr in dir(Tensor) if not attr.startswith('_')]
for attr in sorted(public_attrs):
    obj = getattr(Tensor, attr)
    if callable(obj):
        print(f"  {attr}()")
    else:
        print(f"  {attr} (property)")

---

## Part 2: Benchmarking Against PyTorch

Let's compare MicroGrad+ performance against PyTorch. Spoiler: PyTorch will be much faster - and that's okay! The goal is to understand **why**.

### ELI5: Why Compare Ourselves to PyTorch?

Imagine you built a bicycle from scratch in your garage. It works! But you wouldn't use it to race against a professional racing bike. However, by understanding *why* the racing bike is faster (lighter materials, precision engineering, aerodynamics), you learn valuable engineering principles.

Similarly, comparing MicroGrad+ to PyTorch teaches us about optimization, not to make us feel bad about our code!

In [None]:
# Check if PyTorch is available
try:
    import torch
    import torch.nn as nn
    PYTORCH_AVAILABLE = True
    print(f"PyTorch version: {torch.__version__}")
    print(f"CUDA available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"GPU: {torch.cuda.get_device_name(0)}")
except ImportError:
    PYTORCH_AVAILABLE = False
    print("PyTorch not available. Skipping PyTorch benchmarks.")
    print("Install with: pip install torch")

In [None]:
def benchmark_micrograd(batch_size, in_features, hidden, out_features, num_iterations):
    """Benchmark MicroGrad+ forward and backward pass.
    
    Args:
        batch_size: Number of samples per batch
        in_features: Input dimension
        hidden: Hidden layer dimension
        out_features: Output dimension (number of classes)
        num_iterations: Number of iterations to benchmark
    
    Returns:
        Average time per iteration in seconds
    """
    set_seed(42)
    
    # Create model
    model = Sequential(
        Linear(in_features, hidden),
        ReLU(),
        Linear(hidden, hidden),
        ReLU(),
        Linear(hidden, out_features)
    )
    
    loss_fn = CrossEntropyLoss()
    optimizer = Adam(model.parameters(), lr=0.001)
    
    # Create data
    X = np.random.randn(batch_size, in_features).astype(np.float32)
    y = np.random.randint(0, out_features, batch_size)
    
    # Warmup (important for fair benchmarking!)
    for _ in range(5):
        x_tensor = Tensor(X, requires_grad=True)
        logits = model(x_tensor)
        loss = loss_fn(logits, Tensor(y))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    # Benchmark
    start = time.time()
    for _ in range(num_iterations):
        x_tensor = Tensor(X, requires_grad=True)
        logits = model(x_tensor)
        loss = loss_fn(logits, Tensor(y))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    elapsed = time.time() - start
    
    return elapsed / num_iterations

In [None]:
if PYTORCH_AVAILABLE:
    def benchmark_pytorch(batch_size, in_features, hidden, out_features, num_iterations, device='cpu'):
        """Benchmark PyTorch forward and backward pass.
        
        Args:
            batch_size: Number of samples per batch
            in_features: Input dimension
            hidden: Hidden layer dimension
            out_features: Output dimension (number of classes)
            num_iterations: Number of iterations to benchmark
            device: 'cpu' or 'cuda'
        
        Returns:
            Average time per iteration in seconds
        """
        torch.manual_seed(42)
        
        # Create model
        model = nn.Sequential(
            nn.Linear(in_features, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU(),
            nn.Linear(hidden, out_features)
        ).to(device)
        
        loss_fn = nn.CrossEntropyLoss()
        optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
        
        # Create data
        X = torch.randn(batch_size, in_features, device=device)
        y = torch.randint(0, out_features, (batch_size,), device=device)
        
        # Warmup
        for _ in range(5):
            logits = model(X)
            loss = loss_fn(logits, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        if device == 'cuda':
            torch.cuda.synchronize()
        
        # Benchmark
        start = time.time()
        for _ in range(num_iterations):
            logits = model(X)
            loss = loss_fn(logits, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        
        if device == 'cuda':
            torch.cuda.synchronize()
        
        elapsed = time.time() - start
        
        return elapsed / num_iterations

In [None]:
# Run benchmarks
print("Benchmark: Forward + Backward + Optimizer Step")
print("=" * 60)

# Different configurations to test
configs = [
    (32, 784, 256, 10, 100),   # Small: MNIST-like
    (64, 784, 512, 10, 50),    # Medium
    (128, 1024, 512, 100, 20), # Large
]

results = []

for batch_size, in_features, hidden, out_features, iters in configs:
    print(f"\nConfig: batch={batch_size}, input={in_features}, hidden={hidden}, output={out_features}")
    
    # MicroGrad+
    mg_time = benchmark_micrograd(batch_size, in_features, hidden, out_features, iters)
    print(f"  MicroGrad+: {mg_time*1000:.2f} ms/iteration")
    
    if PYTORCH_AVAILABLE:
        # PyTorch CPU
        pt_cpu_time = benchmark_pytorch(batch_size, in_features, hidden, out_features, iters, 'cpu')
        print(f"  PyTorch CPU: {pt_cpu_time*1000:.2f} ms/iteration")
        
        ratio_cpu = mg_time / pt_cpu_time
        print(f"  Ratio (MicroGrad+ / PyTorch CPU): {ratio_cpu:.1f}x slower")
        
        # PyTorch GPU (if available)
        if torch.cuda.is_available():
            pt_gpu_time = benchmark_pytorch(batch_size, in_features, hidden, out_features, iters, 'cuda')
            print(f"  PyTorch GPU: {pt_gpu_time*1000:.2f} ms/iteration")
            ratio_gpu = mg_time / pt_gpu_time
            print(f"  Ratio (MicroGrad+ / PyTorch GPU): {ratio_gpu:.1f}x slower")
        
        results.append({
            'config': f'{batch_size}x{in_features}',
            'micrograd': mg_time * 1000,
            'pytorch_cpu': pt_cpu_time * 1000,
            'ratio': ratio_cpu
        })

In [None]:
# Visualize benchmark results
if PYTORCH_AVAILABLE and results:
    import matplotlib.pyplot as plt
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))
    
    configs = [r['config'] for r in results]
    mg_times = [r['micrograd'] for r in results]
    pt_times = [r['pytorch_cpu'] for r in results]
    ratios = [r['ratio'] for r in results]
    
    # Time comparison
    x = np.arange(len(configs))
    width = 0.35
    
    axes[0].bar(x - width/2, mg_times, width, label='MicroGrad+', color='#3498db')
    axes[0].bar(x + width/2, pt_times, width, label='PyTorch CPU', color='#e74c3c')
    axes[0].set_ylabel('Time (ms)')
    axes[0].set_title('Execution Time Comparison')
    axes[0].set_xticks(x)
    axes[0].set_xticklabels(configs)
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Ratio
    colors = ['#2ecc71' if r < 20 else '#f39c12' if r < 50 else '#e74c3c' for r in ratios]
    axes[1].bar(configs, ratios, color=colors)
    axes[1].set_ylabel('Slowdown Factor (x times slower)')
    axes[1].set_title('MicroGrad+ vs PyTorch CPU')
    axes[1].axhline(y=1, color='gray', linestyle='--', alpha=0.5)
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
else:
    print("Skipping visualization (PyTorch not available)")

---

## Part 3: Why is PyTorch Faster?

PyTorch is typically 10-100x faster than our implementation. Here's why:

### 1. Compiled Operations (C++/CUDA)

| MicroGrad+ | PyTorch |
|------------|----------|
| Pure Python + NumPy | C++ kernels, CUDA for GPU |
| Interpreted at runtime | Pre-compiled, optimized |

**ELI5**: MicroGrad+ is like doing math by hand - correct but slow. PyTorch is like using a calculator - does the same math but much faster because it's purpose-built.

### 2. Optimized Memory Layout

| MicroGrad+ | PyTorch |
|------------|----------|
| NumPy arrays (general purpose) | Custom tensor layout optimized for ML |
| Standard memory allocation | Memory pools, caching |

**ELI5**: NumPy is like a general-purpose toolbox. PyTorch tensors are like a specialized toolkit designed specifically for neural networks.

### 3. Operator Fusion

| MicroGrad+ | PyTorch |
|------------|----------|
| Each operation creates new array | Fuses operations to reduce memory traffic |
| `a = x + y; b = a * z` | `b = fused_add_mul(x, y, z)` |

**ELI5**: MicroGrad+ writes down each step on a new piece of paper. PyTorch combines steps when possible, using less paper and time.

### 4. Parallelization

| MicroGrad+ | PyTorch |
|------------|----------|
| Single-threaded (mostly) | Multi-threaded CPU |
| NumPy's limited parallelism | Massively parallel GPU |

**ELI5**: MicroGrad+ is one person doing all the work. PyTorch has thousands of helpers (GPU cores) working simultaneously.

### 5. Autograd Optimization

| MicroGrad+ | PyTorch |
|------------|----------|
| Python closures for backward functions | Compiled autograd engine |
| Full computation graph in memory | Checkpointing and memory optimization |

**ELI5**: Our backward functions are like detailed step-by-step recipes. PyTorch's are like a chef who's memorized everything and doesn't need to read instructions.

In [None]:
# Demonstrate the power of GPU parallelism
print("Why GPUs are so fast for neural networks:")
print("=" * 60)
print("""
Matrix multiplication example: (1000 x 1000) @ (1000 x 1000)

Operations needed: 1000 * 1000 * 1000 = 1 billion multiply-adds

CPU (8 cores, ~100 GFLOPS):
  - Can do ~100 billion ops/sec
  - Time: ~10 ms

GPU (DGX Spark with NVIDIA Blackwell GB10 Superchip):
  - FP8: ~209 TFLOPS
  - NVFP4: 1 PFLOP
  - Time: ~5 microseconds!

The GPU wins because:
1. 6,144 CUDA cores + 192 Tensor cores (5th generation)
2. 273 GB/s memory bandwidth with 128GB unified memory
3. Operations optimized for data parallelism
4. Native support for FP8/NVFP4 quantization
""")

In [None]:
# Demonstrate actual GPU speedup if available
if PYTORCH_AVAILABLE and torch.cuda.is_available():
    print("Real GPU Benchmark: Large Matrix Multiplication")
    print("=" * 60)
    
    sizes = [256, 512, 1024, 2048, 4096]
    
    for size in sizes:
        # CPU
        a_cpu = torch.randn(size, size)
        b_cpu = torch.randn(size, size)
        
        start = time.time()
        for _ in range(10):
            c_cpu = torch.mm(a_cpu, b_cpu)
        cpu_time = (time.time() - start) / 10 * 1000
        
        # GPU
        a_gpu = a_cpu.cuda()
        b_gpu = b_cpu.cuda()
        torch.cuda.synchronize()
        
        start = time.time()
        for _ in range(10):
            c_gpu = torch.mm(a_gpu, b_gpu)
        torch.cuda.synchronize()
        gpu_time = (time.time() - start) / 10 * 1000
        
        speedup = cpu_time / gpu_time
        print(f"  {size}x{size}: CPU={cpu_time:.2f}ms, GPU={gpu_time:.3f}ms, Speedup={speedup:.1f}x")
else:
    print("GPU not available - skipping GPU benchmark demo")

---

## Part 4: What You've Accomplished

Let's take a moment to appreciate what you've built in this capstone project!

In [None]:
# Count lines of code in MicroGrad+
import os

def count_lines(filepath):
    """Count non-empty, non-comment lines of code."""
    count = 0
    try:
        with open(filepath, 'r') as f:
            for line in f:
                stripped = line.strip()
                if stripped and not stripped.startswith('#'):
                    count += 1
    except Exception as e:
        print(f"Could not read {filepath}: {e}")
    return count

# Find package directory
module_root = Path(_find_module_root())
package_dir = module_root / 'micrograd_plus'

total_lines = 0

print("MicroGrad+ Package Statistics")
print("=" * 60)

if package_dir.exists():
    for filepath in sorted(package_dir.glob('*.py')):
        lines = count_lines(filepath)
        total_lines += lines
        print(f"  {filepath.name:25s}: {lines:4d} lines")
    
    print(f"  {'-'*30}")
    print(f"  {'TOTAL':25s}: {total_lines:4d} lines")
else:
    print(f"Package directory not found at {package_dir}")

In [None]:
# Summary of what was built
print("""
CONGRATULATIONS! You've completed the Domain 1 Capstone!

You built MicroGrad+, featuring:

Tensor Class with Autograd
   - Automatic differentiation (reverse-mode)
   - Broadcasting support
   - 15+ differentiable operations
   - Gradient accumulation and zeroing

Neural Network Layers
   - Linear (fully connected)
   - ReLU, Sigmoid, Tanh, Softmax activations
   - Dropout for regularization
   - BatchNorm and LayerNorm for normalization
   - Embedding for discrete inputs
   - Flatten for reshaping

Loss Functions
   - MSE Loss for regression
   - Cross-Entropy Loss for classification
   - BCE Loss for binary classification
   - L1 and Huber Loss for robust regression
   - NLL Loss for log-probability inputs

Optimizers
   - SGD with momentum and weight decay
   - Adam (adaptive learning rates)
   - AdamW (decoupled weight decay)
   - RMSprop
   - Learning rate schedulers

Training Utilities
   - Sequential container
   - DataLoader for batching
   - Training and evaluation loops
   - Model save/load

Testing Suite
   - Unit tests for all components
   - Gradient checking against numerical gradients
   - Edge case validation

Real Application
   - Trained on MNIST with >95% accuracy!
   - Visualizations of training progress
""")

---

## Part 5: What's Next?

In **Domain 2: Deep Learning Frameworks**, you'll use PyTorch to:

### 1. Work with Real Hardware
- Use the DGX Spark with NVIDIA Blackwell GB10 Superchip
- Leverage 128GB unified memory with 273 GB/s bandwidth
- Experience 10-100x+ speedups with native FP8/NVFP4 support!

### 2. Build Advanced Architectures
- Convolutional Neural Networks (CNNs) for images
- Recurrent Neural Networks (RNNs) for sequences
- Transformers and Attention mechanisms

### 3. Handle Real-World Data
- Image classification (CIFAR-10, ImageNet)
- Natural Language Processing
- Transfer learning from pre-trained models

### 4. Scale to Larger Models
- Fine-tuning pre-trained models (up to 100-120B with QLoRA)
- Efficient training techniques (mixed precision, gradient checkpointing)
- NVFP4 inference for models up to ~200B parameters
- Model optimization and deployment

### Why Your MicroGrad+ Experience Matters

Everything you learned building MicroGrad+ directly applies:

| MicroGrad+ Concept | PyTorch Equivalent |
|-------------------|--------------------|
| `Tensor` | `torch.Tensor` |
| `requires_grad=True` | `requires_grad=True` |
| `loss.backward()` | `loss.backward()` |
| `optimizer.step()` | `optimizer.step()` |
| `Sequential` | `nn.Sequential` |
| `Linear`, `ReLU` | `nn.Linear`, `nn.ReLU` |

You understand the **internals** now. PyTorch will feel familiar but much faster!

---

## Try It Yourself: Final Reflection

Before we wrap up, take a moment to reflect on your learning journey.

In [None]:
# Exercise: What was the most challenging concept?
# Fill in your reflection here:

most_challenging = """
# Your answer here
# Example: "Understanding how backward() propagates gradients through the graph"
"""

most_interesting = """
# Your answer here  
# Example: "Seeing how simple operations combine to train a neural network"
"""

next_steps = """
# Your answer here
# Example: "I want to try implementing a CNN from scratch"
"""

print("Your Domain 1 Capstone Reflection")
print("=" * 60)
print(f"\nMost challenging: {most_challenging.strip()}")
print(f"\nMost interesting: {most_interesting.strip()}")
print(f"\nNext steps: {next_steps.strip()}")

---

## Domain 1 Complete!

You now have:
- **Deep understanding** of how neural networks work internally
- **Your own working autograd engine** that you built from scratch
- **Solid foundation** in Python, NumPy, and software engineering
- **Practical experience** training real models on real data

These fundamentals will serve you well as you move to more advanced topics in Domain 2!

### Key Takeaways

1. **Autograd is just careful bookkeeping** - track operations, apply chain rule
2. **Neural networks are function compositions** - layers are just functions with learnable parameters
3. **Training is optimization** - minimize loss by following gradients
4. **Understanding beats memorization** - knowing *why* something works lets you debug and innovate

---

## Cleanup

In [None]:
import gc
gc.collect()
print("Domain 1 Capstone Complete!")