# Gradient Computation Deep Dive: PyTorch Mastery Hub

**Understanding PyTorch's Automatic Differentiation Engine**

**Authors:** PyTorch Mastery Hub Team  
**Institution:** Advanced Deep Learning Education  
**Course:** PyTorch Fundamentals & Advanced Techniques  
**Date:** December 2024

## Overview

This notebook provides a comprehensive exploration of PyTorch's automatic differentiation system (autograd). We'll master gradient computation from fundamentals to advanced applications, building intuition through interactive visualizations and practical examples.

## Key Objectives
1. Master PyTorch's autograd system and computational graphs
2. Understand gradient computation mechanics and vector operations
3. Explore advanced gradient techniques for research and optimization
4. Build intuition through interactive visualizations and real-time demos
5. Learn performance optimization and debugging techniques
6. Apply gradients to practical scenarios like adversarial examples and optimization

## 📚 Learning Path
- **Prerequisites:** Basic PyTorch tensors, calculus fundamentals, chain rule
- **Difficulty:** Intermediate to Advanced
- **Duration:** 90-120 minutes
- **Next Steps:** Custom autograd functions, neural network architectures

---

## 1. Setup and Environment Configuration

```python
# Essential imports for comprehensive gradient computation tutorial
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional, Union
import warnings
import time
import copy
from pathlib import Path
warnings.filterwarnings('ignore')

# Advanced imports for specialized functionality
from torch.utils.checkpoint import checkpoint
from torch.utils.data import TensorDataset, DataLoader
import json

# Configure environment
torch.manual_seed(42)
np.random.seed(42)

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette('husl')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11

# Create results directory
results_dir = Path('../results/notebooks/gradient_computation')
results_dir.mkdir(parents=True, exist_ok=True)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print("🔥 PyTorch Mastery Hub - Gradient Computation Deep Dive")
print("=" * 60)
print(f"📱 Device: {device}")
print(f"🎨 PyTorch version: {torch.__version__}")
print(f"📊 NumPy version: {np.__version__}")
print(f"📁 Results directory: {results_dir}")
print("✨ Ready to explore automatic differentiation!\n")
```

## 2. Autograd Fundamentals: Building the Foundation

### 2.1 Your First Gradient Computation

```python
def demonstrate_basic_gradients():
    """Demonstrate fundamental gradient computation concepts"""
    
    print("=== 2.1 Basic Gradient Computation ===\n")
    
    # Create tensor with gradient tracking
    x = torch.tensor(2.0, requires_grad=True)
    
    print(f"🔢 Input Analysis:")
    print(f"  Input tensor x: {x}")
    print(f"  Requires grad: {x.requires_grad}")
    print(f"  Gradient function: {x.grad_fn}")
    print(f"  Is leaf node: {x.is_leaf}")
    print(f"  Current gradient: {x.grad}")
    
    # Define mathematical function: f(x) = x² + 3x + 1
    y = x**2 + 3*x + 1
    
    print(f"\n📐 Function Definition:")
    print(f"  Function: f(x) = x² + 3x + 1")
    print(f"  f({x.item()}) = {y.item()}")
    print(f"  Output grad_fn: {y.grad_fn}")
    print(f"  Output is leaf: {y.is_leaf}")
    
    # Compute analytical gradient for verification
    analytical_grad = 2*x.item() + 3
    
    # Compute gradient using backpropagation
    y.backward()
    
    print(f"\n🧮 Gradient Analysis:")
    print(f"  Computed gradient df/dx: {x.grad.item()}")
    print(f"  Analytical gradient (2x + 3): {analytical_grad}")
    print(f"  Absolute difference: {abs(x.grad.item() - analytical_grad):.2e}")
    print(f"  ✅ Gradients match: {abs(x.grad.item() - analytical_grad) < 1e-6}")
    
    return x.grad.item(), analytical_grad

# Execute basic gradient demonstration
computed_grad, analytical_grad = demonstrate_basic_gradients()
```

### 2.2 Computational Graph Deep Dive

```python
def explore_computational_graph():
    """Explore computational graph construction and traversal"""
    
    print("\n=== 2.2 Computational Graph Analysis ===\n")
    
    # Build complex computation with multiple operations
    x = torch.tensor(2.0, requires_grad=True)
    
    # Create computation chain
    operations = []
    
    a = x * 3          # Multiplication: a = 3x
    operations.append(("a = x * 3", a, a.grad_fn))
    
    b = a + 1          # Addition: b = 3x + 1  
    operations.append(("b = a + 1", b, b.grad_fn))
    
    c = b ** 2         # Power: c = (3x + 1)²
    operations.append(("c = b ** 2", c, c.grad_fn))
    
    d = torch.sin(c)   # Trigonometric: d = sin((3x + 1)²)
    operations.append(("d = sin(c)", d, d.grad_fn))
    
    e = d.mean()       # Reduction: e = mean(d)
    operations.append(("e = d.mean()", e, e.grad_fn))
    
    print(f"🔗 Computation Chain Analysis:")
    print(f"  Input x = {x.item()}")
    print()
    
    for i, (description, tensor, grad_fn) in enumerate(operations, 1):
        print(f"  Step {i}: {description}")
        print(f"    Value: {tensor.item():.6f}")
        print(f"    Grad function: {grad_fn}")
        print(f"    Shape: {tensor.shape}")
        print()
    
    # Compute gradient through the entire chain
    print("🔄 Backward Pass Analysis:")
    e.backward()
    
    # Manual gradient computation for verification
    # e = sin((3x + 1)²)
    # de/dx = cos((3x + 1)²) * 2(3x + 1) * 3 = 6(3x + 1) * cos((3x + 1)²)
    manual_grad = 6 * (3 * x.item() + 1) * np.cos((3 * x.item() + 1)**2)
    
    print(f"  Computed gradient de/dx: {x.grad.item():.6f}")
    print(f"  Manual gradient: {manual_grad:.6f}")
    print(f"  Relative error: {abs(x.grad.item() - manual_grad) / abs(manual_grad):.2e}")
    print(f"  ✅ Chain rule applied correctly: {abs(x.grad.item() - manual_grad) < 1e-5}")
    
    return operations, x.grad.item(), manual_grad

# Execute computational graph exploration
operations, computed_chain_grad, manual_chain_grad = explore_computational_graph()
```

## 3. Advanced Gradient Concepts: Vectors and Jacobians

### 3.1 Vector Functions and Jacobian Matrices

```python
def demonstrate_vector_gradients():
    """Demonstrate gradient computation for vector-valued functions"""
    
    print("\n=== 3.1 Vector Functions and Jacobians ===\n")
    
    # Vector input: 3D input space
    x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
    
    print(f"📊 Vector Function Analysis:")
    print(f"  Input vector x: {x.detach().numpy()}")
    print(f"  Input dimension: {x.shape[0]}")
    
    # Define vector function: f: R³ → R³
    # f(x) = [x₁², x₁x₂, x₂x₃]
    y = torch.stack([
        x[0]**2,           # y₁ = x₁²
        x[0] * x[1],       # y₂ = x₁x₂  
        x[1] * x[2]        # y₃ = x₂x₃
    ])
    
    print(f"  Function definitions:")
    print(f"    y₁ = x₁² = {y[0].item():.4f}")
    print(f"    y₂ = x₁x₂ = {y[1].item():.4f}")
    print(f"    y₃ = x₂x₃ = {y[2].item():.4f}")
    print(f"  Output vector y: {y.detach().numpy()}")
    
    # Compute Jacobian matrix: ∂y/∂x
    jacobian = torch.zeros(3, 3)
    
    print(f"\n🧮 Jacobian Computation:")
    for i in range(3):
        # Clear previous gradients
        if x.grad is not None:
            x.grad.zero_()
        
        # Compute gradient for output component i
        y[i].backward(retain_graph=True)
        
        # Store in Jacobian matrix
        jacobian[i] = x.grad.clone()
        
        print(f"  ∂y{i+1}/∂x: {x.grad.detach().numpy()}")
    
    print(f"\n📐 Computed Jacobian Matrix:")
    print(jacobian.numpy())
    
    # Analytical Jacobian for verification
    # ∂y₁/∂x = [2x₁, 0, 0]
    # ∂y₂/∂x = [x₂, x₁, 0]  
    # ∂y₃/∂x = [0, x₃, x₂]
    analytical_jacobian = torch.tensor([
        [2*x[0], 0, 0],
        [x[1], x[0], 0],
        [0, x[2], x[1]]
    ])
    
    print(f"\n📋 Analytical Jacobian Matrix:")
    print(analytical_jacobian.numpy())
    
    # Verification
    jacobian_diff = torch.norm(jacobian - analytical_jacobian).item()
    print(f"\n✅ Verification:")
    print(f"  Frobenius norm difference: {jacobian_diff:.2e}")
    print(f"  Jacobians match: {jacobian_diff < 1e-6}")
    
    return jacobian, analytical_jacobian

# Execute vector gradient demonstration
computed_jacobian, analytical_jacobian = demonstrate_vector_gradients()
```

### 3.2 Neural Network Gradient Flow

```python
def analyze_neural_network_gradients():
    """Analyze gradient flow through neural network layers"""
    
    print("\n=== 3.2 Neural Network Gradient Flow ===\n")
    
    # Define comprehensive neural network
    class AnalysisNet(nn.Module):
        def __init__(self, input_dim=3, hidden_dims=[4, 6, 4], output_dim=2):
            super().__init__()
            
            layers = []
            prev_dim = input_dim
            
            for i, hidden_dim in enumerate(hidden_dims):
                layers.append(nn.Linear(prev_dim, hidden_dim))
                layers.append(nn.ReLU())
                prev_dim = hidden_dim
            
            layers.append(nn.Linear(prev_dim, output_dim))
            
            self.network = nn.Sequential(*layers)
            
        def forward(self, x):
            return self.network(x)
    
    # Initialize network with controlled weights
    net = AnalysisNet()
    
    # Initialize weights for reproducible analysis
    for module in net.modules():
        if isinstance(module, nn.Linear):
            nn.init.xavier_normal_(module.weight)
            nn.init.zeros_(module.bias)
    
    print(f"🏗️ Network Architecture:")
    total_params = sum(p.numel() for p in net.parameters())
    trainable_params = sum(p.numel() for p in net.parameters() if p.requires_grad)
    
    print(f"  Architecture: {net}")
    print(f"  Total parameters: {total_params:,}")
    print(f"  Trainable parameters: {trainable_params:,}")
    
    # Create training data
    batch_size = 8
    x = torch.randn(batch_size, 3)
    target = torch.randn(batch_size, 2)
    
    print(f"\n📊 Training Setup:")
    print(f"  Input shape: {x.shape}")
    print(f"  Target shape: {target.shape}")
    print(f"  Batch size: {batch_size}")
    
    # Forward pass
    output = net(x)
    loss = F.mse_loss(output, target)
    
    print(f"\n⚡ Forward Pass Results:")
    print(f"  Output shape: {output.shape}")
    print(f"  Output range: [{output.min().item():.4f}, {output.max().item():.4f}]")
    print(f"  Loss: {loss.item():.6f}")
    
    # Backward pass
    loss.backward()
    
    # Analyze gradients across layers
    print(f"\n🔍 Gradient Analysis by Layer:")
    print(f"{'Layer':<20} {'Shape':<15} {'Grad Norm':<12} {'Mean':<10} {'Std':<10} {'Min':<10} {'Max':<10}")
    print("-" * 85)
    
    gradient_norms = []
    for name, param in net.named_parameters():
        if param.grad is not None:
            grad = param.grad
            grad_norm = grad.norm().item()
            grad_mean = grad.mean().item()
            grad_std = grad.std().item()
            grad_min = grad.min().item()
            grad_max = grad.max().item()
            
            gradient_norms.append(grad_norm)
            
            print(f"{name:<20} {str(param.shape):<15} {grad_norm:<12.6f} "
                  f"{grad_mean:<10.4f} {grad_std:<10.4f} {grad_min:<10.4f} {grad_max:<10.4f}")
    
    # Gradient flow health analysis
    print(f"\n🏥 Gradient Flow Health Check:")
    if gradient_norms:
        max_grad = max(gradient_norms)
        min_grad = min(gradient_norms) 
        ratio = max_grad / (min_grad + 1e-8)
        total_norm = (sum(g**2 for g in gradient_norms))**0.5
        
        print(f"  Total gradient norm: {total_norm:.6f}")
        print(f"  Max gradient norm: {max_grad:.6f}")
        print(f"  Min gradient norm: {min_grad:.6f}")
        print(f"  Max/Min ratio: {ratio:.2f}")
        
        # Health assessment
        if ratio > 100:
            print(f"  ⚠️ Warning: Large gradient imbalance detected!")
        elif any(g < 1e-7 for g in gradient_norms):
            print(f"  ⚠️ Warning: Very small gradients detected (vanishing gradients)")
        elif any(g > 10 for g in gradient_norms):
            print(f"  ⚠️ Warning: Very large gradients detected (exploding gradients)")
        else:
            print(f"  ✅ Gradient flow appears healthy")
    
    return net, gradient_norms, total_norm

# Execute neural network gradient analysis
analysis_net, gradient_norms, total_gradient_norm = analyze_neural_network_gradients()
```

## 4. Interactive Gradient Visualization

### 4.1 2D Function Gradient Landscapes

```python
def create_gradient_landscape_visualization():
    """Create comprehensive gradient landscape visualizations"""
    
    print("\n=== 4.1 Interactive Gradient Landscapes ===\n")
    
    def rosenbrock_function(coords):
        """Rosenbrock function: classic optimization benchmark"""
        x, y = coords[0], coords[1]
        a, b = 1, 100
        return (a - x)**2 + b * (y - x**2)**2
    
    def himmelblau_function(coords):
        """Himmelblau's function: multi-modal optimization problem"""
        x, y = coords[0], coords[1]
        return (x**2 + y - 11)**2 + (x + y**2 - 7)**2
    
    def visualize_function_landscape(func, func_name, x_range=(-3, 3), y_range=(-3, 3), resolution=100):
        """Create comprehensive function landscape visualization"""
        
        print(f"🎨 Visualizing {func_name} function...")
        
        # Create coordinate grids
        x = np.linspace(x_range[0], x_range[1], resolution)
        y = np.linspace(y_range[0], y_range[1], resolution)
        X, Y = np.meshgrid(x, y)
        
        # Compute function values and gradients
        Z = np.zeros_like(X)
        Gx = np.zeros_like(X)  # ∂f/∂x
        Gy = np.zeros_like(Y)  # ∂f/∂y
        
        for i in range(resolution):
            for j in range(resolution):
                coords = torch.tensor([X[i, j], Y[i, j]], requires_grad=True)
                z = func(coords)
                Z[i, j] = z.item()
                
                z.backward()
                Gx[i, j] = coords.grad[0].item()
                Gy[i, j] = coords.grad[1].item()
        
        # Create comprehensive visualization
        fig = plt.figure(figsize=(20, 15))
        
        # 1. Function contour plot
        ax1 = plt.subplot(3, 3, 1)
        contour = ax1.contour(X, Y, Z, levels=30, alpha=0.8)
        ax1.clabel(contour, inline=True, fontsize=8)
        ax1.set_title(f'{func_name} Function Contours', fontweight='bold')
        ax1.set_xlabel('x₁')
        ax1.set_ylabel('x₂')
        ax1.grid(True, alpha=0.3)
        
        # 2. Function surface (filled contour)
        ax2 = plt.subplot(3, 3, 2)
        surface = ax2.contourf(X, Y, Z, levels=50, cmap='viridis')
        plt.colorbar(surface, ax=ax2)
        ax2.set_title(f'{func_name} Function Surface', fontweight='bold')
        ax2.set_xlabel('x₁')
        ax2.set_ylabel('x₂')
        
        # 3. Gradient vector field
        ax3 = plt.subplot(3, 3, 3)
        step = resolution // 15  # Subsample for clarity
        ax3.contour(X, Y, Z, levels=20, alpha=0.3)
        quiver = ax3.quiver(X[::step, ::step], Y[::step, ::step], 
                           Gx[::step, ::step], Gy[::step, ::step], 
                           alpha=0.8, scale=None, color='red', width=0.003)
        ax3.set_title('Gradient Vector Field', fontweight='bold')
        ax3.set_xlabel('x₁')
        ax3.set_ylabel('x₂')
        ax3.grid(True, alpha=0.3)
        
        # 4. Gradient magnitude heatmap
        ax4 = plt.subplot(3, 3, 4)
        grad_magnitude = np.sqrt(Gx**2 + Gy**2)
        magnitude_plot = ax4.imshow(grad_magnitude, extent=[x_range[0], x_range[1], y_range[0], y_range[1]], 
                                   origin='lower', cmap='plasma')
        plt.colorbar(magnitude_plot, ax=ax4)
        ax4.set_title('Gradient Magnitude', fontweight='bold')
        ax4.set_xlabel('x₁')
        ax4.set_ylabel('x₂')
        
        # 5. X-direction gradients
        ax5 = plt.subplot(3, 3, 5)
        gx_plot = ax5.imshow(Gx, extent=[x_range[0], x_range[1], y_range[0], y_range[1]], 
                            origin='lower', cmap='RdBu_r')
        plt.colorbar(gx_plot, ax=ax5)
        ax5.set_title('∂f/∂x₁', fontweight='bold')
        ax5.set_xlabel('x₁')
        ax5.set_ylabel('x₂')
        
        # 6. Y-direction gradients
        ax6 = plt.subplot(3, 3, 6)
        gy_plot = ax6.imshow(Gy, extent=[x_range[0], x_range[1], y_range[0], y_range[1]], 
                            origin='lower', cmap='RdBu_r')
        plt.colorbar(gy_plot, ax=ax6)
        ax6.set_title('∂f/∂x₂', fontweight='bold')
        ax6.set_xlabel('x₁')
        ax6.set_ylabel('x₂')
        
        # 7-9. Statistical analysis
        ax7 = plt.subplot(3, 3, 7)
        ax7.hist(Z.flatten(), bins=50, alpha=0.7, color='skyblue', edgecolor='black')
        ax7.set_title('Function Value Distribution', fontweight='bold')
        ax7.set_xlabel('Function Value')
        ax7.set_ylabel('Frequency')
        ax7.grid(True, alpha=0.3)
        
        ax8 = plt.subplot(3, 3, 8)
        ax8.hist(grad_magnitude.flatten(), bins=50, alpha=0.7, color='orange', edgecolor='black')
        ax8.set_title('Gradient Magnitude Distribution', fontweight='bold')
        ax8.set_xlabel('Gradient Magnitude')
        ax8.set_ylabel('Frequency')
        ax8.grid(True, alpha=0.3)
        
        ax9 = plt.subplot(3, 3, 9)
        gradient_angles = np.arctan2(Gy, Gx)
        ax9.hist(gradient_angles.flatten(), bins=50, alpha=0.7, color='green', edgecolor='black')
        ax9.set_title('Gradient Direction Distribution', fontweight='bold')
        ax9.set_xlabel('Gradient Angle (radians)')
        ax9.set_ylabel('Frequency')
        ax9.grid(True, alpha=0.3)
        
        plt.suptitle(f'{func_name} Function: Comprehensive Gradient Analysis', 
                     fontsize=16, fontweight='bold')
        plt.tight_layout()
        
        # Save visualization
        filename = f'gradient_landscape_{func_name.lower().replace(" ", "_")}.png'
        plt.savefig(results_dir / filename, dpi=300, bbox_inches='tight')
        plt.show()
        
        # Return analysis data
        return {
            'function_stats': {
                'min': float(Z.min()),
                'max': float(Z.max()),
                'mean': float(Z.mean()),
                'std': float(Z.std())
            },
            'gradient_stats': {
                'magnitude_mean': float(grad_magnitude.mean()),
                'magnitude_max': float(grad_magnitude.max()),
                'x_grad_mean': float(Gx.mean()),
                'y_grad_mean': float(Gy.mean())
            }
        }
    
    # Visualize multiple functions
    functions = [
        (rosenbrock_function, "Rosenbrock", (-2, 2), (-1, 3)),
        (himmelblau_function, "Himmelblau", (-5, 5), (-5, 5))
    ]
    
    analysis_results = {}
    
    for func, name, x_range, y_range in functions:
        result = visualize_function_landscape(func, name, x_range, y_range)
        analysis_results[name] = result
        
        print(f"📊 {name} Function Analysis:")
        print(f"  Function range: [{result['function_stats']['min']:.2f}, {result['function_stats']['max']:.2f}]")
        print(f"  Mean gradient magnitude: {result['gradient_stats']['magnitude_mean']:.4f}")
        print(f"  Max gradient magnitude: {result['gradient_stats']['magnitude_max']:.4f}")
        print()
    
    return analysis_results

# Execute gradient landscape visualization
landscape_analysis = create_gradient_landscape_visualization()
```

### 4.2 Real-Time Gradient Descent Animation

```python
def demonstrate_gradient_descent_optimization():
    """Demonstrate gradient descent with real-time analysis"""
    
    print("\n=== 4.2 Gradient Descent Optimization ===\n")
    
    def rosenbrock(coords):
        """Rosenbrock function for optimization"""
        x, y = coords[0], coords[1]
        return (1 - x)**2 + 100 * (y - x**2)**2
    
    def run_gradient_descent(start_point, learning_rate, max_iterations=200, tolerance=1e-6):
        """Run gradient descent with comprehensive tracking"""
        
        print(f"🚀 Starting gradient descent:")
        print(f"  Initial position: {start_point.detach().numpy()}")
        print(f"  Learning rate: {learning_rate}")
        print(f"  Max iterations: {max_iterations}")
        print(f"  Tolerance: {tolerance}")
        
        # Initialize tracking
        current_point = start_point.clone().detach().requires_grad_(True)
        
        trajectory = [current_point.clone().detach()]
        losses = []
        gradient_norms = []
        step_sizes = []
        convergence_metrics = []
        
        for iteration in range(max_iterations):
            # Zero gradients
            if current_point.grad is not None:
                current_point.grad.zero_()
            
            # Forward pass
            loss = rosenbrock(current_point)
            losses.append(loss.item())
            
            # Backward pass
            loss.backward()
            
            # Track gradient information
            grad_norm = current_point.grad.norm().item()
            gradient_norms.append(grad_norm)
            
            # Check convergence
            if grad_norm < tolerance:
                print(f"  ✅ Converged at iteration {iteration} (gradient norm: {grad_norm:.2e})")
                break
            
            # Update step
            with torch.no_grad():
                step = learning_rate * current_point.grad
                step_size = step.norm().item()
                step_sizes.append(step_size)
                
                current_point -= step
                
                # Track convergence rate
                if len(losses) > 1:
                    improvement = losses[-2] - losses[-1]
                    convergence_metrics.append(improvement)
            
            # Re-enable gradients
            current_point.requires_grad_(True)
            
            # Store trajectory
            trajectory.append(current_point.clone().detach())
            
            # Periodic progress reports
            if iteration % 50 == 0 or iteration < 10:
                print(f"  Iter {iteration:3d}: Loss={loss.item():.6f}, "
                      f"Pos={current_point.detach().numpy()}, "
                      f"GradNorm={grad_norm:.6f}")
        
        return {
            'trajectory': torch.stack(trajectory),
            'losses': losses,
            'gradient_norms': gradient_norms,
            'step_sizes': step_sizes,
            'convergence_metrics': convergence_metrics,
            'final_position': current_point.detach(),
            'iterations': len(losses)
        }
    
    def visualize_optimization_results(results, title_suffix=""):
        """Create comprehensive optimization visualization"""
        
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        
        trajectory = results['trajectory']
        losses = results['losses']
        gradient_norms = results['gradient_norms']
        step_sizes = results['step_sizes']
        
        # 1. Optimization path on function landscape
        x_range = (trajectory[:, 0].min().item() - 0.5, trajectory[:, 0].max().item() + 0.5)
        y_range = (trajectory[:, 1].min().item() - 0.5, trajectory[:, 1].max().item() + 0.5)
        
        x = np.linspace(x_range[0], x_range[1], 100)
        y = np.linspace(y_range[0], y_range[1], 100)
        X, Y = np.meshgrid(x, y)
        Z = (1 - X)**2 + 100 * (Y - X**2)**2
        
        axes[0,0].contour(X, Y, Z, levels=50, alpha=0.6)
        axes[0,0].plot(trajectory[:, 0], trajectory[:, 1], 'ro-', linewidth=2, markersize=3, alpha=0.8)
        axes[0,0].plot(trajectory[0, 0], trajectory[0, 1], 'go', markersize=12, label='Start')
        axes[0,0].plot(trajectory[-1, 0], trajectory[-1, 1], 'ro', markersize=12, label='End')
        axes[0,0].plot(1, 1, 'k*', markersize=15, label='Global Minimum')
        axes[0,0].set_title(f'Optimization Trajectory{title_suffix}', fontweight='bold')
        axes[0,0].set_xlabel('x₁')
        axes[0,0].set_ylabel('x₂')
        axes[0,0].legend()
        axes[0,0].grid(True, alpha=0.3)
        
        # 2. Loss convergence
        axes[0,1].semilogy(losses, linewidth=2, color='blue')
        axes[0,1].set_title('Loss Convergence', fontweight='bold')
        axes[0,1].set_xlabel('Iteration')
        axes[0,1].set_ylabel('Loss (log scale)')
        axes[0,1].grid(True, alpha=0.3)
        
        # 3. Gradient norm evolution
        axes[0,2].semilogy(gradient_norms, linewidth=2, color='orange')
        axes[0,2].set_title('Gradient Norm Evolution', fontweight='bold')
        axes[0,2].set_xlabel('Iteration')
        axes[0,2].set_ylabel('Gradient Norm (log scale)')
        axes[0,2].grid(True, alpha=0.3)
        
        # 4. Step size evolution
        axes[1,0].plot(step_sizes, linewidth=2, color='green')
        axes[1,0].set_title('Step Size Evolution', fontweight='bold')
        axes[1,0].set_xlabel('Iteration')
        axes[1,0].set_ylabel('Step Size')
        axes[1,0].grid(True, alpha=0.3)
        
        # 5. Parameter evolution
        axes[1,1].plot(trajectory[1:, 0], label='x₁', linewidth=2)
        axes[1,1].plot(trajectory[1:, 1], label='x₂', linewidth=2)
        axes[1,1].axhline(y=1, color='red', linestyle='--', alpha=0.7, label='Optimal')
        axes[1,1].set_title('Parameter Evolution', fontweight='bold')
        axes[1,1].set_xlabel('Iteration')
        axes[1,1].set_ylabel('Parameter Value')
        axes[1,1].legend()
        axes[1,1].grid(True, alpha=0.3)
        
        # 6. Convergence rate analysis
        if len(results['convergence_metrics']) > 0:
            axes[1,2].plot(results['convergence_metrics'], linewidth=2, color='purple')
            axes[1,2].set_title('Loss Improvement per Iteration', fontweight='bold')
            axes[1,2].set_xlabel('Iteration')
            axes[1,2].set_ylabel('Loss Improvement')
            axes[1,2].grid(True, alpha=0.3)
        
        plt.tight_layout()
        return fig
    
    # Run experiments with different learning rates
    experiments = [
        (torch.tensor([-1.5, 2.5]), 0.001, "Small LR"),
        (torch.tensor([-1.5, 2.5]), 0.01, "Medium LR"),
        (torch.tensor([0.5, 0.5]), 0.001, "Good Start + Small LR")
    ]
    
    experiment_results = {}
    
    for start_point, lr, experiment_name in experiments:
        print(f"\n{'='*60}")
        print(f"🧪 Experiment: {experiment_name}")
        print(f"{'='*60}")
        
        results = run_gradient_descent(start_point, lr)
        experiment_results[experiment_name] = results
        
        # Visualize results
        fig = visualize_optimization_results(results, f" ({experiment_name})")
        fig.suptitle(f'Gradient Descent Analysis: {experiment_name}', fontsize=16, fontweight='bold')
        
        # Save visualization
        filename = f'gradient_descent_{experiment_name.lower().replace(" ", "_")}.png'
        plt.savefig(results_dir / filename, dpi=300, bbox_inches='tight')
        plt.show()
        
        # Print summary
        final_loss = results['losses'][-1]
        total_distance = torch.norm(torch.diff(results['trajectory'], dim=0), dim=1).sum().item()
        
        print(f"\n📈 {experiment_name} Results:")
        print(f"  Final position: {results['final_position'].numpy()}")
        print(f"  Final loss: {final_loss:.8f}")
        print(f"  Iterations: {results['iterations']}")
        print(f"  Total distance traveled: {total_distance:.4f}")
        print(f"  Average step size: {np.mean(results['step_sizes']):.6f}")
        print(f"  Final gradient norm: {results['gradient_norms'][-1]:.2e}")
    
    return experiment_results

# Execute gradient descent demonstrations
optimization_results = demonstrate_gradient_descent_optimization()
```

## 5. Advanced Autograd Techniques

### 5.1 Higher-Order Gradients and Hessian Computation

```python
def explore_higher_order_gradients():
    """Explore second-order gradients and Hessian matrices"""
    
    print("\n=== 5.1 Higher-Order Gradients & Hessian Analysis ===\n")
    
    def compute_hessian_matrix(func, inputs):
        """Compute full Hessian matrix for scalar function"""
        
        n = inputs.numel()
        hessian = torch.zeros(n, n)
        
        # First, compute gradient
        inputs_copy = inputs.clone().detach().requires_grad_(True)
        output = func(inputs_copy)
        
        # Compute first-order gradients
        first_grad = torch.autograd.grad(output, inputs_copy, create_graph=True)[0]
        
        # Compute second-order gradients (Hessian)
        for i in range(n):
            second_grad = torch.autograd.grad(first_grad[i], inputs_copy, 
                                            retain_graph=(i < n-1))[0]
            hessian[i] = second_grad
        
        return hessian
    
    def analyze_quadratic_function():
        """Analyze quadratic function with known Hessian"""
        
        print("🔢 Quadratic Function Analysis:")
        
        # Define quadratic function: f(x) = 0.5 * x^T A x + b^T x + c
        A = torch.tensor([[3.0, 1.0], [1.0, 2.0]])  # Positive definite matrix
        b = torch.tensor([1.0, -2.0])
        c = 5.0
        
        def quadratic_func(x):
            return 0.5 * x @ A @ x + b @ x + c
        
        # Test point
        test_point = torch.tensor([2.0, 1.0])
        print(f"  Test point: {test_point.numpy()}")
        print(f"  Matrix A:\n{A.numpy()}")
        print(f"  Vector b: {b.numpy()}")
        print(f"  Scalar c: {c}")
        
        # Compute Hessian
        computed_hessian = compute_hessian_matrix(quadratic_func, test_point)
        true_hessian = A  # For quadratic function, Hessian = A
        
        print(f"\n  Computed Hessian:\n{computed_hessian.numpy()}")
        print(f"  True Hessian (A):\n{true_hessian.numpy()}")
        print(f"  Difference norm: {torch.norm(computed_hessian - true_hessian).item():.2e}")
        print(f"  ✅ Hessians match: {torch.allclose(computed_hessian, true_hessian, atol=1e-6)}")
        
        return computed_hessian, true_hessian
    
    def analyze_hessian_properties(hessian, function_name):
        """Analyze mathematical properties of Hessian matrix"""
        
        print(f"\n🔍 {function_name} Hessian Properties:")
        
        # Eigenvalue analysis
        eigenvals, eigenvecs = torch.linalg.eigh(hessian)
        
        print(f"  Eigenvalues: {eigenvals.numpy()}")
        print(f"  Eigenvectors:\n{eigenvecs.numpy()}")
        
        # Matrix properties
        condition_number = eigenvals.max() / eigenvals.min()
        determinant = torch.det(hessian)
        trace = torch.trace(hessian)
        
        print(f"  Determinant: {determinant.item():.6f}")
        print(f"  Trace: {trace.item():.6f}")
        print(f"  Condition number: {condition_number.item():.6f}")
        
        # Definiteness analysis
        if (eigenvals > 0).all():
            definiteness = "Positive definite"
            optimization_property = "Unique minimum (convex)"
        elif (eigenvals < 0).all():
            definiteness = "Negative definite"
            optimization_property = "Unique maximum (concave)"
        elif (eigenvals >= 0).all():
            definiteness = "Positive semidefinite"
            optimization_property = "Possible minimum"
        elif (eigenvals <= 0).all():
            definiteness = "Negative semidefinite"
            optimization_property = "Possible maximum"
        else:
            definiteness = "Indefinite"
            optimization_property = "Saddle point"
        
        print(f"  Matrix type: {definiteness}")
        print(f"  Optimization implication: {optimization_property}")
        
        return {
            'eigenvalues': eigenvals.numpy(),
            'condition_number': condition_number.item(),
            'definiteness': definiteness,
            'optimization_property': optimization_property
        }
    
    # Analyze quadratic function
    quad_hessian, true_quad_hessian = analyze_quadratic_function()
    quad_properties = analyze_hessian_properties(quad_hessian, "Quadratic")
    
    # Analyze Rosenbrock function at different points
    def rosenbrock_func(x):
        return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
    
    test_points = [
        torch.tensor([0.0, 0.0]),  # Away from minimum
        torch.tensor([1.0, 1.0]),  # At minimum
        torch.tensor([0.5, 0.25]) # Intermediate point
    ]
    
    rosenbrock_analysis = {}
    
    for i, point in enumerate(test_points):
        point_name = f"Point_{i+1}_({point[0]:.1f},{point[1]:.1f})"
        print(f"\n🌹 Rosenbrock Function at {point_name}:")
        print(f"  Position: {point.numpy()}")
        
        hessian = compute_hessian_matrix(rosenbrock_func, point)
        properties = analyze_hessian_properties(hessian, f"Rosenbrock at {point_name}")
        
        rosenbrock_analysis[point_name] = {
            'position': point.numpy(),
            'hessian': hessian.numpy(),
            'properties': properties
        }
    
    return {
        'quadratic_analysis': {
            'hessian': quad_hessian.numpy(),
            'properties': quad_properties
        },
        'rosenbrock_analysis': rosenbrock_analysis
    }

# Execute higher-order gradient analysis
hessian_analysis_results = explore_higher_order_gradients()
```

### 5.2 Gradient Debugging and Anomaly Detection

```python
def demonstrate_gradient_debugging():
    """Comprehensive gradient debugging and anomaly detection"""
    
    print("\n=== 5.2 Gradient Debugging & Anomaly Detection ===\n")
    
    class DiagnosticNet(nn.Module):
        """Neural network designed to exhibit various gradient issues"""
        
        def __init__(self, input_dim=10, hidden_dims=[50, 50, 50], output_dim=1, 
                     problematic=False):
            super().__init__()
            
            layers = []
            prev_dim = input_dim
            
            for i, hidden_dim in enumerate(hidden_dims):
                linear = nn.Linear(prev_dim, hidden_dim)
                
                if problematic:
                    # Initialize with problematic weights
                    if i == 0:
                        nn.init.normal_(linear.weight, mean=0, std=5.0)  # Large weights
                    else:
                        nn.init.normal_(linear.weight, mean=0, std=0.01)  # Small weights
                else:
                    nn.init.xavier_normal_(linear.weight)
                
                layers.extend([linear, nn.ReLU()])
                prev_dim = hidden_dim
            
            # Output layer
            output_layer = nn.Linear(prev_dim, output_dim)
            if problematic:
                nn.init.normal_(output_layer.weight, mean=0, std=0.001)
            else:
                nn.init.xavier_normal_(output_layer.weight)
            
            layers.append(output_layer)
            self.network = nn.Sequential(*layers)
            self.problematic = problematic
        
        def forward(self, x):
            return self.network(x)
    
    def comprehensive_gradient_analysis(model, input_data, target, model_name):
        """Perform comprehensive gradient analysis with anomaly detection"""
        
        print(f"🔍 Analyzing {model_name}:")
        print("-" * 50)
        
        analysis_results = {
            'model_name': model_name,
            'layer_analysis': [],
            'anomalies_detected': [],
            'health_score': 0.0,
            'recommendations': []
        }
        
        # Enable anomaly detection
        with torch.autograd.detect_anomaly():
            try:
                # Forward pass
                output = model(input_data)
                loss = F.mse_loss(output, target)
                
                print(f"  📊 Forward Pass:")
                print(f"    Input shape: {input_data.shape}")
                print(f"    Output shape: {output.shape}")
                print(f"    Loss: {loss.item():.6f}")
                print(f"    Output range: [{output.min().item():.4f}, {output.max().item():.4f}]")
                
                # Backward pass
                loss.backward()
                
                # Analyze each layer's gradients
                print(f"\n  🔄 Gradient Analysis:")
                print(f"    {'Layer':<25} {'Param Count':<12} {'Grad Norm':<12} {'Mean':<10} {'Std':<10} {'Issues':<20}")
                print("    " + "-" * 90)
                
                total_norm_squared = 0
                layer_count = 0
                issue_count = 0
                
                for name, param in model.named_parameters():
                    if param.grad is not None:
                        grad = param.grad
                        grad_norm = grad.norm().item()
                        grad_mean = grad.mean().item()
                        grad_std = grad.std().item()
                        param_count = param.numel()
                        
                        # Detect issues
                        issues = []
                        
                        # Zero gradients
                        if grad_norm == 0:
                            issues.append("ZERO_GRAD")
                            issue_count += 1
                        
                        # Exploding gradients
                        elif grad_norm > 10:
                            issues.append("EXPLODING")
                            issue_count += 1
                        
                        # Vanishing gradients
                        elif grad_norm < 1e-7:
                            issues.append("VANISHING")
                            issue_count += 1
                        
                        # NaN gradients
                        if torch.isnan(grad).any():
                            issues.append("NAN")
                            issue_count += 1
                        
                        # Inf gradients
                        if torch.isinf(grad).any():
                            issues.append("INF")
                            issue_count += 1
                        
                        # Large standard deviation (unstable)
                        if grad_std > 10 * abs(grad_mean) and grad_std > 1.0:
                            issues.append("UNSTABLE")
                            issue_count += 1
                        
                        issues_str = ",".join(issues) if issues else "OK"
                        
                        print(f"    {name:<25} {param_count:<12} {grad_norm:<12.6f} "
                              f"{grad_mean:<10.4f} {grad_std:<10.4f} {issues_str:<20}")
                        
                        # Store layer analysis
                        layer_info = {
                            'name': name,
                            'param_count': param_count,
                            'grad_norm': grad_norm,
                            'grad_mean': grad_mean,
                            'grad_std': grad_std,
                            'issues': issues
                        }
                        analysis_results['layer_analysis'].append(layer_info)
                        analysis_results['anomalies_detected'].extend(issues)
                        
                        total_norm_squared += grad_norm ** 2
                        layer_count += 1
                
                # Overall gradient analysis
                total_grad_norm = total_norm_squared ** 0.5
                avg_grad_norm = total_grad_norm / max(layer_count, 1)
                
                print(f"\n  📈 Overall Gradient Statistics:")
                print(f"    Total gradient norm: {total_grad_norm:.6f}")
                print(f"    Average gradient norm: {avg_grad_norm:.6f}")
                print(f"    Total parameters: {sum(p.numel() for p in model.parameters()):,}")
                print(f"    Layers with issues: {issue_count}/{layer_count}")
                
                # Health score calculation
                health_score = max(0, 100 - (issue_count * 20))
                analysis_results['health_score'] = health_score
                
                # Generate recommendations
                recommendations = []
                
                if issue_count == 0:
                    recommendations.append("✅ Gradient flow appears healthy")
                else:
                    if "EXPLODING" in analysis_results['anomalies_detected']:
                        recommendations.append("🔧 Use gradient clipping to handle exploding gradients")
                    if "VANISHING" in analysis_results['anomalies_detected']:
                        recommendations.append("🔧 Consider residual connections or different activation functions")
                    if "ZERO_GRAD" in analysis_results['anomalies_detected']:
                        recommendations.append("🔧 Check for dead neurons or inappropriate activations")
                    if "NAN" in analysis_results['anomalies_detected'] or "INF" in analysis_results['anomalies_detected']:
                        recommendations.append("🔧 Reduce learning rate or check for numerical instabilities")
                    if "UNSTABLE" in analysis_results['anomalies_detected']:
                        recommendations.append("🔧 Consider batch normalization or layer normalization")
                
                analysis_results['recommendations'] = recommendations
                
                print(f"\n  🏥 Health Assessment:")
                print(f"    Health score: {health_score}/100")
                for rec in recommendations:
                    print(f"    {rec}")
                
            except Exception as e:
                print(f"  ❌ Error during analysis: {e}")
                analysis_results['error'] = str(e)
        
        return analysis_results
    
    # Create test models
    good_model = DiagnosticNet(problematic=False)
    problematic_model = DiagnosticNet(problematic=True)
    
    # Create test data
    input_data = torch.randn(16, 10)
    target = torch.randn(16, 1)
    
    # Analyze both models
    good_analysis = comprehensive_gradient_analysis(good_model, input_data, target, "Healthy Model")
    print("\n" + "="*80 + "\n")
    problematic_analysis = comprehensive_gradient_analysis(problematic_model, input_data, target, "Problematic Model")
    
    # Create comparison visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Extract gradient norms for visualization
    good_norms = [layer['grad_norm'] for layer in good_analysis['layer_analysis']]
    prob_norms = [layer['grad_norm'] for layer in problematic_analysis['layer_analysis']]
    good_names = [layer['name'] for layer in good_analysis['layer_analysis']]
    prob_names = [layer['name'] for layer in problematic_analysis['layer_analysis']]
    
    # 1. Gradient norm comparison
    x_good = range(len(good_norms))
    x_prob = range(len(prob_norms))
    
    axes[0,0].bar([x - 0.2 for x in x_good], good_norms, 0.4, label='Healthy Model', alpha=0.8)
    axes[0,0].bar([x + 0.2 for x in x_prob], prob_norms, 0.4, label='Problematic Model', alpha=0.8)
    axes[0,0].set_title('Gradient Norms by Layer', fontweight='bold')
    axes[0,0].set_ylabel('Gradient Norm')
    axes[0,0].set_yscale('log')
    axes[0,0].legend()
    axes[0,0].grid(True, alpha=0.3)
    
    # 2. Health scores
    health_scores = [good_analysis['health_score'], problematic_analysis['health_score']]
    model_names = ['Healthy Model', 'Problematic Model']
    colors = ['green', 'red']
    
    bars = axes[0,1].bar(model_names, health_scores, color=colors, alpha=0.7)
    axes[0,1].set_title('Model Health Scores', fontweight='bold')
    axes[0,1].set_ylabel('Health Score (0-100)')
    axes[0,1].set_ylim(0, 100)
    
    # Add value labels on bars
    for bar, score in zip(bars, health_scores):
        height = bar.get_height()
        axes[0,1].text(bar.get_x() + bar.get_width()/2., height + 2,
                      f'{score:.0f}', ha='center', va='bottom', fontweight='bold')
    
    # 3. Issue distribution
    good_issues = good_analysis['anomalies_detected']
    prob_issues = problematic_analysis['anomalies_detected']
    
    all_issue_types = list(set(good_issues + prob_issues))
    if all_issue_types:
        good_issue_counts = [good_issues.count(issue) for issue in all_issue_types]
        prob_issue_counts = [prob_issues.count(issue) for issue in all_issue_types]
        
        x_issues = range(len(all_issue_types))
        axes[1,0].bar([x - 0.2 for x in x_issues], good_issue_counts, 0.4, 
                     label='Healthy Model', alpha=0.8)
        axes[1,0].bar([x + 0.2 for x in x_issues], prob_issue_counts, 0.4, 
                     label='Problematic Model', alpha=0.8)
        axes[1,0].set_title('Issue Type Distribution', fontweight='bold')
        axes[1,0].set_ylabel('Issue Count')
        axes[1,0].set_xticks(x_issues)
        axes[1,0].set_xticklabels(all_issue_types, rotation=45)
        axes[1,0].legend()
        axes[1,0].grid(True, alpha=0.3)
    else:
        axes[1,0].text(0.5, 0.5, 'No Issues Detected', ha='center', va='center', 
                      transform=axes[1,0].transAxes, fontsize=14)
        axes[1,0].set_title('Issue Type Distribution', fontweight='bold')
    
    # 4. Gradient distribution histogram
    axes[1,1].hist(good_norms, bins=20, alpha=0.7, label='Healthy Model', density=True)
    axes[1,1].hist(prob_norms, bins=20, alpha=0.7, label='Problematic Model', density=True)
    axes[1,1].set_title('Gradient Norm Distribution', fontweight='bold')
    axes[1,1].set_xlabel('Gradient Norm')
    axes[1,1].set_ylabel('Density')
    axes[1,1].set_yscale('log')
    axes[1,1].legend()
    axes[1,1].grid(True, alpha=0.3)
    
    plt.suptitle('Gradient Debugging Analysis Comparison', fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.savefig(results_dir / 'gradient_debugging_analysis.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    return {
        'healthy_model_analysis': good_analysis,
        'problematic_model_analysis': problematic_analysis
    }

# Execute gradient debugging demonstration
debugging_results = demonstrate_gradient_debugging()
```

## 6. Performance Optimization Techniques

### 6.1 Gradient Accumulation for Large-Scale Training

```python
def demonstrate_gradient_accumulation():
    """Demonstrate gradient accumulation for memory-efficient training"""
    
    print("\n=== 6.1 Gradient Accumulation for Large-Scale Training ===\n")
    
    def create_training_comparison(model_class, train_loader, effective_batch_size=128):
        """Compare standard training vs gradient accumulation"""
        
        actual_batch_size = train_loader.batch_size
        accumulation_steps = effective_batch_size // actual_batch_size
        
        print(f"📊 Training Configuration Comparison:")
        print(f"  Physical batch size: {actual_batch_size}")
        print(f"  Effective batch size: {effective_batch_size}")
        print(f"  Accumulation steps: {accumulation_steps}")
        print(f"  Memory multiplier: {accumulation_steps}x")
        
        # Method 1: Standard training (large batch simulation)
        model1 = model_class()
        optimizer1 = torch.optim.Adam(model1.parameters(), lr=0.001)
        criterion = nn.MSELoss()
        
        print(f"\n🔄 Method 1: Simulated Large Batch Training")
        
        # Collect multiple mini-batches
        batch_data_list = []
        batch_targets_list = []
        
        for i, (data, target) in enumerate(train_loader):
            batch_data_list.append(data)
            batch_targets_list.append(target)
            if len(batch_data_list) >= accumulation_steps:
                break
        
        # Combine into large batch
        large_batch_data = torch.cat(batch_data_list, dim=0)
        large_batch_targets = torch.cat(batch_targets_list, dim=0)
        
        start_time = time.time()
        
        optimizer1.zero_grad()
        output1 = model1(large_batch_data)
        loss1 = criterion(output1, large_batch_targets)
        loss1.backward()
        
        # Capture gradient statistics
        grad_norms_method1 = []
        for param in model1.parameters():
            if param.grad is not None:
                grad_norms_method1.append(param.grad.norm().item())
        
        optimizer1.step()
        
        method1_time = time.time() - start_time
        method1_loss = loss1.item()
        
        print(f"    Loss: {method1_loss:.6f}")
        print(f"    Time: {method1_time:.4f}s")
        print(f"    Memory usage: {large_batch_data.numel() * 4 / 1024**2:.1f} MB (simulated)")
        
        # Method 2: Gradient accumulation
        model2 = model_class()
        model2.load_state_dict(model1.state_dict())  # Start from same initialization
        optimizer2 = torch.optim.Adam(model2.parameters(), lr=0.001)
        
        print(f"\n📈 Method 2: Gradient Accumulation Training")
        
        start_time = time.time()
        
        optimizer2.zero_grad()
        accumulated_loss = 0
        grad_norms_method2 = []
        
        for i, (data, target) in enumerate(zip(batch_data_list, batch_targets_list)):
            # Forward pass
            output = model2(data)
            loss = criterion(output, target) / accumulation_steps  # Scale loss
            
            # Backward pass (accumulate gradients)
            loss.backward()
            
            accumulated_loss += loss.item() * accumulation_steps
        
        # Capture gradient statistics after accumulation
        for param in model2.parameters():
            if param.grad is not None:
                grad_norms_method2.append(param.grad.norm().item())
        
        # Single optimizer step after accumulation
        optimizer2.step()
        
        method2_time = time.time() - start_time
        
        print(f"    Accumulated loss: {accumulated_loss:.6f}")
        print(f"    Time: {method2_time:.4f}s")
        print(f"    Memory usage: {batch_data_list[0].numel() * 4 / 1024**2:.1f} MB per step")
        
        # Compare results
        print(f"\n📊 Comparison Results:")
        loss_difference = abs(method1_loss - accumulated_loss)
        time_difference = abs(method1_time - method2_time)
        
        print(f"  Loss difference: {loss_difference:.8f}")
        print(f"  Time difference: {time_difference:.4f}s")
        print(f"  Speed ratio: {method1_time/method2_time:.2f}x")
        
        # Compare gradients
        grad_norm_diff = np.mean([abs(g1 - g2) for g1, g2 in zip(grad_norms_method1, grad_norms_method2)])
        print(f"  Average gradient norm difference: {grad_norm_diff:.8f}")
        
        # Model parameter comparison
        param_diff = 0
        for p1, p2 in zip(model1.parameters(), model2.parameters()):
            param_diff += (p1 - p2).norm().item()
        
        print(f"  Total parameter difference: {param_diff:.8f}")
        print(f"  ✅ Methods equivalent: {param_diff < 1e-6}")
        
        return {
            'method1': {
                'loss': method1_loss,
                'time': method1_time,
                'grad_norms': grad_norms_method1
            },
            'method2': {
                'loss': accumulated_loss,
                'time': method2_time,
                'grad_norms': grad_norms_method2
            },
            'comparison': {
                'loss_diff': loss_difference,
                'time_diff': time_difference,
                'param_diff': param_diff,
                'equivalent': param_diff < 1e-6
            }
        }
    
    # Create test model and data
    class TestModel(nn.Module):
        def __init__(self):
            super().__init__()
            self.layers = nn.Sequential(
                nn.Linear(50, 100),
                nn.ReLU(),
                nn.Dropout(0.1),
                nn.Linear(100, 50),
                nn.ReLU(),
                nn.Linear(50, 10)
            )
        
        def forward(self, x):
            return self.layers(x)
    
    # Create synthetic dataset
    data_size = 200
    input_dim = 50
    output_dim = 10
    
    synthetic_data = torch.randn(data_size, input_dim)
    synthetic_targets = torch.randn(data_size, output_dim)
    
    dataset = TensorDataset(synthetic_data, synthetic_targets)
    train_loader = DataLoader(dataset, batch_size=32, shuffle=False)
    
    print(f"📁 Dataset Configuration:")
    print(f"  Total samples: {data_size}")
    print(f"  Input dimension: {input_dim}")
    print(f"  Output dimension: {output_dim}")
    print(f"  Batch size: {train_loader.batch_size}")
    
    # Run comparison
    results = create_training_comparison(TestModel, train_loader, effective_batch_size=128)
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # 1. Loss comparison
    methods = ['Standard Training', 'Gradient Accumulation']
    losses = [results['method1']['loss'], results['method2']['loss']]
    colors = ['skyblue', 'lightcoral']
    
    bars1 = axes[0,0].bar(methods, losses, color=colors, alpha=0.8)
    axes[0,0].set_title('Training Loss Comparison', fontweight='bold')
    axes[0,0].set_ylabel('Loss')
    
    # Add value labels
    for bar, loss in zip(bars1, losses):
        height = bar.get_height()
        axes[0,0].text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                      f'{loss:.6f}', ha='center', va='bottom')
    
    # 2. Training time comparison
    times = [results['method1']['time'], results['method2']['time']]
    
    bars2 = axes[0,1].bar(methods, times, color=colors, alpha=0.8)
    axes[0,1].set_title('Training Time Comparison', fontweight='bold')
    axes[0,1].set_ylabel('Time (seconds)')
    
    for bar, time_val in zip(bars2, times):
        height = bar.get_height()
        axes[0,1].text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                      f'{time_val:.4f}s', ha='center', va='bottom')
    
    # 3. Gradient norm distribution
    grad_norms_1 = results['method1']['grad_norms']
    grad_norms_2 = results['method2']['grad_norms']
    
    axes[1,0].hist(grad_norms_1, bins=15, alpha=0.7, label='Standard Training', density=True)
    axes[1,0].hist(grad_norms_2, bins=15, alpha=0.7, label='Gradient Accumulation', density=True)
    axes[1,0].set_title('Gradient Norm Distribution', fontweight='bold')
    axes[1,0].set_xlabel('Gradient Norm')
    axes[1,0].set_ylabel('Density')
    axes[1,0].legend()
    axes[1,0].grid(True, alpha=0.3)
    
    # 4. Method equivalence metrics
    equiv_metrics = ['Loss Difference', 'Parameter Difference', 'Avg Grad Norm Diff']
    equiv_values = [
        results['comparison']['loss_diff'],
        results['comparison']['param_diff'],
        np.mean([abs(g1 - g2) for g1, g2 in zip(grad_norms_1, grad_norms_2)])
    ]
    
    bars4 = axes[1,1].bar(equiv_metrics, equiv_values, color='lightgreen', alpha=0.8)
    axes[1,1].set_title('Method Equivalence Metrics', fontweight='bold')
    axes[1,1].set_ylabel('Absolute Difference')
    axes[1,1].set_yscale('log')
    axes[1,1].tick_params(axis='x', rotation=45)
    
    plt.suptitle('Gradient Accumulation vs Standard Training Analysis', 
                 fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.savefig(results_dir / 'gradient_accumulation_comparison.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"\n🎓 Key Benefits of Gradient Accumulation:")
    print(f"  • Enables training with larger effective batch sizes")
    print(f"  • Maintains mathematical equivalence to large-batch training")
    print(f"  • Reduces memory requirements per forward pass")
    print(f"  • Essential for training large models on limited hardware")
    print(f"  • Provides more stable gradient estimates")
    
    return results

# Execute gradient accumulation demonstration
accumulation_results = demonstrate_gradient_accumulation()
```

### 6.2 Memory-Efficient Gradient Checkpointing

```python
def demonstrate_gradient_checkpointing():
    """Demonstrate gradient checkpointing for memory efficiency"""
    
    print("\n=== 6.2 Memory-Efficient Gradient Checkpointing ===\n")
    
    def memory_intensive_computation(x, num_layers=10):
        """Create memory-intensive computation chain"""
        y = x
        for i in range(num_layers):
            # Each operation creates intermediate tensors
            y = torch.sin(y) + torch.cos(y * 0.5) + torch.exp(y * 0.1)
            y = torch.relu(y - y.mean())
        return y.sum()
    
    def compare_memory_usage():
        """Compare memory usage between normal and checkpointed computation"""
        
        # Create large input tensor
        input_size = (100, 1000)  # Reduced size for demonstration
        large_input = torch.randn(input_size, requires_grad=True)
        
        print(f"📊 Memory Comparison Setup:")
        print(f"  Input tensor shape: {input_size}")
        print(f"  Input tensor size: {large_input.numel() * 4 / 1024**2:.1f} MB")
        print(f"  Computation layers: 10 (each creates intermediate tensors)")
        
        # Method 1: Normal computation (stores all intermediates)
        print(f"\n💾 Method 1: Normal Computation")
        
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
            initial_memory = torch.cuda.memory_allocated()
        
        start_time = time.time()
        
        # Clear any existing gradients
        if large_input.grad is not None:
            large_input.grad.zero_()
        
        result_normal = memory_intensive_computation(large_input)
        
        if torch.cuda.is_available():
            peak_memory_normal = torch.cuda.memory_allocated()
            memory_after_forward = peak_memory_normal - initial_memory
        else:
            memory_after_forward = "N/A (CPU only)"
        
        result_normal.backward()
        
        if torch.cuda.is_available():
            final_memory_normal = torch.cuda.memory_allocated()
            total_memory_normal = final_memory_normal - initial_memory
        else:
            total_memory_normal = "N/A (CPU only)"
        
        normal_time = time.time() - start_time
        normal_grad_norm = large_input.grad.norm().item()
        
        print(f"    Forward pass time: {normal_time:.4f}s")
        print(f"    Memory after forward: {memory_after_forward if isinstance(memory_after_forward, str) else f'{memory_after_forward / 1024**2:.1f} MB'}")
        print(f"    Total memory used: {total_memory_normal if isinstance(total_memory_normal, str) else f'{total_memory_normal / 1024**2:.1f} MB'}")
        print(f"    Gradient norm: {normal_grad_norm:.6f}")
        
        # Reset for checkpointed computation
        large_input.grad = None
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        
        # Method 2: Checkpointed computation
        print(f"\n🔄 Method 2: Gradient Checkpointing")
        
        if torch.cuda.is_available():
            initial_memory_cp = torch.cuda.memory_allocated()
        
        start_time = time.time()
        
        result_checkpointed = checkpoint(memory_intensive_computation, large_input)
        
        if torch.cuda.is_available():
            memory_after_forward_cp = torch.cuda.memory_allocated() - initial_memory_cp
        else:
            memory_after_forward_cp = "N/A (CPU only)"
        
        result_checkpointed.backward()
        
        if torch.cuda.is_available():
            total_memory_cp = torch.cuda.memory_allocated() - initial_memory_cp
        else:
            total_memory_cp = "N/A (CPU only)"
        
        checkpoint_time = time.time() - start_time
        checkpoint_grad_norm = large_input.grad.norm().item()
        
        print(f"    Forward pass time: {checkpoint_time:.4f}s")
        print(f"    Memory after forward: {memory_after_forward_cp if isinstance(memory_after_forward_cp, str) else f'{memory_after_forward_cp / 1024**2:.1f} MB'}")
        print(f"    Total memory used: {total_memory_cp if isinstance(total_memory_cp, str) else f'{total_memory_cp / 1024**2:.1f} MB'}")
        print(f"    Gradient norm: {checkpoint_grad_norm:.6f}")
        
        # Comparison
        print(f"\n📈 Comparison Results:")
        print(f"    Results match: {torch.allclose(result_normal, result_checkpointed, atol=1e-6)}")
        print(f"    Gradients match: {abs(normal_grad_norm - checkpoint_grad_norm) < 1e-6}")
        print(f"    Time overhead: {(checkpoint_time / normal_time - 1) * 100:.1f}%")
        
        if torch.cuda.is_available():
            memory_savings = (memory_after_forward - memory_after_forward_cp) / memory_after_forward * 100
            print(f"    Memory savings: {memory_savings:.1f}%")
        else:
            print(f"    Memory savings: Significant (recomputation vs storage trade-off)")
        
        return {
            'normal': {
                'time': normal_time,
                'memory_forward': memory_after_forward,
                'memory_total': total_memory_normal,
                'grad_norm': normal_grad_norm
            },
            'checkpointed': {
                'time': checkpoint_time,
                'memory_forward': memory_after_forward_cp,
                'memory_total': total_memory_cp,
                'grad_norm': checkpoint_grad_norm
            }
        }
    
    # Run memory comparison
    memory_results = compare_memory_usage()
    
    print(f"\n🎓 Gradient Checkpointing Insights:")
    print(f"  • Trade computation time for memory usage")
    print(f"  • Essential for training very deep networks")
    print(f"  • Mathematically equivalent to normal backpropagation")
    print(f"  • Enables training of models that wouldn't fit in memory otherwise")
    print(f"  • Particularly useful for transformer models and deep CNNs")
    
    return memory_results

# Execute gradient checkpointing demonstration
checkpointing_results = demonstrate_gradient_checkpointing()
```

## 7. Practical Applications: Advanced Gradient Techniques

### 7.1 Learning Rate Finding with Gradients

```python
def implement_learning_rate_finder():
    """Implement advanced learning rate finder using gradient analysis"""
    
    print("\n=== 7.1 Advanced Learning Rate Finder ===\n")
    
    def find_optimal_learning_rate(model, dataloader, init_lr=1e-8, final_lr=10, 
                                  beta=0.98, criterion=None):
        """Find optimal learning rate using loss and gradient analysis"""
        
        if criterion is None:
            criterion = nn.MSELoss()
        
        # Create model copy for testing
        model_copy = copy.deepcopy(model)
        optimizer = torch.optim.SGD(model_copy.parameters(), lr=init_lr)
        
        num_batches = len(dataloader)
        gamma = (final_lr / init_lr) ** (1 / num_batches)
        
        print(f"🔍 Learning Rate Finder Configuration:")
        print(f"  Initial LR: {init_lr:.2e}")
        print(f"  Final LR: {final_lr:.2e}")
        print(f"  Number of batches: {num_batches}")
        print(f"  LR multiplication factor: {gamma:.6f}")
        
        # Storage for analysis
        learning_rates = []
        losses = []
        smoothed_losses = []
        gradient_norms = []
        loss_improvements = []
        
        best_lr = init_lr
        best_loss = float('inf')
        avg_loss = 0
        
        print(f"\n🔄 Running LR finder...")
        
        for batch_idx, (data, target) in enumerate(dataloader):
            # Update learning rate
            lr = init_lr * (gamma ** batch_idx)
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr
            
            learning_rates.append(lr)
            
            # Forward pass
            optimizer.zero_grad()
            output = model_copy(data)
            loss = criterion(output, target)
            
            # Backward pass
            loss.backward()
            
            # Calculate gradient norm
            total_norm = 0
            for param in model_copy.parameters():
                if param.grad is not None:
                    param_norm = param.grad.norm()
                    total_norm += param_norm.item() ** 2
            total_norm = total_norm ** 0.5
            gradient_norms.append(total_norm)
            
            # Update parameters
            optimizer.step()
            
            # Track losses
            current_loss = loss.item()
            losses.append(current_loss)
            
            # Exponential moving average for smoothing
            if batch_idx == 0:
                avg_loss = current_loss
            else:
                avg_loss = beta * avg_loss + (1 - beta) * current_loss
                # Bias correction
                smoothed_loss = avg_loss / (1 - beta ** (batch_idx + 1))
                smoothed_losses.append(smoothed_loss)
                
                # Track loss improvement
                if len(smoothed_losses) > 1:
                    improvement = smoothed_losses[-2] - smoothed_losses[-1]
                    loss_improvements.append(improvement)
                
                # Check for best learning rate (based on smoothed loss)
                if smoothed_loss < best_loss:
                    best_loss = smoothed_loss
                    best_lr = lr
            
            # Stop if loss explodes
            if batch_idx > 10 and current_loss > 4 * min(losses):
                print(f"  ⚠️ Stopping early: loss explosion detected at LR {lr:.2e}")
                break
            
            # Periodic updates
            if batch_idx % max(1, num_batches // 10) == 0:
                print(f"    Batch {batch_idx:3d}/{num_batches}: LR={lr:.2e}, "
                      f"Loss={current_loss:.6f}, GradNorm={total_norm:.6f}")
        
        # Analysis and recommendations
        print(f"\n📊 LR Finder Analysis:")
        print(f"  Total batches processed: {len(learning_rates)}")
        print(f"  LR range tested: [{min(learning_rates):.2e}, {max(learning_rates):.2e}]")
        print(f"  Minimum loss: {min(losses):.6f}")
        print(f"  Best LR (min smoothed loss): {best_lr:.2e}")
        
        # Find steepest descent point
        if len(loss_improvements) > 5:
            # Find where loss improvement is maximized
            max_improvement_idx = np.argmax(loss_improvements)
            steepest_lr = learning_rates[max_improvement_idx + 1]  # +1 due to improvement calculation
            print(f"  Steepest descent LR: {steepest_lr:.2e}")
        else:
            steepest_lr = best_lr
        
        # Find gradient-based recommendation
        if len(gradient_norms) > 5:
            # Smooth gradient norms
            grad_smooth = np.convolve(gradient_norms, np.ones(5)/5, mode='valid')
            grad_lr_range = learning_rates[2:-2]  # Account for convolution size
            
            # Find LR where gradients are stable but not too small
            stable_grad_mask = (grad_smooth > np.percentile(grad_smooth, 10)) & \
                              (grad_smooth < np.percentile(grad_smooth, 90))
            if np.any(stable_grad_mask):
                stable_indices = np.where(stable_grad_mask)[0]
                gradient_recommended_lr = grad_lr_range[stable_indices[len(stable_indices)//2]]
                print(f"  Gradient-stable LR: {gradient_recommended_lr:.2e}")
            else:
                gradient_recommended_lr = best_lr
        else:
            gradient_recommended_lr = best_lr
        
        return {
            'learning_rates': learning_rates,
            'losses': losses,
            'smoothed_losses': smoothed_losses,
            'gradient_norms': gradient_norms,
            'loss_improvements': loss_improvements,
            'recommendations': {
                'best_lr': best_lr,
                'steepest_lr': steepest_lr,
                'gradient_stable_lr': gradient_recommended_lr
            }
        }
    
    def visualize_lr_finder_results(results):
        """Create comprehensive LR finder visualization"""
        
        fig, axes = plt.subplots(2, 3, figsize=(18, 12))
        
        lrs = results['learning_rates']
        losses = results['losses']
        smoothed_losses = results['smoothed_losses']
        grad_norms = results['gradient_norms']
        
        # 1. Loss vs Learning Rate
        axes[0,0].semilogx(lrs, losses, alpha=0.6, label='Raw Loss')
        if len(smoothed_losses) > 0:
            axes[0,0].semilogx(lrs[1:len(smoothed_losses)+1], smoothed_losses, 
                              linewidth=2, label='Smoothed Loss')
        
        # Mark recommendations
        recs = results['recommendations']
        axes[0,0].axvline(recs['best_lr'], color='red', linestyle='--', 
                         label=f'Best LR: {recs["best_lr"]:.2e}')
        axes[0,0].axvline(recs['steepest_lr'], color='orange', linestyle='--', 
                         label=f'Steepest: {recs["steepest_lr"]:.2e}')
        
        axes[0,0].set_title('Loss vs Learning Rate', fontweight='bold')
        axes[0,0].set_xlabel('Learning Rate')
        axes[0,0].set_ylabel('Loss')
        axes[0,0].legend()
        axes[0,0].grid(True, alpha=0.3)
        
        # 2. Gradient Norm vs Learning Rate
        axes[0,1].semilogx(lrs, grad_norms, linewidth=2, color='green')
        axes[0,1].axvline(recs['gradient_stable_lr'], color='purple', linestyle='--', 
                         label=f'Grad Stable: {recs["gradient_stable_lr"]:.2e}')
        axes[0,1].set_title('Gradient Norm vs Learning Rate', fontweight='bold')
        axes[0,1].set_xlabel('Learning Rate')
        axes[0,1].set_ylabel('Gradient Norm')
        axes[0,1].legend()
        axes[0,1].grid(True, alpha=0.3)
        
        # 3. Loss Improvement Rate
        if len(results['loss_improvements']) > 0:
            improvement_lrs = lrs[2:len(results['loss_improvements'])+2]
            axes[0,2].semilogx(improvement_lrs, results['loss_improvements'], 
                              linewidth=2, color='orange')
            axes[0,2].set_title('Loss Improvement Rate', fontweight='bold')
            axes[0,2].set_xlabel('Learning Rate')
            axes[0,2].set_ylabel('Loss Improvement')
            axes[0,2].grid(True, alpha=0.3)
        
        # 4. Learning Rate Schedule
        axes[1,0].plot(lrs, linewidth=2)
        axes[1,0].set_title('Learning Rate Schedule', fontweight='bold')
        axes[1,0].set_xlabel('Batch')
        axes[1,0].set_ylabel('Learning Rate')
        axes[1,0].set_yscale('log')
        axes[1,0].grid(True, alpha=0.3)
        
        # 5. Loss Distribution
        axes[1,1].hist(losses, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
        axes[1,1].set_title('Loss Distribution', fontweight='bold')
        axes[1,1].set_xlabel('Loss')
        axes[1,1].set_ylabel('Frequency')
        axes[1,1].grid(True, alpha=0.3)
        
        # 6. Gradient Norm Distribution
        axes[1,2].hist(grad_norms, bins=30, alpha=0.7, color='lightgreen', edgecolor='black')
        axes[1,2].set_title('Gradient Norm Distribution', fontweight='bold')
        axes[1,2].set_xlabel('Gradient Norm')
        axes[1,2].set_ylabel('Frequency')
        axes[1,2].grid(True, alpha=0.3)
        
        plt.suptitle('Learning Rate Finder: Comprehensive Analysis', 
                     fontsize=16, fontweight='bold')
        plt.tight_layout()
        plt.savefig(results_dir / 'learning_rate_finder_analysis.png', dpi=300, bbox_inches='tight')
        plt.show()
    
    # Create test model and data
    class LRTestModel(nn.Module):
        def __init__(self):
            super().__init__()
            self.layers = nn.Sequential(
                nn.Linear(20, 50),
                nn.ReLU(),
                nn.Linear(50, 30),
                nn.ReLU(),
                nn.Linear(30, 5)
            )
        
        def forward(self, x):
            return self.layers(x)
    
    # Create synthetic dataset
    test_data = torch.randn(200, 20)
    test_targets = torch.randn(200, 5)
    test_dataset = TensorDataset(test_data, test_targets)
    test_loader = DataLoader(test_dataset, batch_size=16, shuffle=True)
    
    model = LRTestModel()
    
    print(f"🧪 LR Finder Test Setup:")
    print(f"  Model parameters: {sum(p.numel() for p in model.parameters()):,}")
    print(f"  Dataset size: {len(test_dataset)}")
    print(f"  Batch size: {test_loader.batch_size}")
    
    # Run LR finder
    lr_results = find_optimal_learning_rate(model, test_loader)
    
    # Visualize results
    visualize_lr_finder_results(lr_results)
    
    # Print recommendations
    print(f"\n🎯 Learning Rate Recommendations:")
    recs = lr_results['recommendations']
    print(f"  Best LR (minimum loss): {recs['best_lr']:.2e}")
    print(f"  Steepest descent LR: {recs['steepest_lr']:.2e}")
    print(f"  Gradient-stable LR: {recs['gradient_stable_lr']:.2e}")
    print(f"\n💡 Usage Suggestions:")
    print(f"  • Start training with: {recs['steepest_lr']:.2e}")
    print(f"  • Use for fine-tuning: {recs['best_lr']:.2e}")
    print(f"  • Conservative choice: {recs['gradient_stable_lr']:.2e}")
    
    return lr_results

# Execute learning rate finder
lr_finder_results = implement_learning_rate_finder()
```

### 7.2 Adversarial Examples with Gradients

```python
def demonstrate_adversarial_examples():
    """Demonstrate gradient-based adversarial example generation"""
    
    print("\n=== 7.2 Adversarial Examples with Gradient-Based Attacks ===\n")
    
    def fast_gradient_sign_method(model, image, target, epsilon=0.1):
        """Implement Fast Gradient Sign Method (FGSM)"""
        
        model.eval()
        criterion = nn.CrossEntropyLoss()
        
        # Ensure image requires gradients
        image_copy = image.clone().detach().requires_grad_(True)
        
        # Forward pass
        output = model(image_copy)
        loss = criterion(output, target)
        
        # Get original prediction
        original_pred = output.argmax(dim=1)
        original_confidence = torch.softmax(output, dim=1).max()
        
        # Backward pass
        model.zero_grad()
        loss.backward()
        
        # Generate adversarial example
        data_grad = image_copy.grad.data
        sign_data_grad = data_grad.sign()
        perturbed_image = image_copy + epsilon * sign_data_grad
        perturbed_image = torch.clamp(perturbed_image, 0, 1)
        
        # Test adversarial example
        with torch.no_grad():
            perturbed_output = model(perturbed_image)
            perturbed_pred = perturbed_output.argmax(dim=1)
            perturbed_confidence = torch.softmax(perturbed_output, dim=1).max()
        
        return {
            'original_image': image.detach(),
            'perturbed_image': perturbed_image.detach(),
            'perturbation': (perturbed_image - image).detach(),
            'original_pred': original_pred.item(),
            'perturbed_pred': perturbed_pred.item(),
            'original_confidence': original_confidence.item(),
            'perturbed_confidence': perturbed_confidence.item(),
            'attack_success': original_pred.item() != perturbed_pred.item()
        }
    
    def projected_gradient_descent(model, image, target, epsilon=0.1, 
                                  alpha=0.01, num_iter=10):
        """Implement Projected Gradient Descent (PGD) attack"""
        
        model.eval()
        criterion = nn.CrossEntropyLoss()
        
        # Initialize with small random perturbation
        perturbed_image = image.clone().detach()
        perturbed_image += torch.empty_like(perturbed_image).uniform_(-epsilon, epsilon)
        perturbed_image = torch.clamp(perturbed_image, 0, 1)
        
        original_image = image.clone().detach()
        
        attack_trajectory = []
        
        for iteration in range(num_iter):
            perturbed_image.requires_grad_(True)
            
            # Forward pass
            output = model(perturbed_image)
            loss = criterion(output, target)
            
            # Backward pass
            model.zero_grad()
            loss.backward()
            
            # PGD update
            with torch.no_grad():
                data_grad = perturbed_image.grad.data
                sign_data_grad = data_grad.sign()
                
                # Take step in direction of gradient
                perturbed_image = perturbed_image + alpha * sign_data_grad
                
                # Project back to epsilon ball
                eta = perturbed_image - original_image
                eta = torch.clamp(eta, -epsilon, epsilon)
                perturbed_image = original_image + eta
                
                # Ensure valid pixel range
                perturbed_image = torch.clamp(perturbed_image, 0, 1)
                
                # Track progress
                pred = output.argmax(dim=1).item()
                confidence = torch.softmax(output, dim=1).max().item()
                perturbation_norm = eta.norm().item()
                
                attack_trajectory.append({
                    'iteration': iteration,
                    'prediction': pred,
                    'confidence': confidence,
                    'perturbation_norm': perturbation_norm
                })
        
        # Final evaluation
        with torch.no_grad():
            final_output = model(perturbed_image)
            final_pred = final_output.argmax(dim=1)
            final_confidence = torch.softmax(final_output, dim=1).max()
        
        return {
            'original_image': original_image,
            'perturbed_image': perturbed_image.detach(),
            'perturbation': (perturbed_image - original_image).detach(),
            'final_pred': final_pred.item(),
            'final_confidence': final_confidence.item(),
            'trajectory': attack_trajectory,
            'attack_success': len(set(step['prediction'] for step in attack_trajectory)) > 1
        }
    
    def visualize_adversarial_attack(attack_results, attack_name, class_names=None):
        """Visualize adversarial attack results"""
        
        if class_names is None:
            class_names = [f"Class {i}" for i in range(10)]
        
        fig, axes = plt.subplots(2, 3, figsize=(15, 10))
        
        original_img = attack_results['original_image'][0]
        perturbed_img = attack_results['perturbed_image'][0]
        perturbation = attack_results['perturbation'][0]
        
        # Handle different image formats
        if original_img.dim() == 3 and original_img.shape[0] in [1, 3]:
            if original_img.shape[0] == 3:  # RGB
                orig_display = original_img.permute(1, 2, 0).numpy()
                pert_display = perturbed_img.permute(1, 2, 0).numpy()
                diff_display = perturbation.permute(1, 2, 0).numpy()
            else:  # Grayscale
                orig_display = original_img[0].numpy()
                pert_display = perturbed_img[0].numpy()
                diff_display = perturbation[0].numpy()
        else:
            orig_display = original_img.numpy()
            pert_display = perturbed_img.numpy()
            diff_display = perturbation.numpy()
        
        # 1. Original image
        if len(orig_display.shape) == 2:
            axes[0,0].imshow(orig_display, cmap='gray')
        else:
            axes[0,0].imshow(orig_display)
        
        if 'original_pred' in attack_results:
            orig_class = class_names[attack_results['original_pred']]
            orig_conf = attack_results['original_confidence']
            axes[0,0].set_title(f'Original\n{orig_class}\n(Conf: {orig_conf:.3f})', fontweight='bold')
        else:
            axes[0,0].set_title('Original Image', fontweight='bold')
        axes[0,0].axis('off')
        
        # 2. Adversarial image
        if len(pert_display.shape) == 2:
            axes[0,1].imshow(pert_display, cmap='gray')
        else:
            axes[0,1].imshow(pert_display)
        
        if 'perturbed_pred' in attack_results:
            pert_class = class_names[attack_results['perturbed_pred']]
            pert_conf = attack_results['perturbed_confidence']
        elif 'final_pred' in attack_results:
            pert_class = class_names[attack_results['final_pred']]
            pert_conf = attack_results['final_confidence']
        else:
            pert_class = "Unknown"
            pert_conf = 0.0
        
        axes[0,1].set_title(f'Adversarial\n{pert_class}\n(Conf: {pert_conf:.3f})', fontweight='bold')
        axes[0,1].axis('off')
        
        # 3. Perturbation (amplified for visibility)
        perturbation_amplified = diff_display * 10  # Amplify for visualization
        axes[0,2].imshow(perturbation_amplified, cmap='RdBu_r')
        axes[0,2].set_title('Perturbation\n(10x amplified)', fontweight='bold')
        axes[0,2].axis('off')
        
        # 4. Perturbation statistics
        perturbation_flat = perturbation.flatten()
        axes[1,0].hist(perturbation_flat.numpy(), bins=50, alpha=0.7, color='orange')
        axes[1,0].set_title('Perturbation Distribution', fontweight='bold')
        axes[1,0].set_xlabel('Perturbation Value')
        axes[1,0].set_ylabel('Frequency')
        axes[1,0].grid(True, alpha=0.3)
        
        # 5. Attack trajectory (if available)
        if 'trajectory' in attack_results and attack_results['trajectory']:
            trajectory = attack_results['trajectory']
            iterations = [step['iteration'] for step in trajectory]
            confidences = [step['confidence'] for step in trajectory]
            predictions = [step['prediction'] for step in trajectory]
            
            axes[1,1].plot(iterations, confidences, 'o-', linewidth=2, markersize=6)
            axes[1,1].set_title('Confidence Over Iterations', fontweight='bold')
            axes[1,1].set_xlabel('Iteration')
            axes[1,1].set_ylabel('Prediction Confidence')
            axes[1,1].grid(True, alpha=0.3)
            
            # Show prediction changes
            for i, (iter_num, pred) in enumerate(zip(iterations, predictions)):
                if i == 0 or pred != predictions[i-1]:
                    axes[1,1].axvline(iter_num, color='red', linestyle='--', alpha=0.7)
                    axes[1,1].text(iter_num, confidences[i], f'→{class_names[pred][:5]}', 
                                  rotation=90, fontsize=8)
        else:
            axes[1,1].text(0.5, 0.5, 'No trajectory data\n(single-step attack)', 
                          ha='center', va='center', transform=axes[1,1].transAxes)
            axes[1,1].set_title('Attack Trajectory', fontweight='bold')
        
        # 6. Attack summary
        axes[1,2].axis('off')
        summary_text = f"{attack_name} Attack Results\n\n"
        
        if 'attack_success' in attack_results:
            success = attack_results['attack_success']
            summary_text += f"Attack Success: {'✅ Yes' if success else '❌ No'}\n"
        
        perturbation_norm = perturbation.norm().item()
        perturbation_linf = perturbation.abs().max().item()
        
        summary_text += f"L2 Norm: {perturbation_norm:.6f}\n"
        summary_text += f"L∞ Norm: {perturbation_linf:.6f}\n"
        summary_text += f"Mean |perturbation|: {perturbation.abs().mean().item():.6f}\n"
        
        axes[1,2].text(0.1, 0.9, summary_text, transform=axes[1,2].transAxes, 
                      fontsize=12, verticalalignment='top', 
                      bbox=dict(boxstyle="round,pad=0.5", facecolor="lightblue", alpha=0.8))
        
        plt.suptitle(f'{attack_name} Adversarial Attack Analysis', 
                     fontsize=16, fontweight='bold')
        plt.tight_layout()
        plt.savefig(results_dir / f'adversarial_attack_{attack_name.lower().replace(" ", "_")}.png', 
                   dpi=300, bbox_inches='tight')
        plt.show()
    
    # Create simple CNN for demonstration
    class AdversarialTestCNN(nn.Module):
        def __init__(self, num_classes=10):
            super().__init__()
            self.features = nn.Sequential(
                nn.Conv2d(1, 32, 3, padding=1),
                nn.ReLU(),
                nn.MaxPool2d(2),
                nn.Conv2d(32, 64, 3, padding=1),
                nn.ReLU(),
                nn.MaxPool2d(2),
                nn.AdaptiveAvgPool2d(4)
            )
            self.classifier = nn.Sequential(
                nn.Flatten(),
                nn.Linear(64 * 16, 128),
                nn.ReLU(),
                nn.Dropout(0.5),
                nn.Linear(128, num_classes)
            )
        
        def forward(self, x):
            x = self.features(x)
            x = self.classifier(x)
            return x
    
    # Create model and synthetic data
    model = AdversarialTestCNN(num_classes=10)
    model.eval()
    
    # Create synthetic image (28x28 grayscale, like MNIST)
    synthetic_image = torch.rand(1, 1, 28, 28)
    true_target = torch.tensor([3])  # Pretend this is class 3
    
    class_names = [f"Class_{i}" for i in range(10)]
    
    print(f"🎯 Adversarial Attack Setup:")
    print(f"  Model: Simple CNN with {sum(p.numel() for p in model.parameters()):,} parameters")
    print(f"  Image shape: {synthetic_image.shape}")
    print(f"  True class: {class_names[true_target.item()]}")
    
    # Get original prediction
    with torch.no_grad():
        orig_output = model(synthetic_image)
        orig_pred = orig_output.argmax(dim=1)
        orig_conf = torch.softmax(orig_output, dim=1).max()
    
    print(f"  Original prediction: {class_names[orig_pred.item()]} (confidence: {orig_conf:.3f})")
    
    # Test FGSM attack
    print(f"\n🏃‍♂️ Running FGSM Attack...")
    fgsm_results = fast_gradient_sign_method(model, synthetic_image, true_target, epsilon=0.3)
    visualize_adversarial_attack(fgsm_results, "FGSM", class_names)
    
    print(f"FGSM Results:")
    print(f"  Original: {class_names[fgsm_results['original_pred']]} → Adversarial: {class_names[fgsm_results['perturbed_pred']]}")
    print(f"  Attack success: {fgsm_results['attack_success']}")
    print(f"  Perturbation L2 norm: {fgsm_results['perturbation'].norm().item():.6f}")
    
    # Test PGD attack
    print(f"\n🔄 Running PGD Attack...")
    pgd_results = projected_gradient_descent(model, synthetic_image, true_target, 
                                           epsilon=0.3, alpha=0.05, num_iter=10)
    visualize_adversarial_attack(pgd_results, "PGD", class_names)
    
    print(f"PGD Results:")
    print(f"  Final prediction: {class_names[pgd_results['final_pred']]}")
    print(f"  Attack success: {pgd_results['attack_success']}")
    print(f"  Perturbation L2 norm: {pgd_results['perturbation'].norm().item():.6f}")
    print(f"  Iterations: {len(pgd_results['trajectory'])}")
    
    print(f"\n🛡️ Adversarial Examples Key Insights:")
    print(f"  • Small perturbations can fool neural networks")
    print(f"  • Gradients reveal model vulnerabilities")
    print(f"  • FGSM is fast but less effective than PGD")
    print(f"  • PGD is iterative and often more successful")
    print(f"  • Defense mechanisms use similar gradient techniques")
    print(f"  • Important for security and robustness in AI systems")
    
    return {
        'fgsm_results': fgsm_results,
        'pgd_results': pgd_results,
        'model': model
    }

# Execute adversarial examples demonstration
adversarial_results = demonstrate_adversarial_examples()
```

## 8. Comprehensive Summary and Next Steps

```python
def generate_comprehensive_summary():
    """Generate final comprehensive summary of gradient computation mastery"""
    
    print("\n" + "="*80)
    print("🎓 PYTORCH GRADIENT COMPUTATION MASTERY - COMPREHENSIVE SUMMARY")
    print("="*80)
    
    summary = {
        'completion_timestamp': pd.Timestamp.now().isoformat(),
        'notebook_sections': 8,
        'total_demonstrations': 12,
        'concepts_mastered': [],
        'practical_skills': [],
        'advanced_techniques': [],
        'performance_optimizations': [],
        'applications_explored': []
    }
    
    # Concepts mastered
    concepts_mastered = [
        "✅ Fundamental autograd mechanics and computational graphs",
        "✅ Vector functions and Jacobian matrix computation", 
        "✅ Neural network gradient flow analysis",
        "✅ Higher-order gradients and Hessian matrices",
        "✅ Gradient debugging and anomaly detection",
        "✅ Interactive gradient visualization techniques",
        "✅ Real-time optimization trajectory analysis"
    ]
    
    practical_skills = [
        "🔧 Gradient accumulation for large-scale training",
        "🔧 Memory-efficient gradient checkpointing",
        "🔧 Learning rate finding with gradient analysis",
        "🔧 Comprehensive gradient debugging workflows",
        "🔧 Performance profiling and optimization",
        "🔧 Custom autograd function development readiness"
    ]
    
    advanced_techniques = [
        "🚀 Fast Gradient Sign Method (FGSM) implementation",
        "🚀 Projected Gradient Descent (PGD) attacks",
        "🚀 Multi-dimensional gradient landscape visualization",
        "🚀 Eigenvalue analysis for optimization insights",
        "🚀 Gradient-based hyperparameter optimization",
        "🚀 Mixed precision training considerations"
    ]
    
    performance_optimizations = [
        "⚡ Gradient checkpointing for memory efficiency",
        "⚡ Accumulation techniques for effective large batches",
        "⚡ Memory usage profiling and optimization",
        "⚡ Computational graph optimization strategies",
        "⚡ Efficient second-order derivative computation"
    ]
    
    applications_explored = [
        "🎯 Adversarial example generation and analysis",
        "🎯 Learning rate schedule optimization",
        "🎯 Model vulnerability assessment",
        "🎯 Training stability analysis",
        "🎯 Optimization landscape exploration",
        "🎯 Gradient-based meta-learning foundations"
    ]
    
    # Display mastery overview
    print(f"\n📚 CORE CONCEPTS MASTERED:")
    for concept in concepts_mastered:
        print(f"  {concept}")
        
    print(f"\n🛠️ PRACTICAL SKILLS DEVELOPED:")
    for skill in practical_skills:
        print(f"  {skill}")
        
    print(f"\n🧠 ADVANCED TECHNIQUES LEARNED:")
    for technique in advanced_techniques:
        print(f"  {technique}")
        
    print(f"\n⚡ PERFORMANCE OPTIMIZATIONS:")
    for optimization in performance_optimizations:
        print(f"  {optimization}")
        
    print(f"\n🎯 APPLICATIONS EXPLORED:")
    for application in applications_explored:
        print(f"  {application}")
    
    # Store in summary
    summary['concepts_mastered'] = concepts_mastered
    summary['practical_skills'] = practical_skills
    summary['advanced_techniques'] = advanced_techniques
    summary['performance_optimizations'] = performance_optimizations
    summary['applications_explored'] = applications_explored
    
    # Next steps and recommendations
    next_steps = [
        "📓 03_custom_autograd_functions.ipynb - Implement custom autograd operations",
        "📓 04_neural_network_architectures.ipynb - Build advanced network architectures",
        "📓 05_optimization_algorithms.ipynb - Explore advanced optimizers",
        "📓 06_computer_vision_applications.ipynb - Apply to CNN architectures",
        "📓 07_sequence_modeling.ipynb - Gradient flow in RNNs and Transformers",
        "📓 08_research_applications.ipynb - Meta-learning and neural architecture search"
    ]
    
    advanced_challenges = [
        "🏆 Implement MAML (Model-Agnostic Meta-Learning)",
        "🏆 Create custom activation functions with proper gradients",
        "🏆 Build gradient penalty for improved GAN training",
        "🏆 Develop adversarial training defense mechanisms", 
        "🏆 Implement neural architecture search with gradients",
        "🏆 Create curriculum learning with gradient-based difficulty"
    ]
    
    research_directions = [
        "🔬 Second-order optimization methods (K-FAC, Natural Gradients)",
        "🔬 Gradient-based neural architecture search (GDAS, DARTS)",
        "🔬 Meta-learning and few-shot learning applications",
        "🔬 Adversarial training and certified defenses",
        "🔬 Gradient flow analysis in very deep networks",
        "🔬 Quantum-inspired gradient computation methods"
    ]
    
    print(f"\n🚀 IMMEDIATE NEXT STEPS:")
    for step in next_steps:
        print(f"  {step}")
        
    print(f"\n🏆 ADVANCED CHALLENGES TO TACKLE:")
    for challenge in advanced_challenges:
        print(f"  {challenge}")
        
    print(f"\n🔬 RESEARCH DIRECTIONS TO EXPLORE:")
    for direction in research_directions:
        print(f"  {direction}")
    
    # Key insights and best practices
    key_insights = [
        "💡 Always monitor gradient norms during training for stability",
        "💡 Use gradient clipping for RNN and very deep network training",
        "💡 Leverage gradient accumulation for memory-constrained environments",
        "💡 Visualize gradient landscapes to understand optimization challenges",
        "💡 Implement comprehensive gradient debugging for research projects",
        "💡 Consider second-order information for advanced optimization",
        "💡 Apply adversarial techniques for robustness testing"
    ]
    
    best_practices = [
        "🎯 Enable anomaly detection during development phases",
        "🎯 Use learning rate finders for new model architectures",
        "🎯 Profile memory usage with gradient checkpointing",
        "🎯 Implement gradient-based hyperparameter optimization",
        "🎯 Document gradient flow patterns for reproducible research",
        "🎯 Test model robustness with adversarial examples",
        "🎯 Use mixed precision when available for efficiency"
    ]
    
    print(f"\n💡 KEY INSIGHTS FOR PRACTITIONERS:")
    for insight in key_insights:
        print(f"  {insight}")
        
    print(f"\n🎯 BEST PRACTICES FOR PRODUCTION:")
    for practice in best_practices:
        print(f"  {practice}")
    
    # Performance metrics from demonstrations
    print(f"\n📊 NOTEBOOK PERFORMANCE METRICS:")
    print(f"  Total functions implemented: 15+")
    print(f"  Visualizations created: 25+")
    print(f"  Code examples: 50+")
    print(f"  Mathematical concepts: 12")
    print(f"  Optimization techniques: 8")
    print(f"  Debugging methods: 6")
    
    # Save comprehensive summary
    summary.update({
        'next_steps': next_steps,
        'advanced_challenges': advanced_challenges,
        'research_directions': research_directions,
        'key_insights': key_insights,
        'best_practices': best_practices
    })
    
    with open(results_dir / 'comprehensive_gradient_mastery_summary.json', 'w') as f:
        json.dump(summary, f, indent=2)
    
    print(f"\n💾 Complete mastery summary saved to:")
    print(f"    {results_dir / 'comprehensive_gradient_mastery_summary.json'}")
    
    # List all generated files
    print(f"\n📂 Generated Learning Artifacts:")
    all_files = list(results_dir.glob('*'))
    
    for file_path in sorted(all_files):
        if file_path.is_file():
            size_mb = file_path.stat().st_size / (1024 * 1024)
            print(f"  📄 {file_path.name} ({size_mb:.2f} MB)")
    
    total_size = sum(f.stat().st_size for f in all_files if f.is_file()) / (1024 * 1024)
    print(f"\n📊 Total artifacts: {len(all_files)} files ({total_size:.1f} MB)")
    
    print(f"\n🌟 CONGRATULATIONS! You've achieved GRADIENT COMPUTATION MASTERY!")
    print(f"🎯 You're now ready to tackle advanced PyTorch research and development!")
    print(f"🚀 Continue your journey with the next notebook in the PyTorch Mastery Hub series!")
    
    return summary

# Generate final comprehensive summary
final_summary = generate_comprehensive_summary()

print(f"\n" + "="*80)
print("🎉 PYTORCH GRADIENT COMPUTATION DEEP DIVE - COMPLETE!")
print("="*80)
```

---

## Final Notes

This comprehensive gradient computation notebook has taken you from basic autograd concepts to advanced research-level techniques. You've explored:

### 🎓 **Theoretical Foundations**
- Computational graph construction and traversal
- Vector calculus and Jacobian matrices
- Higher-order derivatives and Hessian analysis
- Mathematical optimization principles

### 🛠️ **Practical Implementation Skills**
- Interactive gradient visualization techniques
- Performance optimization strategies
- Memory-efficient training methods
- Comprehensive debugging workflows

### 🚀 **Advanced Applications**
- Adversarial example generation and analysis
- Learning rate optimization techniques
- Gradient-based hyperparameter tuning
- Research-ready optimization tools

### 📊 **Production-Ready Techniques**
- Gradient accumulation for scalable training
- Memory profiling and optimization
- Robust debugging and monitoring
- Performance analysis and optimization

**You are now equipped with the knowledge and skills to tackle advanced PyTorch projects, implement custom autograd functions, debug complex training scenarios, and explore cutting-edge research applications.**

**Next recommended notebooks:**
- Custom Autograd Functions Development
- Advanced Neural Network Architectures  
- Meta-Learning and Few-Shot Learning
- Neural Architecture Search with Gradients

**Happy gradient computing! 🚀**