# üìò PYTORCH PHASE 1 - FILE 2: BACKPROPAGATION & GRADIENT FLOW

**Core Concepts:** Gradient Issues, Initialization, Gradient Flow Analysis

**M·ª•c ti√™u:**
- ‚úÖ Hi·ªÉu backpropagation mechanism
- ‚úÖ Identify gradient issues (vanishing/exploding)
- ‚úÖ Master initialization strategies
- ‚úÖ Analyze gradient flow per layer
- ‚úÖ Practical gradient monitoring

**Th·ªùi l∆∞·ª£ng:** 2-3 tu·∫ßn

---

## üìö M·ª•c L·ª•c

### 1. BACKPROPAGATION BASICS
1.1 Chain Rule Review
1.2 Computational Graph
1.3 Forward vs Backward Pass
1.4 PyTorch Autograd

### 2. GRADIENT ISSUES
2.1 Vanishing Gradients
2.2 Exploding Gradients
2.3 Why Deep Networks Are Hard
2.4 Role of Depth & Activation

### 3. GRADIENT FLOW ANALYSIS
3.1 Gradient Norm Per Layer
3.2 Effect of Initialization
3.3 Effect of Activation Functions
3.4 Effect of Normalization

### 4. INITIALIZATION STRATEGIES
4.1 Zero Initialization (Why It Fails)
4.2 Xavier/Glorot Initialization
4.3 He Initialization
4.4 Initialization vs Activation

### 5. PRACTICAL EXPERIMENTS
5.1 Track Gradient Norms
5.2 Compare Initializations
5.3 Deep Network Failure Modes
5.4 Gradient Clipping

---

In [None]:
# Import libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

print(f"‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"‚úÖ Using device: {device}")

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

---

# 1. BACKPROPAGATION BASICS

## 1.1 Chain Rule Review

### Fundamental Rule

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial w}$$

### Deep Network Example

$$y = f_3(f_2(f_1(x)))$$

$$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial f_3}{\partial f_2} \cdot \frac{\partial f_2}{\partial f_1} \cdot \frac{\partial f_1}{\partial x}$$

### Key Insight

Gradient = **Product** of local gradients

‚Üí If any local gradient is small ‚Üí Overall gradient vanishes
‚Üí If any local gradient is large ‚Üí Overall gradient explodes

In [None]:
# Simple backprop example

# Forward: y = w3 * (w2 * (w1 * x))
x = torch.tensor(2.0, requires_grad=True)
w1 = torch.tensor(0.5, requires_grad=True)
w2 = torch.tensor(0.5, requires_grad=True)
w3 = torch.tensor(0.5, requires_grad=True)

# Forward pass
z1 = w1 * x
z2 = w2 * z1
y = w3 * z2

print("Forward pass:")
print(f"  z1 = w1 * x = {z1.item():.4f}")
print(f"  z2 = w2 * z1 = {z2.item():.4f}")
print(f"  y = w3 * z2 = {y.item():.4f}")

# Backward pass
y.backward()

print("\nBackward pass (gradients):")
print(f"  dy/dw3 = {w3.grad.item():.4f}")
print(f"  dy/dw2 = {w2.grad.item():.4f}")
print(f"  dy/dw1 = {w1.grad.item():.4f}")
print(f"  dy/dx = {x.grad.item():.4f}")

# Manual calculation
print("\nManual verification:")
print(f"  dy/dw3 = z2 = {z2.item():.4f} ‚úì")
print(f"  dy/dw2 = w3 * z1 = {(w3 * z1).item():.4f} ‚úì")
print(f"  dy/dw1 = w3 * w2 * x = {(w3 * w2 * x).item():.4f} ‚úì")
print(f"  dy/dx = w3 * w2 * w1 = {(w3 * w2 * w1).item():.4f} ‚úì")

print("\nüí° Key: Gradient = Product of local derivatives")

## 1.2 Computational Graph

### Forward Pass
```
x ‚Üí [Layer 1] ‚Üí h1 ‚Üí [Layer 2] ‚Üí h2 ‚Üí [Layer 3] ‚Üí y ‚Üí Loss
```

### Backward Pass
```
x ‚Üê [Layer 1] ‚Üê h1 ‚Üê [Layer 2] ‚Üê h2 ‚Üê [Layer 3] ‚Üê y ‚Üê ‚àáLoss
```

### PyTorch Autograd

- Automatically builds computational graph
- Tracks operations with `requires_grad=True`
- `.backward()` computes all gradients
- `.grad` stores gradients

---

# 2. GRADIENT ISSUES

## 2.1 Vanishing Gradients

### Problem

Trong deep networks:
$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial y} \cdot \prod_{i=2}^{n} \frac{\partial h_i}{\partial h_{i-1}}$$

If $\frac{\partial h_i}{\partial h_{i-1}} < 1$ ‚Üí Product becomes very small

### Causes

1. **Sigmoid/Tanh activations**
   - Derivative max = 0.25 (sigmoid)
   - With 10 layers: $0.25^{10} \approx 10^{-6}$

2. **Poor initialization**
   - Weights too small ‚Üí activations saturate

3. **Deep networks**
   - More layers ‚Üí more multiplications ‚Üí smaller gradient

### Consequences
- Early layers don't learn
- Training extremely slow
- Network behaves like shallow network

In [None]:
# Demonstrate vanishing gradients

class DeepSigmoidNet(nn.Module):
    """Deep network with sigmoid (prone to vanishing gradients)"""
    def __init__(self, depth=10):
        super().__init__()
        layers = []
        for i in range(depth):
            layers.append(nn.Linear(100, 100))
            layers.append(nn.Sigmoid())
        layers.append(nn.Linear(100, 10))
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)

# Create model and compute gradients
model = DeepSigmoidNet(depth=10)
x = torch.randn(32, 100)
y = torch.randint(0, 10, (32,))

# Forward + backward
output = model(x)
loss = F.cross_entropy(output, y)
loss.backward()

# Collect gradient norms per layer
layer_names = []
grad_norms = []

for name, param in model.named_parameters():
    if 'weight' in name and param.grad is not None:
        layer_names.append(name)
        grad_norms.append(param.grad.norm().item())

# Plot
plt.figure(figsize=(12, 5))
plt.bar(range(len(grad_norms)), grad_norms)
plt.xlabel('Layer Index (0 = first layer)', fontsize=12)
plt.ylabel('Gradient Norm', fontsize=12)
plt.title('Vanishing Gradients in Deep Sigmoid Network', fontsize=14, fontweight='bold')
plt.yscale('log')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("üìä Observations:")
print(f"   First layer gradient: {grad_norms[0]:.2e}")
print(f"   Last layer gradient: {grad_norms[-1]:.2e}")
print(f"   Ratio: {grad_norms[0]/grad_norms[-1]:.2e}")
print("\n‚ö†Ô∏è  Early layers have MUCH smaller gradients!")
print("   ‚Üí Vanishing gradient problem")

## 2.2 Exploding Gradients

### Problem

If $\frac{\partial h_i}{\partial h_{i-1}} > 1$ ‚Üí Product becomes very large

### Causes

1. **Poor initialization**
   - Weights too large

2. **Recurrent networks**
   - Same weight matrix multiplied many times

3. **No normalization**
   - Activations grow unbounded

### Consequences
- Loss becomes NaN
- Weights become NaN
- Training diverges

### Solutions
- **Gradient clipping**: Limit gradient magnitude
- **Proper initialization**: He/Xavier
- **Batch normalization**: Normalize activations
- **Residual connections**: Skip connections

In [None]:
# Demonstrate exploding gradients

class PoorlyInitializedNet(nn.Module):
    """Network with large initialization (prone to exploding)"""
    def __init__(self, depth=10):
        super().__init__()
        layers = []
        for i in range(depth):
            linear = nn.Linear(100, 100)
            # BAD: Initialize with large weights
            nn.init.uniform_(linear.weight, -2, 2)
            layers.append(linear)
            layers.append(nn.ReLU())
        layers.append(nn.Linear(100, 10))
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)

model = PoorlyInitializedNet(depth=5)
x = torch.randn(32, 100)
y = torch.randint(0, 10, (32,))

# Forward + backward
output = model(x)
loss = F.cross_entropy(output, y)
loss.backward()

# Collect gradient norms
grad_norms = []
for param in model.parameters():
    if param.grad is not None:
        grad_norms.append(param.grad.norm().item())

print("‚ö†Ô∏è  Exploding Gradients:")
print(f"   Max gradient norm: {max(grad_norms):.2e}")
print(f"   Min gradient norm: {min(grad_norms):.2e}")
print(f"   Range: {max(grad_norms)/min(grad_norms):.2e}")

if max(grad_norms) > 1e3:
    print("\n‚ùå DANGER: Gradients exploding!")
    print("   Solution: Gradient clipping or better initialization")

## 2.3 Gradient Clipping

### Norm Clipping

$$\text{if } ||g|| > \text{threshold}: \quad g = \frac{g}{||g||} \times \text{threshold}$$

### Implementation

In [None]:
# Gradient clipping example

model = PoorlyInitializedNet(depth=5)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# Training step WITH gradient clipping
output = model(x)
loss = F.cross_entropy(output, y)
loss.backward()

# Gradient norms BEFORE clipping
total_norm_before = 0.0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm_before += param_norm.item() ** 2
total_norm_before = total_norm_before ** 0.5

# CLIP gradients
max_norm = 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# Gradient norms AFTER clipping
total_norm_after = 0.0
for p in model.parameters():
    if p.grad is not None:
        param_norm = p.grad.data.norm(2)
        total_norm_after += param_norm.item() ** 2
total_norm_after = total_norm_after ** 0.5

optimizer.step()

print("‚úÇÔ∏è  Gradient Clipping:")
print(f"   Before clipping: {total_norm_before:.4f}")
print(f"   After clipping: {total_norm_after:.4f}")
print(f"   Threshold: {max_norm}")
print("\n‚úÖ Gradients clipped to prevent explosion!")

---

# 3. INITIALIZATION STRATEGIES

## 3.1 Why Initialization Matters

### Goal

Maintain **variance** of activations and gradients across layers

### Bad Initialization
- Activations too small ‚Üí vanishing gradients
- Activations too large ‚Üí exploding gradients
- All neurons do same thing ‚Üí symmetry

## 3.2 Zero Initialization (WRONG)

```python
# ‚ùå BAD
nn.init.zeros_(layer.weight)
```

### Problems
1. All neurons compute same function
2. All gradients are identical
3. No learning (symmetry breaking fails)

## 3.3 Xavier/Glorot Initialization

### Formula

$$W \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)$$

Or uniform:
$$W \sim \mathcal{U}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)$$

### When to Use
- **Tanh activation**
- **Sigmoid activation**
- Linear layers

### Derivation
Assumes activation function symmetric around zero (tanh)

In [None]:
# Xavier initialization

layer = nn.Linear(100, 100)

# Xavier/Glorot init
nn.init.xavier_uniform_(layer.weight)
print("‚úÖ Xavier initialization")
print(f"   Mean: {layer.weight.mean().item():.6f}")
print(f"   Std: {layer.weight.std().item():.6f}")
print(f"   Theoretical std: {(2/(100+100))**0.5:.6f}")

## 3.4 He Initialization

### Formula

$$W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)$$

### When to Use
- **ReLU activation**
- **Leaky ReLU**
- Any activation that "kills" half the neurons

### Why Different?
ReLU zeroes out half the activations ‚Üí need larger variance to compensate

In [None]:
# Compare Xavier vs He initialization

def test_initialization(init_fn, activation, depth=10, n_samples=1000):
    """
    Test initialization by measuring activation variance across layers
    """
    layers = []
    for i in range(depth):
        linear = nn.Linear(100, 100, bias=False)
        init_fn(linear.weight)
        layers.append(linear)
        layers.append(activation())
    
    model = nn.Sequential(*layers)
    
    # Forward pass
    x = torch.randn(n_samples, 100)
    
    # Track activation variance per layer
    variances = []
    with torch.no_grad():
        for layer in model:
            x = layer(x)
            if isinstance(layer, nn.Linear):
                variances.append(x.var().item())
    
    return variances

# Test different combinations
configs = [
    ('Xavier + Tanh', nn.init.xavier_uniform_, nn.Tanh),
    ('Xavier + ReLU', nn.init.xavier_uniform_, nn.ReLU),
    ('He + ReLU', nn.init.kaiming_uniform_, nn.ReLU),
]

plt.figure(figsize=(14, 5))

for i, (name, init_fn, activation) in enumerate(configs):
    variances = test_initialization(init_fn, activation, depth=10)
    plt.plot(variances, marker='o', label=name, linewidth=2, markersize=6)

plt.axhline(y=1.0, color='red', linestyle='--', alpha=0.5, label='Ideal (var=1)')
plt.xlabel('Layer Index', fontsize=12)
plt.ylabel('Activation Variance', fontsize=12)
plt.title('Activation Variance Across Layers', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.tight_layout()
plt.show()

print("üìä Observations:")
print("   Xavier + Tanh: Variance stable (designed for tanh)")
print("   Xavier + ReLU: Variance decreases (not optimal for ReLU)")
print("   He + ReLU: Variance stable (designed for ReLU)")
print("\nüí° Key: Match initialization to activation function!")

---

# 4. GRADIENT FLOW ANALYSIS

## 4.1 Monitor Gradient Norms Per Layer

In [None]:
# Complete gradient monitoring system

class GradientMonitor:
    """
    Monitor gradient flow during training
    """
    def __init__(self, model):
        self.model = model
        self.gradient_norms = {}
        
    def record_gradients(self):
        """Record gradient norms for all layers"""
        for name, param in self.model.named_parameters():
            if param.grad is not None:
                if name not in self.gradient_norms:
                    self.gradient_norms[name] = []
                
                norm = param.grad.norm().item()
                self.gradient_norms[name].append(norm)
    
    def plot_gradient_flow(self, iteration=None):
        """Plot gradient norms per layer"""
        if iteration is None:
            iteration = -1  # Last iteration
        
        layer_names = []
        grad_norms = []
        
        for name, norms in self.gradient_norms.items():
            if 'weight' in name:
                layer_names.append(name.split('.')[0])
                grad_norms.append(norms[iteration])
        
        plt.figure(figsize=(12, 5))
        plt.bar(range(len(grad_norms)), grad_norms)
        plt.xlabel('Layer', fontsize=12)
        plt.ylabel('Gradient Norm', fontsize=12)
        plt.title(f'Gradient Flow (Iteration {iteration})', fontsize=14, fontweight='bold')
        plt.yscale('log')
        plt.xticks(range(len(layer_names)), layer_names, rotation=45)
        plt.grid(True, alpha=0.3, axis='y')
        plt.tight_layout()
        plt.show()
    
    def plot_gradient_history(self, layer_name=None):
        """Plot gradient history over time"""
        plt.figure(figsize=(12, 5))
        
        if layer_name:
            # Plot specific layer
            norms = self.gradient_norms[layer_name]
            plt.plot(norms, linewidth=2)
            plt.title(f'Gradient History: {layer_name}', fontsize=14, fontweight='bold')
        else:
            # Plot all layers
            for name, norms in self.gradient_norms.items():
                if 'weight' in name:
                    plt.plot(norms, alpha=0.7, label=name.split('.')[0])
            plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
            plt.title('Gradient History (All Layers)', fontsize=14, fontweight='bold')
        
        plt.xlabel('Iteration', fontsize=12)
        plt.ylabel('Gradient Norm', fontsize=12)
        plt.yscale('log')
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.show()

print("‚úÖ GradientMonitor class defined!")
print("\nüìä Usage:")
print("   monitor = GradientMonitor(model)")
print("   # In training loop:")
print("   loss.backward()")
print("   monitor.record_gradients()")
print("   optimizer.step()")

## 4.2 Compare Different Network Architectures

In [None]:
# Experiment: Compare gradient flow in different architectures

class PlainDeepNet(nn.Module):
    """Plain deep network (prone to gradient issues)"""
    def __init__(self, depth=20):
        super().__init__()
        layers = [nn.Linear(100, 100)]
        for _ in range(depth-1):
            layers.append(nn.ReLU())
            layers.append(nn.Linear(100, 100))
        layers.append(nn.Linear(100, 10))
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)

class ResidualBlock(nn.Module):
    """Residual block with skip connection"""
    def __init__(self, dim):
        super().__init__()
        self.linear1 = nn.Linear(dim, dim)
        self.linear2 = nn.Linear(dim, dim)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        identity = x
        out = self.relu(self.linear1(x))
        out = self.linear2(out)
        out += identity  # Skip connection!
        out = self.relu(out)
        return out

class ResNet(nn.Module):
    """Deep residual network"""
    def __init__(self, depth=10):
        super().__init__()
        self.input_layer = nn.Linear(100, 100)
        self.blocks = nn.Sequential(*[ResidualBlock(100) for _ in range(depth)])
        self.output_layer = nn.Linear(100, 10)
    
    def forward(self, x):
        x = self.input_layer(x)
        x = self.blocks(x)
        x = self.output_layer(x)
        return x

# Compare gradient flow
models = {
    'Plain (20 layers)': PlainDeepNet(depth=20),
    'ResNet (10 blocks)': ResNet(depth=10)
}

x = torch.randn(32, 100)
y = torch.randint(0, 10, (32,))

plt.figure(figsize=(14, 5))

for idx, (name, model) in enumerate(models.items()):
    # Forward + backward
    output = model(x)
    loss = F.cross_entropy(output, y)
    loss.backward()
    
    # Collect gradients
    grad_norms = []
    for param in model.parameters():
        if param.grad is not None and len(param.shape) > 1:  # Only weight matrices
            grad_norms.append(param.grad.norm().item())
    
    # Plot
    plt.subplot(1, 2, idx+1)
    plt.bar(range(len(grad_norms)), grad_norms)
    plt.xlabel('Layer Index', fontsize=12)
    plt.ylabel('Gradient Norm', fontsize=12)
    plt.title(name, fontsize=13, fontweight='bold')
    plt.yscale('log')
    plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("üìä Observations:")
print("   Plain Network: Gradient vanishes in early layers")
print("   ResNet: Gradients flow better (thanks to skip connections)")
print("\n‚úÖ Skip connections help gradient flow!")

---

# üéì T·ªïng k·∫øt FILE 2: Backpropagation & Gradient Flow

## ‚úÖ Nh·ªØng g√¨ ƒë√£ h·ªçc

### 1. Backpropagation Basics
- **Chain rule**: Gradient = product of local derivatives
- **Computational graph**: Forward and backward pass
- **PyTorch autograd**: Automatic differentiation

### 2. Gradient Issues
- **Vanishing gradients**: Product of small derivatives
- **Exploding gradients**: Product of large derivatives
- **Causes**: Activation functions, initialization, depth
- **Solutions**: Proper init, normalization, skip connections

### 3. Initialization Strategies
- **Zero init**: NEVER use (breaks symmetry)
- **Xavier/Glorot**: For tanh/sigmoid
- **He initialization**: For ReLU
- **Match init to activation function**

### 4. Gradient Flow Analysis
- **Monitor gradient norms**: Essential for debugging
- **Layer-wise analysis**: Identify problematic layers
- **Skip connections**: Help gradient flow
- **Gradient clipping**: Prevent explosion

## üöÄ Key Takeaways

1. **Gradient = product** ‚Üí Prone to vanishing/exploding
2. **Sigmoid/tanh** ‚Üí Vanishing gradients
3. **ReLU** ‚Üí Better gradient flow
4. **Proper initialization** crucial for deep networks
5. **He init for ReLU**, Xavier for tanh
6. **Skip connections** solve gradient flow issues
7. **Monitor gradients** during training
8. **Gradient clipping** for RNNs and unstable training

## üìù Next: FILE 3

- Regularization Techniques
- Dropout
- Weight Decay
- Label Smoothing

---

**Ch√∫c m·ª´ng b·∫°n ƒë√£ ho√†n th√†nh FILE 2! üéâ**