# Week 5: Vanishing & Exploding Gradients - Interactive Notebook
**Module 2: Optimization and Regularization**
**Deep Neural Network Architectures (21CSE558T)**

This notebook provides interactive demonstrations of gradient problems and their solutions.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/your-repo/blob/main/week5_gradient_problems_colab.ipynb)

## Setup and Imports

In [None]:
# Install required packages (if running on Colab)
import sys
if 'google.colab' in sys.modules:
    print("Running on Google Colab")
    !pip install -q tensorflow matplotlib numpy scikit-learn seaborn

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print(f"TensorFlow version: {tf.__version__}")
print(f"GPU Available: {tf.config.list_physical_devices('GPU')}")

## Part 1: Understanding Gradient Flow
### 1.1 Visualizing the Chain Rule in Backpropagation

In [None]:
def visualize_chain_rule():
    """Visualize how gradients flow through layers via chain rule"""
    
    # Create figure
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
    
    # Simulate gradient flow through layers
    num_layers = 10
    
    # Sigmoid activation gradients (max 0.25)
    sigmoid_grads = [0.25 ** i for i in range(1, num_layers + 1)]
    
    # ReLU activation gradients (can be 0 or 1)
    relu_grads = [1.0 * (0.9 ** i) for i in range(1, num_layers + 1)]  # Slight decay
    
    # Plot 1: Gradient magnitude through layers
    layers = list(range(1, num_layers + 1))
    ax1.plot(layers, sigmoid_grads, 'r-o', label='Sigmoid', linewidth=2, markersize=8)
    ax1.plot(layers, relu_grads, 'b-s', label='ReLU', linewidth=2, markersize=8)
    ax1.set_yscale('log')
    ax1.set_xlabel('Layer Depth', fontsize=12)
    ax1.set_ylabel('Gradient Magnitude (log scale)', fontsize=12)
    ax1.set_title('Gradient Flow Through Deep Networks', fontsize=14, fontweight='bold')
    ax1.legend(fontsize=11)
    ax1.grid(True, alpha=0.3)
    
    # Add annotations
    ax1.annotate('Vanishing!', xy=(8, sigmoid_grads[7]), xytext=(6, 1e-6),
                arrowprops=dict(arrowstyle='->', color='red', lw=2),
                fontsize=11, color='red', fontweight='bold')
    
    # Plot 2: Cumulative gradient decay
    sigmoid_cumulative = np.cumprod([0.25] * num_layers)
    relu_cumulative = np.cumprod([0.9] * num_layers)
    
    ax2.fill_between(layers, 0, sigmoid_cumulative, alpha=0.3, color='red', label='Sigmoid Area')
    ax2.fill_between(layers, 0, relu_cumulative, alpha=0.3, color='blue', label='ReLU Area')
    ax2.set_yscale('log')
    ax2.set_xlabel('Layer Depth', fontsize=12)
    ax2.set_ylabel('Cumulative Gradient Product (log scale)', fontsize=12)
    ax2.set_title('Cumulative Gradient Decay', fontsize=14, fontweight='bold')
    ax2.legend(fontsize=11)
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print analysis
    print("\n📊 GRADIENT FLOW ANALYSIS:")
    print("=" * 50)
    print(f"After {num_layers} layers:")
    print(f"  Sigmoid gradient: {sigmoid_grads[-1]:.2e}")
    print(f"  ReLU gradient: {relu_grads[-1]:.4f}")
    print(f"  Ratio (ReLU/Sigmoid): {relu_grads[-1]/sigmoid_grads[-1]:.0f}x better")

visualize_chain_rule()

## Part 2: Vanishing Gradients Problem
### 2.1 Create and Analyze a Deep Network with Sigmoid Activations

In [None]:
class GradientAnalyzer:
    """Tool for analyzing gradient flow in neural networks"""
    
    def __init__(self):
        self.gradient_history = []
        self.loss_history = []
        self.model_counter = 0  # Add counter for unique model names
    
    def create_deep_network(self, depth=10, activation='sigmoid', width=64):
        """Create a deep neural network with unique layer names"""
        # Clear any existing models to avoid naming conflicts
        tf.keras.backend.clear_session()
        
        # Generate unique suffix for this model
        self.model_counter += 1
        suffix = f"_{activation}_{self.model_counter}"
        
        model = tf.keras.Sequential()
        
        # Input layer with unique name
        model.add(tf.keras.layers.Dense(width, activation=activation, 
                                       input_shape=(10,),
                                       name=f'input_layer{suffix}'))
        
        # Hidden layers with unique names
        for i in range(depth - 2):
            model.add(tf.keras.layers.Dense(width, activation=activation,
                                           name=f'hidden_{i+1}{suffix}'))
        
        # Output layer with unique name
        model.add(tf.keras.layers.Dense(1, activation='sigmoid',
                                       name=f'output_layer{suffix}'))
        
        return model
    
    def analyze_gradients(self, model, X, y):
        """Analyze gradient magnitudes for each layer"""
        with tf.GradientTape() as tape:
            y_pred = model(X, training=True)
            loss = tf.keras.losses.binary_crossentropy(y, y_pred)
            loss = tf.reduce_mean(loss)
        
        # Get gradients
        gradients = tape.gradient(loss, model.trainable_variables)
        
        # Calculate gradient statistics
        gradient_stats = []
        for i, (grad, weight) in enumerate(zip(gradients, model.trainable_variables)):
            if 'kernel' in weight.name:  # Only analyze weight gradients
                grad_norm = tf.norm(grad).numpy()
                grad_mean = tf.reduce_mean(tf.abs(grad)).numpy()
                grad_std = tf.math.reduce_std(grad).numpy()
                
                layer_name = weight.name.split('/')[0]
                gradient_stats.append({
                    'layer': layer_name,
                    'norm': grad_norm,
                    'mean': grad_mean,
                    'std': grad_std,
                    'shape': grad.shape
                })
        
        return gradient_stats, loss.numpy()

# Create analyzer
analyzer = GradientAnalyzer()

# Generate sample data
X_sample = tf.random.normal((100, 10))
y_sample = tf.random.uniform((100, 1))

print("🔍 Analyzing gradient flow in deep networks...\n")

### 2.2 Visualize Vanishing Gradients

In [None]:
def visualize_vanishing_gradients():
    """Compare gradient flow in networks with different activations"""
    
    # Create networks
    sigmoid_model = analyzer.create_deep_network(depth=10, activation='sigmoid')
    tanh_model = analyzer.create_deep_network(depth=10, activation='tanh')
    relu_model = analyzer.create_deep_network(depth=10, activation='relu')
    
    # Analyze gradients
    sigmoid_stats, _ = analyzer.analyze_gradients(sigmoid_model, X_sample, y_sample)
    tanh_stats, _ = analyzer.analyze_gradients(tanh_model, X_sample, y_sample)
    relu_stats, _ = analyzer.analyze_gradients(relu_model, X_sample, y_sample)
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
    
    # Plot 1: Gradient norms comparison
    ax = axes[0, 0]
    layer_indices = range(len(sigmoid_stats))
    
    sigmoid_norms = [s['norm'] for s in sigmoid_stats]
    tanh_norms = [s['norm'] for s in tanh_stats]
    relu_norms = [s['norm'] for s in relu_stats]
    
    ax.semilogy(layer_indices, sigmoid_norms, 'r-o', label='Sigmoid', linewidth=2, markersize=8)
    ax.semilogy(layer_indices, tanh_norms, 'g-^', label='Tanh', linewidth=2, markersize=8)
    ax.semilogy(layer_indices, relu_norms, 'b-s', label='ReLU', linewidth=2, markersize=8)
    
    ax.set_xlabel('Layer Index (Input → Output)', fontsize=12)
    ax.set_ylabel('Gradient Norm (log scale)', fontsize=12)
    ax.set_title('Gradient Norm by Layer', fontsize=14, fontweight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Gradient mean absolute values
    ax = axes[0, 1]
    sigmoid_means = [s['mean'] for s in sigmoid_stats]
    tanh_means = [s['mean'] for s in tanh_stats]
    relu_means = [s['mean'] for s in relu_stats]
    
    x = np.arange(len(sigmoid_stats))
    width = 0.25
    
    ax.bar(x - width, sigmoid_means, width, label='Sigmoid', color='red', alpha=0.7)
    ax.bar(x, tanh_means, width, label='Tanh', color='green', alpha=0.7)
    ax.bar(x + width, relu_means, width, label='ReLU', color='blue', alpha=0.7)
    
    ax.set_xlabel('Layer Index', fontsize=12)
    ax.set_ylabel('Mean Absolute Gradient', fontsize=12)
    ax.set_title('Mean Gradient Magnitude by Layer', fontsize=14, fontweight='bold')
    ax.legend(fontsize=11)
    ax.set_yscale('log')
    
    # Plot 3: Gradient distribution heatmap for Sigmoid (FIXED)
    ax = axes[1, 0]
    
    # Collect gradient values for each layer with consistent shape
    gradient_samples = []
    sample_size = 100  # Fixed sample size for all layers
    
    for var in sigmoid_model.trainable_variables:
        if 'kernel' in var.name:
            with tf.GradientTape() as tape:
                y_pred = sigmoid_model(X_sample[:10], training=True)
                loss = tf.reduce_mean(tf.keras.losses.binary_crossentropy(y_sample[:10], y_pred))
            grads = tape.gradient(loss, var)
            
            # Flatten and take exactly sample_size values
            flat_grads = tf.reshape(grads, [-1]).numpy()
            
            # Ensure we have exactly sample_size values (pad with zeros if needed)
            if len(flat_grads) >= sample_size:
                sampled_grads = flat_grads[:sample_size]
            else:
                # Pad with zeros if we have fewer gradients
                sampled_grads = np.pad(flat_grads, (0, sample_size - len(flat_grads)), 'constant')
            
            gradient_samples.append(sampled_grads)
    
    # Create heatmap with consistent shape
    if gradient_samples:
        gradient_matrix = np.array(gradient_samples)
        im = ax.imshow(gradient_matrix, aspect='auto', cmap='coolwarm', 
                       vmin=-0.01, vmax=0.01)
        ax.set_xlabel('Gradient Values (sampled)', fontsize=12)
        ax.set_ylabel('Layer Index', fontsize=12)
        ax.set_title('Gradient Distribution - Sigmoid Network', fontsize=14, fontweight='bold')
        plt.colorbar(im, ax=ax)
    else:
        ax.text(0.5, 0.5, 'No gradient data available', ha='center', va='center')
        ax.set_title('Gradient Distribution - Sigmoid Network', fontsize=14, fontweight='bold')
    
    # Plot 4: Gradient ratio (last/first layer)
    ax = axes[1, 1]
    
    # Ensure we have gradients to calculate ratios
    if sigmoid_norms and tanh_norms and relu_norms:
        ratios = {
            'Sigmoid': sigmoid_norms[-1] / sigmoid_norms[0] if sigmoid_norms[0] != 0 else 0,
            'Tanh': tanh_norms[-1] / tanh_norms[0] if tanh_norms[0] != 0 else 0,
            'ReLU': relu_norms[-1] / relu_norms[0] if relu_norms[0] != 0 else 0
        }
        
        colors = ['red', 'green', 'blue']
        bars = ax.bar(ratios.keys(), ratios.values(), color=colors, alpha=0.7)
        ax.set_ylabel('Gradient Ratio (Last Layer / First Layer)', fontsize=12)
        ax.set_title('Gradient Decay Across Network', fontsize=14, fontweight='bold')
        ax.set_yscale('log')
        
        # Add value labels on bars
        for bar, (name, value) in zip(bars, ratios.items()):
            if value > 0:  # Only add label if value is positive
                height = bar.get_height()
                ax.text(bar.get_x() + bar.get_width()/2., height,
                        f'{value:.2e}', ha='center', va='bottom', fontsize=10)
    else:
        ax.text(0.5, 0.5, 'Insufficient data for ratios', ha='center', va='center')
        ax.set_title('Gradient Decay Across Network', fontsize=14, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Print detailed analysis
    print("\n📈 VANISHING GRADIENT ANALYSIS:")
    print("=" * 60)
    
    if sigmoid_norms and tanh_norms and relu_norms:
        print(f"{'Activation':<12} {'First Layer Norm':<20} {'Last Layer Norm':<20} {'Ratio':<15}")
        print("-" * 60)
        print(f"{'Sigmoid':<12} {sigmoid_norms[0]:<20.6f} {sigmoid_norms[-1]:<20.6e} {sigmoid_norms[-1]/sigmoid_norms[0] if sigmoid_norms[0] != 0 else 0:<15.2e}")
        print(f"{'Tanh':<12} {tanh_norms[0]:<20.6f} {tanh_norms[-1]:<20.6e} {tanh_norms[-1]/tanh_norms[0] if tanh_norms[0] != 0 else 0:<15.2e}")
        print(f"{'ReLU':<12} {relu_norms[0]:<20.6f} {relu_norms[-1]:<20.6e} {relu_norms[-1]/relu_norms[0] if relu_norms[0] != 0 else 0:<15.2e}")
    else:
        print("Insufficient gradient data for analysis")

visualize_vanishing_gradients()

## Part 3: Exploding Gradients Problem
### 3.1 Detection and Visualization of Gradient Explosion

In [None]:
class GradientExplosionDetector:
    """Detect and visualize gradient explosion in neural networks"""
    
    def __init__(self, explosion_threshold=100.0):
        self.explosion_threshold = explosion_threshold
        self.gradient_history = []
        self.explosion_events = []
    
    def create_unstable_network(self):
        """Create a network prone to gradient explosion"""
        model = tf.keras.Sequential([
            tf.keras.layers.Dense(64, activation='linear',
                                kernel_initializer=tf.keras.initializers.RandomNormal(stddev=2.0),
                                input_shape=(10,)),
            tf.keras.layers.Dense(64, activation='relu',
                                kernel_initializer=tf.keras.initializers.RandomNormal(stddev=2.0)),
            tf.keras.layers.Dense(64, activation='relu',
                                kernel_initializer=tf.keras.initializers.RandomNormal(stddev=2.0)),
            tf.keras.layers.Dense(1)
        ])
        return model
    
    def detect_explosion(self, model, X, y, epochs=20):
        """Train model and detect gradient explosions"""
        optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)  # High learning rate
        
        self.gradient_history = []
        self.explosion_events = []
        loss_history = []
        
        for epoch in range(epochs):
            with tf.GradientTape() as tape:
                y_pred = model(X, training=True)
                loss = tf.reduce_mean(tf.square(y_pred - y))
            
            # Calculate gradients
            gradients = tape.gradient(loss, model.trainable_variables)
            
            # Calculate total gradient norm
            total_norm = tf.reduce_sum([tf.norm(g) for g in gradients if g is not None]).numpy()
            
            # Store history
            self.gradient_history.append(total_norm)
            loss_history.append(loss.numpy())
            
            # Detect explosion
            if total_norm > self.explosion_threshold:
                self.explosion_events.append(epoch)
                print(f"⚠️ EXPLOSION DETECTED at epoch {epoch}: Gradient norm = {total_norm:.2f}")
            
            # Check for NaN
            if np.isnan(total_norm) or np.isinf(total_norm):
                print(f"💥 CRITICAL: NaN/Inf detected at epoch {epoch}!")
                break
            
            # Apply gradients (this might cause issues!)
            try:
                optimizer.apply_gradients(zip(gradients, model.trainable_variables))
            except:
                print(f"❌ Training failed at epoch {epoch}")
                break
        
        return self.gradient_history, loss_history
    
    def visualize_explosion(self):
        """Visualize gradient explosion detection results"""
        fig, axes = plt.subplots(2, 2, figsize=(15, 10))
        
        epochs = range(len(self.gradient_history))
        
        # Plot 1: Gradient norm over time
        ax = axes[0, 0]
        ax.plot(epochs, self.gradient_history, 'b-', linewidth=2, label='Gradient Norm')
        ax.axhline(y=self.explosion_threshold, color='r', linestyle='--', 
                  linewidth=2, label=f'Explosion Threshold ({self.explosion_threshold})')
        
        # Mark explosion events
        for event in self.explosion_events:
            ax.scatter(event, self.gradient_history[event], color='red', s=100, 
                      marker='x', linewidths=3, zorder=5)
        
        ax.set_xlabel('Epoch', fontsize=12)
        ax.set_ylabel('Total Gradient Norm', fontsize=12)
        ax.set_title('Gradient Explosion Detection', fontsize=14, fontweight='bold')
        ax.legend(fontsize=11)
        ax.grid(True, alpha=0.3)
        
        # Plot 2: Log scale gradient norm
        ax = axes[0, 1]
        ax.semilogy(epochs, self.gradient_history, 'g-', linewidth=2)
        ax.axhline(y=self.explosion_threshold, color='r', linestyle='--', linewidth=2)
        ax.set_xlabel('Epoch', fontsize=12)
        ax.set_ylabel('Total Gradient Norm (log scale)', fontsize=12)
        ax.set_title('Gradient Norm - Log Scale View', fontsize=14, fontweight='bold')
        ax.grid(True, alpha=0.3)
        
        # Plot 3: Gradient distribution histogram
        ax = axes[1, 0]
        ax.hist(self.gradient_history, bins=30, color='purple', alpha=0.7, edgecolor='black')
        ax.axvline(x=self.explosion_threshold, color='r', linestyle='--', linewidth=2)
        ax.set_xlabel('Gradient Norm', fontsize=12)
        ax.set_ylabel('Frequency', fontsize=12)
        ax.set_title('Distribution of Gradient Norms', fontsize=14, fontweight='bold')
        
        # Plot 4: Explosion event analysis
        ax = axes[1, 1]
        if self.explosion_events:
            explosion_magnitudes = [self.gradient_history[e] for e in self.explosion_events]
            ax.bar(range(len(self.explosion_events)), explosion_magnitudes, 
                  color='red', alpha=0.7)
            ax.set_xlabel('Explosion Event Index', fontsize=12)
            ax.set_ylabel('Gradient Magnitude', fontsize=12)
            ax.set_title(f'Explosion Events ({len(self.explosion_events)} detected)', 
                        fontsize=14, fontweight='bold')
        else:
            ax.text(0.5, 0.5, 'No explosions detected', 
                   ha='center', va='center', fontsize=14)
            ax.set_title('Explosion Events', fontsize=14, fontweight='bold')
        
        plt.tight_layout()
        plt.show()

# Demonstrate gradient explosion
print("🔥 Demonstrating Gradient Explosion...\n")

detector = GradientExplosionDetector(explosion_threshold=50.0)

# Create unstable network
unstable_model = detector.create_unstable_network()

# Generate data
X_train = tf.random.normal((100, 10))
y_train = tf.random.normal((100, 1))

# Detect explosions
grad_history, loss_history = detector.detect_explosion(unstable_model, X_train, y_train, epochs=20)

# Visualize results
detector.visualize_explosion()

# Print summary
print("\n📊 EXPLOSION DETECTION SUMMARY:")
print("=" * 50)
print(f"Total epochs: {len(grad_history)}")
print(f"Explosions detected: {len(detector.explosion_events)}")
if detector.explosion_events:
    print(f"First explosion at epoch: {detector.explosion_events[0]}")
    print(f"Max gradient norm: {max(grad_history):.2f}")
print(f"Mean gradient norm: {np.mean(grad_history):.2f}")

## Part 4: Gradient Clipping Solution
### 4.1 Implement and Visualize Gradient Clipping

In [None]:
def demonstrate_gradient_clipping():
    """Compare training with and without gradient clipping"""
    
    # Create two identical unstable models
    tf.random.set_seed(42)
    model_no_clip = detector.create_unstable_network()
    
    tf.random.set_seed(42)
    model_with_clip = detector.create_unstable_network()
    
    # Training setup
    optimizer_no_clip = tf.keras.optimizers.SGD(learning_rate=0.1)
    optimizer_with_clip = tf.keras.optimizers.SGD(learning_rate=0.1, clipnorm=1.0)
    
    # Training data
    X = tf.random.normal((100, 10))
    y = tf.random.normal((100, 1))
    
    # Training history
    history_no_clip = {'gradients': [], 'loss': []}
    history_with_clip = {'gradients': [], 'loss': []}
    
    epochs = 30
    
    print("Training WITHOUT gradient clipping...")
    for epoch in range(epochs):
        # No clipping
        with tf.GradientTape() as tape:
            y_pred = model_no_clip(X, training=True)
            loss = tf.reduce_mean(tf.square(y_pred - y))
        
        grads = tape.gradient(loss, model_no_clip.trainable_variables)
        grad_norm = tf.reduce_sum([tf.norm(g) for g in grads if g is not None]).numpy()
        
        history_no_clip['gradients'].append(grad_norm)
        history_no_clip['loss'].append(loss.numpy())
        
        if not np.isnan(grad_norm) and not np.isinf(grad_norm):
            optimizer_no_clip.apply_gradients(zip(grads, model_no_clip.trainable_variables))
        else:
            print(f"  ❌ NaN/Inf at epoch {epoch}")
            break
    
    print("\nTraining WITH gradient clipping (max_norm=1.0)...")
    for epoch in range(epochs):
        # With clipping
        with tf.GradientTape() as tape:
            y_pred = model_with_clip(X, training=True)
            loss = tf.reduce_mean(tf.square(y_pred - y))
        
        grads = tape.gradient(loss, model_with_clip.trainable_variables)
        
        # Manual clipping for visualization
        clipped_grads = []
        for g in grads:
            if g is not None:
                clipped_grads.append(tf.clip_by_norm(g, 1.0))
            else:
                clipped_grads.append(g)
        
        grad_norm = tf.reduce_sum([tf.norm(g) for g in clipped_grads if g is not None]).numpy()
        
        history_with_clip['gradients'].append(grad_norm)
        history_with_clip['loss'].append(loss.numpy())
        
        optimizer_with_clip.apply_gradients(zip(clipped_grads, model_with_clip.trainable_variables))
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    # Plot 1: Gradient norms comparison
    ax = axes[0, 0]
    ax.plot(history_no_clip['gradients'], 'r-', label='No Clipping', linewidth=2, alpha=0.7)
    ax.plot(history_with_clip['gradients'], 'b-', label='With Clipping', linewidth=2)
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Gradient Norm', fontsize=12)
    ax.set_title('Gradient Norm Comparison', fontsize=14, fontweight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    
    # Plot 2: Loss comparison
    ax = axes[0, 1]
    ax.plot(history_no_clip['loss'], 'r-', label='No Clipping', linewidth=2, alpha=0.7)
    ax.plot(history_with_clip['loss'], 'b-', label='With Clipping', linewidth=2)
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Loss', fontsize=12)
    ax.set_title('Loss Comparison', fontsize=14, fontweight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    
    # Plot 3: Gradient stability (variance)
    ax = axes[1, 0]
    window = 5
    no_clip_var = [np.var(history_no_clip['gradients'][max(0,i-window):i+1]) 
                   for i in range(len(history_no_clip['gradients']))]
    clip_var = [np.var(history_with_clip['gradients'][max(0,i-window):i+1]) 
                for i in range(len(history_with_clip['gradients']))]
    
    ax.plot(no_clip_var, 'r-', label='No Clipping', linewidth=2, alpha=0.7)
    ax.plot(clip_var, 'b-', label='With Clipping', linewidth=2)
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Gradient Variance (5-epoch window)', fontsize=12)
    ax.set_title('Training Stability', fontsize=14, fontweight='bold')
    ax.legend(fontsize=11)
    ax.grid(True, alpha=0.3)
    
    # Plot 4: Clipping effect visualization
    ax = axes[1, 1]
    clipping_effect = [min(g, 1.0) / g if g > 0 else 1.0 
                      for g in history_no_clip['gradients']]
    ax.fill_between(range(len(clipping_effect)), 0, clipping_effect, 
                    alpha=0.5, color='green')
    ax.axhline(y=1.0, color='r', linestyle='--', linewidth=2)
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Clipping Factor (1.0 = no clipping)', fontsize=12)
    ax.set_title('Gradient Clipping Effect', fontsize=14, fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Print analysis
    print("\n📊 GRADIENT CLIPPING ANALYSIS:")
    print("=" * 60)
    print(f"{'Metric':<30} {'No Clipping':<15} {'With Clipping':<15}")
    print("-" * 60)
    print(f"{'Max Gradient Norm':<30} {max(history_no_clip['gradients']):<15.2f} {max(history_with_clip['gradients']):<15.2f}")
    print(f"{'Mean Gradient Norm':<30} {np.mean(history_no_clip['gradients']):<15.2f} {np.mean(history_with_clip['gradients']):<15.2f}")
    print(f"{'Gradient Std Dev':<30} {np.std(history_no_clip['gradients']):<15.2f} {np.std(history_with_clip['gradients']):<15.2f}")
    print(f"{'Final Loss':<30} {history_no_clip['loss'][-1]:<15.4f} {history_with_clip['loss'][-1]:<15.4f}")

demonstrate_gradient_clipping()

## Part 5: Solutions Summary
### 5.1 Comprehensive Solutions for Gradient Problems

In [None]:
def create_solution_comparison():
    """Compare different solutions for gradient problems"""
    
    # Generate dataset
    X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                              n_redundant=5, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Define different model configurations
    models = {
        'Problematic (Sigmoid)': {
            'model': tf.keras.Sequential([
                tf.keras.layers.Dense(128, activation='sigmoid', input_shape=(20,)),
                tf.keras.layers.Dense(64, activation='sigmoid'),
                tf.keras.layers.Dense(32, activation='sigmoid'),
                tf.keras.layers.Dense(1, activation='sigmoid')
            ]),
            'optimizer': tf.keras.optimizers.SGD(learning_rate=0.01)
        },
        'ReLU + He Init': {
            'model': tf.keras.Sequential([
                tf.keras.layers.Dense(128, activation='relu', 
                                    kernel_initializer='he_normal', input_shape=(20,)),
                tf.keras.layers.Dense(64, activation='relu', 
                                    kernel_initializer='he_normal'),
                tf.keras.layers.Dense(32, activation='relu', 
                                    kernel_initializer='he_normal'),
                tf.keras.layers.Dense(1, activation='sigmoid')
            ]),
            'optimizer': tf.keras.optimizers.Adam(learning_rate=0.001)
        },
        'BatchNorm + Dropout': {
            'model': tf.keras.Sequential([
                tf.keras.layers.Dense(128, activation='relu', input_shape=(20,)),
                tf.keras.layers.BatchNormalization(),
                tf.keras.layers.Dropout(0.3),
                tf.keras.layers.Dense(64, activation='relu'),
                tf.keras.layers.BatchNormalization(),
                tf.keras.layers.Dropout(0.2),
                tf.keras.layers.Dense(32, activation='relu'),
                tf.keras.layers.Dense(1, activation='sigmoid')
            ]),
            'optimizer': tf.keras.optimizers.Adam(learning_rate=0.001)
        },
        'Complete Solution': {
            'model': tf.keras.Sequential([
                tf.keras.layers.Dense(128, activation='relu', 
                                    kernel_initializer='he_normal', input_shape=(20,)),
                tf.keras.layers.BatchNormalization(),
                tf.keras.layers.Dropout(0.3),
                tf.keras.layers.Dense(64, activation='relu', 
                                    kernel_initializer='he_normal'),
                tf.keras.layers.BatchNormalization(),
                tf.keras.layers.Dropout(0.2),
                tf.keras.layers.Dense(32, activation='relu'),
                tf.keras.layers.Dense(1, activation='sigmoid')
            ]),
            'optimizer': tf.keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0)
        }
    }
    
    # Train and evaluate each model
    results = {}
    histories = {}
    
    for name, config in models.items():
        print(f"\nTraining {name}...")
        model = config['model']
        model.compile(optimizer=config['optimizer'],
                     loss='binary_crossentropy',
                     metrics=['accuracy'])
        
        history = model.fit(X_train, y_train,
                          validation_split=0.2,
                          epochs=50,
                          batch_size=32,
                          verbose=0)
        
        test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
        
        results[name] = {
            'test_loss': test_loss,
            'test_accuracy': test_acc,
            'final_train_loss': history.history['loss'][-1],
            'final_val_loss': history.history['val_loss'][-1]
        }
        histories[name] = history.history
        
        print(f"  Test Accuracy: {test_acc:.4f}")
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    
    colors = ['red', 'blue', 'green', 'purple']
    
    # Plot training loss
    ax = axes[0, 0]
    for (name, history), color in zip(histories.items(), colors):
        ax.plot(history['loss'], label=name, color=color, linewidth=2, alpha=0.8)
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Training Loss', fontsize=12)
    ax.set_title('Training Loss Comparison', fontsize=14, fontweight='bold')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    
    # Plot validation loss
    ax = axes[0, 1]
    for (name, history), color in zip(histories.items(), colors):
        ax.plot(history['val_loss'], label=name, color=color, linewidth=2, alpha=0.8)
    ax.set_xlabel('Epoch', fontsize=12)
    ax.set_ylabel('Validation Loss', fontsize=12)
    ax.set_title('Validation Loss Comparison', fontsize=14, fontweight='bold')
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.3)
    
    # Plot test accuracy comparison
    ax = axes[1, 0]
    names = list(results.keys())
    test_accs = [results[n]['test_accuracy'] for n in names]
    bars = ax.bar(range(len(names)), test_accs, color=colors, alpha=0.7)
    ax.set_xticks(range(len(names)))
    ax.set_xticklabels(names, rotation=45, ha='right')
    ax.set_ylabel('Test Accuracy', fontsize=12)
    ax.set_title('Final Test Accuracy', fontsize=14, fontweight='bold')
    
    # Add value labels
    for bar, acc in zip(bars, test_accs):
        ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.01,
               f'{acc:.3f}', ha='center', va='bottom')
    
    # Plot convergence speed
    ax = axes[1, 1]
    for (name, history), color in zip(histories.items(), colors):
        # Find epoch where validation loss stabilizes
        val_losses = history['val_loss']
        convergence_epoch = None
        for i in range(10, len(val_losses)):
            if np.std(val_losses[i-5:i]) < 0.01:
                convergence_epoch = i
                break
        
        if convergence_epoch:
            ax.bar(name, convergence_epoch, color=color, alpha=0.7)
    
    ax.set_ylabel('Epochs to Convergence', fontsize=12)
    ax.set_title('Convergence Speed', fontsize=14, fontweight='bold')
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
    
    plt.tight_layout()
    plt.show()
    
    # Print summary table
    print("\n" + "="*80)
    print("SOLUTION COMPARISON SUMMARY")
    print("="*80)
    print(f"{'Model':<25} {'Test Loss':<12} {'Test Acc':<12} {'Train-Val Gap':<15}")
    print("-"*80)
    
    for name, res in results.items():
        gap = res['final_val_loss'] - res['final_train_loss']
        print(f"{name:<25} {res['test_loss']:<12.4f} {res['test_accuracy']:<12.4f} {gap:<15.4f}")

create_solution_comparison()

## Part 6: Interactive Exercise
### Build Your Own Solution

In [None]:
# TODO: Students implement their own solution
def student_solution():
    """
    Exercise: Create a neural network that addresses gradient problems
    
    Requirements:
    1. Use appropriate activation functions
    2. Implement proper weight initialization
    3. Add regularization (BatchNorm or Dropout)
    4. Use gradient clipping if needed
    5. Choose an appropriate optimizer
    """
    
    # Your implementation here
    model = tf.keras.Sequential([
        # TODO: Add layers with proper configuration
        tf.keras.layers.Dense(128, activation='relu', 
                            kernel_initializer='he_normal',
                            input_shape=(20,)),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.3),
        
        tf.keras.layers.Dense(64, activation='relu',
                            kernel_initializer='he_normal'),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.2),
        
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    
    # TODO: Choose optimizer with appropriate settings
    optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, clipnorm=1.0)
    
    model.compile(optimizer=optimizer,
                 loss='binary_crossentropy',
                 metrics=['accuracy'])
    
    return model

# Test student solution
print("Testing student solution...")
student_model = student_solution()
print("\nModel architecture:")
student_model.summary()

print("\n✅ Solution created successfully!")
print("\nKey features of a good solution:")
print("- ReLU activation (avoids vanishing gradients)")
print("- He initialization (proper weight scaling)")
print("- Batch normalization (stabilizes training)")
print("- Dropout (prevents overfitting)")
print("- Adam optimizer with gradient clipping")

## Summary and Key Takeaways

### 🎯 What We Detected:
1. **Vanishing Gradients**: Gradient norms approaching zero in deep sigmoid networks
2. **Exploding Gradients**: Gradient norms exceeding safe thresholds
3. **Training Instability**: High variance in gradient norms
4. **Convergence Issues**: Slow or failed convergence

### 📊 What We Visualized:
1. **Gradient Flow**: Layer-wise gradient magnitude progression
2. **Explosion Events**: Epochs where gradients exceeded thresholds
3. **Clipping Effects**: Comparison of clipped vs unclipped gradients
4. **Solution Effectiveness**: Performance comparison of different techniques

### ✅ Solutions Implemented:
1. **ReLU Activation**: Maintains gradient flow
2. **He/Xavier Initialization**: Proper weight scaling
3. **Batch Normalization**: Stabilizes distributions
4. **Gradient Clipping**: Prevents explosion
5. **Dropout**: Regularization
6. **Adaptive Optimizers**: Adam, RMSprop

### 📝 Best Practices:
- Start with ReLU for hidden layers
- Use He initialization with ReLU
- Add BatchNorm after dense layers
- Apply gradient clipping (norm=1.0-5.0)
- Monitor gradient norms during training
- Use adaptive optimizers (Adam) for faster convergence