# Module 11: Debugging Neural Networks

**Difficulty**: ⭐⭐⭐ (Advanced)

**Estimated Time**: 60-75 minutes

**Prerequisites**: 
- [Module 02: Backpropagation and Gradient Descent](02_backpropagation_and_gradient_descent.ipynb)
- [Module 05: Feed-Forward Neural Networks with Keras](05_feedforward_neural_networks_keras.ipynb)
- [Module 06: Optimizers](06_optimizers_sgd_adam_rmsprop.ipynb)
- [Module 07: Regularization Techniques](07_regularization_techniques.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Identify common neural network training problems and their symptoms
2. Understand and detect vanishing and exploding gradients
3. Implement gradient checking for debugging backpropagation
4. Use proper weight initialization to prevent training issues
5. Apply systematic debugging strategies to troubleshoot neural networks
6. Visualize network internals (weights, activations, gradients) for diagnosis

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Deep learning libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, regularizers
from tensorflow.keras.datasets import mnist, fashion_mnist

# For gradient checking
from scipy.optimize import approx_fprime

# For reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Plotting configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print(f"TensorFlow version: {tf.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Common Neural Network Problems

### Problem Categories:

1. **Not Learning At All**
   - Loss stays constant or changes minimally
   - Accuracy stuck at random chance

2. **Learning Too Slowly**
   - Loss decreases but very gradually
   - Takes many epochs to converge

3. **Training Instability**
   - Loss oscillates wildly
   - NaN or Inf values appear
   - Model diverges instead of converging

4. **Overfitting**
   - Training accuracy high, validation accuracy low
   - Large gap between training and validation loss

5. **Underfitting**
   - Both training and validation accuracy are low
   - Model is too simple for the task

Let's simulate and diagnose each of these problems.

## 3. Load Data for Debugging Experiments

In [None]:
# Load Fashion-MNIST
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# Normalize
X_train_full = X_train_full.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

# Flatten images
X_train_full = X_train_full.reshape(-1, 784)
X_test = X_test.reshape(-1, 784)

# Create validation split
validation_split = int(0.9 * len(X_train_full))
X_train = X_train_full[:validation_split]
y_train = y_train_full[:validation_split]
X_val = X_train_full[validation_split:]
y_val = y_train_full[validation_split:]

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_val.shape}")
print(f"Test set: {X_test.shape}")

## 4. Problem 1: Model Not Learning (Dead Neurons)

**Symptoms**: Loss barely changes, accuracy near random guess

**Common Causes**:
- Learning rate too small
- Wrong activation function (e.g., saturating activations)
- Dead ReLU neurons (always output 0)
- Input not normalized

In [None]:
# Create a poorly initialized model that won't learn well
def create_dead_model():
    """Model with issues that prevent learning."""
    model = keras.Sequential([
        layers.InputLayer(input_shape=(784,)),
        layers.Dense(128, activation='relu', 
                    kernel_initializer=keras.initializers.Constant(-1.0)),  # Bad init!
        layers.Dense(64, activation='relu',
                    kernel_initializer=keras.initializers.Constant(-1.0)),
        layers.Dense(10, activation='softmax')
    ])
    
    # Very small learning rate
    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=1e-7),  # Too small!
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Train the problematic model
dead_model = create_dead_model()
print("Training model with dead neurons...")
dead_history = dead_model.fit(
    X_train[:5000], y_train[:5000],
    validation_split=0.2,
    epochs=10,
    batch_size=32,
    verbose=0
)

print(f"Final training accuracy: {dead_history.history['accuracy'][-1]:.4f}")
print(f"Expected random guess: {1/10:.4f}")
print("Notice: Model barely learns!")

In [None]:
# Diagnose: Check activation outputs
def check_activation_statistics(model, X_sample):
    """
    Check what percentage of neurons are active (non-zero).
    """
    layer_outputs = []
    
    # Create intermediate model to extract layer outputs
    for i, layer in enumerate(model.layers[:-1]):  # Exclude output layer
        intermediate_model = keras.Model(
            inputs=model.input,
            outputs=layer.output
        )
        output = intermediate_model.predict(X_sample, verbose=0)
        layer_outputs.append(output)
    
    # Analyze activations
    print("Activation Statistics:")
    print("=" * 50)
    for i, activations in enumerate(layer_outputs):
        dead_neurons = np.sum(activations == 0, axis=0)
        dead_percentage = (dead_neurons / len(X_sample)) * 100
        avg_dead = np.mean(dead_percentage)
        
        print(f"Layer {i}: {avg_dead:.1f}% dead neurons (average)")
        if avg_dead > 50:
            print(f"  ⚠️  WARNING: Over 50% neurons are dead!")

# Check the dead model
check_activation_statistics(dead_model, X_val[:100])

## 5. Problem 2: Vanishing Gradients

**Vanishing Gradients** occur when gradients become extremely small as they backpropagate through layers.

**Mathematical Intuition**:

During backpropagation, gradients are multiplied across layers:

$$\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial w_n} \cdot \frac{\partial w_n}{\partial w_{n-1}} \cdots \frac{\partial w_2}{\partial w_1}$$

If each gradient term is < 1, the product becomes exponentially smaller.

**Common Causes**:
- Deep networks with sigmoid/tanh activations
- Poor weight initialization
- Very deep networks without skip connections

In [None]:
# Create deep network with vanishing gradient problem
def create_vanishing_gradient_model():
    """Deep network with sigmoid activations (prone to vanishing gradients)."""
    model = keras.Sequential([
        layers.InputLayer(input_shape=(784,))
    ])
    
    # Add many layers with sigmoid activation
    for _ in range(10):
        model.add(layers.Dense(50, activation='sigmoid'))  # Sigmoid saturates!
    
    model.add(layers.Dense(10, activation='softmax'))
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

vanishing_model = create_vanishing_gradient_model()
print("Deep network with sigmoid activations:")
print(f"Total layers: {len(vanishing_model.layers)}")

In [None]:
# Custom callback to monitor gradient norms
class GradientMonitor(keras.callbacks.Callback):
    """Monitor gradient magnitudes during training."""
    
    def __init__(self, X_sample, y_sample):
        super().__init__()
        self.X_sample = X_sample
        self.y_sample = y_sample
        self.gradient_norms = []
    
    def on_epoch_end(self, epoch, logs=None):
        # Compute gradients
        with tf.GradientTape() as tape:
            predictions = self.model(self.X_sample, training=True)
            loss = self.model.compiled_loss(self.y_sample, predictions)
        
        gradients = tape.gradient(loss, self.model.trainable_weights)
        
        # Calculate gradient norms for each layer
        norms = []
        for grad in gradients:
            if grad is not None:
                norms.append(tf.norm(grad).numpy())
        
        self.gradient_norms.append(norms)

# Train with gradient monitoring
gradient_monitor = GradientMonitor(X_train[:100], y_train[:100])

print("Training model with vanishing gradient problem...")
vanishing_history = vanishing_model.fit(
    X_train[:5000], y_train[:5000],
    validation_split=0.2,
    epochs=5,
    batch_size=32,
    callbacks=[gradient_monitor],
    verbose=0
)

print(f"Final training accuracy: {vanishing_history.history['accuracy'][-1]:.4f}")

In [None]:
# Visualize gradient norms across layers
final_gradients = gradient_monitor.gradient_norms[-1]

plt.figure(figsize=(12, 6))
plt.plot(range(len(final_gradients)), final_gradients, 'o-', linewidth=2, markersize=8)
plt.xlabel('Layer Index', fontsize=12)
plt.ylabel('Gradient Norm', fontsize=12)
plt.title('Gradient Norms Across Layers (Vanishing Gradient Problem)', 
          fontsize=14, fontweight='bold')
plt.yscale('log')  # Log scale to see vanishing effect
plt.grid(True, alpha=0.3)
plt.axhline(y=1e-5, color='red', linestyle='--', label='Very Small Gradient')
plt.legend()
plt.tight_layout()
plt.show()

print("\nGradient Analysis:")
print(f"First layer gradient norm: {final_gradients[0]:.2e}")
print(f"Last layer gradient norm: {final_gradients[-1]:.2e}")
if final_gradients[0] < 1e-5:
    print("⚠️  WARNING: Vanishing gradients detected in early layers!")

## 6. Problem 3: Exploding Gradients

**Exploding Gradients** occur when gradients become extremely large, causing weight updates to overshoot.

**Symptoms**:
- Loss becomes NaN or Inf
- Weights grow exponentially
- Model diverges instead of converging

**Common Causes**:
- Learning rate too high
- Poor weight initialization (weights too large)
- Unstable network architecture

In [None]:
# Create model with exploding gradient problem
def create_exploding_gradient_model():
    """Model with large initialization and high learning rate."""
    model = keras.Sequential([
        layers.InputLayer(input_shape=(784,)),
        # Bad initialization: weights too large
        layers.Dense(128, activation='relu',
                    kernel_initializer=keras.initializers.RandomNormal(mean=0, stddev=10)),
        layers.Dense(64, activation='relu',
                    kernel_initializer=keras.initializers.RandomNormal(mean=0, stddev=10)),
        layers.Dense(10, activation='softmax')
    ])
    
    # Very high learning rate
    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=10.0),  # Too high!
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

exploding_model = create_exploding_gradient_model()

# Custom callback to detect NaN
class NaNTerminator(keras.callbacks.Callback):
    """Stop training if NaN is detected."""
    
    def on_batch_end(self, batch, logs=None):
        loss = logs.get('loss')
        if loss is not None and (np.isnan(loss) or np.isinf(loss)):
            print(f"\n⚠️  NaN/Inf detected at batch {batch}! Stopping training.")
            self.model.stop_training = True

print("Training model with exploding gradient problem...")
try:
    exploding_history = exploding_model.fit(
        X_train[:1000], y_train[:1000],
        epochs=5,
        batch_size=32,
        callbacks=[NaNTerminator()],
        verbose=0
    )
except Exception as e:
    print(f"Training failed: {e}")

print("\nNotice: Model likely diverged or produced NaN values!")

## 7. Solution: Proper Weight Initialization

**Weight Initialization Methods**:

1. **Xavier/Glorot Initialization** (for sigmoid/tanh):
   $$W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in} + n_{out}}}\right)$$

2. **He Initialization** (for ReLU):
   $$W \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)$$

These ensure gradients neither vanish nor explode.

In [None]:
# Compare different initializations
def create_model_with_init(init_method):
    """Create model with specified initialization."""
    model = keras.Sequential([
        layers.InputLayer(input_shape=(784,)),
        layers.Dense(128, activation='relu', kernel_initializer=init_method),
        layers.Dense(64, activation='relu', kernel_initializer=init_method),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Test different initializations
initializers = {
    'Random Normal': keras.initializers.RandomNormal(stddev=0.01),
    'Xavier/Glorot': keras.initializers.GlorotUniform(),
    'He': keras.initializers.HeNormal()  # Best for ReLU
}

init_results = {}

for name, init in initializers.items():
    print(f"\nTraining with {name} initialization...")
    model = create_model_with_init(init)
    
    history = model.fit(
        X_train[:5000], y_train[:5000],
        validation_split=0.2,
        epochs=10,
        batch_size=32,
        verbose=0
    )
    
    init_results[name] = history
    print(f"Final accuracy: {history.history['accuracy'][-1]:.4f}")

In [None]:
# Compare initialization methods
plt.figure(figsize=(12, 5))

for name, history in init_results.items():
    plt.plot(history.history['val_accuracy'], label=name, linewidth=2)

plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Validation Accuracy', fontsize=12)
plt.title('Impact of Weight Initialization on Training', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Insight: He initialization works best for ReLU activations!")

## 8. Gradient Checking

**Gradient Checking** verifies that your backpropagation implementation is correct by comparing analytical gradients with numerical gradients.

**Numerical Gradient** (slow but accurate):
$$\frac{\partial f}{\partial \theta} \approx \frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}$$

**Use Case**: Debugging custom layers or loss functions.

In [None]:
def gradient_check(model, X, y, epsilon=1e-7):
    """
    Perform gradient checking on a model.
    Compares analytical gradients (from backprop) with numerical gradients.
    
    Args:
        model: Keras model
        X: Input data
        y: Labels
        epsilon: Small value for numerical gradient computation
    
    Returns:
        Dictionary with gradient comparison results
    """
    # Get analytical gradients
    with tf.GradientTape() as tape:
        predictions = model(X, training=True)
        loss = model.compiled_loss(y, predictions)
    
    analytical_grads = tape.gradient(loss, model.trainable_weights)
    
    # Check first weight only (for efficiency)
    weight = model.trainable_weights[0]
    analytical_grad = analytical_grads[0].numpy().flatten()
    
    # Compute numerical gradient for a few random weights
    num_checks = min(10, len(analytical_grad))  # Check 10 random weights
    indices = np.random.choice(len(analytical_grad), num_checks, replace=False)
    
    differences = []
    
    for idx in indices:
        # Compute f(theta + epsilon)
        original_value = weight.numpy().flatten()[idx]
        
        # Perturb weight positively
        weight_flat = weight.numpy().flatten()
        weight_flat[idx] = original_value + epsilon
        weight.assign(weight_flat.reshape(weight.shape))
        
        predictions_plus = model(X, training=False)
        loss_plus = model.compiled_loss(y, predictions_plus).numpy()
        
        # Perturb weight negatively
        weight_flat[idx] = original_value - epsilon
        weight.assign(weight_flat.reshape(weight.shape))
        
        predictions_minus = model(X, training=False)
        loss_minus = model.compiled_loss(y, predictions_minus).numpy()
        
        # Restore original weight
        weight_flat[idx] = original_value
        weight.assign(weight_flat.reshape(weight.shape))
        
        # Compute numerical gradient
        numerical_grad = (loss_plus - loss_minus) / (2 * epsilon)
        
        # Compare
        analytical_val = analytical_grad[idx]
        difference = abs(numerical_grad - analytical_val)
        relative_diff = difference / (abs(numerical_grad) + abs(analytical_val) + 1e-8)
        
        differences.append(relative_diff)
    
    avg_difference = np.mean(differences)
    max_difference = np.max(differences)
    
    return {
        'average_difference': avg_difference,
        'max_difference': max_difference,
        'all_differences': differences
    }

# Test gradient checking
test_model = create_model_with_init('he_normal')
X_sample = X_train[:10]
y_sample = y_train[:10]

print("Performing gradient check...")
results = gradient_check(test_model, X_sample, y_sample)

print("\nGradient Check Results:")
print(f"Average relative difference: {results['average_difference']:.2e}")
print(f"Max relative difference: {results['max_difference']:.2e}")

if results['average_difference'] < 1e-5:
    print("✅ Gradients are correct!")
elif results['average_difference'] < 1e-3:
    print("⚠️  Gradients are approximately correct (acceptable)")
else:
    print("❌ Gradient computation may have errors!")

## 9. Debugging Checklist

When your neural network isn't working, follow this systematic checklist:

### Step 1: Start Simple
- [ ] Try overfitting on a single batch (should get 100% accuracy)
- [ ] If can't overfit single batch → implementation bug

### Step 2: Check Data
- [ ] Visualize input data
- [ ] Check for correct normalization
- [ ] Verify labels match inputs
- [ ] Check for class imbalance

### Step 3: Check Model Architecture
- [ ] Verify input/output shapes
- [ ] Check activation functions
- [ ] Ensure sufficient capacity

### Step 4: Check Initialization and Learning Rate
- [ ] Use proper initialization (He for ReLU)
- [ ] Start with standard learning rate (1e-3 for Adam)
- [ ] Use learning rate finder if needed

### Step 5: Monitor Training
- [ ] Plot loss curves
- [ ] Check gradient norms
- [ ] Monitor activation statistics
- [ ] Look for NaN/Inf values

### Step 6: Regularization (only if overfitting)
- [ ] Add dropout
- [ ] Add L2 regularization
- [ ] Use data augmentation
- [ ] Reduce model capacity

In [None]:
# Implement "overfit single batch" test
def overfit_single_batch_test(model, X_batch, y_batch, epochs=100):
    """
    Test if model can overfit a single batch.
    This verifies the model has learning capacity and backprop works.
    
    Returns:
        True if model successfully overfits, False otherwise
    """
    print("Testing if model can overfit single batch...")
    
    history = model.fit(
        X_batch, y_batch,
        epochs=epochs,
        batch_size=len(X_batch),
        verbose=0
    )
    
    final_acc = history.history['accuracy'][-1]
    
    print(f"Final accuracy on single batch: {final_acc:.4f}")
    
    if final_acc > 0.95:
        print("✅ Model can learn! Backprop and optimization are working.")
        return True
    else:
        print("❌ Model cannot overfit single batch. Check:")
        print("   - Model architecture (sufficient capacity?)")
        print("   - Learning rate (too small?)")
        print("   - Loss function (correct for task?)")
        print("   - Implementation bugs")
        return False

# Test on a good model
good_model = create_model_with_init('he_normal')
single_batch_X = X_train[:32]
single_batch_y = y_train[:32]

overfit_single_batch_test(good_model, single_batch_X, single_batch_y)

## 10. Visualization Tools for Debugging

In [None]:
def visualize_weights_and_activations(model, X_sample):
    """
    Visualize weight distributions and activation distributions.
    Helps identify dead neurons, saturation, etc.
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 8))
    
    # Plot weight distributions for first 3 layers
    for i in range(min(3, len(model.layers))):
        layer = model.layers[i]
        if hasattr(layer, 'kernel'):
            weights = layer.kernel.numpy().flatten()
            
            axes[0, i].hist(weights, bins=50, alpha=0.7, edgecolor='black')
            axes[0, i].set_title(f'Layer {i} Weights', fontweight='bold')
            axes[0, i].set_xlabel('Weight Value')
            axes[0, i].set_ylabel('Frequency')
            axes[0, i].axvline(0, color='red', linestyle='--', alpha=0.5)
            axes[0, i].grid(True, alpha=0.3)
    
    # Plot activation distributions
    for i in range(min(3, len(model.layers))):
        layer = model.layers[i]
        intermediate_model = keras.Model(
            inputs=model.input,
            outputs=layer.output
        )
        activations = intermediate_model.predict(X_sample, verbose=0).flatten()
        
        axes[1, i].hist(activations, bins=50, alpha=0.7, edgecolor='black')
        axes[1, i].set_title(f'Layer {i} Activations', fontweight='bold')
        axes[1, i].set_xlabel('Activation Value')
        axes[1, i].set_ylabel('Frequency')
        axes[1, i].axvline(0, color='red', linestyle='--', alpha=0.5)
        axes[1, i].grid(True, alpha=0.3)
        
        # Check for dead neurons
        dead_percentage = (np.sum(activations == 0) / len(activations)) * 100
        axes[1, i].text(0.95, 0.95, f'{dead_percentage:.1f}% dead',
                       transform=axes[1, i].transAxes,
                       verticalalignment='top',
                       horizontalalignment='right',
                       bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.suptitle('Weight and Activation Distributions', 
                 fontsize=14, fontweight='bold', y=1.00)
    plt.tight_layout()
    plt.show()

# Train a model and visualize
debug_model = create_model_with_init('he_normal')
debug_model.fit(X_train[:1000], y_train[:1000], epochs=5, batch_size=32, verbose=0)

visualize_weights_and_activations(debug_model, X_val[:100])

## 11. Exercise 1: Diagnose and Fix a Broken Model

**Task**: You're given a model that isn't learning. Diagnose the problem and fix it.

**Requirements**:
1. Train the broken model and observe the symptoms
2. Use debugging techniques to identify the root cause
3. Fix the model and verify it learns properly
4. Document what was wrong and how you fixed it

In [None]:
# Broken model for you to fix
def create_broken_model():
    """This model has multiple issues. Can you find and fix them all?"""
    model = keras.Sequential([
        layers.InputLayer(input_shape=(784,)),
        layers.Dense(1000, activation='sigmoid'),
        layers.Dense(1000, activation='sigmoid'),
        layers.Dense(1000, activation='sigmoid'),
        layers.Dense(10, activation='softmax')
    ])
    
    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=0.00001),
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# YOUR CODE HERE
# 1. Train the broken model and observe issues
# 2. Apply debugging techniques (gradient monitoring, activation checking, etc.)
# 3. Identify the problems
# 4. Create a fixed version
# 5. Compare performance

pass  # Replace with your implementation

## 12. Exercise 2: Implement Learning Rate Warmup

**Task**: Implement learning rate warmup to prevent early training instability.

**Concept**: Start with very small learning rate and gradually increase it over first few epochs.

**Requirements**:
1. Create a custom callback that implements LR warmup
2. Warmup should gradually increase LR from 0 to target LR over N epochs
3. After warmup, use constant or decaying LR
4. Compare training with and without warmup
5. Show when warmup helps most (large models, large batch sizes)

In [None]:
# YOUR CODE HERE
# Hint: Create a custom callback that modifies learning rate
# class LRWarmup(keras.callbacks.Callback):
#     def on_epoch_begin(self, epoch, logs=None):
#         # Implement warmup logic here

pass  # Replace with your implementation

## 13. Exercise 3: Build a Comprehensive Training Monitor

**Task**: Create a training monitor that tracks multiple debugging metrics.

**Requirements**:
1. Monitor gradient norms for each layer
2. Track percentage of dead neurons (activation = 0)
3. Record weight statistics (mean, std, min, max)
4. Detect potential issues (vanishing/exploding gradients, dead neurons)
5. Create a visualization dashboard showing all metrics
6. Issue warnings when problems are detected

In [None]:
# YOUR CODE HERE
# Hint: Extend the GradientMonitor callback
# Add methods to track multiple metrics
# Create a comprehensive plotting function

pass  # Replace with your implementation

## 14. Summary

### Key Concepts Covered:

1. **Common Neural Network Problems**
   - Not learning: dead neurons, wrong settings
   - Learning too slowly: poor initialization, small LR
   - Training instability: exploding gradients, high LR
   - Overfitting/Underfitting: capacity issues

2. **Gradient Problems**
   - Vanishing gradients: common in deep networks with sigmoid
   - Exploding gradients: caused by large weights or high LR
   - Solutions: proper initialization, ReLU, gradient clipping

3. **Weight Initialization**
   - He initialization for ReLU: $W \sim \mathcal{N}(0, \sqrt{2/n_{in}})$
   - Xavier/Glorot for sigmoid/tanh
   - Critical for training stability

4. **Gradient Checking**
   - Numerical vs analytical gradients
   - Useful for debugging custom implementations
   - Expensive but valuable for validation

5. **Systematic Debugging**
   - Overfit single batch test
   - Check data, architecture, initialization
   - Monitor gradients and activations
   - Use visualization tools

### Debugging Best Practices:

- **Start simple**: Verify model can overfit small data
- **Monitor everything**: Gradients, weights, activations
- **Visualize**: Plots reveal patterns humans miss
- **Use proper defaults**: He init + ReLU + Adam
- **Test incrementally**: Add complexity gradually
- **Keep logs**: Track experiments systematically

### Quick Reference: Common Fixes

| Problem | Likely Cause | Solution |
|---------|-------------|----------|
| Not learning | LR too small, dead neurons | Increase LR, check activation |
| NaN loss | LR too large, exploding gradients | Reduce LR, gradient clipping |
| Slow convergence | Poor initialization, bad architecture | Use He init, add capacity |
| Vanishing gradients | Deep + sigmoid, bad init | Use ReLU, proper init |
| Dead neurons | Bad init, high LR | He init, reduce LR |

### What's Next?

- [Module 12: Model Interpretation and Visualization](12_model_interpretation_visualization.ipynb)
- Advanced debugging: TensorBoard, profiling, distributed training issues

### Additional Resources:

1. "Delving Deep into Rectifiers" (He et al., 2015) - He initialization paper
2. "Understanding the difficulty of training deep feedforward neural networks" (Glorot & Bengio, 2010)
3. Andrej Karpathy's "Recipe for Training Neural Networks": http://karpathy.github.io/2019/04/25/recipe/
4. TensorFlow Debugger documentation: https://www.tensorflow.org/guide/debugger