# Module 05: Feed-Forward Neural Networks with Keras

**Difficulty**: ⭐⭐ (Intermediate)

**Estimated Time**: 45-60 minutes

**Prerequisites**: 
- [Module 04: Introduction to TensorFlow and Keras](04_introduction_to_tensorflow_keras.ipynb)
- [Module 02: Backpropagation and Gradient Descent](02_backpropagation_and_gradient_descent.ipynb)
- [Module 01: Perceptrons and Activation Functions](01_perceptrons_and_activation_functions.ipynb)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Design deep neural network architectures with multiple hidden layers
2. Understand and apply principles for hidden layer design (depth vs width tradeoffs)
3. Build and train deep networks on realistic datasets (Fashion-MNIST and CIFAR-10)
4. Monitor training progress and detect overfitting through validation strategies
5. Implement effective validation strategies to ensure model generalization
6. Interpret training/validation curves to diagnose model performance issues

## 1. Setup and Imports

Let's import all the necessary libraries for building deep neural networks.

In [None]:
# Core libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import fashion_mnist, cifar10
from tensorflow.keras.utils import to_categorical

# Scikit-learn for evaluation
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Visualization settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print(f"TensorFlow version: {tf.__version__}")
print(f"Keras version: {keras.__version__}")

## 2. Understanding Deep Neural Networks

### What Makes a Network "Deep"?

A **deep neural network** is characterized by having multiple hidden layers between the input and output layers. The term "deep" refers to the depth of the network (number of layers), not the width (number of neurons per layer).

### Architecture Components:

1. **Input Layer**: Receives raw features (e.g., pixel values)
2. **Hidden Layers**: Extract hierarchical features
   - Early layers: Low-level features (edges, textures)
   - Middle layers: Mid-level features (shapes, patterns)
   - Deep layers: High-level features (object parts)
3. **Output Layer**: Produces final predictions

### Mathematical Representation:

For a network with $L$ layers:

$$
\begin{align}
\mathbf{h}^{(1)} &= \sigma(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) \\
\mathbf{h}^{(2)} &= \sigma(\mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)}) \\
&\vdots \\
\mathbf{h}^{(L-1)} &= \sigma(\mathbf{W}^{(L-1)} \mathbf{h}^{(L-2)} + \mathbf{b}^{(L-1)}) \\
\hat{\mathbf{y}} &= \text{softmax}(\mathbf{W}^{(L)} \mathbf{h}^{(L-1)} + \mathbf{b}^{(L)})
\end{align}
$$

Where:
- $\mathbf{h}^{(i)}$ is the activation of layer $i$
- $\mathbf{W}^{(i)}$ and $\mathbf{b}^{(i)}$ are weights and biases for layer $i$
- $\sigma$ is the activation function (e.g., ReLU)
- $\hat{\mathbf{y}}$ is the predicted output

## 3. Depth vs Width: Design Principles

### The Depth vs Width Tradeoff

When designing neural networks, you must decide:
- **Depth**: How many layers?
- **Width**: How many neurons per layer?

### General Guidelines:

| Aspect | Deeper Networks (More Layers) | Wider Networks (More Neurons) |
|--------|-------------------------------|-------------------------------|
| **Representation Power** | Can learn hierarchical features | Can learn complex single-level patterns |
| **Parameters** | Fewer parameters for same capacity | More parameters needed |
| **Training** | Can be harder to train (vanishing gradients) | Generally easier to train |
| **Generalization** | Often better (hierarchical inductive bias) | May overfit without regularization |
| **Computation** | More sequential operations | More parallel operations |

### Rule of Thumb:
- Start with **2-3 hidden layers** of moderate width (64-256 neurons)
- For complex tasks (image classification), prefer **deeper networks**
- For simpler tasks (tabular data), wider shallow networks may suffice
- Use validation performance to guide architecture choices

In [None]:
# Helper function to visualize network architecture
def visualize_architecture(layer_sizes, title="Network Architecture"):
    """
    Visualize neural network architecture.
    
    Parameters:
    -----------
    layer_sizes : list
        Number of neurons in each layer [input, hidden1, hidden2, ..., output]
    title : str
        Plot title
    """
    fig, ax = plt.subplots(figsize=(12, 6))
    
    # Calculate positions
    n_layers = len(layer_sizes)
    x_positions = np.linspace(0, 10, n_layers)
    
    # Draw each layer
    for i, (x, size) in enumerate(zip(x_positions, layer_sizes)):
        # Calculate y positions for neurons in this layer
        y_positions = np.linspace(0, 10, size)
        
        # Draw neurons
        for y in y_positions:
            circle = plt.Circle((x, y), 0.2, color='skyblue', ec='black', zorder=3)
            ax.add_patch(circle)
        
        # Add layer label
        if i == 0:
            label = f"Input\n({size})"
        elif i == n_layers - 1:
            label = f"Output\n({size})"
        else:
            label = f"Hidden {i}\n({size})"
        ax.text(x, -1, label, ha='center', fontsize=10, fontweight='bold')
    
    ax.set_xlim(-1, 11)
    ax.set_ylim(-2, 11)
    ax.axis('off')
    ax.set_title(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Example: Compare deep vs wide networks
print("Deep Network (more layers, fewer neurons per layer):")
visualize_architecture([784, 64, 64, 64, 10], "Deep Network: 3 Hidden Layers")

print("\nWide Network (fewer layers, more neurons per layer):")
visualize_architecture([784, 256, 10], "Wide Network: 1 Hidden Layer")

## 4. Loading and Preprocessing Fashion-MNIST

**Fashion-MNIST** is a dataset of 70,000 grayscale images (28×28 pixels) of clothing items in 10 categories:
- T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot

It's a drop-in replacement for MNIST, but more challenging and realistic.

In [None]:
# Load Fashion-MNIST dataset
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# Class names for Fashion-MNIST
class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

print(f"Training data shape: {X_train_full.shape}")
print(f"Training labels shape: {y_train_full.shape}")
print(f"Test data shape: {X_test.shape}")
print(f"Test labels shape: {y_test.shape}")
print(f"\nNumber of classes: {len(class_names)}")
print(f"Pixel value range: [{X_train_full.min()}, {X_train_full.max()}]")

In [None]:
# Visualize sample images
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
axes = axes.ravel()

for i in range(10):
    axes[i].imshow(X_train_full[i], cmap='gray')
    axes[i].set_title(f"{class_names[y_train_full[i]]}")
    axes[i].axis('off')

plt.suptitle("Fashion-MNIST Sample Images", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Preprocessing steps
# 1. Normalize pixel values to [0, 1] range
X_train_full = X_train_full.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

# 2. Flatten images from 28x28 to 784-dimensional vectors
# This is needed for fully-connected (Dense) layers
X_train_full_flat = X_train_full.reshape(-1, 28 * 28)
X_test_flat = X_test.reshape(-1, 28 * 28)

# 3. Create validation set (last 10,000 samples)
X_train, X_valid = X_train_full_flat[:50000], X_train_full_flat[50000:]
y_train, y_valid = y_train_full[:50000], y_train_full[50000:]

print(f"Training set: {X_train.shape}")
print(f"Validation set: {X_valid.shape}")
print(f"Test set: {X_test_flat.shape}")
print(f"\nPixel values after normalization: [{X_train.min():.2f}, {X_train.max():.2f}]")

## 5. Building Deep Neural Networks

### Architecture Design Principles:

1. **Start Simple**: Begin with 2-3 hidden layers
2. **Layer Size Progression**: Common patterns:
   - Decreasing: 512 → 256 → 128 (funnel architecture)
   - Constant: 256 → 256 → 256 (uniform architecture)
   - Increasing: 128 → 256 → 512 (expansion architecture)
3. **Activation Functions**: Use ReLU for hidden layers, softmax for output
4. **Output Layer**: Must match number of classes (10 for Fashion-MNIST)

In [None]:
# Model 1: Simple Deep Network (3 hidden layers, decreasing size)
def create_simple_deep_model():
    """
    Create a simple deep neural network with 3 hidden layers.
    Uses decreasing layer sizes (funnel architecture).
    """
    model = models.Sequential([
        # Input layer - explicitly define input shape
        layers.Input(shape=(784,)),
        
        # Hidden layer 1: 256 neurons
        layers.Dense(256, activation='relu', name='hidden_1'),
        
        # Hidden layer 2: 128 neurons
        layers.Dense(128, activation='relu', name='hidden_2'),
        
        # Hidden layer 3: 64 neurons
        layers.Dense(64, activation='relu', name='hidden_3'),
        
        # Output layer: 10 classes (softmax for multi-class classification)
        layers.Dense(10, activation='softmax', name='output')
    ], name='simple_deep_model')
    
    return model

# Create and display model architecture
model_simple = create_simple_deep_model()
model_simple.summary()

# Calculate total parameters
total_params = model_simple.count_params()
print(f"\nTotal trainable parameters: {total_params:,}")

### Understanding Parameter Count

For each Dense layer, the number of parameters is:
$$\text{parameters} = (\text{input\_size} + 1) \times \text{output\_size}$$

The "+1" accounts for the bias term for each neuron.

Example for first hidden layer:
- Input: 784 features
- Output: 256 neurons
- Parameters: $(784 + 1) \times 256 = 200,960$

In [None]:
# Compile the model
model_simple.compile(
    optimizer='adam',  # Adam optimizer (adaptive learning rate)
    loss='sparse_categorical_crossentropy',  # For integer labels
    metrics=['accuracy']  # Track accuracy during training
)

print("Model compiled successfully!")
print(f"Optimizer: Adam")
print(f"Loss function: Sparse Categorical Crossentropy")
print(f"Metrics: Accuracy")

## 6. Training and Monitoring Progress

### Best Practices for Training:

1. **Use Validation Data**: Monitor performance on unseen data
2. **Track Metrics**: Loss and accuracy for both training and validation
3. **Watch for Overfitting**: Validation loss increasing while training loss decreases
4. **Be Patient**: Deep networks may take many epochs to converge

In [None]:
# Train the model with validation monitoring
print("Training simple deep model...")
history_simple = model_simple.fit(
    X_train, y_train,
    epochs=20,
    batch_size=128,
    validation_data=(X_valid, y_valid),
    verbose=1  # Show progress bar
)

print("\nTraining completed!")

In [None]:
# Helper function to plot training history
def plot_training_history(history, title="Training History"):
    """
    Plot training and validation loss/accuracy curves.
    
    Parameters:
    -----------
    history : keras.callbacks.History
        Training history object returned by model.fit()
    title : str
        Main title for the plot
    """
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot loss
    ax1.plot(history.history['loss'], label='Training Loss', linewidth=2)
    ax1.plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
    ax1.set_xlabel('Epoch', fontsize=12)
    ax1.set_ylabel('Loss', fontsize=12)
    ax1.set_title('Model Loss', fontsize=12, fontweight='bold')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot accuracy
    ax2.plot(history.history['accuracy'], label='Training Accuracy', linewidth=2)
    ax2.plot(history.history['val_accuracy'], label='Validation Accuracy', linewidth=2)
    ax2.set_xlabel('Epoch', fontsize=12)
    ax2.set_ylabel('Accuracy', fontsize=12)
    ax2.set_title('Model Accuracy', fontsize=12, fontweight='bold')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    plt.suptitle(title, fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Plot training history
plot_training_history(history_simple, "Simple Deep Model Training History")

## 7. Detecting Overfitting

### Signs of Overfitting:

1. **Training accuracy >> Validation accuracy**: Large gap indicates overfitting
2. **Validation loss increases**: While training loss continues to decrease
3. **Early plateau**: Validation metrics stop improving early

### Common Causes:
- Model too complex for the dataset
- Insufficient training data
- Training for too many epochs
- Lack of regularization

### Solutions (covered in later modules):
- Dropout (Module 07)
- Batch Normalization (Module 07)
- L1/L2 Regularization (Module 07)
- Early Stopping (Module 07)
- Data Augmentation (Module 07)

In [None]:
# Function to analyze overfitting
def analyze_overfitting(history):
    """
    Analyze training history to detect overfitting.
    
    Parameters:
    -----------
    history : keras.callbacks.History
        Training history object
    """
    final_epoch = len(history.history['loss']) - 1
    
    train_loss = history.history['loss'][-1]
    val_loss = history.history['val_loss'][-1]
    train_acc = history.history['accuracy'][-1]
    val_acc = history.history['val_accuracy'][-1]
    
    print("=" * 60)
    print("OVERFITTING ANALYSIS")
    print("=" * 60)
    print(f"\nFinal Epoch: {final_epoch + 1}")
    print(f"\nTraining Loss:      {train_loss:.4f}")
    print(f"Validation Loss:    {val_loss:.4f}")
    print(f"Loss Gap:           {abs(val_loss - train_loss):.4f}")
    
    print(f"\nTraining Accuracy:  {train_acc:.4f} ({train_acc*100:.2f}%)")
    print(f"Validation Accuracy: {val_acc:.4f} ({val_acc*100:.2f}%)")
    print(f"Accuracy Gap:       {abs(train_acc - val_acc):.4f} ({abs(train_acc - val_acc)*100:.2f}%)")
    
    # Overfitting diagnosis
    print("\n" + "=" * 60)
    print("DIAGNOSIS:")
    print("=" * 60)
    
    if val_loss > train_loss * 1.1 and abs(train_acc - val_acc) > 0.05:
        print("⚠️  OVERFITTING DETECTED")
        print("   - Validation loss significantly higher than training loss")
        print("   - Consider: regularization, dropout, or early stopping")
    elif abs(train_acc - val_acc) < 0.02:
        print("✅ GOOD GENERALIZATION")
        print("   - Training and validation metrics are close")
        print("   - Model generalizes well to unseen data")
    else:
        print("⚠️  MILD OVERFITTING")
        print("   - Some gap between training and validation")
        print("   - Consider monitoring for more epochs")
    
    print("=" * 60)

# Analyze the simple model
analyze_overfitting(history_simple)

## 8. Comparing Different Architectures

Let's compare three different architectures:
1. **Shallow Wide**: 1 hidden layer with many neurons
2. **Deep Narrow**: Many hidden layers with fewer neurons
3. **Balanced**: Moderate depth and width

In [None]:
# Architecture 1: Shallow Wide (1 hidden layer, 512 neurons)
model_shallow = models.Sequential([
    layers.Input(shape=(784,)),
    layers.Dense(512, activation='relu', name='hidden_wide'),
    layers.Dense(10, activation='softmax', name='output')
], name='shallow_wide')

# Architecture 2: Deep Narrow (5 hidden layers, 64 neurons each)
model_deep = models.Sequential([
    layers.Input(shape=(784,)),
    layers.Dense(64, activation='relu', name='hidden_1'),
    layers.Dense(64, activation='relu', name='hidden_2'),
    layers.Dense(64, activation='relu', name='hidden_3'),
    layers.Dense(64, activation='relu', name='hidden_4'),
    layers.Dense(64, activation='relu', name='hidden_5'),
    layers.Dense(10, activation='softmax', name='output')
], name='deep_narrow')

# Architecture 3: Balanced (3 hidden layers, 128 neurons each)
model_balanced = models.Sequential([
    layers.Input(shape=(784,)),
    layers.Dense(128, activation='relu', name='hidden_1'),
    layers.Dense(128, activation='relu', name='hidden_2'),
    layers.Dense(128, activation='relu', name='hidden_3'),
    layers.Dense(10, activation='softmax', name='output')
], name='balanced')

# Compare parameter counts
print("Architecture Comparison:")
print("=" * 60)
print(f"Shallow Wide:  {model_shallow.count_params():>10,} parameters")
print(f"Deep Narrow:   {model_deep.count_params():>10,} parameters")
print(f"Balanced:      {model_balanced.count_params():>10,} parameters")
print("=" * 60)

In [None]:
# Compile all models with same configuration
for model in [model_shallow, model_deep, model_balanced]:
    model.compile(
        optimizer='adam',
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )

# Train all models (fewer epochs for comparison)
histories = {}
epochs = 10

for name, model in [('Shallow Wide', model_shallow), 
                     ('Deep Narrow', model_deep),
                     ('Balanced', model_balanced)]:
    print(f"\nTraining {name} model...")
    history = model.fit(
        X_train, y_train,
        epochs=epochs,
        batch_size=128,
        validation_data=(X_valid, y_valid),
        verbose=0  # Suppress output for cleaner comparison
    )
    histories[name] = history
    
    # Print final metrics
    final_train_acc = history.history['accuracy'][-1]
    final_val_acc = history.history['val_accuracy'][-1]
    print(f"  Final Training Accuracy:   {final_train_acc:.4f}")
    print(f"  Final Validation Accuracy: {final_val_acc:.4f}")

print("\nAll models trained!")

In [None]:
# Compare validation accuracy across architectures
fig, ax = plt.subplots(figsize=(12, 6))

for name, history in histories.items():
    ax.plot(history.history['val_accuracy'], label=name, linewidth=2, marker='o')

ax.set_xlabel('Epoch', fontsize=12)
ax.set_ylabel('Validation Accuracy', fontsize=12)
ax.set_title('Architecture Comparison: Validation Accuracy', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Print summary
print("\nBest Final Validation Accuracy:")
for name, history in histories.items():
    best_val_acc = max(history.history['val_accuracy'])
    print(f"  {name:15s}: {best_val_acc:.4f} ({best_val_acc*100:.2f}%)")

## 9. Training on CIFAR-10 (Color Images)

**CIFAR-10** consists of 60,000 color images (32×32 pixels, RGB) in 10 classes:
- Airplane, Automobile, Bird, Cat, Deer, Dog, Frog, Horse, Ship, Truck

This is more challenging than Fashion-MNIST due to:
- Color images (3 channels)
- More complex visual patterns
- Greater intra-class variation

In [None]:
# Load CIFAR-10 dataset
# Note: We'll use a subset for faster training on CPU
(X_cifar_train, y_cifar_train), (X_cifar_test, y_cifar_test) = cifar10.load_data()

# CIFAR-10 class names
cifar_classes = [
    'Airplane', 'Automobile', 'Bird', 'Cat', 'Deer',
    'Dog', 'Frog', 'Horse', 'Ship', 'Truck'
]

# Use subset for faster training (10,000 training samples)
subset_size = 10000
X_cifar_train = X_cifar_train[:subset_size]
y_cifar_train = y_cifar_train[:subset_size]

print(f"CIFAR-10 Training subset: {X_cifar_train.shape}")
print(f"CIFAR-10 Test set: {X_cifar_test.shape}")
print(f"Image shape: {X_cifar_train[0].shape} (height, width, channels)")
print(f"Pixel value range: [{X_cifar_train.min()}, {X_cifar_train.max()}]")

In [None]:
# Visualize CIFAR-10 samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
axes = axes.ravel()

for i in range(10):
    axes[i].imshow(X_cifar_train[i])
    axes[i].set_title(f"{cifar_classes[y_cifar_train[i][0]]}")
    axes[i].axis('off')

plt.suptitle("CIFAR-10 Sample Images", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Preprocess CIFAR-10 data
# 1. Normalize to [0, 1]
X_cifar_train = X_cifar_train.astype('float32') / 255.0
X_cifar_test = X_cifar_test.astype('float32') / 255.0

# 2. Flatten images: 32x32x3 = 3072 features
X_cifar_train_flat = X_cifar_train.reshape(-1, 32 * 32 * 3)
X_cifar_test_flat = X_cifar_test.reshape(-1, 32 * 32 * 3)

# 3. Flatten labels (they're 2D for some reason)
y_cifar_train = y_cifar_train.flatten()
y_cifar_test = y_cifar_test.flatten()

# 4. Create validation set
split_idx = int(0.8 * len(X_cifar_train_flat))
X_cifar_train_final = X_cifar_train_flat[:split_idx]
X_cifar_valid = X_cifar_train_flat[split_idx:]
y_cifar_train_final = y_cifar_train[:split_idx]
y_cifar_valid = y_cifar_train[split_idx:]

print(f"CIFAR-10 Training set: {X_cifar_train_final.shape}")
print(f"CIFAR-10 Validation set: {X_cifar_valid.shape}")
print(f"CIFAR-10 Test set: {X_cifar_test_flat.shape}")

In [None]:
# Build model for CIFAR-10
# Deeper network needed for color images
model_cifar = models.Sequential([
    layers.Input(shape=(3072,)),  # 32x32x3 = 3072
    layers.Dense(512, activation='relu', name='hidden_1'),
    layers.Dense(256, activation='relu', name='hidden_2'),
    layers.Dense(128, activation='relu', name='hidden_3'),
    layers.Dense(64, activation='relu', name='hidden_4'),
    layers.Dense(10, activation='softmax', name='output')
], name='cifar10_model')

model_cifar.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

model_cifar.summary()

In [None]:
# Train on CIFAR-10
print("Training on CIFAR-10...")
print("Note: This is more challenging than Fashion-MNIST!")

history_cifar = model_cifar.fit(
    X_cifar_train_final, y_cifar_train_final,
    epochs=15,
    batch_size=128,
    validation_data=(X_cifar_valid, y_cifar_valid),
    verbose=1
)

print("\nCIFAR-10 training completed!")

In [None]:
# Plot CIFAR-10 training history
plot_training_history(history_cifar, "CIFAR-10 Model Training History")

# Analyze overfitting
analyze_overfitting(history_cifar)

## 10. Model Evaluation and Testing

After training, we evaluate on the **test set** - data the model has never seen.
This gives us the true estimate of generalization performance.

In [None]:
# Evaluate Fashion-MNIST model
test_loss, test_accuracy = model_simple.evaluate(X_test_flat, y_test, verbose=0)

print("Fashion-MNIST Test Results:")
print("=" * 60)
print(f"Test Loss:     {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print("=" * 60)

# Make predictions
y_pred = model_simple.predict(X_test_flat, verbose=0)
y_pred_classes = np.argmax(y_pred, axis=1)

# Visualize some predictions
fig, axes = plt.subplots(2, 5, figsize=(14, 6))
axes = axes.ravel()

for i in range(10):
    idx = np.random.randint(0, len(X_test))
    axes[i].imshow(X_test[idx], cmap='gray')
    
    true_label = class_names[y_test[idx]]
    pred_label = class_names[y_pred_classes[idx]]
    confidence = y_pred[idx][y_pred_classes[idx]]
    
    # Color code: green if correct, red if wrong
    color = 'green' if y_test[idx] == y_pred_classes[idx] else 'red'
    
    axes[i].set_title(f"True: {true_label}\nPred: {pred_label}\nConf: {confidence:.2f}",
                      fontsize=9, color=color)
    axes[i].axis('off')

plt.suptitle("Fashion-MNIST Predictions (Green=Correct, Red=Wrong)", 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 11. Summary

### Key Concepts Learned:

1. **Deep Neural Networks**: Multiple hidden layers extract hierarchical features

2. **Depth vs Width**:
   - Deeper networks: Better hierarchical representation, fewer parameters
   - Wider networks: More parallel computation, may need regularization

3. **Architecture Design**:
   - Start simple (2-3 hidden layers)
   - Common patterns: Funnel (decreasing), Uniform, Expansion
   - Use ReLU for hidden layers, softmax for output

4. **Training Best Practices**:
   - Always use validation data
   - Monitor both loss and accuracy
   - Watch for overfitting (validation metrics diverging)

5. **Overfitting Detection**:
   - Training accuracy >> Validation accuracy
   - Validation loss increases over time
   - Solutions: Regularization (covered in Module 07)

6. **Dataset Complexity**:
   - Fashion-MNIST: Grayscale, simpler patterns
   - CIFAR-10: Color, more complex, needs deeper networks

### What's Next?

In the next modules, we'll learn:
- **Module 06**: Optimizers (SGD, Adam, RMSprop) - How to train efficiently
- **Module 07**: Regularization (Dropout, Batch Norm) - How to prevent overfitting
- **Module 08**: Loss Functions and Metrics - Choosing the right objectives
- **Module 09**: Hyperparameter Tuning - Finding optimal configurations

### Additional Resources:

- [Keras Sequential Model Guide](https://keras.io/guides/sequential_model/)
- [Fashion-MNIST Dataset](https://github.com/zalandoresearch/fashion-mnist)
- [CIFAR-10 Dataset](https://www.cs.toronto.edu/~kriz/cifar.html)
- [Deep Learning Book - Chapter 6](https://www.deeplearningbook.org/contents/mlp.html)

## 12. Exercises

Test your understanding with these exercises!

### Exercise 1: Design Your Own Architecture

**Task**: Create a neural network architecture with:
- 4 hidden layers
- Use an "expansion" pattern (increasing neuron count): 64 → 128 → 256 → 512
- Train on Fashion-MNIST for 10 epochs
- Compare its performance to the simple model

**Questions**:
1. How many parameters does your model have?
2. Does it perform better or worse than the simple model?
3. Does it show signs of overfitting?

```python
# Your code here
```

In [None]:
# Exercise 1 Solution
# Uncomment to reveal

# model_expansion = models.Sequential([
#     layers.Input(shape=(784,)),
#     layers.Dense(64, activation='relu', name='hidden_1'),
#     layers.Dense(128, activation='relu', name='hidden_2'),
#     layers.Dense(256, activation='relu', name='hidden_3'),
#     layers.Dense(512, activation='relu', name='hidden_4'),
#     layers.Dense(10, activation='softmax', name='output')
# ], name='expansion_model')
# 
# model_expansion.compile(
#     optimizer='adam',
#     loss='sparse_categorical_crossentropy',
#     metrics=['accuracy']
# )
# 
# print(f"Total parameters: {model_expansion.count_params():,}")
# 
# history_expansion = model_expansion.fit(
#     X_train, y_train,
#     epochs=10,
#     batch_size=128,
#     validation_data=(X_valid, y_valid),
#     verbose=1
# )
# 
# plot_training_history(history_expansion, "Expansion Model")
# analyze_overfitting(history_expansion)

### Exercise 2: Optimal Depth Experiment

**Task**: Experiment with different depths while keeping total parameters similar:
- Create 3 models: 2 layers (wide), 4 layers (medium), 6 layers (narrow)
- Adjust neuron counts to keep total parameters around 200,000
- Train each for 10 epochs and compare validation accuracy

**Questions**:
1. Which depth performs best?
2. How does training time differ?
3. Which shows the least overfitting?

```python
# Your code here
```

In [None]:
# Exercise 2 Solution
# Uncomment to reveal

# # Hint: Use count_params() to check parameter count
# # Adjust layer sizes to get similar counts
# 
# model_2layers = models.Sequential([
#     layers.Input(shape=(784,)),
#     layers.Dense(256, activation='relu'),
#     layers.Dense(256, activation='relu'),
#     layers.Dense(10, activation='softmax')
# ])
# 
# model_4layers = models.Sequential([
#     layers.Input(shape=(784,)),
#     layers.Dense(128, activation='relu'),
#     layers.Dense(128, activation='relu'),
#     layers.Dense(128, activation='relu'),
#     layers.Dense(128, activation='relu'),
#     layers.Dense(10, activation='softmax')
# ])
# 
# model_6layers = models.Sequential([
#     layers.Input(shape=(784,)),
#     layers.Dense(96, activation='relu'),
#     layers.Dense(96, activation='relu'),
#     layers.Dense(96, activation='relu'),
#     layers.Dense(96, activation='relu'),
#     layers.Dense(96, activation='relu'),
#     layers.Dense(96, activation='relu'),
#     layers.Dense(10, activation='softmax')
# ])
# 
# for name, model in [('2 Layers', model_2layers),
#                     ('4 Layers', model_4layers),
#                     ('6 Layers', model_6layers)]:
#     print(f"{name}: {model.count_params():,} parameters")

### Exercise 3: CIFAR-10 Improvement Challenge

**Task**: Try to improve the CIFAR-10 model performance:
- Experiment with different architectures
- Try different batch sizes (32, 64, 256)
- Train for more epochs (20-30)
- Analyze what works and what doesn't

**Goal**: Achieve >50% validation accuracy

**Questions**:
1. What architecture worked best?
2. Did more epochs help or hurt?
3. How does batch size affect training?

```python
# Your code here
```

In [None]:
# Exercise 3 Solution
# Uncomment to reveal

# # Try a deeper architecture with more neurons
# model_improved = models.Sequential([
#     layers.Input(shape=(3072,)),
#     layers.Dense(1024, activation='relu'),
#     layers.Dense(512, activation='relu'),
#     layers.Dense(256, activation='relu'),
#     layers.Dense(128, activation='relu'),
#     layers.Dense(10, activation='softmax')
# ])
# 
# model_improved.compile(
#     optimizer='adam',
#     loss='sparse_categorical_crossentropy',
#     metrics=['accuracy']
# )
# 
# history_improved = model_improved.fit(
#     X_cifar_train_final, y_cifar_train_final,
#     epochs=25,
#     batch_size=64,  # Smaller batch size
#     validation_data=(X_cifar_valid, y_cifar_valid),
#     verbose=1
# )
# 
# plot_training_history(history_improved, "Improved CIFAR-10 Model")

### Exercise 4: Validation Strategy Comparison

**Task**: Compare different validation split ratios on Fashion-MNIST:
- 70% train / 30% validation
- 80% train / 20% validation  
- 90% train / 10% validation

Use the same architecture and training configuration for fair comparison.

**Questions**:
1. How does validation set size affect the reliability of validation metrics?
2. Which split gives the best final test accuracy?
3. What are the tradeoffs between more training data vs more validation data?

```python
# Your code here
```

In [None]:
# Exercise 4 Solution
# Uncomment to reveal

# results = {}
# 
# for val_ratio in [0.3, 0.2, 0.1]:
#     train_ratio = 1 - val_ratio
#     split_point = int(len(X_train_full_flat) * train_ratio)
#     
#     X_tr = X_train_full_flat[:split_point]
#     X_val = X_train_full_flat[split_point:]
#     y_tr = y_train_full[:split_point]
#     y_val = y_train_full[split_point:]
#     
#     model = create_simple_deep_model()
#     model.compile(optimizer='adam',
#                   loss='sparse_categorical_crossentropy',
#                   metrics=['accuracy'])
#     
#     history = model.fit(X_tr, y_tr,
#                        epochs=15,
#                        batch_size=128,
#                        validation_data=(X_val, y_val),
#                        verbose=0)
#     
#     test_acc = model.evaluate(X_test_flat, y_test, verbose=0)[1]
#     results[f"{int(train_ratio*100)}/{int(val_ratio*100)}"] = {
#         'val_acc': history.history['val_accuracy'][-1],
#         'test_acc': test_acc
#     }
# 
# for split, metrics in results.items():
#     print(f"Split {split}: Val={metrics['val_acc']:.4f}, Test={metrics['test_acc']:.4f}")

---

**Congratulations!** You've completed Module 05. You now understand how to:
- Design deep neural network architectures
- Balance depth vs width tradeoffs
- Train models on realistic datasets
- Monitor training progress
- Detect and diagnose overfitting

Continue to **Module 06: Optimizers (SGD, Adam, RMSprop)** to learn how to train deep networks more efficiently!