# Task 1: MNIST with Optimal Learning Rate & TensorBoard

## Step 1: MNIST Optimization

**Objective:** Reach >98% accuracy on the MNIST dataset.

### Learning Rate Finder

We implement a callback that exponentially increases the learning rate during training. By plotting Loss vs. Learning Rate, we can identify the optimal learning rate range—typically 10x smaller than the rate where the loss begins to increase.

### TensorBoard Integration

TensorBoard provides real-time visualization of training metrics including:
- Loss curves
- Accuracy metrics
- Learning rate progression
- Model graph structure

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import os

# 1. Load MNIST
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train_full = X_train_full / 255.0  # Scale to [0, 1]
X_test = X_test / 255.0

# Create validation set
X_train, X_valid = X_train_full[:-5000], X_train_full[-5000:]
y_train, y_valid = y_train_full[:-5000], y_train_full[-5000:]

# 2. Build Model
model = keras.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

# 3. Learning Rate Finder Setup
# We grow LR exponentially from 1e-3 to 10
class ExponentialLearningRate(keras.callbacks.Callback):
    def __init__(self, factor):
        self.factor = factor
        self.rates = []
        self.losses = []
    def on_batch_end(self, batch, logs):
        self.rates.append(keras.backend.get_value(self.model.optimizer.learning_rate))
        self.losses.append(logs["loss"])
        keras.backend.set_value(self.model.optimizer.learning_rate, self.model.optimizer.learning_rate * self.factor)

model.compile(loss="sparse_categorical_crossentropy", optimizer=keras.optimizers.SGD(learning_rate=1e-3), metrics=["accuracy"])
expon_lr = ExponentialLearningRate(factor=1.005)

# Train for one epoch to find the rate
model.fit(X_train, y_train, epochs=1, callbacks=[expon_lr])

# 4. Plot LR vs Loss
plt.plot(expon_lr.rates, expon_lr.losses)
plt.gca().set_xscale('log')
plt.hlines(min(expon_lr.losses), min(expon_lr.rates), max(expon_lr.rates))
plt.axis([1e-3, 10, 0, 5])
plt.xlabel("Learning Rate")
plt.ylabel("Loss")
plt.show()

# 5. Final Training with TensorBoard
# Use the best LR found (e.g., 3e-1)
root_logdir = os.path.join(os.curdir, "my_logs")
tensorboard_cb = keras.callbacks.TensorBoard(root_logdir)

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=3e-1),
              metrics=["accuracy"])

model.fit(X_train, y_train, epochs=20, validation_data=(X_valid, y_valid), callbacks=[tensorboard_cb])

# Task 2: The 100-Layer Challenge & Vanishing Gradients

## Step 2: Deep Architecture Analysis

**Objective:** Understanding why modern architectures need specialized activation functions.

### Vanishing Gradients

In very deep networks, gradients diminish as they propagate backward through layers. By the time the gradient signal reaches the initial layers, it approaches zero—effectively stopping the model from learning in early layers.

### Activation Function Comparison

| Activation | Characteristics | Vanishing Gradient Issue |
|------------|-----------------|--------------------------|
| **Sigmoid** | Saturates at 0 or 1 | Severe - gradients vanish due to saturation |
| **ReLU** | Positive values pass, negatives become 0 | Moderate - can cause "Dying ReLU" where neurons get stuck at 0 |
| **ELU/SELU** | Allows negative values, keeps mean activation near zero | Minimal - SELU provides "Self-Normalization" for deep networks |

In [None]:
# Function to build a super deep 100-layer model
def build_deep_model(activation):
    model = keras.Sequential()
    model.add(keras.layers.Flatten(input_shape=[28, 28]))
    # Add 100 hidden layers
    for _ in range(100):
        model.add(keras.layers.Dense(100, activation=activation, kernel_initializer="he_normal" if activation != "sigmoid" else "glorot_uniform"))
    model.add(keras.layers.Dense(10, activation="softmax"))
    return model

# Practice: Run this for 'sigmoid' then 'selu'
# Note: Sigmoid will likely show 10% accuracy (random guessing) because it can't train 100 layers.
model_deep = build_deep_model("selu")
model_deep.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
# model_deep.fit(...)

# Task 3: CIFAR10, Batch Normalization, and Optimizers

## Step 3: CIFAR10 and Optimization

**Objective:** Train on the more complex CIFAR10 dataset (color images, 3 channels) and address training stability issues.

### He Initialization

Designed specifically for ELU/ReLU activation functions to prevent signal death and maintain proper gradient flow throughout the network.

### Batch Normalization (BN)

Standardizes the inputs to each layer, providing:
- Ability to use much higher learning rates
- Reduced sensitivity to weight initialization
- Faster convergence and improved training stability

### Optimizer Comparison

| Optimizer | Description | Best Use Case |
|-----------|-------------|---------------|
| **Momentum** | Builds velocity like a ball rolling downhill | Standard SGD with faster convergence |
| **Adam/Nadam** | Combines momentum with adaptive learning rates per weight | The "go-to" optimizer for most deep learning tasks |

In [None]:
# 1. Load CIFAR10
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.cifar10.load_data()
X_train_full = X_train_full / 255.0

# 2. Build DNN with Batch Normalization
model_cifar = keras.Sequential()
model_cifar.add(keras.layers.Flatten(input_shape=[32, 32, 3]))

# Add 20 layers with Batch Normalization
for _ in range(20):
    model_cifar.add(keras.layers.Dense(100, kernel_initializer="he_normal")) # Layer
    model_cifar.add(keras.layers.BatchNormalization())                      # Normalization
    model_cifar.add(keras.layers.Activation("elu"))                        # Activation

model_cifar.add(keras.layers.Dense(10, activation="softmax"))

# 3. Train with Nadam
optimizer = keras.optimizers.Nadam(learning_rate=5e-4)
model_cifar.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

# Early Stopping to save time
early_stopping_cb = keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True)

history_cifar = model_cifar.fit(X_train_full, y_train_full, epochs=50,
                                validation_split=0.1, callbacks=[early_stopping_cb])

## Comparison Discussion

### Convergence Speed
Batch Normalization typically enables the model to reach higher accuracy in fewer epochs, even though each epoch may take slightly longer to compute due to the additional normalization calculations.

### Optimizer Differences

| Optimizer | Characteristics | Tuning Required |
|-----------|-----------------|------------------|
| **SGD** | Slow convergence, can get stuck in local minima | High |
| **Momentum/NAG** | Faster than SGD, builds velocity, better at escaping local minima | Medium |
| **Adam/Nadam** | Most "forgiving", combines momentum with adaptive learning rates, fastest convergence | Low |