# Chapter 11: Training Deep Neural Networks

## 1. Chapter Overview
**Goal:** Training a shallow neural network is easy. But training a deep network (with 10+ layers) introduces new challenges: gradients disappearing (vanishing) or growing too large (exploding), extremely slow training, and massive overfitting. This chapter introduces the techniques used to build state-of-the-art Deep Learning models.

**Key Concepts:**
* **Vanishing/Exploding Gradients:** Why deep networks fail to learn and how to fix it using proper Initialization (Glorot, He) and Activation Functions (ELU, SELU, ReLU).
* **Batch Normalization:** A technique to stabilize training by normalizing inputs at each layer.
* **Transfer Learning:** Reusing parts of a pretrained neural network for a new task.
* **Faster Optimizers:** Moving beyond SGD (Momentum, RMSProp, Adam, Nadam).
* **Learning Rate Scheduling:** Adjusting the learning speed during training.
* **Regularization:** Techniques like Dropout, Alpha Dropout, and Max-Norm to generalize better.

**Practical Skills:**
* Implementing **He Initialization** and **ELU** activation.
* Adding **BatchNormalization** layers to a Keras model.
* Using **Adam** optimizer with a Learning Rate Scheduler.
* Implementing **Dropout** to prevent overfitting.

In [None]:
# Setup
import sys
assert sys.version_info >= (3, 5)

import sklearn
assert sklearn.__version__ >= "0.20"

import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
tf.random.set_seed(42)

## 2. Theoretical Explanation

### 1. Vanishing and Exploding Gradients
During backpropagation, gradients propagate from the output layer to the input layer. In deep networks, these gradients often get smaller and smaller until they vanish (the lower layers stop learning), or they grow huge and the weights explode.
* **Cause:** Using the Sigmoid activation function (which saturates at 0 and 1) and random weight initialization often resulted in the variance of outputs dying out.
* **Solution 1 (Initialization):** Use **Xavier/Glorot Initialization** for Sigmoid/Tanh, and **He Initialization** for ReLU/variants. This ensures variance remains constant across layers.
* **Solution 2 (Activation):** Avoid Sigmoid. Use **ReLU** (fast, but can die), **Leaky ReLU** (never dies), **ELU** (smoother than ReLU), or **SELU** (self-normalizing).

### 2. Batch Normalization (BN)
Even with good initialization, the distribution of inputs to a layer changes during training. BN fixes this by adding an operation before (or after) the activation function to center and normalize the inputs. It learns scaling and shifting parameters.
* **Benefit:** Drastically speeds up training and makes the model less sensitive to initialization.

### 3. Faster Optimizers
Standard SGD is slow. Advanced optimizers speed up convergence:
* **Momentum Optimization:** Like a ball rolling down a hill, it gains speed.
* **RMSProp:** Adapts the learning rate for each parameter.
* **Adam:** Combines Momentum and RMSProp. It is the default choice for many DL tasks.

### 4. Regularization (Dropout)
Deep nets have millions of parameters and easily overfit. **Dropout** is a simple yet powerful technique: at every training step, every neuron has a probability $p$ (e.g., 50%) of being temporarily "dropped out" (ignored). This forces the network to learn robust features that don't rely on specific neighbors.

## 3. Code Reproduction

### 3.1 Dealing with Vanishing Gradients
We will load Fashion MNIST again and create a deep network (20 layers) to demonstrate modern configuration: **He Initialization** + **ELU Activation**.

In [None]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))

# Adding 20 dense layers with He Normal init and ELU activation
for _ in range(20):
    model.add(keras.layers.Dense(100,
                                 activation="elu",
                                 kernel_initializer="he_normal"))

model.add(keras.layers.Dense(10, activation="softmax"))

# Note: Without He Init and ELU, a 20-layer network would struggle to train.
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=5,
                    validation_data=(X_valid, y_valid))

### 3.2 Batch Normalization
To make the model even more stable, we add `BatchNormalization` layers. The authors of the BN paper argue it should be added *before* the activation function.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(), # Batch Norm on input
    keras.layers.Dense(300, use_bias=False), # Bias is handled by BN, so use_bias=False
    keras.layers.BatchNormalization(), # Batch Norm before activation
    keras.layers.Activation("elu"),
    keras.layers.Dense(100, use_bias=False),
    keras.layers.BatchNormalization(),
    keras.layers.Activation("elu"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

history_bn = model.fit(X_train, y_train, epochs=5,
                       validation_data=(X_valid, y_valid))

### 3.3 Faster Optimizers & Learning Rate Scheduling
We will use the **Adam** optimizer, which is generally faster than SGD. We will also implement a **Learning Rate Scheduler** that reduces the learning rate exponentially as training progresses.

In [None]:
# Define a scheduling function
def exponential_decay(epoch):
    return 0.01 * 0.1**(epoch / 20)

lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay)

# Use Adam Optimizer
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(100, activation="selu", kernel_initializer="lecun_normal"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

history_adam = model.fit(X_train, y_train, epochs=5,
                         validation_data=(X_valid, y_valid),
                         callbacks=[lr_scheduler])

### 3.4 Regularization: Dropout
To prevent overfitting, we add Dropout layers. A rate of 0.2 means 20% of neurons are dropped during training.

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])

history_dropout = model.fit(X_train, y_train, epochs=5,
                            validation_data=(X_valid, y_valid))

## 4. Step-by-Step Explanation

### 1. Initialization and Activation
**Problem:** In a 20-layer network with standard initialization, the signal dies out before reaching the bottom layers.
**Solution:**
* `kernel_initializer="he_normal"`: Initializes weights with a variance that considers the number of inputs, ensuring the signal variance remains constant.
* `activation="elu"`: Unlike ReLU, ELU has negative values for $z<0$, which allows the mean unit activation to be closer to 0, helping gradients flow.

### 2. Batch Normalization Structure
The sequence `Dense` -> `BatchNormalization` -> `Activation` is the textbook way to apply BN. 
* `use_bias=False`: The `Dense` layer normally adds a bias ($wX + b$). But `BatchNormalization` also includes a shifting parameter (beta). Two bias terms are redundant, so we tell the Dense layer not to use one.

### 3. Adam and Scheduling
* **Adam:** You rarely need to tune the momentum parameters of Adam; default values work great.
* **Scheduler:** We defined a function `exponential_decay` that lowers the learning rate every epoch. This allows the model to make big steps at the start (to learn fast) and tiny steps at the end (to converge precisely to the minimum).

### 4. Dropout Behavior
During **training**, the `Dropout` layer randomly sets 20% of inputs to zero. 
During **testing/prediction**, the `Dropout` layer does nothing; it lets all signals pass through (but scales them down by the keeping probability to preserve expected values).
This forces the network to become like an ensemble of smaller networks.

## 5. Chapter Summary

* **Initialization:** Use **He Initialization** for ReLU-based networks.
* **Activation:** Prefer **ELU** or **SELU** over ReLU for deep networks. Avoid Sigmoid.
* **Normalization:** Use **Batch Normalization** to train deep networks significantly faster and more stably.
* **Optimization:** Use **Adam** or **Nadam** instead of standard SGD.
* **Regularization:** Use **Dropout** to fight overfitting in large networks.
* **Transfer Learning:** Reuse lower layers of trained networks when you have limited data for a similar task.