# MODEL PERFORMANCE IMPROVEMENT WITH BATCH NORMALIZATION & SELF-REGULARIZATION

**_Experimenting with techniques that solves deep neural network (DNN) training related problems such as gradient vanishing/exploding and slow training._**

**List of Experiments:**

1. Build a vanila DNN with layers using **_He_ initialization** and the **_Swish_ activation** function. Add 20 hidden layers each with 100 neurons and an output layer with 10 neurons with **_softmax activation_**. Train the network using **_Nadam_ optimization** and **early stopping** on the CIFAR10 dataset.

2. Add **batch normalization** and compare the learning curves to check if it **converges faster** than before. Also, check if model **prediction performance improved** and if batch normalization **affected training speed**.

3. **[OPTIONAL]** Experiment if layers with **_SELU activation_** can **self-regulate** a neural network. 

## Imports Packages

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
import time

import tensorflow as tf

import matplotlib.pyplot as plt

## Helpers

In [None]:
def build_model(
    input_shape: tuple,
    hidden_layers: list,
    output_layer,
    ):

    # Resets all the keras states
    tf.random.set_seed(42)
    tf.keras.backend.clear_session()

    model = tf.keras.Sequential()

    model.add(tf.keras.layers.Input(shape=input_shape))
    model.add(tf.keras.layers.Flatten())

    for hidden_layer in hidden_layers:
        model.add(hidden_layer)

    model.add(output_layer)

    return model

In [None]:
def train_model(model, x, y, optimizer, loss, metrics, batch_size=32, 
                epochs=1, callbacks=None, validation_data=None):

    model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
    
    history = model.fit(
        x, y, batch_size=batch_size, epochs=epochs, callbacks=callbacks, validation_data=validation_data)
    
    return history


## Data Ingestion & Preparation

In [2]:
# Loads the CIFAR10 dataset that contains 50,000 32x32-pixels color training images and 10,000 test images, 
# labeled over 10 categories. See more info at the CIFAR homepage https://www.cs.toronto.edu/~kriz/cifar.html.

(X_train_full, y_train_full), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()

In [None]:
print(f"Pixel value range in train set: [{X_train_full.min()} - {X_train_full.max()}]")
print(f"Pixel value range in test set: [{X_test.min()} - {X_test.max()}]")

In [3]:
# Extracts out validation set from the train set
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=5000, random_state=42, stratify=y_train_full)

In [None]:
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_val shape: {X_val.shape}, y_val shape: {y_val.shape}")

Try executing the following command from terminal to watch live learning curve across all exprements.

`tensorboard --logdir=<logs_path> --port=6006 --bind_all`

## Experiment #1
**_Using better layer initialization, activation and optmization with early stopping_**

### Modeling

In [None]:
# Creates a list of 20 hidden layers each with 100 units, "swish" activation and "he_normal" as kernel initializer

hidden_layers = []
for _ in range(20):
    hidden_layers.append(#CODE HERE
                         )

# Also creates a dense layer as output with 10 units with "softmax" activation
output_layer = #CODE HERE

In [None]:
# Builds the model specifying the shape of the input as a tuple, hidden layers and output layer

model1 = build_model(
    input_shape=#CODE HERE,
    hidden_layers=#CODE HERE,
    output_layer=#CODE HERE
)

In [None]:
# [OPTIONAL] Prints the model summary
model1.summary()

### Training the Model

In [None]:
# Configures a list of callbacks 
callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=20, restore_best_weights=True),
    tf.keras.callbacks.ModelCheckpoint("./models/cifar-10/checkpoints/cifar-10.weights.keras", save_best_only=True),
    tf.keras.callbacks.TensorBoard(f"./models/cifar-10/logs/run-{time.strftime("%Y.%m.%d_%H.%M.%S")}")
]

In [None]:
# Trains the model1 making a call to `train_model` passing model, train set with labels,
# `Nadam`optimzer with 5e-4 as learning_rate, "sparse_categorical_crossentropy" as loss,
# ["accuracy"] as metrics, 100 as epochs, callbacks and validation set with labels
history1 = train_model(#CODE HERE
    


    
                      )

### Evaluating Performance

In [None]:
# Plots the learning curves

fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, figsize=(15, 4))

ax1.plot(history1.history["loss"], "b-", label="Train loss")
ax1.plot(history1.history["val_loss"], "r-", label="Validation loss")
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax1.legend()
ax1.set_title("Train vs. Validation loss")

ax2.plot(history1.history["accuracy"], "b-", label="Train accuracy")
ax2.plot(history1.history["val_accuracy"], "r-", label="Validation accuracy")
ax2.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax2.legend()
ax2.set_title("Train vs. Validation accuracy")

In [None]:
print(f"Lowest validation loss at epoch: {np.argmin(history1.history["val_loss"]) + 1}")

print(f"Highest validation accuracy of {max(history1.history["val_accuracy"])*100:.2f}% \
      reached at epoch {np.argmax(history1.history["val_accuracy"]) + 1}")


In [None]:
# Tests prediction performance on test set
model1.evaluate(X_test, y_test)

Record the above results to compare the same from other next experiment.

## Experiment #2
**_Adding batch normalization for faster convergence and better model performance_**

### Modeling

In [None]:
# Creates a list of 20 pairs of hidden layers each with a batch normalization layer and dense
# layer with 100 units, "swish" activation and "he_normal" as kernel initializer

hidden_layers = []
for _ in range(20):
    hidden_layers.append(#CODE HERE)
    hidden_layers.append(#CODE HERE)

# Adds batch normalization just before the last dense output layer
hidden_layers.append(#CODE HERE)

# Creates a dense layer as output with 10 units with "softmax" activation
output_layer = #CODE HERE

In [None]:
# Builds the model specifying the shape of the input as a tuple, hidden layers and output layer
model2 = build_model(
    input_shape=#CODE HERE,
    hidden_layers=#CODE HERE,
    output_layer=#CODE HERE
)

In [None]:
# [OPTIONAL] Prints the model summary
model2.summary()

### Training the Model

In [None]:
# Configures the callbacks
callbacks = [
    tf.keras.callbacks.EarlyStopping(patience=10, restore_best_weights=True),
    tf.keras.callbacks.ModelCheckpoint("./models/cifar-10/checkpoints/cifar-10.weights.keras", save_best_only=True),
    tf.keras.callbacks.TensorBoard(f"./models/cifar-10/logs/run-{time.strftime("%Y.%m.%d_%H.%M.%S")}")
]

In [None]:
# Trains the model2 making a call to `train_model` passing model2, train set with labels,
# `Nadam`optimizer with 5e-4 as learning_rate, "sparse_categorical_crossentropy" as loss,
# ["accuracy"] as metrics, 100 as epochs, callbacks and validation set with labels
history2 = train_model(#CODE HERE
    


    
                      )

### Evaluating Performance

In [None]:
# Plots the learning curves

fig, (ax1, ax2) = plt.subplots(1, 2, sharex=True, figsize=(15, 4))

ax1.plot(history1.history["loss"], "c--", label="Train loss (without BN)")
ax1.plot(history1.history["val_loss"], "y--", label="Validation loss (without BN)")

ax1.plot(history2.history["loss"], "b-", label="Train loss (with BN)")
ax1.plot(history2.history["val_loss"], "r-", label="Validation loss (with BN)")
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax1.legend()
ax1.set_title("Train vs. Validation loss")


ax2.plot(history1.history["accuracy"], "c--", label="Train accuracy (without BN)")
ax2.plot(history1.history["val_accuracy"], "y--", label="Validation accuracy (without BN)")

ax2.plot(history2.history["accuracy"], "b-", label="Train accuracy (with BN)")
ax2.plot(history2.history["val_accuracy"], "r-", label="Validation accuracy (with BN)")
ax2.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax2.legend()
ax2.set_title("Train vs. Validation accuracy")

In [None]:
print(f"Lowest validation loss at epoch: {np.argmin(history2.history["val_loss"]) + 1}")

print(f"Highest validation accuracy of {max(history2.history["val_accuracy"])*100:.2f}% \
      reached at epoch {np.argmax(history2.history["val_accuracy"]) + 1}")


In [None]:
# Tests prediction performance on test set
model2.evaluate(X_test, y_test)

**Observations:**

- Write observations on the faster convergence, if so, of the model with BN layers.

- Did the model with BN layers come up as a better model?

- Compare the average time both the model took to complete each epoch.