# Advanced Programme in Deep Learning (Foundations and Applications)
## A Program by IISc and TalentSprint


## Learning Objectives

* understand the different regularization methods to avoid the overfitting of neural networks

## Dataset


The original MNIST dataset contains handwritten digits. People from AI/ML or Data Science community love this dataset. They use it as a benchmark to validate their algorithms. In fact, MNIST is often the first dataset they would try on. As per popular belief, If the algorithm doesn’t work on MNIST, it won’t work at all. Well, if algorithm works on MNIST, it may still fail on other datasets.


As per the original [paper](https://arxiv.org/abs/1708.07747) describing about Fashion-MNIST, It is a dataset recomposed from the product pictures of Zalando’s websites. Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits.

There are some good reasons for the challenges faced by MNIST dataset:

* MNIST is too easy - Neural networks can achieve 99.7% on MNIST easily, and similarly, even classic ML algorithms can achieve 97%.

* MNIST is overused - Almost everyone who has experience with deep learning has come across MNIST at least once.

* MNIST cannot represent modern CV task





### Description

The dataset choosen for this experiment is Fashion-MNIST. The dataset is made up of 28x28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. The training set has 60,000 images and the test set has 10,000 images.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255.

**Labels / Classes**

0 - T-shirt/top

1 - Trouser

2 - Pullover

3 - Dress

4 - Coat

5 - Sandal

6 - Shirt

7 - Sneaker

8 - Bag

9 - Ankle boot

### Import required packages

In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Dropout
from tensorflow.keras.optimizers import SGD, RMSprop, Adam
from tensorflow.keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

In [None]:
# Load and preprocess Fashion MNIST dataset
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
X_train, X_test = X_train / 255.0, X_test / 255.0

**X_train / 255.0:** This operation divides every pixel value in the training images (X_train) by 255.0. Since the division is element-wise, each pixel value is scaled down to a value between 0 and 1.

**X_test / 255.0:** Similarly, this operation divides every pixel value in the test images (X_test) by 255.0.

After this normalization step, the pixel values of the images are in the range `[0, 1]`, which is a common range for input data to neural networks. This can help improve the convergence of the optimization process and the overall training performance of the model.

In [None]:
# Dataset class names
class_names = [
    'T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
    'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot'
]

# Visualize a few images with their class labels
plt.figure(figsize=(10, 10))
for i in range(25):
    plt.subplot(5, 5, i + 1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(X_train[i], cmap=plt.cm.binary)
    plt.xlabel(class_names[y_train[i]])
plt.show()

Intentionally creating an overfitting scenario is a useful way to demonstrate the effectiveness of regularization techniques in deep neural networks. Let's break down why an overfitting scenario is important when showcasing these techniques:

**1. Understanding Overfitting:**

Overfitting occurs when a machine learning model performs well on the training data but poorly on unseen test data. It's a result of the model learning to capture noise or specific details in the training data that don't generalize to other data. By creating an overfitting scenario, you can showcase how models can become excessively complex and tailored to the training data.

**2. Need for Regularization:**

Regularization techniques help prevent or mitigate overfitting by constraining the complexity of the model. These techniques encourage the model to generalize well to new data rather than memorizing the training data. Demonstrating regularization techniques in an overfitting scenario highlights their role in improving model generalization.

**3. Visualizing the Problem:**

Visualizing an overfitting scenario can provide a clear visual representation of the problem. When you see training loss decreasing while validation loss starts increasing, it's a sign of overfitting. Visualizing this scenario helps you understand why regularization is necessary.

**4. Comparing Techniques:**

By applying regularization techniques to an overfitting model, you can directly compare the impact of these techniques on model performance. You can observe how each technique modifies the training and validation loss curves, showing how they prevent or reduce overfitting.

In [None]:
# Create an overfitting scenario by using a small dataset
small_X_train, small_y_train = X_train[:1000], y_train[:1000]

In [None]:
# Build a deep neural network model (intentionally prone to overfitting)
model_overfit = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(256, activation='relu'),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')
])

In [None]:
# summary of the architecture of the neural network model
model_overfit.summary()

In [None]:
model_overfit.compile(optimizer='adam',
                     loss='sparse_categorical_crossentropy',
                     metrics=['accuracy'])

In [None]:
# Train the overfitting model
history = model_overfit.fit(small_X_train, small_y_train, epochs=50, validation_split=0.2)

In [None]:
# Access training and validation loss
training_loss = history.history['loss']
validation_loss = history.history['val_loss']

In [None]:
# Access training and validation accuracy (if applicable)
training_accuracy = history.history['accuracy']
validation_accuracy = history.history['val_accuracy']

# Print the final training and validation loss
final_training_loss = training_loss[-1]
final_validation_loss = validation_loss[-1]
print(f"Final Training Loss: {final_training_loss:.4f}")
print(f"Final Validation Loss: {final_validation_loss:.4f}")

# Print the final training and validation accuracy (if applicable)
if training_accuracy and validation_accuracy:
    final_training_accuracy = training_accuracy[-1]
    final_validation_accuracy = validation_accuracy[-1]
    print(f"Final Training Accuracy: {final_training_accuracy:.4f}")
    print(f"Final Validation Accuracy: {final_validation_accuracy:.4f}")

The **model_overfit.fit()** function trains the model_overfit neural network using the provided training data (small_X_train and small_y_train). The training process involves iterating through the training dataset for a specified number of epochs, with a subset of the training data reserved for validation. The history variable stores information about the training process, which can be used for further analysis and visualization.

In [None]:
# Plot training and validation loss
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='validation')
plt.title('Overfitting Scenario: Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

Now, let's introduce regularization techniques to overcome overfitting:

###  Optimizers

Some popular optimizers used for boosting the speed in training large deep neural networks are: Momentum optimization, RMSProp, and Adam optimization. Refer [here](https://mlfromscratch.com/optimizers-explained/#/) for a detailed understanding.

#### Momentum Optimization

Momentum  optimization subtracts  the  local  gradient  from  the  momentum  vector  m  (multiplied  by  the  learning  rate  η),  and  it  updates  the  weights  by  simply  adding  this momentum vector, thus accelerating the speed. The momentum hyperparameter $β$ is introduced to prevent  the momentum from growing too large (set between 0 and 1, typically 0.9).



### 1. Momentum Optimizer:


#### RMSProp

The RMSProp algorithm fixes only the gradients from the most recent iterations (as opposed to all the gradients since the beginning of training). It does so by using exponential decay in the first step.

The decay rate $β$ is typically set to 0.9. Yes, it is once again a new hyperparameter, but this default value often works well, so we may not need to tune it at all.


#### Adam Optimization

Adam combines the ideas of Momentum  optimization  and  RMSProp:  it keeps track of both, an  exponentially  decaying  average  of  past  gradients,  and  an  exponentially  decaying  average  of  past  squared  gradients.

The momentum decay hyperparameter $β_1$ is typically initialized to 0.9, while the scaling  decay  hyperparameter  $β_2$  is  often  initialized  to  0.999.

### 2. Dropout:

Add dropout layers after each hidden layer:

Dropout  is  one  of  the  most  popular  regularization  techniques  for  deep  neural  networks. At each training stage, individual nodes are either dropped out of the net with probability 1-p or kept with probability p, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.

![Image](https://i.ibb.co/HnfSTyX/M5-2.jpg)

$\text{Figure: Dropout Regularization}$

To  implement  dropout  using  Keras,  we  can  use  the  keras.layers.Dropout  layer. During  training,  it  randomly  drops  some  inputs  (setting  them  to  0)  and  divides  the remaining inputs by the keep probability. After training, it just passes  the  inputs  to  the  next  layer.  For  example,  the  following  code  applies  dropout regularization before every Dense layer, using a dropout rate of 0.5:

### 3. L2 Regularization:

Apply L2 regularization to all hidden layers:

Deep neural networks may have millions of parameters. The network, therefore,   has vast freedom and can fit a huge variety of complex datasets. This flexibility however also makes it prone to overfitting the training set. Thus we need regularization.

Let us now see some popular regularization techniques for neural networks: $ℓ1$ and $ℓ2$ regularization and dropout

We can use $ℓ1$ and $ℓ2$ regularization  to  constrain  a  neural  network’s  connection  weights  (but  typically  not  its  biases).  Here  is  how  to  apply  $ℓ2$  regularization  to  a  Keras  layer’s  connection  weights, using a regularization factor of 0.001:

### 4. Early Stopping:

Add early stopping callback to stop training if validation loss doesn't improve:

In [None]:
from tensorflow.keras.layers import Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping

In [None]:
# Build a model with regularization techniques
regularized_model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(256, activation='relu', kernel_regularizer=l2(0.001)),  # L2 regularization
    Dropout(0.5),  # Dropout layer
    Dense(128, activation='relu', kernel_regularizer=l2(0.001)),
    Dropout(0.5),
    Dense(10, activation='softmax')
])

In [None]:
# Compile the regularized model
regularized_model.compile(optimizer='adam',
                         loss='sparse_categorical_crossentropy',
                         metrics=['accuracy'])

# Define Early Stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

In [None]:
# Train the regularized model
history_regularized = regularized_model.fit(X_train, y_train, epochs=50,
                                            validation_data=(X_test, y_test),
                                            callbacks=[early_stopping])

### Visualize the loss after applying regularization

In [None]:
# Access training and validation loss
training_loss = history_regularized.history['loss']
validation_loss = history_regularized.history['val_loss']

# Access training and validation accuracy (if applicable)
training_accuracy = history_regularized.history['accuracy']
validation_accuracy = history_regularized.history['val_accuracy']

# Print the final training and validation loss
final_training_loss = training_loss[-1]
final_validation_loss = validation_loss[-1]
print(f"Final Training Loss: {final_training_loss:.4f}")
print(f"Final Validation Loss: {final_validation_loss:.4f}")

# Print the final training and validation accuracy (if applicable)
if training_accuracy and validation_accuracy:
    final_training_accuracy = training_accuracy[-1]
    final_validation_accuracy = validation_accuracy[-1]
    print(f"Final Training Accuracy: {final_training_accuracy:.4f}")
    print(f"Final Validation Accuracy: {final_validation_accuracy:.4f}")

In [None]:
# Plot training and validation loss for the regularized model
plt.plot(history_regularized.history['loss'], label='Train Loss')
plt.plot(history_regularized.history['val_loss'], label='Validation Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.title('Regularized Model')
plt.show()

The above showed regularization techniques are just help you to understand the basic idea of how to use these method in overfitting scenarios you can also tweak the hyperparameters to reduce the overfitting and to see the better performance.

