[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Machine Learning - Deep Learning - Regularization

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 25/04/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0081DeepLearningRegularization.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Miscellaneous
import math
import os
import pickle
from platform import python_version
import random
import time

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import HTML, Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

D_CLASSES_CIFAR_10  = {0: 'Airplane', 1: 'Automobile', 2: 'Bird', 3: 'Cat', 4: 'Deer', 5: 'Dog', 6: 'Frog', 7: 'Horse', 8: 'Ship', 9: 'Truck'}
L_CLASSES_CIFAR_10  = ['Airplane', 'Automobile', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck']
T_IMG_SIZE_CIFAR_10 = (32, 32, 3)


In [None]:
# Download Auxiliary Modules for Google Colab
if runInGoogleColab:
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DataManipulation.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DataVisualization.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DeepLearningBlocks.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/NumericDiff.py

In [None]:
# Courses Packages

from DataVisualization import PlotLabelsHistogram, PlotMnistImages
from DeepLearningBlocks import DataSet, LinearLayer, ModelNN, NNMode, NNWeightInit, Optimizer, ReLULayer, SGD
from DeepLearningBlocks import CrossEntropyLoss, RunEpoch, ScoreAccLogits


In [None]:
# General Auxiliary Functions


## Neural Net Regularization

_Regularization_ is the operation to prevent the model to _overfit_.   
It regularizes the _Degrees of Freedom_ of the model to tune it to generalize to new data.

There are many methods:

 - Data Based  
   Methods to extend the data. Either by more data or by _augmenting_ the given data.
 - Architecture Based  
   Define the model structure with elements and architecture which fits the problem.  
   If the architecture fits the problem, the number of required parameters is reduced.  
   For instance, the use of convolution for time and spatially correlated data.
 - Optimization / Prior Based  
   Use optimization / prior based methods to regulate the process.  
   This include regulating the magnitude of the parameters by adding norm based regularization on the parameters.
   Additional options include different loss functions.
 - Scheduling Techniques  
   Stopping the training or changing its policy once overfit is detected.

This notebook demonstrates the following regularization methods:

1. Early Stopping.
2. Weight Decay.
3. Dropout Layer.

* <font color='brown'>(**#**)</font> **Data** is the most effective regularizer.
* <font color='brown'>(**#**)</font> An overview is given in [Regularizing Your Neural Networks](https://rawgit.com/danielkunin/Deeplearning-Visualizations/master/regularization/index.html).


![Regularization Effect](https://i.imgur.com/wdBiBh8.png)

In [None]:
# Parameters

# Data
numSamplesTrain = 50_000
numSamplesTest  = 10_000

# Model
dropP = 0.5 #<! Dropout Layer

# Training
batchSize   = 256
nEpochs     = 20

# Visualization
numImg = 3


## Generate / Load Data

Load the [CIFAR 10 Data Set](https://en.wikipedia.org/wiki/CIFAR-10).  
It is composed of 60,000 RGB images of size `32x32` with 10 classes uniformly spread.

* <font color='brown'>(**#**)</font> The data set is available at [OpenML - CIFAR 10](https://www.openml.org/search?type=data&sort=runs&id=40927&status=active).  

In [None]:
# Load Data

mX, vY = fetch_openml('CIFAR_10', version = 1, return_X_y = True, as_frame = False, parser = 'auto')
vY = vY.astype(np.int_) #<! The labels are strings, convert to integer

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Reorder Data
# Data is C x H x W -> H x W x C for displaying
mX = np.reshape(mX, (mX.shape[0], *T_IMG_SIZE_CIFAR_10[::-1]))
mX = np.transpose(mX, (0, 2, 3, 1))
mX = np.reshape(mX, (mX.shape[0], -1))

In [None]:
# Pre Process Data

mX = mX / 255.0

* <font color='red'>(**?**)</font> Does the scaling affects the training phase? Think about the _Learning Rate_.

### Plot the Data

In [None]:
# Plot the Data

hF = PlotMnistImages(mX, vY, numImg, tuImgSize = T_IMG_SIZE_CIFAR_10)

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(vY, lClass = L_CLASSES_CIFAR_10)
plt.show()

### Train & Test Split

The data is split into _Train_ and _Test_ data sets.  

In [None]:
# Train Test Split

numClass = len(np.unique(vY))
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mX, vY, test_size = numSamplesTest, train_size = numSamplesTrain, shuffle = True, stratify = vY)

print(f'The training features data shape: {mXTrain.shape}')
print(f'The training labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Pre Process Data

vMean = np.mean(mXTrain, axis = 0)
vStd  = np.std(mXTrain, axis = 0)

# Processing the test data based on the train data!
mXTrain -= vMean
mXTest  -= vMean
mXTrain /= vStd
mXTest  /= vStd

* <font color='red'>(**?**)</font> What should be done with `vMean` and `vStd` in production?

In [None]:
# Generate the Data Sets
# The DataSet assumes each column is a sample.

oDsTrain    = DataSet(mXTrain.T, vYTrain, batchSize) #<! Train Data Set
oDsVal      = DataSet(mXTest.T, vYTest, batchSize)   #<! Validation Data Set

print(f'The batch size: {batchSize}')
print(f'The training data set number of batches per Epoch: {len(oDsTrain)}')
print(f'The validation data set number of batches per Epoch: {len(oDsVal)}')

## Regularizers

This section implements 3 regularizers:

 * Early Stopping  
   Stop the net with the signs of overfitting.  
   This implementation will only save the best model based on the score of the validation.
 * Weights Decay  
   Implementation of the weight decay in a manner which decompose it from the actual loss.
 * Dropout Layer  
   The _Dropout Layer_ avoids a single point of failure where a small number of features might become too significant.  
   It "forces" the model to learn to get the result in "many ways".

</br>

* <font color='brown'>(**#**)</font> Some practitioners, with various considerations, choose to practically stop the training. This is the actual meaning of _Early Stopping_.
* <font color='brown'>(**#**)</font> In most cases _Cross Validation_ is infeasible for Deep Learning. 
* <font color='brown'>(**#**)</font> You may read on _Weight Decay_ in [Dive into Deep Learning - 3.7. Weight Decay](https://d2l.ai/chapter_linear-regression/weight-decay.html). 
* <font color='brown'>(**#**)</font> You may read on _Dropout_ in [Dive into Deep Learning - 5.6. Dropout](http://d2l.ai/chapter_multilayer-perceptrons/dropout.html). 
* <font color='brown'>(**#**)</font> You may read on _Dropout_ in [What Makes Dropout Effective](https://datascience.stackexchange.com/questions/37021). 


### Early Stopping

The early stopping is incorporated into the training loop.

In [None]:
# Training Model Loop Function

def TrainModel( oModel: ModelNN, oDsTrain: DataSet, oDsVal: DataSet, oOpt: Optimizer, numEpoch: int, hL: Callable, hS: Callable ) -> Tuple[ModelNN, List, List, List, List]:

    lTrainLoss  = []
    lTrainScore = []
    lValLoss    = []
    lValScore   = []

    #!!!#
    bestScore = 0.0 #<! Assuming higher is better
    #!!!#

    for ii in range(numEpoch):
        startTime           = time.time()
        trainLoss, trainScr = RunEpoch(oModel, oDsTrain, oOpt, hL, hS, opMode = NNMode.TRAIN) #<! Train
        valLoss,   valScr   = RunEpoch(oModel, oDsVal, oOpt, hL, hS, opMode = NNMode.INFERENCE)    #<! Score Validation
        epochTime           = time.time() - startTime

        # Aggregate Results
        lTrainLoss.append(trainLoss)
        lTrainScore.append(trainScr)
        lValLoss.append(valLoss)
        lValScore.append(valScr)
        
        # Display (Babysitting)
        print('Epoch '              f'{(ii + 1):4d} / ' f'{numEpoch}:', end = '')
        print(' | Train Loss: '     f'{trainLoss          :6.3f}', end = '')
        print(' | Val Loss: '       f'{valLoss            :6.3f}', end = '')
        print(' | Train Score: '    f'{trainScr           :6.3f}', end = '')
        print(' | Val Score: '      f'{valScr             :6.3f}', end = '')
        print(' | Epoch Time: '     f'{epochTime          :5.2f}', end = '')

        #!!!#
        # Save best model ("Early Stopping")
        if valScr > bestScore:
            bestScore = valScr
            print(' | <-- Checkpoint!', end = '')
            with open('BestModel.pkl', 'wb') as oFile:
                pickle.dump(oModel, oFile)
        print(' |')
        #!!!#
    
    #!!!#
    # Load best model ("Early Stopping")
    with open('BestModel.pkl', 'rb') as oFile:
        oModel = pickle.load(oFile)
    #!!!#

    return oModel, lTrainLoss, lTrainScore, lValLoss, lValScore


* <font color='red'>(**?**)</font> Why is a file saved and not just in memory copy? Think about memory considerations and error tolerance.

In [None]:
# Define Model

oModel = ModelNN([
        LinearLayer(mX.shape[1], 200, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(200, 100, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(100, 50, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(50, 10,  initMethod = NNWeightInit.KAIMING),
        ])

In [None]:
# Define Optimizer

oOpt = Optimizer(SGD(μ = 2e-3, β = 0.9))

In [None]:
# Train a Model

oModel.Init()
oModel, lTrainLoss, lTrainScore, lValLoss, lValScore = TrainModel(oModel, oDsTrain, oDsVal, oOpt, nEpochs, CrossEntropyLoss, ScoreAccLogits)

In [None]:
# Plot Results
hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 6))
vHa = vHa.flat

hA = vHa[0]
hA.plot(lTrainLoss, lw = 2, label = 'Train Loss')
hA.plot(lValLoss, lw = 2, label = 'Validation Loss')
hA.grid()
hA.set_title('Cross Entropy Loss')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Loss')
hA.legend();


hA = vHa[1]
hA.plot(lTrainScore, lw = 2, label = 'Train Score')
hA.plot(lValScore, lw = 2, label = 'Validation Score')
hA.grid()
hA.set_title('Accuracy Score')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Score')
hA.legend();


* <font color='red'>(**?**)</font> Where would classic "Early Stopping" stop the training?
* <font color='brown'>(**#**)</font> Good regularization would avoid the divergence of the lines (Training / Test).

### Weights Decay

This section implements _Weights Decay_ with _SGD_:

1. $\boldsymbol{v}^{\left(t\right)}=\beta\boldsymbol{v}^{\left(t-1\right)}-\mu\nabla L\left(\boldsymbol{w}^{\left(t\right)}\right)$.
2. $\boldsymbol{w}^{\left(t+1\right)}=\boldsymbol{p}^{\left(t\right)}+\boldsymbol{v}^{\left(t\right)}-\lambda\boldsymbol{w}^{\left(t\right)}$.

* <font color='brown'>(**#**)</font> The implementation is accurate for "Vanilla SGD". It is conceptually accurate (Not Mathematically) for other methods.

In [None]:
# The SGDM Class

class SGDMW():
    def __init__( self, μ: float = 1e-3, β: float = 0.9, λ = 0.0 ) -> None:
        
        self.μ = μ
        self.β = β
        self.λ = λ #<! Weight Decay (L2 Squared)

    def Step( self, mW: np.ndarray, mDw: np.ndarray, dState: Dict = {} ) -> Tuple[np.ndarray, Dict]:
        
        mV            = dState.get('mV', np.zeros(mW.shape)) #<! Default for 1st iteration
        mV            = self.β * mV - self.μ * mDw
        mW           += mV - (self.λ * mW)
        dState['mV']  = mV

        return mW, dState


In [None]:
# Define Optimizer

oOpt = Optimizer(SGDMW(μ = 2e-3, β = 0.9, λ = 8e-4))

In [None]:
# Train a Model

oModel.Init()
oModel, lTrainLoss, lTrainScore, lValLoss, lValScore = TrainModel(oModel, oDsTrain, oDsVal, oOpt, nEpochs, CrossEntropyLoss, ScoreAccLogits)

In [None]:
# Plot Results
hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 6))
vHa = vHa.flat

hA = vHa[0]
hA.plot(lTrainLoss, lw = 2, label = 'Train Loss')
hA.plot(lValLoss, lw = 2, label = 'Validation Loss')
hA.grid()
hA.set_title('Cross Entropy Loss')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Loss')
hA.legend();


hA = vHa[1]
hA.plot(lTrainScore, lw = 2, label = 'Train Score')
hA.plot(lValScore, lw = 2, label = 'Validation Score')
hA.grid()
hA.set_title('Accuracy Score')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Score')
hA.legend();


### Dropout Layer

#### Forward

$$\boldsymbol{z}=\frac{1}{p}\boldsymbol{x}\odot\boldsymbol{m}=\frac{1}{p}\text{Diag}\left(\boldsymbol{m}\right)\boldsymbol{x}$$

 * $\boldsymbol{m}$ is a mask (same size as $\boldsymbol{x}$) such that each $\boldsymbol{m}\left[i\right]\sim\text{Bernoulli}\left(p\right)$.

#### Backward

$$
\left\langle \nabla_{\boldsymbol{z}}L,\nabla_{\boldsymbol{x}}\boldsymbol{z}\left[\boldsymbol{h}\right]\right\rangle =\left\langle \nabla_{\boldsymbol{z}}L,\frac{1}{p}\text{Diag}\left(\boldsymbol{m}\right)\boldsymbol{h}\right\rangle =\left\langle \frac{1}{p}\text{Diag}\left(\boldsymbol{m}\right)\nabla_{\boldsymbol{z}}L,\boldsymbol{h}\right\rangle 
$$
$$
\implies\boxed{\nabla_{\boldsymbol{x}}L=\frac{1}{p}\text{Diag}\left(\boldsymbol{m}\right)\nabla_{\boldsymbol{z}}L=\frac{1}{p}\nabla_{\boldsymbol{z}}L\odot\boldsymbol{m}}
$$

</br>

* `Forward` - For train time (With dropout).
* `Predict` - For test  time (Without dropout).

</br>

* <font color='brown'>(**#**)</font> The original paper, [Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov - Dropout: A Simple Way to Prevent Neural Networks from Overfitting](https://jmlr.org/papers/v15/srivastava14a.html), suggested using the Dropout layer on the input features. See [Dropout on the Input Layer](https://datascience.stackexchange.com/questions/38507).

In [None]:
# The Dropout Layer Class

class DropoutLayer():
    def __init__( self, p: float = 0.5 ) -> None:
        
        self.p       = p
        self.mMask   = None
        self.dGrads  = {}
        self.dParams = {}

    # Train Time
    def Forward( self, mX: np.ndarray ) -> np.ndarray:
        
        self.mMask = (np.random.rand(*mX.shape) < self.p) / self.p
        mZ         = mX * self.mMask

        return mZ

    # Test Time
    def Predict( self, mX: np.ndarray ) -> np.ndarray:
        
        return mX
    
    def Backward( self, mDz: np.ndarray) -> np.ndarray:
        
        mDx   = mDz * self.mMask

        return mDx


* <font color='blue'>(**!**)</font> Go through the code of the `ModelNN` class (Forward vs. Predict).

In [None]:
# Define Model

oModel = ModelNN([
        LinearLayer(mX.shape[1], 200, initMethod = NNWeightInit.KAIMING), ReLULayer(), DropoutLayer(dropP),
        LinearLayer(200, 100, initMethod = NNWeightInit.KAIMING), ReLULayer(), DropoutLayer(dropP),
        LinearLayer(100, 50, initMethod = NNWeightInit.KAIMING), ReLULayer(), DropoutLayer(dropP),
        LinearLayer(50, 10,  initMethod = NNWeightInit.KAIMING),
        ])

In [None]:
# Define Optimizer

oOpt = Optimizer(SGDMW(μ = 2e-3, β = 0.9, λ = 5e-5))

In [None]:
# Train a Model

oModel.Init()
oModel, lTrainLoss, lTrainScore, lValLoss, lValScore = TrainModel(oModel, oDsTrain, oDsVal, oOpt, 2 * nEpochs, CrossEntropyLoss, ScoreAccLogits)

In [None]:
# Plot Results
hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 6))
vHa = vHa.flat

hA = vHa[0]
hA.plot(lTrainLoss, lw = 2, label = 'Train Loss')
hA.plot(lValLoss, lw = 2, label = 'Validation Loss')
hA.grid()
hA.set_title('Cross Entropy Loss')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Loss')
hA.legend();


hA = vHa[1]
hA.plot(lTrainScore, lw = 2, label = 'Train Score')
hA.plot(lValScore, lw = 2, label = 'Validation Score')
hA.grid()
hA.set_title('Accuracy Score')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Score')
hA.legend();


* <font color='green'>(**@**)</font> Optimize the hyper parameters of the optimizers.

## Analysis of the Learning Curves

 - Model with Capacity Shortage (Underfit)

<img src="https://machinelearningmastery.com/wp-content/uploads/2019/02/Example-of-Training-Learning-Curve-Showing-An-Underfit-Model-That-Does-Not-Have-Sufficient-Capacity.png" width = "600"/>

 - Shortage of Training Iterations (Underfit)

<img src="https://machinelearningmastery.com/wp-content/uploads/2018/12/Example-of-Training-Learning-Curve-Showing-An-Underfit-Model-That-Requires-Further-Training.png" width = "600"/>

 - Model with Over Capacity and / or Learning Iterations (Overfit)

<img src="https://machinelearningmastery.com/wp-content/uploads/2018/12/Example-of-Train-and-Validation-Learning-Curves-Showing-An-Overfit-Model.png" width = "600"/>

 - Fit

<img src="https://machinelearningmastery.com/wp-content/uploads/2018/12/Example-of-Train-and-Validation-Learning-Curves-Showing-A-Good-Fit.png" width = "600"/>

 - Unrepresentative Train Dataset

<img src="https://machinelearningmastery.com/wp-content/uploads/2018/12/Example-of-Train-and-Validation-Learning-Curves-Showing-a-Training-Dataset-the-May-be-too-Small-Relative-to-the-Validation-Dataset.png" width = "600"/>

 - Unrepresentative Validation Dataset (Small, Noisy)

<img src="https://machinelearningmastery.com/wp-content/uploads/2018/12/Example-of-Train-and-Validation-Learning-Curves-Showing-a-Validation-Dataset-the-May-be-too-Small-Relative-to-the-Training-Dataset.png" width = "600"/>

 - Unrepresentative Validation Dataset (Easy)

<img src="https://machinelearningmastery.com/wp-content/uploads/2018/12/Example-of-Train-and-Validation-Learning-Curves-Showing-a-Validation-Dataset-that-is-Easier-to-Predict-than-the-Training-Dataset.png" width = "600"/>

Resource: [How to Use Learning Curves to Diagnose Machine Learning Model Performance](https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance)