[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Machine Learning - Deep Learning - Weights Initialization

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 24/04/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0077DeepLearningWeightsInitialization.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Miscellaneous
import math
import os
from platform import python_version
import random
import time

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

D_CLASSES_FASHION_MNIST = {0: 'T-Shirt', 1: 'Trouser', 2: 'Pullover', 3: 'Dress', 4: 'Coat', 5: 'Sandal', 6: 'Shirt', 7: 'Sneaker', 8: 'Bag', 9: 'Boots'}
L_CLASSES_FASHION_MNIST = ['T-Shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Boots']


In [None]:
# Courses Packages

from DataVisualization import PlotLabelsHistogram, PlotMnistImages
from DeepLearningBlocks import CrossEntropyLoss, LinearLayer, NNWeightInit, ReLULayer
from DeepLearningBlocks import ModelNN


In [None]:
# General Auxiliary Functions


## Neural Net Weights Initialization

Proper weights initialization was one of the earliest and effective ways to avoid 2 main issues in training:

 * Exploding Gradients  
   The gradients of the nets had large values which means the training phase was instable.  
   It is an indication being far away from a proper local minima.
 * Vanishing Gradients  
   Vanishing gradients reduces the ability of the net to learn.  
   Basically means some of its capabilities are turned off.

The concept of proper initialization tries to maximize the probability of the net starting point to be closer to a proper local minima.  
The idea is to set the weights in a manner which keeps the variance of the data moving in the net proper in the first forward step.

This notebook shows a simple case where the initialization has an effect on the performance (Mainly speed of convergence) of the net.

* <font color='brown'>(**#**)</font> A rule of thumb states that a _proper local minima_ is at the basis of a wide and deep valley (See [Effect of Depth and Width on Local Minima in Deep Learning](https://arxiv.org/abs/1811.08150)).
* <font color='brown'>(**#**)</font> An interactive analysis of weights initialization is given by [DeepLearning.AI - Initializing Neural Networks](https://www.deeplearning.ai/ai-notes/initialization/index.html).
* <font color='brown'>(**#**)</font> An interactive analysis of parameter optimization is given by [DeepLearning.AI - Parameter Optimization in Neural Networks](https://www.deeplearning.ai/ai-notes/optimization/index.html).

In [None]:
# Parameters

# Data
numSamplesTrain = 60_000
numSamplesTest  = 10_000

# Model
hidLayerDim = 200

# Training
numIter = 300
µ       = 0.35 #!< Step Size \ Learning Rate

# Visualization
numImg = 3


## Generate / Load Data

Load the [Fashion MNIST Data Set](https://github.com/zalandoresearch/fashion-mnist).  

The _Fashion MNIST Data Set_ is considerably more challenging than the original MNIST though it is still no match to Deep Learning models.

* <font color='brown'>(**#**)</font> The data set is available at [OpenML - Fashion MNIST](https://www.openml.org/search?type=data&id=40996).  
  Yet it is not separated into the original _test_ and _train_ sets.

In [None]:
# Load Data

mX, vY = fetch_openml('Fashion-MNIST', version = 1, return_X_y = True, as_frame = False, parser = 'auto')
vY = vY.astype(np.int_) #<! The labels are strings, convert to integer

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Pre Process Data

mX = mX / 255.0

* <font color='red'>(**?**)</font> Does the scaling affects the training phase? Think about the _Learning Rate_.

### Plot the Data

In [None]:
# Plot the Data

hF = PlotMnistImages(mX, vY, numImg)

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(vY, lClass = L_CLASSES_FASHION_MNIST)
plt.show()

### Train & Test Split

The data is split into _Train_ and _Test_ data sets.  

* <font color='brown'>(**#**)</font> Deep Learning is _big data_ oriented, hence it can easily handle all samples in a single _batch_.  
  Though usually, for complex (Deep) nets and larger images the concept of _batch_ and _epoch_ is used.

In [None]:
# Train Test Split

numClass = len(np.unique(vY))
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mX, vY, test_size = numSamplesTest, train_size = numSamplesTrain, shuffle = True, stratify = vY)

print(f'The training features data shape: {mXTrain.shape}')
print(f'The training labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


## Train by Epochs

In Deep Learning the data is usually trained in batches.  
The motivations are:

 * Memory Limitations.
 * Speed.
 * Regularization (Avoid Overfit).

An _Epoch_ is a set of batches which consists the whole data set.

This section implements a few auxiliary function to support the modular training phase of a NN.

* <font color='brown'>(**#**)</font> If a batch is the size of the whole data set, each iteration is an _Epoch_.


<!-- ![Number Iterations per Epoch for a Batch Size](https://i.imgur.com/XvK4QtL.png)
 
 
 * Credit to [Chandra Prakash Bathula - Demystifying Epoch in Machine Learning: Unleashing the Power of Iterative Learning](https://scribe.rip/979f4ae5a5b6). -->

![Number Iterations per Epoch for a Batch Size](https://i.imgur.com/HLoYAna.png)

In [None]:
# Epoch Training Auxiliary Functions

# Calculate Classification Accuracy from Logits (Vector prior to SoftMax)
def ScoreAccLogits( mScore: np.ndarray, vY: np.ndarray ) -> np.float_:
    """
    Calculates the classification accuracy.  
    Input:
        mScore      - Matrix (numCls, batchSize) of the Logits Score.
        vY          - Vector (batchSize, ) of the reference classes: {0, 1, .., numCls - 1}.
    Output:
        valAcc      - Scalar of the accuracy in [0, 1] range.
    Remarks:
      - The Logits are assumed to be monotonic with regard to probabilities.  
        Namely, the class probability is a monotonic transformation of the Logit.  
        For instance, by a SoftMax.
      - Classes are in the range {0, 1, ..., numCls - 1}.
    """
    
    vYHat  = np.argmax(mScore, axis = 0) #<! Class prediction
    valAcc = np.mean(vYHat == vY)
    
    return valAcc


def TrainEpoch( oModel: ModelNN, mX: np.ndarray, vY: np.ndarray, learnRate: float, hL: Callable, hS: Callable ) -> Tuple[float, float]:
    """
    Applies a single Epoch training of a model.  
    Input:
        oModel      - ModelNN which supports `Forward()` and `Backward()` methods.
        mX          - Matrix (dataDim, numSamples) of the data input.
        vY          - Vector (numSamples, ) of the reference labels.
        learnRate   - Scalar of the learning rate in the range (0, inf).
        hL          - Callable for the Loss function.
        hS          - Callable for the Score function.
    Output:
        valLoss     - Scalar of the loss.
        valScore    - Scalar of the score.
    Remarks:
      - The `hL` function should accept the `vY` (Reference target) and `mZ` (Output of the NN).  
        It should return a Tuple of `valLoss` (Scalar of the loss) and `mDz` (Gradient by the loss).
      - The `hS` function should accept the `vY` (Reference target) and `mZ` (Output of the NN).  
        It should return a scalar `valScore` of the score.
    """
    # Forward
    mZ              = oModel.Forward(mX)
    valLoss, mDz    = hL(vY, mZ)

    # Backward
    oModel.Backward(mDz)

    # Gradient Descent (Update parameters
    for oLayer in oModel.lLayers:
        for sParam in oLayer.dGrads:
            oLayer.dParams[sParam] -= learnRate * oLayer.dGrads[sParam]
              
    # Score
    valScore = hS(mZ, vY)
            
    return valLoss, valScore


def ScoreEpoch( oModel: ModelNN, mX: np.ndarray, vY: np.ndarray, hL: Callable, hS: Callable ) -> Tuple[float, float]:
    """
    Calculates the loss and the score of a model over an Epoch.  
    Input:
        oModel      - ModelNN which supports `Forward()` and `Backward()` methods.
        mX          - Matrix (dataDim, numSamples) of the data input.
        vY          - Vector (numSamples, ) of the reference labels.
        hL          - Callable for the Loss function.
        hS          - Callable for the Score function.
    Output:
        valLoss     - Scalar of the loss.
        valScore    - Scalar of the score.
    Remarks:
      - The `hL` function should accept the `vY` (Reference target) and `mZ` (Output of the NN).  
        It should return a Tuple of `valLoss` (Scalar of the loss) and `mDz` (Gradient by the loss).
      - The `hS` function should accept the `vY` (Reference target) and `mZ` (Output of the NN).  
        It should return a scalar `valScore` of the score.
      - The function does not optimize the model parameter.
    """
    
    # Forward
    mZ          = oModel.Forward(mX)
    valLoss, _  = hL(vY, mZ)
    # Score
    valScore    = hS(mZ, vY)
    
    return valLoss, valScore

In [None]:
# Training Model Function
def TrainModel( oModel: ModelNN, mXTrain: np.ndarray, vYTrain: np.ndarray, mXVal: np.ndarray, vYVal: np.ndarray, numEpoch: int, hL: Callable, hS: Callable, learnRate: float ) -> Tuple[List, List, List, List]:

    lTrainLoss  = []
    lTrainScore = []
    lValLoss    = []
    lValScore   = []

    for ii in range(numEpoch):
        startTime           = time.time()
        trainLoss, trainScr = TrainEpoch(oModel, mXTrain, vYTrain, learnRate, hL, hS) #<! Train
        valLoss,   valScr   = ScoreEpoch(oModel, mXVal, vYVal, hL, hS)                #<! Score Validation
        endTime             = time.time()

        # Aggregate Results
        lTrainLoss.append(trainLoss)
        lTrainScore.append(trainScr)
        lValLoss.append(valLoss)
        lValScore.append(valScr)
        
        # Display (Babysitting)
        print('Epoch '              f'{(ii + 1):4d} / ' f'{numEpoch}:', end = '')
        print(' | Train Loss: '     f'{trainLoss          :6.3f}', end = '')
        print(' | Val Loss: '       f'{valLoss            :6.3f}', end = '')
        print(' | Train Score: '    f'{trainScr           :6.3f}', end = '')
        print(' | Val Score: '      f'{valScr             :6.3f}', end = '')
        print(' | Epoch Time: '     f'{(endTime-startTime):6.3f} |')

    return lTrainLoss, lTrainScore, lValLoss, lValScore


## Normalization and Initialization Effects

This section compares 2 cases:

1. Data with no normalization and trivial initialization.
2. Data with normalization and _Kaiming_ initialization.

* <font color='brown'>(**#**)</font> The initialization of weights is based on the Kaiming Method as the activation layer is `ReLULayer()`.  
  See [Delving Deep into Rectifiers: Surpassing Human Level Performance on ImageNet Classification](https://arxiv.org/abs/1502.01852) and [Kaiming He Initialization](https://scribe.rip/a8d9ed0b5899).
* <font color='brown'>(**#**)</font> The _Kaiming_ initialization is also known as _He_ initialization as the name of the researcher is Kaiming He.  
* <font color='brown'>(**#**)</font> The SoftMax + CrossEntropy are both defined as part of the loss function and not the model.  

In [None]:
# Train Parameters
nEpochs     = 10
learnRate   = 2e-1

### Naive Training

In [None]:
# Define Model

# Initialization matches previous notebooks
oModel = ModelNN([
        LinearLayer(784, 200, initMethod = NNWeightInit.CONST, initStd = (1.0 / mX.shape[1])), ReLULayer(),
        LinearLayer(200, 10,  initMethod = NNWeightInit.CONST, initStd = (1.0 / mX.shape[1])),
        ])

In [None]:
# Training

_, _, lValLossNaive, lValScoreNaive = TrainModel(oModel, mXTrain.T, vYTrain, mXTest.T, vYTest, nEpochs, CrossEntropyLoss, ScoreAccLogits, learnRate)

### Optimized Training

In [None]:
# Pre Process Data

vMean = np.mean(mXTrain, axis = 0)
vStd  = np.std(mXTrain, axis = 0)

# Processing the test data based on the train data!
mXTrain -= vMean
mXTest  -= vMean
mXTrain /= vStd
mXTest  /= vStd

In [None]:
# Define Model

# Initialization with Kaiming to match the ReLU layers
oModel = ModelNN([
        LinearLayer(784, 200, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(200, 10,  initMethod = NNWeightInit.KAIMING),
        ])

In [None]:
# Training

_, _, lValLossOpt, lValScoreOpt = TrainModel(oModel, mXTrain.T, vYTrain, mXTest.T, vYTest, nEpochs, CrossEntropyLoss, ScoreAccLogits, learnRate)

* <font color='red'>(**?**)</font> Explain why thee score of the validation is higher than the train on the 1st epoch?  
  Look at the `mZ` in the `TrainEpoch()` function.

In [None]:
# Plot Results

hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 6))

hA = vHa.flat[0]
hA.plot(lValLossNaive, lw = 2, label = 'Naive')
hA.plot(lValLossOpt, lw = 2, label = 'Optimized')
hA.grid()
hA.set_title('Cross Entropy Loss')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Loss')
hA.legend();

hA = vHa.flat[1]
hA.plot(lValScoreNaive, lw = 2, label = 'Naive')
hA.plot(lValScoreOpt, lw = 2, label = 'Optimized')
hA.grid()
hA.set_title('Accuracy Score')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Score')
hA.legend();

## Coding Exercise

Implement a function to count the number of parameters of a given model.

In [None]:
def CountModelParams( oModel: ModelNN ) -> int:
    """
    Calculates the number of parameters of a model.  
    Input:
        oModel      - ModelNN which supports `Forward()` and `Backward()` methods.
    Output:
        numParams   - Scalar of the number of parameters in the model.
    Remarks:
      - AA
    """

    #===========================Fill This===========================#
    ?????
    #===============================================================#
    
    return numParams

In [None]:
# Verify Implementation

numParams = 159010

assert (CountModelParams(oModel) == numParams), "Implementation is not verified"
print(f'Implementation is verified')