[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Machine Learning - Deep Learning - Optimizers

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 24/04/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0079DeepLearningOptimizers.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Miscellaneous
import math
import os
from platform import python_version
import random
import re
import time

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

D_CLASSES_FASHION_MNIST = {0: 'T-Shirt', 1: 'Trouser', 2: 'Pullover', 3: 'Dress', 4: 'Coat', 5: 'Sandal', 6: 'Shirt', 7: 'Sneaker', 8: 'Bag', 9: 'Boots'}
L_CLASSES_FASHION_MNIST = ['T-Shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Boots']


In [None]:
# Download Auxiliary Modules for Google Colab
if runInGoogleColab:
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DataManipulation.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DataVisualization.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DeepLearningBlocks.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/NumericDiff.py

In [None]:
# Courses Packages

from DataVisualization import PlotLabelsHistogram, PlotMnistImages
from DeepLearningBlocks import DataSet, LinearLayer, ModelNN, NNMode, NNWeightInit, Optimizer, ReLULayer
from DeepLearningBlocks import CrossEntropyLoss, ScoreAccLogits


* <font color='blue'>(**!**)</font> Go through the code of the `Optimizer` class.

In [None]:
# General Auxiliary Functions


## Neural Net Optimizers

The optimizer of a _Neural Network_ is the mechanism to update the parameters of the model.  
Most common methods work based on the 1st order data: _The Gradient_.  

The different flavors of the optimizer tries to tackle 2 main challenges:

1. The Loss Landscape  
   The loss landscape in most cases is highly chaotic.  
   Narrow corridors are highly common which makes vanilla SGD struggle with.  
   More advanced method are optimized for those cases and the ability to overcome deep narrow local minimum.
2. The Estimation Variance  
   Since the (Batch) _Stochastic Gradient Descent_ uses an estimation of the gradient at each step variance reduction methods are beneficial.  
   Specifically by using a smoothed (Averaging) of the previous steps.

This notebook demonstrates the motivation using such methods.

![](https://i.imgur.com/MfBxrIF.gif)
Based on [Enhancing Multi Layer Perceptron Performance: Demystifying Optimizers](https://scribe.rip/57255d5950e4).

* <font color='brown'>(**#**)</font> The optimized flavors are highly useful in the context of classic optimization as well.
* <font color='brown'>(**#**)</font> [Parameter Optimization in Neural Networks](https://www.deeplearning.ai/ai-notes/optimization/index.html).
* <font color='brown'>(**#**)</font> [Setting the Learning Rate of Your Neural Network](https://www.jeremyjordan.me/nn-learning-rate/).
* <font color='brown'>(**#**)</font> There are concepts on optimizing Neural Nets without gradients.  
  See [Methods to Optimize Neural Network without Back Propagation](https://stats.stackexchange.com/questions/235862).  
  Though none have proved to be as efficient as the 1st order derivative based methods.
* <font color='brown'>(**#**)</font> There are methods to optimize NN by an approximated 2nd order information.  
  Explicit Hessian based methods are infeasible unless the net is little.
* <font color='brown'>(**#**)</font> Some visualization methods have been developed for the visualization of the _Loss Landscape_ of Deep Learning.  
  See [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). [Code is available on GitHub](https://github.com/tomgoldstein/loss-landscape).
* <font color='brown'>(**#**)</font> Visualizations: [Exploring the Deep Learning Loss Landscape](https://losslandscape.com), [Loss Visualizer](http://www.telesens.co/loss-landscape-viz/viewer.html).
* <font color='brown'>(**#**)</font> [Gradient Descent Visualization](https://github.com/lilipads/gradient_descent_viz) - Application to visualize several _gradient descent_ methods.  
  See [Hacker News Discussion](https://news.ycombinator.com/item?id=40282923).

In [None]:
# Parameters

# Data
numSamplesTrain = 60_000
numSamplesTest  = 10_000

# Model

# Training
batchSize = 256

# Visualization
numImg = 3


## Generate / Load Data

Load the [Fashion MNIST Data Set](https://github.com/zalandoresearch/fashion-mnist).  

The _Fashion MNIST Data Set_ is considerably more challenging than the original MNIST though it is still no match to Deep Learning models.

* <font color='brown'>(**#**)</font> The data set is available at [OpenML - Fashion MNIST](https://www.openml.org/search?type=data&id=40996).  
  Yet it is not separated into the original _test_ and _train_ sets.

In [None]:
# Load Data

mX, vY = fetch_openml('Fashion-MNIST', version = 1, return_X_y = True, as_frame = False, parser = 'auto')
vY = vY.astype(np.int_) #<! The labels are strings, convert to integer

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Pre Process Data

mX = mX / 255.0

* <font color='red'>(**?**)</font> Does the scaling affects the training phase? Think about the _Learning Rate_.

### Plot the Data

In [None]:
# Plot the Data

hF = PlotMnistImages(mX, vY, numImg)

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(vY, lClass = L_CLASSES_FASHION_MNIST)
plt.show()

### Train & Test Split

The data is split into _Train_ and _Test_ data sets.  

* <font color='brown'>(**#**)</font> Deep Learning is _big data_ oriented, hence it can easily handle all samples in a single _batch_.  
  Though usually, for complex (Deep) nets and larger images the concept of _batch_ and _epoch_ is used.

In [None]:
# Train Test Split

numClass = len(np.unique(vY))
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mX, vY, test_size = numSamplesTest, train_size = numSamplesTrain, shuffle = True, stratify = vY)

print(f'The training features data shape: {mXTrain.shape}')
print(f'The training labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Pre Process Data

vMean = np.mean(mXTrain, axis = 0)
vStd  = np.std(mXTrain, axis = 0)

# Processing the test data based on the train data!
mXTrain -= vMean
mXTest  -= vMean
mXTrain /= vStd
mXTest  /= vStd

In [None]:
# Generate the Data Sets
# The DataSet assumes each column is a sample.

oDsTrain    = DataSet(mXTrain.T, vYTrain, batchSize) #<! Train Data Set
oDsVal      = DataSet(mXTest.T, vYTest, batchSize) #<! Validation Data Set

print(f'The batch size: {batchSize}')
print(f'The training data set number of batches per Epoch: {len(oDsTrain)}')
print(f'The validation data set number of batches per Epoch: {len(oDsVal)}')

## Stochastic Gradient Descent

This section implements the SGD update rule.

The most common update rules are:

 * SGD  
   The vanilla SGD optimizer.
 * SGD with Momentum  
   Adding the _Momentum_ to the SGD.
 * ADAM
   Improvement of the `RMSProp` which scaled the gradient by adding momentum.

The most important hyper parameter in that context is the _Step Size_ / _Learning Rate_.  
Deep Learning frameworks add the option to have some scheduling logic into it.

</br>

* <font color='brown'>(**#**)</font> The _Nesterov Acceleration_ is considered superior to _Momentum_ though not significantly.
* <font color='brown'>(**#**)</font> Most improved methods adds some hyper parameters. The glory of the Adam method is mostly because of being relatively robust.
* <font color='brown'>(**#**)</font> For tuning the hyper parameters see: [Setting the Learning Rate of Your Neural Network](https://www.jeremyjordan.me/nn-learning-rate).
* <font color='brown'>(**#**)</font> [Andrej Karpathy half joked about `3e-4` being a magic learning rate for Adam](https://twitter.com/karpathy/status/801621764144971776).  
  Yet it is a good value (Magnitude wise) to start with in the grid search of the hyper parameter.
* <font color='brown'>(**#**)</font> Reviews of different Update Rules: [An Overview of Gradient Descent Optimization Algorithms](https://www.ruder.io/optimizing-gradient-descent), [An Updated Overview of Recent Gradient Descent Algorithms](https://johnchenresearch.github.io/demon).
* <font color='brown'>(**#**)</font> Visualization of different Update Rules: [Visualize Gradient Descent Optimization Algorithms in TensorFlow](https://github.com/j-w-yun/optimizer-visualization).

![Visualization of Optimization Algorithms](https://github.com/Jaewan-Yun/optimizer-visualization/raw/master/figures/movie7.gif)

### SGD

In [None]:
# The SGD Class

class SGD():
    def __init__( self, μ: float = 1e-3 ) -> None:
        
        self.μ = μ

    def Step( self, mW: np.ndarray, mDw: np.ndarray, dState: Dict = {} ) -> Tuple[np.ndarray, Dict]:
        
        mW  -= self.μ * mDw

        return mW, dState

In [None]:
# Epoch Trainer with Optimizer

def RunEpoch( oModel: ModelNN, oDataSet: DataSet, oOpt: Optimizer, hL: Callable, hS: Callable, opMode: NNMode = NNMode.TRAIN ) -> Tuple[float, float]:
    """
    Runs a single Epoch (Train / Test) of a model.  
    Input:
        oModel      - ModelNN object which supports `Forward()` and `Backward()` methods.
        oDataSet    - DataSet object which supports iterating.
        oOpt        - Optimizer object which supports `Step` method.
        hL          - Callable for the Loss function.
        hS          - Callable for the Score function.
    Output:
        valLoss     - Scalar of the loss.
        valScore    - Scalar of the score.
    Remarks:
      - The `oDataSet` object returns a Tuple of (mX, vY) per batch.
      - The `hL` function should accept the `vY` (Reference target) and `mZ` (Output of the NN).  
        It should return a Tuple of `valLoss` (Scalar of the loss) and `mDz` (Gradient by the loss).
      - The `hS` function should accept the `vY` (Reference target) and `mZ` (Output of the NN).  
        It should return a scalar `valScore` of the score.
    """

    epochLoss   = 0.0
    epochScore  = 0.0
    numSamples  = 0
    for ii, (mX, vY) in enumerate(oDataSet):
        batchSize       = len(vY)
        # Forward
        mZ              = oModel.Forward(mX)
        valLoss, mDz    = hL(vY, mZ)
        
        if opMode == NNMode.TRAIN:
            # Backward
            oModel.Backward(mDz) #<! Backward
            oOpt.Step(oModel)  #<! Update parameters
        
        # Score
        valScore = hS(mZ, vY)

        # Normalize so each sample has the same weight
        epochLoss  += batchSize * valLoss
        epochScore += batchSize * valScore
        numSamples += batchSize
    
            
    return epochLoss / numSamples, epochScore / numSamples

In [None]:
# Training Model Function
def TrainModel( oModel: ModelNN, oDsTrain: DataSet, oDsVal: DataSet, oOpt: Optimizer, numEpoch: int, hL: Callable, hS: Callable ) -> Tuple[List, List, List, List]:

    lTrainLoss  = []
    lTrainScore = []
    lValLoss    = []
    lValScore   = []

    for ii in range(numEpoch):
        startTime           = time.time()
        trainLoss, trainScr = RunEpoch(oModel, oDsTrain, oOpt, hL, hS, opMode = NNMode.TRAIN) #<! Train
        valLoss,   valScr   = RunEpoch(oModel, oDsVal, oOpt, hL, hS, opMode = NNMode.INFERENCE)    #<! Score Validation
        endTime             = time.time()

        # Aggregate Results
        lTrainLoss.append(trainLoss)
        lTrainScore.append(trainScr)
        lValLoss.append(valLoss)
        lValScore.append(valScr)
        
        # Display (Babysitting)
        print('Epoch '              f'{(ii + 1):4d} / ' f'{numEpoch}:', end = '')
        print(' | Train Loss: '     f'{trainLoss          :6.3f}', end = '')
        print(' | Val Loss: '       f'{valLoss            :6.3f}', end = '')
        print(' | Train Score: '    f'{trainScr           :6.3f}', end = '')
        print(' | Val Score: '      f'{valScr             :6.3f}', end = '')
        print(' | Epoch Time: '     f'{(endTime-startTime):6.3f} |')

    return lTrainLoss, lTrainScore, lValLoss, lValScore


In [None]:
# Define Model

oModel = ModelNN([
        LinearLayer(784, 350, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(350, 250, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(250, 150, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(150, 50, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(50, 10,  initMethod = NNWeightInit.KAIMING),
        ])

In [None]:
# Train Parameters
nEpochs     = 10
learnRate   = 2e-1

In [None]:
# Optimizer

oOpt = Optimizer(SGD(μ = learnRate))

In [None]:
# Train Model

_, _, lValLoss, lValScore = TrainModel(oModel, oDsTrain, oDsVal, oOpt, nEpochs, CrossEntropyLoss, ScoreAccLogits)