[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Machine Learning - Deep Learning - Stochastic Gradient Descent (SGD)

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.001 | 22/05/2024 | Royi Avital | Added the task to add the `__getitem__()` method                   |
| 1.0.000 | 24/04/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0078DeepLearningSgd.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Miscellaneous
import math
import os
from platform import python_version
import random
import re
import time

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

D_CLASSES_FASHION_MNIST = {0: 'T-Shirt', 1: 'Trouser', 2: 'Pullover', 3: 'Dress', 4: 'Coat', 5: 'Sandal', 6: 'Shirt', 7: 'Sneaker', 8: 'Bag', 9: 'Boots'}
L_CLASSES_FASHION_MNIST = ['T-Shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Boots']


In [None]:
# Download Auxiliary Modules for Google Colab
if runInGoogleColab:
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DataManipulation.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DataVisualization.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/DeepLearningBlocks.py
    !wget https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/AIProgram/2024_02/NumericDiff.py

In [None]:
# Courses Packages

from DataVisualization import PlotConfusionMatrix, PlotLabelsHistogram, PlotMnistImages
from DeepLearningBlocks import CrossEntropyLoss, DataSet, LinearLayer, ModelNN, NNWeightInit, ReLULayer, ScoreEpoch, ScoreAccLogits, TrainEpoch


* <font color='blue'>(**!**)</font> Go through the code of the `DataSet` class.

In [None]:
# General Auxiliary Functions


## Neural Net Weights Optimization

Intuition about the minimization and the loss landscape:

 * Good / Proper Local Minimum  
   Though not rigorously proved, the common concept is that a wide and deep local minimum as almost as good as the global minimum.  
   The intuition is based on the idea that given millions of directions to move, the probability there is no improvement in any direction is a unique case.  
   Though it is not a single phenomenon, it is probably similar in its results.
 * The SGD as a Regularizer  
   Since the SGD uses a noisy estimation of the _gradient_ it may "escape" a bad local minima (Narrow, not deep).  
   Those "sensitive" local minima can be thought as "over fitting" as usually they don't generalize well.  
   Yet escaping wide deep local minima is less likely.
 * Batch Size  
   The bigger the batch, the better the approximation.  
   Yet, in practice the most limiting factor is the memory of the GPU and the speed of computation.  
   So it selected to maximize the number of iterations within the envelope of the memory and speed.
 * Iterations  
   Each batch creates a "Forward" and "Backward" step.  
   To even out the estimation over the samples the batch in each iteration can be drawn randomly.
 * Optimization Methods  
   Since the SGD uses approximated / estimated gradient most acceleration methods can be viewed as variance reduction methods of the estimation.
 * Speed of Convergence  
   The SGD is also superior as it allows within the same time budget making much more iterations.  
   Compare single accurate iteration vs. 1000 approximated iterations. In practice the latter is much faster to converge.


![](https://i.imgur.com/niEt3Sl.png)
Based on [Loss Landscape Gallery](https://losslandscape.com/wp-content/uploads/2019/11/mode-connectivity-1.jpg).

* <font color='brown'>(**#**)</font> There are concepts on optimizing Neural Nets without gradients.  
  See [Methods to Optimize Neural Network without Back Propagation](https://stats.stackexchange.com/questions/235862).  
  Though none have proved to be as efficient as the 1st order derivative based methods.
* <font color='brown'>(**#**)</font> There are methods to optimize NN by an approximated 2nd order information.  
  Explicit Hessian based methods are infeasible unless the net is little.
* <font color='brown'>(**#**)</font> Some visualization methods have been developed for the visualization of the _Loss Landscape_ of Deep Learning.  
  See [Visualizing the Loss Landscape of Neural Nets](https://arxiv.org/abs/1712.09913). [Code is available on GitHub](https://github.com/tomgoldstein/loss-landscape).
* <font color='brown'>(**#**)</font> Visualizations: [Exploring the Deep Learning Loss Landscape](https://losslandscape.com), [Loss Visualizer](http://www.telesens.co/loss-landscape-viz/viewer.html).

In [None]:
# Parameters

# Data
numSamplesTrain = 60_000
numSamplesTest  = 10_000

# Model

# Training

# Visualization
numImg = 3


## Generate / Load Data

Load the [Fashion MNIST Data Set](https://github.com/zalandoresearch/fashion-mnist).  

The _Fashion MNIST Data Set_ is considerably more challenging than the original MNIST though it is still no match to Deep Learning models.

* <font color='brown'>(**#**)</font> The data set is available at [OpenML - Fashion MNIST](https://www.openml.org/search?type=data&id=40996).  
  Yet it is not separated into the original _test_ and _train_ sets.

In [None]:
# Load Data

mX, vY = fetch_openml('Fashion-MNIST', version = 1, return_X_y = True, as_frame = False, parser = 'auto')
vY = vY.astype(np.int_) #<! The labels are strings, convert to integer

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Pre Process Data

mX = mX / 255.0

* <font color='red'>(**?**)</font> Does the scaling affects the training phase? Think about the _Learning Rate_.

### Plot the Data

In [None]:
# Plot the Data

hF = PlotMnistImages(mX, vY, numImg)

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(vY, lClass = L_CLASSES_FASHION_MNIST)
plt.show()

### Train & Test Split

The data is split into _Train_ and _Test_ data sets.  

* <font color='brown'>(**#**)</font> Deep Learning is _big data_ oriented, hence it can easily handle all samples in a single _batch_.  
  Though usually, for complex (Deep) nets and larger images the concept of _batch_ and _epoch_ is used.

In [None]:
# Train Test Split

numClass = len(np.unique(vY))
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mX, vY, test_size = numSamplesTest, train_size = numSamplesTrain, shuffle = True, stratify = vY)

print(f'The training features data shape: {mXTrain.shape}')
print(f'The training labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Pre Process Data

vMean = np.mean(mXTrain, axis = 0)
vStd  = np.std(mXTrain, axis = 0)

# Processing the test data based on the train data!
mXTrain -= vMean
mXTest  -= vMean
mXTrain /= vStd
mXTest  /= vStd

## Train by Batch Stochastic Gradient Descent

This section trains the data based on the concept of batches with _Batch Stochastic Gradient Descent_.  

* <font color='brown'>(**#**)</font> While the number of Epochs will be smaller than previous experiments, the number of gradient steps will be much higher while being faster.


![Gradient Descent Flavors](https://i.imgur.com/ygtK28K.png)

In [None]:
# Generate the Data Sets
# The DataSet assumes each column is a sample.

batchSize   = 256
oDsTrain    = DataSet(mXTrain.T, vYTrain, batchSize ,dropLast = True) #<! Train Data Set
oDsVal      = DataSet(mXTest.T, vYTest, batchSize) #<! Validation Data Set

print(f'The batch size: {batchSize}')
print(f'The training data set number of batches per Epoch: {len(oDsTrain)}')
print(f'The validation data set number of batches per Epoch: {len(oDsVal)}')


* <font color='red'>(**?**)</font> Why is _batch_ partition used for the validation data set as well?

In [None]:
# Testing the Data Sets

for ii, (mXBatch, vYBatch) in enumerate(oDsTrain):
    print(f'The {(ii + 1): 3d} / {len(oDsTrain)} Batch', end = '')
    print(f' | The features shape: {mXBatch.shape}', end = '')
    print(f' | The target shape: {vYBatch.shape}')

* <font color='blue'>(**!**)</font> Compare results with the case `dropLast = False`.

In [None]:
# Training Model Function
def TrainModel( oModel: ModelNN, oDsTrain: DataSet, oDsVal: DataSet, numEpoch: int, hL: Callable, hS: Callable, learnRate: float ) -> Tuple[List, List, List, List]:

    lTrainLoss  = []
    lTrainScore = []
    lValLoss    = []
    lValScore   = []

    for ii in range(numEpoch):
        startTime           = time.time()
        trainLoss, trainScr = TrainEpoch(oModel, oDsTrain, learnRate, hL, hS) #<! Train
        valLoss,   valScr   = ScoreEpoch(oModel, oDsVal, hL, hS)                #<! Score Validation
        endTime             = time.time()

        # Aggregate Results
        lTrainLoss.append(trainLoss)
        lTrainScore.append(trainScr)
        lValLoss.append(valLoss)
        lValScore.append(valScr)
        
        # Display (Babysitting)
        print('Epoch '              f'{(ii + 1):4d} / ' f'{numEpoch}:', end = '')
        print(' | Train Loss: '     f'{trainLoss          :6.3f}', end = '')
        print(' | Val Loss: '       f'{valLoss            :6.3f}', end = '')
        print(' | Train Score: '    f'{trainScr           :6.3f}', end = '')
        print(' | Val Score: '      f'{valScr             :6.3f}', end = '')
        print(' | Epoch Time: '     f'{(endTime-startTime):6.3f} |')

    return lTrainLoss, lTrainScore, lValLoss, lValScore


* <font color='blue'>(**!**)</font> Go through the code of the `TrainEpoch()` class.
* <font color='blue'>(**!**)</font> Go through the code of the `ScoreEpoch()` class.

In [None]:
# Train Parameters
batchSize   = 256
nEpochs     = 10
learnRate   = 2e-1

### Train by a Single Batch

In this section the data is trained as a single batch.

In [None]:
# Data Objects

oDsTrain    = DataSet(mXTrain.T, vYTrain, mXTrain.shape[0]) #<! Train Data Set
oDsVal      = DataSet(mXTest.T, vYTest, mXTest.shape[0]) #<! Validation Data Set

In [None]:
# Train Model

oModel = ModelNN([
        LinearLayer(784, 200, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(200, 10,  initMethod = NNWeightInit.KAIMING),
        ])

_, _, lValLossNaive, lValScoreNaive = TrainModel(oModel, oDsTrain, oDsVal, nEpochs, CrossEntropyLoss, ScoreAccLogits, learnRate)

* <font color='red'>(**?**)</font> How many gradient steps were conducted?
* <font color='red'>(**?**)</font> Explain the train score of the first _Epoch_.

### Train by Mini Batches

This section trains the model by mini batches smaller than the data set.

In [None]:
# Data Objects

oDsTrain    = DataSet(mXTrain.T, vYTrain, batchSize) #<! Train Data Set
oDsVal      = DataSet(mXTest.T, vYTest, batchSize) #<! Validation Data Set

In [None]:
# Train Model

oModel = ModelNN([
        LinearLayer(784, 200, initMethod = NNWeightInit.KAIMING), ReLULayer(),
        LinearLayer(200, 10,  initMethod = NNWeightInit.KAIMING),
        ])

_, _, lValLossOpt, lValScoreOpt = TrainModel(oModel, oDsTrain, oDsVal, nEpochs, CrossEntropyLoss, ScoreAccLogits, learnRate)

* <font color='red'>(**?**)</font> How many gradient steps were conducted?
* <font color='red'>(**?**)</font> What happens if we run `_, _, lValLossOpt, lValScoreOpt = TrainModel(oModel, oDsTrain, oDsVal, nEpochs, CrossEntropyLoss, ScoreAccLogits, learnRate)` without redefining the model?

In [None]:
# Plot Results

hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 6))

hA = vHa.flat[0]
hA.plot(lValLossNaive, lw = 2, label = 'Single Batch')
hA.plot(lValLossOpt, lw = 2, label = 'Multiple Batches')
hA.grid()
hA.set_title('Cross Entropy Loss')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Loss')
hA.legend();

hA = vHa.flat[1]
hA.plot(lValScoreNaive, lw = 2, label = 'Single Batch')
hA.plot(lValScoreOpt, lw = 2, label = 'Multiple Batches')
hA.grid()
hA.set_title('Accuracy Score')
hA.set_xlabel('Epoch Index')
hA.set_ylabel('Score')
hA.legend();

* <font color='brown'>(**#**)</font> The _Single Batch_ is basically _Gradient Descent_ while the multiple batches is the _Batch Gradient Descent_.

### Display Test / Validation Samples

Display the estimated class of some samples from the test data.

In [None]:
# Process the Test
mS = oModel.Forward(mXTest.T)
vYHat = np.argmax(mS, axis = 0)

regExpPtrn = r'Index = (\d+)'

hF = PlotMnistImages((mXTest * vStd) + vMean, vYTest, numImg)
lHAx = hF.get_axes()
for hA in lHAx:
    titleStr = hA.get_title()
    regMatch = re.search(regExpPtrn, titleStr)
    imgIdx = int(regMatch.group(1))
    yHat = vYHat[imgIdx]
    titleStr += f'\nEstimated Label = {yHat}'
    hA.set_title(titleStr)

* <font color='red'>(**?**)</font> How many errors are expected in the images above?
* <font color='green'>(**@**)</font> Add the `__getitem__()` method to the `DataSet` class.  
  The method signature is `__getitem__( self, idx: int )` where it enables `oDs[idx]` to extract the `idx` -th sample of the dataset.  
  See [Creating a Custom Dataset](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files) for the concept of the method.