[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Machine Learning - Deep Learning - Vanilla Neural Network

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 22/04/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0074DeepLearningVanillaNN.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Courses Packages

from DataVisualization import PlotConfusionMatrix, PlotLabelsHistogram, PlotMnistImages


In [None]:
# General Auxiliary Functions


## Neural Network Classifier

This notebook builds a Neural Network based on _ReLU_ activation and a single _Hidden Layer_ for a classification.  
The model is trained with a simple _Gradient Descent_ loop with a constant _step size_.

* <font color='brown'>(**#**)</font> The Neural Net will be implemented using _NumPy_.

In [None]:
# Parameters

# Data
numSamplesTrain = 60_000
numSamplesTest  = 10_000

# Model
hidLayerDim = 200

# Training
numIter = 300
µ       = 0.35 #!< Step Size \ Learning Rate

# Visualization
numImg = 3


## Generate / Load Data

This section loads the [MNIST Data set](https://en.wikipedia.org/wiki/MNIST_database) using [`fetch_openml()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html).

The data is splitted to 60,000 train samples and 10,000 test samples.

In [None]:
# Load Data

mX, vY = fetch_openml('mnist_784', version = 1, return_X_y = True, as_frame = False, parser = 'auto')
vY = vY.astype(np.int_) #<! The labels are strings, convert to integer

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Pre Process Data

mX = mX / 255.0


* <font color='red'>(**?**)</font> Does the scaling affects the training phase? Think about the _Learning Rate_.

### Plot the Data

In [None]:
# Plot the Data

hF = PlotMnistImages(mX, vY, numImg)

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(vY)
plt.show()

### Train & Test Split

The data is split into _Train_ and _Test_ data sets.  

* <font color='brown'>(**#**)</font> Deep Learning is _big data_ oriented, hence it can easily handle all samples in a single _batch_.

In [None]:
# Train Test Split

numClass = len(np.unique(vY))
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mX, vY, test_size = numSamplesTest, train_size = numSamplesTrain, shuffle = True, stratify = vY)

print(f'The training features data shape: {mXTrain.shape}')
print(f'The training labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')

## Neural Network Classifier

This section builds a Neural Network with a single hidden layer.  

The network architecture is given by:

![Neural Netwrok Classifier Architecture](https://github.com/FixelAlgorithmsTeam/FixelCourses/blob/master/DeepLearningMethods/01_DeepLearningFramework/OneHiddenLayerModel.png?raw=true)

* <font color='brown'>(**#**)</font> Deep Learning is the set of methods how to train Neural Networks with many hidden layers as this case requires a delicate handling.

### Math Building Blocks

\begin{align*}
\boldsymbol{x}\in\mathbb{R}^{784},\quad & \boldsymbol{W}_{1}\in\mathbb{R}^{d\times784},\quad\boldsymbol{W}_{2}\in\mathbb{R}^{10\times d}\\
\hat{\boldsymbol{y}}\in\mathbb{R}^{10},\quad & \boldsymbol{b}_{1}\in\mathbb{R}^{d},\qquad\boldsymbol{b}_{2}\in\mathbb{R}^{10}
\end{align*}

 * The hidden layer dimension is given by $d$.
 * The number of classes is 10.  


For simplicity, we denote:

$$\hat{\boldsymbol{y}}_{i}=\text{softmax}\left(\boldsymbol{z}_i\right)$$

Where

$$\boldsymbol{z}_i=\boldsymbol{W}_{2}\boldsymbol{a}_{i}+\boldsymbol{b}_{2},\qquad\boldsymbol{a}_{i}=\text{ReLU}\left(\boldsymbol{W}_{1}\boldsymbol{x}_{i}+\boldsymbol{b}_{1}\right)$$

The data is arranged as:

$$\boldsymbol{X}=\left[\begin{matrix}| &  & |\\
\boldsymbol{x}_{1} & \cdots & \boldsymbol{x}_{N}\\
| &  & |
\end{matrix}\right]\in\mathbb{R}^{784\times N},\qquad\hat{\boldsymbol{Y}}=\left[\begin{matrix}| &  & |\\
\hat{\boldsymbol{y}}_{1} & \cdots & \hat{\boldsymbol{y}}_{N}\\
| &  & |
\end{matrix}\right]\in\mathbb{R}^{10\times N}$$

* <font color='brown'>(**#**)</font> The default in data processing is having samples as rows.
* <font color='brown'>(**#**)</font> Pay attention that in this case the default of Linear Algebra is used, where each sample is a column.

### Model Functions

This section build the components for the model evaluation:

1. Activation: _ReLU_.
2. Linear Model.
3. SoftMax.

* <font color='brown'>(**#**)</font> In many cases the _SoftMax_ function is considered as part of the loss function.
* <font color='brown'>(**#**)</font> Since the _SoftMax_ function is monotonic non decreasing, the argument which maximizes it is the same as the argument maximizes its input.

In [None]:
# Model Functions

def ReLU( mX: np.ndarray ) -> np.ndarray:
    
    return np.maximum(mX, 0)

def SoftMax( mX: np.ndarray ) -> np.ndarray:
    
    # mExp = np.exp(mX)
    # return mExp / np.sum(mExp, axis = 0)
    
    # Better use tuned implementations
    return sp.special.softmax(mX, axis = 0)

def Model( mX: np.ndarray, mW1: np.ndarray, vB1: np.ndarray, mW2: np.ndarray, vB2: np.ndarray ) -> np.ndarray:
    
    mA    = ReLU(mW1 @ mX + vB1[:, None])
    mZ    = mW2 @ mA + vB2[:, None]
    mYHat = SoftMax(mZ) 
    
    return mYHat

### Loss Function

#### Cross Entropy Loss

$$\ell_{i}=\ell\left(\boldsymbol{y}_{i},\hat{\boldsymbol{y}}_{i}\right)=-\boldsymbol{y}_{i}^{T}\log\left(\hat{\boldsymbol{y}}_{i}\right)$$

* <font color='brown'>(**#**)</font> For a single data sample.

#### One Hot Encoding

$$y_{i}=2\implies\boldsymbol{y}_{i}=\left[\begin{matrix}0\\
1\\
0\\
\vdots\\
0
\end{matrix}\right]$$

Note that if (for example) $y_i = 3$ than:

$$\boldsymbol{y}_{i}^{T}\log\left(\hat{\boldsymbol{y}}_{i}\right)=\log\left(\hat{\boldsymbol{y}}_{i}\left[3\right]\right)=\log\left(\hat{\boldsymbol{y}}_{i}\left[y_{i}\right]\right)$$

#### Overall Loss

$$L=\frac{1}{N}\sum_{i=1}^{N}\ell\left(\boldsymbol{y}_{i},\hat{\boldsymbol{y}}_{i}\right)=-\frac{1}{N}\sum_{i=1}^{N}\boldsymbol{y}_{i}^{T}\log\left(\hat{\boldsymbol{y}}_{i}\right)=-\frac{1}{N}\sum_{i=1}^{N}\log\left(\hat{\boldsymbol{y}}_{i}\left[y_{i}\right]\right)$$


* <font color='brown'>(**#**)</font> The package [_NumPy ML_](https://github.com/ddbourgin/numpy-ml) is useful for implemented loss functions and other ML related functions.  
  It also offers a calculation of _Gradient_ of some of the functions.

In [None]:
# Loss Functions

def CrossEntropyLoss( vY: np.ndarray, mYHat: np.ndarray ) -> np.ndarray:
    # vY: Vector of Ground Truth (Scalar per sample)
    
    numSamples = len(vY)
    return -np.mean(np.log(mYHat[vY, range(numSamples)]))

### Gradient Function

The model function is given by

$$\hat{\boldsymbol{y}}_{i} = \text{softmax}\left(\boldsymbol{W}_{2}\text{ReLU}\left(\boldsymbol{W}_{1}\boldsymbol{x}_{i}+\boldsymbol{b}_{1}\right)+\boldsymbol{b}_{2}\right)$$

The loss function is given by

$$ -\frac{1}{N}\sum_{i=1}^{N}\boldsymbol{y}_{i}^{T}\log\left(\hat{\boldsymbol{y}}_{i}\right)=-\frac{1}{N}\sum_{i=1}^{N}\log\left(\hat{\boldsymbol{y}}_{i}\left[y_{i}\right]\right) $$

The gradients of the loss function $L$ are:

$$\nabla_{\boldsymbol{b}_{2}}L=\frac{1}{N}\sum_{i=1}^{N}{\hat{\boldsymbol{y}}_{i}}-\boldsymbol{y}_{i}$$

$$\nabla_{\boldsymbol{W}_{2}}L=\frac{1}{N}\sum_{i=1}^{N}\left(\hat{\boldsymbol{y}}_{i}-\boldsymbol{y}_{i}\right)\boldsymbol{a}_{i}^{T}$$

$$\nabla_{\boldsymbol{b}_{1}}L=\frac{1}{N}\sum_{i=1}^{N}\boldsymbol{I}_{\boldsymbol{v}_{i}>0}\boldsymbol{W}_{2}^{T}\left(\hat{\boldsymbol{y}}_{i}-\boldsymbol{y}_{i}\right)$$

$$\nabla_{\boldsymbol{W}_{1}}L=\frac{1}{N}\sum_{i=1}^{N}\boldsymbol{I}_{\boldsymbol{v}_{i}>0}\boldsymbol{W}_{2}^{T}\left(\hat{\boldsymbol{y}}_{i}-\boldsymbol{y}_{i}\right)\boldsymbol{x}_{i}^{T}$$

where $\boldsymbol{v}_{i}=\boldsymbol{W}_{1}\boldsymbol{x}_{i}+\boldsymbol{b}_{1}$ and $\boldsymbol{I}_{\boldsymbol{v}_{i}>0}=\text{diag}\left(\mathbb{I}\left\{ \boldsymbol{v}_i>0\right\} \right)$



* <font color='brown'>(**#**)</font> Cross Entropy and SoftMax Loss Gradient: [Gradient of SoftMax Cross Entropy Loss](https://www.michaelpiseno.com/blog/2021/softmax-gradient), [Derivative of SoftMax Loss Function](https://math.stackexchange.com/questions/945871).
* <font color='brown'>(**#**)</font> Pay attention to the dependence of chained gradients.

In [None]:
## Gradient Functions

def CalcGradients( mX: np.ndarray, vY: np.ndarray, mW1: np.ndarray, vB1: np.ndarray, mW2: np.ndarray, vB2: np.ndarray ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    
    N      = mX.shape[1]
    
    mV     = mW1 @ mX + vB1[:, None]          #<! (d, N)
    mA     = ReLU(mV)                         #<! (d, N)
    mZ     = mW2 @ mA + vB2[:, None]          #<! (10, N)
    mYHat  = SoftMax(mZ)                      #<! (10, N)
    
    mTemp               = mYHat               #<! (10, N)
    mTemp[vY,range(N)] -= 1
    mTemp              /= N
    
    dB2                 = mTemp.sum(axis = 1) #<! (10,)
    dW2                 = mTemp @ mA.T        #<! (10, d)
    
    mTemp               = mW2.T @ mTemp       #<! (d, N)
    mTemp[mV < 0]       = 0
    
    dB1                 = mTemp.sum(axis = 1) #<! (d,)
    dW1                 = mTemp @ mX.T        #<! (d, 784)
    
    return dW1, dB1, dW2, dB2

## Model Training

The model training (Optimization) is by a vanilla _Gradient Descent_.  
Since the model is small and the data si relatively small, the batch size is the whole training set.

* <font color='brown'>(**#**)</font> Larger model / data set might require using _Stochastic Gradient Descent_.  
  In this case the actual gradient of the loss function over the whole data is _approximated_ by the gradient calculated over a sub sample (Batch).

### Initialize the Model

In order to initialize each _perceptron_ with a different value, a random initialization will be applied.

* <font color='brown'>(**#**)</font> Random results means the training phase is random. Set the seed for reproducibility.

In [None]:
# Model Initialization

def InitModel( dIn: int, dHidden: int, dOut: int ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    
    mW1 = np.random.randn(dHidden, dIn)  / dIn
    mW2 = np.random.randn(dOut, dHidden) / dHidden
    vB1 = np.zeros(dHidden)
    vB2 = np.zeros(dOut)
    
    return mW1, vB1, mW2, vB2

### Training Loop

In [None]:
# Training Loop

# Parameters
dIn                = mX.shape[1] #<! Number of features
dHidden            = hidLayerDim #<! Dimensions of the hidden layer
dOut               = len(np.unique(vY)) #<! Number of classes
mW1, vB1, mW2, vB2 = InitModel(dIn, dHidden, dOut)
    
# Display Results
hF, hA = plt.subplots(figsize = (12, 6))

# Gradient Descent
lLoss = [] #<! List of Loss values
for ii in range(numIter):

    # Update Weights
    dW1, dB1, dW2, dB2  = CalcGradients(mXTrain.T, vYTrain, mW1, vB1, mW2, vB2)
    mW1                -= µ * dW1
    vB1                -= µ * dB1
    mW2                -= µ * dW2
    vB2                -= µ * dB2

    # Check Loss
    mYHat = Model(mXTrain.T, mW1, vB1, mW2, vB2)
    valLoss  = CrossEntropyLoss(vYTrain, mYHat)
    lLoss.append(valLoss)
    
    # Display
    hA.cla()
    hA.set_title(f'Iteration: {(ii + 1): 04d} / {numIter}, Loss = {valLoss: 0.2f}')
    hA.plot(lLoss, 'b', marker = '.', ms = 5)
    hA.set_xlabel('Iteration Index')
    hA.set_ylabel('Loss Value')
    hA.grid()
    
    # fig.canvas.draw()
    plt.pause(1e-20)
    display(hF, clear = True) #<! "In Place"

* <font color='brown'>(**#**)</font> In practice, some metric on a small validation set is also presented.

## Model Performance

This section analyzes the model performance on the train and test data.

In [None]:
# Apply Model on Data

mYHatTrain = Model(mXTrain.T, mW1, vB1, mW2, vB2)
mYHatTest  = Model(mXTest.T,  mW1, vB1, mW2, vB2)
vYHatTrain = np.argmax(mYHatTrain, axis = 0)
vYHatTest  = np.argmax(mYHatTest, axis = 0)

In [None]:
# Confusion Matrix

hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 6))

hA, _ = PlotConfusionMatrix(vYTrain, vYHatTrain, hA = vHa[0])
hA.set_title(f'Train Data, Accuracy {np.mean(vYTrain == vYHatTrain): 0.2%}')

hA, _ = PlotConfusionMatrix(vYTest, vYHatTest, hA = vHa[1])
hA.set_title(f'Test Data, Accuracy {np.mean(vYTest == vYHatTest): 0.2%}')

* <font color='red'>(**?**)</font> How many parameters in the model?
* <font color='red'>(**?**)</font> Is the problem _Convex_?
* <font color='green'>(**@**)</font> Add another hidden layer.