[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io)

# AI Program

## Introduction to Deep Learning - Multi Layer Perceptron Optimization with Gradient Descent by Auto Differentiation

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 05/12/2025 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0074DeepLearningVanillaNN.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Scientific Computing
from autograd import grad
import autograd.numpy as anp
import autograd.scipy as asp
from numba import njit, vectorize

# Machine Learning
from sklearn.metrics import r2_score

# Miscellaneous
import math
from platform import python_version
import random

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union
from numpy.typing import ArrayLike, NDArray

# Visualization
import matplotlib.pyplot as plt

# Jupyter
from IPython import get_ipython
from ipywidgets import IntSlider, Layout
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

In [None]:
# Courses Packages

from DataVisualization import PlotConfusionMatrix, PlotLabelsHistogram, PlotMnistImages

In [None]:
# General Auxiliary Functions


## Multi Layer Perceptron (MLP) Optimization

This notebook optimizes a [Multi Layer Perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP) model with a single _Hidden Layer_ for a _Regression_ task.  
The model is trained with a simple _Gradient Descent_ loop with a constant _step size_.  
The gradient is calculated using [Automatic Differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) based on the [`AutoGrad`](https://github.com/HIPS/autograd) package.

In [None]:
# Parameters

# Data
numSamples = 151

# Model
hidLayerDim = 10_000

# Training
numIter    = 7_500 #<! Number of Iterations
µ          = 1e-3 #<! Step Size \ Learning Rate
α          = 0.5 #<! Backtracking Factor
minμ       = 1e-8 #<! Minimum Step Size
maxNumBack = 20 #<! Maximum Number of Backtracking Steps

## Generate / Load Data

The 1D _Time Series_ is an _Harmonic_ signal with non linear phase function.  
It is composed of `numSamples` sample within the time interval $\left[ 0, 1 \right]$.

In [None]:
# Load Data

# non Linear Phase
vX   = np.linspace(0, 1, numSamples)
vPhi = np.sin(1.25 * vX) + 0.7 * np.cos(2.50 * vX) + 0.3 * np.sin(3.75 * vX) + 0.3 * np.cos(5.00 * vX) + 0.65 * np.sin(6.25 * vX)
vPhi = np.sin(1.25 * vX) + 0.7 * np.cos(2.50 * vX) + 0.3 * np.sin(3.75 * vX) + 0.3 * np.cos(5.00 * vX)

vY = np.cos(2 * np.pi * vPhi) #<! Labels

vY = np.sin(20 * vX) * np.sin(10 * np.abs(vX) ** (1.1)) + vX / 10

print(f'The number of sample: {len(vY)}')

### Plot the Data

In [None]:
# Plot the Data

hF, hA = plt.subplots(figsize = (6, 6))
hA.plot(vX, vY, color = lMatPltLibclr[0], linewidth = 1.5, linestyle = '--', marker = 'o', markersize = 4)
hA.set_title('Data Samples')
hA.set_xlabel('Time')
hA.set_ylabel('Amplitude');

## Neural Network Regressor

This section builds a Neural Network with a single hidden layer.  

The network architecture is given by:

![](https://i.imgur.com/BnZatPW.png)
<!-- ![](https://i.postimg.cc/50qsSCkk/Diagrams-Multi-Layer-Perceptron-(MLP)-Regression.png) -->

* <font color='brown'>(**#**)</font> The Neural Net will be implemented using _NumPy_.
* <font color='brown'>(**#**)</font> Deep Learning is the set of methods how to train Neural Networks with many hidden layers as this case requires a delicate handling.

### Math Building Blocks

\begin{align*}
x \in \mathbb{R}, \quad & \boldsymbol{W}_{1} \in \mathbb{R}^{d \times 1}, \quad \boldsymbol{W}_{2} \in \mathbb{R}^{1 \times d}\\
y \in \mathbb{R}, \quad & \boldsymbol{b}_{1} \in \mathbb{R}^{d}
\end{align*}

 * The hidden layer dimension is given by $d$.
 * The input and output are scalars.  

</br>

* <font color='brown'>(**#**)</font> The default in data processing is having samples as rows.
* <font color='brown'>(**#**)</font> Pay attention that in this case the default of Linear Algebra is used, where each sample is a column.
* <font color='red'>(**?**)</font> What is the Mathematical formulation of the model?

### Model Functions

This section builds the model as a NumPy function.  
The [activation function](https://en.wikipedia.org/wiki/Activation_function) is the [Rectified Linear Unit (ReLU)](https://en.wikipedia.org/wiki/Rectified_linear_unit).

In [None]:
# Activation Function

def ReLU( mX: NDArray ) -> NDArray:
    # Using `anp` instead of `np` to ensure compatibility with autograd.
    
    return anp.maximum(mX, 0)

def Sigmoid( mX: NDArray ) -> NDArray:
    # Using `anp` instead of `np` to ensure compatibility with autograd.
    
    return (2 * asp.special.expit(mX)) - 1

def TanH( mX: NDArray ) -> NDArray:
    # Using `anp` instead of `np` to ensure compatibility with autograd.
    
    return anp.tanh(mX)

In [None]:
# Model Functions

def Model( vW: NDArray, vX: NDArray, hidLayerDim: int ) -> NDArray:
    """
    A single Hidden Layer MLP model.  
    A regression model from R to R.
    """

    # Unpack the weights
    mW1 = anp.reshape(vW[0:hidLayerDim], (hidLayerDim, 1))
    vB1 = vW[hidLayerDim:(hidLayerDim + hidLayerDim)]
    mW2 = anp.reshape(vW[(hidLayerDim + hidLayerDim):(hidLayerDim + hidLayerDim + hidLayerDim)], (1, hidLayerDim))

    vY = mW2 @ ReLU(mW1 @ vX[None, :] + vB1[:, None]) #<! Adapt to NumPy's broadcasting
    # vY = TanH(vY) #<! Apply Hyperbolic Tangent Activation, Taking advantage of the [-1, 1] range
    # vY = Sigmoid(vY) #<! Apply Scaled and Shifted Sigmoid Activation, Taking advantage of the [-1, 1] range
    vY = np.squeeze(vY) #<! Remove single dimensional entries from the shape
    
    return vY

### Loss Function

The Loss function is given by the [MSE Loss](https://en.wikipedia.org/wiki/Mean_squared_error):

$$ \ell \left( \hat{\boldsymbol{y}}, \boldsymbol{y} \right) = \frac{1}{2 N} \sum_{i = 1}^{N} {\left( \hat{y}_{i} - {y}_{i} \right)}^{2} $$

Where $\hat{y}_{i}$ is the model output.

* <font color='brown'>(**#**)</font> The package [_NumPy ML_](https://github.com/ddbourgin/numpy-ml) is useful for implemented loss functions and other ML related functions.  
  It also offers a calculation of _Gradient_ of some of the functions.

In [None]:
# Loss Functions

def MseLoss( vW: NDArray, vX: NDArray, vY: NDArray, hidLayerDim: int ) -> NDArray:
    # vY: Vector of Ground Truth (Scalar per sample)

    vYHat = Model(vW, vX, hidLayerDim)
    
    return 0.5 * anp.mean((vY - vYHat) ** 2)

### Gradient Function

This section build the function which returns the Gradient of the loss function for a given:
 - Set of parameters of the model.
 - Input data.
 - Labels.

The gradient is calculated using _Auto Differentiation_.

In [None]:
# Auxiliary Functions

hModelFun = lambda vW: Model(vW, vX, hidLayerDim)
hLossFun  = lambda vW: MseLoss(vW, vX, vY, hidLayerDim)

In [None]:
## Gradient Functions

hGradFun = grad(hLossFun)

## Model Optimization

The model optimization is done by a _Gradient Descent_ with _Adaptive Step Size_ (Backtracking):

 * The model is initialized by a random values.
 * The step size is adjusted to ensure an improvement of the loss.

The optimization should yield optimized parameters of the model.

* <font color='red'>(**?**)</font> What would happen if all values are initialized as $0$? Think about the initial values and gradients.

### Gradient Descent Loop

In [None]:
# Training Loop

# Parameters
vW = np.random.randn(hidLayerDim + hidLayerDim + hidLayerDim) #<! Total number of parameters

# Initialization
lLoss = [] #<! List of Loss values
lYHat = [] #<! List of Prediction values

lossVal = hLossFun(vW)
vYHat   = hModelFun(vW)

lLoss.append(lossVal)
lYHat.append(vYHat)

print(f'Iteration: {(0):<5} / {numIter}, Loss Value: {lossVal:0.4f}, Step Size: {µ:0.2e}')

# Gradient Descent
for ii in range(numIter):

    # Calculate the Gradient
    vG       = hGradFun(vW) #<! Current gradient
    lossVal  = hLossFun(vW) #<! Current objective 

    kk      = 0
    while((hLossFun(vW - µ * vG) > lossVal) and (kk < maxNumBack) and (µ >= minμ)):
        kk += 1
        µ  *= α
    
    vW -= µ * vG #<! Gradient Step
    µ  /= α #<! Increase Step Size

    # Calculate Loss
    lossVal = hLossFun(vW)
    vYHat   = hModelFun(vW)

    lLoss.append(lossVal)
    lYHat.append(vYHat)

    print(f'Iteration: {(ii + 1):<5} / {numIter}, Loss Value: {lossVal:0.4f}, Step Size: {µ:0.2e}')

## Model Performance

This section analyzes the model performance along the optimization path.

In [None]:
# Plot Function

def PlotOptPath( iterIdx: int, vX: NDArray, vY: NDArray, lLoss: List[float], lYHat: List[NDArray] ) -> None:
    """
    Display the Optimization Path.
    """

    vYHat = lYHat[iterIdx]

    hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (8, 6))
    vHa = vHa.flat

    # Plot the Prediction
    hA = vHa[0]
    hA.plot(vX, vY, color = lMatPltLibclr[0], linewidth = 1.5, linestyle = '--', marker = '.', markersize = 4, label = 'Ground Truth')
    hA.plot(vX, vYHat, color = lMatPltLibclr[1], linewidth = 2, linestyle = '-', alpha = 0.55, label = 'Prediction')
    hA.set_ylim((-1.1, 1.1))
    hA.set_xlabel('Time')
    hA.set_ylabel('Amplitude');
    hA.set_title(f'Signals, R2: {r2_score(vY, vYHat):0.2f}')
    hA.legend()

    lLogLoss = [math.log(lossVal + 1) for lossVal in lLoss]
    xLimBuffer = int(0.025 * len(lLogLoss))

    # Plot the Loss Function
    hA = vHa[1]
    hA.plot(range(iterIdx), lLogLoss[:iterIdx], color = lMatPltLibclr[2], linewidth = 2)
    hA.set_xlim((-xLimBuffer, len(lLoss) + xLimBuffer))
    hA.set_ylim((-1, 10))
    hA.set_xlabel('Iteration Index')
    hA.set_ylabel('Loss Value [Log(MSE + 1)]')
    hA.set_title('Loss Value vs. Iterations')

    hF.suptitle(f'Model Prediction vs. Ground Truth, Iteration: {iterIdx:<5}');

* <font color='brown'>(**#**)</font> Potentially, by the [Universal Approximation Theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem), the loss can be reduced to zero.

In [None]:
# Interactive Plot

hPlotOptPath = lambda ii: PlotOptPath(ii, vX, vY, lLoss, lYHat)
idxSlider = IntSlider(value = 0, min = 0, max = len(lLoss) - 1, step = 1, description = 'Iteration', continuous_update = False, readout_format = 'd', layout = Layout(width = '30%'))
interact(hPlotOptPath, ii = idxSlider);

* <font color='red'>(**?**)</font> How many parameters in the model?
* <font color='red'>(**?**)</font> Is the problem _Convex_?
* <font color='green'>(**@**)</font> Try different dimensions of the hidden layer.
* <font color='green'>(**@**)</font> Add another hidden layer.

## Gradient Descent on Non Convex Loss Landscape

This section shows a 2D _Non Convex_ loss function with multiple minimum points.

In [None]:
# Create the Grid of the Data

vG = np.linspace(-5, 5, 1_001) #<! Linear steps grid
mX1, mX2 = np.meshgrid(vG, vG)
mX = np.vstack([mX1.flatten(), mX2.flatten()])

In [None]:
# The Object Function

def ObjFunction( vX: NDArray ) -> float:
    """
    Beale Function.
    Global Minimum at f(3, 0.5) = 0
    """

    valX1 = vX[0]
    valX2 = vX[1]

    valZ = (1.5 - valX1 + (valX1 * valX2)) ** 2 + (2.25 - valX1 + (valX1 * (valX2 ** 2))) ** 2 + (2.625 - valX1 + (valX1 * (valX2 ** 3))) ** 2

    return valZ

def ObjFunction( vX: NDArray ) -> float:
    """
    Himmelblau's Function.
    Global Minimums at f(3, 2) = 0, f(-2.805118, 3.131312) = 0, f(-3.779310, -3.283186) = 0, f(3.584428, -1.848126) = 0
    """

    valX1 = vX[0]
    valX2 = vX[1]

    valZ = (valX1 ** 2 + valX2 - 11) ** 2 + (valX1 + valX2 ** 2 - 7) ** 2

    return valZ

In [None]:
# Vectorize the Function
hObjFunction = np.vectorize(ObjFunction, signature = '(2, n)->(n)')

In [None]:
# Plot the Function

vZ = hObjFunction(mX)
mZ = np.reshape(vZ, mX1.shape)

hF, hA = plt.subplots(figsize = (6, 6))
hA.grid(True, linestyle = '--', linewidth = .3)
hA.imshow(np.log1p(mZ), extent = [vG.min(), vG.max(), vG.min(), vG.max()], origin = 'lower', cmap = 'viridis', alpha = 0.95)
hA.contour(mX1, mX2, np.log1p(mZ), alpha = 0.75)
hA.set_xlabel('$x_1$')
hA.set_ylabel('$x_2$');

In [None]:
# Define the Gradient Function

hObjFunGrad = grad(ObjFunction)

In [None]:
# Create Optimization Path

numOptPath = 5
lOptPaths = []

mXInit = np.array([
    [0, 0],
    [-2.5, 2.5],
    [-2.5, -2.5],
    [2.5, -2.5],
    [2.5, 2.5],
])

numSteps = 5000
µ = 2e-5 #<! Step Size

for ii in range(numOptPath):
    # Random Initialization
    mXk = np.zeros((numSteps + 1, 2))

    # Initialization
    mXk[0, :] = mXInit[ii, :]
    # mXk[0, :] = np.random.uniform(low = -3.0, high = 3.0, size = (2, ))

    # Gradient Descent
    for jj in range(numSteps):
        vGk = hObjFunGrad(mXk[jj, :]) #<! Current Gradient
        mXk[jj + 1, :] = mXk[jj, :] - µ * vGk #<! Gradient Step

    lOptPaths.append(mXk.copy())

In [None]:
# Plot the Optimization Path

hF, hA = plt.subplots(figsize = (6, 6))
hA.grid(True, linestyle = '--', linewidth = .3)
hA.imshow(np.log1p(mZ), extent = [vG.min(), vG.max(), vG.min(), vG.max()], origin = 'lower', cmap = 'viridis', alpha = 0.95)
hA.contour(mX1, mX2, np.log1p(mZ), alpha = 0.85, linewidths = 2)

# Clear axis labels and ticks
hA.set_xticks([])
hA.set_yticks([])

# Draw optimization path
for ii in range(numOptPath):
    mXk = lOptPaths[ii]
    hA.plot(mXk[:, 0], mXk[:, 1], color = lMatPltLibclr[ii % len(lMatPltLibclr)], linestyle = None, marker = 'o', markersize = 4, label = f'Path {ii + 1}')
hA.set_aspect('equal')
hA.legend(framealpha = 0.0)

# Save Figures for Slides
# hF.savefig('Figure.svg', transparent = True, bbox_inches = 'tight', pad_inches = 0.1)
# hF.savefig('Figure.png', transparent = True, bbox_inches = 'tight', pad_inches = 0.1)