[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Machine Learning - Deep Learning - BackPropagation - Exercise

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.001 | 19/05/2024 | Royi Avital | Added code comments and typing                                     |
| 1.0.000 | 23/04/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0076DeepBackPropagation.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Courses Packages

from DataVisualization import PlotRegressionResults


In [None]:
# General Auxiliary Functions


## Back Propagation (BackPropagation)

The [BackPropagation](https://en.wikipedia.org/wiki/Backpropagation) is the method utilizing the [Chain Rule](https://en.wikipedia.org/wiki/Chain_rule) in order to calculate the gradient of a neural network.  
The _BackPropagation_ is efficient under the assumption the net is $f: \mathbb{R}^{d} \to \mathbb{R}^{c}$ where $c \ll d$.  

This notebook replicates the previous note book with 4 differences:

 - The Application: _Regression_ instead of _Classification_.
 - The Data Set: Replacing the MNIST with [California Housing Dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).
 - The Loss Function: Replacing the CE + SoftMax with MSE.
 - The Activation Layer: Replacing the _ReLU_ with _LeakyReLU_.

* <font color='brown'>(**#**)</font> The objective to create a simple NN which beats a vanilla linear regression model. The score is the ${R}^{2}$ score.

In [None]:
# Parameters

# Data
numSamplesTrain = 15_000
numSamplesTest  = 5_640

# Model
α = 0.01 #<! LeakyReLU

# Training
numIter = 300
µ       = 0.35 #!< Step Size \ Learning Rate

# Visualization
numImg = 3


## Generate / Load Data

This section loads the [California Housing Dataset](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html) using [`fetch_california_housing()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html).

The data is split to 15,000 train samples and 5,640 test samples.

In [None]:
# Load Data

mX, vY  = fetch_california_housing(return_X_y = True)

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')


In [None]:
# Pre Process Data

mX -= np.mean(mX, axis = 0)
mX /= np.std(mX, axis = 0)


* <font color='red'>(**?**)</font> Does the scaling affects the training phase? Think about the _Learning Rate_.

### Plot the Data

In [None]:
# Plot the Data

dfData = pd.DataFrame(np.column_stack((mX, vY)))
dfData.columns = ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitud', 'MedHouseVal[100K$]'] #<! https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

# Pair Plot
# sns.pairplot(data = dfData)

### Train & Test Split

The data is split into _Train_ and _Test_ data sets.  

* <font color='brown'>(**#**)</font> Deep Learning is _big data_ oriented, hence it can easily handle all samples in a single _batch_.

In [None]:
# Train Test Split

numClass = len(np.unique(vY))
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mX, vY, test_size = numSamplesTest, train_size = numSamplesTrain, shuffle = True)

print(f'The training features data shape: {mXTrain.shape}')
print(f'The training labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')


## Neural Network Building Blocks

This section implements a class per NN building block.  
Each class has 2 main methods:
1. `Forward()` - Pushes the input forward on the computational graph.
2. `Backward()` - Pushes the input gradient backward on the computational graph.  
   The _backward_ step must calculate the gradient with respect to each parameter (With reduction over the batch) and per input.

* <font color='brown'>(**#**)</font> In practice each block supports the calculation over a _batch_.
* <font color='brown'>(**#**)</font> The implementation supports simple feed forward with no branching graph.
* <font color='brown'>(**#**)</font> The convention for the NumPy implementation is data as $d \times N$ where $d$ is the number of features and $N$ is the batch size.

The model to implement is given by

![The Neural Network Computational Graph](https://i.imgur.com/9tx3oCz.png)

### Affine Layer

#### Parameters

$$ \boldsymbol{W} \in \mathbb{R}^{ {d}_{out} \times {d}_{in} }, \; \boldsymbol{b} \in \mathbb{R}^{{d}_{out}} $$

#### Forward

$$\boldsymbol{z}=\boldsymbol{W}\boldsymbol{x}+\boldsymbol{b}$$

#### Backward

$$\boxed{\nabla_{\boldsymbol{b}}L=\nabla_{\boldsymbol{z}}L}$$
  
$$\boxed{\nabla_{\boldsymbol{x}}L=\boldsymbol{W}^{T}\nabla_{\boldsymbol{z}}L}$$

$$\boxed{\nabla_{\boldsymbol{W}}L=\nabla_{\boldsymbol{z}}L\boldsymbol{x}^{T}}$$

* <font color='brown'>(**#**)</font> The above _Linear Layer_ is often called _Dense Layer_ or _Fully Connected_.

In [None]:
# Linear Layer

class LinearLayer():
    def __init__( self, dimIn: int, dimOut: int ) -> None:
        
        # Initialization
        mW = np.random.randn(dimOut, dimIn) / dimIn
        vB = np.zeros(dimOut)
        
        # Parameters
        self.mX      = None #<! Required for the backward pass
        self.dParams = {'mW': mW,   'vB': vB}
        self.dGrads  = {'mW': None, 'vB': None}
        
    def Forward( self, mX: np.ndarray ) -> np.ndarray:
        self.mX = mX #<! Required for the backward pass
        
        mW      = self.dParams['mW'] #<! Shape: (dimOut, dimIn)
        vB      = self.dParams['vB'] 
        mZ      = mW @ mX + vB[:, None]
        
        return mZ
    
    def Backward( self: Self, mDz: np.ndarray ) -> np.ndarray:
        # Supports batch onf input by summing the gradients over each input.
        # Summing instead of averaging to support the case the loss is scaled by N.
        mW  = self.dParams['mW']
        mX  = self.mX
        
        vDb = np.sum(mDz, axis = 1) #<! Explicit Sum
        mDw = mDz @ mX.T #<! Implicit Sum
        mDx = mW.T @ mDz
        
        self.dGrads['vB'] = vDb
        self.dGrads['mW'] = mDw
                
        return mDx

* <font color='blue'>(**!**)</font> Fill the shapes of the arrays in the code (As comments).
* <font color='red'>(**?**)</font> Why can't `self.mX` be initialized with a concrete dimensions at initialization? Think about batches.

### Leaky ReLU (`LeakyReLU`) Layer

#### Parameters

None.

#### Forward

$$ \boldsymbol{z} = \text{LeakyReLU} \left( \boldsymbol{x} \right) = \begin{cases} x & x \geq 0 \\ \alpha x & x < 0 \end{cases}, \; \alpha \ll 1 $$

#### Backward

$$\boxed{ {\nabla} _{\boldsymbol{x}} L = \text{ {\color{red}???} } }$$

* <font color='brown'>(**#**)</font> For element wise vector functions the form of the gradient of a composition is a diagonal matrix of the element wise gradient function which multiplies the input gradient. 

In [None]:
# LeakyReLU Layer

class LeakyReLULayer():
    def __init__( self, α: float = 0.01 ) -> None:
        
        #===========================Fill This===========================#
        ?????
        #===============================================================#
    
    def Forward( self: Self, mX: np.ndarray ) -> np.ndarray:

        #===========================Fill This===========================#
        ?????
        #===============================================================#
        
        return mZ
    
    def Backward( self: Self, mDz: np.ndarray ) -> np.ndarray:
        
        #===========================Fill This===========================#
        ?????
        #===============================================================#
                
        return mDx

* <font color='blue'>(**!**)</font> Fill the shapes of the arrays in the code (As comments).

### MSE Function

The [_Mean Squared Error_](https://en.wikipedia.org/wiki/Mean_squared_error) (MSE):


$$ \ell\left( \boldsymbol{y}_{i}, \hat{\boldsymbol{y}}_{i} \right) = \frac{1}{2} {\left\| \hat{\boldsymbol{y}}_{i} - \boldsymbol{y}_{i} \right\|}_{2}^{2} $$

The Gradient

$$ \boxed{ {\nabla}_{\hat{\boldsymbol{y}}} \ell = \text{ {\color{red} ???} } } $$

The loss over a batch

$$L=\frac{1}{N}\sum_{i=1}^{N}\ell\left(\boldsymbol{y}_{i},\hat{\boldsymbol{y}}_{i}\right)$$




In [None]:
# MSE Loss

def MseLoss( vY: np.ndarray, vZ: np.ndarray ) -> Tuple[np.float_, np.ndarray]:
    '''
    Returns both the loss and the gradient w.r.t the input (vZ).
    The function uses the mean loss (Normalized by N). 
    Hence gradients calculation should sum the gradients over the batch.
    '''
    
    #===========================Fill This===========================#
    ?????
    #===============================================================#
    
    return valLoss, vDz

### Model Class

The model class should be composable to allow arbitrary _Feed Forward_ model.

In [None]:
# NN Model
class ModelNN():
    def __init__( self, lLayers: List ) -> None:
        
        self.lLayers = lLayers
        
    def Forward( self: Self, mX: np.ndarray ) -> np.ndarray:
        
        for oLayer in self.lLayers:
            mX = oLayer.Forward(mX)
        return mX
    
    def Backward( self: Self, mDz: np.ndarray ) -> None:
        
        for oLayer in reversed(self.lLayers):
            mDz = oLayer.Backward(mDz)

## Model Training

The model training (Optimization) is by a vanilla _Gradient Descent_.  
Since the model is small and the data si relatively small, the batch size is the whole training set.

* <font color='brown'>(**#**)</font> Larger model / data set might require using _Stochastic Gradient Descent_.  
  In this case the actual gradient of the loss function over the whole data is _approximated_ by the gradient calculated over a sub sample (Batch).

### Training Function

In [None]:
# Training Model Function
def TrainModel( oModel: ModelNN, mX: np.ndarray, vY: np.ndarray, numIter: int, learningRate: float ) -> None:
    
    # Display Results
    hF, hA = plt.subplots(figsize = (12, 6))

    vLoss = np.full(numIter, np.nan)
    for ii in range(numIter):
        # Forward Pass
        mZ        = oModel.Forward(mX)
        # Loss
        valLoss, mDz = MseLoss(vY, mZ)
        vLoss[ii]    = valLoss

        # Backward Pass
        oModel.Backward(mDz)

        # Gradient Descent (Update parameters)
        for oLayer in oModel.lLayers:
            for sParam in oLayer.dGrads:
                oLayer.dParams[sParam] -= learningRate * oLayer.dGrads[sParam]

        # Display Results
        hA.cla()
        hA.set_title(f'Iteration: {(ii + 1): 04d} / {numIter}, Loss = {valLoss: 0.2f}')
        hA.plot(vLoss, 'b', marker = '.', ms = 5)
        hA.set_xlabel('Iteration Index')
        hA.set_ylabel('Loss Value')
        hA.grid()

        plt.pause(1e-20)
        display(hF, clear = True) #<! "In Place"

## Model Performance

This section analyzes the model performance on the train and test data.

In [None]:
# Define the Model

oModel = ModelNN([
    LinearLayer(8,   200), LeakyReLULayer(α),
    LinearLayer(200, 250), LeakyReLULayer(α),
    LinearLayer(250, 1),
])

numIter = 600
µ = 7.5e-6
TrainModel(oModel, mXTrain.T, vYTrain, numIter, µ) #<! Works in place on the model

In [None]:
# Base Line Results

oLinReg     = LinearRegression()
oLinReg     = oLinReg.fit(mXTrain, vYTrain)
vYHatTrain  = oLinReg.predict(mXTrain)
vYHatTest   = oLinReg.predict(mXTest)
print(f'Linear Regression MSE (Train) = {np.mean(np.square(vYHatTrain - vYTrain))}')
print(f'Linear Regression R²  (Train) = {r2_score(vYTrain, vYHatTrain)}')
print(f'Linear Regression MSE (Test) = {np.mean(np.square(vYHatTest - vYTest))}')
print(f'Linear Regression R²  (Test) = {r2_score(vYTest, vYHatTest)}')


In [None]:
# Apply Model on Data

vYHatTrain = np.squeeze(oModel.Forward(mXTrain.T))
vYHatTest  = np.squeeze(oModel.Forward(mXTest.T))

In [None]:
# Results Analysis

# Plot Regression Results

hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 6))

hA = PlotRegressionResults(vYTrain, vYHatTrain, hA = vHa[0])
hA.set_title(f'Train Data, R2 = {r2_score(vYTrain, vYHatTrain.flat): 0.2f}')

hA = PlotRegressionResults(vYTest, vYHatTest, hA = vHa[1])
hA.set_title(f'Test Data, R2 = {r2_score(vYTest, vYHatTest.flat): 0.2f}')

* <font color='blue'>(**!**)</font> Tune hyper parameters (Number of iterations, learning rate, $\alpha$, Model) to beat the baseline model.