[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io/)

# AI Program

## Machine Learning - Deep Learning - BackPropagation

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.001 | 19/05/2024 | Royi Avital | Added code comments and typing                                     |
| 1.0.000 | 22/04/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0075DeepBackPropagation.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Miscellaneous
import math
import os
from platform import python_version
import random
import timeit

# Typing
from typing import Callable, Dict, List, Optional, Self, Set, Tuple, Union

# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image
from IPython.display import display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout, SelectionSlider
from ipywidgets import interact

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())


In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Courses Packages

from DataVisualization import PlotConfusionMatrix, PlotLabelsHistogram, PlotMnistImages


In [None]:
# General Auxiliary Functions


## Back Propagation (BackPropagation)

The [BackPropagation](https://en.wikipedia.org/wiki/Backpropagation) is the method utilizing the [Chain Rule](https://en.wikipedia.org/wiki/Chain_rule) in order to calculate the gradient of a neural network.  
The _BackPropagation_ is efficient under the assumption the net is $f: \mathbb{R}^{d} \to \mathbb{R}^{c}$ where $c \ll d$.  


* <font color='brown'>(**#**)</font> The assumption holds as the gradients of the net are calculated with regard to the loss function which has a scalar output.
* <font color='brown'>(**#**)</font> _BackPropagation_ is also called _Reverse Mode Differentiation_.  
  There is also a _Forward Mode Differentiation_ which is more efficient for the case $c \gg d$.  
  The forward mode is useful in physical models where the functions are a vector functions.
* <font color='brown'>(**#**)</font> The optimal calculation of the gradient of a composition of function is equivalent to the [Matrix Chain Ordering Problem](https://en.wikipedia.org/wiki/Matrix_chain_multiplication).  
  It might require a _Mixed Mode Differentiation_.


This notebook demonstrates creating _Deep Learning_ atoms with build in support for _BackPropagation_.  
Using the _atoms_ a computational graph is built and processed both _forward_ and _backward_.
Those atoms will allow building a composable and scalable model.

* <font color='brown'>(**#**)</font> The model is simplified by supporting a "Feed Forward Networks" only. In practice more complex computational graphs are supported by Deep Learning Frameworks.

In [None]:
# Parameters

# Data
numSamplesTrain = 60_000
numSamplesTest  = 10_000

# Model
hidLayerDim = 200

# Training
numIter = 300
µ       = 0.35 #!< Step Size \ Learning Rate

# Visualization
numImg = 3


## Generate / Load Data

This section loads the [MNIST Data set](https://en.wikipedia.org/wiki/MNIST_database) using [`fetch_openml()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html).

The data is splitted to 60,000 train samples and 10,000 test samples.

In [None]:
# Load Data

mX, vY = fetch_openml('mnist_784', version = 1, return_X_y = True, as_frame = False, parser = 'auto')
vY = vY.astype(np.int_) #<! The labels are strings, convert to integer

print(f'The features data shape: {mX.shape}')
print(f'The labels data shape: {vY.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')


In [None]:
# Pre Process Data

mX = mX / 255.0


* <font color='red'>(**?**)</font> Does the scaling affects the training phase? Think about the _Learning Rate_.

### Plot the Data

In [None]:
# Plot the Data

hF = PlotMnistImages(mX, vY, numImg)

In [None]:
# Histogram of Labels

hA = PlotLabelsHistogram(vY)
plt.show()

### Train & Test Split

The data is split into _Train_ and _Test_ data sets.  

* <font color='brown'>(**#**)</font> Deep Learning is _big data_ oriented, hence it can easily handle all samples in a single _batch_.

In [None]:
# Train Test Split

numClass = len(np.unique(vY))
mXTrain, mXTest, vYTrain, vYTest = train_test_split(mX, vY, test_size = numSamplesTest, train_size = numSamplesTrain, shuffle = True, stratify = vY)

print(f'The training features data shape: {mXTrain.shape}')
print(f'The training labels data shape: {vYTrain.shape}')
print(f'The test features data shape: {mXTest.shape}')
print(f'The test labels data shape: {vYTest.shape}')
print(f'The unique values of the labels: {np.unique(vY)}')

## Neural Network Building Blocks

This section implements a class per NN building block.  
Each class has 2 main methods:
1. `Forward()` - Pushes the input forward on the computational graph.
2. `Backward()` - Pushes the input gradient backward on the computational graph.  
   The _backward_ step must calculate the gradient with respect to each parameter (With reduction over the batch) and per input.

* <font color='brown'>(**#**)</font> In practice each block supports the calculation over a _batch_.
* <font color='brown'>(**#**)</font> The implementation supports simple feed forward with no branching graph.
* <font color='brown'>(**#**)</font> The convention for the NumPy implementation is data as $d \times N$ where $d$ is the number of features and $N$ is the batch size.

The model to implement is given by

![The Neural Network Computational Graph](https://i.imgur.com/SsZfWqz.png)

The `CE` block stands for _SoftMax + Cross Entropy Loss_.

### Affine Layer

#### Parameters

$$ \boldsymbol{W} \in \mathbb{R}^{ {d}_{out} \times {d}_{in} }, \; \boldsymbol{b} \in \mathbb{R}^{{d}_{out}} $$

#### Forward

$$\boldsymbol{z}=\boldsymbol{W}\boldsymbol{x}+\boldsymbol{b}$$

#### Backward

$$\boxed{\nabla_{\boldsymbol{b}}L=\nabla_{\boldsymbol{z}}L}$$
  
$$\boxed{\nabla_{\boldsymbol{x}}L=\boldsymbol{W}^{T}\nabla_{\boldsymbol{z}}L}$$

$$\boxed{\nabla_{\boldsymbol{W}}L=\nabla_{\boldsymbol{z}}L\boldsymbol{x}^{T}}$$

* <font color='brown'>(**#**)</font> The above _Linear Layer_ is often called _Dense Layer_ or _Fully Connected_.

In [None]:
# Linear Layer

class LinearLayer():
    def __init__( self, dimIn: int, dimOut: int ) -> None:
        
        # Initialization
        mW = np.random.randn(dimOut, dimIn) / dimIn
        vB = np.zeros(dimOut)
        
        # Parameters
        self.mX      = None #<! Required for the backward pass
        self.dParams = {'mW': mW,   'vB': vB}
        self.dGrads  = {'mW': None, 'vB': None}
        
    def Forward( self: Self, mX: np.ndarray ) -> np.ndarray:
        self.mX = mX #<! Required for the backward pass
        
        mW      = self.dParams['mW'] #<! Shape: (dimOut, dimIn)
        vB      = self.dParams['vB'] 
        mZ      = mW @ mX + vB[:, None]
        
        return mZ
    
    def Backward( self: Self, mDz: np.ndarray ) -> np.ndarray:
        # Supports batch onf input by summing the gradients over each input.
        # Summing instead of averaging to support the case the loss is scaled by N.
        mW  = self.dParams['mW']
        mX  = self.mX
        
        vDb = np.sum(mDz, axis = 1) #<! Explicit Sum
        mDw = mDz @ mX.T #<! Implicit Sum
        mDx = mW.T @ mDz #<! Each column on its own
        
        self.dGrads['vB'] = vDb
        self.dGrads['mW'] = mDw
                
        return mDx

* <font color='blue'>(**!**)</font> Fill the shapes of the arrays in the code (As comments).
* <font color='red'>(**?**)</font> Why can't `self.mX` be initialized with a concrete dimensions at initialization? Think about batches.

### ReLU Layer

#### Parameters

None.

#### Forward

$$\boldsymbol{z}=\text{ReLU}\left(\boldsymbol{x}\right)=\max\left\{ \boldsymbol{x},0\right\} $$

#### Backward

$$\boxed{\nabla_{\boldsymbol{x}}L=\text{Diag}\left(\mathbb{I}_{\boldsymbol{x}>0}\right)\nabla_{\boldsymbol{z}}L = \mathbb{I}_{\boldsymbol{x}>0} \otimes \nabla_{\boldsymbol{z}}L}$$

In [None]:
# ReLU Layer

class ReLULayer():
    def __init__( self ) -> None:
        
        self.mX = None #<! Required for the backward pass
        self.dGrads = {}
    
    def Forward( self: Self, mX: np.ndarray ) -> np.ndarray:
        self.mX = mX                 #<! Store for Backward
        mZ      = np.maximum(mX, 0)
        
        return mZ
    
    def Backward( self: Self, mDz: np.ndarray ) -> np.ndarray:
        mX    = self.mX
        # mMask = (mX > 0)
        # mDx   = mDz * mMask

        mDx = np.where(mX > 0.0, mDz, 0.0)
                
        return mDx

* <font color='blue'>(**!**)</font> Fill the shapes of the arrays in the code (As comments).

### Cross Entropy + SoftMax Loss Function

Due to numerical and computational benefits the _SoftMax_ layer is merged with the Cross Entropy Loss.  
This is done as the _SoftMax_ layer applies the $\exp \left( \cdot \right)$ function while _Cross Entropy_ applies $\log \left( \cdot \right)$.


$$\ell\left(\boldsymbol{y}_{i},\hat{\boldsymbol{y}}_{i}\right)=-\boldsymbol{y}_{i}^{T}\log\left(\hat{\boldsymbol{y}}_{i}\right)$$
where
$$\hat{\boldsymbol{y}}=\text{softmax}\left(\boldsymbol{z}\right)=\frac{\exp\left(\boldsymbol{z}\right)}{\boldsymbol{1}^{T}\exp\left(\boldsymbol{z}\right)}$$

Gradient:  
$$\boxed{\nabla_{\boldsymbol{z}}\ell=\hat{\boldsymbol{y}}_{i}-\boldsymbol{y}_{i}}$$

Loss over a batch

$$L=\frac{1}{N}\sum_{i=1}^{N}\ell\left(\boldsymbol{y}_{i},\hat{\boldsymbol{y}}_{i}\right)$$

* <font color='brown'>(**#**)</font> Since the loss function is the end point of the graph which ends the forward pass and starts the backward pass, both can be calculated at once.
* <font color='brown'>(**#**)</font> The above matches the [`CrossEntropyLoss` of PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html).
* <font color='red'>(**?**)</font> Assume the calculation was not merged. What if, due to finite numeric accuracy, the target index in the estimated probabilities after the _SoftMax_ is zeroed?  
  You may read at [`BCELoss` in PyTorch](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html).



#### Gradient Derivation

Since $\hat{\boldsymbol{y}} = S \left( \boldsymbol{z} \right)$ where $S \left( \cdot \right)$ is the _SoftMax_ function then:

$$\ell\left(\boldsymbol{y},\hat{\boldsymbol{y}}\right)=-\boldsymbol{y}^{T}\log\left(\hat{\boldsymbol{y}}\right) = - \log \left(\hat{y}_{c}\right)$$

Where $\hat{y}_{c}$ is the estimated probability of the correct class.

By the chain rule:

$$ \frac{\partial \ell}{\partial \boldsymbol{z}} = \frac{\partial \ell}{\partial \hat{y}_{c}} \frac{\partial \hat{y}_{c}}{\partial \boldsymbol{z}}$$

By the derivative of the $\log$: $\frac{\partial \ell}{\partial \hat{y}_{c}} = \frac{-1}{\hat{y}_{c}}$

Then for $\frac{\partial \hat{y}_{c}}{\partial {z}_{i}}$ one has to set for the case $i = c$:

$$
\begin{align*}
    \frac{\partial \hat{y}_{c}}{\partial {z}_{i}} &= \frac{\partial}{\partial {z}_{i}} \frac{e^{{z}_{c}}}{\sum_{j}e^{{z}_{j}}} \\
    &= \frac{e^{{z}_{c}}\sum_{j}e^{{z}_{j}} - e^{{z}_{c}}e^{{z}_{c}}}{(\sum_{j}e^{{z}_{j}})^{2}} \\
    &= \frac{e^{{z}_{c}}}{\sum_{j}e^{{z}_{j}}}\frac{\sum_{j}e^{{z}_{j}} - e^{{z}_{c}}}{\sum_{j}e^{{z}_{j}}} \\
    &= \hat{y}_{c}(1 - \hat{y}_{c})
\end{align*}
$$

The case = $i \neq c$:

$$
\begin{align*}
    \frac{\partial \hat{y}_{c}}{\partial {z}_{i}} &= \frac{\partial}{\partial {z}_{i}} \frac{e^{{z}_{c}}}{\sum_{j}e^{{z}_{j}}} \\
    &= \frac{-e^{{z}_{i}}e^{{z}_{c}}}{(\sum_{j}e^{{z}_{j}})^{2}} \\
    &= -\hat{y}_{i} \hat{y}_{c}
\end{align*}
$$

Which yields:

$$
\begin{align*}
    \frac{\partial \ell}{\partial \mathbf{z}} &= \frac{\partial \ell}{\partial \hat{y}_{c}}\frac{\partial \hat{y}_{c}}{\partial \mathbf{z}} \\
    &= \frac{-1}{\hat{y}_{c}}
    \begin{bmatrix} -\hat{y}_{1}\hat{y}_{c} & -\hat{y}_{2}\hat{y}_{c} & ... & \hat{y}_{c}(1 - \hat{y}_{c}) & ... & -\hat{y}_{k}\hat{y}_{c} \end{bmatrix}^{T} \\
    &= \begin{bmatrix} \hat{y}_{1} & \hat{y}_{2} & ... & (\hat{y}_{c} - 1) & ... & \hat{y}_{k} \end{bmatrix}^{T} \\
    & = \hat{\boldsymbol{y}} - \boldsymbol{y}
\end{align*}
$$

### Cross Entropy Loss vs. MSE for Probabilistic Predictions

The Logistic Regression is based on the [Cross Entropy Loss](https://en.wikipedia.org/wiki/Cross-entropy) which measure similarity between distributions.  
In the context of classification is measures the distance between 2 _discrete_ distributions.

Consider the the true probabilities and 2 estimations of 6 categories data:

$$ \boldsymbol{y} = {\left[ 0, 1, 0, 0, 0, 0 \right]}^{T}, \; \hat{\boldsymbol{y}}_{1} = {\left[ 0.16, 0.2, 0.16, 0.16, 0.16, 0.16 \right]}^{T}, \; \hat{\boldsymbol{y}}_{2} = {\left[ 0.5, 0.4, 0.1, 0.0, 0.0, 0.0 \right]}^{T} $$

One could use the [Mean Squared Error](https://en.wikipedia.org/wiki/Mean_squared_error) to measure the distance between the vectors (Called [Brier Score](https://en.wikipedia.org/wiki/Brier_score) in this context) as an alternative to the CE which will yield:

$$ MSE \left( \boldsymbol{y}, \hat{\boldsymbol{y}}_{1} \right) = 0.128, \; MSE \left( \boldsymbol{y}, \hat{\boldsymbol{y}}_{2} \right) = 0.103 $$

Yet, in $\hat{\boldsymbol{y}}_{2}$ which has a lower error the most probable class is not the correct one while in $\hat{\boldsymbol{y}}_{1}$ it is.  
The CE in contrast only "cares" about the error in the index of the _correct_ class and minimizes that.  
Another advantage of the CE is being the [_Maximum Likelihood Estimator_](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) which ensures some useful properties.

Yet there are some empirical advantages to the MSE loss in this context.  
Some analysis is presented by [Evaluation of Neural Architectures Trained with Square Loss vs Cross Entropy in Classification Tasks](https://arxiv.org/abs/2006.07322).  
Hence the MSE is a legitimate choice as well.

See:

 * [Cross Entropy Loss vs. MSE for Multi Class Classification](https://stats.stackexchange.com/questions/573944).
 * [Disadvantages of Using a Regression Loss Function in Multi Class Classification](https://stats.stackexchange.com/questions/568238).

In [None]:
# Cross Entropy Loss

def CrossEntropyLoss( vY: np.ndarray, mZ: np.ndarray ) -> Tuple[np.float_, np.ndarray]:
    '''
    Returns both the loss and the gradient w.r.t the input (mZ).
    Assumes the input is logits (Before applying probability like transformation).
    The function is equivalent of SoftMax + Cross Entropy.
    The function uses the mean loss (Normalized by N). 
    Hence gradients calculation should sum the gradients over the batch.
    '''
    N      = len(vY)
    # mHatY  = np.exp(mZ)
    # mHatY /= np.sum(mHatY, axis = 0)
    mYHat   = sp.special.softmax(mZ, axis = 0)
    valLoss = -np.mean(np.log(mYHat[vY, range(N)]))
    
    mDz                = mYHat
    mDz[vY, range(N)] -= 1 #<! Assumes vY is One Hot
    mDz               /= N #<! Now all needed is to sum gradients
    
    return valLoss, mDz

### Model Class

The model class should be composable to allow arbitrary _Feed Forward_ model.

In [None]:
# NN Model
class ModelNN():
    def __init__( self, lLayers: List ) -> None:
        
        self.lLayers = lLayers
        
    def Forward( self: Self, mX: np.ndarray ) -> np.ndarray:
        
        for oLayer in self.lLayers:
            mX = oLayer.Forward(mX)
        return mX
    
    def Backward( self: Self, mDz: np.ndarray ) -> None:
        
        for oLayer in reversed(self.lLayers):
            mDz = oLayer.Backward(mDz)

In [None]:
# Model Example

oModel = ModelNN([
    LinearLayer(784, 200), ReLULayer(),
    LinearLayer(200, 10),
    ])

oModel.lLayers

## Model Training

The model training (Optimization) is by a vanilla _Gradient Descent_.  
Since the model is small and the data si relatively small, the batch size is the whole training set.

* <font color='brown'>(**#**)</font> Larger model / data set might require using _Stochastic Gradient Descent_.  
  In this case the actual gradient of the loss function over the whole data is _approximated_ by the gradient calculated over a sub sample (Batch).

### Training Function

In [None]:
def TrainModel( oModel: ModelNN, mX: np.ndarray, vY: np.ndarray, numIter: int, learningRate: float ) -> None:
    
    # Display Results
    hF, hA = plt.subplots(figsize = (12, 6))

    vLoss = np.full(numIter, np.nan)
    for ii in range(numIter):
        # Forward Pass
        mZ        = oModel.Forward(mX)
        # Loss
        valLoss, mDz = CrossEntropyLoss(vY, mZ)
        vLoss[ii]    = valLoss

        # Backward Pass
        oModel.Backward(mDz)

        # Gradient Descent (Update parameters)
        for oLayer in oModel.lLayers:
            for sParam in oLayer.dGrads:
                oLayer.dParams[sParam] -= learningRate * oLayer.dGrads[sParam]

        # Display Results
        hA.cla()
        hA.set_title(f'Iteration: {(ii + 1): 04d} / {numIter}, Loss = {valLoss: 0.2f}')
        hA.plot(vLoss, 'b', marker = '.', ms = 5)
        hA.set_xlabel('Iteration Index')
        hA.set_ylabel('Loss Value')
        hA.grid()

        plt.pause(1e-20)
        display(hF, clear = True) #<! "In Place"

## Model Performance

This section analyzes the model performance on the train and test data.

### 1 Hidden Layers Model

In [None]:
# Define the Model

oModel = ModelNN([
    LinearLayer(784, 200), ReLULayer(),
    LinearLayer(200, 10),
])

TrainModel(oModel, mXTrain.T, vYTrain, numIter, µ) #<! Works in place on the model

* <font color='brown'>(**#**)</font> Mathematically, the model is equivalent to the one in the previous notebook.  
  Namely, given the same data, number of iterations and learning rate the result will be the same.
* <font color='red'>(**?**)</font> Which one is more efficient computationally? Explain.

In [None]:
# Apply Model on Data

mYHatTrain = oModel.Forward(mXTrain.T)
mYHatTest  = oModel.Forward(mXTest.T)
vYHatTrain = np.argmax(mYHatTrain, axis = 0)
vYHatTest  = np.argmax(mYHatTest, axis = 0)

* <font color='green'>(**@**)</font> Make the model work with `oModel(mXTrain.T)`. You may want to read about the `__call__()` method.

In [None]:
# Confusion Matrix

hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 6))

hA, _ = PlotConfusionMatrix(vYTrain, vYHatTrain, hA = vHa[0])
hA.set_title(f'Train Data, Accuracy {np.mean(vYTrain == vYHatTrain): 0.2%}')

hA, _ = PlotConfusionMatrix(vYTest, vYHatTest, hA = vHa[1])
hA.set_title(f'Test Data, Accuracy {np.mean(vYTest == vYHatTest): 0.2%}')

### 2 Hidden Layers Model

In [None]:
# Define the Model

oModel = ModelNN([
    LinearLayer(784, 200), ReLULayer(),
    LinearLayer(200, 50), ReLULayer(),
    LinearLayer(50, 10),
])

TrainModel(oModel, mXTrain.T, vYTrain, 2 * numIter, 1.5 * µ) #<! Works in place on the model

* <font color='red'>(**?**)</font> How, policy wise, should the capricious behavior of the loss be handled?  
  Think that one can not know in advance when the sudden jump will happen.

In [None]:
# Apply Model on Data

mYHatTrain = oModel.Forward(mXTrain.T)
mYHatTest  = oModel.Forward(mXTest.T)
vYHatTrain = np.argmax(mYHatTrain, axis = 0)
vYHatTest  = np.argmax(mYHatTest, axis = 0)

In [None]:
# Confusion Matrix

hF, vHa = plt.subplots(nrows = 1, ncols = 2, figsize = (14, 6))

hA, _ = PlotConfusionMatrix(vYTrain, vYHatTrain, hA = vHa[0])
hA.set_title(f'Train Data, Accuracy {np.mean(vYTrain == vYHatTrain): 0.2%}')

hA, _ = PlotConfusionMatrix(vYTest, vYHatTest, hA = vHa[1])
hA.set_title(f'Test Data, Accuracy {np.mean(vYTest == vYHatTest): 0.2%}')

* <font color='brown'>(**#**)</font> The run time of this simple case is the motivation for using GPU's and the difference they made.
* <font color='green'>(**@**)</font> You may try to replicate the above on GPU using [_CuPy_](https://github.com/cupy/cupy). Make sure to use `Float32`.