[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Machine Learning Methods

## Neural Networks - Auto Differentiation  

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 0.1.000 | 11/03/2023 | Royi Avital | First version                                                      |
|         |            |             |                                                                    |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/MachineLearningMethods/2023_01/0046AnomalyDetectionIsolationForest.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
import autograd.numpy as anp
from autograd import grad
from autograd import elementwise_grad as egrad

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
from matplotlib.colors import LogNorm, Normalize, PowerNorm
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

In [None]:
# Configuration
#%matplotlib inline

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Fixel Algorithms Packages


## Auto Differentiation

Deep Learning / Neural Networks framework frameworks usually have the following 4 main ingredients:

 - Data Loaders  
   Loading data in an optimized manner for the training work.
 - Layers  
   The building bocks (Math functions) of the DL related operations.
 - Optimizers and Schedulers  
   Algorithms for the optimization (Usually 1st derivative based) and schedulers to optimize the learning rate.
 - Auto Differentiation  
   Used for the back propagation calculation.

Most frameworks also pack the ayer which optimizes the operation into hardware.  
Currently, it mostly done to target GPU's (Mostly by NVIDIA).

In this notebook we'll use a pretty modern auto differentiation framework called [`AutoGrad`](https://github.com/HIPS/autograd) for a simple example.  
We'll try 

* <font color='brown'>(**#**)</font> This notebook doesn't cover the ideas for _auto differentiation_. One might watch [What is Automatic Differentiation](https://www.youtube.com/watch?v=wG_nF1awSSY).
* <font color='brown'>(**#**)</font> There are few approaches to auto differentiation. One of them is based on Dual Numbers.  
  See A Hands On Introduction to Automatic Differentiation: [Part I](https://mostafa-samir.github.io/auto-diff-pt1), [Part II](https://mostafa-samir.github.io/auto-diff-pt2).

In [None]:
# Parameters

# Data
numCoeff    = 3
numSamples  = 50
noiseStd    = 0.15

# Model
λ  = 0.1 #<! Regularization
mD = -np.eye(numSamples - 1, numSamples, k = 0) + np.eye(numSamples - 1, numSamples, k = 1) #<! Finite Differences Matrix
δ  = 1 #<! Huber Loss

# Visualization
numGrdiPts = 1000


In [None]:
# Auxiliary Functions

def PlotRegressionData( vX: np.ndarray, vY: np.ndarray, hA:plt.Axes = None, figSize: Tuple[int, int] = FIG_SIZE_DEF, elmSize: int = ELM_SIZE_DEF, elmColor = None, elmAlpha: float = 1.0, dataLabel: str = '_', axisTitle: str = None ) -> plt.Axes:

    if hA is None:
        hF, hA = plt.subplots(figsize = figSize)
    else:
        hF = hA.get_figure()
    
    hA.scatter(vX, vY, s = elmSize, c = elmColor, edgecolor = 'k', alpha = elmAlpha, label = dataLabel)
    hA.set_xlabel('$x$')
    hA.set_ylabel('$y$')
    if axisTitle is not None:
        hA.set_title(axisTitle)
    hA.legend()
    
    return hA

def SampleSum1Vec( numElements: int ) -> np.ndarray:
    # Non negative, sum of 1 vector
    vS = np.random.rand(numElements)
    vS = vS / np.sum(vS)

    return vS

def GenMeanMatrix( numSamples: int, numCoeff: int = 5 ) -> np.ndarray:
    # `numCoeff` - Must be odd!
    # In practice, for large number of samples relative to coefficients it is better to use sparse matrix.
    # This filter is not convolution as it always average `numCoeff` values. 
    # Hence at the borders it differs from a standard convolution.

    if ((numCoeff % 2) == 0):
        # The number is even
        raise ValueError(f'The parameter `numCoeff` must be an odd positive integer. The input is {numCoeff}.')
    
    kernelRadius = int(numCoeff / 2)

    coeffVal = 1 / numCoeff
    mH = np.zeros(shape = (numSamples, numSamples))

    # mH[:kernelRadius, :numCoeff] = coeffVal
    # mH[-kernelRadius:, -numCoeff:] = coeffVal

    mH[:kernelRadius, :numCoeff] = SampleSum1Vec(numCoeff)
    mH[-kernelRadius:, -numCoeff:] = SampleSum1Vec(numCoeff)

    for ii in range(kernelRadius, numSamples - kernelRadius):
        # mH[ii, (ii - kernelRadius):(ii + kernelRadius + 1)] = coeffVal
        mH[ii, (ii - kernelRadius):(ii + kernelRadius + 1)] = SampleSum1Vec(numCoeff)

    return mH

def PseudoHuber( vX: np.ndarray, δ: float = 1 ):
    # Smooth approximation of the Abs

    return δ * δ * (anp.sqrt(1 + anp.square(vX / δ)) - 1)

def PseudoHuberGrad( vX: np.ndarray, δ: float = 1 ):

    return vX / np.sqrt(1 + np.square(vX / δ))

def ApproxL1Norm( vX: np.ndarray, δ: float = 1 ):

    return anp.sum(PseudoHuber(vX, δ = δ))



## Generate / Load Data

The data will be a piece wise constant function which is filtered by a random mean filter.  
The filtration will be applied by the known filter model matrix `mH`.

* <font color='brown'>(**#**)</font> While the filter is not applied by a convolution, it is a linear model.
* <font color='brown'>(**#**)</font> We use random mean in order to stabilize the filter matrix (Condition Number).


In [None]:
# Loading / Generating Data

mH = GenMeanMatrix(numSamples, numCoeff = numCoeff) #<! Model Matrix
# mH = np.eye(numSamples)
vX = np.arange(numSamples)
vY1 = np.piecewise(vX, [vX < numSamples, vX < 40, vX < 30, vX < 20, vX < 10], [0, 1, 2, 3, 4])
vY2 = mH @ vY1
vY = vY2 + (noiseStd * np.random.randn(numSamples))

### Plot the Data

In [None]:
# Plot the Data
hF, hA = plt.subplots(figsize = (8, 8))

hA = PlotRegressionData(vX, vY1, elmColor = 'C0', dataLabel = 'RAW Data', hA = hA)
hA = PlotRegressionData(vX, vY2, elmColor = 'C1', elmAlpha = 0.8, dataLabel = 'Filtered Data', hA = hA)
hA = PlotRegressionData(vX, vY, elmColor = 'C2', elmAlpha = 0.6, dataLabel = 'Noisy Data (Measurement)', hA = hA)

plt.show()

## The Model and Optimization Process

We'll use the Total Variation model which is a good fit for Piece Wise Constant models:


$$ \arg \min_{\boldsymbol{w}} \frac{1}{2} {\left\| H \boldsymbol{w} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda {\left\| D \boldsymbol{w} \right\|}_{1} $$

Where ${\left\| D \boldsymbol{w} \right\|}_{1} = \sum_{i = 1}^{N - 1} \left| {w}_{i + 1} - {w}_{i} \right|$ is the Total Variation of the samples.

Since the ${L}_{1}$ norm is not smooth, we'll approximate it using the Pseudo Huber Loss:

$$ \arg \min_{\boldsymbol{w}} \frac{1}{2} {\left\| H \boldsymbol{w} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda {L}_{\delta} \left( D \boldsymbol{w} \right) $$

Where ${L}_{\delta} \left( D \boldsymbol{w} \right) = \sum_{i = 1}^{N - 1} \operatorname{PH}_{\delta} \left( {w}_{i + 1} - {w}_{i} \right)$ with $\operatorname{PH}_{\delta} \left( x \right) = {\delta}^{2} \left( \sqrt{1 + {\left(\frac{x}{\delta}\right)}^{2}} - 1 \right)$.

We'll use a vanilla gradient descent where the gradient will be calculated by the _auto differentiation_ framework.

In [None]:
# Smooth Absolute Value Function
# Show the different approximations of the absolute value function
vXX = np.linspace(-5, 5, numGrdiPts)

hF, hA = plt.subplots(figsize = (8, 8))

δ = 0.75

hA.plot(vXX, np.abs(vXX), label = 'Abs')
hA.plot(vXX, PseudoHuber(vXX, δ = δ), label = 'Pseudo Huber')
# hA.plot(vXX, sp.special.pseudo_huber(δ, vXX), label = 'Pseudo Huber')
hA.plot(vXX, sp.special.huber(δ, vXX), label = 'Huber')

hA.legend()

plt.show()

In [None]:
# Define the Loss Function and Loss Function Gradient
def LossFun(mH: np.ndarray, vW: np.ndarray, vY: np.ndarray, λ: float, mD: np.ndarray, δ: float):
    vT = anp.dot(mH, vW) - vY
    l2Loss = anp.dot(vT, vT)
    vL = anp.dot(mD, vW)
    huberLoss = ApproxL1Norm(vL, δ = δ)

    return (0.5 * l2Loss) + (λ * huberLoss)

def LossFunGrad( mH: np.ndarray, vW: np.ndarray, vY: np.ndarray, λ: float, mD: np.ndarray, δ: float ):
    # Gradient with respect to `vW`

    vL2LossGrad = mH.T @ ((mH @ vW) - vY)
    vD = mD @ vW
    vRegGrad = mD.T @ np.array([PseudoHuberGrad(x, δ = δ) for x in vD])

    return vL2LossGrad + (λ * vRegGrad)

In [None]:
# Define the Functions
# Define the functions with respect to `vW`
λ = 0.005
hLossFun        = lambda vW: LossFun(mH, vW, vY, λ, mD, δ)
hLossFunGrad    = grad(hLossFun)
hLossFunGradAna = lambda vW: LossFunGrad(mH, vW, vY, λ, mD, δ)


In [None]:
# Verify the Auto Differentiation

vW = np.random.randn(numSamples)

maxDev = np.max(np.abs(hLossFunGrad(vW) - hLossFunGradAna(vW)))

print(f'The maximum deviation between the Auto Diff and analytic gradient: {maxDev}')

In [None]:
# Gradient Descent

numIterations = 30_000
stepSize = 0.0015

mW = np.zeros(shape = (numSamples, numIterations)) #<! Estimation
mW[:, 0] = vY

mG = np.zeros(shape = (numSamples, numIterations)) #<! Gradient
mG[:, 0] = hLossFun(mW[:, 0])

vL = np.zeros(numIterations) #<! Loss Function
vL[0] = hLossFun(mW[:, 0])

for ii in range(1, numIterations):
    vG = hLossFunGrad(mW[:, ii - 1])
    # vG = hLossFunGradAna(mW[:, ii - 1])
    mW[:, ii] = mW[:, ii - 1] - (stepSize * vG)
    mG[:, ii] = vG
    vL[ii] = hLossFun(mW[:, ii])

In [None]:
hF, hAs = plt.subplots(nrows = 1, ncols = 3, figsize = (24, 8))
hAs = hAs.flat

hAs[0].scatter(vX, mW[:, -1], label = 'Prediction')
hAs[0].scatter(vX, vY, label = 'Measurements')
hAs[0].set_xlabel('Sample Index')
hAs[0].set_ylabel('Sample Value')
hAs[0].legend()

hAs[1].plot(range(numIterations), np.linalg.norm(mG, axis = 0))
hAs[1].set_xlabel('Iteration Index')
hAs[1].set_ylabel('Gradient Norm')

hAs[2].plot(range(numIterations), vL, label = 'By Optimization')
hAs[2].plot(range(numIterations), hLossFun(vY1) * np.ones(numIterations), label = 'Ground Truth')
hAs[2].set_xlabel('Iteration Index')
hAs[2].set_ylabel('Loss Fun')
hAs[2].legend()

plt.show()

In [None]:
# Stability of mH (Condition Number)
vSingularValues = sp.linalg.svdvals(mH)

vSingularValues = np.abs(vSingularValues)

np.max(vSingularValues) / np.min(vSingularValues) #<! Matches np.linalg.cond(mH)

