[![Fixel Algorithms](https://i.imgur.com/AqKHVZ0.png)](https://fixelalgorithms.gitlab.io)

# AI Program

## Introduction to Optimization - Objective Function

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.1.000 | 20/11/2024 | Royi Avital | Added a section on the _Chain Rule_ for composition of functions   |
| 1.0.000 | 13/09/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0008ObjectiveFunction.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
import autograd.numpy as anp
import autograd.scipy as asp
from autograd import grad
from autograd import elementwise_grad as egrad

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
%matplotlib inline

# warnings.filterwarnings("ignore")

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme
# sns.set_palette("tab10")

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Course Packages


In [None]:
# Auxiliary Functions

def SigmoidFun( vX: np.ndarray ) -> np.ndarray:
    # Implements the Sigmoid (Scaled ans shifted) function.
    # Uses AutoGrad for auto differentiation.
    
    return (2 * asp.special.expit(vX)) - 1

def LogisiticRegresion( vX: np.ndarray, mA: np.ndarray, vY: np.ndarray ) -> float:
    # Implements the Logistic Regression objective function

    vR = SigmoidFun(mA @ vX) - vY
    
    return 0.5 * anp.sum(anp.square(vR))

In [None]:
# Parameters

numRows = 10
numCols = 5
ε = 1e-6

## Objective Functions

In this section we'll derive the gradient of various objective functions.
The analytic solution will be verified by _Auto Diff_ calculation.

* <font color='brown'>(**#**)</font> The notebook use the `NumericDiff.py` file for the actual calculations.

### Logistic Regression Objective Function

Logistic Regression is used in the context of classification.  

* <font color='brown'>(**#**)</font> It is named regression as it is basically regression on a parameter of the model for regression.
* <font color='brown'>(**#**)</font> This section analyzes the problem with the _Squared_ ${L}^{2}$ Loss function. In the context of _classification_ it is usually used with the _Cross Entropy Loss_ function.

The objective function is given by:

$$ f \left( \boldsymbol{x} \right) = \frac{1}{2} {\left\| \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\|}_{2}^{2} $$

Where $\sigma(x) = 2 \frac{1}{1 + {e}^{-x}} - 1$ is a scaled and shifted version of the [Sigmoid Function](https://en.wikipedia.org/wiki/Sigmoid_function).  
See the `SigmoidFun()` function for a reference implementation with _auto differentiation_ support.

* <font color='red'>(**?**)</font> Is the problem _convex_?  
* <font color='brown'>(**#**)</font> In practice such function requires numerical stable implementation. Use professionally made implementations if available.   
* <font color='brown'>(**#**)</font> See [`scipy.special.expit()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html) for $\frac{ 1 }{ 1 + \exp \left( -x \right) }$.

Since:

$$ \sigma \left( x \right) = 2 \frac{ 1 }{ 1 + \exp \left( -x \right) } - 1 = 2 \frac{ \exp \left( x \right) }{ 1 + \exp \left( x \right) } - 1 $$

The derivative is given by:

$$ \frac{\mathrm{d} \sigma \left( x \right) }{\mathrm{d} x} = 2 \frac{ \exp \left( x \right)}{\left( 1 + \exp \left( x \right) \right)^{2}} = 2 \left( \frac{ 1 }{ 1 + \exp \left( -x \right) } \right) \left( 1 - \frac{ 1 }{ 1 + \exp \left( -x \right) } \right) $$

* <font color='brown'>(**#**)</font> For derivation of the last step, see https://math.stackexchange.com/questions/78575.
* <font color='brown'>(**#**)</font> For information about the objective function in the context of classification see [Stanley Chan - Purdue University - ECE595 / STAT598: Machine Learning I Lecture 14 Logistic Regression](https://engineering.purdue.edu/ChanGroup/ECE595/files/Lecture14_logistic.pdf).

In [None]:
# The Derivative Function

#===========================Fill This===========================#
# 1. Generate the function which for a given vector calculate the 
#    derivative per element.
# !! The output should be the Sigmoid function as defined in the notebook (Scaled, Shifted).
# !! You may find `sp.special.expit()` useful.

def GradSigmoidFun(vX: np.ndarray) -> np.ndarray:
    """
    Calculates the element wise derivative of the Sigmoid function.  
    The derivative of `SigmoidFun()` is calculated element wise on the input vector.
    Input:
        vX          - Vector (numElements, ) of the values to calculate the derivative at.
    Output:
        []          - Vector (numElements, ) of the derivatives.
    """

    vExpit = sp.special.expit(vX) #<! Calculate the **classic** sigmoid.
    
    return 2 * vExpit * (1 - vExpit)
#===============================================================#

In [None]:
# Verify the Implementation
# This section verifies the analytic solution using AutoGrad.

vX = np.random.rand(numCols)

assert (np.linalg.norm(GradSigmoidFun(vX) - egrad(SigmoidFun)(vX), np.inf) < ε), "Implementation is not verified"
print(f'Implementation is verified')

### Chain Rule for Vector Functions

Given $f \left( \boldsymbol{x} \right) : \mathbb{R}^{n} \to \mathbb{R}^{m}$ as a composition $f \left( \boldsymbol{x} \right) = g \left( h \left( \boldsymbol{x} \right) \right)$ where:

 - $h \left( \boldsymbol{x} \right) : \mathbb{R}^{n} \to \mathbb{R}^{k}$.
 - $g \left( \boldsymbol{x} \right) : \mathbb{R}^{k} \to \mathbb{R}^{m}$.

The directional derivative of $f$ is given, by the _Chain Rule_:

$$ \nabla f \left( \boldsymbol{x} \right) \left[ \boldsymbol{h} \right] = {J}_{g} \left( \boldsymbol{x} \right) {J}_{h} \left( \boldsymbol{x} \right) \boldsymbol{h} $$

Where ${J}_{g} \left( \boldsymbol{x} \right) = {\nabla}^{T} g \left( \boldsymbol{x} \right), \; {J}_{h} \left( \boldsymbol{x} \right) = {\nabla}^{T} h \left( \boldsymbol{x} \right)$ are the _Jacobians_ of the functions.

* <font color='brown'>(**#**)</font> Jacobians are the _Derivatives_ of a vector to vector functions.
* <font color='brown'>(**#**)</font> Gradients and Derivatives are linked by the Adjoint operator (Transpose).
* <font color='brown'>(**#**)</font> For $f : \mathbb{R}^{n} \to \mathbb{R}^{m}$ the _Directional Derivative_ (A vector) is given by $\nabla f \left( \boldsymbol{x} \right) \left[ \boldsymbol{h} \right] = {J}_{f} \left( \boldsymbol{x} \right) \boldsymbol{h}$.

### Question 001

1. Derive the gradient of the Logistic Regression function.
2. Implement the gradient as a function.

The implementation will be verified using `AutoGrad`.

**Remark**: The derivation is relatively hard and goes a bit beyond the slides.

### Solution 001

 - The objective function is given by: $f \left( \boldsymbol{x} \right) = \frac{1}{2} {\left\| \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\|}_{2}^{2} = \frac{1}{2} \left\langle \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y}, \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle$.
 - By _Product Rule_ and symmetry: $\nabla f \left( \boldsymbol{x} \right) \left[ \boldsymbol{h} \right] = \left\langle \nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right) \left[ \boldsymbol{h} \right], \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle = \left\langle \nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right], \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle$.
 - By _Chain Rule_: $\nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right] = \left\langle \nabla \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] , \nabla \boldsymbol{A} \boldsymbol{x} \left[ \boldsymbol{h} \right] \right\rangle $:
   - As linear function $\nabla \boldsymbol{A} \boldsymbol{x} \left[ \boldsymbol{h} \right] = \boldsymbol{A} \boldsymbol{h}$ which implies $\nabla \left( \boldsymbol{A} \boldsymbol{x} \right) = \boldsymbol{A}$.
   - As element wise function $\nabla \sigma \left[ \boldsymbol{w} \right] \left[ \boldsymbol{h} \right] = {\sigma}^{'} \left[ \boldsymbol{w} \right] \circ \boldsymbol{h}$ which implies $\nabla \sigma \left[ \boldsymbol{w} \right] = \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{w} \right] \right)$.
   - Hence, by _Chain Rule_ for composition of vector functions: $\nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right] = {J}_{\sigma} \left( \boldsymbol{A} \boldsymbol{x} \right) J \left( \boldsymbol{A} \boldsymbol{x} \right) \boldsymbol{h} = \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \boldsymbol{A} \boldsymbol{h}$.
 - Hence $\nabla f \left( \boldsymbol{x} \right) \left[ \boldsymbol{h} \right] = \left\langle \nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right], \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle = \left\langle \left( \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \boldsymbol{A} \right) \boldsymbol{h}, \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle = \left\langle \boldsymbol{A}^{T} \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right), \boldsymbol{h} \right\rangle$.
 - Which yields: $\nabla f \left( \boldsymbol{x} \right) = \boldsymbol{A}^{T} \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right)$.

---

### Solution 001 (Alternative)

This solution matches slides, though the _Chain Rule_ using the _Inner Product_ is abused.

 - The objective function is given by: $f \left( \boldsymbol{x} \right) = \frac{1}{2} {\left\| \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\|}_{2}^{2} = \frac{1}{2} \left\langle \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y}, \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle$.
 - By _Product Rule_ and symmetry: $\nabla f \left( \boldsymbol{x} \right) \left[ \boldsymbol{h} \right] = \left\langle \nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right) \left[ \boldsymbol{h} \right], \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle = \left\langle \nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right], \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle$.
 - By _Chain Rule_: $\nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right] = \left\langle \nabla \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] , \nabla \boldsymbol{A} \boldsymbol{x} \left[ \boldsymbol{h} \right] \right\rangle $:
   - As linear function $\nabla \boldsymbol{A} \boldsymbol{x} \left[ \boldsymbol{h} \right] = \boldsymbol{A} \boldsymbol{h}$.
   - As element wise function $\nabla \sigma \left[ \boldsymbol{w} \right] \left[ \boldsymbol{h} \right] = {\sigma}^{'} \left[ \boldsymbol{w} \right] \circ \boldsymbol{h}$ which implies $\nabla \sigma \left[ \boldsymbol{w} \right] = \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{w} \right] \right)$.
   - Hence $\nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right] = \left\langle \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{w} \right] \right), \boldsymbol{A} \boldsymbol{h} \right\rangle = \left( \boldsymbol{A}^{T} \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{w} \right] \right) \right)^{T} \boldsymbol{h}$.
 - Hence $\nabla f \left( \boldsymbol{x} \right) \left[ \boldsymbol{h} \right] = \left\langle \nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right], \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle = \left\langle {\left( \boldsymbol{A}^{T} \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \right)}^{T} \boldsymbol{h}, \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\rangle = \left\langle \boldsymbol{A}^{T} \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right), \boldsymbol{h} \right\rangle$.
 - Which yields: $\nabla f \left( \boldsymbol{x} \right) = \boldsymbol{A}^{T} \operatorname{Diag} \left( {\sigma}^{'} \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right)$.

<br/>

 * <font color='brown'>(**#**)</font> Pay attention that $\nabla \left( \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] \right) \left[ \boldsymbol{h} \right]$ is not a scalar. Hence the use of _Inner Product_ is abused.  
Though the reasoning is correct, namely using the _Inner Product_ to apply a linear operator.

---

In [None]:
# The Gradient Function of the Logistic Function

#===========================Fill This===========================#
# 1. Implement the derivation of the gradient function.
# !! Try avoiding the diagonal matrix for efficiency.

def GradLogisiticRegresion( vX: np.ndarray, mA: np.ndarray, vY: np.ndarray ) -> np.ndarray:
    """
    Calculates the gradient of the Logistic Regression function.  
    Input:
        vX          - Vector (numElements, ) of the values to calculate the gradient at.
    Output:
        []          - Vector (numElements, ) of the gradient.
    """
    
    return (mA.T * GradSigmoidFun(mA @ vX).T) @ (SigmoidFun(mA @ vX) - vY)
#===============================================================#

In [None]:
# Verify the Implementation
# This section verifies the analytic solution using AutoGrad.

vX = np.random.rand(numCols)
mA = np.random.rand(numRows, numCols)
vY = np.random.rand(numRows)
hLogisiticRegresion = lambda vX: LogisiticRegresion(vX, mA, vY)

assert (np.linalg.norm(GradLogisiticRegresion(vX, mA, vY) - grad(hLogisiticRegresion)(vX), np.inf) < ε), "Implementation is not verified"
print(f'Implementation is verified')

### Frobenius Norm Objective

The objective function $f: \mathbb{R}^{m \times n} \to \mathbb{R}$ is given by

$$ f \left( \boldsymbol{X} \right) = {\left\| \boldsymbol{X} - \boldsymbol{Y} \right\|}_{F} $$

### Question 002

1. Derive the gradient of the objective function.
2. Implement the objective function.  
   The implementation should be `AutoGrad` compatible.
3. Implement a function to calculate its gradient.

### Solution 002

The function can be rewritten as:

$$ f \left( \boldsymbol{X} \right) = {\left\| \boldsymbol{X} - \boldsymbol{Y} \right\|}_{F} = \sqrt{{\left\| \boldsymbol{X} - \boldsymbol{Y} \right\|}_{F}^{2}} = \sqrt{ \left\langle \boldsymbol{X} - \boldsymbol{Y}, \boldsymbol{X} - \boldsymbol{Y} \right\rangle } $$

Hence $\nabla f \left( \boldsymbol{X} \right) = \frac{1}{2} \frac{1}{ {\left\| \boldsymbol{X} - \boldsymbol{Y} \right\|}_{F} } \nabla \left( {\left\| \boldsymbol{X} - \boldsymbol{Y} \right\|}_{F}^{2} \right)$.

This yields:

$$ \nabla f \left( \boldsymbol{X} \right) = \frac{ 1 }{ {\left\| \boldsymbol{X} - \boldsymbol{Y} \right\|}_{F} } \left( \boldsymbol{X} - \boldsymbol{Y} \right) $$ 


---

In [None]:
# Implement the Frobenius Norm Objective Function

#===========================Fill This===========================#
# 1. Implement the objective function f(X) = 0.5 * ||X - Y||_F^2.
# Make the implementation `AutoGrad` compatible.

def FrobNorm( mX: np.ndarray, mY: np.ndarray ) -> float:
    """
    Calculates the objective function f(X) = 0.5 * ||X - Y||_F^2.  
    Input:
        mX          - Matrix (numRows, numCols).
        mY          - Matrix (numRows, numCols).
    Output:
        []          - Value of the function.
    """
    
    return anp.linalg.norm(mX - mY, 'fro')
#===============================================================#

In [None]:
# Implement the Gradient of the Frobenius Norm Objective Function

#===========================Fill This===========================#
# 1. Implement the gradient of f(X) = 0.5 * ||X - Y||_F^2.
# Make the implementation `AutoGrad` compatible.

def GradFrobNorm( mX: np.ndarray, mY: np.ndarray ) -> np.ndarray:
    """
    Calculates the gradient of f(X) = 0.5 * ||X - Y||_F^2.  
    Input:
        mX          - Matrix (numRows, numCols), to calculate the gradient at.
        mY          - Matrix (numRows, numCols).
    Output:
        []          - Value of the function.
    """
    
    mB = mX - mY #<! Buffer
    return mB / np.linalg.norm(mB, 'fro')
#===============================================================#

In [None]:
# Verify the Implementation
# This section verifies the analytic solution using AutoGrad.

mX = np.random.rand(numRows, numCols)
mY = np.random.rand(numRows, numCols)
hFrobNorm = lambda mX: FrobNorm(mX, mY)

assert (np.max(np.abs(GradFrobNorm(mX, mY) - grad(hFrobNorm)(mX))) < ε), "Implementation is not verified"
print(f'Implementation is verified')