[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io/)

# Optimization Methods

## Convex Optimization - Smooth Optimization - Logistic Regression

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 28/09/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0008ObjectiveFunction.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
import autograd.numpy as anp
import autograd.scipy as asp
from autograd import grad
from autograd import elementwise_grad as egrad


# Miscellaneous
import gdown
import os
import math
from platform import python_version
import random
import zipfile

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
from matplotlib.colors import LogNorm, Normalize, PowerNorm
import matplotlib.pyplot as plt
import seaborn as sns

# Jupyter
from IPython import get_ipython
from IPython.display import Image, display
from ipywidgets import Dropdown, FloatSlider, interact, IntSlider, Layout

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [None]:
# Configuration
%matplotlib inline

# warnings.filterwarnings("ignore")

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# sns.set_theme() #>! Apply SeaBorn theme
# sns.set_palette("tab10")

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [None]:
# Course Packages

from AuxFun import StepSizeMode
from DataVisualization import Plot2DLinearClassifier, PlotBinaryClassData
from NumericDiff import DiffMode



In [None]:
# Auxiliary Functions

def SigmoidFun( vX: np.ndarray ) -> np.ndarray:
    # Implements the Sigmoid (Scaled ans shifted) function.
    # Uses AutoGrad for auto differentiation.
    
    return (2 * asp.special.expit(vX)) - 1


In [None]:
# Parameters

# Data
zipFileId       = '1SIN8Er2k2gYJe2k5Mer2DrLwZZK_wykc'
dataFileName    = 'LogRegData.mat'

# Numerical Differentiation
diffMode    = DiffMode.CENTRAL
ε           = 1e-6

# Solver
stepSizeMode    = StepSizeMode.ADAPTIVE
μ               = 0.01
numIterations   = 100
α               = 0.5
maxNumBack      = 20
minμ            = 1e-7

# Visualization
numGridPts  = 501
tuAxLim     = [-2, 2] #<! Boundaries for Xlim / Ylim

## Logistic Regression

The _Logistic Regression_ is an objective function which models the probability of a binary event.  
It used mainly as a _Statistical Classifier_ (Either in its binary form, [Logistic Regression](https://en.wikipedia.org/wiki/Logistic_regression), or [Multinomial Logistic Regression](https://en.wikipedia.org/wiki/Multinomial_logistic_regression)).

This notebook:
 - Optimizes the loss function to find an optimal weights.
 - Implements the _adaptive step size_ (Backtracking) policy for _gradient Descent_.
 - Use _Numeric differentiation_ / _Auto differentiation_ for the optimizer.

### Logistic Regression Objective Function

Logistic Regression is used in the context of classification.  

* <font color='brown'>(**#**)</font> It is named regression as it is basically regression on a parameter of the model for regression.
* <font color='brown'>(**#**)</font> This section analyzes the problem with the _Squared_ ${L}^{2}$ Loss function. In the context of _classification_ it is usually used with the _Cross Entropy Loss_ function.
* <font color='brown'>(**#**)</font> The _Squared_ ${L}^{2}$ Loss function in the context of classification is called [_Brier Score_](https://en.wikipedia.org/wiki/Brier_score).
* <font color='brown'>(**#**)</font> For analysis of the difference score functions, see [The Effect of Using the MSE Score (Brier Score) for Logistic Regression](https://stats.stackexchange.com/questions/326350).

The objective function is given by:

$$ f \left( \boldsymbol{x} \right) = \frac{1}{2} {\left\| \sigma \left[ \boldsymbol{A} \boldsymbol{x} \right] - \boldsymbol{y} \right\|}_{2}^{2} $$

Where $\sigma(x) = 2 \frac{1}{1 + {e}^{-x}} - 1$ is a scaled and shifted version of the [Sigmoid Function](https://en.wikipedia.org/wiki/Sigmoid_function).  
See the `SigmoidFun()` function for a reference implementation with _auto differentiation_ support.

* <font color='red'>(**?**)</font> Is the problem _convex_?  
* <font color='brown'>(**#**)</font> In practice such function requires numerical stable implementation. Use professionally made implementations if available.   
* <font color='brown'>(**#**)</font> See [`scipy.special.expit()`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.expit.html) for $\frac{ 1 }{ 1 + \exp \left( -x \right) }$.

Since:

$$ \sigma \left( x \right) = 2 \frac{ 1 }{ 1 + \exp \left( -x \right) } - 1 = 2 \frac{ \exp \left( x \right) }{ 1 + \exp \left( x \right) } - 1 $$

The derivative is given by:

$$ \frac{\mathrm{d} \sigma \left( x \right) }{\mathrm{d} x} = 2 \frac{ \exp \left( x \right)}{\left( 1 + \exp \left( x \right) \right)^{2}} = 2 \left( \frac{ 1 }{ 1 + \exp \left( -x \right) } \right) \left( 1 - \frac{ 1 }{ 1 + \exp \left( -x \right) } \right) $$

* <font color='brown'>(**#**)</font> For derivation of the last step, see https://math.stackexchange.com/questions/78575.
* <font color='brown'>(**#**)</font> For information about the objective function in the context of classification see [Stanley Chan - Purdue University - ECE595 / STAT598: Machine Learning I Lecture 14 Logistic Regression](https://engineering.purdue.edu/ChanGroup/ECE595/files/Lecture14_logistic.pdf).

## Generate Data


The data for this notebook is a pre generated classic classification data called _moon data_.

In [None]:
# Generate / Load the Data

fileNameExt = gdown.download(id = zipFileId)

with zipfile.ZipFile(fileNameExt, 'r') as oZipFile:
    oZipFile.extract(dataFileName, '.')


In [None]:
# Parse MATLAB Data
dMatFile = sp.io.loadmat(dataFileName)

# 2D Data
mX = dMatFile['mX'] #<! (numSamples x 2)
vY = np.ravel(dMatFile['vY']) #<! Labels (numSamples,)

mX[:, 0] -= 0.5
vY[vY == 0] = -1
vY = vY.astype(np.float64)

numSamples = np.size(mX, 0)

In [None]:
# Display Data

hA = PlotBinaryClassData(mX, vY, axisTitle = 'Binary Classification Data')
hA.set_xlim(tuAxLim)
hA.set_ylim(tuAxLim)
hA.set_xlabel('$x_1$')
hA.set_ylabel('$x_2$');

## Logistic Regression Model

### Data Pre Processing

The data model is an **affine** function of the coordinates.  
Hence a constant term must be added:

$$ \boldsymbol{X} = \begin{bmatrix} -1 & {X}_{1, 1} & {X}_{1, 2} \\ -1 & {X}_{2, 1} & {X}_{2, 2} \\ \vdots & \vdots & \vdots \\ -1 & {X}_{m, 1} & {X}_{m, 2} \end{bmatrix} $$

In [None]:
### Data Pre Processing

#===========================Fill This===========================#
# 1. Create the variable `mXX` which add a constant column of 1 to `mX`.
# !! You may find `np.concatenate()` / `np.column_stack()` useful.

mXX = np.concatenate((-np.ones(shape = (numSamples, 1)), mX), axis = 1)
#===============================================================#

* <font color='red'>(**?**)</font> Does the value of the constant column makes any difference?

## Optimizing the Objective Function

This section optimizes the objective function using _Gradient Descent_:
 - The step size is adaptive.
 - The gradient function is calculated by numeric / auto differentiation.

### Objective Function

This section implements the objective function.

In [None]:
# Objective Function

#===========================Fill This===========================#
# 1. Implement the objective function. 
#    Given a vector of parameters `vW` it returns the objective.
# 2. The implementation should be using a Lambda Function.
# !! You may use `SigmoidFun()` from above.

hObjFun = lambda vW: 0.5 * anp.sum(anp.square( SigmoidFun(mXX @ vW) - vY ))
#===============================================================#

### Gradient Function

This section implements the gradient function either by _Numeric differentiation_ / _Auto differentiation_.

In [None]:
# Gradient Function

#===========================Fill This===========================#
# 1. Implement the gradient function. 
#    Given a vector of parameters `vW` it returns the gradient at `vW`.
# 2. The implementation should be using a Lambda Function.
# !! You may use Auto Grad or the numeric differentiation in ``NumericDiff.py`.

hGradFun = lambda vW: grad(hObjFun)(vW)
#===============================================================#

### Gradient Descent with Adaptive Step Size

This section implements the _Adaptive Step Size_ for the _Gradient Descent_:

 - Analyze the 1D function: $h \left( \mu \right) = f \left( \boldsymbol{x} - \mu {\nabla}_{f} \left( x \right) \right)$.
 - Find $\mu$ small enough such that: $h \left( \mu \right) \leq h \left( 0 \right)$.

* <font color='brown'>(**#**)</font> See [Wikipedia - Backtracking Line Search](https://en.wikipedia.org/wiki/Backtracking_line_search).   


In [None]:
# Gradient Descent with Adaptive Step Size

#===========================Fill This===========================#
# 1. Implement the gradient function. 
#    Given a vector of parameters `vW` it returns the gradient at `vW`.
# 2. The implementation should be using a Lambda Function.
# !! You may use Auto Grad or the numeric differentiation in ``NumericDiff.py`.

def GradientDescent( mX: np.ndarray, hGradFun: Callable, hObjFun: Callable, /, *, μ: float = 1e-3, α: float = 0.5, maxNumBack: int = 20, minμ: float = 1e7 ) -> np.ndarray:
    """
    Input:
      - mX                -   2D Matrix.
                              The first column is the initialization.
                              Structure: Matrix (dataDim * numIterations).
                              Type: 'Single' / 'Double'.
                              Range: (-inf, inf).
      - hGradFun          -   The Gradient Function.
                              A function to calculate the gradient.
                              Its input is `vX`, `jj` for the location 
                              of the gradient and the component index.
                              Structure: NA.
                              Type: Callable.
                              Range: NA.
      - hObjFun           -   The Gradient Function.
                              A function to calculate the gradient.
                              Its input is `vX`, `jj` for the location 
                              of the gradient and the component index.
                              Structure: NA.
                              Type: Callable.
                              Range: NA.
      - μ                 -   The Step Size.
                              The descent step size.
                              Structure: Scalar.
                              Type: 'Single' / 'Double'.
                              Range: (0, inf).
      - α                 -   The Step Size.
                              The descent step size.
                              Structure: Scalar.
                              Type: 'Single' / 'Double'.
                              Range: (0, inf).
      - maxNumBack        -   The Step Size.
                              The descent step size.
                              Structure: Scalar.
                              Type: 'Single' / 'Double'.
                              Range: (0, inf).
      - minμ              -   The Step Size.
                              The descent step size.
                              Structure: Scalar.
                              Type: 'Single' / 'Double'.
                              Range: (0, inf).
    Output:
      - mX                -   2D Matrix.
                              All iterations results.
                              Structure: Matrix (dataDim * numIterations).
                              Type: 'Single' / 'Double'.
                              Range: (-inf, inf).
    """

    dataDim       = np.size(mX, 0)
    numIterations = np.size(mX, 1)

    for ii in range(1, numIterations):
        vG      = hGradFun(mX[:, ii - 1]) #<! Current gradient
        objVal  = hObjFun(mX[:, ii - 1]) #<! Current objective 
        kk      = 0
        while((hObjFun(mX[:, ii - 1] - μ * vG) > objVal) and (kk < maxNumBack) and (μ >= minμ)):
            kk += 1
            μ  *= α
        
        mX[:, ii] = mX[:, ii - 1] - μ * vG
        μ /= α

    return mX

#===============================================================#

In [None]:
# Solve by Gradient Descent

# Define Data
mW      = np.zeros(shape = (3, numIterations))
vObjVal = np.empty(numIterations)
vE      = np.empty(numIterations) #<! Error rate

# Optimization
mW = GradientDescent(mW, hGradFun, hObjFun, μ = μ, α = α, maxNumBack = maxNumBack, minμ = minμ)

In [None]:
# Validation of Solution

for ii in range(numIterations):
    vObjVal[ii] = hObjFun(mW[:, ii]) / (2 * numSamples) #<! Scaling for classification
    vYEst       = np.sign(mXX @ mW[:, ii])
    vE[ii]      = np.mean(vYEst != vY) #<! Mean Error

* <font color='red'>(**?**)</font> Should the error rate match the objective function?

In [None]:
# Plotting Function

# Grid of the data support
vV       = np.linspace(-2, 2, numGridPts)
mX1, mX2 = np.meshgrid(vV, vV)

def PlotLinClassTrain( itrIdx: int, mX: np.ndarray, mW: np.ndarray, vY: np.ndarray, vE: np.ndarray, vL: np.ndarray, mX1: np.ndarray, mX2: np.ndarray ):

    hF, _ = plt.subplots(nrows = 1, ncols = 2, figsize = (12, 6))

    hA1, hA2 = hF.axes[0], hF.axes[1]

    # hA1.cla()
    # hA2.cla()
    
    Plot2DLinearClassifier(mX, vY, mW[:, itrIdx], mX1, mX2, hA1) #<! Assumes the model is [-1, x_1, x_2] * [w1; w2; w3]

    K = np.size(mW, 1) #<! Number of iterations

    vEE = vE[:itrIdx]
    vLL = vL[:itrIdx]

    hA2.plot(vEE, color = 'k', lw = 2, label = r'$J \left( w \right)$')
    hA2.plot(vLL, color = 'm', lw = 2, label = r'$\tilde{J} \left( w \right)$')
    hA2.set_title('Objective Function')
    hA2.set_xlabel('Iteration Index')
    hA2.set_ylabel('Value')
    hA2.set_xlim((0, K - 1))
    hA2.set_ylim((0, 1))
    hA2.grid()
    hA2.legend()
        
    # hF.canvas.draw()
    plt.show()

In [None]:
# Display the Optimization Path

hPlotLinClassTrain = lambda itrIdx: PlotLinClassTrain(itrIdx, mX, mW, vY, vE, vObjVal, mX1, mX2)
kSlider = IntSlider(min = 0, max = numIterations - 1, step = 1, value = 0, description = 'Iteration', layout = Layout(width = '30%'))
interact(hPlotLinClassTrain, itrIdx = kSlider);
