[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io/)

# Optimization Methods

## Convex Optimization - Non Smooth Optimization - Proximal Gradient Method

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 30/09/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0012LinearFitL1.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning

# Optimization
import cvxpy as cp

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

 ```python
 vallToFill = ???
 ```

 - Multi Line to Fill (At least one)

 ```python
 # You need to start writing
 ????
 ```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

???
#===============================================================#
```

In [2]:
# Configuration
# %matplotlib inline

# warnings.filterwarnings("ignore")

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme
# sns.set_palette("tab10")

runInGoogleColab = 'google.colab' in str(get_ipython())

In [3]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [4]:
# Course Packages

from AuxFun import ProxGradientDescent


In [5]:
# Auxiliary Functions


In [6]:
# Parameters

# Data
csvUrl = r'https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/DataSets/mtcars.csv'

# Solution Path
λ  = 0.5 #<! Verification
vλ = np.linspace(0, 7, 100)

# Solver
μ               = 0.00025
numIterations   = 100_000

# # Verification
ε = 1e-6 #<! Error threshold

## Least Squares with ${L}_{1}$ Norm Regularization

The ${L}_{1}$ regularized Least Squares (${L}_{1}$ Regularized LS) is given by:

$$ \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{1} $$

Where $\lambda {\left\| \boldsymbol{x} \right\|}_{1}$ is the regularization term with $\lambda \geq 0$ sets the regularization level.

* <font color='brown'>(**#**)</font> The ${L}_{1}$ norm is non smooth.
* <font color='brown'>(**#**)</font> The _median_ of a set of numbers (As a vector) can be defined as $\arg \min_{\alpha} {\left\| \boldsymbol{x} - \alpha \boldsymbol{1} \right\|}_{1}$.
* <font color='brown'>(**#**)</font> The motivation for the ${L}_{1}$ norm can explained as:
  - [Promoting Sparsity](https://en.wikipedia.org/wiki/Structured_sparsity_regularization).
  - [Laplace Distribution](https://en.wikipedia.org/wiki/Laplace_distribution) Prior.
* <font color='brown'>(**#**)</font> The function $g \left( \boldsymbol{x} \right) = \lambda {\left\| \boldsymbol{x} \right\|}_{1}$ has an efficient _Prox Operator_.

This notebooks covers the solution of the above problem using the Proximal Gradient Descent method.

### The Least Absolute Shrinkage and Selection Operator (LASSO) Model

The [LASSO](https://en.wikipedia.org/wiki/Lasso_(statistics)) model is a well known model in _Statistics_ and _Machine Learning_.  
It uses the ${L}_{1}$ Regularized LS model in order to apply a feature selection for ML model.  


## Generate Data


The data in this notebooks is a very well known data set in ML.  
It is based on the MPG consumption of cars with some technical information about the model.  

The [CSV File](https://raw.githubusercontent.com/FixelAlgorithmsTeam/FixelCourses/master/DataSets/mtcars.csv) includes the features:

 - `mpg` - Miles per Gallon
 - `cyl` - # of cylinders
 - `disp` - displacement, in cubic inches
 - `hp` - horsepower
 - `drat` - driveshaft ratio
 - `wt` - weight
 - `qsec` - 1/4 mile time; a measure of acceleration
 - `vs` - 'V' or straight - engine shape
 - `am` - transmission; auto or manual
 - `gear` - # of gears
 - `carb` - # of carburetors

* <font color='brown'>(**#**)</font> The metadata is from [GitHub Gist - `mtcars.csv`](https://gist.github.com/seankross/a412dfbd88b3db70b74b).

The data is normalized ot have zero mean and unit variance.

In [None]:
# Generate / Load the Data

dfData = pd.read_csv(csvUrl) #<! Data Frame

mA  = dfData.iloc[:, 2:].to_numpy() #<! NumPy array (`model` and `mpg` removed)
mA -= np.mean(mA, axis = 0)[None, :] #<! Normalize
mA /= np.std(mA, axis = 0)[None, :] #<! Normalize
vY  = dfData['mpg'].to_numpy() #<! Target (`mpg`)

lFeatureName = dfData.columns[2:]

numSamples  = len(vY)
numFeatures = np.size(mA, 1)
numλ        = len(vλ)

mX  = np.zeros(shape = (numFeatures, numIterations)) #<! Weights per iteration
mXλ = np.zeros(shape = (numFeatures, numλ)) #<! Weights per λ

dfData #<! Display the DF


## Analysis

This section defines the problem and solve it using the _Proximal Gradient Method_ (PGM).

### Objective Function

The objective function (LASSO):

$$ \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{1} $$

Since the model is linear and features are normalized, one can think the absolute value of the weight of the $i$ -th feature as a measure of its significance.  
This is the concept behind using the LASSO model for feature selection.

In [8]:
# Objective Function

#===========================Fill This===========================#
# 1. Implement the objective function. 
#    Given a vector of `vX` and a scalar `λ` it returns the objective.
# 2. The implementation should be using a Lambda Function.
# !! You may `np.square()` and / or `np.linalg.norm()`.
# !! Pay attention to the variable of the labels.

hObjFun = lambda vX, λ: ???
#===============================================================#

### Proximal Operator

The Proximal Operator of a function $g \left( \cdot \right)$ is given by:

$$ \operatorname{prox}_{\lambda g \left( \cdot \right)} \left( \boldsymbol{y} \right) = \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda g \left( \boldsymbol{x} \right) $$

* <font color='brown'>(**#**)</font> The Proximal Operator can be thought as a generalization of a projection operator.
* <font color='brown'>(**#**)</font> The Proximal Operator can be used to generalize the _Gradient_.


#### Question 001

Derive the Prox of the ${L}_{1}$ norm:

$$ \operatorname{prox}_{\lambda {\left\| \cdot \right\|}_{1}} \left( \boldsymbol{y} \right) = \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{1} $$

Hints:

 - Pay attention the problem is separable.
 - Per element, first solve with the assumption the solution is positive and then for negative.

### Solution 001

<font color='red'>??? Fill the answer here ???</font>

---

In [9]:
# The Proximal Function

#===========================Fill This===========================#
# 1. Implement the prox operator function of the `λ || ||_1` function. 
#    Given a vector `vY` and `λ` it returns the proximal at `vY`.
# 2. The implementation should be using a Lambda Function.
# !! You may assume `λ` > 0.

hProxFun = lambda vY, λ: ???
#===============================================================#

In [None]:
# Validation 
# The proximal operator is the solution of a convex problem.
# The operator can be verified by DCP.

# Model Data
vYY = np.linspace(-5, 5, 101)
vXX = cp.Variable(len(vYY))

# Model Problem
cpObjFun = cp.Minimize(0.5 * cp.sum_squares(vXX - vYY) + λ * cp.norm(vXX, 1)) #<! Objective Function
lConst   = [] #<! Constraints
oCvxPrb  = cp.Problem(cpObjFun, lConst) #<! Problem

oCvxPrb.solve(solver = cp.SCS)

assert (oCvxPrb.status == 'optimal'), 'The problem is not solved.'
print('Problem is solved.')

assertCond = np.linalg.norm(vXX.value - hProxFun(vYY, λ), np.inf) <= (ε * max(np.linalg.norm(vXX.value), ε))
assert assertCond, f'The prox calculation deviation exceeds the threshold {ε}'

print('The prox implementation is verified')



In [None]:
# Display the Operator

hF, hA = plt.subplots(figsize = (10, 6))
hA.plot(vYY, vYY, lw = 2, label = 'Input')
hA.plot(vYY, hProxFun(vYY, λ), lw = 2, label = 'Soft Threshold')
hA.set_title(r'The Prox Operator of ${L}_{1}$' + f' Norm, λ = {λ: 0.2f}')
hA.set_xlabel('Input Value')
hA.set_ylabel('Output Value')

hA.legend();

## Proximal Gradient Method

For the composition model of:

$$ F \left( \boldsymbol{x} \right) = f \left( \boldsymbol{x} \right) + \lambda g \left( \boldsymbol{x} \right) $$

Where $f \left( \boldsymbol{x} \right)$ is smooth and convex and $g \left( \boldsymbol{x} \right)$ is convex with a given prox operator.

The method iteration is given by:

$$ \boldsymbol{x}^{k + 1} = \operatorname{prox}_{ {\mu}_{k} \lambda g \left( \cdot \right) } \left( \boldsymbol{x}^{k} - {\nabla}_{f} \left( \boldsymbol{x}^{k} \right) \right) $$

Where ${\mu}_{k}$ is the step size.

* <font color='red'>(**?**)</font> For which $g$ the above becomes a _Projected Gradient_ descent?

In [12]:
# Gradient Function

#===========================Fill This===========================#
# 1. Implement the gradient function (1/2) * || A x - y ||_2^2. 
#    Given a vector `vX` it returns the gradient at `vX`.
# 2. The implementation should be using a Lambda Function.
# !! You may pre calculate terms for efficient code.

hGradFun = ???
#===============================================================#


In [13]:
# Proximal Gradient Method (PGM)

oProxGrad = ProxGradientDescent(np.zeros(numFeatures), hGradFun, μ, λ, hProxFun = hProxFun)
lX = oProxGrad.ApplyIterations(numIterations)


 * <font color='brown'>(**#**)</font> The size of $\mu$ which guarantees convergence depends on the smoothness of $f$, The Lipschitz constant of its gradient.  
   For cases where $f$ is the Linear Least Squares problem the constant is given by ${\left\| \boldsymbol{A} \right\|}_{2}^{2}$, Namely the square of the largest singular value of $\boldsymbol{A}$.
 * <font color='brown'>(**#**)</font> One could implement adaptive step size in a similar manner to Gradient Descent with a different decision rule.
 * <font color='blue'>(**!**)</font> Go through the implementation of `ProxGradientDescent`.
 * <font color='blue'>(**!**)</font> Edit the code to use $\mu$ set by the Lipschitz constant of .

### The DCP Solution

This section solves the problem using a DCP solver.

In [None]:
# DCP Solution

#===========================Fill This===========================#
# 1. Formulate the problem in CVXPY.  
#    Use `vXRef` for the optimal argument.
# !! You may find `cp.max()` useful.

# Model Data
vXRef = ???

# Model Problem
cpObjFun = ??? #<! Objective Function
cpConst  = ??? #<! Constraints
oCvxPrb  = ??? #<! Problem

oCvxPrb.solve(solver = cp.SCS)
#===============================================================#

vXRef = vXRef.value

assert (oCvxPrb.status == 'optimal'), 'The problem is not solved.'
print('Problem is solved.')

assertCond = abs(hObjFun(vXRef, λ) - hObjFun(lX[-1], λ)) <= (ε * max(abs(hObjFun(vXRef, λ)), ε))
assert assertCond, f'The optimization calculation deviation {abs(hObjFun(vXRef, λ) - hObjFun(lX[-1], λ))} exceeds the threshold {ε}'

print('The implementation is verified')

 * <font color='brown'>(**#**)</font> The _Proximal Gradient Method_ is equivalent to the _Gradient Descent_ in its convergence rate.
 * <font color='brown'>(**#**)</font> In practice, the convergence rate will depend on the value of $\lambda$.

In [18]:
# Solution Analysis

objValRef   = hObjFun(vXRef, λ)
vObjVal     = np.empty(numIterations)
vArgErr     = np.empty(numIterations)

for ii in range(numIterations):
    vObjVal[ii] = hObjFun(lX[ii], λ)
    vArgErr[ii] = np.linalg.norm(lX[ii] - vXRef)

vObjVal = 20 * np.log10(np.maximum(np.abs(vObjVal - objValRef), np.sqrt(np.spacing(1.0))) / max(np.abs(objValRef), np.sqrt(np.spacing(1.0))))
vArgErr = 20 * np.log10(np.maximum(np.abs(vArgErr), np.sqrt(np.spacing(1.0))) / max(np.linalg.norm(vXRef), np.sqrt(np.spacing(1.0))))

In [None]:
# Display Results

hF, hA = plt.subplots(figsize = (12, 6))
hA.plot(range(numIterations), vObjVal, lw = 2, label = 'Objective Function')
hA.plot(range(numIterations), vArgErr, lw = 2, label = 'Argument Error')
hA.set_xlabel('Iteration Index')
hA.set_ylabel('Relative Error [dB]')
hA.set_title('Proximal Gradient Method Convergence')

hA.legend();

* <font color='red'>(**?**)</font> Is the convergence of `Argument Error` to the same value as the DCP guaranteed?  
  Think about the on the convexity type of the problem.
* <font color='red'>(**?**)</font> Change the value of $\lambda$. What's the effect on convergence?

## Feature Selection by LASSO

This section analyzes the LASSO Path: The solution as a function of $\lambda$.  
It is used for feature selection as features which have a weight which does not vanish for high $\lambda$ are considered to be more significant.

In [20]:
# Calculate the LASSO Path

for ii, valλ in enumerate(vλ):
    oProxGrad = ProxGradientDescent(np.zeros(numFeatures), hGradFun, μ, numSamples * valλ, hProxFun = hProxFun, useAccel = True)
    oProxGrad.ApplyIterations(numIterations // 10, logArg = False)
    mXλ[:, ii] = np.fabs(oProxGrad.vX) #<! Significance is in absolute value


In [None]:
# Display the Lasso Path

hF, hA = plt.subplots(figsize = (10, 8))
for ii in range(numFeatures):
    hA.plot(vλ, mXλ[ii, :], lw = 2, label = lFeatureName[ii])

hA.set_title('Feature Significance to Estimate MPG')
hA.set_xlabel('λ')
hA.set_ylabel('Significance (Feature Absolute Weight)')

hA.legend();

* <font color='brown'>(**#**)</font> There are specialized algorithms for the LASSO Path (See [Least Angle Regression (LARS)](https://en.wikipedia.org/wiki/Least-angle_regression)).
* <font color='brown'>(**#**)</font> Using the solver above, even having a stopping rule and warm restart would have made things much faster.
* <font color='red'>(**?**)</font> Explain the solution at $\lambda \to \infty$ and $\lambda \to 0$.
* <font color='red'>(**?**)</font> Explain the intuition about the feature significance.
* <font color='red'>(**?**)</font> Why is the significance not monotonic?