[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Optimization Methods

## Convex Optimization - Non Smooth Optimization - Proximal Gradient Method

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 03/10/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0012LinearFitL1.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning
from sklearn.datasets import load_diabetes

# Optimization

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [None]:
# Configuration
# %matplotlib inline

# warnings.filterwarnings("ignore")

seedNum = 512 # Try: 23, 12, 20
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme
# sns.set_palette("tab10")

runInGoogleColab = 'google.colab' in str(get_ipython())

In [None]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2

In [None]:
# Course Packages

from AuxFun import ProxGradientDescent
from AuxFun import ProjectL1Ball

In [None]:
# Auxiliary Functions


In [None]:
# Parameters

# Data
numGridPts  = 25
polyDeg     = 5 #<! Polynomial Degree
numFeatures = 3
noiseStd    = 0.085

# Solution Path
λ  = 0.0075 #<! Verification
vλ = np.linspace(0, 5, 1000)

# Solver
μ               = 0.0075
numIterations   = 75_000

# # Verification
ε = 1e-6 #<! Error threshold

## Least Squares with ${L}_{0}$ **Pseudo** Norm Regularization

The ${L}_{0}$ regularized Least Squares (${L}_{0}$ Regularized LS) is given by:

$$ \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{0} $$

Where $\lambda {\left\| \boldsymbol{x} \right\|}_{0}$ is the regularization term with $\lambda \geq 0$ sets the regularization level.

* <font color='brown'>(**#**)</font> The ${L}_{0}$ is **not a norm**.
* <font color='brown'>(**#**)</font> The ${L}_{0}$ counts the non zero elements: ${\left\| \boldsymbol{x} \right\|}_{0} = \sum_{i} {I}_{ \neq 0 } \left( {x}_{i} \right)$.
* <font color='brown'>(**#**)</font> The _mode_ of a set of numbers (As a vector) can be defined as $\arg \min_{\alpha} {\left\| \boldsymbol{x} - \alpha \boldsymbol{1} \right\|}_{0}$.
* <font color='brown'>(**#**)</font> The motivation for the ${L}_{0}$ norm can explained as:
  - Sparsity of the weights solution.
  - [Bernoulli Distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) Prior for the existence of the weight.  
    For exact derivation see [Sparsifying Parametric Models with L0 Regularization](https://arxiv.org/abs/2409.03489).
* <font color='brown'>(**#**)</font> The function $g \left( \boldsymbol{x} \right) = \lambda {\left\| \boldsymbol{x} \right\|}_{0}$ has an efficient _Prox Operator_.
* <font color='brown'>(**#**)</font> There are some approximations of the ${L}_{0}$ norm. See [Convex Optimization with ${L}_{0}$ Pseudo Norm](https://math.stackexchange.com/questions/1862775).

The motivation for regularization is:

 - Include prior knowledge into into the model.
 - Avoid overfitting.
 - Make underdetermined systems solvable.

This notebooks covers the solution of the above problem using the Proximal Gradient Descent method. 


## Generate Data


Building a sparse model of the parameters to optimize by.  
The feature space is a polynomial where the data is generated by a sub set of coefficients.

In [None]:
# Generate / Load the Data

vA = np.sort(np.random.rand(numGridPts)) #<! Grid (Random samples, Sorted in ascending manner)
mA = np.power(vA[:, None], range(polyDeg + 1)) #<! Model Matrix

vXRef           = np.zeros(polyDeg + 1)
vFeatIdx        = np.random.choice(polyDeg + 1, numFeatures, replace = False) #!< Active features index
vXRef[vFeatIdx] = np.random.randn(numFeatures); #<! Active features

vN = noiseStd * np.random.randn(numGridPts) #<! Noise Samples
vS = mA @ vXRef
vY = vS + vN

## Analysis

This section defines the problem and solve it using the _Proximal Gradient Method_ (PGM).

### Objective Function

The objective function:

$$ \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{0} $$

Since the model "punishes" for extreme points only, it tries to bound the values into a symmetric range.

In [None]:
# Objective Function

#===========================Fill This===========================#
# 1. Implement the objective function. 
#    Given a vector of `vX` and a scalar `λ` it returns the objective.
# 2. The implementation should be using a Lambda Function.
# !! You may `np.square()` and / or `np.linalg.norm()`.
# !! Pay attention to the variable of the labels.

hObjFun = lambda vX, λ: 0.5 * np.square(np.linalg.norm(mA @ vX - vY)) + λ * np.linalg.norm(vX, ord = 0)
#===============================================================#

### Proximal Operator

The Proximal Operator of a function $g \left( \cdot \right)$ is given by:

$$ \operatorname{prox}_{\lambda g \left( \cdot \right)} \left( \boldsymbol{y} \right) = \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda g \left( \boldsymbol{x} \right) $$

* <font color='brown'>(**#**)</font> The Proximal Operator can be thought as a generalization of a projection operator.
* <font color='brown'>(**#**)</font> The Proximal Operator can be used to generalize the _Gradient_.


#### Question 001

The Prox of the ${L}_{0}$ norm:

$$ \operatorname{prox}_{\lambda {\left\| \cdot \right\|}_{1}} \left( \boldsymbol{y} \right) = \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{0} $$

Hints:

 - Pay attention the problem is separable.
 - Per element, first solve with the assumption the solution is ${x}_{i} = 0$ and then for ${x}_{i} = {y}_{i}$.  
   Explain why those are the 2 options.

* <font color='red'>(**?**)</font> Is the problem convex?

### Solution 001

The Optimization Problem given by the Prox Operator:

$$ \operatorname{prox}_{\lambda {\left\| \cdot \right\|}_{1}} \left( \boldsymbol{y} \right) = \arg \min_{\boldsymbol{x}} \left\{ \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{0} \right\} $$

This problem is separable with respect to both $ \boldsymbol{x} $ and $ \boldsymbol{y} $ hence one could solve the following problem:

$$ \arg \min_{ {x}_{i} } \left\{ \frac{1}{2} {\left( {x}_{i} - {y}_{i} \right)}^{2} + \lambda {I}_{\neq 0} \left( {x}_{i} \right) \right\} $$

Working on cases:

 - Assuming ${x}_{i} = {y}_{i}$ the loss is given by $\lambda$.  
 - Assuming ${x}_{i} = 0$ the loss is given by $\frac{1}{2} {y}_{i}^{2}$.  

Since the case of ${x}_{i} \neq 0$ and ${x}_{i} \neq {y}_{i}$ will have larger cost than ${x}_{i} \neq 0$ and ${x}_{i} = {y}_{i}$ it is not feasible.

In summary:

$$ {x}_{i} = \operatorname{prox}_{\lambda {\left\| \cdot \right\|}_{0}} \left( \boldsymbol{y} \right)_{i} = \begin{cases}
{y}_{i} & \text{ if } \frac{1}{2} {y}_{i}^{2} > \lambda \\ 
0 & \text{ if } \frac{1}{2} {y}_{i}^{2} \leq \lambda 
\end{cases} $$ 



 * <font color='brown'>(**#**)</font> The solution is called the _Hard Thresholding Operator_.

---

In [None]:
# The Proximal Function

#===========================Fill This===========================#
# 1. Implement the prox operator function of the `λ || ||_∞` function. 
#    Given a vector `vY` and `λ` it returns the proximal at `vY`.
# 2. The implementation should be using a Lambda Function.
# !! You may assume `λ` > 0.
# !! You may find `np.where()` useful.

hProxFun = lambda vY, λ: np.where(0.5 * np.square(vY) > λ, vY, 0)
# hProxFun = lambda vY, λ: (np.abs(vY) > np.sqrt(2 * λ)) * vY
#===============================================================#

* <font color='red'>(**?**)</font> Could the function be validated with DCP?

In [None]:
# Validation 
# Using SciPy with a gradient free method global optimizer.

# Model Data
vYY = np.linspace(-0.5, 0.5, 41)
hMinFun = lambda vX: 0.5 * np.square(np.linalg.norm(vX - vYY)) + λ * np.linalg.norm(vX, 0) #<! Objective function

# Model Problem
sOptRes = sp.optimize.direct(hMinFun, sp.optimize.Bounds(-3 * np.ones_like(vYY), 3 * np.ones_like(vYY)), maxfun = 10_000 * len(vYY), maxiter = 50_000)
vXX = sOptRes.x

assert (sOptRes.success), 'The problem is not solved.'
print('Problem is solved.')

In [None]:
# Display the Operator

hF, hA = plt.subplots(figsize = (10, 6))
hA.plot(vYY, vYY, lw = 2, label = 'Input')
hA.plot(vYY, hProxFun(vYY, λ), lw = 2, label = f'Hard Threshold Prox, Objective Value: {hMinFun(hProxFun(vYY, λ))}')
hA.plot(vYY, vXX, lw = 2, label = f'Global Optimization SciPy, Objective Value: {hMinFun(vXX)}')
hA.set_title(r'The Prox Operator of ${L}_{0}$' + f' Norm, λ = {λ: 0.2f}')
hA.set_xlabel('Input Value')
hA.set_ylabel('Output Value')

hA.legend();

* <font color='red'>(**?**)</font> How come the Prox method achieves better objective value than the _Global Optimizer_?

## Proximal Gradient Method

For the composition model of:

$$ F \left( \boldsymbol{x} \right) = f \left( \boldsymbol{x} \right) + \lambda g \left( \boldsymbol{x} \right) $$

Where $f \left( \boldsymbol{x} \right)$ is smooth and convex and $g \left( \boldsymbol{x} \right)$ is convex with a given prox operator.

The method iteration is given by:

$$ \boldsymbol{x}^{k + 1} = \operatorname{prox}_{ {\mu}_{k} \lambda g \left( \cdot \right) } \left( \boldsymbol{x}^{k} - {\nabla}_{f} \left( \boldsymbol{x}^{k} \right) \right) $$

Where ${\mu}_{k}$ is the step size.

* <font color='red'>(**?**)</font> For which $g$ the above becomes a _Projected Gradient_ descent?

In [None]:
# Gradient Function

#===========================Fill This===========================#
# 1. Implement the gradient function (1/2) * || A x - y ||_2^2. 
#    Given a vector `vX` it returns the gradient at `vX`.
# 2. The implementation should be using a Lambda Function.
# !! You may pre calculate terms for efficient code.

mAA = mA.T @ mA
vAy = mA.T @ vY

hGradFun = lambda vX: mAA @ vX - vAy
#===============================================================#

In [None]:
# Proximal Gradient Method (PGM)

oProxGrad = ProxGradientDescent(np.zeros(polyDeg + 1), hGradFun, μ, λ, hProxFun = hProxFun, useAccel = True)
lX = oProxGrad.ApplyIterations(numIterations)

 * <font color='brown'>(**#**)</font> The size of $\mu$ which guarantees convergence depends on the smoothness of $f$, The Lipschitz constant of its gradient.  
   For cases where $f$ is the Linear Least Squares problem the constant is given by ${\left\| \boldsymbol{A} \right\|}_{2}^{2}$, Namely the square of the largest singular value of $\boldsymbol{A}$.
 * <font color='brown'>(**#**)</font> One could implement adaptive step size in a similar manner to Gradient Descent with a different decision rule.
 * <font color='blue'>(**!**)</font> Go through the implementation of `ProxGradientDescent`.
 * <font color='blue'>(**!**)</font> Edit the code to use $\mu$ set by the Lipschitz constant of .

In [None]:
# Solution Analysis

objValRef   = hObjFun(vXRef, λ)
vObjVal     = np.empty(numIterations)
vArgErr     = np.empty(numIterations)

for ii in range(numIterations):
    vObjVal[ii] = hObjFun(lX[ii], λ)
    vArgErr[ii] = np.linalg.norm(lX[ii] - vXRef)

vObjVal = 20 * np.log10(np.maximum(np.abs(vObjVal - objValRef), np.sqrt(np.spacing(1.0))) / max(np.abs(objValRef), np.sqrt(np.spacing(1.0))))
vArgErr = 20 * np.log10(np.maximum(np.abs(vArgErr), np.sqrt(np.spacing(1.0))) / max(np.linalg.norm(vXRef), np.sqrt(np.spacing(1.0))))

In [None]:
# Display Results

hF, hA = plt.subplots(figsize = (12, 6))
hA.plot(range(numIterations), vObjVal, lw = 2, label = 'Objective Function')
hA.plot(range(numIterations), vArgErr, lw = 2, label = 'Argument Error')
hA.set_xlabel('Iteration Index')
hA.set_ylabel('Relative Error [dB]')
hA.set_title('Proximal Gradient Method Convergence')

hA.legend();

## Polynomial Regression

This section shows the difference between the Least Squares

In [None]:
# Least Squares Solution

#===========================Fill This===========================#
# 1. Calculate the LS solution of the problem.
#    The polynomial model is `mA * vXLS ≈ vY`.
# !! You find `sp.linalg.lstsq()` useful.

vXLS, *_ = sp.linalg.lstsq(mA, vY)
#===============================================================#

In [None]:
# Display the Results

hF, hA = plt.subplots(figsize = (10, 8))
hA.plot(vA, vS, lw = 2, label = 'Model Data')
hA.plot(vA, vY, ls = 'None', marker = '*', markersize = 5, label = 'Data Samples')
hA.plot(vA, mA @ vXLS, lw = 2, label = 'Least Squares')
hA.plot(vA, mA @ lX[-1], lw = 2, label = 'L0 Regularized Least Squares')
hA.set_xlim((0, 1))
hA.set_title('Polynomial Model - Estimation from Data Samples')
hA.set_xlabel('x')
hA.set_ylabel('y')

hA.legend();

* <font color='brown'>(**#**)</font> The regularization limits the _number of active_ coefficients.
* <font color='red'>(**?**)</font> Is the result better? Should it be better?
* <font color='red'>(**?**)</font> Is is robust to the outlier at the top left?

## Feature Importance

This section shows the ability to select features using the _Sparsity_ property of the solution.  
The dataset is the [Diabetes Dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html).


In [None]:
# Load Data

dfX, dsY = load_diabetes(return_X_y = True, as_frame = True)