[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Optimization Methods

## Convex Optimization - Non Smooth Optimization - Proximal Gradient Method

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 01/10/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0012LinearFitL1.ipynb)

In [None]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning

# Optimization
import cvxpy as cp

# Miscellaneous
import os
import math
from platform import python_version
import random

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
valToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [2]:
# Configuration
# %matplotlib inline

# warnings.filterwarnings("ignore")

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme
# sns.set_palette("tab10")

runInGoogleColab = 'google.colab' in str(get_ipython())

In [3]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [4]:
# Course Packages

from AuxFun import ProxGradientDescent
from AuxFun import ProjectL1Ball


In [5]:
# Auxiliary Functions


In [6]:
# Parameters

# Data
numGridPts  = 25
polyDeg     = 5 #<! Polynomial Degree
numFeatures = 3
noiseStd    = 0.085

# Solution Path
λ  = 0.7 #<! Verification

# Solver
μ               = 0.0435
numIterations   = 50_000

# # Verification
ε = 1e-6 #<! Error threshold

## Least Squares with ${L}_{\infty}$ Norm Regularization

The ${L}_{\infty}$ regularized Least Squares (${L}_{\infty}$ Regularized LS) is given by:

$$ \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{\infty} $$

Where $\lambda {\left\| \boldsymbol{x} \right\|}_{\infty}$ is the regularization term with $\lambda \geq 0$ sets the regularization level.

* <font color='brown'>(**#**)</font> The ${L}_{\infty}$ norm is non smooth.
* <font color='brown'>(**#**)</font> The motivation for the ${L}_{\infty}$ norm can explained as:
  - Limit the Range of Weights.
  - [Bounded Uniform Distribution](https://en.wikipedia.org/wiki/Continuous_uniform_distribution) Prior.  
    For exact derivation see [The Prior Behind The ${L}_{\infty}$ Norm](https://stats.stackexchange.com/questions/438429).
* <font color='brown'>(**#**)</font> The function $g \left( \boldsymbol{x} \right) = \lambda {\left\| \boldsymbol{x} \right\|}_{\infty}$ has an efficient _Prox Operator_.

The motivation for regularization is:

 - Include prior knowledge into into the model.
 - Avoid overfitting.
 - Make underdetermined systems solvable.

This notebooks covers the solution of the above problem using the Proximal Gradient Descent method. 


## Generate Data


Building a sparse model of the parameters to optimize by.  
The feature space is a polynomial where the data is generated by a sub set of coefficients.

In [7]:
# Generate / Load the Data

vA = np.sort(np.random.rand(numGridPts)) #<! Grid (Random samples, Sorted in ascending manner)
mA = np.power(vA[:, None], range(polyDeg + 1)) #<! Model Matrix

vXRef           = np.zeros(polyDeg + 1)
vFeatIdx        = np.random.choice(polyDeg + 1, numFeatures, replace = False) #!< Active features index
vXRef[vFeatIdx] = np.random.randn(numFeatures); #<! Active features

vN = noiseStd * np.random.randn(numGridPts) #<! Noise Samples
vS = mA @ vXRef
vY = vS + vN


## Analysis

This section defines the problem and solve it using the _Proximal Gradient Method_ (PGM).

### Objective Function

The objective function:

$$ \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{\infty} $$

Since the model "punishes" for extreme points only, it tries to bound the values into a symmetric range.

In [8]:
# Objective Function

#===========================Fill This===========================#
# 1. Implement the objective function. 
#    Given a vector of `vX` and a scalar `λ` it returns the objective.
# 2. The implementation should be using a Lambda Function.
# !! You may `np.square()` and / or `np.linalg.norm()`.
# !! Pay attention to the variable of the labels.

hObjFun = lambda vX, λ: ???
#===============================================================#

### Proximal Operator

The Proximal Operator of a function $g \left( \cdot \right)$ is given by:

$$ \operatorname{prox}_{\lambda g \left( \cdot \right)} \left( \boldsymbol{y} \right) = \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda g \left( \boldsymbol{x} \right) $$

* <font color='brown'>(**#**)</font> The Proximal Operator can be thought as a generalization of a projection operator.
* <font color='brown'>(**#**)</font> The Proximal Operator can be used to generalize the _Gradient_.


#### The Operator Derivation

The Prox of the ${L}_{\infty}$ norm:

$$ \operatorname{prox}_{\lambda {\left\| \cdot \right\|}_{1}} \left( \boldsymbol{y} \right) = \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{\infty} $$

Its solution requires working with _Moreau Decomposition_ from Convex Analysis.  
The Prox is given by:

$$ \operatorname{prox}_{ \lambda {\left\| \cdot \right\|}_{\infty}} \left( \boldsymbol{y} \right) = \boldsymbol{y} - \lambda \operatorname{Proj}_{ \left\{ \left\| \cdot \right\|_1 \leq 1 \right\} }( \frac{\boldsymbol{y}}{\lambda} ) $$

The projection onto the ${L}_{1}$ Ball does not have a closed form solution.  
It can be solved using DCP or the given function `ProjectL1Ball()`.

* <font color='brown'>(**#**)</font> The full derivation is given at [The Proximal Operator of the ${L}_{\infty}$ (Infinity Norm)](https://math.stackexchange.com/questions/527872).
* <font color='brown'>(**#**)</font> The projection onto the ${L}_{1}$ Balls derivation is given at [Orthogonal Projection onto the ${L}_{1}$
 Unit Ball](https://math.stackexchange.com/questions/2327504) (An alternative derivation: [Josh Nguyen - Projection onto the L1 Norm Ball](https://joshnguyen.net/posts/l1-proj)).

In [9]:
# The Proximal Function

#===========================Fill This===========================#
# 1. Implement the prox operator function of the `λ || ||_∞` function. 
#    Given a vector `vY` and `λ` it returns the proximal at `vY`.
# 2. The implementation should be using a Lambda Function.
# !! You may assume `λ` > 0.
# !! You may use `ProjectL1Ball()`.

hProxFun = lambda vY, λ: ???
#===============================================================#

In [None]:
# Validation 
# The proximal operator is the solution of a convex problem.
# The operator can be verified by DCP.

# Model Data
vYY = np.linspace(-2, 2, 31)
vXX = cp.Variable(len(vYY))

# Model Problem
cpObjFun = cp.Minimize(0.5 * cp.sum_squares(vXX - vYY) + λ * cp.norm(vXX, 'inf')) #<! Objective Function
lConst   = [] #<! Constraints
oCvxPrb  = cp.Problem(cpObjFun, lConst) #<! Problem

oCvxPrb.solve(solver = cp.SCS)

assert (oCvxPrb.status == 'optimal'), 'The problem is not solved.'
print('Problem is solved.')

assertCond = np.linalg.norm(vXX.value - hProxFun(vYY, λ), np.inf) <= (ε * max(np.linalg.norm(vXX.value), ε))
assert assertCond, f'The prox calculation deviation exceeds the threshold {ε}'

print('The prox implementation is verified')



In [None]:
# Display the Operator

hF, hA = plt.subplots(figsize = (10, 6))
hA.plot(vYY, vYY, lw = 2, label = 'Input')
hA.plot(vYY, hProxFun(vYY, λ), lw = 2, label = 'Soft Threshold')
hA.set_title(r'The Prox Operator of ${L}_{\infty}$' + f' Norm, λ = {λ: 0.2f}')
hA.set_xlabel('Input Value')
hA.set_ylabel('Output Value')

hA.legend();

* <font color='blue'>(**!**)</font> Revise the number of grid points. See the effect on the output.
* <font color='brown'>(**#**)</font> The operator depends on the number of elements and not only the $\lambda$ parameter.   
  The intuition is in the optimization problem: $\arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{x} - \boldsymbol{y} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{x} \right\|}_{\infty}$.  
  The fidelity term depends on the number of elements while the regularization term depends on a single value.  
  Hence adding 50 elements might change the first while the other is not changed.

## Proximal Gradient Method

For the composition model of:

$$ F \left( \boldsymbol{x} \right) = f \left( \boldsymbol{x} \right) + \lambda g \left( \boldsymbol{x} \right) $$

Where $f \left( \boldsymbol{x} \right)$ is smooth and convex and $g \left( \boldsymbol{x} \right)$ is convex with a given prox operator.

The method iteration is given by:

$$ \boldsymbol{x}^{k + 1} = \operatorname{prox}_{ {\mu}_{k} \lambda g \left( \cdot \right) } \left( \boldsymbol{x}^{k} - {\nabla}_{f} \left( \boldsymbol{x}^{k} \right) \right) $$

Where ${\mu}_{k}$ is the step size.

* <font color='red'>(**?**)</font> For which $g$ the above becomes a _Projected Gradient_ descent?

In [12]:
# Gradient Function

#===========================Fill This===========================#
# 1. Implement the gradient function (1/2) * || A x - y ||_2^2. 
#    Given a vector `vX` it returns the gradient at `vX`.
# 2. The implementation should be using a Lambda Function.
# !! You may pre calculate terms for efficient code.

hGradFun = ???
#===============================================================#


In [13]:
# Proximal Gradient Method (PGM)

oProxGrad = ProxGradientDescent(np.zeros(polyDeg + 1), hGradFun, μ, λ, hProxFun = hProxFun)
lX = oProxGrad.ApplyIterations(numIterations)


 * <font color='brown'>(**#**)</font> The size of $\mu$ which guarantees convergence depends on the smoothness of $f$, The Lipschitz constant of its gradient.  
   For cases where $f$ is the Linear Least Squares problem the constant is given by ${\left\| \boldsymbol{A} \right\|}_{2}^{2}$, Namely the square of the largest singular value of $\boldsymbol{A}$.
 * <font color='brown'>(**#**)</font> One could implement adaptive step size in a similar manner to Gradient Descent with a different decision rule.
 * <font color='blue'>(**!**)</font> Go through the implementation of `ProxGradientDescent`.
 * <font color='blue'>(**!**)</font> Edit the code to use $\mu$ set by the Lipschitz constant of .

### The DCP Solution

This section solves the problem using a DCP solver.

In [None]:
# DCP Solution

#===========================Fill This===========================#
# 1. Formulate the problem in CVXPY.  
#    Use `vXRef` for the optimal argument.
# !! You may find `cp.max()` useful.

# Model Data
vXRef = ??? #<! Variable

# Model Problem
cpObjFun = ??? #<! Objective Function
cpConst  = ??? #<! Constraints
oCvxPrb  = ??? #<! Problem

oCvxPrb.solve(solver = cp.SCS)
#===============================================================#

vXRef = vXRef.value

assert (oCvxPrb.status == 'optimal'), 'The problem is not solved.'
print('Problem is solved.')

assertCond = abs(hObjFun(vXRef, λ) - hObjFun(lX[-1], λ)) <= (ε * max(abs(hObjFun(vXRef, λ)), ε))
assert assertCond, f'The optimization calculation deviation {abs(hObjFun(vXRef, λ) - hObjFun(lX[-1], λ))} exceeds the threshold {ε}'

print('The implementation is verified')

 * <font color='brown'>(**#**)</font> The _Proximal Gradient Method_ is equivalent to the _Gradient Descent_ in its convergence rate.
 * <font color='brown'>(**#**)</font> In practice, the convergence rate will depend on the value of $\lambda$.

In [15]:
# Solution Analysis

objValRef   = hObjFun(vXRef, λ)
vObjVal     = np.empty(numIterations)
vArgErr     = np.empty(numIterations)

for ii in range(numIterations):
    vObjVal[ii] = hObjFun(lX[ii], λ)
    vArgErr[ii] = np.linalg.norm(lX[ii] - vXRef)

vObjVal = 20 * np.log10(np.maximum(np.abs(vObjVal - objValRef), np.sqrt(np.spacing(1.0))) / max(np.abs(objValRef), np.sqrt(np.spacing(1.0))))
vArgErr = 20 * np.log10(np.maximum(np.abs(vArgErr), np.sqrt(np.spacing(1.0))) / max(np.linalg.norm(vXRef), np.sqrt(np.spacing(1.0))))

In [None]:
# Display Results

hF, hA = plt.subplots(figsize = (12, 6))
hA.plot(range(numIterations), vObjVal, lw = 2, label = 'Objective Function')
hA.plot(range(numIterations), vArgErr, lw = 2, label = 'Argument Error')
hA.set_xlabel('Iteration Index')
hA.set_ylabel('Relative Error [dB]')
hA.set_title('Proximal Gradient Method Convergence')

hA.legend();

* <font color='red'>(**?**)</font> How will the convergence look like for higher $\lambda$?

## Polynomial Regression

This section shows the difference between the Least Squares

In [17]:
# Least Squares Solution

#===========================Fill This===========================#
# 1. Calculate the LS solution of the problem.
#    The polynomial model is `mA * vXLS ≈ vY`.
# !! You find `sp.linalg.lstsq()` useful.

vXLS, *_ = ???
#===============================================================#



In [None]:
# Display the Results

hF, hA = plt.subplots(figsize = (10, 8))
hA.plot(vA, vS, lw = 2, label = 'Model Data')
hA.plot(vA, vY, ls = 'None', marker = '*', markersize = 5, label = 'Data Samples')
hA.plot(vA, mA @ vXLS, lw = 2, label = 'Least Squares')
hA.plot(vA, mA @ lX[-1], lw = 2, label = 'L∞ Regularized Least Squares')

hA.set_title('Polynomial Model - Estimation from Data Samples')
hA.set_xlabel('x')
hA.set_ylabel('y')

hA.legend();

* <font color='brown'>(**#**)</font> The regularization limits the Dynamic Range of the coefficients.
* <font color='red'>(**?**)</font> Is the result better? Should it be better?