[![Fixel Algorithms](https://fixelalgorithms.co/images/CCExt.png)](https://fixelalgorithms.gitlab.io)

# Scientific Programming Methods

## Convex Optimization - Algorithms & Solvers - Accelerated Gradient Descent / Sub Gradient Method

> Notebook by:
> - Royi Avital RoyiAvital@fixelalgorithms.com

## Revision History

| Version | Date       | User        |Content / Changes                                                   |
|---------|------------|-------------|--------------------------------------------------------------------|
| 1.0.000 | 03/10/2024 | Royi Avital | First version                                                      |

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/FixelAlgorithmsTeam/FixelCourses/blob/master/AIProgram/2024_02/0012LinearFitL1.ipynb)

In [1]:
# Import Packages

# General Tools
import numpy as np
import scipy as sp
import pandas as pd

# Machine Learning

# Optimization
import cvxpy as cp

# Miscellaneous
import os
import math
from platform import python_version
import random
import time

# Typing
from typing import Callable, List, Tuple, Union

# Visualization
import matplotlib.pyplot as plt

# Jupyter
from IPython import get_ipython

## Notations

* <font color='red'>(**?**)</font> Question to answer interactively.
* <font color='blue'>(**!**)</font> Simple task to add code for the notebook.
* <font color='green'>(**@**)</font> Optional / Extra self practice.
* <font color='brown'>(**#**)</font> Note / Useful resource / Food for thought.

Code Notations:

```python
someVar    = 2; #<! Notation for a variable
vVector    = np.random.rand(4) #<! Notation for 1D array
mMatrix    = np.random.rand(4, 3) #<! Notation for 2D array
tTensor    = np.random.rand(4, 3, 2, 3) #<! Notation for nD array (Tensor)
tuTuple    = (1, 2, 3) #<! Notation for a tuple
lList      = [1, 2, 3] #<! Notation for a list
dDict      = {1: 3, 2: 2, 3: 1} #<! Notation for a dictionary
oObj       = MyClass() #<! Notation for an object
dfData     = pd.DataFrame() #<! Notation for a data frame
dsData     = pd.Series() #<! Notation for a series
hObj       = plt.Axes() #<! Notation for an object / handler / function handler
```

### Code Exercise

 - Single line fill

```python
vallToFill = ???
```

 - Multi Line to Fill (At least one)

```python
# You need to start writing
?????
```

 - Section to Fill

```python
#===========================Fill This===========================#
# 1. Explanation about what to do.
# !! Remarks to follow / take under consideration.
mX = ???

?????
#===============================================================#
```

In [2]:
# Configuration
# %matplotlib inline

# warnings.filterwarnings("ignore")

seedNum = 512
np.random.seed(seedNum)
random.seed(seedNum)

# Matplotlib default color palette
lMatPltLibclr = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd', '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
# sns.set_theme() #>! Apply SeaBorn theme
# sns.set_palette("tab10")

runInGoogleColab = 'google.colab' in str(get_ipython())

In [3]:
# Constants

FIG_SIZE_DEF    = (8, 8)
ELM_SIZE_DEF    = 50
CLASS_COLOR     = ('b', 'r')
EDGE_COLOR      = 'k'
MARKER_SIZE_DEF = 10
LINE_WIDTH_DEF  = 2


In [4]:
# Course Packages

from AuxFun import ProxGradientDescent
from AuxFun import DisplayCompaisonSummary, DisplayRunSummary, MakeSignal


In [5]:
# Auxiliary Functions

def SubGradient( mX: np.ndarray, hGradFun: Callable[[np.ndarray], np.ndarray], /, *, hMuK: Callable[[int], float] = lambda kk: 1 / kk ) -> np.ndarray:
    """
    SubGradient: Performs subgradient optimization on the given sequence of points.

    This function iteratively updates a sequence of points `mX` by following the subgradient
    provided by `hGradFun()`, with a step size determined by the `hMuK()` function. It implements
    a basic subgradient method, typically used for optimizing non differentiable functions.

    Parameters:
    ----------
    mX : np.ndarray
        A 2D array of shape `(numIter, dataDim)`, where each row represents a point in the optimization process.
        The initial point is provided in `mX[0]`, and subsequent points are updated in place.

    hGradFun : Callable
        A function that computes the subgradient at a given point `hGradFun(vX)`. It should take a single argument (a point
        in the same space as the rows of `mX`) and return the corresponding subgradient vector.

    hMuK : Callable, optional
        A function that computes the step size at iteration `kk`. By default, it uses `lambda kk: 1 / kk`,
        which gives a diminishing step size. It should take the iteration number `kk` as an input and return
        a scalar value representing the step size.

    Returns:
    -------
    mX : np.ndarray
        The updated sequence of points after performing subgradient descent. The final point can be found
        in `mX[-1]`.
    """

    numIter = np.size(mX, 0)
    
    for kk in range(1, numIter):
        vG      = hGradFun(mX[kk - 1])
        mX[kk]  = mX[kk - 1] - (hMuK(kk) * vG)
    
    return mX


In [6]:
# Parameters

# Data
numSamples  = 200
noiseStd    = 0.25

λ = 0.5

# Solver
μ               = 0.0025
numIterations   = 10_000

# # Verification
ε = 1e-6 #<! Error threshold

## FISTA for TV Denoising

The _Total Variation Denoising_ is a sparse based module which promotes the model of a _piece wise constant_ signal.  
Its general form is given as:

$$ \arg \min_{\boldsymbol{x}} \frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b} \right\|}_{2}^{2} + \lambda {\left\| \boldsymbol{D} \boldsymbol{x} \right\|}_{1} $$

Where $\lambda {\left\| \boldsymbol{D} \boldsymbol{x} \right\|}_{1}$ is the regularization term with $\lambda \geq 0$ sets the regularization level.  
The matrix $\boldsymbol{D}$ represent the [Finite Difference operator](https://en.wikipedia.org/wiki/Finite_difference).  
In this case the regularization term promotes sparsity over the 1st derivative which implies piece wise constant signal.  

* <font color='brown'>(**#**)</font> There are variations of the [Finite Difference Coefficients](https://en.wikipedia.org/wiki/Finite_difference_coefficient).
* <font color='brown'>(**#**)</font> The regularization is a non smooth function. It requires _non smooth solver_.

This notebooks covers:
 - The solution of the above problem using the Accelerated Sub Gradient Method.  
 - Comparing the convergence of the accelerated and non accelerated method.

### Acceleration Methods

Most acceleration methods use _memory_ in order to improve the gradient direction.  
This improvement yields a faster convergence rates with small computational burden.

This notebooks uses the formulation called _FISTA_ coined in the paper [A Fast Iterative Shrinkage Thresholding Algorithm for Linear Inverse Problems](https://epubs.siam.org/doi/10.1137/080716542).  
Its formulation for _Sub Gradient_ is given for $k \in \left\{ 1, 2, \ldots, K \right\}$:

$$
\begin{aligned}
\boldsymbol{v} & = \boldsymbol{x}_{k} + \frac{k - 1}{k + 2} \left( \boldsymbol{x}_{k} - \boldsymbol{x}_{k - 1} \right) \\
\boldsymbol{x}_{k + 1} & = \boldsymbol{v} - {\mu}_{k} \partial f \left( \boldsymbol{v} \right)
\end{aligned}
$$

It can be thought as a _look ahead_ where the gradient is calculated at a farther step.

![](https://i.imgur.com/2coe2Uy.png)  
Image from [Andre Wibisono - Accelerated Gradient Descent](http://awibisono.github.io/2016/06/20/accelerated-gradient-descent.html)

* <font color='brown'>(**#**)</font> The _Acceleration Framework_ can be utilized in Gradient Descent, Sub Gradient Descent and Proximal Gradient Method.
* <font color='brown'>(**#**)</font> The formulation used is a simplified formulation of the original paper.
* <font color='brown'>(**#**)</font> Some suggest using the accelerated method for $\left \lfloor \frac{2 K}{3} \right \rfloor$ iterations and then use regular non accelerated method.  
See [Another Look at the Fast Iterative Shrinkage / Thresholding Algorithm (FISTA)](https://arxiv.org/abs/1608.03861).
* <font color='brown'>(**#**)</font> Some explain the acceleration effect by analysis of a linear differential system.  
See [A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights](https://arxiv.org/abs/1503.01243).  
Farther ideas are given in [StackExchnage Mathematics - Intuition Behind Accelerated First Order Methods](https://math.stackexchange.com/questions/904691).
* <font color='brown'>(**#**)</font> The solution path of the FISTA is not monotonic and known to be oscillatory. There are monotonic variants.  
See [Improving Fast Iterative Shrinkage Thresholding Algorithm: Faster, Smarter and Greedier](https://arxiv.org/abs/1811.01430).
* <font color='brown'>(**#**)</font> Theoretically the FISTA achieves the fastest convergence rate available to _First Order_ methods.
* <font color='brown'>(**#**)</font> [A Twitt by Ben Grimmer (On Twitter)](https://twitter.com/prof_grimmer) showed alternative policy: [Periodic Long Steps](https://twitter.com/prof_grimmer/status/1679846891171766272).  
See [Provably Faster Gradient Descent via Long Steps](https://arxiv.org/abs/2307.06324) and [Accelerated Gradient Descent via Long Steps](https://arxiv.org/abs/2309.09961).


## Generate Data


The data is a Piece Wise Constant signal to match the model of the problem.

* <font color='brown'>(**#**)</font> The function `MakeSignal`  is based on code by [Ivan W. Selesnick - Total Variation Denoising](https://eeweb.engineering.nyu.edu/iselesni/lecture_notes/TVDmm).

In [7]:
# Generate / Load the Data

# Model Data
vS = MakeSignal('Blocks', numSamples)
vY = vS + (noiseStd * np.random.randn(numSamples))

mD = sp.sparse.spdiags([-np.ones(numSamples), np.ones(numSamples)], [0, 1], numSamples - 1, numSamples) #<! Different than MATLAB in the length required

mX = np.zeros(shape = (numIterations, numSamples)) #<! Initialization by zeros

dSolverData = {}


In [None]:
# Display Data 

hF, hA = plt.subplots(figsize = (10, 6))
hA.plot(range(numSamples), vS, lw = 2, label = 'Signal Model')
hA.plot(range(numSamples), vY, ls = 'None', marker = '*', ms = 5, label = 'Signal Samples')
hA.set_title('Signal Model and Signal Samples (Noisy)')
hA.set_xlabel('Sample Index')
hA.set_ylabel('Sample Value')

hA.legend();

* <font color='red'>(**?**)</font> How the gradient of the signal model would look like?
* <font color='red'>(**?**)</font> How the gradient of the signal samples  would look like?

## Total Variation Denoising

This section defines the problem and solve it using the _Accelerated Sub Gradient_.

### Objective Function

The objective function of the TV Denoising:

$$ \arg \min_{\boldsymbol{x}} \underbrace{\frac{1}{2} {\left\| \boldsymbol{A} \boldsymbol{x} - \boldsymbol{b} \right\|}_{2}^{2}}_{\text{Fidelity}} + \underbrace{\lambda {\left\| \boldsymbol{D} \boldsymbol{x} \right\|}_{1}}_{\text{Regularization}} $$

The format of fidelity term and regularization is common in [_Inverse Problem_](https://en.wikipedia.org/wiki/Inverse_problem) which are one of the most challenging types of problems in Engineering.  
The regularization is modelling a desired effect on the output. In our case promoting sparse derivative which implies a _piece wise constant_ signal.

* <font color='brown'>(**#**)</font> For simplicity the explicit sparse matrix is used above. In many cases it is better to use an operator.  
In this case by applying a convolution.

In [9]:
# Objective Function

#===========================Fill This===========================#
# 1. Implement the objective function. 
#    Given a vector of `vX` it returns the objective.
# 2. The implementation should be using a Lambda Function.
# !! You may `np.square()` and / or `np.linalg.norm()`.

hObjFun = lambda vX: ???
#===============================================================#

* <font color='red'>(**?**)</font> How would the least squares (With no regularization, $\lambda = 0$) solution look like?

## Analysis

This section solves the problem in 3 ways:

 - DCP Solver: As the problem is _convex_ and relatively small it can be solved by a DCP solver.
 - Sub Gradient: Iterative method using the _sub gradient_ of the regularization term.
 - Accelerated Sub Gradient: Iterative method using the _sub gradient_ of the regularization term with acceleration.


### DCP Solver

Solving the problem using a DCP Solver.

In [None]:
# DCP Solution
# The Total Variation Denoising model.
# Solved using `CVXPY`.

startTime = time.time()

solverString = 'CVXPY'

#===========================Fill This===========================#
# 1. Create the auxiliary variable `vX`.
# 1. Define the objective function.
# 3. Define the constraints.
# 4. Solve the problem using `CVXPY`.
# !! You may use list operations to define constraints.

vX   = ??? #<! Objective Variable

cpObjFun = ??? #<! Objective Function
cpConst  = ??? #<! Constraints
oCvxPrb  = ??? #<! Problem
#===============================================================#

oCvxPrb.solve(solver = cp.SCS)

assert (oCvxPrb.status == 'optimal'), 'The problem is not solved.'
print('Problem is solved.')

vX = vX.value

runTime = time.time() - startTime

In [None]:
# Storing Results

DisplayRunSummary(solverString, hObjFun, vX, runTime, oCvxPrb.status)

dSolverData[solverString] = {'vX': vX, 'objVal': hObjFun(vX)}


### Sub Gradient Solver

This section implements the Sub Gradient Solver.

In [12]:
# Gradient Function

#===========================Fill This===========================#
# 1. Implement the gradient function of (1/2) * || x - y ||_2^2 + λ * || D x ||_1. 
#    Given a vector `vX` it returns the gradient at `vX`.
# 2. The implementation should be using a Lambda Function.
# !! You may pre calculate terms for efficient code.

hGradFun = ???
#===============================================================#

In [13]:
# Sub Gradient Solution
# The Total Variation Denoising model.
# Solved using Sub Gradient Method.

hMuK = lambda kk: μ #<! Try using the 1/L 
# hMuK = lambda kk: 1 / math.sqrt(kk) #<! Classic Sub Gradient

startTime = time.time()

solverString = 'Sub Gradient'

mX = SubGradient(mX, hGradFun, hMuK = hMuK)

runTime = time.time() - startTime



In [None]:
# Storing Results

vX = np.copy(mX[-1])

DisplayRunSummary(solverString, hObjFun, vX, runTime)

dSolverData[solverString] = {'vX': vX, 'objVal': hObjFun(vX), 'mX': np.copy(mX)}


### Accelerated Sub Gradient Solver

This section implements the Sub Gradient Solver.

> [!TIP]  
> For $k \in \left\{ 2, 3, \ldots, K \right\}$:
>    1. $\boldsymbol{v}^{k} = \boldsymbol{x}^{k - 1} + \frac{k - 1}{k + 2} \left( \boldsymbol{x}^{k - 1} - \boldsymbol{x}^{k - 2} \right)$
>    2. $\boldsymbol{x}^{k} = \boldsymbol{v}^{k} - {\mu}_{k} \partial f \left( \boldsymbol{v}^{k} \right)$

Where ${\mu}_{k} = \frac{1}{k}$ or $\frac{1}{\sqrt{k}}$.

In [15]:
# Accelerated Sub Gradient Function

#===========================Fill This===========================#
# 1. Implement the accelerated sub gradient solver. 
#    Given a matrix `mX` with shape `(numIter, dataDim)` it applies the method.
# 2. An input parameter is `hGradFun` which is callable `hGradFun(vX)`.  
#    It calculates the (Sub) gradient at `vX`.
# 2. The initial value is given by `mX[0, :]`.
# !! Do not overwrite `mX[0, :]`.
# !! The function input should match `SubGradient()`.

def SubGradientAccel( mX: np.ndarray, hGradFun: Callable[[np.ndarray], np.ndarray], /, *, hMuK: Callable[[int], float] = lambda kk: 1 / kk ) -> np.ndarray:
    """
    SubGradient: Performs accelerated subgradient optimization on the given sequence of points.  
    It is using a _Nesterov_ like momentum method (Also known as FISTA).

    This function iteratively updates a sequence of points `mX` by following the subgradient
    provided by `hGradFun()`, with a step size determined by the `hMuK()` function. It implements
    a basic subgradient method, typically used for optimizing non differentiable functions.

    Parameters:
    ----------
    mX : np.ndarray
        A 2D array of shape `(numIter, dataDim)`, where each row represents a point in the optimization process.
        The initial point is provided in `mX[0]`, and subsequent points are updated in place.

    hGradFun : Callable
        A function that computes the subgradient at a given point `hGradFun(vX)`. It should take a single argument (a point
        in the same space as the rows of `mX`) and return the corresponding subgradient vector.

    hMuK : Callable, optional
        A function that computes the step size at iteration `kk`. By default, it uses `lambda kk: 1 / kk`,
        which gives a diminishing step size. It should take the iteration number `kk` as an input and return
        a scalar value representing the step size.

    Returns:
    -------
    mX : np.ndarray
        The updated sequence of points after performing subgradient descent. The final point can be found
        in `mX[-1]`.
    
    Notes:
    ------
    - The algorithm follows a two phase approach: a standard subgradient update for the first iteration and 
      a momentum based update for all subsequent iterations.
    - The momentum term uses a specific Nesterov like weighting factor `(kk - 1) / (kk + 2)` to propagate 
      the difference between consecutive iterates.
    """

    numIter = np.size(mX, 0)
    
    # First iteration
    vV    = np.copy(mX[0])
    vG    = hGradFun(mX[0])
    mX[1] = vV - (hMuK(1) * vG)
    
    # Steady state
    for kk in range(2, numIter):
        vV     = ???
        vG     = ???
        mX[kk] = ???
    
    return mX

#===============================================================#

In [16]:
# Accelerated Sub Gradient Solution
# The Total Variation Denoising model.
# Solved using Sub Gradient Method.

startTime = time.time()

solverString = 'Accelerated Sub Gradient'

mX = SubGradientAccel(mX, hGradFun, hMuK = hMuK)

runTime = time.time() - startTime



In [None]:
# Storing Results

vX = np.copy(mX[-1])

DisplayRunSummary(solverString, hObjFun, vX, runTime)

dSolverData[solverString] = {'vX': vX, 'objVal': hObjFun(vX), 'mX': np.copy(mX)}


### Display Results

In [None]:
# Display Results

hF = DisplayCompaisonSummary(dSolverData, hObjFun)

* <font color='blue'>(**!**)</font> Replace `hMuK` with the step size of the Sub Gradient Method.  
This is the difference between "practice" and theory.

In [None]:
# Display Data 

hF, hA = plt.subplots(figsize = (10, 6))
hA.plot(range(numSamples), vS, lw = 2, label = 'Signal Model')
hA.plot(range(numSamples), vY, ls = 'None', marker = '*', ms = 5, label = 'Signal Samples')
hA.plot(range(numSamples), mX[-1], lw = 2, label = 'Denoised Signal')
hA.set_title('Signal Model, Signal Samples (Noisy) and Smoothed Signal')
hA.set_xlabel('Sample Index')
hA.set_ylabel('Sample Value')

hA.legend();

* <font color='red'>(**?**)</font> What the result of $\lambda \to \infty$ would look like?