$\textbf{GOAL}$ : We want to get a correct ISTA algorithm converging until tolerance given (or not) L value, prox.

*The L-lipschitz constant is known for LASSO but not $\sqrt{LASSO}$ and surely for other $f$. So we will need **backtrack** : choose a "large" L and decreases (dividing by 2) over iterations and see if the loss decreases (accept so L/2).*   

*The prox sera toujours donné dans ce projet car c'est un hypothèse que g soit proximable*   

TODO : S'assurer de l'hyptothese + def d'être proximable (ça veut surement dire que le prox, selon g, est explicite)

In [2]:
using LinearAlgebra, Printf, Statistics, Random

# ISTA

$\textbf{Definition}$
Let $\gamma>0$ and convex function $g : \mathbb{R}^n \to \mathbb{R}$, for $n>0$, we call proximal operator of $g$ with $\gamma$, 
$$
\emph{prox}_{\gamma, g} : v\in \mathbb{R}^n \mapsto \arg\min_{x\in\mathbb R^{n}}\;\Bigl\{\,g(x)+\tfrac1{2\gamma}\|x-v\|_2^2\Bigr\}
$$


Iterative Soft-Thresholding Algorithm (ISTA) is a proximal gradient method, each iteration $(w_k)$ performs a gradient descent step on the smooth loss followed by a soft-thresholding. 

$\textbf{Definition}$
To minimize the problem $f+g$ with $f$ convex, where its differential is $L$-Lipschitz and g convex. We approximate the minimizer $w^*$ by $(w_k)_{k\geq0}$ defined as, $w_0$ an initial point and with the update rule, for $\gamma>0$,
    $$
    w^{(k+1)} \;=\; \mathrm{prox}_{\gamma, g}\!\Big(w_k - \gamma\, \partial f(w_k) \Big)
    $$

# Basic LASSO 

Let $X \in \mathbb{R}^{n \times p}$ and $y \in \mathbb{R}^{n}$

$$
\min_{b \in \mathbb{R}^{p}}
\frac12 \|y - X b\|_{2}^{2}
+ \lambda \|b\|_{1}
\tag{LASSO}
$$


So in this first case, we have a optimization problem under the form $f+g$ with $f=MSE$ and $g = \ell_1$-norm 

$L = X^TX$ because 
$$
\|\nabla f(b) - \nabla f(a)\|
= \|-X^{T}(y-X b) + X^{T}(y-X a)\|
= \|X^{T}X\,(b - a)\|
\le \|X^{T}X\|\,\|b - a\|$$

and then we use $\ell_2$-norm since it allows to the proximal to be explicit.  
So we have the square of largest eigenvalue for $X^TX$ ,

$$||X^TX||_2 = ||X||_2^2

**Proximal**
$$f : w \mapsto \tfrac12\|Aw-b\|_2^2 \quad \text{and} \quad g : w \mapsto \lambda\|w\|_1$$
For $k\geq0$, let $u_k = w_k-\gamma A^\top(Aw_k-b)$. 
Then, $\forall i\in \llbracket 1,n \rrbracket$,
$$
w_{k+1}^{(i)}
=\emph{sign}\bigl(u_k^{(i)}\bigr)\,\max\bigl(|u_k^{(i)}|-\gamma\lambda,0\bigr)
$$


*Proof*  

Since $g$ is separable we have coordinate-wise on $\mathrm{prox}$, hence, for $i \in \llbracket 1,n \rrbracket$,
\begin{equation}
w_{k+1}^{(i)}
=\mathrm{prox}_{\gamma,\lambda|\cdot|}\bigl(u_k^{(i)}\bigr) = 
\arg\min_{x\in\mathbb{R}} \Bigl\{ \lambda|x|+\tfrac{1}{2\gamma}\Bigl(x-u_k^{(i)}\Bigr)^2\Bigr\} \tag{R1}
\end{equation}

From KKT Theorem, we denote $x^*$ the minimizer of (R1) verifying, $$0 \in \partial \Bigl( x\mapsto  \lambda|x|+\tfrac{1}{2\gamma}\Bigl(x-u_k^{(i)}\Bigr)^ 2\Bigr)(x^*) = \lambda \partial|\cdot|_1(x^*) + \frac{1}{\gamma}(x^*-u_k^{(i)})$$

+  $x^{*}>0 \implies 0 = \lambda\cdot1 + \tfrac{1}{\gamma}(x^{*}-u_k^{(i)})
   \;\Longleftrightarrow\;
   x^{*}=u_k^{(i)}-\gamma\lambda$ et $u_k^{(i)}>\gamma\lambda$
   
+ $x^{*}<0 \implies 0 = \lambda\cdot1 + \tfrac{1}{\gamma}(x^{*}-u_k^{(i)})
   \;\Longleftrightarrow\;
   x^{*}=u_k^{(i)}-\gamma\lambda$ et $u_k^{(i)}<-\gamma\lambda$

+ $x^{*}=0 \implies 
   0\in \lambda[-1,1] + \tfrac{1}{\gamma}(0-u_k^{(i)})
   \;\Longrightarrow\;
   u_k^{(i)}\in[-\gamma\lambda,\;\gamma\lambda]
   $

Combining these three cases gives,

$$
w_{k+1}^{(i)}
=x^{*}
=\begin{cases}
u_k^{(i)}-\gamma\lambda, & u_k^{(i)}>\gamma\lambda\\
0, & |u_k^{(i)}|\le\gamma\lambda\\
u_k^{(i)}+\gamma\lambda, & u_k^{(i)}<-\gamma\lambda
\end{cases}
=\mathrm{sign}\bigl(u_k^{(i)}\bigr)\,\max\bigl\{|u_k^{(i)}|-\gamma\lambda,\;0\bigr\}
$$

In [3]:
prox(x, τ) = sign(x) * max(abs(x) - τ, zero(x))

prox (generic function with 1 method)

### Simple (primal)

We run the algroithm while the difference value of the cost function (f+g) at two consecutive iteration is below than $\epsilon$

In [None]:
function ista_basic_lasso(X, y, λ, prox;
        max_iter = 100_000, tol = 1e-9,print_freq = 1000, verbose = true
    )
    m, p = size(X)
    L = opnorm(X)^2 # Lipschitz constant of ∇½‖y-Xβ‖²
    step = 1/L #in ]0,2/L[ so linearly convergent
    β = zeros(eltype(X), p)
    β_next = similar(β)
    r = copy(y) # residual
    # Initial cost value
    cost_prev = Inf
    
    for k in 1:max_iter
        # Gradient step
        grad = -(X' * r) # ∇½‖y-Xβ‖²
        @. β_next = prox(β - step * grad, λ*step)
        mul!(r, X, β_next)
        @. r = y - r
        cost_current = 0.5*dot(r,r) + λ*sum(abs, β_next)
    
        if abs(cost_current - cost_prev) < tol
            β .= β_next
            if verbose
                @printf("[ISTA] END iter %5d  cost=%.3e  diff=%.3e\n", k, cost_current, abs(cost_current - cost_prev))
            end
            break
        end
        
        if verbose && (k == 1 || k % print_freq == 0)
            @printf("[ISTA]  iter %5d  cost=%.3e  diff=%.3e\n", k, cost_current, abs(cost_current - cost_prev))
        end
        cost_prev = cost_current
        β .= β_next
        
    end
    
    return β
end

ista_basic_lasso (generic function with 1 method)

### GAP

For my project (screen cleaning), I use ISTA until the solution was $\epsilon$-optimal (the solution under the primal, there : $f+g$) and this is possible according to the gap function after definind the dual
$$
  (\mathcal{D}_\lambda) : 
  \max_{u\in\mathbb{R}^m}\; -\tfrac12\|u\|_{2}^{2} + \langle u,y\rangle  
  \quad\text{s.t.}\quad \|A^Tu\|_\infty\le\lambda$$

  and we stop the algorithm when $\mathcal{G}(x_k,u_k) := P(x_k)-D(y-Ax_k) \leq \epsilon$ with $u_k:=y-Ax_k$ the corresponding point in the dual from the primal.

In [None]:
function ista_basic_lasso_gap(X, y, λ, prox;
        max_iter = 100_000,tol = 1e-9,print_freq = 1000, verbose = true
    )
    m, p  = size(X)
    L = opnorm(X)^2 # Lipschitz constant of ∇½‖y-Xβ‖²
    step = 1/L #in ]0,2/L[ so linearly convergent
    β = zeros(eltype(X), p)
    r = copy(y) # residual

    for k in 1:max_iter
        grad  = -(X' * r) # ∇½‖y-Xβ‖²
        @. β  = prox(β - step * grad, λ*step)
        mul!(r, X, β) # r = Xβ
        @. r = y - r
        primal = 0.5*dot(r,r) + λ*sum(abs,β)
        θ = r .* min(one(eltype(X)), λ / maximum(abs.(X' * r)))  # dual projection
        dual = 0.5*dot(y,y) - 0.5*dot(y .- θ, y .- θ)
        gap = max(primal - dual, 0.0) # >= 0

        if verbose && (k == 1 || k % print_freq == 0)
            @printf("[ISTA]  iter %5d  gap=%.3e\n", k, gap)
        end
        
        if verbose && gap < tol
            @printf("[ISTA] END iter %5d cost=%.3e gap=%.3e\n", k, primal, gap)
            break # or return
        end
    end
    return β
end


ista_basic_lasso_gap (generic function with 1 method)

## Test

In [61]:
n, p = 100, 50
Random.seed!(42)
X = randn(n, p)
sigma = 0.1
y = X * randn(p) + sigma * randn(n)
λ = 0.1

0.1

In [82]:
beta_1 = ista_basic_lasso(X, y, λ, prox, tol=1e-9)
beta_2 = ista_basic_lasso_gap(X, y, λ, prox, tol=1e-9)
;

[ISTA]  iter     1  cost=7.895e+02  diff=Inf
[ISTA] END iter   259  cost=4.280e+00  diff=9.353e-10
[ISTA]  iter     1  gap=7.876e+02
[ISTA] END iter   672 cost=4.280e+00 gap=9.652e-10


In [83]:
maximum(abs.(beta_1 - beta_2))

1.4914334963922471e-5

In [84]:
X

100×50 Matrix{Float64}:
 -0.363357      1.70818    -0.601298    …  -0.963881  -0.263208   -0.608203
  0.251737     -1.03105     0.0701926      -0.866058   1.76491    -1.15529
 -0.314988     -0.144414   -1.78484         0.415071  -0.401666   -0.306233
 -0.311252     -1.62692    -0.166235       -0.830509   0.395066    0.217733
  0.816307     -0.0621749  -1.74545        -0.61771    0.0689813   0.989967
  0.476738      0.214392   -0.980311    …  -1.98395   -0.560384    0.141943
 -0.859555      1.52944     1.16886         0.84472    0.756063    0.100051
 -1.46929       0.274373    0.920525        1.83407   -1.22233     0.0582548
 -2.11433       1.72908     0.460454        0.966417  -0.664859    0.0794762
  0.0437817    -0.878513    0.253296        0.913405   1.16119    -0.0985575
  ⋮                                     ⋱                         
 -0.000939762   0.566839   -0.675231       -0.714065   3.09126     0.924723
 -1.50078       0.717872    1.15522        -0.12098    0.0609324  -2.19

# General function *(f, g, L, prox given)*

Precedents functions use primal, dual which depend on f and g, so we give, inspired from the first algorithm, ISTA function with prox given and the L Lispitchz constant given.

In [85]:
function ista_L(x0, f, g, ∇f, L, prox;
        max_iter = 100_000, tol = 1e-9,print_freq = 1000, verbose = true
    )
    x = copy(x0)
    x_next = similar(x)
    
    step = 1/L
    cost_prev = f(x) + g(x)
    for k in 1:max_iter
        grad = ∇f(x)
        @. x_next = prox(x-step*grad, step)
        
        cost_current = f(x_next) + g(x_next)
        if abs(cost_current - cost_prev) <tol
            x .= x_next
            if verbose
                @printf("[ISTA] END iter %5d  cost=%.3e  diff=%.3e\n", k, cost_current, abs(cost_current - cost_prev))
            end
            break
        end
        if verbose && (k == 1 || k % print_freq == 0)
            @printf("[ISTA] iter %5d  cost=%.3e  diff=%.3e\n", k, cost_current, abs(cost_current - cost_prev))
        end
    
        cost_prev = cost_current
        x .= x_next
    end
    
    return x
end

ista_L (generic function with 1 method)

## Test

In [86]:
n, p = 100, 50
Random.seed!(42)
X = randn(n, p)
sigma = 0.1
y = X * randn(p) + sigma * randn(n)
λ = 0.1

0.1

### LASSO

On est censé tombé sur les memes résultats que précédemment. C'est le cas, la différence est nulle.

In [94]:
f(b) = 0.5*norm(y-X*b,2)^2
∇f(b) = -X'*(y - X*b)
g(x) = λ*sum(abs, x) # L1 norm
prox_L(x, step) = sign(x) * max(abs(x) - λ*step, zero(x))

beta0 = zeros(p)# initial conditions

beta_L = ista_L(beta0, f, g, ∇f, opnorm(X)^2, prox_L, tol=1e-9)

println("Différence entre beta_L et beta_1 : ", maximum(abs.(beta_L.-beta_1))) # beta_1 was defined earlier and has to be the same as beta_L

[ISTA] iter     1  cost=7.895e+02  diff=1.733e+03
[ISTA] END iter   259  cost=4.280e+00  diff=9.353e-10
Différence entre beta_L et beta_1 : 0.0


### Others

TODO : find interesting problem with confirmed (known / analytic) result to compare

# General function BackTrack *(f, g, prox given)*

We have to guess $L$ Lispichtz constant by consider a first ""large"" (??) value.  

Then at each iteration, TODO : read FISTA pdf 