## Implementing a simple gradient descent algorithm in ``Julia``

In this blog, we discuss how to implement a simple gradient descent scheme in ``Julia``. To do this, we will use the Julia package ``ProximalOperators``, which is an excellent package to compute proximal operators and gradient of common convex functions. I highly recommend the package for anyone interested in operator splitting algorithms. You can find more information about the package at: [https://github.com/kul-forbes/ProximalOperators.jl](https://github.com/kul-forbes/ProximalOperators.jl).

Before we implement gradient descent method, we first record some necessary background.

### 0. Background.

Given a differentiable convex function $f$, our goal is to solve the following optimization problem: 

\begin{eqnarray*}
\begin{array}{ll}
\textrm{minimize} & f(x)\\
\textrm{subject to} & x\in\mathbf{R}^{n},
\end{array}
\end{eqnarray*}

where $x$ is the decision variable. To solve the problem above, we consider gradient descent algorithm. The gradient descent implements the following iteration scheme:

\begin{eqnarray} \label{SGD}
x_{n+1} & = & x_{n}-\gamma_{n}{\nabla f(x_{n})},\qquad (1)
\end{eqnarray}

where ${\nabla f(x_{n})}$ denotes a gradient of $f$ evaluated at the iterate $x_{n}$, and $n$ is our iteration counter. As our step size rule, we pick a sequence that is square-summable but not summable, e.g., $\gamma_{n}=1/n$, will do the job. 
 
We will go through the following steps:
1. Load the packages
2. Create the types
3. Write the functions

### 1. Load the packages
Let us load the necessary packages that we are going to use.

In [1]:
## Load the packages to be used
# -----------------------------
# comment the first two lines if you already have ProximalOperators
using Pkg
Pkg.add("ProximalOperators")
using ProximalOperators, LinearAlgebra


[32m[1m  Updating[22m[39m registry at `C:\Users\shuvo\.julia\registries\General`
[32m[1m  Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`
[32m[1m Installed[22m[39m XML2_jll ─────── v2.9.9+4
[32m[1m Installed[22m[39m IntervalSets ─── v0.5.0
[32m[1m Installed[22m[39m ImageMetadata ── v0.9.1
[32m[1m Installed[22m[39m ArrayInterface ─ v2.8.5
[32m[1m Installed[22m[39m ColorTypes ───── v0.10.0
[32m[1m Installed[22m[39m AxisArrays ───── v0.4.0
[32m[1m Installed[22m[39m ImageCore ────── v0.8.14
[32m[1m Installed[22m[39m Colors ───────── v0.12.0
[32m[1m Installed[22m[39m FlameGraphs ──── v0.2.1
[32m[1m  Updating[22m[39m `C:\Users\shuvo\.julia\environments\v1.3\Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `C:\Users\shuvo\.julia\environments\v1.3\Manifest.toml`
 [90m [4fba245c][39m[93m ↑ ArrayInterface v2.8.4 ⇒ v2.8.5[39m
 [90m [39de3d68][39m[95m ↓ AxisArrays v0.4.2 ⇒ v0.4.0[39m
 [90m [3da00

### 2. Create the types

Next, we define a few Julia types, that we require to write an optimization solver in `Julia`. 

#### 2.1. ``GD_problem``

This type contains information about the problem instance, this bascially tells us what function $f$ we are trying to optimize over, one initial point $x_0$, and what should be the beginning step size $\gamma_0$.

In [2]:
struct GD_problem{F <: ProximableFunction,A <: AbstractVecOrMat{<:Real}, R <: Real}
    
    # problem structure, contains information regarding the problem
    
    f::F # the objective function
    x0::A # the intial condition
    γ::R # the stepsize
    
end

**Usage of ``GD_problem``.**  For example, the user may wish to solve a simple least-squares problem using gradient descent. Then he can create a problem instance. A list of functions that we can use in this regard can be found in the documentation of ``ProximalOperators``: [https://kul-forbes.github.io/ProximalOperators.jl/latest](https://kul-forbes.github.io/ProximalOperators.jl/latest).

In [20]:
# create a problem instance
# ------------------------

A = [0.0992104966854672 -0.6824105286448215 2.460409825989712 -0.004053803803441462 0.5715605356881739; 
    0.1742749398573746 2.5174383202611947 0.651567460427589 -0.5408271816480797 -0.01223347123544491; 
    1.0442129438578551 0.7352767582403392 -0.241436834090505 0.4638234650202825 0.23166711887400535; 
    -0.16776414819267987 0.08274565791466311 -0.6142796136746458 0.9830335769912758 0.3796261892272989; 
    0.3990439508203404 -0.47277912665344407 1.126410516294192 0.5625008522977397 -0.6990783609684885; 
    -1.7266177238831322 0.9984702196895198 -0.8542927758529749 1.5919321188500941 1.3677645133446954]

# or generate a random A by running the following
# A = randn(6,5)

b = [-0.3965327124603881, -1.4820396287913964, -0.467961891048195, -1.555879871070049, 0.5014062010336909, 0.25719126443155527]

# or, generate a random b by running 
# b = randn(6)

m, n = size(A)

# randomized intial point:

x0 = randn(n)

f = LeastSquares(A, b)

γ = 1.0

# create GD_problem

problem = GD_problem(f, x0, γ)

GD_problem{ProximalOperators.LeastSquaresDirect{Float64,Float64,Array{Float64,2},Array{Float64,1},Cholesky{Float64,Array{Float64,2}}},Array{Float64,1},Float64}(description : Least squares penalty
domain      : n/a
expression  : n/a
parameters  : n/a, [1.1297778521319135, -1.5412078943916476, 0.22898702828884018, 0.30103847585081944, -1.0572035198699317], 1.0)

### 2.2. ``GD_setting`` 

This type contains different parameters required to implement our algorithm, such as, 

* the initial step size $\gamma$, 
* maximum number of iterations $\textrm{maxit}$, 
* what should be the tolerance $\textrm{tol}$ (i.e., if $\| \nabla{f(x)} \| \leq \textrm{tol}$, we take that $x$ to be an optimal solution and terminate our algorithm), 
* whether to print out information about  the iterates or not controlled by a boolean variable $\textrm{verbose}$, and 
* how frequently to print out such information controlled by the variable $\textrm{freq}$.

The user may specify what values for these parameters above should be used. But if he does not specify anything, we should be able to have a default set of values to be used. We can achieve this by creating a simple constructor function for ``GD_setting``.

In [6]:
struct GD_setting
    
    # user settings to solve the problem using Gradient Descent
    
    γ::Float64 # the step size
    maxit::Int64 # maximum number of iteration
    tol::Float64 # tolerance, i.e., if ||∇f(x)|| ≤ tol, we take x to be an optimal solution
    verbose::Bool # whether to print information about the iterates
    freq::Int64 # how often print information about the iterates

    # constructor for the structure, so if user does not specify any particular values, 
    # then we create a GD_setting object with default values
    function GD_setting(; γ = 1, maxit = 1000, tol = 1e-8, verbose = false, freq = 10)
        new(γ, maxit, tol, verbose, freq)
    end
    
end

**Usage of ``GD_setting``.** For the previously described least squares problem, we create the following setting instance.

In [7]:
setting = GD_setting(verbose = true, tol = 1e-2, maxit = 1000, freq = 100)

GD_setting(1.0, 1000, 0.01, true, 100)

### 2.3. ``GD_state``
Now we define the type named ``GD_state`` that describes the state our algorithm at iteration number $n$. The state is controlled by 

* current iterte $x_n$,
* the gradient of $f$ at the current iterate: ${\nabla{f}(x_n)}$,
* the stepsize at iteration $n$: $\gamma_n$, and
* iteration number: $n$.

In [8]:
mutable struct GD_state{T <: AbstractVecOrMat{<: Real}, I <: Integer, R <: Real} # contains information regarding one iterattion sequence
    
    x::T # iterate x_n
    ∇f_x::T # one gradient ∇f(x_n)
    γ::R # stepsize
    n::I # iteration counter
    
end

Also, once the user has given the problem information by creating a problem instance ``GD_problem``, we need a method to construct the intial value of the type `GD_state`,  as we did earlier for the least-squares problem. We create the intial state from the problem instance by writing a constructor function.

In [9]:
function GD_state(problem::GD_problem)
    
    # a constructor for the struct GD_state, it will take the problem data and create one state containing all 
    # the iterate information, current state of the gradient etc so that we can start our gradient descent scheme
    
    # unpack information from iter which is GD_iterable type
    x0 = copy(problem.x0) # to be safe
    f = problem.f
    γ = problem.γ
    ∇f_x, f_x = gradient(f, x0)
    n = 1
    
    return GD_state(x0, ∇f_x, γ, n)
    
end

GD_state

### 3. Write the functions 

Now that we are done defining the types, we can now focus on writing the functions that will implement our gradient descent scheme. 

#### 3.1. ```GD_iteration!```
First, we need a function that will take the problem information and the state of our algorithm at iteration number $n$, and then compute the next state for iteration number $n+1$ according to (1). 

In [10]:
function GD_iteration!(problem::GD_problem, state::GD_state)
    
    # this is the main iteration function, that takes the problem information, and the previous state, 
    # and create the new state using Gradient Descent algorithm
    
    # unpack the current state information
    x_n = state.x
    ∇f_x_n = state.∇f_x
    γ_n = state.γ
    n = state.n
    
    # compute the next state
    x_n_plus_1 = x_n - γ_n*∇f_x_n
    
    # now load the computed values in the state
    state.x = x_n_plus_1
    state.∇f_x, f_x = gradient(problem.f, x_n_plus_1) # note that f_x is not used anywhere
	# gradient(f,x) is a function in the ProximalOperators package, see its documentation 
	# if more information is required
    state.γ = 1/(n+1)
    state.n = n+1
    
    # done computing return the new state
    return state
    
end

GD_iteration! (generic function with 1 method)

#### ``GD_solver``
Now we are in a position to write the main solver function named ``GD_solver`` that will be used by the end user. Internally, this function will take the problem information and the problem setting, and then it will

* create the initial state,
* keep updating the state using ``GD_iteration!`` function until we reach the termination criterion or the maximum number of iterations,
* print state of the algorithm if ``verbose`` is ``true`` at the specified frequency, and 
* return the final state.

In [11]:
## The solver function

function GD_solver(problem::GD_problem, setting::GD_setting)
    
    # this is the function that the end user will use to solve a particular problem, internally it is using the previously defined types and functions to run Gradient Descent Scheme
    # create the intial state
    state = GD_state(problem::GD_problem)
    
    ## time to run the loop
    while  (state.n < setting.maxit) & (norm(state.∇f_x, Inf) > setting.tol)
        # compute a new state
        state =  GD_iteration!(problem, state)
        # print information if verbose = true
        if setting.verbose == true
            if mod(state.n, setting.freq) == 0
                @info "iteration = $(state.n) | obj val = $(problem.f(state.x)) | gradient norm = $(norm(state.∇f_x, Inf))"
            end
        end
    end
    
    # print information regarding the final state
    
    @info "final iteration = $(state.n) | final obj val = $(problem.f(state.x)) | final gradient norm = $(norm(state.∇f_x, Inf))"
    return state
    
end

GD_solver (generic function with 1 method)

**Usage of ``GD_solver``.** For the previously created ``problem`` and ``setting``, we run our ``GD_solver`` function as follows.


In [10]:
# The following function will run the entire loop over the struct GradientDescent

In [21]:
final_state_GD = GD_solver(problem, setting)

┌ Info: final iteration = 15 | final obj val = 1.3426102641150683 | final gradient norm = 0.008021018664603696
└ @ Main In[11]:23


GD_state{Array{Float64,1},Int64,Float64}([-0.5482448838438723, -0.42909865018108256, -0.0701068264712182, -0.07118625446430132, -0.5163734825032458], [0.0076930667784437246, -0.007012653935494384, 0.008021018664603696, -0.00661057576689994, 0.0011087604577533772], 0.06666666666666667, 15)

In [23]:
println("objective value found by our gradient descent $(f(final_state_GD.x))")

println("real objective value $(f(pinv(A)*b)) ")

objective value found by our gradient descent 1.3426102641150683
real objective value 1.3425784868644 


So, we do decent in terms of finding a good solution!