## Gradient methods

#### Problem:
$$
f(\vec{x}) \rightarrow min,\\
f: \Omega \rightarrow \mathbb{R}, \\
\Omega \subset \mathbb{R^n}, f(\vec{x}) \mbox{ is convex}, \\
f(\vec{x}) \mbox{ - is diffirentiable on } \Omega\\
\vec{x_*} \in \Omega, f_{min} = f(\vec{x_*})
$$

<em>**Definition**</em>.

Sequnce $\{\vec{x_k}\}$ is named **Relaxational**, if $\forall k \in \mathbb{N}:  f(\vec{x_k}) < f(\vec{x}_{k-1})$ 

$\{\vec{x}_l\}$ convergece to $\vec{x}_* \in \mathbb{R}^n$ by Bolzano–Weierstrass theorem 

Let's choose our relaxational sequence by this equation:
$$
\vec{x}_k = \vec{x}_{k-1} + \beta_k\vec{u}_k
$$
where $\vec{u}_{k}$ is unit vector, which defines the direction of descent and $\beta_k \geq 0$ - length of descent step

<em>**Lemma**</em>.

$f(\vec{x})$ - is differentiable on $\Omega \subset \mathbb{R}^n$ and $\exists L > 0$, such that $\forall \vec{x}, \vec{y} \in \Omega$:
$$
||\nabla f(\vec{x}) - \nabla f(\vec{y})|| \leq  L ||\vec{x} = \vec{y}|| 
$$
Then:
$$
f(\vec{x}) - f(\vec{y}) \geq (\nabla f(\vec{x}), \vec{x} - \vec{y}) - \frac{L}{2}||\vec{x}-\vec{y}||^2
$$
<em>**Definition**</em>.

$\vec{w}(\vec{x}) = - \nabla f(\vec{x})$ is called **antigradient**

If we take our $\vec{u}_k = \frac{\vec{w}_k}{||\vec{w}_k||}$, from our lemma we have, that: 

$$
f(x_{k}) - f(x_{k+1}) \geq (\nabla f(x_k), \vec{x_k} - \vec{x_k} - \beta_k \frac{\vec{w_k}}{||\vec{w_k}||}) - \frac{L}{2} || \vec{x_k} - \vec{x_k} - \beta_k \frac{\vec{w_k}}{||\vec{w_k}||} ||^2 = \beta_k||\nabla f(\vec{x}_k)|| - \beta_k \frac{L}{2} 
$$
As we can see gradient must be always posistive (and $> \frac{L}{2}$),  so that we have a convergece, we get this when function is convex

All methods in which $\vec{u}_k = \frac{\vec{w}_k}{||\vec{w}_k||}$, are named ***gradient methods***, the methods vary on the way we choose our $\beta_k > 0$




In [1]:
import matplotlib as mplib
import math as m
import numpy as np
from numpy.linalg import norm
from functools import reduce
import matplotlib.pyplot as plt
from onedim_optimize import fibbonaci_method, middle_point_method, upgraded_newton, qubic_approx
from scipy.optimize import approx_fprime, minimize
import matplotlib.animation as pltanimation
from animations import Animate3D

from test_functions import *
from scipy.misc import derivative
%matplotlib notebook


def toOneParamFunc(f, x, w):
    return lambda p: f(x + p*w) 

def argmin(f, a, b, eps, onedim_opti):
#     fig, ax = plt.subplots()
#     ax.plot(np.linspace(a, b, 1000), [f(y) for y in np.linspace(a, b, 1000)])
    f_ev, j_ev = 0, 0
    x, f_ev = onedim_opti(f, a, b, eps)
#     ax.scatter(x, f(x))
    return x, f_ev

def approx_gradient(f, eps):
    return lambda x: approx_fprime(x, f, eps)

def optimization_result(title, fmin, xmin, K, f_ev, j_ev, h_ev = None, res=None):
    print(f"""
{title}
Optimization {res}
x minimum: {xmin},
f minimum: {fmin},
number of iterations: {K},
number of function evaluations: {f_ev},
number of gradient evaluations: {j_ev},
{f"number of hessian evaluations: {h_ev}" if h_ev != None else ''}
""") if res == 'succes' else print(f"""{title}\nOptimization {res}""")

In [2]:
test_sqrt1 = [
    danilov,
    danilov_gradient,
    np.array([-2, 2]),
    0.001,
    'Square root func 1 test. Starting point (-2, 2)' 
]

test_sqrt2 = [
    danilov,
    danilov_gradient,
    np.array([4, 3]),
    0.001,
    'Square root func 1 test. Starting point (4, 3)' 
]

test_rosen1 = [
    rosenbrok,
    rosen_gradient,
    np.array([-2, -1]),
    1e-4,
    'Rosenbrock1 test. Starting point (-2, -1)'
]

test_rosen2 = [
    rosenbrok,
    rosen_gradient,
    np.array([-3, 4]),
    1e-4,
    'Rosenbrock2 test. Starting point (-3, 4)'
]

test_rosen3 = [
    rosenbrok,
    rosen_gradient,
    np.array([3, 3]),
    1e-4,
    'Rosenbrock3 test. Starting point (3, 3)'
]


test_himmel1 = [
    himmelblau,
    himmel_gradient,
    np.array([0, -4]),
    1e-4,
    'Himmelblau1 test. Starting point (0, -4)'
]

test_himmel2 = [
    himmelblau,
    himmel_gradient,
    np.array([10, 21]),
    1e-4,
    'Himmelblau1 test. Starting point (10, 21)'
]

test_himmel3 = [
    himmelblau,
    himmel_gradient,
    np.array([-5, 17]),
    1e-4,
    'Himmelblau1 test. Starting point (-5, 17)'
]



# test_rastrigin = [
#     rastrigin,
#     approx_gradient(rastrigin, np.float64(1e-8)),
#     np.array([2, 1]),
#     1e-4
# ]

# test_ackley = [
#     ackley,
#     approx_gradient(ackley, np.float64(1e-9)),
#     np.array([1, 1]),
#     1e-4
# ]

# test_sphere = [
#     sphere,
#     approx_gradient(sphere, np.float64(1e-9)),
#     np.array([-3, 3]),
#     1e-5,
#     [[-3, 3], [0, 10]]
# ]

# test_beale = [
#     beale,
#     approx_gradient(beale, np.float64(1e-9)),
#     np.array([3, 1.5]),
#     1e-3,
#     [[-0.01, 800], [2.9, 1.6]]
# ]

# test_goldstein = [
#     goldstein_price,np.array([2, 1]),
#     approx_gradient(goldstein_price, np.float64(1e-9)),
#     np.array([-1.3, 1]),
#     1e-5,
#     [[-1.5, 1], [0, 50000]]
# ]

# test_booth = [
#     booth,
#     approx_gradient(booth, np.float64(1e-8)),
#     np.array([5, 3]),
#     1e-5,
#     [[0, 8], [0, 700]]
# ]

# test_bukin = [
#     bukin,
#     approx_gradient(bukin, np.float64(1e-8)),
#     np.array([-10.5, 1.5]),
#     1e-5
# ]

# test_himmel = [
#     himmelblau,
#     approx_gradient(himmelblau, np.float64(1e-8)),
#     np.array([0, -4]),
#     1e-5,
#     [[-4, 4], [-0.1, 280]]
# ]

# test_egg = [
#     eggholder,
#     approx_gradient(eggholder, np.float64(1e-8)),
#     np.array([353, -200]),
#     1e-7
# ]

# test_cross = [
#     cross,
#     approx_gradient(cross, np.float64(1e-8)),
#     np.array([2, -2]),
#     1e-4
# ]

### Fastest descent method

We will construct relaxational sequence, using this rule:
$$
\vec{x}_{k+1} = \vec{x}_k + \lambda_k\vec{w}_K
$$

Where $\lambda_k$ is found from
$$
\lambda_k = argmin\{\psi_k(\lambda)\} \\
\psi_k(\lambda) = f(\vec{x}_{k-1} + \lambda\vec{w}_k)
$$

Finding minimum of $\psi_k(\lambda)$ is a pretty complex task of one-dimension minimization. But it is guaranteed that $\{|\vec{w}_k|\}$ convergace to 0.

So at start we pick some small $\epsilon$ and continuing procedure while $|\vec{w}_k\| > \epsilon$, than on some N iteration we pick our $x_* = x_N$

In [3]:
def fastest_descent(f, gr, x, epsilon, title, onedim_opt):
    try:
        anim = Animate3D(f, x, title)
        f_ev = 0
        j_ev = 0
        w = -gr(x) 
        phi = toOneParamFunc(f, x, w)
        anim.add(x)
        l, i = argmin(phi, 0, 40, np.divide(epsilon, 1e5), onedim_opt)
        f_ev += i
        j_ev += 1
        k = 1
    #     print(x, f(x), l, norm(w))
        x = x + l*w
        anim.add(x)
        while(norm(w) > epsilon):
            w = -gr(x) 
            phi = toOneParamFunc(f, x, w)
            l, i = argmin(phi, 0, 40, np.divide(epsilon, 1e5), onedim_opt)
            f_ev += i
            j_ev += 1
            k += 1
    #         print(x, f(x), l, norm(w))
            x = x + l*w
            anim.add(x)
        return f(x), x, k, f_ev, j_ev, anim, 'succes'
    except:
        return f(x), x, k, f_ev, j_ev, anim, 'fail'

In [4]:
fmin, xmin, K, f_ev, j_ev, anim, res = fastest_descent(*test_sqrt1, onedim_opt=fibbonaci_method)
optimization_result(test_sqrt1[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=5000).save('examples/Sqrt/Sqrt1-Fastest-Desc.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = fastest_descent(*test_sqrt2, onedim_opt=fibbonaci_method)
optimization_result(test_sqrt2[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=5000).save('examples/Sqrt/Sqrt2-Fastest-Desc.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = fastest_descent(*test_rosen1, onedim_opt=upgraded_newton)
optimization_result(test_rosen1[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=8000).save('examples/Rosenbrock/Rosenbrock1-Fastest-Desc.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = fastest_descent(*test_rosen2, onedim_opt=upgraded_newton)
optimization_result(test_rosen2[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=10000).save('examples/Rosenbrock/Rosenbrock2-Fastest-Desc.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = fastest_descent(*test_rosen3, onedim_opt=upgraded_newton)
optimization_result(test_rosen3[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=8000).save('examples/Rosenbrock/Rosenbrock3-Fastest-Desc.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = fastest_descent(*test_himmel1, onedim_opt=upgraded_newton)
optimization_result(test_rosen3[4], fmin, xmin, K, f_ev, j_ev, res=res)
a = anim.get_animation(duration=8000).save('examples/Himmelblau/Himmel1-Fastest-Desc.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = fastest_descent(*test_himmel2, onedim_opt=upgraded_newton)
optimization_result(test_rosen3[4], fmin, xmin, K, f_ev, j_ev, res=res)
a = anim.get_animation(duration=8000).save('examples/Himmelblau/Himmel2-Fastest-Desc.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = fastest_descent(*test_himmel3, onedim_opt=upgraded_newton)
optimization_result(test_rosen3[4], fmin, xmin, K, f_ev, j_ev, res=res)
a = anim.get_animation(duration=8000).save('examples/Himmelblau/Himmel3-Fastest-Desc.gif')


Square root func 1 test. Starting point (-2, 2)
Optimization succes
x minimum: [-0.30152546 -0.60303916],
f minimum: 3.3166247909062534,
number of iterations: 7,
number of function evaluations: 378,
number of gradient evaluations: 7,



Square root func 1 test. Starting point (4, 3)
Optimization succes
x minimum: [-0.30150214 -0.60303502],
f minimum: 3.316624790723409,
number of iterations: 7,
number of function evaluations: 378,
number of gradient evaluations: 7,



Rosenbrock1 test. Starting point (-2, -1)
Optimization succes
x minimum: [0.99997534 0.99995397],
f minimum: 1.6978904661568623e-09,
number of iterations: 4724,
number of function evaluations: 94816,
number of gradient evaluations: 4724,



Rosenbrock2 test. Starting point (-3, 4)
Optimization succes
x minimum: [0.99998557 0.99997414],
f minimum: 1.1133000599057326e-09,
number of iterations: 4138,
number of function evaluations: 76092,
number of gradient evaluations: 4138,



Rosenbrock3 test. Starting point (3, 3)
Optimi

### Conjugate gradient method

#### Problem 

$$
f(\vec{x}) = \frac{1}{2}(Q\vec{x}, \vec{x}) + (\vec{c}, \vec{x}) \rightarrow min
$$

$Q$ is positive determined n-dimsensional matrix, $c \in \mathbb{R}$ - constant

This function has single point of minimum $x_* = -Q^{-1}\vec{c}$

To find the inverted matrix $Q^{-1}$ we can use
$$
Q^{-1} = \sum^n_{i=1}\frac{p^i(p^i)^T}{(Qp^i, p^i)}
$$
Where $p^i \in \mathbb{R}$ is conjugate vector of matrix $Q$

But constructing a system of conjugate vectors is a pretty complex problem.

So we do another way, let's construct system of conjugate vectors on every iteration

$\vec{x}_0$ is a starting point, antrigradient in this point is $\vec{w}_1 = -Qx_0 - c$ and let's choose $\vec{p}_1 = \vec{w}$

Using $\vec{x}_k = \vec{x}_{k-1} + \lambda_k\vec{w}_k$

We can find that 
$$\lambda_1 = \frac{|\vec{w}_1|^2}{(Q\vec{w}_1, \vec{w}_1)} = \frac{|\vec{p}_1|^2}{(Q\vec{p}_1, \vec{p}_1)}$$
(from minimization of quadratic function)

And so $x_1 = x_0 + \lambda_1\vec{p}_1$

On second iteration (k = 2) we evaluate antigradient $\vec{w}_2 = -Q\vec{x_1} - c$

Let's assume, that
$$\vec{p}_2 = \gamma_1\vec{p}_1 + \vec{w}_2$$

If we product scalarly this equation on $Q\vec{p}_1 \not = 0$ and demand that $\vec{p}_1, \vec{p}_2$ are conjugate (ortogonal) over the matrix $Q$ ($(Q\vec{p}_1, \vec{p_2}) = 0$), we can find $\gamma_1$
$$\gamma_1 = -\frac{(Q\vec{p}_1, \vec{w}_2)}{(Q\vec{p}_1, \vec{p}_1)}$$

Contniuing constructing this system of conjugate vectors, we can say, that on every k iteration we have system of equations:
$$
\begin{cases}
    p_{k+1} = \gamma\vec{p_k} + \vec{w}_{k+1} \\
    \gamma_k = - \frac{(Q\vec{p}_k, \vec{w}_{k+1})}{(Q\vec{p}_k, \vec{p}_k)} \\
    \vec{w}_{k+1} = \vec{w}_k = \lambda_kQ\vec{p}_k \\
    (Q\vec{p}_{k+1}, \vec{p}_i) = 0 \\
    (\vec{w}_{k+1}, \vec{w}_i) = 0, i = \overline{1, k} \\
\end{cases} \\
\mbox{also } \\
\lambda_k = \frac{(\vec{w}_k, \vec{p}_k)}{(Q\vec{p}_k, \vec{p}_k)},\\
\vec{x}_k = \vec{x_1} + \lambda_k\vec{p}_k
$$

With n steps we can find all $\vec{p}_k$ conjugate vectors and evaluate our minimum $x_* = -Q^{-1}\vec{c}$

To use this method in our problems (non-quadratic function optimization, we need to remove matrix $Q$ from system of equations

We can do this, by if on every iteration by doing minimization process:
$$
\psi_k(\lambda) = f(x_{k-1} + \lambda)
$$

In fundament of constructing conjuguate directions $\vec{p}_{k+1} = \gamma_k\vec{p}_k + \vec{w}_{k+1}$ we assume, that $(\vec{w}_{k+1}, \vec{w}_i) = 0$

Using this we can show that:
$$
\begin{cases}
    (Q\vec{p}_k, \vec{w}_{k+1}) = - \frac{1}{\lambda_k}|\vec{w}_{k+1}|^2 \\
    (Q\vec{p}_k, \vec{p}_{k}) = \frac{1}{\lambda_k}(\vec{w}_k, \vec{p}_k)
\end{cases} \\
\mbox{so from our system of equations we can evaluate $\gamma$ using one of theese formulas: } \\
\gamma_k = \frac{|\vec{w}_{k+1}|^2}{|\vec{w}_k|^2} \\
\gamma_k = \frac{(\vec{w}_{k+1} - \vec{w}_k, \vec{w}_{k+1})}{|\vec{w}_k|^2} \\
\mbox{also if function twice differentiable, we can use Hessian instead of matrix Q:} \\
\gamma_k = - \frac{(H(\vec{x}_k)\vec{p}_k, \vec{w}_{k+1})}{(H(\vec{x}_k)\vec{p}_k, \vec{p}_k)} \\
$$

This method is called ***conjaguate gradients method***

Also as every $\gamma_k$ is different and we need to minimize $\psi_k(\lambda)$ this turns us to inevitably errors, to minimize errors, we need to do **restarts** (set $\gamma_k = 0$). It is common to restart every $n$ times, where $n$ is our dimension number. Also, with non-quadratic functions our procedure of optimization in general don't take $n$ steps, so we choose our $\epsilon$ and iterate through $\{\vec{x}_k\}$ till our |$\vec{w}_{k+1|} < \epsilon$, and then $x_{k-1} \approx x_*$ 



In [5]:
def conjugate_gradient(f, gr, x, epsilon, title, onedim_opt):
    try:
        anim = Animate3D(f, x, title)
        w = -gr(x) 
        p = w
        j_ev = 1
        f_ev = 0
        phi = toOneParamFunc(f, x, p)
        l, i = argmin(phi, 0, 500, np.divide(epsilon, 1e3), onedim_opt)
        f_ev += i
    #     print(x, f(x), l, p)
        x = x + l*p
        anim.add(x)
        j_ev += 1
        k = 1
        while norm(w) > epsilon:
            w_2 = -gr(x)
            gamma = np.divide(np.dot(w_2 - w, w_2), np.power(norm(w), 2))
            p = gamma*p + w_2
            phi = toOneParamFunc(f, x, p)
            l, i = argmin(phi, 0, 500, np.divide(epsilon, 1e3), onedim_opt) 
    #         print(x, f(x), l, p)
            x = x + l*p
            anim.add(x)
            w = w_2
            j_ev += 1
            f_ev += i
            k += 1
        return f(x), x, k+1, f_ev, j_ev, anim, 'succes'
    except:
        return f(x), x, k+1, f_ev, j_ev, anim, 'fail'
        

In [6]:
fmin, xmin, K, f_ev, j_ev, anim, res = conjugate_gradient(*test_sqrt1, onedim_opt=fibbonaci_method)
optimization_result(test_sqrt1[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=5000).save('examples/Sqrt/Sqrt1-Conj-grad.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = conjugate_gradient(*test_sqrt2, onedim_opt=fibbonaci_method)
optimization_result(test_sqrt2[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=5000).save('examples/Sqrt/Sqrt2-Conj-grad.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = conjugate_gradient(*test_rosen1, onedim_opt=upgraded_newton)
optimization_result(test_rosen1[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=10000).save('examples/Rosenbrock/Rosenbrock1-Conj-grad.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = conjugate_gradient(*test_rosen2, onedim_opt=upgraded_newton)
optimization_result(test_rosen2[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=10000).save('examples/Rosenbrock/Rosenbrock2-Conj-grad.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = conjugate_gradient(*test_rosen3, onedim_opt=upgraded_newton)
optimization_result(test_rosen3[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=10000).save('examples/Rosenbrock/Rosenbrock3-Conj-grad.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = conjugate_gradient(*test_himmel1, onedim_opt=upgraded_newton)
optimization_result(test_rosen3[4], fmin, xmin, K, f_ev, j_ev, res=res)
# a = anim.get_animation(duration=8000).save('examples/Himmelblau/Himmel1-Conj-grad.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = conjugate_gradient(*test_himmel2, onedim_opt=upgraded_newton)
optimization_result(test_rosen3[4], fmin, xmin, K, f_ev, j_ev, res=res)
a = anim.get_animation(duration=8000).save('examples/Himmelblau/Himmel2-Conj-grad.gif')

fmin, xmin, K, f_ev, j_ev, anim, res = conjugate_gradient(*test_himmel3, onedim_opt=upgraded_newton)
optimization_result(test_rosen3[4], fmin, xmin, K, f_ev, j_ev, res=res)
a = anim.get_animation(duration=8000).save('examples/Himmelblau/Himmel3-Conj-grad.gif')


Square root func 1 test. Starting point (-2, 2)
Optimization succes
x minimum: [-0.30152061 -0.60305597],
f minimum: 3.316624791738934,
number of iterations: 7,
number of function evaluations: 264,
number of gradient evaluations: 7,



Square root func 1 test. Starting point (4, 3)
Optimization succes
x minimum: [-0.30151114 -0.60302208],
f minimum: 3.3166247903558785,
number of iterations: 7,
number of function evaluations: 264,
number of gradient evaluations: 7,



Rosenbrock1 test. Starting point (-2, -1)
Optimization succes
x minimum: [1. 1.],
f minimum: 1.0463865198826742e-25,
number of iterations: 18,
number of function evaluations: 325,
number of gradient evaluations: 18,



Rosenbrock2 test. Starting point (-3, 4)
Optimization succes
x minimum: [0.99999947 0.99999895],
f minimum: 2.8090178513397114e-13,
number of iterations: 29,
number of function evaluations: 645,
number of gradient evaluations: 29,



Rosenbrock3 test. Starting point (3, 3)
Optimization succes
x minimum: [1.