## Gradient methods

#### Proplem:
$$
f(\vec{x}) \rightarrow min,\\
f: \Omega \rightarrow \mathbb{R}, \\
\Omega \subset \mathbb{R^n}, f(\vec{x}) \mbox{ is convex}, \\
f(\vec{x}) \mbox{ - is diffirentiable on } \Omega\\
\vec{x_*} \in \Omega, f_{min} = f(\vec{x_*})
$$

<em>**Definition**</em>.

Sequnce $\{\vec{x_k}\}$ is named **Relaxational**, if $\forall k \in \mathbb{N}:  f(\vec{x_k}) < f(\vec{x}_{k-1})$ 

$\{\vec{x}_l\}$ convergece to $\vec{x}_* \in \mathbb{R}^n$ by Bolzano–Weierstrass theorem 

Let's choose our relaxational sequence by this equation:
$$
\vec{x}_k = \vec{x}_{k-1} + \beta_k\vec{u}_k
$$
where $\vec{u}_{k}$ is unit vector, which defines the direction of descent and $\beta_k \geq 0$ - length of descent step

<em>**Lemma**</em>.

$f(\vec{x})$ - is differentiable on $\Omega \subset \mathbb{R}^n$ and $\exists L > 0$, such that $\forall \vec{x}, \vec{y} \in \Omega$:
$$
||\nabla f(\vec{x}) - \nabla f(\vec{y})|| \leq  L ||\vec{x} = \vec{y}|| 
$$
Then:
$$
f(\vec{x}) - f(\vec{y}) \geq (\nabla f(\vec{x}), \vec{x} - \vec{y}) - \frac{L}{2}||\vec{x}-\vec{y}||^2
$$
<em>**Definition**</em>.

$\vec{w}(\vec{x}) = - \nabla f(\vec{x})$ is called **antigradient**

If we take our $\vec{u}_k = \frac{\vec{w}_k}{||\vec{w}_k||}$, from our lemma we have, that: 

$$
f(x_{k}) - f(x_{k+1}) \geq (\nabla f(x_k), \vec{x_k} - \vec{x_k} - \beta_k \frac{\vec{w_k}}{||\vec{w_k}||}) - \frac{L}{2} || \vec{x_k} - \vec{x_k} - \beta_k \frac{\vec{w_k}}{||\vec{w_k}||} ||^2 = \beta_k||\nabla f(\vec{x}_k)|| - \beta_k \frac{L}{2} 
$$
As we can see gradient must be always posistive (and $> \frac{L}{2}$),  so that we have a convergece, we get this when function is convex

All methods in which $\vec{u}_k = \frac{\vec{w}_k}{||\vec{w}_k||}$, are named ***gradient methods***, the methods vary on the way we choose our $\beta_k > 0$




In [1]:
import matplotlib as mplib
import math as m
import numpy as np
from numpy.linalg import norm
from functools import reduce

from onedim_optimize import quadratic_approx, newton_method, fibbonaci_method
from scipy.optimize import approx_fprime

from scipy.misc import derivative

def der(f, x, n):
    return derivative(f, x, dx=1e-6, n=n)


def toOneParamFunc(f, x, w):
    return lambda p: f(x + p*w) 

def argmin(f, a, b, eps):
    xmin, k = fibbonaci_method(f, a, b, eps)
    return xmin, k

def approx_gradient(f, eps):
    return lambda x: approx_fprime(x, f, eps)

### Test functions

#### Rosenbrock banana function:
$$
f(x_1, x_2, ..., x_N) = \sum^{N/2}_{i=1}[100(x^2_{2i-1} - x_{2i})^2 + (x_{2i-1} - 1)^2]
$$

In [50]:
def rosenbrock(vec):
    build_ros_terms = lambda i, x: 100*(x**2 - vec[i+1])**2 + (x - 1)**2 if i % 2 == 0 else 0 
    ros_terms = map(build_ros_terms, enumerate(vec))
    sum = lambda a, b: a + b
    return reduce(sum, ros_terms)

### Fastest descent method

We will construct relaxational sequence, using this rule:
$$
\vec{x}_{k+1} = \vec{x}_k + \lambda_k\vec{w}_K
$$

Where $\lambda_k$ is found from
$$
\lambda_k = argmin\{\psi_k(\lambda)\} \\
\psi_k(\lambda) = f(\vec{x}_{k-1} + \lambda\vec{w}_k)
$$

Finding minimum of $\psi_k(\lambda)$ is a pretty complex task of one-dimension minimization. But it is guaranteed that $\{|\vec{w}_k|\}$ convergace to 0.

So at start we pick some small $\epsilon$ and continuing procedure while $|\vec{w}_k\| > \epsilon$, than on some N iteration we pick our $x_* = x_N$

In [3]:
def fastest_descent(f, gr, x, epsilon):
    n = len(x)
    w = -gr(x) 
    phi = toOneParamFunc(f, x, w)
    l, i = argmin(phi, 0, 1, epsilon)
    n += i
    k = 1
    print(x, f(x), l, norm(w))
    x = x + l*w
    while(norm(w) > epsilon):
        w = -gr(x) 
        phi = toOneParamFunc(f, x, w)
        l, i = argmin(phi, 0, 1, epsilon)
        n += i
        k += 1
        print(x, f(x), l, norm(w))
        x = x + l*w
    return f(x), x, k, n

In [4]:
f1 = lambda x: 6*x[0]**2 - 4*x[0]*x[1] + 3*x[1]**2 + 4*m.sqrt(5)*(x[0] + 2*x[1]) + 22
f2 = lambda x: (x[0]**2 - x[1])**2 + (x[0] - 1)**2
danilov = lambda x: x[0] + 2*x[1] + 4*m.sqrt(1 + x[0]**2 + x[1]**2)

# rosenbrock_test = [
#     rosenbrock,
#     approx_gradient(rosenbrock, 1e-8),
#     np.array([-2, 1]),
#     0.01,
# ]

test1 = [
    f1,
    approx_gradient(f1, 1e-8),
    np.array([-2, 1]),
    0.01,
]
test2 = [
    f2,
    approx_gradient(f2, 1e-8),
    np.array([-1, -2]),
    0.001,
]

test_danilov = [
    danilov,
    approx_gradient(danilov, 1e-8),
    np.array([50, 41]),
    0.01,
]

fmin, xmin, K, N = fastest_descent(*test_danilov)
print(f"""
x minimum: {xmin},
f minimum: {fmin},
number of iterations: {K}
number of one-dimension minimization iterations: {N}
""")

[50 41] 390.67353942759587 0.8611055555555555 6.109475309776838
[46.47575064 37.09400868] 358.5531835649015 0.8611055555555555 6.101374413275004
[42.92293512 33.22344605] 326.5214139811794 0.8611055555555555 6.091919866988924
[39.33848929 29.39329994] 294.5932166774629 0.8611055555555555 6.080754565866896
[35.71870287 25.60983081] 262.7878077458083 0.8611055555555555 6.06730697439827
[32.05906284 21.88110532] 231.13083688529028 0.8611055555555555 6.050751828636447
[28.35396219 18.21779663] 199.6575078183351 0.8611055555555555 6.029801821920918
[24.59630344 14.63451087] 168.41816371048645 0.8611055555555555 6.002299784785708
[20.77691124 11.1521538 ] 137.48889438681735 0.881936111111111 5.964386744828725
[16.78947904  7.7213852 ] 106.25996272482165 0.881936111111111 5.906467775345529
[12.70717279  4.48568094] 75.72941052317098 0.8611055555555555 5.8092011935232675
[8.60698079 1.62005967] 47.10721056038919 0.8611055555555555 5.616566236099833
[ 4.38274676 -0.73518071] 21.1327958142819 0.

### Conjugate gradient method

#### Problem 

$$
f(\vec{x}) = \frac{1}{2}(Q\vec{x}, \vec{x}) + (\vec{c}, \vec{x}) \rightarrow min
$$

$Q$ is positive determined n-dimsensional matrix, $c \in \mathbb{R}$ - constant

This function has single point of minimum $x_* = -Q^{-1}\vec{c}$

To find the inverted matrix $Q^{-1}$ we can use
$$
Q^{-1} = \sum^n_{i=1}\frac{p^i(p^i)^T}{(Qp^i, p^i)}
$$
Where $p^i \in \mathbb{R}$ is conjugate vector of matrix $Q$

But constructing a system of conjugate vectors is a pretty complex problem.

So we do another way, let's construct system of conjugate vectors on every iteration

$\vec{x}_0$ is a starting point, antrigradient in this point is $\vec{w}_1 = -Qx_0 - c$ and let's choose $\vec{p}_1 = \vec{w}$

Using $\vec{x}_k = \vec{x}_{k-1} + \lambda_k\vec{w}_k$

We can find that 
$$\lambda_1 = \frac{|\vec{w}_1|^2}{(Q\vec{w}_1, \vec{w}_1)} = \frac{|\vec{p}_1|^2}{(Q\vec{p}_1, \vec{p}_1)}$$
(from minimization of quadratic function)

And so $x_1 = x_0 + \lambda_1\vec{p}_1$

On second iteration (k = 2) we evaluate antigradient $\vec{w}_2 = -Q\vec{x_1} - c$

Let's assume, that
$$\vec{p}_2 = \gamma_1\vec{p}_1 + \vec{w}_2$$

If we product scalarly this equation on $Q\vec{p}_1 \not = 0$ and demand that $\vec{p}_1, \vec{p}_2$ are conjugate (ortogonal) over the matrix $Q$ ($(Q\vec{p}_1, \vec{p_2}) = 0$), we can find $\gamma_1$
$$\gamma_1 = -\frac{(Q\vec{p}_1, \vec{w}_2)}{(Q\vec{p}_1, \vec{p}_1)}$$

Contniuing constructing this system of conjugate vectors, we can say, that on every k iteration we have system of equations:
$$
\begin{cases}
    p_{k+1} = \gamma\vec{p_k} + \vec{w}_{k+1} \\
    \gamma_k = - \frac{(Q\vec{p}_k, \vec{w}_{k+1})}{(Q\vec{p}_k, \vec{p}_k)} \\
    \vec{w}_{k+1} = \vec{w}_k = \lambda_kQ\vec{p}_k \\
    (Q\vec{p}_{k+1}, \vec{p}_i) = 0 \\
    (\vec{w}_{k+1}, \vec{w}_i) = 0, i = \overline{1, k} \\
\end{cases} \\
\mbox{also } \\
\lambda_k = \frac{(\vec{w}_k, \vec{p}_k)}{(Q\vec{p}_k, \vec{p}_k)},\\
\vec{x}_k = \vec{x_1} + \lambda_k\vec{p}_k
$$

With n steps we can find all $\vec{p}_k$ conjugate vectors and evaluate our minimum $x_* = -Q^{-1}\vec{c}$

To use this method in our problems (non-quadratic function optimization, we need to remove matrix $Q$ from system of equations

We can do this, by if on every iteration by doing minimization process:
$$
\psi_k(\lambda) = f(x_{k-1} + \lambda)
$$

In fundament of constructing conjuguate directions $\vec{p}_{k+1} = \gamma_k\vec{p}_k + \vec{w}_{k+1}$ we assume, that $(\vec{w}_{k+1}, \vec{w}_i) = 0$

Using this we can show that:
$$
\begin{cases}
    (Q\vec{p}_k, \vec{w}_{k+1}) = - \frac{1}{\lambda_k}|\vec{w}_{k+1}|^2 \\
    (Q\vec{p}_k, \vec{p}_{k}) = \frac{1}{\lambda_k}(\vec{w}_k, \vec{p}_k)
\end{cases} \\
\mbox{so from our system of equations we can evaluate $\gamma$ using one of theese formulas: } \\
\gamma_k = \frac{|\vec{w}_{k+1}|^2}{|\vec{w}_k|^2} \\
\gamma_k = \frac{(\vec{w}_{k+1} - \vec{w}_k, \vec{w}_{k+1})}{|\vec{w}_k|^2} \\
\mbox{also if function twice differentiable, we can use Hessian instead of matrix Q:} \\
\gamma_k = - \frac{(H(\vec{x}_k)\vec{p}_k, \vec{w}_{k+1})}{(H(\vec{x}_k)\vec{p}_k, \vec{p}_k)} \\
$$

This method is called ***conjaguate gradients method***

Also as every $\gamma_k$ is different and we need to minimize $\psi_k(\lambda)$ this turns us to inevitably errors, to minimize errors, we need to do **restarts** (set $\gamma_k = 0$). It is common to restart every $n$ times, where $n$ is our dimension number. Also, with non-quadratic functions our procedure of optimization in general don't take $n$ steps, so we choose our $\epsilon$ and iterate through $\{\vec{x}_k\}$ till our |$\vec{w}_{k+1|} < \epsilon$, and then $x_{k-1} \approx x_*$ 



In [12]:
def conjugate_gradient(f, gr, x, epsilon):
    w = -gr(x) 
    p = w
    phi = toOneParamFunc(f, x, p)
    l, i = argmin(phi, 0, 1, epsilon)
    n = i
    print(x, f(x), l, norm(w))
    x = x + l*p
    k = 1
    while norm(p) > epsilon:
        w_2 = -gr(x)
#         gamma = 0
#         if k % n != 0:
            gamma = np.divide(np.dot(w_2 - w, w_2), np.power(norm(w), 2))
        p = gamma*p + w_2
        phi = toOneParamFunc(f, x, p)
        l, i = argmin(phi, 0, 1, epsilon) 
        n += i
        print(x, f(x), l, norm(w_2))
        x = x + l*p
        w = w_2
        k += 1
    return f(x), x, k, n
        

In [16]:
f1 = lambda x: 6*x[0]**2 - 4*x[0]*x[1] + 3*x[1]**2 + 4*m.sqrt(5)*(x[0] + 2*x[1]) + 22
f2 = lambda x: (x[0]**2 - x[1])**2 + (x[0] - 1)**2
danilov = lambda x: x[0] + 2*x[1] + 4*m.sqrt(1 + x[0]**2 + x[1]**2)

test1 = [
    f1,
    approx_gradient(f1, 1e-8),
    np.array([-2, 1]),
    0.01,
]
test2 = [
    f2,
    approx_gradient(f2, 1e-8),
    np.array([-1, -2]),
    0.001,
]

test_danilov = [
    danilov,
    approx_gradient(danilov, 1e-8),
    np.array([50, 41]),
    0.01,
]

fmin, xmin, K, N = conjugate_gradient(*test2)
print(f"""
x minimum: {xmin},
f minimum: {fmin},
number of iterations: {K}
number of one-dimension minimization iterations: {N}
""")

[-1 -2] 13 0.08641208515967441 17.08800732430959
[ 0.38259335 -1.48152749] 3.0312661795548737 0.3926112085159674 3.4898519294762607
[ 0.16352927 -0.10041696] 0.7158526356196627 0.491546737633062 1.6099780373832036
[0.86026478 0.55864271] 0.05243652710301816 0.2886663243581716 0.5005165110397883
[0.87224152 0.7688523 ] 0.016386983559418717 0.3876018159048215 0.28404902059240583
[0.9939935  0.97026447] 0.0003514459836750528 0.1477771195992486 0.06851875075807531
[0.99792458 0.9963406 ] 4.544682442950059e-06 0.24420792736380711 0.006172668084462344
[1.0000079  1.00004817] 1.1105122826242846e-09 0.12773956167814654 0.0001308176732386466

x minimum: [1.00000128 1.00000229],
f minimum: 1.7281815932334995e-12,
number of iterations: 8
number of one-dimension minimization iterations: 128

