# Lab 5 Practice: 

This notebook does not need to be submitted. Problems similar to problem 1 will be featured in the midterm exam.

In [None]:
import numpy as np

In [None]:
# for plotting
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d # plotting surfaces
from matplotlib.colors import LogNorm # for later use, display colormap in log scale
%matplotlib inline

In [None]:
# function you can use to plot the convergence in contour using x_vals, y_vals
def plot_gradient_descent(x_vals,y_vals,f):
    X = np.linspace(-4.5,4.5,300)
    Y = np.linspace(-4.5,4.5,300)
    X, Y = np.meshgrid(X,Y)
    Z = f(X, Y)

    # these 4 lines of code is plotting the contour
    fig, ax = plt.subplots(figsize=(12, 8))
    CS = ax.contour(X, Y, Z, levels=np.logspace(0, 5, 35), cmap='jet', norm = LogNorm())
    plt.axis('tight')
    ax.clabel(CS, inline=True, fontsize=8)
    
    num_steps = len(x_vals)
    delta_n = num_steps//10
    for i in range(0,num_steps-delta_n,delta_n):
        # plt.scatter(x_vals[i], y_vals[i])
        plt.arrow(x_vals[i], y_vals[i], (x_vals[i+delta_n] - x_vals[i]), 
              (y_vals[i+delta_n] - y_vals[i]), 
              head_width=0.3, head_length=0.2, linewidth = 1.5, color='red')
    plt.show()

# Gradient descent of Beale function
Reference: Lecture 12 notebook

In this lab we will continue to explore the gradient descent method for Beale function, which is one of the [benchmark for testing your optimization algorithm](https://en.wikipedia.org/wiki/Test_functions_for_optimization):
$$\displaystyle f(x,y)=\left(1.5-x+xy\right)^{2}+\left(2.25-x+xy^{2}\right)^{2} 
+\left(2.625-x+xy^{3}\right)^{2}$$
We know that this function has the global minimum is achieved at $(3,0.5)$. But it has lots of traps (saddle points and local minima, very flat plateau which yields vanishing gradient near the global minimum).

In [None]:
f  = lambda x, y: (1.5 - x + x*y)**2 + (2.25 - x + x*y**2)**2 + (2.625 - x + x*y**3)**2

## Vanilla gradient descent

> Choose initial guess $(x_0,y_0)$ and step size (learning rate) $\eta$<br><br>
>    For $k=0,1,2, \cdots, M$<br><br>
>    &nbsp;&nbsp;&nbsp;&nbsp;    $(x_{k+1},y_{k+1}) =  (x_k,y_k) - \eta\nabla f(x_k,y_k) $

A self-sustained implementation of the algorithm is as follows, where we use the *central difference* to approximate partial derivatives:

$$\frac{\partial f}{\partial x} \approx \frac{f(x + h, y) - f(x - h,y )}{2h}$$
and
$$\frac{\partial f}{\partial y} \approx \frac{f(x, y+h) - f(x,y-h)}{2h}.$$

In [None]:
def grad_descent(f, x0 = (0,0), eta=1e-2, h=1e-6, num_steps=200):
    '''
    Gradient descent algorithm using the numerical gradient.
    f: function to be minimized
    x0: initial guess, array-like (tuple, list, array)
    eta: step size
    h: numerical gradient's h
    num_steps: total number of iterations
    '''
    x, y = x0[0], x0[1]
    
    numpartialx = lambda x, y: 0.5*(f(x+h, y)-f(x-h, y))/h
    numpartialy = lambda x, y: 0.5*(f(x, y+h)-f(x, y-h))/h

    x_vals = np.zeros(num_steps)
    y_vals = np.zeros(num_steps)
    f_vals = np.zeros(num_steps)

    for i in range(num_steps):
        dx, dy = numpartialx(x,y), numpartialy(x,y)
        x -= eta*dx
        y -= eta*dy
        x_vals[i], y_vals[i], f_vals[i] = x, y, f(x,y)

    return x_vals, y_vals, f_vals

# Problem 1: Nesterov's accelerated gradient (NAG)
Implement the following routine, by modifying the gradient descent above:
> Choose $(x_0,y_0)$, $\eta$, $\gamma$, and let $\mathbf{v}_{-1} = (0,0)$ <br><br>
>    For $k=0,1,2, \cdots, M$<br><br>
>    &nbsp;&nbsp;&nbsp;&nbsp;   $\mathbf{v}_k = \gamma \mathbf{v}_{k-1} 
+ \eta \nabla f\big( (x_k,y_k) - \gamma \mathbf{v}_{k-1} \big)$<br><br>
>    &nbsp;&nbsp;&nbsp;&nbsp;    $(x_{k+1},y_{k+1}) =  (x_k,y_k) - \mathbf{v}_k  $

$\mathbf{v}_k$ can be viewed as an accelerated gradient. Notice that the gradient, instead of being evaluated at $(x_k,y_k)$, is evaluated at $(x_k,y_k) - \gamma \mathbf{v}_{k-1}$.

This is called Nesterov's accelerated gradient by incoporating "momentum" and trying to extrapolating the gradient at the next iteration step.

## Questions

* Implement the NAG using the template above (vanilla gradient descent). Nesterov suggested in his original paper that a good candidate value for $\gamma$ is 0.9, please use this as the default value for $\gamma$ in the input argument.

* After you define the function in the following cell, try $(x_0,y_0) = (1,2)$, step size $\eta = 10^{-3}$, $\gamma = 0.8$, and total $1000$ steps for NAG on the Beale function. 

* Try changing $h = 10^{-6}$ to $h=10^{-1}$ in the numerical partial derivatives, do you still get the same result?

* Try $(x_0,y_0) = (-2,-2)$, step size $\eta = 10^{-3}$, $\gamma = 0.7$, and total $1000$ steps for the Beale function. Please keep $h = 10^{-6}$ in the numerical partial derivatives as given in default values for that argument. Do you get the same result?

* Try using the vanilla gradient descent, using the same parameter setting (without the accelarated momentum correction) with the first questions ($(x_0,y_0) = (1,2)$), you should observe that the gradient descent converges not even near the point of the minimum point $(3,0.5)$. Moreover, the vanilla gradient descent is very sensitive to the step size for Beale function. Yet, with the NAG, we should end reasonably near it with a moderately big step size (try $\eta = 5\cdot 10^{-3}$).

HINT: instead of letting gradient update at each iteration, you might wanna keep track the gradient at each iteration (or at least storing the gradient from the previous iteration), and treat the $0$-th iteration different from the others.

Reference: [An overview of gradient descent optimization algorithms](http://ruder.io/optimizing-gradient-descent/index.html#nesterovacceleratedgradient)

# Problem 2 (optional read)
Reading other person's implementation or commonly-used package's manual is a data-scientist's bread and butter.

* Read the [Newton method for optimizing a function](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization) which uses the information of the second derivatives (Hessian).

* Read the [`scipy.optimize.minimize` manual](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html#scipy.optimize.minimize) in `SciPy`, write a short script using this function to minimize Beale function with initial guess $(4,4)$ and `method` being Newton-CG. 

Note that in [Newton-CG](https://docs.scipy.org/doc/scipy/reference/optimize.minimize-newtoncg.html#optimize-minimize-newtoncg)'s manual, the parameter `jac` is required, basically this is the gradient function. In the reference's example, Rosenbrock function is used:
```python
from scipy.optimize import minimize, rosen, rosen_der
rosen()
```

Checking the help of `rosen`  and `rosen_der` in the [Scipy optimize tutorial](https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html), and mimic the example re-define the Beale using `rosen`'s format and its exact gradient using `rosen_der`'s format.



