# Lab 5 Assignment: 

### Please type your name here:
### Do not forget to change the filename with your name appended

In this lab please fill the indicated cells with your code and explainations, ***run*** everything (select `cell` in the menu, and click `Run all`), save the notebook with your name appended to the filename (for example, `Lab-05-assignment-caos.ipynb`), and upload it to Canvas.
<br><br>
This lab assignment contains only 1 problem. 
<br><br>
Read each problem carefully and answer them the best you can. You may copy the code from the Lecture 14's notebook. Even though a better way would be copy the code from Lecture 14, make the codes comment-like, type the codes by yourself using auto-completion.
<br><br>
For how to use a function, instead of asking others (TA, friend, your neighbor), you can put the cursor inside an empty parenthesis, press `Shift + Tab` (hold the shift key, press tab) to read the help in the pop up window, for example:

In [None]:
import numpy as np

In [None]:
# you might find the following useful
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import axes3d # plotting surfaces
from matplotlib.colors import LogNorm # for later use, display colormap in log scale
%matplotlib inline

In [None]:
# function you can use to plot the convergence in contour using x_vals, y_vals
def plot_gradient_descent(x_vals,y_vals,f):
    X = np.linspace(-4.5,4.5,300)
    Y = np.linspace(-4.5,4.5,300)
    X, Y = np.meshgrid(X,Y)
    Z = f(X, Y)

    # these 4 lines of code is plotting the contour
    fig, ax = plt.subplots(figsize=(12, 8))
    CS = ax.contour(X, Y, Z, levels=np.logspace(0, 5, 35), cmap='jet', norm = LogNorm())
    plt.axis('tight')
    ax.clabel(CS, inline=True, fontsize=8)
    
    num_steps = len(x_vals)
    delta_n = num_steps//20
    for i in range(0,num_steps-delta_n,delta_n):
        # plt.scatter(x_vals[i], y_vals[i])
        plt.arrow(x_vals[i], y_vals[i], (x_vals[i+delta_n] - x_vals[i]), 
              (y_vals[i+delta_n] - y_vals[i]), 
              head_width=0.3, head_length=0.2, linewidth = 1.5, color='red')

    plt.show()

# Gradient descent of Beale function
Reference: Lecture 12 notebook, Lab 5 practice.

In this lab we will continue to explore the gradient descent method for Beale function, which is one of the [benchmark for testing your optimization algorithm](https://en.wikipedia.org/wiki/Test_functions_for_optimization):
$$\displaystyle f(x,y)=\left(1.5-x+xy\right)^{2}+\left(2.25-x+xy^{2}\right)^{2} 
+\left(2.625-x+xy^{3}\right)^{2}$$
We know that this function has the global minimum is achieved at $(3,0.5)$. But it has lots of traps (saddle points and local minima, very flat gradient near the global minimum)

In [None]:
f  = lambda x, y: (1.5 - x + x*y)**2 + (2.25 - x + x*y**2)**2 + (2.625 - x + x*y**3)**2

## Vanilla gradient descent

> Choose initial guess $(x_0,y_0)$ and step size (learning rate) $\eta$<br><br>
>    For $k=0,1,2, \cdots, M$<br><br>
>    &nbsp;&nbsp;&nbsp;&nbsp;    $(x_{k+1},y_{k+1}) =  (x_k,y_k) - \eta\nabla f(x_k,y_k) $

A self-sustained implementation of the algorithm is as follows, where we use the *central difference* to approximate partial derivatives:

$$\frac{\partial f}{\partial x} \approx \frac{f(x + h, y) - f(x - h,y )}{2h}$$
and
$$\frac{\partial f}{\partial y} \approx \frac{f(x, y+h) - f(x,y-h)}{2h}.$$

In [None]:
def grad_descent(f, x0 = (0,0), eta=1e-2, h=1e-6, num_steps=200):
    '''
    Gradient descent algorithm using the numerical gradient.
    f: function to be minimized
    x0: initial guess, array-like (tuple, list, array)
    eta: step size
    h: numerical gradient's h
    num_steps: total number of iterations
    '''
    x, y = x0[0], x0[1]
    
    numpartialx = lambda x, y: 0.5*(f(x+h, y)-f(x-h, y))/h
    numpartialy = lambda x, y: 0.5*(f(x, y+h)-f(x, y-h))/h

    x_vals = np.zeros(num_steps)
    y_vals = np.zeros(num_steps)
    f_vals = np.zeros(num_steps)

    for i in range(num_steps):
        dx, dy = numpartialx(x,y), numpartialy(x,y)
        x -= eta*dx
        y -= eta*dy
        x_vals[i], y_vals[i], f_vals[i] = x, y, f(x,y)

    return x_vals, y_vals, f_vals

# Problem 1: Adaptive step-size (learning rate)
Another popular gradient descent algorithm in the machine learning community is [Root Mean Square Prop or RMSprop](http://ruder.io/optimizing-gradient-descent/index.html#rmsprop), which changes the step size adaptively according to the magnitude of the gradient. This algorithm is invented by Geoff Hinton. Despite its vast popularity in the machine learning community, it is unpublished, and it was proposed firstly in his Coursera lecture notes.

RMSprop is proposed to solve two problems in the vanilla gradient descent (GD):
* Vanishing gradient: it is observed that in Lab 5 practice even after 2000 steps, even starting somewhere like $(1,2)$ reasonably near the true global minimum point $(3,0.5)$, the vanialla GD stops or "plateaus" on the plateau-like region near $(3,0.5)$ (the contour looks like a plateau). It is because the gradient is too small to drive the GD forward.
* Step-size (learning rate) has to be small to be convergent: in Lab 5 practice, we also observe that, if starting from somewhere like $(4,4)$, a reasonably small step-size like $10^{-4}$ will make GD blow up fairly fast.

The RMSprop algorithm is as follows:
> Choose $(x_0,y_0)$, $\eta$, $\gamma$, $\epsilon$, and let $s_{-1} = 1$ <br><br>
>    For $k=0,1,2, \cdots, M$<br><br>
>    &nbsp;&nbsp;&nbsp;&nbsp;  $s_{k} = \gamma s_{k-1} + (1 - \gamma)\, \left|\nabla f(x_k,y_k)\right|^2$<br><br>
>    &nbsp;&nbsp;&nbsp;&nbsp;    $\displaystyle(x_{k+1},y_{k+1}) =  (x_k,y_k) -  \frac{\eta} {\sqrt{s_{k}+ \epsilon}} \nabla f(x_k,y_k)$  

where $\left|\nabla f(x_k,y_k)\right|$ denotes the magnitude of the gradient vector.

Normally the parameters are chosen as: $\gamma = 0.9$, $\epsilon = 10^{-3}$.

*  Using $h = 10^{-6}$ in the numerical partial derivatives as in the `grad_descent` function, implement RMSprop as a function. Try $(x_0,y_0) = (1,2)$, step size $\eta = 10^{-2}$, $\gamma = 0.9$, $\epsilon = 10^{-3}$, and total $1000$ steps for RMSprop on the Beale function. The correctly implemented algorithm is expected to converge reasonably close to the global minimum $(3.0, 0.5)$ (distance is less than $10^{-2}$).

In [None]:
def rmsprop():
    # your code here to replace pass
    pass

In [None]:
num_steps = 1000
x_vals, y_vals, f_vals = rmsprop(f, x0 = (1,2), eta=1e-2, h=1e-6, eps = 1e-3, num_steps = num_steps)
print("The value of f(x,y): ", f(x_vals[-1],y_vals[-1]), "after", 
      num_steps, "iterations at point", (x_vals[-1],y_vals[-1]))
# let's see what the f(x,y) values were    

plt.title("f(x,y) over gradient descent steps")
plt.plot(range(num_steps), f_vals)
plt.show()

In [None]:
plot_gradient_descent(x_vals,y_vals,f)