# Physics 494/594
## Gradient Descent Improvements

In [None]:
# %load ./include/header.py
import numpy as np
import matplotlib.pyplot as plt
import sys
from tqdm import trange,tqdm
sys.path.append('./include')
import ml4s
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
plt.style.use('./include/notebook.mplstyle')
np.set_printoptions(linewidth=120)
ml4s.set_css_style('./include/bootstrap.css')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

## Last Time

### [Notebook Link: 11_Gradient_Descent.ipynb](./11_Gradient_Descent.ipynb)
- Gradient Descent: Derived a general framework for optimizing functions of many parameters

## Today
- Improvements and adaptive methods (step size variation)


We learned how a general convex function:

\begin{equation}
f(\mathbf{w}) = \frac{1}{2} \mathbf{w}^{\top} \mathsf{A}\, \mathbf{w}
\end{equation}

where $\mathsf{A} \in \mathbb{R}^{M \times M}$ is a positive semi-definite matrix, can be minimized in an iterative fashion by making steps *downhill* via **gradient descent**:

\begin{equation}
\mathbf{w}_{i+1} \leftarrow \mathbf{w}_i - \eta \nabla_w f(\mathbf{w}_i).
\end{equation}

Today we will investigate some simple improvements to the gradient descent algorithm for the case $M=2$ that we can easily visualize.

In [None]:
def f(w,A):
    return (1/2) * w.T @ A @ w 

We can think of this function as a quadratic bowl whose curvature is specified by the value of $A$.

It always has a minimum at $f(\mathbf{w}^*)=0$ given by $\mathbf{w}^* = (0, 0)^{\sf T}$. 

### Our plotting functions for visualization of the process.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib.colors import LogNorm

def plot_function(grid_1d, func, contours=50, log_contours=False, exact=[0,0]):
    '''Make a contour plot over the region described by grid_1d for function func.'''
    
    # make the 2D grid
    X,Y = np.meshgrid(grid_1d, grid_1d, indexing='xy')
    Z = np.zeros_like(X)
    
    # numpy bonus exercise: can you think of a way to vectorize the following for-loop?
    for i in range(len(X)):
        for j in range(len(X.T)):
            Z[i, j] = func(np.array((X[i, j], Y[i, j])))  # compute function values
    
    fig = plt.figure(figsize=plt.figaspect(0.5))
    ax = fig.add_subplot(1, 2, 1)
    
    if not log_contours:
        ax.contour(X, Y, Z, contours, cmap='Spectral_r')
    else:
        ax.contour(X, Y, Z, levels=np.logspace(0, 5, 35), norm=LogNorm(), cmap='Spectral_r')
        
    ax.plot(*exact, '*', color='black')

    ax.set_xlabel(r'$w_0$')
    ax.set_ylabel(r'$w_1$')
    ax.set_aspect('equal')
    
    ax3d = fig.add_subplot(1, 2, 2, projection='3d')
    
    if log_contours:
        Z = np.log(Z)
        label = r'$\ln f(\mathbf{w}$'
    else:
        label = r'$f(\mathbf{w})$'
        
    surf = ax3d.plot_surface(X,Y,Z, rstride=1, cstride=1, cmap='Spectral_r', 
                       linewidth=0, antialiased=True, rasterized=True)
    
    ax3d.plot([exact[0]], [exact[0]], [func(np.array(exact))], marker='*', ms=6, linestyle='-', color='k',lw=1, zorder=100)

         
    ax3d.set_xlabel(r'$w_0$',labelpad=8)
    ax3d.set_ylabel(r'$w_1$',labelpad=8)
    ax3d.set_zlabel(label,labelpad=8);
    
    return fig,ax,ax3d

In [None]:
A = ml4s.random_psd_matrix([2,2])
fig,ax,ax3d = plot_function(np.linspace(-5,5,100),lambda x: f(x,A))

### Autodiff for derivatives

In [None]:
import jax.numpy as jnp # jax has it's own accelerated version of numpy
from jax import grad

df_dw = grad(f,argnums=0)

## Performing Gradient Descent

Now that we know how to take gradients using `jax` we are ready to code up our algorithm.

\begin{equation}
\mathbf{w}_{i+1} \leftarrow \mathbf{w}_i  - \eta \nabla_w f(\mathbf{w}_i) \ .
\end{equation}

In [None]:
from IPython import display
A = ml4s.random_psd_matrix([2,2], seed=0)
fig, ax, ax3d = plot_function(np.linspace(-5, 5, 100), lambda x: f(x, A))

# hyperparameters
η = 0.5
w = np.array([2.5,-4.0])
num_iter = 20

ax.plot(*w, marker='.', color='k', ms=15)  

for i in range(num_iter):

    # we keep a copy of the previous version for plotting
    w_old = np.copy(w)
    
    # perform the GD update
    w += -η*df_dw(w, A)
    
    # plot
    ax.plot([w_old[0], w[0]], [w_old[1], w[1]], marker='.', linestyle='-', color='k',lw=1) 
    ax3d.plot([w_old[0], w[0]], [w_old[1], w[1]], [f(w_old,A),f(w,A)], marker='.', linestyle='-', color='k',lw=1, zorder=100)

    ax.set_title(f'$i={i}, w=[{w[0]:.2f},{w[1]:.2f}]$' + '\n' + f'$f(w) = {f(w,A):.6f}$', fontsize=14);
    display.display(fig)
    display.clear_output(wait=True)

## Gradient Descent with Momentum

One problem that arises with the GD algorithm is that retains no **memory** about where it came from and this can lead to problems when there is a rough and/or shallow energy landscape. In physics, this would be equivalent to a ball rolling down a hill that is completely overdamped, i.e. it has no kinetic energy (momentum) to climb out of minima.  

This will also allow us to prevent large swings due to local curvature, and will become even more important when our cost functions become high dimensional and we will rely on computing the gradient over only a subset of data (*stochastic gradient descent*).  We modify our above algorithm to include a *memory* or *momentum* term:

\begin{align}
\mathbf{v}_i &\leftarrow \gamma \mathbf{v}_{i-1} + \eta \nabla_w f(\mathbf{w}_i) \\
\mathbf{w}_{i+1} &\leftarrow \mathbf{w}_i - \mathbf{v}_{i} 
\end{align}

where we have introduced a new momentum **hyperparamter** $0 \le \gamma < 1$.  For $\gamma = 0$ we recover ordinary gradient descent, increasing $\gamma$ increases the information retained about previous steps.  In practice, we often use $\gamma \approx 0.9$ which gives us a memory of approximately 10 iterations.  In the literature, you will see this method called *gradient descent with classical momentum* or CM.

Let's see how it works.

In [None]:
w = np.random.uniform(low=-5,high=5,size=2)
η = 0.5
γ = 0.9
print(f'f(w) = {f(w,A):.3f}')
v = np.zeros(2)

In [None]:
v = γ*v + η*df_dw(w, A)
w -= v
print(f'f(w) = {f(w,A):.3f}')

In [None]:
fig, ax, ax3d = plot_function(np.linspace(-5, 5, 100), lambda x: f(x, A))

# hyperparameters
η = 0.5
γ = 0.75
num_iter = 20

w = np.array([2.5,-4.0])
ax.plot(*w, marker='.', color='k', ms=15)  
v = np.zeros(2)

for i in range(num_iter):
    
    # keep a copy for plotting
    w_old = np.copy(w)
    
    # perform the CM update
    v = γ*v + η*df_dw(w, A)
    w -= v
    
    # plot
    ax.plot([w_old[0], w[0]], [w_old[1], w[1]], marker='.', linestyle='-', color='k',lw=1) 
    ax3d.plot([w_old[0], w[0]], [w_old[1], w[1]], [f(w_old,A),f(w,A)], marker='.', linestyle='-', color='k',lw=1, zorder=100)

    ax.set_title(f'$i={i}, w=[{w[0]:.2f},{w[1]:.2f}]$' + '\n' + f'$f(w) = {f(w,A):.6f}$', fontsize=14);
    display.display(fig)
    display.clear_output(wait=True)

<div class="span alert alert-warning">
    <strong>Note:</strong> due to the momentum term, this method is not strictly <em>downhill</em> anymore!
</div>

### Other Methods

There is a zoo of deferent numerical optimization algorithms.  Check out: https://ruder.io/optimizing-gradient-descent/

A commonly used variant of CM is the **Nesterov Accelerated Gradient** which is a simple modification of CM that computes the gradient not at $\mathbf{w}_i$  but at the position that momentum would carry it to at the next time step:

\begin{align}
\mathbf{v}_i &\leftarrow \gamma \mathbf{v}_{i-1} + \eta \nabla_w f(\mathbf{w}_i - \gamma \mathbf{v}_{i-1}) \\
\mathbf{w}_{i+1} &\leftarrow \mathbf{w}_i - \mathbf{v}_{i} 
\end{align}

In [None]:
fig, ax, ax3d = plot_function(np.linspace(-5, 5, 100), lambda x: f(x, A))

# hyperparameters
η = 0.5
γ = 0.9
num_iter = 20

w = np.array([2.5,-4.0])
#w = np.random.uniform(low=-5,high=5,size=2)

ax.plot(*w, marker='.', color='k', ms=15)  
v = np.zeros(2)

for i in range(num_iter):
    
    # keep a copy for plotting
    w_old = np.copy(w)
    
    # perform the NAG update
    v = γ*v + η*df_dw(w-γ*v, A)
    w -= v
    
    # plot
    ax.plot([w_old[0], w[0]], [w_old[1], w[1]], marker='.', linestyle='-', color='k',lw=1) 
    ax3d.plot([w_old[0], w[0]], [w_old[1], w[1]], [f(w_old,A),f(w,A)], marker='.', linestyle='-', color='k',lw=1, zorder=100)

    ax.set_title(f'$i={i}, w=[{w[0]:.2f},{w[1]:.2f}]$' + '\n' + f'$f(w) = {f(w,A):.6f}$', fontsize=14);
    display.display(fig)
    display.clear_output(wait=True)

### Adapative Methods

One of the most important things in practice to improve optimization performance is to change (adapt) the learning rate as a function of time (our index $i$).  This can be done by hand (using a learning schedule) or via algorithms such as **ADAM** which keeps a running average of the first and second moment of the gradient and uses these to update the learning rate for different parameters.  You should watch [Andrew Ng's video on the subject](https://www.youtube.com/watch?v=JXQT_vxqwIs).  I quote the final update scheme here.

\begin{align}
\mathbf{g}_i &= \nabla_w f(\mathbf{w}_i) \\
\mathbf{m}_i &= \beta_1 \mathbf{m}_{i-1} + (1-\beta_1) \mathbf{g}_i \\
\mathbf{v}_i &= \beta_2 \mathbf{v}_{i-1} +(1-\beta_2)\mathbf{g}_i^2  \\
\hat{\mathbf{m}}_i &= \frac{\mathbf{m}_i}{1-(\beta_1)^i} \\
\hat{\mathbf{v}}_i &= \frac{\mathbf{v}_i}{1-(\beta_2)^i}  \\
\mathbf{w}_{i+1} &=\mathbf{w}_i - \eta \frac{\hat{\mathbf{m}}_i}{\sqrt{\hat{\mathbf{v}}_i} +\epsilon}, \nonumber 
\end{align}

where $\beta_1$ and $\beta_2$ set the memory lifetime of the first and second moment.  We typically take:

\begin{align}
\beta_1 &= 0.9 \\
\beta_2 &= 0.999 \\
\eta  &= 10^{-3} \\
\epsilon &= 10^{-8} \ . 
\end{align}

However, for simple functions, a larger starting learning rate such as $\eta = 10^{-1}$ is preferred.

In [None]:
β1 = 0.9
β2 = 0.999
ϵ = 1.0E-8
η = 1.0E-1
γ = 0.9

num_iter = 1000

w = np.random.uniform(low=-5,high=5,size=2)

w_traj = np.zeros([num_iter,2])
w_traj[0,:] = w

m = np.zeros(2)
v = np.zeros(2)

for i in range(1,num_iter):
    
    g = np.array(df_dw(w,A))
    m = β1*m + (1-β1)*g
    v = β2*v + (1-β2)*g*g
    
    m̂ = m/(1-β1**i)
    v̂ = v/(1-β2**i)

    w = w - η*np.divide(m̂,np.sqrt(v̂) + ϵ)
    w_traj[i,:] = w

Now we plot after as we are taking more steps (it would be too slow otherwise)

In [None]:
def plot_trajectory(fig,ax,ax3d,w_traj,func,log_contours=False):
    '''Plot the trajectory of a minimization.'''
    
    num_iter = w_traj.shape[0]
    f_traj = np.array([func(w_traj[i,:]) for i in range(num_iter)])
    
    ax.plot(w_traj[0,0],w_traj[0,1], 'o', color='k', ms=6)    
    ax.plot(w_traj[:,0],w_traj[:,1], '.', color='k', ms=1)  
    
    if log_contours:
        f_traj = np.log(f_traj)
        
    ax3d.plot([w_traj[0,0]], [w_traj[0,1]], [f_traj[0]], marker='o', ms=6, linestyle='-', color='k',lw=1, zorder=100)
    ax3d.plot(w_traj[:,0], w_traj[:,1], f_traj, marker='.', ms=1, linestyle='-', color='k',lw=1, zorder=100)
    
    ax.set_title(f'$i={i}, w=[{w[0]:.2f},{w[1]:.2f}]$' + '\n' + f'$f(w) = {func(w):.6f}$', fontsize=14);
    
    return fig,ax,ax3d

In [None]:
fig, ax, ax3d = plot_function(np.linspace(-5, 5, 100), lambda x: f(x, A))
fig, ax, ax3d = plot_trajectory(fig,ax,ax3d,w_traj,lambda x: f(x, A))

There is a nice comparison of methods due to [Alec Radford](https://twitter.com/alecrad).

<img src="https://ruder.io/content/images/2016/09/contours_evaluation_optimizers.gif">