*Credit*: These examples have been adapted from [Scipy Tutorial](https://docs.scipy.org/doc/scipy-0.18.1/reference/tutorial/optimize.html).

In [0]:
import numpy as np
import scipy.optimize

# Unconstrained Optimization

The `scipy.optimize` package provides several commonly used optimization algorithms. A detailed listing can be found by:

In [0]:
help(scipy.optimize)

For Machine Learning, we are mainly interested in unconstrained minimization of multivariate scalar functions (typically where gradient information is available). In addition to several algorithms for unconstrained minimization of multivariate scalar functions (e.g. BFGS, Nelder-Mead simplex, Newton Conjugate Gradient, etc.) the module also contains:
- Global (brute-force) optimization routines
- Least-squares minimization (which we saw before in the Solving Systems of Linear Equations notebook)
- Scalar univariate function minimizers and root finders; and
- Multivariate equation system solvers using a variety of algorithms

# Unconstrained minimization of multivariate scalar functions (`minimize`)

The `minimize` function provides a common interface to unconstrained and constrained minimization algorithms for multivariate scalar functions. To demonstrate the minimization function, let's consider the problem of minimizing the [Rosenbrock function](https://en.wikipedia.org/wiki/Rosenbrock_function) of $N$ variables:
$$ f\left(\mathbf{x}\right)=\sum_{i=1}^{N-1}100\left(x_{i}-x_{i-1}^{2}\right)^{2}+\left(1-x_{i-1}\right)^{2}.$$

The minimum value of this function is 0 which is achieved when $x_i=1$.

Note that the Rosenbrock function and its derivatives are included in `scipy.optimize`. The implementations in the following provide examples of how to define an objective function as well as its Jacobian.

## Nelder-Mead Simplex algorithm (`method='Nelder-Mead'`)

In the example below, the `minimize` routine is used with the *Nelder-Mead* simplex algorithm (selected through the `method` parameter):

In [0]:
from scipy.optimize import minimize

In [0]:
def rosen(x):
    """The Rosenbrock function"""
    return sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0)

In [0]:
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
res = minimize(rosen, x0, method='nelder-mead',
               options={'xtol': 1e-8, 'disp': True})
print(res.x)

The simplex method is a simple way to minimize a fairly well-behaved function. It only requires function evaluations and is a good choice for simple minimization problems. However, because it does not use any gradient evaluations, it may take longer to find the minimum compared to a gradient-based method.

## Gradient descent (by hand)

Before we move on to one of the `scipy.optimize` gradient-based methods, let's implement basic gradient descent by hand. Normally you shouldn't really do this yourself. Instead, you should rely on a properly debugged and optimized implementation of an optimizer. But this is a learning exercise!

To demonstrate gradient descent, the Rosenbrock function is used again. The gradient of the Rosenbrock function is the vector:

$$ \begin{eqnarray*} \frac{\partial f}{\partial x_{j}} & = & \sum_{i=1}^{N}200\left(x_{i}-x_{i-1}^{2}\right)\left(\delta_{i,j}-2x_{i-1}\delta_{i-1,j}\right)-2\left(1-x_{i-1}\right)\delta_{i-1,j}.\\  & = & 200\left(x_{j}-x_{j-1}^{2}\right)-400x_{j}\left(x_{j+1}-x_{j}^{2}\right)-2\left(1-x_{j}\right).\end{eqnarray*}$$

This expression is valid for the interior derivatives. Special cases are:

$$ \begin{eqnarray*} \frac{\partial f}{\partial x_{0}} & = & -400x_{0}\left(x_{1}-x_{0}^{2}\right)-2\left(1-x_{0}\right),\\ \frac{\partial f}{\partial x_{N-1}} & = & 200\left(x_{N-1}-x_{N-2}^{2}\right).\end{eqnarray*} $$

A function which computes this gradient is:

In [0]:
# note the special handling of the exterior derivatives
def rosen_der(x):
    xm = x[1:-1]
    xm_m1 = x[:-2]
    xm_p1 = x[2:]
    der = np.zeros_like(x)
    der[1:-1] = 200*(xm-xm_m1**2) - 400*(xm_p1 - xm**2)*xm - 2*(1-xm)
    der[0] = -400*x[0]*(x[1]-x[0]**2) - 2*(1-x[0])
    der[-1] = 200*(x[-1]-x[-2]**2)
    return der

In [0]:
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
num_iter = 1000  # number of gradient descent updates
step_size = 1e-4
x = x0.copy()
for i in range(num_iter):
  x -= step_size * rosen_der(x)
  if (i + 1) % 10 == 0:
    print(f'Iteration {i + 1}, feval {rosen(x)}')
print(x)

We see that basic gradient descent doesn't seem to converge any quicker than the Nelder-Mead method. But there are still a couple of tricks up our sleeve.

### Exercise

Here we used a constant step size. Implement a version of gradient descent which decays the step size at each iteration. Experiment with different schedules. You can find examples of a few different kinds of schedule in [Stanford's cs231n course notes](http://cs231n.github.io/neural-networks-3/#annealing-the-learning-rate).

### Exercise

Try implementing gradient descent with momentum, as described in $\S$ 7.1.2 of [Mathematics for Machine Learning](https://mml-book.github.io/). Does this improve convergence?

### Exercise

We ran our optimizer for a fixed number of iterations. But Scipy's optimizers test for convergence. Can you write a test for convergence and terminate gradient descent automatically?

## Broyden-Fletcher-Golfarb-Shanno algorithm (`method='BFGS'`)

Let's turn now to a more "industrial-strength" gradient descent-based optimizer. In contrast to Nelder-Mead, this routine uses the gradient of the objective function which should enable it to converge more quickly to a solution. If the gradient is not given by the user, then it is estimated using first-differences. The Broyden-Fletcher-Golfarb-Shanno (BFGS) method typically requires fewer calls than the simplex algorithm even when the gradient must be estimated.



We will reuse our definition of `rosen_dir` from above. This gradient information is specified in the `minimize` function through the jac parameter:

In [0]:
res = minimize(rosen, x0, method='BFGS', jac=rosen_der,
               options={'disp': True})
print(res.x)

Machine learning libraries (e.g. TensorFlow, PyTorch etc.) will provide a similar interface. When they provide auto-differentiation capabilities, you will not need to worry about writing the derivative function yourself. You will need to provide the "forward" computational graph and an objective function.