Credit: This is adapted from Roger Grosse and Jimmy Ba's [tutorial from CSC421 at the University of Toronto](https://nbviewer.jupyter.org/url/www.cs.toronto.edu/~rgrosse/courses/csc421_2019/tutorials/tut2/autograd_tutorial.ipynb).

# Automatic Differentiation

We are going to use a lightweight automatic differentiation library called Autograd, written by Dougal Maclaurin, David Duvenaud, Matt Johnson, and Jamie Townsend. The popular machine learning libraries TensorFlow and PyTorch have autodiff capabilities, but they have much larger and more complex codebases. Autograd is focused only on autodiff, so its implementation is simple, in fact it doesn't really deviate from the standard NumPy API. That makes it perfect for learning about autodiff.

In [0]:
import autograd.numpy as np  # Autograd thinly wraps NumPy
from autograd import grad

We start by defining a function as usual, just using Python and NumPy:

In [0]:
def tanh(x):
    y = np.exp(-x)
    return (1.0 - y) / (1.0 + y)

Now we employ `autograd.grad` to create a function that computes the gradient of any function we hand it, in this case `tanh`:

In [0]:
grad_tanh = grad(tanh)

# Evaluate the gradient at x = 1.0
print(grad_tanh(1.0))

# Compare to numeric gradient computed using finite differences
print((tanh(1.0001) - tanh(0.9999)) / 0.0002)

## Autograd vs. manual gradients vs. staged computation

In the next example, we will see how a complicated computation can be written as the composition of simpler functions. This provides a scalable strategy for computing gradients using the chain rule.

Say we wish to write a function to compute the gradient of the sigmoid:

$$ \sigma(x) = \frac{1}{1 + \exp(-x)}$$.

We can write $\sigma(x)$ as a composition of several elementary functions: $\sigma(x) = s(c(b(a(x))))$, where:

$$
\begin{align}
a(x) & = -x\\
b(a) & = \exp(a)\\
c(b) & = 1 + b\\
s(c) & = \frac{1}{c}
\end{align}
$$

Here, we have *staged* the computation such that it contains several intermediate variables, each of which are basic expressions for which we can easily compute the local gradients.

The computation graph for this expression is shown below.

![computation_graph](https://drive.google.com/uc?id=1asbb2T0o9n4VXKRFuSYoWYU-u7br-vCB)

The input to this function is $x$, and the output is represented by node $s$. We aim to compute the gradient of $s$ with respect to $x$, so we can use the chain rule as follows

$$
\frac{ds}{dx} = \frac{ds}{dc}\frac{dc}{db}\frac{db}{da}\frac{da}{dx}.
$$

First, let's see what it would look like if we implemented the chain rule by hand, i.e. no autodiff.

In [0]:
def grad_sigmoid_manual(x):
    """Implements the gradient of the logistic sigmoid function 
    $\sigma(x) = 1 / (1 + e^{-x})$ using staged computation
    """
    # Forward pass, keeping track of intermediate values for use in the 
    # backward pass
    a = -x         # -x in denominator
    b = np.exp(a)  # e^{-x} in denominator
    c = 1 + b      # 1 + e^{-x} in denominator
    s = 1.0 / c    # Final result, 1.0 / (1 + e^{-x})
    
    # Backward pass
    dsdc = (-1.0 / (c**2))
    dsdb = dsdc * 1
    dsda = dsdb * np.exp(a)
    dsdx = dsda * (-1)
    
    return dsdx

Now let's see the autodiff experience. First we write the forward function. Then we call `autodiff.grad`. That's it!

In [0]:
def sigmoid(x):
    y = 1.0 / (1.0 + np.exp(-x))
    return y

grad_sigmoid_automatic = grad(sigmoid)

Let's compare the two implementations.

In [0]:
print(grad_sigmoid_automatic(2.0))
print(grad_sigmoid_manual(2.0))

We see that automatic differentiation returns the exact result, and it's a lot cleaner in terms of code.

## Gradients of data structures: `flatten` and `unflatten`

Autograd allows you to compute gradients for many different data structures. It provides a lot of flexibility in the types of data structures you can use to store the parameters of a model. This flexibility is achieved through the `autograd.misc.flatten` function, which converts any nested combination of lists, tuples, arrays, or dicts into a 1-dimensional NumPy array.

The idea here is that once we know how to compute gradients of vectors, we simply convert many other different data structures into vectors to compute their gradients.

In [0]:
import autograd.numpy.random as npr
from autograd.misc import flatten

Let's flatten a list of tuples:

In [0]:
params = [(1.0, 1.0), (2.0, 2.0), (3.0, 3.0, 3.0)]
flat_params, unflatten_func = flatten(params)
print('Flattened: {}'.format(flat_params))
print('Unflattened: {}'.format(unflatten_func(flat_params)))

Let's flatten a list of matrices (of different sizes):

In [0]:
params = [npr.randn(3, 3), npr.randn(4, 4), npr.randn(3, 3)]
flat_params, unflatten_func = flatten(params)
print('Flattened: {}'.format(flat_params))
print('Unflattened: {}'.format(unflatten_func(flat_params)))

We may want to represent model parameters in a dictionary, for example:

In [0]:
params = {'weights': [1.0, 2.0, 3.0, 4.0], 'biases': [1.0, 2.0]}
flat_params, unflatten_func = flatten(params)
print('Flattened: {}'.format(flat_params))
print('Unflattened: {}'.format(unflatten_func(flat_params)))

Or even a dictionary of dictionaries, say, to represent the weights and biases of a neural network:

In [0]:
params = {
    'layer1': {
        'weights': [1.0, 2.0, 3.0, 4.0],
        'biases': [1.0, 2.0]
    },
    'layer2': {
        'weights': [5.0, 6.0, 7.0, 8.0],
        'biases': [6.0, 7.0]
    }
}
flat_params, unflatten_func = flatten(params)
print('Flattened: {}'.format(flat_params))
print('Unflattened: {}'.format(unflatten_func(flat_params)))

## Gradient functions

Autograd provides several different functions that compute gradients, each with a different signature:

*   `grad(fun, argnum=0)` — Returns a function which computes the gradient of `fun` with respect to positional argument number `argnum`. The returned function takes the same arguments as `fun`, but returns the gradient instead. The function `fun` should be scalar-valued. The gradient has the same type as the argument.
*   `grad_named(fun, argname)` — Takes gradients with respect to a named argument.
*   `multigrad(fun, argnums=[0])` — Takes gradients with respect to multiple arguments simultaneously.
*   `multigrad_dict(fun)` — Takes gradients with respect to all arguments simultaneously, and returns a dict mapping `argname` to `gradval`.

## Modularity: implementing custom gradients

Autograd makes it possible to define custom gradients for your own functions. There are several reasons why you might want to do this, including:

1. **Speed**: you may know a faster way to compute the gradient for a specific function.
2. **Numerical stability**.
3. When your code depends on **external library calls**.

The `@primitive` decorator wraps a function so that its gradient can be specified manually and its invocation can be recorded in a computational graph.

In [0]:
from autograd.extend import primitive, defvjp

First we write our function and decorate it with `@primitive`. This tells autograd not to look inside this function, but instead to treat it as a black box, whose gradient might be specified later.

In [0]:
@primitive
def logsumexp(x):
    """Numerically stable log(sum(exp(x)))"""
    max_x = np.max(x)
    return max_x + np.log(np.sum(np.exp(x - max_x)))

Next, we write a function that specifies the gradient with a closure (a function defined within another function, with access to the parent function's variables).

In [0]:
def make_grad_logsumexp(ans, x):
    # If you want to be able to take higher-order derivatives, then all the
    # code inside this function must be itself differentiable by autograd.
    def gradient_product(g):
        return np.full(x.shape, g) * np.exp(x - np.full(x.shape, ans))
    return gradient_product

Then we tell Autograd that logsumexp has a gradient-making function.

In [0]:
defvjp(logsumexp, make_grad_logsumexp)

That's it! Now we can use `logsumexp` inside a larger function that we wish to differentiate.

In [0]:
# Now we can use logsumexp() inside a larger function that we want to differentiate.
def example_func(y):
    z = y**2
    lse = logsumexp(z)
    return np.sum(lse)

grad_of_example = grad(example_func)
print("Gradient: ", grad_of_example(npr.randn(10)))

Autograd provides its own utility for checking gradients numerically. This will fail if a mismatch occurs.

In [0]:
from autograd.test_util import check_grads
check_grads(example_func, modes=['rev'], order=2)(npr.randn(10))