In [1]:
from utils import *
%matplotlib inline

# Automatic differentiation with `autograd`

<center><img src="support/autograd.gif" width=300><center>

We train models to get better and better as a function of experience. Usually, getting better means minimizing a loss function. To achieve this goal, we often iteratively compute the gradient of the loss with respect to weights and then update the weights accordingly. While the gradient calculations are straightforward through a chain rule, for complex models, working it out by hand can be a pain.

Before diving deep into the model training, let's go through how MXNet’s `autograd` package expedites this work by automatically calculating derivatives.

## Basic usage

Let's first import the `autograd` package.

In [1]:
import mxnet as mx

from mxnet import nd
from mxnet import autograd

As a toy example, let’s say that we are interested in differentiating a function $f(x) = 2 x^2$ with respect to parameter $x$. We can start by assigning an initial value of $x$.

differentiate

$f(x) = 2 x^2$ 

with respect to parameter $x$.

In [3]:
x = nd.array([[1, 2], [3, 4]])
x


[[ 1.  2.]
 [ 3.  4.]]
<NDArray 2x2 @cpu(0)>

Once we compute the gradient of $f(x)$ with respect to $x$, we’ll need a place to store it. In MXNet, we can tell an NDArray that we plan to store a gradient by invoking its `attach_grad` method.

In [4]:
x.attach_grad()

Now we’re going to define the function $y=f(x)$. 

To let MXNet store $y$, so that we can compute gradients later, we need to put the definition inside a `autograd.record()` scope.

$y=f(x)$

In [5]:
def f(x):
    return 2 * x**2

In [6]:
with autograd.record():
    y = f(x)

In [7]:
x, y

(
 [[ 1.  2.]
  [ 3.  4.]]
 <NDArray 2x2 @cpu(0)>, 
 [[  2.   8.]
  [ 18.  32.]]
 <NDArray 2x2 @cpu(0)>)

Let’s invoke back propagation (backprop) by calling `y.backward()`. When $y$ has more than one entry, `y.backward()` is equivalent to `y.sum().backward()`.
<!-- I'm not sure what this second part really means. I don't have enough context. TMI?-->

Backward propagation of y

In [8]:
y.backward()

Now, let’s see if this is the expected output. Note that $y=2x^2$ and $\frac{dy}{dx} = 4x$, which should be `[[4, 8],[12, 16]]`. Let's check the automatically computed results:

$y=2x^2$  

$\frac{dy}{dx} = 4x$

In [9]:
x, x.grad

(
 [[ 1.  2.]
  [ 3.  4.]]
 <NDArray 2x2 @cpu(0)>, 
 [[  4.   8.]
  [ 12.  16.]]
 <NDArray 2x2 @cpu(0)>)

## Using Python control flows

<center><img src="support/branching.gif" width=600><center>

Sometimes we want to write dynamic programs where the execution depends on some real-time values. MXNet will record the execution trace and compute the gradient as well.

Consider the following function `f`: it doubles the inputs until it's `norm` reaches 1000. Then it selects one element depending on the sum of its elements.
<!-- I wonder if there could be another less "mathy" demo of this -->

$Y=f(X)$
- Take a vector `X` of two random numbers in [-1, 1]
- `X` is multiplied by `2` until its norm reach `1000`
- If `X`'s sum is positive, return 1st element
- Otherwise 2nd

In [2]:
def f(x):
    x = x * 2
    while x.norm().asscalar() < 1000:
        x = x * 2
    # If sum positive
    # pick 1st
    if x.sum().asscalar() >= 0:
        y = x[0]
    # else pick 2nd
    else:
        y = x[1]
    return y

We record the trace and feed in a random value:

In [11]:
x = nd.random.uniform(-1, 1, shape=2)
x


[ 0.09762704  0.18568921]
<NDArray 2 @cpu(0)>

In [12]:
x.attach_grad()
with autograd.record():
    y = f(x)
y.backward()

We know that `y` is a linear function of `x`, and `y` is chosen from `x`. Then the gradient with respect to `x` be will be either `[y/x[0], 0]` or `[0, y/x[1]]`, depending on which element from `x` we picked. Let's find the results:

$y=k.x[0]$

or 

$y=k.x[1]$, 

hence $\frac{dy}{dx} =  \begin{vmatrix} 0 \\ k \end{vmatrix} $ or $ \begin{vmatrix} k \\ 0 \end{vmatrix}$

with $k = 2^n$ where n is the number of times $x$ was multiplied by 2  

In [13]:
x


[ 0.09762704  0.18568921]
<NDArray 2 @cpu(0)>

In [14]:
x.grad


[ 8192.     0.]
<NDArray 2 @cpu(0)>