Imagine you want to test out a new machine learning model for your data. This usually means coming up with some loss function to capture how well your model fits the data and optimizing that loss with respect to the model parameters. If there are many model parameters (neural nets can have millions) then you need gradients. You then have two options: derive and code them up yourself, or implement your model using the syntactic and semantic constraints of a system like Theano.
We want to provide a third way: just write down the loss function using a standard numerical library like numpy, and autograd will give you its gradient.
How to use autograd
grad function takes in a function, and gives you a function that computes its derivative.
Your function must have a scalar-valued output (i.e. a float).
This covers the common case when you want to use gradients to optimize something.
Autograd works on ordinary Python and Numpy code containing all the usual control structures, including
if statements, and closures. Here's a simple example of using an open-ended loop to compute the sine function:
import autograd.numpy as np # Thinly-wrapped version of Numpy from autograd import grad def taylor_sine(x): # Taylor approximation to sine function ans = currterm = x i = 0 while np.abs(currterm) > 0.001: currterm = -currterm * x**2 / ((2 * i + 3) * (2 * i + 2)) ans = ans + currterm i += 1 return ans grad_sine = grad(taylor_sine) print "Gradient of sin(pi) is", grad_sine(np.pi)
Complete example: logistic regression
A common use case for automatic differentiation is to train a probabilistic model. Here we present a very simple (but complete) example of specifying and training a logistic regression model for binary classification:
import autograd.numpy as np from autograd import grad def sigmoid(x): return 0.5*(np.tanh(x) + 1) def logistic_predictions(weights, inputs): # Outputs probability of a label being true according to logistic model. return sigmoid(np.dot(inputs, weights)) def training_loss(weights): # Training loss is the negative log-likelihood of the training labels. preds = logistic_predictions(weights, inputs) label_probabilities = preds * targets + (1 - preds) * (1 - targets) return -np.sum(np.log(label_probabilities)) # Build a toy dataset. inputs = np.array([[0.52, 1.12, 0.77], [0.88, -1.08, 0.15], [0.52, 0.06, -1.30], [0.74, -2.49, 1.39]]) targets = np.array([True, True, False, True]) # Define a function that returns gradients of training loss using autograd. training_gradient_fun = grad(training_loss) # Optimize weights using gradient descent. weights = np.array([0.0, 0.0, 0.0]) print "Initial loss:", training_loss(weights) for i in xrange(100): weights -= training_gradient_fun(weights) * 0.01 print "Trained loss:", training_loss(weights)
Python syntax is pretty good for specifying probabilistic models. The biggest win is that it becomes a lot easier to modify a model and rapidly iterate.
For more complex examples, see our examples directory, which includes:
- a simple neural net
- a convolutional neural net
- a recurrent neural net
- a long short-term memory (LSTM)
- backpropagating through a fluid simulation
What's going on under the hood?
To compute the gradient, autograd first has to record every transformation that was applied to the input as it was turned into the output of your function.
To do this, autograd wraps functions (using class
primitive) so that when they're called, they add themselves to a list of operations performed.
primitive class also allows us to specify the gradient of these primitive functions.
To flag the variables we're taking the gradient with respect to, we wrap them using the
You should never have to think about the
Node class, but you might notice it when printing out debugging info.
After the function is evaluated, autograd has a list of all operations that were performed and which nodes they depended on. This is the computational graph of the function evaluation. To compute the derivative, we simply apply the rules of differentiation to each node in the graph.
Reverse mode differentiation
Given a function made up of several nested function calls, there are several ways to compute its derivative.
For example, given L(x) = F(G(H(x))), the chain rule says that its gradient is dL/dx = dF/dG * dG/dH * dH/dx. If we evaluate this product from right-to-left: (dF/dG * (dG/dH * dH/dx)), the same order as the computations themselves were performed, this is called forward-mode differentiation. If we evaluate this product from left-to-right: (dF/dG * dG/dH) * dH/dx)), the reverse order as the computations themselves were performed, this is called reverse-mode differentiation.
Compared to finite differences or forward-mode, reverse-mode differentiation is by far the more practical method for differentiating functions that take in a large vector and output a single number. In the machine learning community, reverse-mode differentiation is known as 'backpropagation', since the gradients propogate backwards through the function. It's particularly nice since you don't need to instantiate Jacobians, and Jacobian-vector products can often be computed efficiently as well. Because autograd supports higher derivatives as well, Hessian-vector products (a form of second-derivative) are also available and efficient to compute.
How can you support ifs, while loops and recursion?
Some autodiff packages (such as Theano or Kayak) work by having you specify a graph of the computation that your function performs, including all the control flow (such as if and for loops), and then turn that graph into another one that computes gradients.
This has some benefits (such as allowing compile-time optimizations), but it requires you to express control flow in a limited mini-language that those packages know how to handle. (For example, the
scan() operation in Theano.)
In contrast, autograd doesn't have to know about any ifs, branches, loops or recursion that were used to decide which operations were called. To compute the gradient of a particular input, one only needs to know which continuous transforms were applied to that particular input, not which other transforms might have been applied. Since autograd keeps track of the relevant operations on each function call separately, it's not a problem that all the Python control flow operations are invisible to autograd. In fact, it greatly simplifies the implementation.
What can autograd differentiate?
The main constraint is that any function that operates on a node is marked as
primitive, and has its gradient implemented.
This is taken care of for most functions in the Numpy library, and it's easy to write your own gradients.
The input can be a scalar, complex number, vector, tuple, a tuple of vectors, a tuple of tuples, etc.
When using the
grad function, the output must be a scalar, but the functions
jacobian allow gradients of vectors.
Supported and unsupported parts of numpy/scipy
Numpy has a lot of features. We've done our best to support most of them. so far, we've implemented gradients for:
- most of the mathematical operations
- most of the array and matrix manipulation routines
- some linear algebra functions
- most of the fast fourier transform routines
- full support for complex numbers
- N-dimensional convolutions
- Some scipy routines, including
Some things remain to be implemented. For example, we support indexing (
x = A[i, j, :]) but not assignment (
A[i,j] = x) in arrays that are being differentiated with respect to.
Assignment is hard to support because it requires keeping copies of the overwritten data, but we plan to support this in the future.
Similarly, we don't support the syntax
A.dot(B); use the equivalent
np.dot(A, B) instead. The reason we don't support the first way is that subclassing
ndarray raises a host of issues. As another consequence of not subclassing
ndarray, some subclass checks can break, like
isinstance(x, np.ndarray) can return
In-place modification of arrays not being differentiated with respect to (for example,
A[i] = x or
A += B) won't raise an error, but be careful. It's easy to accidentally change something without autograd knowing about it. This can be a problem because autograd keeps references
to variables used in the forward pass if they will be needed on the reverse pass. Making copies would
be too slow.
Lists and dicts can be used freely - like control flow, autograd doesn't even need to know about them.
The exception is passing in a list to a primitive function, such as
This requires special care, since the list contents need to be examined for nodes.
So far, we do support passing lists to
autograd.np.concatenate, but in other
cases, make sure you explicitly cast lists to arrays using
TL;DR: Do use
- Most of numpy's functions
- Most numpy.ndarray methods
- Some scipy functions
- Indexing and slicing of arrays
x = A[3, :, 2:4]
- Explicit array creation from lists
A = np.array([x, y])
- Assignment to arrays
A[0,0] = x
- Implicit casting of lists to arrays
A = np.sum([x, y]), use
A = np.sum(np.array([x, y]))instead.
- In-place operations (such as
a += b, use
a = a + binstead)
- Some isinstance checks like
Luckily, it's easy to check gradients numerically if you're worried that something's wrong.
Extend Autograd by defining your own primitives
What if autograd doesn't support a function you need to take the gradient of? This can happen if your code depends on external library calls or C code. It can sometimes even be a good idea to provide the gradient of a pure Python function for speed or numerical stability.
For example, let's add the gradient of a numerically stable version of
This function is included in
scipy.misc and already supported, but let's make our own version.
Next, we define our function using standard Python, using
@primitive as a decorator:
import autograd.numpy as np from autograd.core import primitive @primitive def logsumexp(x): """Numerically stable log(sum(exp(x)))""" max_x = np.max(x) return max_x + np.log(np.sum(np.exp(x - max_x)))
@primitive tells autograd not to look inside the function, but instead
to treat it as a black box whose gradient can be specified later.
Functions with this decorator can contain anything that Python knows
how to execute, including calls to other languages.
Next, we write a function that specifies the gradient of the primitive
def logsumexp_vjp(g, ans, vs, gvs, x): return np.full(x.shape, g) * np.exp(x - np.full(x.shape, ans))
logsumexp_vjp computes a vector-Jacobian product (VJP):
g by the Jacobian of
g will be the gradient of the final objective with respect to
(the output of
The calculation can depend on both the input (
and the output (
ans) of the original function.
If you want to be able to take higher-order derivatives, then the
code inside the VJP function must be itself differentiable by autograd.
The final step is to tell autograd about
logsumexp's vector-Jacobian product function:
Now we can use logsumexp() anywhere, including inside of a larger function that we want to differentiate:
from autograd import grad def example_func(y): z = y**2 lse = logsumexp(z) return np.sum(lse) grad_of_example = grad(example_func) print "Gradient: ", grad_of_example(np.array([1.5, 6.7, 1e-10])
This example can be found as a Python script here.
Autograd supports complex arrays and scalars using a convention described as follows.
Consider a complex-to-complex function,
expressed in terms of real-to-real components,
def f(z): x, y = real(z), imag(z) return u(x, y) + v(x, y) * 1j
def grad_f(z): x, y = real(z), imag(z) return grad(u, 0)(x, y) - i * grad(u, 1)(x, y)
(The second argument of
grad specifies which argument we're differentiating with respect to.)
So we throw out v, the imaginary part of f, entirely.
Our convention covers three important cases:
fis holomorphic, we get the usual complex derivative (since
grad(u, 0) == grad(v, 1)and
grad(u, 1) == - grad(v, 0)).
fis a real-valued loss function of a complex parameter,
x, we get a result that we can use in a gradient-based optimizer, by taking steps in the direction of the complex conjugate of
fis a real-to-real function that happens to use complex primitives internally, some of which must necessarily be non-holomorphic (maybe you use FFTs to implement convolutions for example) then we get the same result that a purely real implementation would have given.
Our convention doesn't handle the case where
f is a non-holomorphic function
and you're interested in all of du/dx, du/dy, dv/dx and dv/dy.
But then the answer would have to contain four real values
and there would be no way to express it as a single complex number.
We define primitive vector-Jacobian products of complex functions like this
def f_vjp(g, z): z_x, z_y = real(z), imag(z) g_x, g_y = real(g), imag(g) return ( g_x * grad(u, 0)(x, y) - i * g_x * grad(u, 1)(x, y) - g_y * grad(v, 0)(x, y) + i * g_y * grad(v, 1)(x, y))
For holomorphic primitives, this is just the regular complex derivative multiplied by
so most simple math primitives don't need to be changed from their real implementations.
For non-holomorphic primitives, it preserves all four real partial derivatives as if we
were treating complex numbers as real 2-tuples
(though it throws a couple of negative signs in there).
Chapter 4 of Dougal's PhD thesis
goes into a bit more detail about how we define the primitive vector-Jacobian products.
Autograd is still under active development. We plan to support:
- GPU operations
- In-place array operations and assignment to arrays
Autograd was written by Dougal Maclaurin, David Duvenaud, and Matthew Johnson and we're actively developing it. Please feel free to submit any bugs or feature requests. We'd also love to hear about your experiences with autograd in general. Drop us an email!