In [None]:
from __future__ import print_function
%matplotlib inline

Attribution: 

   * Most material is adapted from the [PyTorch Deep Learning 60 min Blitz](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)


# Autograd: automatic differentiation

Central to all neural networks in PyTorch is the ``autograd`` package. Let’s first briefly visit this, and we will then go to training our first neural network.


The ``autograd`` package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be
different. This is very different than automatic differentiation in [Theano](http://deeplearning.net/software/theano/) or [TensorFlow](https://www.tensorflow.org/).

Let us see this in more simple terms with some examples.

## Variable

``autograd.Variable`` is the central class of the package. It wraps a Tensor, and supports nearly all of operations defined on it. Once you finish your computation you can call ``.backward()`` and have all the gradients computed automatically.

You can access the raw tensor through the ``.data`` attribute, while the gradient w.r.t. this variable is accumulated into ``.grad``.

![Variable](images/8/Variable.png)

There’s one more class which is very important for autograd implementation - a ``Function``.

``Variable`` and ``Function`` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each variable has a ``.grad_fn`` attribute that references a ``Function`` that has created the ``Variable`` (except for Variables created by the user - their ``grad_fn`` is ``None``).

If you want to compute the derivatives, you can call ``.backward()`` on a ``Variable``. 


The `backward` function receives the gradient of the output Tensors with respect to some scalar value, and computes the gradient of the input Tensors with respect to that same scalar value. 

If the Tensor you are calling `backward()` on is terminal, i.e. there is nothing down-stream, then you do not need to specify any arguments to ``backward()``. Otherwise, you need to specify a ``grad_output`` argument that is a tensor of matching shape.

In [None]:
import torch
from torch.autograd import Variable

Create a variable:



In [None]:
x = Variable(torch.ones(2, 2), requires_grad=True)
print(x)

Do an operation of variable:



In [None]:
y = x + 2
print(y)

``y`` was created as a result of an operation, so it has a ``grad_fn``.



Do more operations on y:



In [None]:
z = y * y * 3
out = z.mean()

print(z, out)

Gradients
---------
Let's perform backprop starting with `out` using ``out.backward()``. Note this is equivalent to doing ``out.backward(torch.Tensor([1.0]))``.


In [None]:
out.backward()

print gradients d(out)/dx




In [None]:
print(x.grad)

You should have got a matrix of ``4.5``. Let’s call the ``out``
*Variable* “$o$”.
We have that $o = \frac{1}{4}\sum_i z_i$,
$z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$.
Therefore,
$\frac{\partial o}{\partial x_i} = \frac{3}{2}(x_i+2)$, hence
$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.



You can do many crazy things with autograd!



In [None]:
x = torch.randn(3)
x = Variable(x, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

In [None]:
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)

print(x.grad)

For a good explanation of the argument to `backward()`, you can read this post: https://stackoverflow.com/a/43461153

Most of the time, when you call `backward()` you will use the default, which is a vector of ones, e.g. [1, 1, 1]. It is only if `y` is an intermediate variable and you want to pass gradients from above, that you would give a non-unity argument as is demonstrated here.

**For further reading:**

Documentation of ``Variable`` and ``Function`` is at
http://pytorch.org/docs/autograd



### Exercises

1. You may have noticed we passed `requires_grad=True` when creating variables. This flag allows for fine-grained exclusion of subgraphs from gradient computation. If there's a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don't require gradient, the output also won't require it. Backward computation is never performed in the subgraphs where all variables didn't require gradients. 
  - Why is this important?
  - Build a graph of variables where some require gradient and others don't
  
  