------
# Autograd: Automatic Differentiation

The autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

`torch.Tensor` is the central class of the package. If you set its attribute `.requires_grad` as `True`, it starts to track all operations on it. When you finish your computation you can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into `.grad` attribute.

To stop a tensor from tracking history, you can call `.detach()` to detach it from the computation history, and to prevent future computation from being tracked.

To prevent tracking history (and using memory), you can also wrap the code block in `with torch.no_grad()`:. This can be particularly helpful when evaluating a model because the model may have trainable parameters with `requires_grad=True`, but for which we don’t need the gradients.

There’s one more class which is very important for autograd implementation - a `Function`.

`Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a .grad_fn attribute that references a Function that has created the Tensor (except for Tensors created by the user - their grad_fn is None)are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor` (except for Tensors created by the user - their `grad_fn is None`).

If you want to compute the derivatives, you can call `.backward()` on a `Tensor`. If `Tensor` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to `backward()`, however if it has more elements, you need to specify a `gradient` argument that is a tensor of matching shape.

In [2]:
import torch

In [3]:
# Create a tensor and set requires_grad=True to track computation with it:
x = torch.ones(2, 2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


In [4]:
# Do a tensor operation:
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


In [5]:
# y was created as a result of an operation, so it has a grad_fn:
print(y.grad_fn)

<AddBackward0 object at 0x7f86cc835590>


In [6]:
# Do more operations on y:
z = y * y * 3
out = z.mean()

print(z, out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)


In [7]:
# .requires_grad_( ... ) changes an existing Tensor’s requires_grad flag in-place.
# like all methods ended with _. They all change Tensors in-place
# The input flag defaults to False if not given.

a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print('.requires_grad? ',a.requires_grad)
a.requires_grad_(True)
print('.requires_grad? ',a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

.requires_grad?  False
.requires_grad?  True
<SumBackward0 object at 0x7f86e8498ed0>


### Gradients
Let’s backprop now. Because `out` contains a single scalar, `out.backward()` is equivalent to `out.backward(torch.tensor(1.))`. Then we print gradients: d(out)/dx

$out=\frac{1}{4}\!\sum_{i}\!z_{i},\:\:z_{i}=3(x_{i}+2)^2\:$ 
and 
$\:z_{i}\!\!∣_{x_{i}=1}=27$ 
. Therefore, 
$\:\frac{∂o}{∂x_{i}}=32(x_{i}+2)$
, hence 
$\:\frac{∂o}{∂x_{i}}\!\!∣_{x_{i}=1}=\frac{9}{2}=4.5$

                     ↑  this comes from the operations performed from 5 cells back

In [13]:
print(out)

tensor(27., grad_fn=<MeanBackward0>)


In [9]:
out.backward()
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


Mathematically, if you have a vector valued function $y⃗ =f(x⃗)$, then the gradient of $y⃗$  with respect to $x⃗$  is a Jacobian matrix.

$$J=\begin{pmatrix} 
\frac{∂y_{1}}{∂x_{1}} & ... & \frac{∂y_{m}}{∂x_{1}}\\
\vdots & \ddots & \vdots \\
\frac{∂y_{1}}{∂x_{n}} & ... & \frac{∂y_{m}}{∂x_{n}}\\
\end{pmatrix}$$

Generally speaking, `torch.autograd` is an engine for computing vector-Jacobian product. That is, given any vector $v=(v_{1}\;v_{2}\; ⋯ v_{m})^T$, compute the product $v^T⋅J$. If $v$ happens to be the gradient of a scalar function $l=g(y⃗)$, that is, $v=(\frac{∂l}{∂y_{1}} ⋯ \frac{∂l}{∂y_{m}})^T$, then by the chain rule, the vector-Jacobian product would be the gradient of $l$ with respect to $x⃗$:

$$J^T.v=
\begin{pmatrix} 
\frac{∂y_{1}}{∂x_{1}} & ... & \frac{∂y_{m}}{∂x_{1}}\\
\vdots & \ddots & \vdots \\
\frac{∂y_{1}}{∂x_{n}} & ... & \frac{∂y_{m}}{∂x_{n}}\\
\end{pmatrix} 
.
\begin{pmatrix}
\frac{∂l}{∂y_{1}} \\ \vdots \\ \frac{∂l}{∂y_{m}}
\end{pmatrix}
=
\begin{pmatrix}
\frac{∂l}{∂x_{1}} \\ \vdots \\ \frac{∂l}{∂x_{n}}
\end{pmatrix} $$

(Note that $v^T⋅J$ gives a row vector which can be treated as a column vector by taking $J^T⋅v$)

This characteristic of vector-Jacobian product makes it very convenient to feed external gradients into a model that has non-scalar output.

In [15]:
# Now let’s take a look at an example of vector-Jacobian product:
x = torch.randn(3, requires_grad=True)
y = x * 2
while y.data.norm() < 1000:
    y = y * 2
print(y)
print(y.data.norm())

tensor([-1621.3694,   216.2311,  -566.0414], grad_fn=<MulBackward0>)
tensor(1730.8950)


In [16]:
# Now in this case y is no longer a scalar. torch.autograd could not compute the full Jacobian directly, 
# but if we just want the vector-Jacobian product, simply pass the vector to backward as argument:
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)
print(x.grad)

tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])


For the case of backprop the Jacobian would be the derivatives of the loss function wrt W (the weight matrix)

$$J_{W}=\frac{∂ξ}{∂W}=\begin{pmatrix} 
\frac{∂\,ξ}{∂w_{11}} & ... & \frac{∂\,ξ}{∂w_{1j}}\\
\vdots & \ddots & \vdots \\
\frac{∂\,ξ}{∂w_{i1}} & ... & \frac{∂\,ξ}{∂w_{ij}}\\
\end{pmatrix}$$

In [18]:
# You can also stop autograd from tracking history on Tensors with .requires_grad=True either by
# (1) wrapping the code block in with torch.no_grad():
print('(1) ','x.requires:grad? ', x.requires_grad)
print('(x**2).requires:grad? ', (x ** 2).requires_grad)
with torch.no_grad():
    print('with torch.no_grad(): (x**2).requires:grad? ', (x ** 2).requires_grad)

# or (2) by using .detach() to get a new Tensor with the same content but that does not require gradients:
print('\n(2) ','x.requires:grad? ', x.requires_grad)
y = x.detach()
print('y=x.detach(), y.requires:grad? ', y.requires_grad)
print('x=y in all their elements?', x.eq(y).all())

(1)  x.requires:grad?  True
(x**2).requires:grad?  True
with torch.no_grad(): (x**2).requires:grad?  False

(2)  x.requires:grad?  True
y=x.detach(), y.requires:grad?  False
x=y in all their elements? tensor(True)
