## Autograd in PyTorch

In [1]:
import torch

### Tensors and Gradients

PyTorch tensors have an attribute `requires_grad` which tells PyTorch (and Autograd) to track all operations on it. To compute the gradients of these operations, use `.backward()`. The gradients of the tensor will then be accumulated into the `.grad` attribute.

To stop a tensor from tracking its gradients, use `.detach()`. Another option is to temporarily wrap evaluation calls in a `with` code block using `torch.no_grad():`. This is useful when evaluating a model which has trainable parameters, but we don't require the gradients at evaluation time.

### PyTorch Functions and Gradients

The PyTorch `Function` class is used to track gradient operations on tensors. Under the hood, PyTorch builds an acyclic computation graph with gradients tracked along the edges and accumulated in each output tensor along the graph. To help keep track of function gradients, each tensor has a `.grad_fn` attribute which references the function which created the tensor. 

> **Note:** User-created tensors have `.grad_fn is None`.

To compute derivates call `.backward()`. If the tensor is non-sclar, then a `gradient` arguments needs to be specified as a tensor of matching shape.

In [2]:
# create a tensor and track computations on it
x = torch.ones(2,2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


In [3]:
# do a tensor operation
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


In [4]:
# y was created as the result of an operation (function) and that has a `.grad_fn`
print(y.grad_fn)

<AddBackward0 object at 0x7fb464947f28>


In [5]:
# do more operations
z = y * y * 3
out = z.mean()
print(z)
print(out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>)
tensor(27., grad_fn=<MeanBackward1>)


In [6]:
# Note that `.requires_grad_()` is an in-place tensor operation
a = torch.randn(2,2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

False
True
<SumBackward0 object at 0x7fb46495f320>


#### Gradients

In [7]:
# earlier we created a tensor called out - let's backprop its gradients
out.backward()

In [8]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


### Vector-Jacobian Products

`torch.autograd` doesn't calculate symbolic gradients. Rather it calculates the gradient evaluated at a point. 

If $\vec{y} = f(\vec{x})$, then the gradient of $\vec{y}$ w.r.t $\vec{x}$ is:

\begin{align}
J = \begin{bmatrix}
    \dfrac{\partial y_1}{\partial x_1} & \cdots & \dfrac{\partial y_1}{\partial x_n} \\
    \vdots & \ddots & \vdots \\
    \dfrac{\partial y_m}{\partial x_1} & \cdots & \dfrac{\partial y_m}{\partial x_n}
    \end{bmatrix}
\end{align}

Then for any vector $\vec{v} = (v_1, \cdots, v_m)^T$, `torch.autograds` calculates the **vector-Jacobian** product:

\begin{align}
dv = v^T \cdot J
\end{align}

In [9]:
# for a concreate example of this operation:
x = torch.randn(3, requires_grad=True)
y = x * 2  # f(x)  = x^2

v = torch.tensor([0.1, 1.0, 0.001], dtype=torch.float)
y.backward(v)

print(x.grad)

tensor([0.2000, 2.0000, 0.0020])


This makes sense since if $y = f(x) = x^2$, then

\begin{align}
\nabla f = \begin{bmatrix} 2 & 2 &2 \end{bmatrix}
\end{align}

therefore:
\begin{align}
v^T\cdot\nabla f = & \ \begin{bmatrix} 0.1 \\ 1.0 \\ 0.001 \end{bmatrix} \cdot \begin{bmatrix} 2 & 2 &2 \end{bmatrix} \\
                   & \\
                   & = \begin{bmatrix} 0.2 & 2.0 & 0.002 \end{bmatrix}
\end{align}