# Autograd

The `autograd` package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

## Tensor

`torch.Tensor`: central class of the package

 - If you set its attribute `.requires_grad` as `True`, it starts to track all operations on it. When you finish your computation you can call `.backward()` and have all the gradients computed automatically. The gradient for this tensor will be accumulated into `.grad` attribute.

 - To stop a tensor from tracking history, you can call `.detach()` to detach it from the computation history, and to prevent future computation from being tracked.

 - To prevent tracking history (and using memory), you can also wrap the code block in `with torch.no_grad():`. (This can be particularly helpful when evaluating a model because the model may have trainable parameters with `requires_grad=True`, but for which we don’t need the gradients.)


 `Function`: another important class for autograd implementation

 `Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. **Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor`** (except for Tensors created by the user - their `grad_fn is None`).

 If you want to compute the derivatives, you can call `.backward()` on a `Tensor`. 

- If `Tensor` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to `backward()`
- If it has more elements, you need to specify a `gradient` argument that is a tensor of matching shape.

In [0]:
import torch

### `grad_fn`

In [2]:
x = torch.ones(2, 2, requires_grad=True)
x

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)

In [3]:
y = x + 2
y

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)

`y` was created as a result of an operation, so it has a `grad_fn`.

In [4]:
y.grad_fn

<AddBackward0 at 0x7fb801ace6a0>

In [8]:
z = y * y * 3
out = z.mean()
print(z, '\n', out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) 
 tensor(27., grad_fn=<MeanBackward0>)


### `required_grad`

`.requires_grad_( ... )` changes an existing Tensor’s `requires_grad` flag in-place. The input flag defaults to `False` if not given.

In [0]:
a = torch.randn(2, 2)

In [12]:
a = ((a * 3) / (a - 1))
a.requires_grad

False

In [13]:
a.requires_grad_(True)
a.requires_grad

True

In [14]:
b = (a * a).sum()
b.grad_fn

<SumBackward0 at 0x7fb7b45c79b0>

In [15]:
b.requires_grad

True

## Gradients

In [16]:
out

tensor(27., grad_fn=<MeanBackward0>)

Because `out` contains a single scalar, `out.backward()` is equivalent to `out.backward(torch.tensor(1.))`.

In [0]:
out.backward()

In [20]:
# gradients d(out)/dx
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


Let's call the `out` Tensor $o$.

We have 
$$
o=\frac{1}{4} \sum_{i} z_{i} \\
z_i = 3(x_i + 2)^2
$$

Therefore:
$$
\frac{\partial o}{\partial z_i} = \frac{1}{4} \\
\frac{\partial z_i}{\partial x_i} = 6(x_i + 2)
$$

According to **Chain Rule**:
$$
\frac{\partial o}{\partial x_i} = \frac{\partial o}{\partial z_i} \frac{\partial z_i}{\partial x_i} = \frac{1}{4} \cdot 6(x_i + 2) \overset{x_i = 1}{=} = 4.5
$$

### Vector-Jacobian product

Mathematically, if you have a vector valued function $\vec{y}=f(\vec{x})$, then the gradient of $\vec{y}$ with respect to $\vec{x}$ is a Jacobian matrix:

$$
J=\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)
$$

Generally speaking, `torch.autograd` is an engine for computing vector-Jacobian product. That is, given any vector 

$$v=\left(\begin{array}{llll}v_{1} & v_{2} & \cdots & v_{m}\end{array}\right)^{T}$$

, compute the product $v^T \cdot J$. If $v$ happens to be the gradient of a scalar function $l = g(\vec{y})$, that is,

$$
v=\left(\begin{array}{lll}
\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}
\end{array}\right)^{T}
$$

, then by the chain rule, the vector-Jacobian product would be the gradient of $l$ with respect to $\vec{x}$:

$$
J^{T} \cdot v=\left(\begin{array}{ccc}
\frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
\end{array}\right)\left(\begin{array}{c}
\frac{\partial l}{\partial y_{1}} \\
\vdots \\
\frac{\partial l}{\partial y_{m}}
\end{array}\right)=\left(\begin{array}{c}
\frac{\partial t}{\partial x_{1}} \\
\vdots \\
\frac{\partial l}{\partial x_{n}}
\end{array}\right)
$$

> Note that $v^T \cdot J$ gives a row vector which can be treated as a column vector by taking $J^T \cdot v$

This characteristic of vector-Jacobian product makes it very convenient to feed external gradients into a model that has non-scalar output.


In [21]:
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
  y = y * 2

print(y)

tensor([-1389.8010,  1174.6008,   706.2604], grad_fn=<MulBackward0>)


In this case `y` is no longer a scalar. `torch.autograd` could not compute the full Jacobian directly, but if we just want the vector-Jacobian product, simply pass the vector to `backward` as argument:

In [22]:
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])


## Stop autograd

Stop autograd from tracking history on Tensors with `.requires_grad=True` 

- either by 
wrapping the code block in `with torch.no_grad():`

In [25]:
print(x.requires_grad)
print((x ** 2).requires_grad)

True
True


In [26]:
with torch.no_grad():
  print((x ** 2).requires_grad)

False


- Or by using `.detach()` to get a new Tensor with the same content but that does not require gradients:

In [27]:
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())

True
False
tensor(True)
