- The autograd package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

- Each torch tensor has an attribute `.requires_grad` - can be set to `True` or `False`
<br> When you finish your computation you can call `.backward()` and __have all the gradients computed automatically__

- The gradient for this tensor will be accumulated into `.grad` attribute.

- To stop a tensor from tracking history, you can call `.detach()` to detach it from the computation history, and to prevent future computation from being tracked.

- During __inference time__, To prevent tracking history (and using memory), you can also wrap the code block in `with torch.no_grad():`. This can be particularly helpful when evaluating a model because the model may have trainable parameters with `requires_grad=True`, but for which __we don’t need the gradients__ (since we only need to do the forward propagation).

In [1]:
import torch

In [2]:
x = torch.ones(2, 2, requires_grad=True)
print(x) # Note the argument

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


- `Tensor` and `Function` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a `.grad_fn` attribute that references a `Function` that has created the `Tensor` (__except for Tensors created by the user__ - their `grad_fn is None`).

In [3]:
print(x.grad_fn)

None


In [4]:
y = x + 3.
y

tensor([[4., 4.],
        [4., 4.]], grad_fn=<AddBackward>)

`y` was created as a result of an operation, so it has a `grad_fn` by default because of the `Function` class which created this tensor

In [5]:
print(y.grad_fn)

<AddBackward object at 0x7f22c74f2668>


In [6]:
z = y * y * 3

out = z.mean()

# Note the automatic attribute `.grad_fn`
print(z)
print(out)

tensor([[48., 48.],
        [48., 48.]], grad_fn=<MulBackward>)
tensor(48., grad_fn=<MeanBackward1>)


In [7]:
a = torch.randn(2, 2) # by default, requires_grad is False
a = ((a * 3) / (a - 1))
print(a.requires_grad)

a.requires_grad_(True)
print(a.requires_grad)

b = (a * a).sum()
print(b.grad_fn)

False
True
<SumBackward0 object at 0x7f21d6aa8a58>


- If you want to compute the derivatives, you can call `.backward()` on a `Tensor`. If `Tensor` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to backward(), however if it has more elements, __you need to specify a `gradient` argument that is a tensor of matching shape.__

#### Let’s backprop now because `out` contains a single scalar - the mean from above
- `out.backward()` is equivalent to `out.backward(torch.tensor(1))`.
`.backward()` requires an argument the same shape as that of the tensor `out`. 

__Therefore, if your output value is a scalar such as `loss`, you can directly do `loss.backward()`__

In [9]:
out.backward()

In [11]:
print(x.grad) # gradients d(out)/dx

tensor([[6., 6.],
        [6., 6.]])


In [25]:
x = torch.randn(3, requires_grad=True)

y = x * 2

In [26]:
while y.data.norm() < 1000: 
    y = y * 2 # Keep squaring y as long as its norm is below 1000, if not exit loop

print(y)

tensor([ 621.1643, 1568.4286,  840.2787], grad_fn=<MulBackward>)


In [27]:
y.data.norm()

tensor(1884.6437)

Compute gradient with respect to the following:

In [28]:
gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)

since y is a tensor, the `gradient` parameter in `.backward()` is also a tensor of the same dimension. By default, we can set it to a tensor of ones [1.0, 1.0, 1.0], i.e., of the same size.

Refer to the 2nd answer to get an intuition:
https://stackoverflow.com/questions/43451125/pytorch-what-are-the-gradient-arguments
The other answers are also somewhat useful.

Basically if there are multiple losses, and we want to importance weight these, then we specify the weights in the `gradients` tensor.

Generally speaking, I dont think this will be used a lot!

In [33]:
y.backward(gradient = gradients) 
print(x.grad)

tensor([ 1024.0000, 10240.0000,     1.0240])


In [34]:
print(x.requires_grad)
print((x ** 2).requires_grad)

True
True


If you want `autograd` to stop tracking history on tensors,

In [36]:
with torch.no_grad():
    print((x ** 2).requires_grad)

False


In [38]:
x

tensor([0.3033, 0.7658, 0.4103], requires_grad=True)

In [40]:
c = x.detach()
x.detach() # returns a new tensor with same values, but no gradient computation on it

tensor([0.3033, 0.7658, 0.4103])

`stop_gradient()` in Tensorflow treats the tensor as a constant. Sending in `x.detach()` to a nn layer will prevent gradients from being computed for `x`, so I believe the behavior is the same

In [39]:
print(x.grad)

tensor([ 1024.0000, 10240.0000,     1.0240])


In [42]:
print(c.grad) # is detached from the graph and we cannot compute gradient with respect to this 

None
