# Chapter 5: Automatic Differentiation

Calculating gradients by hand is tedious and error-prone. Automatic differentiation (autograd) in PyTorch can help us with this
- As we pass data through each successive function, the framework builds a computational graph that tracks how each value depends on others.
- To calculative derivaties, automatic differentiation works backwards through this graph applying chain rule.
- The computation algorithm for applying the chain rule this way is called backpropagation

In [1]:
import torch

## 1. A Simple Function

Assume we are interested in differentiation the function $y = 2x^{T}x$     
Before calculating the gradient, we need a place to store it. We want to avoid allocating new memory every time we take a derivative because in deep learning, we are going to perform computation of derivatives very often and we might just run of memory.

In [17]:
x = torch.arange(4.0, requires_grad=True) # Will build a computation graph
print(x)
x.grad # Gradient is None by default

tensor([0., 1., 2., 3.], requires_grad=True)


In [18]:
# Compute function
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

The derivative of $y = 2x^{T}x$ is given by $y' = 4x$

In [19]:
# Access the gradient by calling backward method
y.backward()
x.grad # gives you y'

tensor([ 0.,  4.,  8., 12.])

In [20]:
x.grad == 4 * x

tensor([True, True, True, True])

We have to reset the gradient first before we calculate the gradient of another function

In [21]:
x

tensor([0., 1., 2., 3.], requires_grad=True)

In [22]:
x.grad.zero_()
y = x.sum()
y

tensor(6., grad_fn=<SumBackward0>)

In [23]:
y.backward()
x.grad

tensor([1., 1., 1., 1.])

## 2. Backward for Non-Scalar Variables

- Jacobian matrix: matrix that contains the partial derivatives of each component of y w.r.t each component of x
- We want to sum up the gradients of each component of y w.r.t the full vector x. For example, when we do batch gradient descent, we want to accumulate the gradients computed individually for each sample.
- Invoking ``backward`` on a non-scalar elicits an error unless we tell PyTorch how to reduce the object to a scalar.

In [24]:
x.grad.zero_()
y = x * x
y.backward(gradient=torch.ones(len(y)))
x.grad

tensor([0., 2., 4., 6.])

## 3. Detatching Computation

Given $z = x * y$ and $y = x * x$, if we want to focus on the direct influence of x on z rather than the influence conveyed via y, we need to detach the computation. $z = x * u$ 

In [27]:
x.grad.zero_()
y = x * x
u = y.detach()
z = u * x

In [30]:
x

tensor([0., 1., 2., 3.], requires_grad=True)

In [31]:
y

tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>)

In [32]:
u

tensor([0., 1., 4., 9.])

In [29]:
z

tensor([ 0.,  1.,  8., 27.], grad_fn=<MulBackward0>)

In [34]:
z.sum().backward()
x.grad

tensor([0., 1., 4., 9.])