Derivative calculation is important to training deep networks but become are tedious and error-prone when done manually.
These issues grow as models become more complex. Fortunately, modern deep learning frameworks handle these issues with _automatic differentiation_ (autograd).

When data is passed through each successive function, the framework used builds a computational graph which tracks how values depends on others.

Automatic differentiation applies the chain rule _backwards_, this algorithm called _backpropagation_.

While backpropagation has become the default method for computing gradients, it is not the only option.

In [1]:
import torch
import numpy as np

## A simple function

We differentiate a function $y = 2\textbf{x}^{\top}\textbf{x}$ with respect to the column $\textbf{x}$.

First, assign an initial value to $\textbf{x}$.

In [2]:
x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

Allocate a place to store the gradient of $y$ with respect  to $\textbf{x}$ to avoid allocating memory every time we take a derivative. The `requires_grad_` method signals that every operation should be tracked.

The gradient of a scalar-valued function with respect to a vector $\textbf{x}$ is vector-valued with the same shape as $\textbf{x}$. 

In [3]:
# alternative is x = torch.arange(4.0, requires_grad=True)
x.requires_grad_(True)
x.grad # none by default

Compute the function of $x$ and assign it to $y$.

In [4]:
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

Take the gradient of $y$ with respect to $\textbf{x}$ using the `backward` method.

In [5]:
y.backward()

Access the gradient using the `grad` attribute.

In [6]:
x.grad

tensor([ 0.,  4.,  8., 12.])

We know that $y = 2\textbf{x}^{\top}\textbf{x}$ with respect to $\textbf{x}$ is $4\textbf{x}$. 

Verify that both the automatic gradient computation and the expected result are identical.

In [7]:
x.grad == 4 * x

tensor([True, True, True, True])

PyTorch does not automatically reset the gradient buffer when recording a new gradient. Meaning, the new gradient is added to the existing gradient.
This is desired when optimizing the sum of multiple functions.

However, when calculating for another function of $\textbf{x}$ and taking its gradient, call the `grad.zero_` method to reset the gradient buffer. 

In [8]:
x.grad.zero_() # resets the gradient
y = x.sum() # compute the function of x
y.backward() # take the gradient of y wrt to x
x.grad # access the gradient

tensor([1., 1., 1., 1.])

Each term is just $x_i$ with its derivative being $\displaystyle \frac{dy}{dx_i} = 1$.

## Backward for non-scalar variables

When $y$ is a vector, the most natural representation of $\displaystyle \frac{dy}{dx}$ is matrix called the _Jacobian_.

A Jacobian contains the partial derivatives of each component of $y$ with respect to each component of $x$. For higher-order $y$ and $x$, the result of differentiation is an even higher-order tensor.

$$
J =
\begin{pmatrix}
\frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{pmatrix}
$$

We use Jacobians to sum up the gradients of each component of $y$ with respect to the full vector $\textbf{x}$—yielding a vector of the same shape as $\textbf{x}$.

When passing $\textbf{v}^{\top}$ as the `gradient` argument for `y.backward(gradient)`—`x.grad` will not give $J$ but $\textbf{v}^{\top}\cdot J$.

For instance, we often have a vector representing the value of our loss function that were calculated separately for each example among a _batch_ of training  examples. Therefore, we want to sum up the gradients computed individually for each example.

Deep learning frameworks vary in how they interpret gradients of non-scalar tensors.

In PyTorch, calling the `backward` method on a non-scalar produces an error unless we tell it how to reduce the object to a scalar. So, we provide some vector $\textbf{v}$ such that `backward` will compute $\textbf{v}^{\top} \partial_{\textbf{x}}\textbf{y}$ instead of $\partial_{\textbf{x}}\textbf{y}$. Yang Zhang's [Medium post](https://zhang-yang.medium.com/the-gradient-argument-in-pytorchs-backward-function-explained-by-examples-68f266950c29) explains this for each case.

In [9]:
x.grad.zero_() # resets the gradient
y = x * x # computes the function of x
v = torch.ones(len(y))
y.backward(gradient=v)  # Faster: y.sum().backward()
x.grad

tensor([0., 2., 4., 6.])

## Detaching computations

Given the following equations:
- $z = x * y$
- $y = x * x$

Manually substituting $y$ into $z$ gives:
$$
z = x * x^2 = x^3
$$

Computing the derivative of $z$ with respect to $x$ gives:
$$
\frac{dz}{dx} = 3x^2
$$


Now, we introduce the variable $u = y$ but it is detached from the computational graph. Meaning, $u$ contains the same value as $y$ but PyTorch no longer tracks how it was computed from $x$—removing its history. Consequently, it prevents gradient flow.


Since $u$ is now a constant, the function $z$ is now:
$$
z = x * u
$$

Deriving $z$ with respect to $x$ now becomes:
$$
\frac{dz}{dx} = u = x^2
$$

In [10]:
x.grad.zero_() # resets the gradient
y = x * x
u = y.detach() # detaches y
z = u * x # computes the function of x

z.sum().backward()  # takes the gradient of z wrt to x
x.grad

tensor([0., 1., 4., 9.])

In [11]:
x.grad == 2 * x

tensor([ True, False,  True, False])

Detaching $y$ removes its ancestors from the graph leading to $z$, but $y$'s graph still remains. This allows us to compute $\nabla_x y$.

In [12]:
x.grad.zero_() # resets the gradient
y.sum().backward() # takes the gradient of y wrt to x
x.grad

tensor([0., 2., 4., 6.])

In [13]:
x.grad == 2 * x

tensor([True, True, True, True])

## Gradients and Python control flow

A benefit of automatic differentiation allows us to do gradient computation in complex control flows in Python such as conditionals, loops, and function calls.

In [14]:
def f(a):
    b = a * 2
    while b.norm() < 1000:
        b = b * 2
    if b.sum() > 0:
        c = b
    else:
        c = 100 * b
    return c

In [15]:
a = torch.randn(size=(), requires_grad=True)
d = f(a)

In [16]:
d.backward()

In [17]:
a.grad

tensor(8192.)

$f(a)$ is a linear function of a piecewise defined  scale. Dividing $f(a)$ by $a$ gives a vector of constant entries which what represents a gradient. This is why $f(a)$ must match $\frac{f(a)}{a}$.

In [18]:
a.grad == d / a

tensor(True)

Dynamic control flow is commonplace in deep learning. For instance, in text processing, the computational graph depends on the length of the input. In cases like this, it is impossible to precompute gradients. Therefore, automatic differentiation is necessary.

## Discussion

Remember these basics:
1. Attach gradients to those variables with respect to which we desire derivatives.
2. Record the computation of the target value.
3. Execute the backpropagation function.
4. Access the resulting gradient.

In [19]:
x = torch.arange(4.0)
x.requires_grad_(True) # step 1
y = 2 * torch.dot(x, x) # step 2
y.backward() # step 3
x.grad # step 4

tensor([ 0.,  4.,  8., 12.])

##  Exercises