# Automatic Differentiation

In [1]:
import torch

## A Simple Example

As a toy example, say that we are interested
in differentiating the function
$y = 2\mathbf{x}^{\top}\mathbf{x}$
with respect to the column vector $\mathbf{x}$.
To start, let us create the variable `x` and assign it an initial value.

In [2]:
x = torch.arange(4.0, requires_grad = True)

Note the `requires_grad=True` argument when creating `x`, it tells the framework
we need allocate gradient space for `x` in the future.

In [3]:
print(x.grad) # Is se to None

None


In [4]:
y = 2 * torch.dot(x, x)
y

tensor(28., grad_fn=<MulBackward0>)

Since `x` is a tensor of length 4,
the `dot` operator will perform an inner product of `x` and `x`,
yielding the scalar output that we assign to `y`.
Next, we can automatically calculate the gradient of `y`
with respect to each component of `x`
by calling `y`'s `backward` function.

In [5]:
y.backward()

if we recheck the value of `x.grad`, we will find its contents overwritten by the newly calculated gradient

In [6]:
x.grad

tensor([ 0.,  4.,  8., 12.])

The gradient of the function $y = 2\mathbf{x}^{T}\mathbf{x}$ with respect to $\mathbf{x}$ should be $4\mathbf{x}$. Let us quickly verify that our desired gradient was calculated correctly. If the two tensors are indeed the same, then the equality between them holds at every position

In [7]:
x.grad == 4 * x

tensor([True, True, True, True])

If we subsequently compute the gradient of another variable whose value was calculated as a function of x, we need to clear the previous values in `x.grad` first, as Pytorch accumulates and adds the gradient in default

In [8]:
x.grad.zero_() # Clean the gradient 
y = x.sum()
y.backward()
x.grad

tensor([1., 1., 1., 1.])

## Backward for Non-Scalar Variables

grad can be implicitly created only for scalar outputs with pytorch, you will get an error

In [9]:
x.grad.zero_() # Clean the gradient
y = x * x
y.backward()
x.grad

RuntimeError: grad can be implicitly created only for scalar outputs

However, you can pass a head grad to compute non-escalar outputs

In [10]:
x.grad.zero_() # Clean the gradient
y = x * x
y.backward(gradient=torch.ones_like(x))
x.grad, y

(tensor([0., 2., 4., 6.]), tensor([0., 1., 4., 9.], grad_fn=<MulBackward0>))

But this is the same that we sum the y results and do the backwards step, without passing the head gradient

In [11]:
x.grad.zero_() # Clean the gradient
y = sum(x * x)
y.backward()
x.grad, y

(tensor([0., 2., 4., 6.]), tensor(14., grad_fn=<AddBackward0>))

## Detaching Computation

Sometimes, we wish to move some calculations
outside of the recorded computational graph.
For example, say that `y` was calculated as a function of `x`,
and that subsequently `z` was calculated as a function of both `y` and `x`.
Now, imagine that we wanted to calculate
the gradient of `z` with respect to `x`,
but wanted for some reason to treat `y` as a constant,
and only take into account the role
that `x` played after `y` was calculated.

Here, we can call `u = y.detach()` to return a new variable `u`
that has the same value as `y` but discards any information
about how `y` was computed in the computational graph.
In other words, the gradient will not flow backwards through `u` to `x`.
Yielding a `u` that will be treated as a constant in any `backward` call.
Thus, the following `backward` function computes
the partial derivative of `z = u * x` with respect to `x` while treating `u` as a constant,
instead of the partial derivative of `z = x * x * x` with respect to `x`.

In [12]:
x.grad.zero_() # Clean the gradient
y = x * x
u = y.detach() # Trait u as a constant
z = u * x # z = u * x

z.sum().backward() # dz/dx = u (We sum because u is not scalar)
x.grad == u

tensor([True, True, True, True])

Since the computation of `y` was recorded,
we can subsequently call `y.backward()` to get the derivative of `y = x * x` with respect to `x`, which is `2 * x`.

In [13]:
x.grad.zero_() # Clean the gradient
y.sum().backward() # Compute dy/dx = 2x (We sum because y is not scalar value)
x.grad == 2 * x

tensor([True, True, True, True])

## Computing the Gradient of Python Control Flow

One benefit of using automatic differentiation
is that even if building the computational graph of a function
required passing through a maze of Python control flow
(e.g., conditionals, loops, and arbitrary function calls),
we can still calculate the gradient of the resulting variable.
In the following snippet, note that
the number of iterations of the `while` loop
and the evaluation of the `if` statement
both depend on the value of the input `a`.

In [14]:
def f(a):
    b = a * 2
    while b.norm().item() < 1000:
        b = b * 2
    if b.sum().item() > 0:
        c = b
    else:
        c = 100 * b
    return c

Again to compute gradients, we just need to record the calculation and then call the backward function.

In [15]:
a = torch.randn(size=(1, ), requires_grad=True)
d = f(a) # a bunch of multiplications on a d = k * a
d.backward() # calculate df/da = k => f/a = df/da

We can now analyze the `f` function defined above.
Note that it is piecewise linear in its input `a`.
In other words, for any `a` there exists some constant scalar `k`
such that `f(a) = k * a`, where the value of `k` depends on the input `a`.
Consequently `d / a` allows us to verify that the gradient is correct.

In [16]:
a.grad == d/a

tensor([True])

## Training Mode and Prediction Mode

When we get to complicated deep learning models,
we will encounter some algorithms where the model
behaves differently during training and
when we subsequently use it to make predictions.

In PyTorch, we can set `model.train()` or `model.eval()` to distinguish the
training mode and prediction mode, which we'll cover in the later sections of
the book.

## Summary

* Deep learning frameworks can automate the calculation of derivatives. To use it, we first attach gradients to those variables with respect to which we desire partial derivatives. We then record the computation of our target value, execute its `backward` function, and access the resulting gradient via our variable's `grad` attribute.
* We can detach gradients to control the part of the computation that will be used in the `backward` function.
* The running modes include training mode and prediction mode.

## Exercises

1. Why is the second derivative much more expensive to compute than the first derivative?

The second derivative is much more expensive because is build on top of the first derivative, so we need to build the first derivative computational graph and then calculate the second derivative on top of that graph

2. After running `y.backward()`, immediately run it again and see what happens.

It's calculate the same derivative, we need to pass an arguments telling pytorch to retain_graph

In [17]:
x = torch.arange(4.0, requires_grad = True)
y = x * x
y.sum().backward(retain_graph=True)
print(x.grad)
x.grad.zero_()
y.sum().backward()
print(x.grad)

tensor([0., 2., 4., 6.])
tensor([0., 2., 4., 6.])


3. In the control flow example where we calculate the derivative of `d` with respect to `a`, what would happen if we changed the variable `a` to a random vector or matrix. At this point, the result of the calculation `f(a)` is no longer a scalar. What happens to the result? How do we analyze this?

The results is the constant `k` replicate in a tensor of shape equal to `a`, because `f(a) = k * a` then the derivative of $\frac{df}{da} = k = \frac{f(a)}{a}$ 

In [18]:
a = torch.randn(size=(2,3), requires_grad=True)
d = f(a) # d = k * a
d.sum().backward() # calculate backward to the sum of non-scalars
                    # dd/da = k => then d/a = k

In [19]:
a, d, a.grad, d/a

(tensor([[-0.0519, -0.5885,  0.0626],
         [ 0.1675,  1.7026,  1.4531]], requires_grad=True),
 tensor([[ -26.5704, -301.3363,   32.0743],
         [  85.7696,  871.7515,  743.9756]], grad_fn=<MulBackward0>),
 tensor([[512., 512., 512.],
         [512., 512., 512.]]),
 tensor([[512., 512., 512.],
         [512., 512., 512.]], grad_fn=<DivBackward0>))

In [20]:
a.grad == d/a

tensor([[True, True, True],
        [True, True, True]])

4. Redesign an example of finding the gradient of the control flow. Run and analyze the result.

In [21]:
def f(a):
    if a.item() > 100:
        d = a/10
    while d.item() > 1:
        d /= 10
    return d

In [22]:
a = torch.tensor(200, dtype=torch.float, requires_grad=True)
d = f(a) # f = k * a => f/a = k
d.backward()

In [23]:
d, a, a.grad, d/a

(tensor(0.2000, grad_fn=<DivBackward0>),
 tensor(200., requires_grad=True),
 tensor(0.0010),
 tensor(0.0010, grad_fn=<DivBackward0>))

In [24]:
a.grad == d/a

tensor(False)

5. Let $f(x) = \sin(x)$. Plot $f(x)$ and $\frac{df(x)}{dx}$, where the latter is computed without exploiting that $f'(x) = \cos(x)$.

In [25]:
f = lambda x: torch.sin(x)
df = lambda x: torch.cos(x)
x = torch.linspace(-5, 5, steps = 200, requires_grad=True)
sin = f(x)
sin.sum().backward()
cos = df(x)

In [26]:
import matplotlib.pyplot as plt
%matplotlib widget

fig = plt.figure(figsize=(8, 5))
plt.subplot(3, 1, 1)
plt.plot(x.detach(), sin.detach())
plt.title("sin(x)")

plt.subplot(3, 1, 2)
plt.plot(x.detach(), cos.detach())
plt.title("cos(x)")

plt.subplot(3, 1, 3)
plt.plot(x.detach(), x.grad)
plt.title("cos(x) autograd with pytorch of sin(x)")

plt.tight_layout()
plt.show()

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …