<div>
<img src="https://discuss.pytorch.org/uploads/default/original/2X/3/35226d9fbc661ced1c5d17e374638389178c3176.png" width="400" style="margin: 50px auto; display: block; position: relative; left: -30px;" />
</div>

<!--NAVIGATION-->
# < [Basics](1-Basics.ipynb) | Autograd | [Optimization](3-Optimization.ipynb) >

### Automatic differentiation

Automatic differentation (autodiff) is a key feature of PyTorch.
PyTorch can differentiate the outcome of any computation with respect to its inputs. You don't need to compute the gradients yourself. This allows to to express and optimize complex models without worrying about correctly differentiating the model.

We will start by discussing a little bit of the math behind autodiff. We then cover PyTorch's `.backward()` method that does everything automatically for you. Finally, we have a quick look under the hood to see how PyTorch does its magic.

### Table of Contents

#### 1. [Usage in PyTorch](#Usage-in-PyTorch)
#### 2. [Differentiation fundamentals](#Differentiation-fundamentals)
#### 3. [Advanced topics](#Advanced-topics)

In [None]:
import torch
torch.__version__

---
# Usage in PyTorch

Let's start with a function that is easy to differentiate: $f(x) = 3 x^2 + 4$. It is easy to see that the derivative $\frac{d}{dx}f(x)$ equals $6 \cdot x$. PyTorch can compute this for us:

In [None]:
# Initialize x with some value
x = torch.tensor(2.0, requires_grad=True)
print("x = {}".format(x))

# Execute f(x)
y = 3 * x**2  + 4
print("y = {}".format(y))

# Compute the gradient of y with respect to all variables that have 'requires_grad' turned on
y.backward()

# Checking the result
computed_gradient = x.grad
print("PyTorch computed the gradient {}".format(computed_gradient))
print("We would expect it to be {}".format(6 * x))

---

# Differentiation fundamentals

While this was easy enough to do by ourselves, differentiating more complex expressions takes time and is prone to human errors. Computers are much better at doing this correctly :)

### Chain rule

To differentiate an expression like $3x^2 + 4$ in a very methodical way, you can interpret is as a sequence of basic operations. Something like 

$$y = f(x) = \text{plus4}(\text{times3}(\text{square}(x))).$$

The gradient of $y$ with respect to $x$ can now be computed systematically using the chain rule:

$$f'(x) = \text{plus4}'(\cdots) \, \text{times3}'(\cdots) \, \text{square}'(x).$$

### Computation graph

When you indicate with `requires_grad` that you will need a gradient of some output w.r.t. some input `x`, PyTorch will track any computations that are based on `x`. This ‘history’ is called the **computation graph**. For our simple polynomial, it would look like this:

![forward pass](figures/computation-graph-1.svg)

You can explore how PyTorch keeps track of history by inspecting the `tensor.grad_fn` argument:


In [None]:
print(y.grad_fn)
print(y.grad_fn.next_functions[0][0])
print(y.grad_fn.next_functions[0][0].next_functions[0][0])

Each value has a `grad_fn` corresponding to the operation that produced the value. 
Each operation's `grad_fn` points to its inputs through `next_functions`.
For each input, `next_functions` contains a tuple of the input's `grad_fn` and, if the operation had multiple outputs, an index of the relevant output.

In [None]:
# In our example, the final `add` operation has two inputs:
# - The first is the output of `multiplication`.
# - The second is a constant `4` for which we don't require a gradient.
y.grad_fn.next_functions

### Back-propagation

After you have computed the output $y=3x^2+4$, you can call `y.backward()` to compute the gradients of $y$ with respect to all inputs that have `requires_grad`. You can see how the chain rule is naturally executed backwards:

![backward pass](figures/computation-graph-2.svg)

### Clearing old gradients

Let's revisit the previous basic example, but call `.backward()` multiple times:

In [None]:
# Initialize x with some value
x = torch.tensor(2.0, requires_grad=True)
y = 3 * x**2  + 4

y.backward()
print("PyTorch computed the gradient {}".format(x.grad))

# !ERROR!
y = 3 * x**2  + 4
y.backward()
print("PyTorch computed the gradient {}".format(x.grad))

You see that the second time, the gradient computed is too large. This is because `.backward()` __accumulates__ the gradients. If you want fresh values, you need to set `x.grad` to zero before you call `.backward()`.

In [None]:
# Initialize x with some value
x = torch.tensor(2.0, requires_grad=True)
y = 3 * x**2  + 4

y.backward()
print("PyTorch computed the gradient {}".format(x.grad))

x.grad = None  # FIX: Delete the previously computed gradient
y = 3 * x**2  + 4
y.backward()
print("PyTorch computed the gradient {}".format(x.grad))

---

# Advanced topics

### Skipping history tracking with `torch.no_grad()`

*Advanced*

After you trained a model, you just want to use it without computing gradients.
Building a computation graph for every operation would be wasteful if you don't need it.
Therefore, you can skip these operations by wrapping your code with the `with torch.no_grad():` context.

In [None]:
x = torch.randn(3, requires_grad=True)
print("x.requires_grad : ", x.requires_grad)

y = (x ** 2)
print("y.requires_grad : ", y.requires_grad)

with torch.no_grad():
    y = (x ** 2)
    print("y.requires_grad : ", y.requires_grad)

### Leaves vs Nodes

*Advanced*

PyTorch's autograd mechanism differentiates between two types of tensors:
- __node variables__ are the result of a pytorch operation
- __leaf variables__ are directly created by a user

Later in the tutorial, we will use the `.is_leaf` property to differentiate between the two types.

In [None]:
A = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True)
B = torch.tensor([[1., 2.], [3., 4.]], requires_grad=True) + 2  # B is the result of an operation (+)
C = 5 * A  # C is the result of an operation (*)
print("A.is_leaf :", A.is_leaf)
print("B.is_leaf :", B.is_leaf)
print("C.is_leaf :", C.is_leaf)

### Dropping history with `.detach()`

*Advanced*

Some tensors are computed from others, but you may want to consider them as __new leaf variables__, i.e. treat them as constants without computation history. For that, you can use the `.detach()` method.

In [None]:
A = torch.rand(1,2, requires_grad=True)
B = A.mean()

print("B : ", B)
print("B.grad_fn :", B.grad_fn)
print("B.is_leaf :", B.is_leaf)

C = B.detach()
print("\n-- C = B.detach() -- \n")

print("C : ", C)
print("C.grad_fn :", C.grad_fn)
print("C.is_leaf :", C.is_leaf)

### Differentiating w.r.t. intermediate values: `.retain_grad()`

*Advanced*

When doing the backward pass, Autograd computes the gradient of the output with respect to every intermediate variables. However, by default, only gradients of variables that were **created by the user** (leaf) and have the __`requires_grad` property to True__ are saved.

Indeed, most of the time when training a model you only need the gradient of a loss w.r.t. to your model parameters. 

In [None]:
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + 3)
C = B.mean()

print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)

In [None]:
A = torch.Tensor([[1, 2], [3, 4]])
A.requires_grad_()

B = 5 * (A + 3)
B.retain_grad()  # <----- This line let us have access to gradient wrt. B after the backward pass
C = B.mean()


print("A.grad :", A.grad)
print("B.grad :", B.grad)
C.backward()
print("\n-- Backward --\n")
print("A.grad :", A.grad)
print("B.grad :", B.grad)

___

<!--NAVIGATION-->
# < [Basics](1-Basics.ipynb) | Autograd | [Optimization](3-Optimization.ipynb) >