### Why Computational Graphs?

When talking about neural networks in any context, backpropagation is an important topic to understand because it is the algorithm used for training deep neural networks.

Backpropagation is used to calculate derivatives which is what you need to keep optimizing parameters of the model and allowing the model to learn on the task at hand.

Many of the deep learning frameworks today like PyTorch does the backpropagation out-of-the-box using [**automatic differentiation**](https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html).

To better understand how this is done it's important to talk about **computational graphs** which defines the flow of computations that are carried out throughout the network. Along the way we will use `torch.autograd` to demonstrate in code how this works.  

This is not a neural network of any sort.Computational graphs contain **nodes** which can represent and input (tensor, matrix, vector, scalar) or **operation** that can be the input to another node. The nodes are connected by **edges**, which represent a function argument, they are pointers to nodes. Note that the computation graphs are directed and acyclic.

We can evaluate the expression by setting our input variables as follows: $a=2$ and $b=1$. This will allow us to compute nodes up through the graph as shown in the computational graph above.  

Rather than doing this by hand, we can use the automatic differentation engine provided by PyTorch.

Let's import PyTorch first:

In [None]:
import torch

Define the inputs like this:

In [None]:
a = torch.tensor([2.], requires_grad=True)
b = torch.tensor([1.], requires_grad=True)

Note that we used `requires_grad=True` to tell the autograd engine that every operation on them should be tracked.

These are the operations in code:

In [None]:
c = a + b
d = b + 1
e = c * d

# grads populated for non-leaf nodes
c.retain_grad()
d.retain_grad()
e.retain_grad()

Note that we used `.retain_grad()` to allow gradients to be stored for non-leaf nodes as we are interested in inpecting those as well.

Now that we have our computational graph, we can check the result when evaluating the expression:

In [None]:
print(e)

tensor([6.], grad_fn=<MulBackward0>)


The output is a tensor with the value of `6.`, which verifies the results here:

![](https://colah.github.io/posts/2015-08-Backprop/img/tree-eval.png)

### Derivatives on Computational Graphs

Using the concept of computational graphs we are now interested in evaluating the **partial derivatives** of the edges of the graph. This will help in gathering the gradients of the graph. Remember that gradients are what we use to train the neural network and those calculations can be taken care of by the automatic differentation engine.

The intuition is: we want to know, for example, if $a$ directly affects $c$, how does it affect it. In other words, if we change $a$ a little, how does $c$ change. This is referred to as the partial derivative of $c$ with respect to $a$.

You can work this by hand, but the easy way to do this with PyTorch is by calling `.backward()` on $e$ and let the engine figure out the values. The `.backward()` signals the autograd engine to calculate the gradients and store them in the respective tensors’ `.grad` attribute.

In [None]:
e.backward()

Using PyTorch, we can do this by calling `a.grad`:

In [None]:
print(a.grad)

tensor([2.])


It is important to understand the intuition behind this:

>Let’s consider how $e$ is affected by $a$. If we change $a$ at a speed of 1, $c$ also changes at a speed of $1$. In turn, $c$ changing at a speed of $1$ causes $e$ to change at a speed of $2$. So $e$ changes at a rate of $1*2$ with respect to $a$.


In other words, by hand this would be:

$$
\frac{\partial e}{\partial \boldsymbol{a}}=\frac{\partial e}{\partial \boldsymbol{c}} \frac{\partial \boldsymbol{c}}{\partial \boldsymbol{a}} = 2 * 1
$$

Since $a$ is not directly connectected to $e$, we can use some special rule which allows to sum over all paths from one node to the other in the computational graph and mulitplying the derivatives on each edge of the path together.

![](https://colah.github.io/posts/2015-08-Backprop/img/tree-eval-derivs.png)

In [None]:
print(b.grad)

tensor([5.])


If you work it out by hand, you are basically doing the following:

$$
\frac{\partial e}{\partial b}=1 * 2+1 * 3
$$

It indicates how $b$ affects $e$ through $c$ and $d$. We are essentially summing over paths in the computational graph.

Here are all the gradients collected, including non-leaf nodes:

In [None]:
print(a.grad, b.grad, c.grad, d.grad, e.grad)

tensor([2.]) tensor([5.]) tensor([2.]) tensor([3.]) tensor([1.])


You can use the computational graph above to verify that everything is correct. This is the power of computational graphs and how they are used by automatic differentation engines. It's also a very useful concept to understand when developing neural networks architectures and their correctness.