# Autograd

In search of an optimal solution for a given problem we will need to tweak the parameters of the neural network model. Knowing how to tweak the parameters and knowing to either increase or decrease a specific value, translates into math in finding a partial derivative of our model. This operation is the central part of backpropagation used within neural networks.

PyTorch has the **autograd** feature that helps in these calculations. It will allow to build flexible and fast machine learning projects. Autograd will allow the computation of multiple partial derivatives (also named *gradients*) of our complex model. 

The power of autograd comes from the fact that it traces your computation dynamically *at runtime,*. This means that if your model has decision branches, or loops whose lengths are unknown up to the moment the model will execute - or run - autograd will still trace these calculations correctly, and will calculate the correct gradients to drive the optimizations of the parameters of the model.

## A mathematical view on autograd

Let's take a high-level mathematical helicopter view of our machine learning model and discuss it here as a *function*, with inputs and outputs.

The inputs are defined by an *i*-dimensional vector $\vec x$, with elements $x_{i}$. You can see them as the input features. We can then express the model, *M*, as a vector-valued function of the input and it will in a general case result in a vector output: *$\vec y$ = $\vec M$($\vec x$)*.

Since we'll be using autograd in the context of training, our output of interest will be the model's loss. 

The *loss function* *L($\vec y$) = L($\vec M$($\vec x$))* is typically a **single-valued** scalar function of the model's output. This function expresses how far off our model's prediction was from a particular input's *true* output.  

The goal of *training* a model is thus to  minimize the loss by adjusting the parameters of our model. For a perfect model this will mean that we want to tweak the parameters of our model in such a way, that our loss function will result in a 0 outcome. With a zero loss we have a perfect model that gives our *true* outputs. 

In reality our data, both input and output, will have some noise and we will not always get *true* outputs. Also our model is always an approximation of reality but not reality itself and rounding errors in our calculations will have a further impact, resulting in a model that does not give *true* outputs.

To guide the tuning of the parameters to minimize the loss, we need a mathematical tool to guide us in the good direction, meaning down the slope. This mathematical tool to *minimize* the loss is the first derivative which we'll need to equal to 0: $\frac{\partial L}{\partial x} = 0$. With the first derivative having a value of 0, we'll know that our model's parameters have been changed in such a way that we have reached a minimum. Depending on the loss-function, this minimum could then be global or a local minimum (if e.g. the loss-function is convex and is differentiable everywhere, then we know that when the gradient of the loss-function=0 then we have reached the global minimum).

Mind that the loss is a function of the model's output and is not *directly* derived from the input. Off course the output is an *indirect* function of the input, through the model that is being applied to the input. 

So: $\frac{\partial L}{\partial x}$ = $\frac{\partial {L({\vec y})}}{\partial x}$. By the chain rule of differential calculus, we have $\frac{\partial {L({\vec y})}}{\partial x}$ = $\frac{\partial L}{\partial y}\frac{\partial y}{\partial x}$ = $\frac{\partial L}{\partial y}\frac{\partial M(x)}{\partial x}$.

In this calculation it is the part of $\frac{\partial M(x)}{\partial x}$ where things get complex. In a neural network we have a linear combination of inputs and weights on top of which we apply non-linear activation functions. In more complex Neural Networks, we can also apply other transformations like convolutions, max-poolings, ...  

The partial derivatives of the model's outputs with respect to its inputs, $\frac{\partial M(x)}{\partial x}$, can again be expanded using the chain-rule on the components of the network. This will result in needing to calculate all the local partial derivatives over every multiplied learning weight, every activation function, and every other mathematical transformation in the model. 

The final result will be a sum of the products of the local gradient of *every possible path* through the computational graph that ends with the variable whose gradient we are trying to measure.

It are off-course the gradients of the parameters/weights of our model that we are interested in as these gradients will allows us to modify the parameters of the model, as they tell us *what direction to change each weight* so that the loss function gets closer to zero.

For a neural net, the number of layers and the number of weights is growing into the 100's of millions and models get 'deeper' and deeper and so complexer and complexer. There will also be a need to calculate local derivatives for each parameter, where each parameter corresponds to a separate path through the model's computational graph and so the number of derivatives to calculate will go up exponentially with the depth of a neural network. 

It is here that autograd will help out in tracking the history of every computation.

## An autograd example

Let's start with a simple example.

In [2]:
import torch
import numpy as np
import matplotlib.pyplot as plt 

## **requires_grad** argument

This argument will tell pytorch that it will need to calculate the gradients for this tensor
later in your optimization steps.

i.e. this is a variable in your model that you want to optimize

The autograd package provides automatic differentiation for all operations on Tensors. This is done by adding `requires_grad = True` to all derivative operations on the tensor. 

In [3]:
x = torch.linspace(0., 6., steps=10, requires_grad=True)
y = x + 2


x = torch.randn(4, requires_grad=True)

y = x+2

As **y** was created as a result of an operation (here an addition), it will get a `grad_fn` attribute, indicating what operation was performed on the input tensor.

In [4]:
print('x:',x) # created by the user

x: tensor([-0.7020,  0.7491,  0.3764, -0.3933], requires_grad=True)


Look at the last attribute of the **w** pytorch-tensor. The `requires_grad` attribute indicates that the variable **x** is a **leaf** node of our calculation graph.

In [5]:
print(x.grad_fn)

None


So `grad_fn` gives us a hint that when we execute the backpropagation step and compute gradients, what needs to be done.

For `x.grad_fn` it returned None, indicating no action is needed.

In [6]:
print('y:',y)

y: tensor([1.2980, 2.7491, 2.3764, 1.6067], grad_fn=<AddBackward0>)


Look again at the last attribute of the **y** pytorch-tensor. This `grad_fn` attribute indicates that this is a **non-leaf** node of our calculation graph and it indicates which operation needs to be performed when we have to backtrack when performing the backward step.

The type of operation is stored in this `grad_fn` attribute.

In [7]:
print(y.grad_fn)

<AddBackward0 object at 0x0000014CD084EEE0>


For `y.grad_fn` it returned that autograd will need to compute the derivative of $add(x)$ in the backward step. 

Let's do some more operations on y

In [8]:
z = y * y * 3
print('z:',z) # the result of a multiplication of y with itself (squaring) and multiply by 3 
print(z.grad_fn) 

z: tensor([ 5.0546, 22.6726, 16.9414,  7.7444], grad_fn=<MulBackward0>)
<MulBackward0 object at 0x0000014CD084ECD0>


In [9]:
# plt.plot(x.detach(), z.detach())    
#       # we'll explain detach below, but here you must know that it allows to make a numpy copy of the pytorch tensor, 
#       # without affecting the gradient's calculations.

In [10]:
t = z.sum() # here t is a TENSOR of DIMENSION 0
print('t:',t)
print(t.grad_fn)

t: tensor(52.4129, grad_fn=<SumBackward0>)
<SumBackward0 object at 0x0000014CD084ED90>


So far no gradients have been calculated, but the model knows how it should calculate all these gradients based on the graf_fn attributes of each variable in each of the layers.

When we finish our model computation - this is also called the forward pass - we can call `.backward()` and have all the gradients computed automatically.

Let's start off easy and when you call `.backward()` on a tensor with no arguments, it expects the calling tensor to contain only a **single** element, as is the case when computing a loss function where we want to optimize 1 value (i.e. the loss).

So let's take the next step and ask the system to calculate all the gradients, by means of `.backward()`.

In [11]:
t.backward()

The gradient for this tensor will be accumulated into the .grad attribute of the *original* tensor (so of x). It is the partial derivate of the output function w.r.t. the tensor x. Here x is seen as a leaf tensor.

In [12]:
print('x.grad:\n',x.grad) # dt/dx

x.grad:
 tensor([ 7.7882, 16.4946, 14.2582,  9.6402])


For all other intermediate steps in our calculations - on non-leaf tensor - the .grad will not be populated and so will contain the value None (+ you'll get a warning if you do ask). If you do want the gradient to be calculated then you need to set the non-leaf tensor to have `.retain_grad()`. This is also indicated when you try to access the gradient of intermediate variables. Just have a look at the warning you get when uncommenting the below commented lines.

In [13]:
# print('y.grad:\n',y.grad) # dt/dy
# print('z.grad:\n',z.grad) # dt/dz
# print('t.grad:\n',t.grad) # dt/dt

There is more functionality included on the `grad_fn` function.

Each `grad_fn` stored with our tensors allows you to walk the computation all the way back to its inputs with its `next_functions property`. 

We can see below that drilling down on this property on t shows us the gradient functions for all the prior tensors. Note that `x.grad_fn` is reported as None, indicating that this was an input to the function with no history of its own.

In [14]:
print(' t:')
print(t.grad_fn)
print(t.grad_fn.next_functions)
print(t.grad_fn.next_functions[0][0].next_functions)
print(t.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)
print(t.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)
print('\n z:')
print(z.grad_fn)
print('\n y:')
print(y.grad_fn)
print('\n x:')
print(x.grad_fn)

 t:
<SumBackward0 object at 0x0000014CD08671F0>
((<MulBackward0 object at 0x0000014CD0867370>, 0),)
((<MulBackward0 object at 0x0000014CD08671F0>, 0), (None, 0))
((<AddBackward0 object at 0x0000014CD0867280>, 0), (<AddBackward0 object at 0x0000014CD0867280>, 0))
((<AccumulateGrad object at 0x0000014CD08671F0>, 0), (None, 0))

 z:
<MulBackward0 object at 0x0000014CD08671F0>

 y:
<AddBackward0 object at 0x0000014CD08671F0>

 x:
None


### Exercise: Create a function that calculates the $exp(2*x^2*sin(x))$ over the range [-2, 2], with steps=25

Use intermediate variables for each mathematical operation.


In [15]:
# Your code here

tensor([-2.0000, -1.8333, -1.6667, -1.5000, -1.3333, -1.1667, -1.0000, -0.8333,
        -0.6667, -0.5000, -0.3333, -0.1667,  0.0000,  0.1667,  0.3333,  0.5000,
         0.6667,  0.8333,  1.0000,  1.1667,  1.3333,  1.5000,  1.6667,  1.8333,
         2.0000], requires_grad=True)


### Exercise: Call the appropriate function on the variable to calculate the gradients.

x.grad:
 tensor([2.7343e-03, 8.0899e-03, 2.4208e-02, 7.0817e-02, 1.9001e-01, 4.3878e-01,
        8.2628e-01, 1.2166e+00, 1.3549e+00, 1.0997e+00, 6.0093e-01, 1.6387e-01,
        0.0000e+00, 1.6691e-01, 6.9499e-01, 1.7762e+00, 4.0675e+00, 9.5081e+00,
        2.3928e+01, 6.5504e+01, 1.9073e+02, 5.6104e+02, 1.5392e+03, 3.5214e+03,
        5.6924e+03])


________________________________

## Autograd: code and function

`requires_grad=True`    ->	Tensor will be tracked for gradient computation


`.grad` ->	Stores the computed gradient


`.backward()`   ->	Triggers automatic differentiation


`torch.no_grad()`   ->	Temporarily disables autograd (useful for inference)


`.detach()` ->	Creates a tensor that shares data but doesnâ€™t track gradients

_______________

## Turning Autograd On and Off

There are situations where you will need fine-grained control over whether autograd is enabled. There are multiple ways to do this, depending on the situation.

There are 3 ways to do so:

    - Using requires_grad=True
    - Using with torch.no_grad()
    - Using .detach()
    

### Using `requires_grad=True`
The simplest is to change the `requires_grad` flag on a tensor directly.

Let's try this on an input tensor `a` and see what happens to 2 other tensors `b1` and  `b2`, where one has the gradients and the other not.

In [18]:
a = torch.ones(2, 3, requires_grad=True)
print(a)

b1 = 2 * a
print(b1)

a.requires_grad = False
b2 = 2 * a
print(b2)

tensor([[1., 1., 1.],
        [1., 1., 1.]], requires_grad=True)
tensor([[2., 2., 2.],
        [2., 2., 2.]], grad_fn=<MulBackward0>)
tensor([[2., 2., 2.],
        [2., 2., 2.]])


In the cell above, we see that `b1` has a `grad_fn` (i.e., a traced computation history), which is what we expect, since it was derived from a tensor, `a`, that had autograd turned on. When we turn off autograd explicitly with `a.requires_grad = False`, computation history is no longer tracked, as we see when we compute `b2`.

### Using `torch.no_grad()`
If you only need autograd turned off **temporarily**, a better way is to use the `torch.no_grad()`:

In [19]:
a = torch.ones(2, 3, requires_grad=True) * 2
b = torch.ones(2, 3, requires_grad=True) * 3

c1 = a + b
print('c1:\n', c1)

with torch.no_grad():
    c2 = a + b
    print('c2:\n', c2)

c3 = a * b
print('c3:\n', c3)


c1:
 tensor([[5., 5., 5.],
        [5., 5., 5.]], grad_fn=<AddBackward0>)
c2:
 tensor([[5., 5., 5.],
        [5., 5., 5.]])
c3:
 tensor([[6., 6., 6.],
        [6., 6., 6.]], grad_fn=<MulBackward0>)


`torch.no_grad()` can also be used as a function or method decorator:

In [20]:
def add_tensors1(x, y):
    return x + y

@torch.no_grad()
def add_tensors2(x, y):
    return x + y


a = torch.ones(2, 3, requires_grad=True) * 2
b = torch.ones(2, 3, requires_grad=True) * 3

c1 = add_tensors1(a, b)
print('c1:\n', c1)

c2 = add_tensors2(a, b)
print('c2:\n', c2)


c1:
 tensor([[5., 5., 5.],
        [5., 5., 5.]], grad_fn=<AddBackward0>)
c2:
 tensor([[5., 5., 5.],
        [5., 5., 5.]])


There's a corresponding context manager, `torch.enable_grad()`, for turning autograd on when it isn't already. It may also be used as a decorator.

### Using `.detach()`
Finally, you may have a tensor that requires gradient tracking, but you want a copy that does not. For this we have the `Tensor` object's `detach()` method - it creates a copy of the tensor that is *detached* from the computation history:

In [21]:
x = torch.rand(5, requires_grad=True)
y = x.detach()

print(x)
print(y)

tensor([0.0737, 0.8558, 0.6912, 0.8008, 0.4990], requires_grad=True)
tensor([0.0737, 0.8558, 0.6912, 0.8008, 0.4990])


We did this above when we wanted to graph some of our tensors. This is because `matplotlib` expects a NumPy array as input, and the implicit conversion from a PyTorch tensor to a NumPy array is not enabled for tensors with requires_grad=True. 

E.g. making a detached copy is necessary when you want to display your results with `matplotlib`.

### Autograd and In-place Operations

In every example in this notebook so far, we've used variables to capture the intermediate values of a computation. Autograd needs these intermediate values to perform gradient computations. 

**For this reason, you must be careful about using in-place operations when using autograd.**

Doing so can destroy information you need to compute derivatives in the `backward()` call. PyTorch will even stop you if you attempt an in-place operation on leaf variable that requires autograd, as shown below.



In [22]:
# Note: The following lines of code - when uncommented - throws a runtime error. This is expected.
# a = torch.linspace(0., 2. * np.math.pi, steps=25, requires_grad=True)
# torch.sin_(a)

## Autograd Profiler (Nice to know)

Autograd tracks every step of your computation in detail. Such a computation history, combined with timing information, would make a handy profiler - and autograd has that feature baked in. Here's a quick example usage:

In [23]:
device = torch.device('cpu')
run_on_gpu = False
if torch.cuda.is_available():
    device = torch.device('cuda')
    run_on_gpu = True
    
x = torch.randn(2, 3, requires_grad=True)
y = torch.rand(2, 3, requires_grad=True)
z = torch.ones(2, 3, requires_grad=True)

with torch.autograd.profiler.profile(use_cuda=run_on_gpu) as prf:
    for _ in range(1000):
        z = (z / x) * y
        
print(prf.key_averages().table(sort_by='self_cpu_time_total'))

-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
    aten::mul        53.32%       4.000ms        53.32%       4.000ms       4.000us      11.039ms        50.72%      11.039ms      11.039us          1000  
    aten::div        46.68%       3.502ms        46.68%       3.502ms       3.502us      10.726ms        49.28%      10.726ms      10.726us          1000  
-------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 7.502ms
Self CUDA time total: 21.765ms



The profiler can also label individual sub-blocks of code, break out the data by input tensor shape, and export data as a Chrome tracing tools file. For full details of the API, see the [documentation](https://pytorch.org/docs/stable/autograd.html#profiler)

## Advanced Topic: More Autograd Detail and the High-Level API (Nice to know)

If you have a function with an n-dimensional input and m-dimensional output, $\vec{y}=f(\vec{x})$, the complete gradient is a matrix of the derivative of every output with respect to every input, called the *Jacobian:*

\begin{align}J=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\end{align}

If you have a second function, $l=g\left(\vec{y}\right)$ that takes m-dimensional input (that is, the same dimensionality as the output above), and returns a scalar output, you can express its gradients with respect to $\vec{y}$ as a column vector, $v=\left(\begin{array}{ccc}\frac{\partial l}{\partial y_{1}} & \cdots & \frac{\partial l}{\partial y_{m}}\end{array}\right)^{T}$ - which is really just a one-column Jacobian.

More concretely, imagine the first function as your PyTorch model (with potentially many inputs and many outputs) and the second function as a loss function (with the model's output as input, and the loss value as the scalar output).

If we multiply the first function's Jacobian by the gradient of the second function, and apply the chain rule, we get:

$J^{T}\cdot v=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{1}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{1}}{\partial x_{n}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)\left(\begin{array}{c}
   \frac{\partial l}{\partial y_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial y_{m}}
   \end{array}\right)=\left(\begin{array}{c}
   \frac{\partial l}{\partial x_{1}}\\
   \vdots\\
   \frac{\partial l}{\partial x_{n}}
   \end{array}\right)$

The resulting column vector is the *gradient of the second function with respect to the inputs of the first* - or in the case of our model and loss function, the gradient of the loss with respect to the model inputs.

**`torch.autograd` is an engine for computing these products.** This is how we accumulate the gradients over the learning weights during the backward pass.

For this reason, the `backward()` call can *also* take an optional vector input. This vector represents a set of gradients over the tensor, which are multiplied by the Jacobian of the autograd-traced tensor that precedes it. Let's try a specific example with a small vector:

In [24]:
x = torch.randn(3, requires_grad=True)
y = x * 2
while y.data.norm() < 1000: # calculate the L2 norm of the data (torch.sqrt(torch.sum(torch.pow(y, 2))) == y.data.norm())
    y = y * 2

print(y)
# y.backward() # Uncommented, this would result in an error

tensor([-1189.2156,   989.5829,   100.2688], grad_fn=<MulBackward0>)


If we tried to call `y.backward()` now, - by uncommenting the line above - we'd get a runtime error and a message that gradients can only be *implicitly* computed for scalar outputs. For a multi-dimensional output, autograd expects us to provide gradients for those - in this case - three outputs that it can multiply into the Jacobian:

In [25]:
v = torch.tensor([1.0, 1.0, 0.0001], dtype=torch.float) # stand-in for gradients
y.backward(v)

print(x.grad)

tensor([5.1200e+02, 5.1200e+02, 5.1200e-02])


(Note that the output gradients are all related to powers of two - which we'd expect from a repeated doubling operation.)

In [26]:
# Another example

dx = 0.1
x = torch.arange(1, 1.8, dx, requires_grad=True)
f = x**2 + torch.sqrt(x) # the gradient of this is 2*x + 0.5/sqrt(x)

f.backward(torch.ones(f.shape))
print(x.grad)

tensor([2.5000, 2.6767, 2.8564, 3.0385, 3.2226, 3.4082, 3.5953, 3.7835])


Also have a look at https://stackoverflow.com/questions/57248777/backward-function-in-pytorch

### The High-Level API (nice to know)

There is an API on autograd that gives you direct access to important differential matrix and vector operations. In particular, it allows you to calculate the Jacobian and the *Hessian* matrices of a particular function for particular inputs. (The Hessian is like the Jacobian, but expresses all partial *second* derivatives.) It also provides methods for taking vector products with these matrices.

Let's take the Jacobian of a simple function, evaluated for a 2 single-element inputs:

In [27]:
def exp_adder(x, y):
    return 2 * x.exp() + 3 * y

inputs = (torch.rand(1), torch.rand(1)) # arguments for the function
print(inputs)
torch.autograd.functional.jacobian(exp_adder, inputs)

(tensor([0.4438]), tensor([0.8725]))


(tensor([[3.1172]]), tensor([[3.]]))

If you look closely, the first output should equal $2e^x$ (since the derivative of $e^x$ is $e^x$), and the second value should be 3.

You can, of course, do this with higher-order tensors. E.g. for 2 1 dimensional vectors:

In [28]:
inputs = (torch.rand(3), torch.rand(3)) # arguments for the function
print(inputs)
torch.autograd.functional.jacobian(exp_adder, inputs)

(tensor([0.1464, 0.2125, 0.7914]), tensor([0.9378, 0.4660, 0.1729]))


(tensor([[2.3152, 0.0000, 0.0000],
         [0.0000, 2.4735, 0.0000],
         [0.0000, 0.0000, 4.4129]]),
 tensor([[3., 0., 0.],
         [0., 3., 0.],
         [0., 0., 3.]]))

The `torch.autograd.functional.hessian()` method works identically (assuming your function is twice differentiable), but returns a matrix of all second derivatives.

In [29]:
inputs = (torch.rand(1), torch.rand(1)) # arguments for the function
print(inputs)
torch.autograd.functional.hessian(exp_adder, inputs)

(tensor([0.8534]), tensor([0.0284]))


((tensor([[4.6952]]), tensor([[0.]])), (tensor([[0.]]), tensor([[0.]])))

There is also a function to directly compute the vector-Jacobian product (vjb), if you provide the vector:

In [30]:
def do_some_doubling(x):
    y = x * 2
    while y.data.norm() < 1000:
        y = y * 2
    return y

inputs = torch.randn(3)
my_gradients = torch.tensor([0.1, 1.0, 0.0001])
torch.autograd.functional.vjp(do_some_doubling, inputs, v=my_gradients)

(tensor([1672.8020, -118.8002,  228.2352]),
 tensor([1.0240e+02, 1.0240e+03, 1.0240e-01]))

The `torch.autograd.functional.jvp()` method performs the same matrix multiplication as `vjp()` with the operands reversed. The `vhp()` and `hvp()` methods do the same for a vector-Hessian product.

For more information, including preformance notes on the [docs for the functional API](https://pytorch.org/docs/stable/autograd.html#functional-higher-level-api)

## When exaclty you need and don't need Autograd?

Answer: 