* All modern deep learning frameworks offer `automatic differentiation (autograd)`, as we pass data through each succesive function, the framework builds a `computational graph` that tracks how each value depends on others.
* To calculate derivatives, automatic differentiation works backwards through this graph applying the chain rule.
* The computational algorithm for applying the chain rule is called *backpropagation*.

In [1]:
import torch

## 1.1 A Simple Function

* Let's assume that we are differentiating the function $y=2x^Tx$

In [19]:
x = torch.arange(4.0)
x

tensor([0., 1., 2., 3.])

* Before calculating the gradient of $y$ with respect to $x$, we need a place to store it.
* In general, we want to avoid allocating new memory everytime we take a derivative because deep learning requires successively computing derivatives with respect to the same parameters a great many times, and we might risk running out of memory.

In [20]:
x.requires_grad_(True)
x.grad#the gradient is None by default

In [21]:
##calculating our function of x and assigning
##results to y
y = 2*torch.dot(x,x)
y

tensor(28., grad_fn=<MulBackward0>)

In [22]:
##taking gradient of y with respect to x by calling
## its backward method
y.backward()
x.grad

tensor([ 0.,  4.,  8., 12.])

In [23]:
##verifying the automatic gradient calculation and the expected result are
## identical
x.grad == 4*x

tensor([True, True, True, True])

In [24]:
x

tensor([0., 1., 2., 3.], requires_grad=True)

* Note that Pytorch does not automatically reset the gradient buffer when we record new gradient.
* Instead, the new gradient is added to the already-stored gradient.
* This behavior comes in handy when we want to optimize the sum of multiple objective functions.
* To reset the gradient buffer, we can call `x.grad.zero_()` as follows:

In [25]:
x.grad.zero_() #reset gradients
y = x.sum()
y.backward()
x.grad


tensor([1., 1., 1., 1.])

In [11]:
##extended example
#creating our tensors and enabling gradient tracking
x = torch.tensor(2.0,requires_grad=True)
y = torch.tensor(3.0,requires_grad=True)

## definiing the function z (forward pass)
z = x**3 + 4*y

#calculating the gradients(backward pass)
z.backward()

#access the results
print(f"Gradient of x: {x.grad}")
print(f"Gradient of y: {y.grad}")

Gradient of x: 12.0
Gradient of y: 4.0


* You may be curious as to why we need to declare `z.backward()` instead of just declaring only `x.grad , y.grad`.
* This is because when you ddefine `z=x**3 + 4*y`, Pytorch does not solve the calculus problem symbolically.
* Instead, it only knows the `local` relationships. It knows that `z` was created by addition of two things.
* Before `z.backward()`: `x.grad` is `None`. The memory for the gradient hasn't even been allocated yet.
* After `z.backward()`: Pytorch starts at `z` looks at its history, see $x$ and $y$, calculates the numerical values of the derivatives, and "deposits" those values into the `.grad` buckets of $x$ and $y$.

## 1.2 Backward for Non-Scalar Variables.


* When $y$ is a vector, the most natural representation of the derivative of $y$ with respect to a vector $x$ is a matrix called the *Jacobian* that contains the partial derivatives of each component of *y* with respect to each component of *x*.


In [27]:
#x = torch.arange(4.0)
x.grad.zero_()
y = x*x
y.backward(gradient=torch.ones(len(y)))
x.grad

tensor([0., 2., 4., 6.])

## 1.3 Detaching Computation

* Suppose we wish to move some calculations outside of the recorded computational graph.
* For example, say that we use the input to create some auxiliary intermediate terms for which we do not want to compute a gradient.
* Therefore we need to `detach` the respective computational graph from the final result.
* Suppose we have `z = x*y` and `y=x*x` but we want to focus on the direct influence of `x` and `z` rather than the influence conveyed via y.
* In this case we can create a variable `u` that takes the same value as `y` but whose `provenance(how it has been created)` has been wiped out.
* So `u` has no ancestors in the graph and gradients do not flow through `u` to `x`.
* Taking the gradient of `z= x*u` will yield the result `u`,(not 3*x*x) as you might have expected since `z=x*x*x`.

In [28]:
x.grad.zero_()
y = x*x
u = y.detach()
z = u*x

z.sum().backward()
x.grad == u

tensor([True, True, True, True])

In [29]:
##calculating gradient of y with respect to x
x.grad.zero_()
y.sum().backward()
x.grad == 2*x

tensor([True, True, True, True])

## 1.4 Gradients and Python Control Flow


* Programming offers us a lot more freedom in how we compute results of deriving functions.
* We can make them depend on auxiliary variables or condition choices on intermediate results.
* One benefit of using automatic differentiation is that even if building computational graph of a function required passing through  maze of Python control flow, we can still calculate the gradient of the resulting variable.

In [30]:
def f(a):
  b = a*2
  while b.norm() <1000:
    b = b**2
  if b.sum() > 0:
    c = b
  else:
    c = 100* b
  return c


In [31]:
##calling the function by passing in a random value, as input
a = torch.randn(size=(),requires_grad=True)
d = f(a)
d.backward()

* Since the input is a random variable, we do not know what form the computational graph will take.


In [33]:
a.grad == d / a

tensor(False)

tensor(False)