# SeokHyeon Chin, 2016320179

In machine learning, we train models, updating them successively so that they get better and better as they see more and more data. Usually, getting better means minimizing a loss function.
With neural networks, we typically choose loss functions that are differentiable with respect to our parameters.
The `autograd` package expedites this work by automatically calculating derivatives.

In [1]:
import torch
from torch.autograd import Variable

## A Simple Example

Example of differentiating the mapping $y = 2\mathbf{x}^{\top}\mathbf{x}$ with respect to the column vector $\mathbf{x}$

In [2]:
x = Variable(torch.arange(4, dtype=torch.float32).reshape((4, 1)), requires_grad=True)
print(x) # creates (4X1) vector which elements are Real number

# Discussion: What if not declare requires_grad or compute as False?

tensor([[0.],
        [1.],
        [2.],
        [3.]], requires_grad=True)


Once we compute the gradient of y with respect to x, we will need a place to store it. We can tell a tensor that we plan to store a gradient by the ``requires_grad=True`` keyword.

Now we are going to compute y and PyTorch will generate a computation graph on the fly. Autograd is reverse automatic differentiation system. Conceptually, autograd records a graph recording all of the operations that created the data as you execute operations, giving you a directed acyclic graph whose leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In [3]:
Variable # For a tensor to be “recordable”, it must be wrapped with torch.autograd.Variable

torch.autograd.variable.Variable

In [4]:
y = 2*torch.mm(x.t(),x) # y is a scalar
print(y)

tensor([[28.]], grad_fn=<MulBackward0>)


In [5]:
y.backward() # can automatically find the gradient with this fucntion

Each Variable has an associated grad_fn, which is the torch.autograd.Function that is used to compute the backward step. For inputs it is None:

In [6]:
print("x.grad:", x.grad) # y_gradient = 4x, which makes (0,4,8,12)
print("x.grad_fn:", x.grad_fn)
print("y.grad_fn:", y.grad_fn)

x.grad: tensor([[ 0.],
        [ 4.],
        [ 8.],
        [12.]])
x.grad_fn: None
y.grad_fn: <MulBackward0 object at 0x00000250B45C9488>


The gradient of the function  𝑦=2𝐱⊤𝐱  with respect to  𝐱  should be  4𝐱 . Now let's verify that the gradient produced is correct.

In [7]:
print((x.grad - 4*x).norm().item() == 0)
print(x.grad) # Note that gradient of y is shown by x.grad

True
tensor([[ 0.],
        [ 4.],
        [ 8.],
        [12.]])


In [8]:
print((x.grad - 4*x)) # Discussion: What does 'grad_fn=<SubBackward0>', 'grad_fn=<NormBackward0>' mean?

tensor([[0.],
        [0.],
        [0.],
        [0.]], grad_fn=<SubBackward0>)


In [9]:
print((x.grad - 4*x).norm())

tensor(0., grad_fn=<NormBackward0>)


In [10]:
print((y))

tensor([[28.]], grad_fn=<MulBackward0>)


In [11]:
x = Variable(torch.arange(4, dtype=torch.float32).reshape((4, 1)))
y = 2*torch.mm(x.t(),x)
print(y)
#y.backward()

#RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

tensor([[28.]])


In [12]:
x = Variable(torch.arange(4, dtype=torch.float32).reshape((4, 1)), requires_grad=False)
y = 2*torch.mm(x.t(),x)
print(y)
#y.backward() 

#RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

tensor([[28.]])


Without `requires_grad=True` keyword or if we compute this keyword as `False`, `Autograd` does not work. If we use `backward` function with this state, `RuntimeError` as above occurs.

## Training Mode and Evaluation Mode

`Model` will change the running mode to the evaluation mode on calling `model.eval()` or to the training mode on calling `model.train()`.

## Computing the Gradient of Python Control Flow

Even if the computational graph of the function contains Python's control flow, we can still find the gradient of a variable. Consider the following program:

In [13]:
def f(a):
    b = a * 2
    while b.norm().item() < 1000:
        b = b * 2
    if b.sum().item() > 0:
        c = b
    else:
        c = 100 * b
    return c

Note that the number of iterations of the while loop and the execution of the conditional statement (if then else) depend on the value of `a`. To compute gradients, we need to `record` the calculation, and then call the `backward` function to calculate the gradient.

In [14]:
a = torch.randn(size=(1,)) # create random variable
print(a)
a.requires_grad=True # to use Autograd
d = f(a)
d.backward()

tensor([0.1136])


In [15]:
print("a.grad:", a.grad)
print("a.grad_fn:", a.grad_fn)
print("d.grad_fn:", d.grad_fn)

a.grad: tensor([16384.])
a.grad_fn: None
d.grad_fn: <MulBackward0 object at 0x00000250B45EB788>


In [16]:
print(a.grad == (d / a))

tensor([True])


## Head gradients and the chain rule

Sometimes when we call the backward method, e.g. `y.backward()`, where
`y` is a function of `x` we are just interested in the derivative of
`y` with respect to `x`. Mathematicians write this as
$\frac{dy(x)}{dx}$. At other times, we may be interested in the
gradient of `z` with respect to `x`, where `z` is a function of `y`,
which in turn, is a function of `x`. That is, we are interested in
$\frac{d}{dx} z(y(x))$. Recall that by the chain rule

$$\frac{d}{dx} z(y(x)) = \frac{dz(y)}{dy} \frac{dy(x)}{dx}.$$

So, when ``y`` is part of a larger function ``z`` and we want ``x.grad`` to store $\frac{dz}{dx}$, we can pass in the *head gradient* $\frac{dz}{dy}$ as an input to ``backward()``. The default argument is ``torch.ones_like(y)``.

In [17]:
x = Variable(torch.tensor([[0.],[1.],[2.],[3.]]), requires_grad=True)
y = x * 2
z = y * x

head_gradient = torch.tensor([[10], [1.], [.1], [.01]]) # Discussion: meaning of this?
z.backward(head_gradient)
print(x.grad)

tensor([[0.0000],
        [4.0000],
        [0.8000],
        [0.1200]])


In [18]:
x = Variable(torch.tensor([[0.],[1.],[2.],[3.]]), requires_grad=True)
y = x * 2
z = y * x

head_gradient = torch.ones_like(y) # default usage (1,1,1,1)
z.backward(head_gradient)
print(x.grad)

tensor([[ 0.],
        [ 4.],
        [ 8.],
        [12.]])


## Summary
* PyTorch provides an `autograd` package to automate the derivation process.
* PyTorch's `autograd` package can be used to derive general imperative programs.
* The running modes of PyTorch include the training mode and the evaluation mode.

## Discussion
* To use `autograd` function `.backward()`, you must write `requires_grad=True`. Otherwise, `RuntimeError` will occur.
* What does `grad_fn=<SubBackward0>`, `grad_fn=<NormBackward0>` mean? It seems that once you start to store a gradient by the `requires_grad=True` keyword, Pytorch tracks what you are doing with that variable and record as shown.
* `head_gradient = torch.tensor([[10], [1.], [.1], [.01]])` reason of ths usage is described below:
* Some matrix represents the gradient of f(X) with respect to X.
Suppose a PyTorch gradient enabled tensors X as:
X = [x1, x2, ….. xn] (Let this be the weights of some machine learning model)
X undergoes some operations to form a vector Y
Y = f(X) = [y1, y2, …. ym]
Y is then used to calculate a scalar loss l. Suppose a vector v happens to be the gradient of the scalar loss l with respect the vector Y as follows: 
Image for post
The vector v is called the grad_tensor and passed to the backward() function as an argument
To get the gradient of the loss l with respect to the weights X the Jacobian matrix J is vector-multiplied with the vector v
Image for post
This method of calculating the Jacobian matrix and multiplying it with a vector v enables the possibility for PyTorch to feed external gradients with ease for even the non-scalar outputs.
* Source: https://towardsdatascience.com/pytorch-autograd-understanding-the-heart-of-pytorchs-magic-2686cd94ec95