# Autograd package in PyTorch

The goal of this notebook is to play around with this package to get a better understanding of how PyTorch does automatic differentiation.

In [1]:
import numpy as np
import torch
from torch.autograd import Variable, grad

### Variables

We need to wrap our tensors as variables to be able to generate the back propagation and our gradients.

Here, we create a simple scaler value $x=4$.  Even though it is nondimensional, we still need to make it a tensor.

In [2]:
x = Variable(torch.FloatTensor([4]), requires_grad=True)
print(x)
print(x.grad)

tensor([4.], requires_grad=True)
None


### Single variable polynomial

## $z = x^{3} + 2x^{2} + 8x$

We will use a lambda function for this polynominal

In [3]:
func_1 = lambda x: torch.pow(x, 3) + 2*torch.pow(x, 2) + 8*x

### First Derivative

Let's manually do the math to get the first derivative and evaluate it for $x=4$.

### $\frac{dz}{dx} = 3x^{2} + 4x + 8$

### $\frac{dz}{dx}\Bigr|_{\substack{x=4}} \quad 3*4^{2} + 4*4 + 8 = 72$

### Now let's see if PyTorch gets the same results

Generate a new tensor $z$ that is the result of our function.

We then take this result tensor and run the back propagation method.

This sets the input variable $x$ grad attribute to first derivative evaluated at $x=4$.

In [4]:
z = func_1(x)
z.backward()
print(x.grad)

tensor([72.])


## Computation Graph

Let's write our polynomial as composite functions to show how it would look as a computational graph.

We will also add the derivative with respect to the local variable $w_i$

## $z = x^{3} + 2x^{2} + 8x$

$w_1 = x \quad \rightarrow \quad \dot{w_1} = 1$

$w_2 = {w_1}^{3} \quad \rightarrow \quad \dot{w_2} = 3{w_1}^2$

$w_3 = {w_1}^{2} \quad \rightarrow \quad \dot{w_3} = 2{w_1}$

$w_4 = 2{w_3} \quad \rightarrow \quad \dot{w_4} = 2\dot{w_3} \quad$ *(chain rule)*

$w_5 = 8{w_2} \quad \rightarrow \quad \dot{w_5} = 8$

$w_6 = w_2 + w_4 + w_3 \quad \rightarrow \quad \dot{w_6} = \dot{w_2} + \dot{w_4} + \dot{w_3}$


Now let's find the numerical derivative for each node using $x=4$:

$\dot{w_2} = 3x^2 = 48$

$\dot{w_3} = 2x = 8$

$\dot{w_4} = 2\dot{w_3} = 16$

$\dot{w_5} = 8$

$\dot{w_6} = \dot{w_2} + \dot{w_4} + \dot{w_5} = 48 + 16 + 8 = 72$

#### This is how the gradient for the forward propagation is stored.  

PyTorch does not use calculus per se, it has a set of known derivatives for most math functions.

During the forward, the gradient is stored as a numeric value.




##### When backward is run, it is actually doing matrix multiplication

The gradient is stored as a matrix ("Jacobian").  This is not a real Jacobian with symbolic derivatives, but an array of the values of the gradient like we did above with our computation graph.  It is a matrix since there can be more than 1  variable and multiple rows of values. You can think of each row of values as a vector which makes our function a vector valued function.

To get the values, you pass in a vector and this vector is multiplied by the Jacobian matrix (Jacobian vector product).

This looks odd since you create a vector of ones.  The ones vector means it will return the values of the gradients unchanged.  You can weight the vector to scale the output, but since we want to see the raw gradients, we use a vector of ones.

*Note: If there is only 1 row and a single variable (scalar), we can use default parameter for the backward function and not build the vector of ones.*

In [5]:
# reset the grad and find z again
x.grad.data.zero_()
z = func_1(x) 

vector_of_ones = torch.ones(x.shape[0])
z.backward(vector_of_ones)
print(x.grad)

tensor([72.])


##### The computational graph is release automatically.  So running backward() twice throws an error

In [6]:
try:
    z.backward()
except RuntimeError as err:
    print(err)

Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.


##### If we tell PyTorch to ratain the computation graph, we can run backward() multiple times

But all this does is add sum the gradients again.

72

72+72=144

72+72+72=216

In [7]:
# reset the grad and find z again
x.grad.data.zero_()
z = func_1(x)

z.backward(retain_graph = True)
print(x.grad)
z.backward(retain_graph = True)
print(x.grad)
z.backward()
print(x.grad)

tensor([72.])
tensor([144.])
tensor([216.])


## Let's try a different function that is not a polynominal

### $z = sin(x)$

$\frac {dz}{dx} = cos(x)$

$\frac{dz}{dx}\Bigr|_{\substack{x=4}} \quad cos(4) \approx  -0.6536$

In [8]:
# reset the grad and find z again
x.grad.data.zero_()
z = func_1(x) 

func_2 = lambda x: torch.sin(x)
z = func_2(x)

z.backward()
print(x.grad)

tensor([-0.6536])


### Multiple values, but still with a single variable

We will use the same function above, $z = sin(x)$

But now instead of passing in just one value for x, we will pass in an array of 10 values.

In [9]:
X = Variable(2*np.pi * torch.rand(([10])), requires_grad=True)
X

tensor([5.6419, 1.1369, 4.7996, 0.8205, 0.4615, 0.4596, 2.5004, 1.2695, 2.5845,
        0.4513], requires_grad=True)

##### We are now required to manually build and pass in the vector of ones since we no longer have a scalar

In [10]:
Z = func_2(X)
print(f'Z:\n {Z}')
vector_of_ones = torch.ones(X.shape)
Z.backward(vector_of_ones)
print(f'\nPyTorch Gradient:\n {X.grad.data.numpy()}')
#since we know Z' = cosine, we can varifiy the results from autograd
print(f'\nValidate with cosine:\n {torch.cos(X).data.numpy()}')

Z:
 tensor([-0.5982,  0.9073, -0.9962,  0.7315,  0.4453,  0.4436,  0.5981,  0.9549,
         0.5287,  0.4362], grad_fn=<SinBackward>)

PyTorch Gradient:
 [ 0.80134296  0.420412    0.08707973  0.6818824   0.8953653   0.896241
 -0.80138785  0.29677498 -0.84881455  0.89987385]

Validate with cosine:
 [ 0.80134296  0.420412    0.08707973  0.6818824   0.8953653   0.896241
 -0.80138785  0.29677498 -0.84881455  0.89987385]


## 2 scaler variables

We already defined $x=4$.

Now let's define $y=2$.

In [11]:
# reset the grad on x
x.grad.data.zero_()

y = Variable(torch.FloatTensor([2]), requires_grad=True)
print(y)
print(y.grad)

tensor([2.], requires_grad=True)
None


## Polynominal with 2 variables

### $z = x^{3} + y^{2} + ax + b$

In [12]:
func_3 = lambda x, y, a, b: torch.pow(x, 3) + torch.pow(y, 2) + a*x + b
z = func_3(x, y, 1, 6)

$\frac{\partial z}{\partial x} = 3x^{2} + a$

$\frac{\partial z}{\partial x}\Bigr|_{\substack{x=4, a=1}} \quad 3*4^{2} + 1 = 49$

$\frac{\partial z}{\partial y} = 2y$

$\frac{\partial z}{\partial y}\Bigr|_{\substack{y=2}} \quad 2*2 = 4$

In [13]:
z.backward()
print(x.grad)
print(y.grad)

tensor([49.])
tensor([4.])


## Sine function with 2 scalar variables

### $z = sin(x) + sin(y)$

$\frac{\partial z}{\partial x}\Bigr|_{\substack{x=4}} \quad cos(4) \approx  -0.6536$

$\frac{\partial z}{\partial y}\Bigr|_{\substack{y=2}} \quad cos(2) \approx  -0.4161$

In [14]:
# reset the grad on x and y
x.grad.data.zero_()
y.grad.data.zero_()

func_4 = lambda x, y: torch.sin(x) + torch.sin(y)
z = func_4(x, y)
z.backward()
print(x.grad)
print(y.grad)

tensor([-0.6536])
tensor([-0.4161])


## Now let's use 2 variables, each with multiple values

We will build a matrix (ndarray) with 10 rows and 2 columns.

The first column will be the X values and the second column will be the Y values

In [15]:
XY = Variable(2*np.pi * torch.rand(([10, 2])), requires_grad=True)
XY

tensor([[5.9066, 1.1849],
        [6.0019, 1.2083],
        [3.8134, 4.7397],
        [2.1871, 2.9670],
        [0.4878, 4.3015],
        [2.5796, 3.2423],
        [1.0124, 5.7254],
        [4.4697, 4.1561],
        [5.1514, 6.0768],
        [0.3598, 1.5586]], requires_grad=True)

In [16]:
func_5 = lambda XY: torch.sin(XY[:,0]) + torch.sin(XY[:,1])
Z = func_5(XY)
Z

tensor([ 0.5587,  0.6574, -1.6221,  0.9898, -0.4481,  0.4324,  0.3188, -1.8199,
        -1.1101,  1.3520], grad_fn=<AddBackward0>)

### The gradiant will also have 2 columns

The first column is the X gradients and the second column is the Y gradients.

In [17]:
vector_of_ones = torch.ones(XY.shape[0]) # We just want this to be a vector, so we just use the row count (shape[0])
Z.backward(vector_of_ones)

print(f'\nPyTorch Gradients:\n {XY.grad.data.numpy()}')
#since we know Z' = cosine, we can varifiy the results from autograd
print(f'\nValidate with cosine:\n {torch.cos(XY).data.numpy()}')


PyTorch Gradients:
 [[ 0.9299176   0.37643185]
 [ 0.9606921   0.3545953 ]
 [-0.78267014  0.02728887]
 [-0.578051   -0.9847898 ]
 [ 0.88338107 -0.39942288]
 [-0.8461944  -0.9949382 ]
 [ 0.5298115   0.8484107 ]
 [-0.24032214 -0.5280292 ]
 [ 0.42501867  0.9787835 ]
 [ 0.9359573   0.01222623]]

Validate with cosine:
 [[ 0.9299176   0.37643185]
 [ 0.9606921   0.3545953 ]
 [-0.78267014  0.02728887]
 [-0.578051   -0.9847898 ]
 [ 0.88338107 -0.39942288]
 [-0.8461944  -0.9949382 ]
 [ 0.5298115   0.8484107 ]
 [-0.24032214 -0.5280292 ]
 [ 0.42501867  0.9787835 ]
 [ 0.9359573   0.01222623]]


## Activation Functions

### $z = tanh(x)$


### $\frac{dz}{dx} = 1 - tanh^2(x)$

In [18]:
X = Variable(2 * torch.rand(([10])) - 1, requires_grad=True)
X

tensor([-0.1015,  0.4655,  0.7771, -0.6037, -0.8693,  0.2803, -0.9955, -0.4249,
         0.3446, -0.9768], requires_grad=True)

In [19]:
func_6 = lambda X: torch.tanh(X)
Z = func_6(X)
Z

tensor([-0.1012,  0.4345,  0.6510, -0.5397, -0.7010,  0.2732, -0.7597, -0.4010,
         0.3316, -0.7517], grad_fn=<TanhBackward>)

In [20]:
vector_of_ones = torch.ones(X.shape) 
Z.backward(vector_of_ones)

print(f'\nPyTorch Gradients:\n {X.grad.data.numpy()}')
#since we know Z' = 1-tanh^2, we can varifiy the results from autograd
print(f'\nValidate with 1-tanh^2:\n {(1 - torch.pow(torch.tanh(X), 2)).data.numpy()}')


PyTorch Gradients:
 [0.9897621  0.8111834  0.57615125 0.7087302  0.5085927  0.9253647
 0.42288297 0.8391874  0.8900712  0.43500286]

Validate with 1-tanh^2:
 [0.9897621  0.8111834  0.57615125 0.7087302  0.5085927  0.9253647
 0.42288297 0.8391874  0.8900712  0.43500286]
