# 2 Variables and gradients 
### 2.1 Variables
    - A variable wraps a Tensor
    - Variables allow a Tensor to accumilate gradients

In [1]:
import torch
from torch.autograd import Variable

In [3]:
a = Variable(torch.ones(2, 2), requires_grad=True)
a

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [62]:
# Not a variable
torch.ones(2, 2)


 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [63]:
# Behaves similarly to Torch Tensors
b = Variable(torch.ones(2, 2), requires_grad=True)
print a + b
print torch.add(a, b)

Variable containing:
 2  2
 2  2
[torch.FloatTensor of size 2x2]

Variable containing:
 2  2
 2  2
[torch.FloatTensor of size 2x2]



In [64]:
print a * b
print (torch.mul(a, b))

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]



### 2.2 Gradients

#### So what exactly is require_grad?
    - Allows for the calculations of gradients with respect to (w.r.t) the variable

$y_i = 5(x_i + 1)^2$

In [111]:
x = Variable(torch.ones(2), requires_grad=True)
x

Variable containing:
 1
 1
[torch.FloatTensor of size 2]

$ \therefore \hspace{0.3cm} y_i \mid_{x_i=1} = 5(1 + 1)^2 = 5(2)^2 = 5(4) = 20$

In [112]:
y = 5 * (x + 1) ** 2
y

Variable containing:
 20
 20
[torch.FloatTensor of size 2]

** Backward should be called only on a scalar(i.e. 1-element tensor) or with a gradient w.r.t the variable **
    - So to get our gradient, we want to reduce y = Variable(torch.Tensor[20, 20]) to a single value... a scalar.
    - To do this we've decided to get the mean.

$\therefore \hspace{0.3cm} o = \frac{1}{2}\Sigma_{i}y_i = \frac{\Sigma_i y_i}{2}$

In [113]:
# In code:
o = (1.0/2) * torch.sum(y)
o

Variable containing:
 20
[torch.FloatTensor of size 1]

Recap $y$ equation: $y_i = 5(x_i + 1)^2$  
  
Recap $o$ equation: $o = \frac{1}{2}\Sigma_{i}y_i = \frac{\Sigma_i y_i}{2}$  

So $o$ is the the mean of some hypothesis of $x$: 

$\hspace{1cm}$ Substitute $y$ into $o$ equation:  

$\hspace{2cm} o = \frac{1}{2}\Sigma_{i}5(x_i + 1)^2$

$\hspace{2cm} \therefore \hspace{0.3cm} \frac{\partial o}{\partial x_i} = \frac{1}{2}[10(x_i + 1)]$

$\hspace{2cm} \therefore \hspace{0.3cm} \frac{\partial o}{\partial x_i} \mid_{x_i = 1} = \frac{1}{2}[10(1 + 1)] = \frac{10(2)}{2} = 10$

In [114]:
o.backward(retain_graph=True)

In [115]:
# o.backward gave us the gradient so now we can differentiate with do with respect to x
x.grad

Variable containing:
 10
 10
[torch.FloatTensor of size 2]

### Summary
- Variable
    - Wraps around a tensor so that gradients can be accumilated.
- Gradients
    - The backprop function creates a computational graph.
    - From the computational graph, we can use the chain rule to find the derivative of our hypothesis with respect to any of our parameters.
    - To apply backpropagation, we need to reduce our hypothesis matrix into just a scalar. (One way is through calculating a mean.)
    - We can then access any one of the gradient with respect to a parameter of our choosing through **```var.grad```** where we calculate the gradient of our **mean(hypothesis)** with respect to **```var```**.