# Introduction to Pytorch

<br/>

    1. Pytorch Tensor Library
    2. Computation Graphs and Autograd
    3. Pytorch Modules
    4. The torch.optim package

## Pytorch Tensor Library `torch.Tensor`

Deep learning frameworks use an object called a **Tensor** to represent structured data. Tensors are a generalization of a matrix that can represent high dimensional data. Tensors have a property called **Rank**. This is the number of indices required to address any single element of the tensor.

<center> <img src="img/tensor.jpg" width="400"/></center>

Pytorch's tensor API is largely a copy of the Numpy API

### Creating Tensors
Tensors can be created with python lists as well as several other ways.

In [5]:
import torch

t = torch.tensor([3, 7, 1])
print(t)

t = torch.tensor([[ 2, 5 ],[ 3, 9 ]])
print(t)

tensor([3, 7, 1])
tensor([[2, 5],
        [3, 9]])


### Indexing Tensors

Indexing a tensor produces a tensor of lower rank.

In [35]:
t = torch.tensor([[ 2, 5 ],[ 3, 9 ]])  # rank 2 tensor

print(t[0])  #  returns a rank 1 tensor

print(t[0][0]) # returns a rank 0 tensor. Equivalently t[0,0]

tensor([2, 5])
tensor(2)


### Slicing Tensors

Like numpy arrays, tensors can be sliced along axes.

In [17]:
t = torch.tensor([[1,1,1],[2,2,2],[3,3,3]])
print(t)
print()

print(t[0,:]), print(t[1,:])
print()

print(t[:,0])
print()

print(t[0:2, 1:])

tensor([[1, 1, 1],
        [2, 2, 2],
        [3, 3, 3]])

tensor([1, 1, 1])
tensor([2, 2, 2])

tensor([1, 2, 3])

tensor([[1, 1],
        [2, 2]])


### Tensor Shape and Reshaping

The shape of a tensor tells you the size along each of its dimensions. The length of the shape is the tensors rank. Tensors can be reshaped into a different rank with the same total number of elements.

In [33]:
t = torch.tensor([[1,1,1],[2,2,2],[3,3,3],[4,4,4]])
print(t.shape)

r = t.view(12)
print(r)

r = t.view(2,-1)  # -1 tell the tensor to figure out what number should go there based on the number of elements
print(r)

torch.Size([4, 3])
tensor([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])
tensor([[1, 1, 1, 2, 2, 2],
        [3, 3, 3, 4, 4, 4]])


### Operations on Tensors

There are a large number of math operations and other miscalanious operations available to tensors. Normal math operations (+, \*, etc...) **<span style="color:red">Always Operate Elementwise</span>**.

In [36]:
x = torch.tensor([1,1])
y = torch.tensor([2,2])
z = x + y
print(z)

tensor([3, 3])


**Note**: This is not the case for operations that specifically define how elements operate, e.g. matrix multiplication, concatenate.

In [38]:
z = torch.cat([x,y])  #concatenate along the first axis
print(z)

tensor([1, 1, 2, 2])


### Tensor Broadcasting (__Advanced Topic__)

In [54]:
torch.ones((3,2)) * 2

tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])

But doesnt this break our rule that tensors operate elementwise?   


Rank 2 * Rank 0

### What is actually happening in our example?

In [55]:
torch.ones((3,2)) * 2

tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])

In [57]:
torch.ones((3,2)) * torch.tensor([[2.,2],[2,2],[2,2]])

tensor([[2., 2.],
        [2., 2.],
        [2., 2.]])

Broacasting replicates the scalar enough times to have the same number of elements as the first tensor.

### Tensor Broadcasting (Advanced Topic)

Broadcasting is a method by which tensors of different size or rank can operate together.
Two tensors are “broadcastable” if the following rules hold:

 * Each tensor has at least one dimension.

 * When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes must either be equal, one of them is 1, or one of them does not exist.

In [48]:
x = torch.zeros((2,3,2))
y = torch.ones((   3,2))

print(x+y)

x = torch.zeros((2,3,2))
y = torch.ones(( 1,3,2))
print(x+y)

x = torch.zeros((2,3,2))
y = torch.ones((     2))
print(x+y)

tensor([[[1., 1.],
         [1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.],
         [1., 1.]]])
tensor([[[1., 1.],
         [1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.],
         [1., 1.]]])
tensor([[[1., 1.],
         [1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.],
         [1., 1.]]])


In [49]:
x = torch.zeros((2,3,2))
y = torch.ones((   3,3))
print(x+y)

RuntimeError: The size of tensor a (2) must match the size of tensor b (3) at non-singleton dimension 2

## Computation Graphs and Autograd

Automatic Differentiation is the core of every deep learning library (Called **Autograd** in Pytorch).
     It is used to calculate the derivatives (gradients) of a function by using the chain rule and black magic. 
 
Training every nerual network consists of two phases:
   1. **Forward pass** - calculate output of the network starting at the inputs
   2. **Backward pass** - backpropagation to compute all of the gradients starting with the output
     
All math operations in the **forward pass** are represented by a **Computation Graph**

### Representing Computation as a Graph

In [None]:
b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = 10 - d

<center> <img src="img/computation_graph_forward.png" width="400"> </center>

[Example from here](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)

#### The gradients we need to find are:
<br/>

$$
\large
\frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \frac{\partial L}{\partial w_3}, \frac{\partial L}{\partial w_4}
$$

In other words, how does the network output ($L$, or the loss) change as we change each of our parameters ($w_i$ or the weights).

<br/>

#### Our job when writing a pytorch model is to define the computation graph so that the **Autograd** library can find these gradients automatically.
---

## Pytorch Autograd - `torch.nn.Autograd`

In pytorch, the gradients and the associated operations are kept track of inside of each tensor if its member `requires_grad=True`.

In [62]:
y = torch.ones((2,2), requires_grad=True)
y.requires_grad

True

In [120]:
a = torch.tensor([1.], requires_grad=True)
w1 = torch.tensor([1.], requires_grad=True)
w2 = torch.tensor([1.], requires_grad=True)
w3 = torch.tensor([1.], requires_grad=True)
w4 = torch.tensor([1.], requires_grad=True)

b = w1 * a
c = w2 * a
d = w3 * b + w4 * c
L = 10 - d

In [123]:
b, d

(tensor([1.], grad_fn=<MulBackward0>), tensor([2.], grad_fn=<AddBackward0>))

**Note**: The backward functions are stored in the tensors themselves. There is no "central" representation of the computation graph.

## Pytorch Functions - `torch.nn.Autograd.Function`

All math operations are defined in the `Function` class. `Function` has two important methods:

  * `forward` - Simply computes the output of the function given the inputs
  * `backward` - Take gradients from later parts of the computation graph and computes the gradients relative to the inputs. It then sends these gradients to earlier parts of the computation graph and invokes their `backward` functions.

In [122]:
print(L)
L.backward()
print(w4.grad)

tensor([8.], grad_fn=<RsubBackward1>)
tensor([-1.])


## Pytorch vs. Tensorflow

In Tensorflow the computation graph is defined at initialization and the graph is run by simply passing all the inputs through it. 

In Pytorch the computation graph is built up from the operations contained in the tensors. After a call to `backward` all `grad_fn` references are destroyed. The next forward pass of the graph recreates each of these. This means the computation graph can change at run time.

###         **Tensorflow** - Static Computation Graph

###         **Pytorch** - Dynamic Computation Graph

<br/>


###### * Tensorflow 2.0 uses eager execution which gives it some dynamic graph capabilities

### Now we know about tensors, computation graphs and how to calculate gradients. Lets move on to how we will build our graphs.




<br/>
<br/>

---

# Pytorch Modules - `torch.nn.Module`

In Pytorch we create computation graphs with Modules. There are many prebuilt ones but the `Module` class allows us to create our own as well.

Neural networks tend to have their computation graph organized into layers. This is the same as how we stacked layers of perceptrons previously. A layer of perceptrons, which is a linear combination of its inputs, is implemented in a module conveniently called `Linear`.

In [134]:
lin = torch.nn.Linear(in_features=3, out_features=5)

x = torch.tensor([1.,2.,3.])
lin(x)

tensor([-0.8849,  0.4273, -2.3330,  1.4144,  1.0879], grad_fn=<AddBackward0>)

### Layers can be stacked together

The output of one layer can be fed into the input of another. This is how we create the stacked structure of a typical neural network.

In [135]:
x = torch.tensor([1.,2.,3.])

lin1 = torch.nn.Linear(in_features=3, out_features=5)
lin2 = torch.nn.Linear(in_features=5, out_features=4)

hidden = lin1(x)
output = lin2(hidden)
output

tensor([-0.6024, -0.5019,  1.1678,  0.9438], grad_fn=<AddBackward0>)

## Using `Module` to Build a Network

Modules can contain other modules. This is how we build up more complicated graphs and keep them in easy to use packages. To make a module you only need to overload the `__init__()` and `forward()` functions. Lets also throw in some activations.

In [171]:
class MyCoolNet(torch.nn.Module):
    def __init__(self, in_size, out_size):
        super(MyCoolNet, self).__init__()
        
        self.lin1 = torch.nn.Linear(in_size, 10)
        self.lin2 = torch.nn.Linear(10, 15)
        self.lin3 = torch.nn.Linear(15, 10)
        self.lin4 = torch.nn.Linear(10, out_size)
        self.relu = torch.nn.ReLU()
        self.sig = torch.nn.Sigmoid()
        
    def forward(self, x):
        hidden = self.relu(self.lin1(x))
        hidden = self.relu(self.lin2(hidden))
        hidden = self.relu(self.lin3(hidden))
        output = self.sig(self.lin4(hidden))
        
        return output
        

### Now we can use our network just like we used the `Linear` layer earlier.

In [172]:
model = MyCoolNet(in_size=5, out_size=5)

x = torch.tensor([1.,2,3,4,5])
model(x)

tensor([0.4050, 0.4399, 0.5060, 0.5242, 0.4559], grad_fn=<SigmoidBackward>)

### We can now build computation graphs, send tensors through them, and calculate their gradients. All thats left is to use the gradients to adjust the weights.

<br/>

---

# The `torch.optim` Package

We have explored tensors, computation graphs and the ability to calculate their gradients. Now we need to adjust the weights of the network based on their gradients. This is done with the `torch.optim` package.

This package contains all of the optimizers that are available for use in Pytorch, as well as the framework to make new ones.

We will explore in detail an optimizer called **Stochastic Gradient Descent** (**SGD**) and then briefly describe more modern optimizers that are modifications of **SGD**.

## Stochastic Gradient Descent

Remember from last time, we said that our update rule for the parameters of our graph (weights) was:

<br/>
$$
\large
w_{t+1} = w_t - \alpha \frac{\partial L}{\partial w}
$$

<br/>

In words, for each parameter in in the graph, the current value $w_{i+1}$ is the previous value $w_i$ minus the gradient of $L$.


<br/>

This is the basic form of **Stochastic Gradient Descent**. It is stochastic because, in general, $\partial L / \partial w$ is the average gradient over a random subset of inputs from our dataset. We will focus on a particular variation of **SGD** called **SGD with Momentum**.


## Momentum

To make **SGD** learn faster and be less likely to get stuck in local minima we can use a moving average. In this case we will use an exponentially weighted moving average.

<br/>
$$
\large
V_t = \beta V_{t-1} + (1-\beta) \frac{\partial L}{\partial w}
$$

<br/>

$V_t$ is called the momentum. We can now write our parameter update rule as:

<br/>
$$
\Large
w_{t+1} = w_t - \alpha V_t
$$

<br/>
Alpha is what is called the learning rate. Momentum has the effect of 'building up speed' to carry the optimizer through areas of low gradient where normally it would slow down more than desired.



## Optimizers in Pytorch

To use **SGD** in Pytorch we just need to instantiate its class. 

In [173]:
model = MyCoolNet(in_size=5, out_size=5)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)  
# If we give no argument for momentum then standard SGD is used

Notice for us to initialize the optimizer we must pass the parameters of our model. This is so the optimizer knows which weights to adjust at each step.

We can now use this optimizer to minimize a loss function on our model.

## Minimizing a Loss Function

Lets create a simple loss function and adjust the weights one step in the direction of the gradients.

In [179]:
model = MyCoolNet(in_size=5, out_size=5)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.8)

x = torch.tensor([1.,2,3,4,5])

for i in range(10):
    out = model(x)
    loss = torch.sum((0-out)**2)  #loss decreases as out->0

    loss.backward()   # Calculate gradients
    optimizer.step()  # Step all weights in the direction of gradient
    optimizer.zero_grad()
    print(f"Loss {i}: {loss.item()}")

Loss 0: 1.3889371156692505
Loss 1: 1.3281832933425903
Loss 2: 1.2424967288970947
Loss 3: 1.1397792100906372
Loss 4: 1.0178158283233643
Loss 5: 0.8763992190361023
Loss 6: 0.7107219099998474
Loss 7: 0.5144205093383789
Loss 8: 0.2940930128097534
Loss 9: 0.10847821831703186


## Modern Optimizers

There are many variations on **SGD** that incorporate some sort of momentum. While these optimizers still show state of the art performance in some tasks, they take more care to get good performance. Newer adaptive methods solve many of the problems with **SGD**.

<center><img src="https://media.giphy.com/media/SJVFO3IcVC0M0/giphy.gif"/></center>

## **Adam** 
### A modification of SGD that uses the first and second moments of the gradient to adjust the learning rate. Adam calculates these moments and adjusts the learning rate for each parameter. Building a new network? Start with Adam `torch.optim.Adam()`.
  
$$
\Large
w_{t+1} = w_t - \eta \frac{\hat{m}_t}{ \sqrt{\hat{v}_t} + \epsilon }
$$

<br/>

$$
\large
m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t
$$


$$
\large
v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2
$$


### **LookAhead** 
#### Uses a combination of two optimizers, a fast and slow one. The fast one explores the loss ahead of the slow one and after a number of iterations, the slow optimizer is then moved to the linear interpolation between the two. The fast optimizer could be any other optimizer. SGD, Adam, etc...

<center> <img src="img/lookahead.png" width="500"/> </center>

# Summary

We have taken a look at the basic components of Pytorch and how they can be combined to form computation graphs. We then looked at how to optimize our paramters with gradient descent and got a preview of modern optimizers.

<center> <img src="https://media.giphy.com/media/1PNcBiyhzL0Na/giphy.gif"/> </center>

## Next time...  Hands on with Pytorch!