![](http://pytorch.org/tutorials/_images/pytorch-logo-flat.png)

<h1 id="tocheading">Table of Contents</h1>
<div id="toc"></div>

# What is PyTorch

It's a Python based scientific computing package targeted at two sets of audiences:
- A replacement for numpy to use the power of GPUs
- a deep learning research platform that provides maximum flexibility and speed.


# Tensors

Tensors are similar to numpy's numpy's ndarrays, with the addition being that Tensors can also be used on a GPU to accelerate computing.

In [1]:
from __future__ import print_function
import torch

Construct a 5x3 matrix, uninitialized

In [2]:
x = torch.Tensor(5, 3)
print(x)


1.00000e-11 *
  0.0000  0.0000 -0.0000
  0.0000 -0.0000  0.0000
  0.0000  0.0000  0.0000
  0.0000 -3.3373  0.0000
 -3.2683  0.0000 -0.0000
[torch.FloatTensor of size 5x3]



Construct a randomly initialized matrix

In [3]:
x = torch.rand(5, 3)
print(x)


 0.6567  0.4359  0.7633
 0.5934  0.6095  0.3843
 0.0266  0.9580  0.0040
 0.9574  0.2764  0.7318
 0.5458  0.3304  0.1916
[torch.FloatTensor of size 5x3]



Get its size

In [4]:
x.size()

torch.Size([5, 3])

## Operations
There are multiple syntaxes for operations. 

In [5]:
y = torch.rand(5,3)
y


 0.8202  0.2407  0.2008
 0.7726  0.2276  0.1646
 0.2548  0.6728  0.4171
 0.5427  0.1835  0.1407
 0.8626  0.1977  0.0360
[torch.FloatTensor of size 5x3]

In [6]:
# syntax 1
x+y


 1.4768  0.6766  0.9641
 1.3659  0.8370  0.5488
 0.2814  1.6308  0.4211
 1.5000  0.4600  0.8725
 1.4084  0.5281  0.2276
[torch.FloatTensor of size 5x3]

In [7]:
# syntax 2
torch.add(x, y)


 1.4768  0.6766  0.9641
 1.3659  0.8370  0.5488
 0.2814  1.6308  0.4211
 1.5000  0.4600  0.8725
 1.4084  0.5281  0.2276
[torch.FloatTensor of size 5x3]

In [8]:
# syntax 3
result = torch.Tensor(5, 3)
torch.add(x, y, out=result)
print(result)


 1.4768  0.6766  0.9641
 1.3659  0.8370  0.5488
 0.2814  1.6308  0.4211
 1.5000  0.4600  0.8725
 1.4084  0.5281  0.2276
[torch.FloatTensor of size 5x3]



In [9]:
# Addition: in-place
y.add_(x)
print(y)


 1.4768  0.6766  0.9641
 1.3659  0.8370  0.5488
 0.2814  1.6308  0.4211
 1.5000  0.4600  0.8725
 1.4084  0.5281  0.2276
[torch.FloatTensor of size 5x3]



<p style="color:blue;">Any operation that mutates a tensor in-place is post-fixed with an _ For example: x.copy_(y), x.t_(), will change x.</p>

You can use standard numpy-like indexing with all bells and whistles!

In [10]:
print(x[:,0:2])
print(x[:,1])


 0.6567  0.4359
 0.5934  0.6095
 0.0266  0.9580
 0.9574  0.2764
 0.5458  0.3304
[torch.FloatTensor of size 5x2]


 0.4359
 0.6095
 0.9580
 0.2764
 0.3304
[torch.FloatTensor of size 5]



## Numpy Bridge
Converting a torch Tensor to a numpy array and vice versa is a breeze.

The torch Tensor and numpy array will share their underlying memory locations, and changing one will change the other.

### Converting torch Tensor to numpy Array

In [11]:
a = torch.ones(5)
a


 1
 1
 1
 1
 1
[torch.FloatTensor of size 5]

In [12]:
b = a.numpy()
b

array([ 1.,  1.,  1.,  1.,  1.], dtype=float32)

In [13]:
# see how the numpy array change in value.
a.add_(1)
print(a)
print(b)


 2
 2
 2
 2
 2
[torch.FloatTensor of size 5]

[ 2.  2.  2.  2.  2.]


### Converting numpy Array to torch Tensor


In [14]:
import numpy as np

In [15]:
a = np.ones(5)
a

array([ 1.,  1.,  1.,  1.,  1.])

In [16]:
b = torch.from_numpy(a)
b


 1
 1
 1
 1
 1
[torch.DoubleTensor of size 5]

In [17]:
np.add(a, 1, out=a)
print(a)
print(b)

[ 2.  2.  2.  2.  2.]

 2
 2
 2
 2
 2
[torch.DoubleTensor of size 5]



<p style="color:blue;"> All the Tensors on the CPU except a CharTensor support converting to NumPy and back.</p>

# Autograd: automatic differentiation

Central to all neural networks in PyTorch is the **```autograd```** package.

The **```autograd```** package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that our backprop is defined by how our code is run, and that every single iteration can be different.


##  Variable
 

**```autograd.Variable```** is the central class of the package. It warps a Tensor, and supports nearly all of the operations defined on it. Once we finish our computation we can call ```.backward()``` and have all the gradients computed automatically.

We can access the raw tensor through ```.data``` attribute, while the gradient with respect to this variable is accumulated into ```.grad```.

![Variable](http://pytorch.org/tutorials/_images/Variable.png)

There's one more class which is very important for autograd implementation -a ```Function```.

Both ```Variable``` and ```Function``` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each Variable has a ```.creator``` attribute that references a ```Function``` that has created the ```Variable``` (except for Variables created by the user- their```creator is None```).

If we want to compute the derivatives, we can call ```.backward()``` on a ```Variable```. If ```Variable``` is a scalar(i.e it holds a one element data), you don't need to specify any arguments to ```backward()```, however if it has more elements, we need to specify a ```grad_output``` argument that is a tensor of matching shape.

## Create a Variable

In [18]:
from torch.autograd import Variable

In [19]:
x = Variable(torch.ones(2, 2), requires_grad=True)
x

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

## Do an Operation of Variable

In [20]:
y = x + 2
y

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]

y was created as a result of an operation, so it has a creator.

In [21]:
y.creator

<torch.autograd._functions.basic_ops.AddConstant at 0x7fb66071b308>

In [22]:
# Do more operations on y
z = y * y * 3
z

Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]

In [23]:
out = z.mean()
out

Variable containing:
 27
[torch.FloatTensor of size 1]

## Gradients
Let's backprop now ```out.backward()``` is equivalent to doing ```out.backward(torch.Tensor([1.0]))```.

In [24]:
out.backward()

In [25]:
x

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]

In [26]:
print(x.grad)

Variable containing:
 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]



Let's call the ```out``` Variable "0". We have that

$$o = \frac{1}{4}\sum _i z_i$$

$$z_i = 3(y_i)^2 = 3(x_i+2)^2$$

$$z_i |_{x_i=1}=27$$

Therefore, $\frac{\partial o }{\partial x_i} = \frac{3}{2}(x_i + 2)$,

hence $$ \frac{\partial o }{\partial x_i} \mid _{x_i=1}=\frac{9}{2}=4.5$$

## Many Crazy things with autograd!


In [27]:
x = torch.randn(3)
x


 0.5593
-0.9787
-0.1997
[torch.FloatTensor of size 3]

In [28]:
x = Variable(x, requires_grad=True)
x

Variable containing:
 0.5593
-0.9787
-0.1997
[torch.FloatTensor of size 3]

In [29]:
y = x * 2
y

Variable containing:
 1.1186
-1.9573
-0.3994
[torch.FloatTensor of size 3]

In [30]:
print(y*2)

Variable containing:
 2.2372
-3.9146
-0.7987
[torch.FloatTensor of size 3]



In [31]:
while y.data.norm() < 1000:
    y = y * 2
    

In [32]:
print(y)

Variable containing:
  572.7346
-1002.1454
 -204.4757
[torch.FloatTensor of size 3]



In [33]:
gradients = torch.FloatTensor([0.1, 1.0, 0.0001])
y.backward(gradients)

print(x.grad)

Variable containing:
  102.4000
 1024.0000
    0.1024
[torch.FloatTensor of size 3]



# Neural Networks

Neural networks can be constructed using the ```torch.nn``` package.
An ```nn.Module``` contains layers, and a method ```forward(input)``` that returns the ```output```.

![](http://pytorch.org/tutorials/_images/mnist.png)

The above example is a simple feed-forward network. It takes the input, feeds it through several layers one after the other, and then finally gives the output.

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:


**```weight = weight + learning_rate * gradient```**

## Define the network

In [34]:
import torch.nn as nn
import torch.nn.functional as F

In [35]:
class Net(nn.Module):
    
    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear( 16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
        
    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x
    
    def num_flat_features(self, x):
        size = x.size()[1:] # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
            
        return num_features

In [36]:
net = Net()
print(net)

Net (
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear (400 -> 120)
  (fc2): Linear (120 -> 84)
  (fc3): Linear (84 -> 10)
)


In [37]:
# The learnable parameters of a model are returned by net.parameters()
params = list(net.parameters())
print(len(params))

10


In [38]:
print(params[0])

Parameter containing:
(0 ,0 ,.,.) = 
 -0.0605 -0.1816  0.1407  0.0594 -0.1860
 -0.1002 -0.0117 -0.1758  0.1666  0.1814
  0.0593  0.0887  0.0529 -0.1395 -0.1210
 -0.1229 -0.1658 -0.0270 -0.0538 -0.0611
 -0.0418 -0.0550  0.1182  0.1332  0.0774

(1 ,0 ,.,.) = 
 -0.0293 -0.1883 -0.1464  0.0267  0.0999
 -0.0900  0.0963 -0.0704 -0.1120 -0.0146
  0.0556 -0.1904 -0.1599  0.0296  0.0967
  0.1023 -0.0296  0.0584  0.0744 -0.1666
  0.0867 -0.1514  0.0774 -0.0108 -0.1040

(2 ,0 ,.,.) = 
  0.1234 -0.1291  0.1795  0.0518  0.1523
  0.1703 -0.1489 -0.1645 -0.0835  0.0503
 -0.0880  0.1814  0.1730  0.1775  0.1042
  0.0127  0.0426  0.1772 -0.1851  0.0821
  0.0260 -0.0925  0.0300 -0.0081  0.1880

(3 ,0 ,.,.) = 
  0.1224  0.0530 -0.0566  0.1859  0.1536
  0.0063 -0.1601  0.1565 -0.0758 -0.1363
  0.0597  0.0210  0.0658 -0.0396 -0.1442
  0.0865 -0.1174 -0.0925 -0.1286 -0.0331
 -0.1325  0.1843 -0.0025 -0.0286  0.0954

(4 ,0 ,.,.) = 
  0.0550  0.0836 -0.0661 -0.0379  0.1297
 -0.0696  0.0909  0.1273 -0.1143  0.04

In [39]:
params[0].size() # Conv1's weights

torch.Size([6, 1, 5, 5])

The input to the forward is an ```autograd.Variable```, and so is the output.

In [40]:
input = Variable(torch.randn(1, 1, 32, 32))
out = net(input)
print(input)

Variable containing:
(0 ,0 ,.,.) = 
  0.1428 -0.5461 -1.1254  ...  -1.3203  0.8102  0.2453
  0.6096  1.1269 -0.8461  ...   1.0527  0.1020 -1.7761
  0.3085  0.0338 -1.1594  ...   0.1656  1.1475  1.8065
           ...             ⋱             ...          
  0.6506  1.5443  2.4143  ...   0.2695  1.3630  1.7906
  0.3943  1.1546  0.3025  ...   1.4561 -1.0245 -0.4550
  2.6471 -0.0005  1.5843  ...  -0.0015  0.3269 -1.1769
[torch.FloatTensor of size 1x1x32x32]



In [41]:
print(out)

Variable containing:
-0.0821  0.0263  0.0497 -0.1758  0.1649  0.1227  0.1006  0.0651 -0.0411 -0.0163
[torch.FloatTensor of size 1x10]



Zero the gradient buffers of all parameters and backprops with random gradients:

In [43]:
net.zero_grad()
out.backward(torch.randn(1, 10))

In [44]:
print(out)

Variable containing:
-0.0821  0.0263  0.0497 -0.1758  0.1649  0.1227  0.1006  0.0651 -0.0411 -0.0163
[torch.FloatTensor of size 1x10]



## Loss Function

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.

There are several different [**loss function**](http://pytorch.org/docs/nn.html#loss-functions):
- [```nn.L1Loss(size_average=True)```](http://pytorch.org/docs/nn.html#torch.nn.L1Loss)
- [```nn.MSELoss(size_average=True)```](http://pytorch.org/docs/nn.html#torch.nn.MSELoss)
- [```nn.CrossEntropyLoss(weight=None, size_average=True) ```](http://pytorch.org/docs/nn.html#torch.nn.CrossEntropyLoss)
- [```nn.NLLLoss(weight=None, size_average=True) ```](http://pytorch.org/docs/nn.html#torch.nn.NLLLoss)
- [``` nn.NLLLoss2d(weight=None, size_average=True)```](http://pytorch.org/docs/nn.html#torch.nn.NLLLoss2d)
- [```nn.KLDivLoss(weight=None, size_average=True)```](http://pytorch.org/docs/nn.html#torch.nn.KLDivLoss)
- [```nn.BCELoss(weight=None, size_average=True) ```](http://pytorch.org/docs/nn.html#torch.nn.BCELoss)
- [```nn.MarginRankingLoss(margin=0, size_average=True) ```](http://pytorch.org/docs/nn.html#torch.nn.MarginRankingLoss)

- Et cetera


A simple loss is: ```nn.MSELoss``` which computes the mean-squared error between the input and the target.

In [45]:
print(input)

Variable containing:
(0 ,0 ,.,.) = 
  0.1428 -0.5461 -1.1254  ...  -1.3203  0.8102  0.2453
  0.6096  1.1269 -0.8461  ...   1.0527  0.1020 -1.7761
  0.3085  0.0338 -1.1594  ...   0.1656  1.1475  1.8065
           ...             ⋱             ...          
  0.6506  1.5443  2.4143  ...   0.2695  1.3630  1.7906
  0.3943  1.1546  0.3025  ...   1.4561 -1.0245 -0.4550
  2.6471 -0.0005  1.5843  ...  -0.0015  0.3269 -1.1769
[torch.FloatTensor of size 1x1x32x32]



In [46]:
output=net(input)
output

Variable containing:
-0.0821  0.0263  0.0497 -0.1758  0.1649  0.1227  0.1006  0.0651 -0.0411 -0.0163
[torch.FloatTensor of size 1x10]

In [50]:
target = Variable(torch.arange(0, 10))  # a dummy target, for example
target

Variable containing:
 0
 1
 2
 3
 4
 5
 6
 7
 8
 9
[torch.FloatTensor of size 10]

In [51]:
criterion = nn.MSELoss()
loss = criterion(output, target)
print(loss)

Variable containing:
 28.2188
[torch.FloatTensor of size 1]



So, when we call loss.backward(), the whole graph is differentiated w.r.t. the loss, and all Variables in the graph will have their .grad Variable accumulated with the gradient.

In [53]:
print(loss.creator)  # MSELoss
print(loss.creator.previous_functions[0][0])  # Linear
print(loss.creator.previous_functions[0][0].previous_functions[0][0])  # ReLU

<torch.nn._functions.thnn.auto.MSELoss object at 0x7fb6607574d8>
<torch.nn._functions.linear.Linear object at 0x7fb660757050>
<torch.nn._functions.thnn.auto.Threshold object at 0x7fb66072eed0>


## Backpropagation
To backpropagate the error all we have to do is to call ```loss.backward()```. You need to clear the existing gradients though, else gradients will be accumulated to existing gradients

In [54]:
net.zero_grad()  # zeroes the gradient buffers of all parameters
print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)


conv1.bias.grad before backward
Variable containing:
 0
 0
 0
 0
 0
 0
[torch.FloatTensor of size 6]



In [55]:
loss.backward()
print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad after backward
Variable containing:
 0.0308
-0.2030
-0.0105
-0.0009
-0.0872
-0.0341
[torch.FloatTensor of size 6]



## Update the weights
The simplest update rule used in practice is the Stochastic Gradient Descent(SGD):

**```weight = weight - learning_rate * gradient```**

PyTorch has ```torch.optim``` that implements  various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.. Using it is very simple:

In [56]:
import torch.optim as optim

# Create an optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

for _ in range(1000):
    
    # in training loop:
    optimizer.zero_grad()   # zero the gradient buffers
    output = net(input)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()   # Does the update

In [42]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>