# Pytorch Tutorial

Pytorch is a python framework for machine learning

- GPU-accelerated computations
- automatic differentiation
- modules for neural networks

This tutorial will teach you the fundamentals of operating on pytorch tensors and networks. You have already seen some things in recitation 0 which we will quickly review, but most of this tutorial is on mostly new or more advanced stuff.

For a worked example of how to build and train a pytorch network, see `pytorch-example.ipynb`.

For additional tutorials, see http://pytorch.org/tutorials/

In [1]:
import torch
import numpy as np
import torch.nn as nn

## Tensors (review)

Tensors are the fundamental object for array data. The most common types you will use are `IntTensor` and `FloatTensor`.

In [2]:
# Create uninitialized tensor
x = torch.FloatTensor(2,3)
print(x)
# Initialize to zeros
x.zero_()
print(x)

tensor([[4.3911e-05, 1.0081e-08, 2.0990e-07],
        [2.1085e-07, 4.0523e-11, 7.1450e+31]])
tensor([[0., 0., 0.],
        [0., 0., 0.]])


In [3]:
# Create from numpy array (seed for repeatability)
np.random.seed(123)
np_array = np.random.random((2,3))
print(torch.FloatTensor(np_array))
print(torch.from_numpy(np_array))

tensor([[0.6965, 0.2861, 0.2269],
        [0.5513, 0.7195, 0.4231]])
tensor([[0.6965, 0.2861, 0.2269],
        [0.5513, 0.7195, 0.4231]], dtype=torch.float64)


In [4]:
# Create random tensor (seed for repeatability)
torch.manual_seed(123)
x=torch.randn(2,3)
print(x)
# export to numpy array
x_np = x.numpy()
print(x_np)

tensor([[-0.1115,  0.1204, -0.3696],
        [-0.2404, -1.1969,  0.2093]])
[[-0.11146712  0.12036294 -0.3696345 ]
 [-0.24041797 -1.1969243   0.20926936]]


In [5]:
# special tensors (see documentation)
print(torch.eye(3))
print(torch.ones(2,3))
print(torch.zeros(2,3))
print(torch.arange(0,3))

tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])
tensor([[1., 1., 1.],
        [1., 1., 1.]])
tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([0, 1, 2])


All tensors have a `size` and `type`

In [6]:
x=torch.FloatTensor(3,4)
print(x.size())
print(x.type())

torch.Size([3, 4])
torch.FloatTensor


## Math, Linear Algebra, and Indexing (review)

Pytorch math and linear algebra is similar to numpy. Operators are overridden so you can use standard math operators (`+`,`-`, etc.) and expect a tensor as a result. See pytorch documentation for a complete list of available functions.

In [7]:
x = torch.arange(0.,5.)
print(torch.sum(x))
print(torch.sum(torch.exp(x)))
print(torch.mean(x))

tensor(10.)
tensor(85.7910)
tensor(2.)


Pytorch indexing is similar to numpy indexing. See pytorch documentation for details.

In [8]:
x = torch.rand(3,2)
print(x)
print(x[1,:])

tensor([[0.0756, 0.1966],
        [0.3164, 0.4017],
        [0.1186, 0.8274]])
tensor([0.3164, 0.4017])


## CPU and GPU

Tensors can be copied between CPU and GPU. It is important that everything involved in a calculation is on the same device. 

This portion of the tutorial may not work for you if you do not have a GPU available.

In [9]:
# create a tensor
x = torch.rand(3,2)
# copy to GPU
y = x.cuda()
# copy back to CPU
z = y.cpu()
# get CPU tensor as numpy array
# cannot get GPU tensor as numpy array directly
try:
    y.numpy()
except RuntimeError as e:
    print(e)

TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

Operations between GPU and CPU tensors will fail. Operations require all arguments to be on the same device.

In [None]:
x = torch.rand(3,5)  # CPU tensor
y = torch.rand(5,4).cuda()  # GPU tensor
try:
    torch.mm(x,y)  # Operation between CPU and GPU fails
except TypeError as e:
    print(e)

Typical code should include `if` statements or utilize helper functions so it can operate with or without the GPU.

In [10]:
# Put tensor on CUDA if available
x = torch.rand(3,2)
if torch.cuda.is_available():
    x = x.cuda()
    print(x, x.dtype)
    
# Do some calculations
y = x ** 2 
print(y)

# Copy to CPU if on GPU
if y.is_cuda:
    y = y.cpu()
    print(y, y.dtype)

tensor([[0.2745, 0.6584],
        [0.2775, 0.8573],
        [0.8993, 0.0390]], device='cuda:0') torch.float32
tensor([[0.0753, 0.4335],
        [0.0770, 0.7350],
        [0.8088, 0.0015]], device='cuda:0')
tensor([[0.0753, 0.4335],
        [0.0770, 0.7350],
        [0.8088, 0.0015]]) torch.float32


A convenient method is `new`, which creates a new tensor on the same device as another tensor. It should be used for creating tensors whenever possible.

In [11]:
x1 = torch.rand(3,2)
x2 = x1.new(1,2)  # create cpu tensor
print(x2)
x1 = torch.rand(3,2).cuda()
x2 = x1.new(1,2)  # create cuda tensor
print(x2)

tensor([[0., 0.]])
tensor([[0.0753, 0.4335]], device='cuda:0')


Calculations executed on the GPU can be many times faster than numpy. However, numpy is still optimized for the CPU and many times faster than python `for` loops. Numpy calculations may be faster than GPU calculations for small arrays due to the cost of interfacing with the GPU.

In [12]:
from timeit import timeit
# Create random data
x = torch.rand(1000,64)
y = torch.rand(64,32)
number = 10000  # number of iterations

def square():
    z=torch.mm(x, y) # dot product (mm=matrix multiplication)

# Time CPU
print('CPU: {}ms'.format(timeit(square, number=number)*1000))
# Time GPU
x, y = x.cuda(), y.cuda()
print('GPU: {}ms'.format(timeit(square, number=number)*1000))

## Differentiation

Tensors provide automatic differentiation.

As you might know, previous versions of Pytorch used Variables, which were wrappers around tensors for differentiation. Starting with pytorch 0.4.0, this wrapping is done internally in the Tensor class and you can, and should, differentiate Tensors directly. However, it is possible that you walk on references to Variables, e.g. in your error messages.

What you need to remember :

- Tensors you are differentiating with respect to must have `requires_grad=True`
- Call `.backward()` on scalar variables you are differentiating
- To differentiate a vector, sum it first

In [13]:
# Create differentiable tensor
x = torch.tensor(torch.arange(0,4), requires_grad=False)
print(x.dtype)
# Calculate y=sum(x**2)
y = x**2
# Calculate gradient (dy/dx=2x)
y.sum().backward()
# Print values
print(x)
print(y)
print(x.grad)

  


torch.int64


RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Differentiation accumulates gradients. This is sometimes what you want and sometimes not. **Make sure to zero gradients between batches if performing gradient descent or you will get strange results!**

In [14]:
# Create a variable
x=torch.tensor(torch.arange(0,4), requires_grad=True)
# Differentiate
torch.sum(x**2).backward()
print(x.grad)
# Differentiate again (accumulates gradient)
torch.sum(x**2).backward()
print(x.grad)
# Zero gradient before differentiating
x.grad.data.zero_()
torch.sum(x**2).backward()
print(x.grad)

  


RuntimeError: Only Tensors of floating point dtype can require gradients

Note that a Tensor with gradient cannot be exported to numpy directly :

In [15]:
x=torch.tensor(torch.arange(0,4), requires_grad=True)
x.numpy() # raises an exception

  """Entry point for launching an IPython kernel.


RuntimeError: Only Tensors of floating point dtype can require gradients

The reason is that pytorch remembers the graph of all computations to perform differenciation. To be integrated to this graph the raw data is wrapped internally to the Tensor class (like what was formerly a Variable). You can detach the tensor from the graph using the **.detach()** method, which returns a tensor with the same data but requires_grad set to False.

In [16]:
x=torch.tensor(torch.arange(0,4), requires_grad=True)
y=x**2
z=y**2
z.detach().numpy()

  """Entry point for launching an IPython kernel.


RuntimeError: Only Tensors of floating point dtype can require gradients

Another reason to use this method is that updating the graph can use a lot of memory. If you are in a context where you have a differentiable tensor that you don't need to differentiate, think of detaching it from the graph.

## Neural Network Modules

Pytorch provides a framework for developing neural network modules. They take care of many things, the main one being wrapping and tracking a list of parameters for you.
You have several ways of building and using a network, offering different tradeoffs between freedom and simplicity.

torch.nn provides basic 1-layer nets, such as Linear (perceptron) and activation layers.

In [17]:
x = torch.arange(0,32)
net = torch.nn.Linear(32,10)
y = net(x)
print(y)

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'mat2'

All nn.Module objects are reusable as components of bigger networks ! That is how you build personnalized nets. The simplest way is to use the nn.Sequential class.

You can also create your own class that inherits n.Module. The forward method should precise what happens in the forward pass given an input. This enables you to precise behaviors more complicated than just applying layers one after another, if necessary.

In [18]:
# create a simple sequential network (`nn.Module` object) from layers (other `nn.Module` objects).
# Here a MLP with 2 layers and sigmoid activation.
net = torch.nn.Sequential(
    torch.nn.Linear(32,128),
    torch.nn.Sigmoid(),
    torch.nn.Linear(128,10))

In [19]:
# create a more customizable network module (equivalent here)
class MyNetwork(torch.nn.Module):
    # you can use the layer sizes as initialization arguments if you want to
    def __init__(self,input_size, hidden_size, output_size):
        super().__init__()
        self.layer1 = torch.nn.Linear(input_size,hidden_size)
        self.layer2 = torch.nn.Sigmoid()
        self.layer3 = torch.nn.Linear(hidden_size,output_size)

    def forward(self, input_val):
        h = input_val
        h = self.layer1(h)
        h = self.layer2(h)
        h = self.layer3(h)
        return h

net = MyNetwork(32,128,10)

The network tracks parameters, and you can access them through the **parameters()** method, which returns a python generator.

In [20]:
for param in net.parameters():
    print(param)

Parameter containing:
tensor([[ 0.0154,  0.0785, -0.1184,  ...,  0.1257,  0.0365,  0.0124],
        [-0.0234, -0.0679,  0.1513,  ..., -0.1436, -0.0187,  0.1284],
        [-0.1456,  0.0865,  0.1176,  ...,  0.1445,  0.0372, -0.0647],
        ...,
        [-0.1480, -0.1639,  0.1487,  ...,  0.0847, -0.0920,  0.1166],
        [-0.1258, -0.1310,  0.0416,  ...,  0.0714, -0.0716,  0.1599],
        [ 0.1268,  0.0177, -0.1755,  ..., -0.1349,  0.0671, -0.0431]],
       requires_grad=True)
Parameter containing:
tensor([ 0.0408, -0.0123, -0.1635,  0.0330,  0.1358, -0.1656, -0.0614, -0.0220,
        -0.0578,  0.0027,  0.0237,  0.0159,  0.1424,  0.1321, -0.1763,  0.0188,
         0.0380, -0.0411, -0.1350, -0.1282, -0.1592, -0.0935,  0.1409,  0.1473,
         0.0075,  0.0541, -0.0385,  0.0188,  0.1385, -0.1126, -0.0476,  0.1354,
        -0.0513, -0.0271, -0.0602,  0.1677,  0.0115, -0.0710, -0.0480,  0.0504,
         0.0364,  0.1576, -0.0507,  0.1083, -0.1162, -0.0565, -0.0122,  0.0131,
         0.0323

Parameters are of type Parameter, which is basically a wrapper for a tensor. How does pytorch retrieve your network's parameters ? They are simply all the attributes of type Parameter in your network. Moreover, if an attribute is of type nn.Module, its own parameters are added to your network's parameters ! This is why, when you define a network by adding up basic components such as nn.Linear, you should never have to explicitely define parameters.

However, if you are in a case where no pytorch default module does what you need, you can define parameters explicitely (this should be rare). For the record, let's build the previous MLP with personnalized parameters.

In [21]:
class MyNetworkWithParams(nn.Module):
    def __init__(self,input_size, hidden_size, output_size):
        super(MyNetworkWithParams,self).__init__()
        self.layer1_weights = nn.Parameter(torch.randn(input_size,hidden_size))
        self.layer1_bias = nn.Parameter(torch.randn(hidden_size))
        self.layer2_weights = nn.Parameter(torch.randn(hidden_size,output_size))
        self.layer2_bias = nn.Parameter(torch.randn(output_size))
        
    def forward(self,x):
        h1 = torch.matmul(x,self.layer1_weights) + self.layer1_bias
        h1_act = torch.max(h1, torch.zeros(h1.size())) # ReLU
        output = torch.matmul(h1_act,self.layer2_weights) + self.layer2_bias
        return output

net = MyNetworkWithParams(32,128,10)

Parameters are useful in that they are meant to be all the network's weights that will be optimized during training. If you were needing to use a tensor in your computational graph that you want to remain constant, just define it as a regular tensor.

## Training

In [22]:
net = MyNetwork(32,128,10)

The nn.Module also provides loss functions, such as cross-entropy.

In [23]:
x = torch.tensor([np.arange(32), np.zeros(32),np.ones(32)]).float()
y = torch.tensor([0,3,9])
criterion = nn.CrossEntropyLoss()

output = net(x)
loss = criterion(output,y)
print(loss)

tensor(2.1876, grad_fn=<NllLossBackward>)


nn.CrossEntropyLoss does both the softmax and the actual cross-entropy : given $output$ of size $(n,d)$ and $y$ of size $n$ and values in $0,1,...,d-1$, it computes $\sum_{i=0}^{n-1}log(s[i,y[i]])$ where $s[i,j] = \frac{e^{output[i,j]}}{\sum_{j'=0}^{d-1}e^{output[i,j']}}$

You can also compose nn.LogSoftmax and nn.NLLLoss to get the same result. Note that all these use the log-softmax rather than the softmax, for stability in the computations.

In [24]:
# equivalent
criterion2 = nn.NLLLoss()
sf = nn.LogSoftmax()
output = net(x)
loss = criterion(sf(output),y)
loss

  """


tensor(2.1876, grad_fn=<NllLossBackward>)

Now, to perform the backward pass, just execute **loss.backward()** ! It will update gradients in all differentiable tensors in the graph, which in particular includes all the network parameters.

In [25]:
loss.backward()

# Check that the parameters now have gradients
for param in net.parameters():
    print(param.grad)

tensor([[-5.9255e-03, -5.9252e-03, -5.9249e-03,  ..., -5.9165e-03,
         -5.9162e-03, -5.9159e-03],
        [ 2.0501e-03,  2.0442e-03,  2.0384e-03,  ...,  1.8805e-03,
          1.8746e-03,  1.8688e-03],
        [ 2.3427e-03,  2.3427e-03,  2.3427e-03,  ...,  2.3427e-03,
          2.3427e-03,  2.3427e-03],
        ...,
        [ 7.0074e-04,  3.5227e-04,  3.8067e-06,  ..., -9.4048e-03,
         -9.7533e-03, -1.0102e-02],
        [ 2.5619e-03,  3.3242e-03,  4.0865e-03,  ...,  2.4669e-02,
          2.5431e-02,  2.6193e-02],
        [-3.4918e-03, -3.4963e-03, -3.5009e-03,  ..., -3.6227e-03,
         -3.6272e-03, -3.6317e-03]])
tensor([-2.1000e-03, -2.4149e-03, -4.7471e-03,  1.8195e-03, -3.0208e-03,
        -1.5004e-03,  1.7470e-03,  4.4897e-03, -4.6868e-03, -4.3105e-03,
         9.4013e-04, -8.7316e-04, -2.6598e-03,  1.8245e-03,  6.0240e-04,
        -3.3299e-03, -7.6377e-03,  9.3953e-03,  7.7175e-03,  6.4383e-03,
        -1.9766e-03,  4.1844e-03,  1.1237e-02, -7.3580e-03, -6.6197e-03,
   

In [26]:
# if I forward prop and backward prop again, gradients accumulate :
output = net(x)
loss = criterion(output,y)
loss.backward()
for param in net.parameters():
    print(param.grad)

# you can remove this behavior by reinitializing the gradients in your network parameters :
net.zero_grad()
output = net(x)
loss = criterion(output,y)
loss.backward()
for param in net.parameters():
    print(param.grad)

tensor([[-1.1851e-02, -1.1850e-02, -1.1850e-02,  ..., -1.1833e-02,
         -1.1832e-02, -1.1832e-02],
        [ 4.1001e-03,  4.0884e-03,  4.0767e-03,  ...,  3.7609e-03,
          3.7492e-03,  3.7375e-03],
        [ 4.6854e-03,  4.6854e-03,  4.6854e-03,  ...,  4.6854e-03,
          4.6854e-03,  4.6854e-03],
        ...,
        [ 1.4015e-03,  7.0455e-04,  7.6134e-06,  ..., -1.8810e-02,
         -1.9507e-02, -2.0204e-02],
        [ 5.1237e-03,  6.6483e-03,  8.1729e-03,  ...,  4.9337e-02,
          5.0862e-02,  5.2386e-02],
        [-6.9837e-03, -6.9927e-03, -7.0017e-03,  ..., -7.2454e-03,
         -7.2544e-03, -7.2635e-03]])
tensor([-4.2001e-03, -4.8299e-03, -9.4943e-03,  3.6389e-03, -6.0416e-03,
        -3.0008e-03,  3.4940e-03,  8.9794e-03, -9.3735e-03, -8.6211e-03,
         1.8803e-03, -1.7463e-03, -5.3195e-03,  3.6490e-03,  1.2048e-03,
        -6.6598e-03, -1.5275e-02,  1.8791e-02,  1.5435e-02,  1.2877e-02,
        -3.9532e-03,  8.3688e-03,  2.2474e-02, -1.4716e-02, -1.3239e-02,
   

We did backpropagation, but still didn't perform gradient descent. Let's define an optimizer on the network parameters.

In [27]:
optimizer = torch.optim.SGD(net.parameters(), lr=0.01)

print("Parameters before gradient descent :")
for param in net.parameters():
    print(param)

optimizer.step()

print("Parameters after gradient descent :")
for param in net.parameters():
    print(param)

Parameters before gradient descent :
Parameter containing:
tensor([[ 0.1683,  0.1271, -0.0570,  ...,  0.0509,  0.0029,  0.1396],
        [ 0.0897, -0.0470,  0.1538,  ...,  0.0113,  0.1658, -0.1457],
        [-0.1631, -0.0063, -0.1529,  ..., -0.1409, -0.1373,  0.0114],
        ...,
        [ 0.0177, -0.0034,  0.0404,  ...,  0.1315, -0.1568, -0.0630],
        [ 0.1281, -0.1212, -0.1050,  ...,  0.0757, -0.1225,  0.0942],
        [-0.1489,  0.0763, -0.0951,  ...,  0.0506, -0.0818,  0.1764]],
       requires_grad=True)
Parameter containing:
tensor([ 0.1180,  0.0759,  0.0698, -0.1238,  0.1715, -0.1681, -0.0615,  0.0139,
         0.0282,  0.0104, -0.0972, -0.0423,  0.0740,  0.0780,  0.0622,  0.0414,
         0.0018,  0.0042, -0.1281,  0.0354, -0.0453,  0.0674, -0.1139,  0.0806,
         0.0038, -0.0815,  0.0847,  0.0365, -0.0064, -0.0384, -0.1690,  0.0996,
         0.0952,  0.0341, -0.0457, -0.0309,  0.0196,  0.0274, -0.1702, -0.0458,
         0.0352, -0.1366,  0.1607, -0.1137, -0.0695,  0.04

In [28]:
# In a training loop, we should perform many GD iterations.
n_iter = 1000
for i in range(n_iter):
    optimizer.zero_grad() # equivalent to net.zero_grad()
    output = net(x)
    loss = criterion(output,y)
    loss.backward()
    optimizer.step()
    print(loss)

tensor(2.0752, grad_fn=<NllLossBackward>)
tensor(1.9747, grad_fn=<NllLossBackward>)
tensor(1.8850, grad_fn=<NllLossBackward>)
tensor(1.8045, grad_fn=<NllLossBackward>)
tensor(1.7319, grad_fn=<NllLossBackward>)
tensor(1.6661, grad_fn=<NllLossBackward>)
tensor(1.6066, grad_fn=<NllLossBackward>)
tensor(1.5526, grad_fn=<NllLossBackward>)
tensor(1.5034, grad_fn=<NllLossBackward>)
tensor(1.4583, grad_fn=<NllLossBackward>)
tensor(1.4168, grad_fn=<NllLossBackward>)
tensor(1.3782, grad_fn=<NllLossBackward>)
tensor(1.3424, grad_fn=<NllLossBackward>)
tensor(1.3088, grad_fn=<NllLossBackward>)
tensor(1.2773, grad_fn=<NllLossBackward>)
tensor(1.2477, grad_fn=<NllLossBackward>)
tensor(1.2197, grad_fn=<NllLossBackward>)
tensor(1.1934, grad_fn=<NllLossBackward>)
tensor(1.1685, grad_fn=<NllLossBackward>)
tensor(1.1450, grad_fn=<NllLossBackward>)
tensor(1.1228, grad_fn=<NllLossBackward>)
tensor(1.1017, grad_fn=<NllLossBackward>)
tensor(1.0818, grad_fn=<NllLossBackward>)
tensor(1.0628, grad_fn=<NllLossBac

tensor(0.3745, grad_fn=<NllLossBackward>)
tensor(0.3733, grad_fn=<NllLossBackward>)
tensor(0.3721, grad_fn=<NllLossBackward>)
tensor(0.3708, grad_fn=<NllLossBackward>)
tensor(0.3696, grad_fn=<NllLossBackward>)
tensor(0.3684, grad_fn=<NllLossBackward>)
tensor(0.3672, grad_fn=<NllLossBackward>)
tensor(0.3660, grad_fn=<NllLossBackward>)
tensor(0.3648, grad_fn=<NllLossBackward>)
tensor(0.3636, grad_fn=<NllLossBackward>)
tensor(0.3625, grad_fn=<NllLossBackward>)
tensor(0.3613, grad_fn=<NllLossBackward>)
tensor(0.3601, grad_fn=<NllLossBackward>)
tensor(0.3590, grad_fn=<NllLossBackward>)
tensor(0.3578, grad_fn=<NllLossBackward>)
tensor(0.3567, grad_fn=<NllLossBackward>)
tensor(0.3555, grad_fn=<NllLossBackward>)
tensor(0.3544, grad_fn=<NllLossBackward>)
tensor(0.3533, grad_fn=<NllLossBackward>)
tensor(0.3522, grad_fn=<NllLossBackward>)
tensor(0.3510, grad_fn=<NllLossBackward>)
tensor(0.3499, grad_fn=<NllLossBackward>)
tensor(0.3488, grad_fn=<NllLossBackward>)
tensor(0.3477, grad_fn=<NllLossBac

tensor(0.2130, grad_fn=<NllLossBackward>)
tensor(0.2125, grad_fn=<NllLossBackward>)
tensor(0.2119, grad_fn=<NllLossBackward>)
tensor(0.2114, grad_fn=<NllLossBackward>)
tensor(0.2108, grad_fn=<NllLossBackward>)
tensor(0.2103, grad_fn=<NllLossBackward>)
tensor(0.2097, grad_fn=<NllLossBackward>)
tensor(0.2092, grad_fn=<NllLossBackward>)
tensor(0.2086, grad_fn=<NllLossBackward>)
tensor(0.2081, grad_fn=<NllLossBackward>)
tensor(0.2075, grad_fn=<NllLossBackward>)
tensor(0.2070, grad_fn=<NllLossBackward>)
tensor(0.2065, grad_fn=<NllLossBackward>)
tensor(0.2059, grad_fn=<NllLossBackward>)
tensor(0.2054, grad_fn=<NllLossBackward>)
tensor(0.2049, grad_fn=<NllLossBackward>)
tensor(0.2044, grad_fn=<NllLossBackward>)
tensor(0.2038, grad_fn=<NllLossBackward>)
tensor(0.2033, grad_fn=<NllLossBackward>)
tensor(0.2028, grad_fn=<NllLossBackward>)
tensor(0.2023, grad_fn=<NllLossBackward>)
tensor(0.2017, grad_fn=<NllLossBackward>)
tensor(0.2012, grad_fn=<NllLossBackward>)
tensor(0.2007, grad_fn=<NllLossBac

tensor(0.1340, grad_fn=<NllLossBackward>)
tensor(0.1337, grad_fn=<NllLossBackward>)
tensor(0.1334, grad_fn=<NllLossBackward>)
tensor(0.1331, grad_fn=<NllLossBackward>)
tensor(0.1329, grad_fn=<NllLossBackward>)
tensor(0.1326, grad_fn=<NllLossBackward>)
tensor(0.1323, grad_fn=<NllLossBackward>)
tensor(0.1320, grad_fn=<NllLossBackward>)
tensor(0.1317, grad_fn=<NllLossBackward>)
tensor(0.1315, grad_fn=<NllLossBackward>)
tensor(0.1312, grad_fn=<NllLossBackward>)
tensor(0.1309, grad_fn=<NllLossBackward>)
tensor(0.1306, grad_fn=<NllLossBackward>)
tensor(0.1303, grad_fn=<NllLossBackward>)
tensor(0.1301, grad_fn=<NllLossBackward>)
tensor(0.1298, grad_fn=<NllLossBackward>)
tensor(0.1295, grad_fn=<NllLossBackward>)
tensor(0.1293, grad_fn=<NllLossBackward>)
tensor(0.1290, grad_fn=<NllLossBackward>)
tensor(0.1287, grad_fn=<NllLossBackward>)
tensor(0.1284, grad_fn=<NllLossBackward>)
tensor(0.1282, grad_fn=<NllLossBackward>)
tensor(0.1279, grad_fn=<NllLossBackward>)
tensor(0.1276, grad_fn=<NllLossBac

tensor(0.0921, grad_fn=<NllLossBackward>)
tensor(0.0920, grad_fn=<NllLossBackward>)
tensor(0.0918, grad_fn=<NllLossBackward>)
tensor(0.0916, grad_fn=<NllLossBackward>)
tensor(0.0915, grad_fn=<NllLossBackward>)
tensor(0.0913, grad_fn=<NllLossBackward>)
tensor(0.0912, grad_fn=<NllLossBackward>)
tensor(0.0910, grad_fn=<NllLossBackward>)
tensor(0.0909, grad_fn=<NllLossBackward>)
tensor(0.0907, grad_fn=<NllLossBackward>)
tensor(0.0906, grad_fn=<NllLossBackward>)
tensor(0.0904, grad_fn=<NllLossBackward>)
tensor(0.0902, grad_fn=<NllLossBackward>)
tensor(0.0901, grad_fn=<NllLossBackward>)
tensor(0.0899, grad_fn=<NllLossBackward>)
tensor(0.0898, grad_fn=<NllLossBackward>)
tensor(0.0896, grad_fn=<NllLossBackward>)
tensor(0.0895, grad_fn=<NllLossBackward>)
tensor(0.0893, grad_fn=<NllLossBackward>)
tensor(0.0892, grad_fn=<NllLossBackward>)
tensor(0.0890, grad_fn=<NllLossBackward>)
tensor(0.0889, grad_fn=<NllLossBackward>)
tensor(0.0887, grad_fn=<NllLossBackward>)
tensor(0.0886, grad_fn=<NllLossBac

tensor(0.0680, grad_fn=<NllLossBackward>)
tensor(0.0679, grad_fn=<NllLossBackward>)
tensor(0.0678, grad_fn=<NllLossBackward>)
tensor(0.0677, grad_fn=<NllLossBackward>)
tensor(0.0676, grad_fn=<NllLossBackward>)
tensor(0.0675, grad_fn=<NllLossBackward>)
tensor(0.0674, grad_fn=<NllLossBackward>)
tensor(0.0673, grad_fn=<NllLossBackward>)
tensor(0.0672, grad_fn=<NllLossBackward>)
tensor(0.0671, grad_fn=<NllLossBackward>)
tensor(0.0670, grad_fn=<NllLossBackward>)
tensor(0.0669, grad_fn=<NllLossBackward>)
tensor(0.0668, grad_fn=<NllLossBackward>)
tensor(0.0667, grad_fn=<NllLossBackward>)
tensor(0.0667, grad_fn=<NllLossBackward>)
tensor(0.0666, grad_fn=<NllLossBackward>)
tensor(0.0665, grad_fn=<NllLossBackward>)
tensor(0.0664, grad_fn=<NllLossBackward>)
tensor(0.0663, grad_fn=<NllLossBackward>)
tensor(0.0662, grad_fn=<NllLossBackward>)
tensor(0.0661, grad_fn=<NllLossBackward>)
tensor(0.0660, grad_fn=<NllLossBackward>)
tensor(0.0659, grad_fn=<NllLossBackward>)
tensor(0.0658, grad_fn=<NllLossBac

In [29]:
output = net(x)
print(output)
print(y)

tensor([[ 7.9902, -1.7800, -2.0774,  0.5787, -1.9603, -1.7135, -1.6187, -1.8039,
         -1.8273,  4.1082],
        [ 0.2567, -1.4705, -1.5120,  5.7903, -1.5307, -1.6327, -1.5679, -1.5251,
         -1.6856,  3.2233],
        [ 1.8633, -1.4923, -1.5976,  3.0309, -1.6929, -1.5278, -1.5327, -1.6354,
         -1.6566,  5.6728]], grad_fn=<AddmmBackward>)
tensor([0, 3, 9])


Now you know how to train a network ! For a complete training check the pytorch_example notebook.

## Saving and Loading

In [30]:
# get dictionary of keys to weights using `state_dict`
net = torch.nn.Sequential(
    torch.nn.Linear(28*28,256),
    torch.nn.Sigmoid(),
    torch.nn.Linear(256,10))
print(net.state_dict().keys())

odict_keys(['0.weight', '0.bias', '2.weight', '2.bias'])


In [31]:
# save a dictionary
torch.save(net.state_dict(),'test.t7')
# load a dictionary
net.load_state_dict(torch.load('test.t7'))

<All keys matched successfully>

## Common issues to look out for

### Type mismatch

In [32]:
net = nn.Linear(4,2)
x = torch.tensor([1,2,3,4])
y = net(x)
print(y)

RuntimeError: Expected object of scalar type Long but got scalar type Float for argument #2 'mat2'

In [33]:
x = x.float()
x = torch.tensor([1.,2.,3.,4.])

In [34]:
x = 2* torch.ones(2,2)
y = 3* torch.ones(2,2)
print(x * y)
print(x.matmul(y))

tensor([[6., 6.],
        [6., 6.]])
tensor([[12., 12.],
        [12., 12.]])


In [35]:
x = torch.ones(4,5)
y = torch.arange(5)
print(x+y)
y = torch.arange(4).view(-1,1)
print(x+y)
y = torch.arange(4)
print(x+y) # exception

RuntimeError: expected device cpu and dtype Float but got device cpu and dtype Long

In [36]:
x = torch.tensor([[1,2,3],[4,5,6]])
print(x)
print(x.t())
print(x.view(3,2))

tensor([[1, 2, 3],
        [4, 5, 6]])
tensor([[1, 4],
        [2, 5],
        [3, 6]])
tensor([[1, 2],
        [3, 4],
        [5, 6]])


In [37]:
net = nn.Sequential(nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,2048),nn.ReLU(),
                   nn.Linear(2048,120))
x = torch.ones(256,2048)
y = torch.zeros(256).long()
net.cuda()
x.cuda()
crit=nn.CrossEntropyLoss()
out = net(x)
loss = crit(out,y)
loss.backward()

RuntimeError: Expected object of backend CUDA but got backend CPU for argument #4 'mat1'

In [38]:
class MyNet(nn.Module):
    def __init__(self,n_hidden_layers):
        super(MyNet,self).__init__()
        self.n_hidden_layers=n_hidden_layers
        self.final_layer = nn.Linear(128,10)
        self.act = nn.ReLU()
        self.hidden = []
        for i in range(n_hidden_layers):
            self.hidden.append(nn.Linear(128,128))
    
            
    def forward(self,x):
        h = x
        for i in range(self.n_hidden_layers):
            h = self.hidden[i](h)
            h = self.act(h)
        out = self.final_layer(h)
        return out

In [39]:
class MyNet(nn.Module):
    def __init__(self,n_hidden_layers):
        super(MyNet,self).__init__()
        self.n_hidden_layers=n_hidden_layers
        self.final_layer = nn.Linear(128,10)
        self.act = nn.ReLU()
        self.hidden = []
        for i in range(n_hidden_layers):
            self.hidden.append(nn.Linear(128,128))
        self.hidden = nn.ModuleList(self.hidden)
            
    def forward(self,x):
        h = x
        for i in range(self.n_hidden_layers):
            h = self.hidden[i](h)
            h = self.act(h)
        out = self.final_layer(h)
        return out