# Learning PyTorch with Examples

This tutorial introduces the fundamental concepts of PyTorch through self-contained examples.

At its core, PyTorch provides two main features:
- An n-dimensional Tensor, similar to numpy but can run on GPUs.
- Automatic differentiation for building and training neural networks.

## Tensors

### Warm-up: numpy

In [1]:
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

lr = 1e-6
for t in range(500):
    # Forward pass
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)
    
    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    
    # Backprop to computer gradients of w1 and w2 w.r.t. loss
    grad_y_pred = 2.0 * (y_pred - y) # Euclidean distance
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)
    
    # Update weights
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

0 28612268.069293723
1 24329791.402298294
2 23349080.73918183
3 22425203.60995101
4 19900338.50380317
5 15672932.242917165
6 10966099.582792219
7 7033742.05234842
8 4353001.436725734
9 2735936.50850668
10 1810916.582864869
11 1282601.9540117942
12 969467.8190234841
13 772321.3543269017
14 638641.7318941832
15 541562.6372106137
16 466912.24917197315
17 407079.12646977033
18 357679.58401681925
19 316142.626768025
20 280722.17191148567
21 250255.13442937713
22 223844.38731136144
23 200803.69442386585
24 180625.4351810698
25 162869.57800816302
26 147197.1013473553
27 133318.7366378346
28 121003.21255938319
29 110021.0134336981
30 100213.34443670136
31 91436.32814021475
32 83555.13425711519
33 76471.78379979578
34 70091.21027736647
35 64331.29271359629
36 59123.344061712975
37 54405.47924422243
38 50125.419017404005
39 46231.579469530254
40 42690.25896875652
41 39469.43152328124
42 36532.364961245185
43 33848.42669487665
44 31393.037560725516
45 29143.619460917136
46 27080.694383980797
47 2

### PyTorch: Tensors

Also unlike numpy, PyTorch Tensors can utilize GPUs to accelerate their numeric computations. To run a PyTorch Tensor on GPU, you simply need to cast it to a new datatype.

Here we use PyTorch Tensors to fit a two-layer network to random data. Like the numpy example above we need to manually implement the forward and backward passes through the network:

In [2]:
import torch

In [3]:
dtype = torch.float
device = torch.device('cpu')
# device = torch.device("cuda:0") # Uncomment this to run on GPU

In [4]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

In [5]:
# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

In [6]:
# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

In [7]:
lr = 1e-6
for t in range(500):
    # Forward pass
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)
    # Backprop to compute gradients of w1 and w2 w.r.t. loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)
    
    # Update weights using gradient descent
    w1 -= lr * grad_w1
    w2 -= lr * grad_w2

99 616.2571411132812
199 6.43768310546875
299 0.1556270718574524
399 0.0047845314256846905
499 0.0003574952424969524


## Autograd

### PyTorch: Tensors and autograd

In the above examples, we had to manually implement both the forward and backward passes of our neural network. Manually implementing the backward pass is not a big deal for a small two-layer network, but can quickly get very hairy for large complex networks.

Thankfully, we can use <b>automatic differentiation</b> to automate the computation of backward passes in neural networks. The <b>autograd</b> package in PyTorch provides exactly this functionality. When using autograd, the forward pass of your network will define a <b>computational graph</b>; nodes in the graph will be Tensors, and edges will be functions that produce output Tensors from input Tensors. Backpropagating through this graph then allows you to easily compute gradients.

In [8]:
import torch

In [9]:
dtype = torch.float
device = torch.device('cpu')

In [10]:
# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

In [11]:
lr = 1e-6
for t in range(500):
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
    
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print('the loss of epoch {}:'.format(t), loss.item())
    
    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()
    
    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= lr * w1.grad
        w2 -= lr * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

the loss of epoch 99: 155.062744140625
the loss of epoch 199: 0.3023970127105713
the loss of epoch 299: 0.0012614316074177623
the loss of epoch 399: 6.449854845413938e-05
the loss of epoch 499: 1.8706752598518506e-05


### PyTorch: Defining new autograd functions

In PyTorch we can easily define our own autograd operator by defining a subclass of <b>torch.autograd.Function</b> and implementing the <b>forward</b> and <b>backward</b> functions. We can then use our new autograd operator by constructing an instance and calling it like a function, passing Tensors containing input data.

In this example we define our own custom autograd function for performing the ReLU nonlinearity, and use it to implement our two-layer network:

In [12]:
import torch

In [13]:
class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """
    
    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

In [14]:
dtype = torch.float
device = torch.device('cpu')

N, D_in, H, D_out = 64, 1000, 1000, 10

x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

In [15]:
lr = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. 
    # We alias this as 'relu'.
    relu = MyReLU.apply
    
    # Forward paa
    y_pred = relu(x.mm(w1)).mm(w2)
    
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print('loss of epoch {}'.format(t), loss.item())
    
    # Use autograd to compute the backward pass
    loss.backward()
    
    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= lr * w1.grad
        w2 -= lr * w2.grad
        
        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

loss of epoch 99 nan
loss of epoch 199 nan
loss of epoch 299 nan
loss of epoch 399 nan
loss of epoch 499 nan


## nn Module

### PyTorch: nn

The <b>nn</b> package defines a set of <b>Modules</b>, which are roughly equivalent to neural network layers. A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters. The <b>nn</b> package also defines a set of useful loss functions that are commonly used when training neural networks.

In this example we use the <b>nn</b> package to implement our two-layer network:

In [16]:
import torch

In [17]:
N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

lr = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)
    
    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
        
    # Zero the gradients before running the backward pass.
    model.zero_grad()
    
    # # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()
    
    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= lr * param.grad    

99 2.4603281021118164
199 0.041589848697185516
299 0.0012026326730847359
399 4.257793625583872e-05
499 1.6535601616851636e-06


### PyTorch: optim

The <b>optim</b> package in PyTorch abstracts the idea of an optimization algorithm and provides implementations of commonly used optimization algorithms.

In this example we will use the <b>nn</b> package to define our model as before, but we will optimize the model using the Adam algorithm provided by the <b>optim</b> package:

In [18]:
import torch

N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out)
)

loss_fn = torch.nn.MSELoss(reduction='sum')

In [19]:
lr = 1e-4

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

for t in range(500):
    y_pred = model(x)
    
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())
        
    optimizer.zero_grad()
    
    loss.backward()
    
    optimizer.step()

99 54.34068298339844
199 0.9342947006225586
299 0.0066838338971138
399 4.871887358603999e-05
499 4.200252590180753e-07


### PyTorch: Custom nn Modules

Sometimes you will want to specify models that are more complex than a sequence of existing Modules; for these cases you can define your own Modules by subclassing <b>nn.Module</b> and defining a <b>forward</b> which receives input Tensors and produces output Tensors using other modules or other autograd operations on Tensors.

In this example we implement our two-layer network as a custom Module subclass:

In [20]:
import torch

In [31]:
class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and 
        assign them as member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we
        must return a Tensor of output data. We can use Modules defined 
        in the constructor as well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

In [32]:
N, D_in, H, D_out = 64, 1000, 100, 10

In [33]:
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

In [34]:
model = TwoLayerNet(D_in, H, D_out)

In [35]:
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

In [36]:
for t in range(500):
    # Forward pass
    y_pred = model(x)
    
    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print('loss of epoch {}:'.format(t), loss.item())
        
    # Zero gradients, perform backprop, and update weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

loss of epoch 99: 2.380481243133545
loss of epoch 199: 0.0460839606821537
loss of epoch 299: 0.0017332186689600348
loss of epoch 399: 9.43930572248064e-05
loss of epoch 499: 6.068229595257435e-06


### PyTorch: Control Flow + Weight Sharing

As an example of dynamic graphs and weight sharing, we implement a very strange model: a fully-connected ReLU network that on each forward pass chooses a random number between 1 and 4 and uses that many hidden layers, reusing the same weights multiple times to compute the innermost hidden layers.

For this model we can use normal Python flow control to implement the loop, and we can implement weight sharing among the innermost layers by simply reusing the same Module multiple times when defining the forward pass.

We can easily implement this model as a Module subclass:

In [37]:
import random
import torch

In [44]:
class DynamicNet(torch.nn.Module):
    
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)
        
    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        
        y_pred = self.output_linear(h_relu)
        return y_pred

In [39]:
N, D_in, H, D_out = 64, 1000, 100, 10

In [40]:
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

In [46]:
model = DynamicNet(D_in, H, D_out)

In [42]:
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)

In [47]:
for t in range(500):
    # Forward pass
    y_pred = model(x)
    
    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print('loss of epoch {}'.format(t), loss.item())
        
    # Zero gradients, perform backprop, and update weights
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

loss of epoch 99 665.8496704101562
loss of epoch 199 639.0000610351562
loss of epoch 299 634.5841064453125
loss of epoch 399 639.0000610351562
loss of epoch 499 639.0000610351562
