# What is torch.nn really?

## MNIST data setup

We will use <b>pathlib</b> for dealing with paths (part of the Python 3 standard library), and will download the dataset using <b>requests</b>. We will only import modules when we use them, so you can see exactly what’s being used at each point.

In [1]:
from pathlib import Path
import requests

DATA_PATH = Path('data')
PATH = DATA_PATH / 'mnist'

PATH.mkdir(parents=True, exist_ok=True)

URL = 'http://deeplearning.net/data/mnist/'
FILENAME = 'mnist.pkl.gz'

if not (PATH / FILENAME).exists():
    content = requests.get(URL + FILENAME).content
    (PATH / FILENAME).open('wb').write(content)

This dataset is in numpy array format, and has been stored using pickle, a python-specific format for serializing data.

In [2]:
import pickle
import gzip

with gzip.open((PATH / FILENAME).as_posix(), 'rb') as f:
    ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')

In [3]:
import torch

x_train, y_train, x_valid, y_valid = map(torch.tensor,
                  (x_train, y_train, x_valid, y_valid))

In [4]:
n, c = x_train.shape

## Using torch.nn.functional

The first and easiest step is to make our code shorter by replacing our hand-written activation and loss functions with those from <b>torch.nn.functional</b> (which is generally imported into the namespace <b>F</b> by convention). This module contains all the functions in the <b>torch.nn</b> library (whereas other parts of the library contain classes). As well as a wide range of loss and activation functions, you’ll also find here some convenient functions for creating neural nets, such as pooling functions. (There are also functions for doing convolutions, linear layers, etc, but as we’ll see, these are usually better handled using other parts of the library.)

In [5]:
import torch.nn.functional as F

In [6]:
loss_fn = F.cross_entropy

In [7]:
def model(xb):
    return xb @ weights + bias

In [8]:
import math

weights = torch.randn(784, 10) / math.sqrt(784)
weights.requires_grad_()
bias = torch.zeros(10, requires_grad=True)

In [9]:
bs = 64 # batch size

xb = x_train[0:bs] # a mini-batch from x
preds = model(xb) # predictions
preds[0], preds.shape
print(preds[0], preds.shape)

yb = y_train[0:bs]
print(loss_fn(preds, yb))

tensor([-0.5063, -0.3695, -0.2609,  0.6300, -0.0068, -0.1993, -0.1746,  0.4627,
        -0.2131,  0.3878], grad_fn=<SelectBackward>) torch.Size([64, 10])
tensor(2.3633, grad_fn=<NllLossBackward>)


## Refactor using nn.Module

Next up, we’ll use <b>nn.Module</b> and <b>nn.Parameter</b>, for a clearer and more concise training loop. We subclass <b>nn.Module</b> (which itself is a class and able to keep track of state). In this case, we want to create a class that holds our weights, bias, and method for the forward step. <b>nn.Module</b> has a number of attributes and methods (such as <b>.parameters()</b> and <b>.zero_grad()</b>) which we will be using.

In [10]:
from torch import nn

class Mnist_Logistic(nn.Module):
    
    def __init__(self):
        super(Mnist_Logistic, self).__init__()
        self.weights = nn.Parameter(torch.randn(784, 10) / math.sqrt(784))
        self.bias = nn.Parameter(torch.zeros(10))
        
    def forward(self, xb):
        return xb @ self.weights + self.bias

Since we’re now using an object instead of just using a function, we first have to instantiate our model:

In [11]:
model = Mnist_Logistic()

Now we can calculate the loss in the same way as before. Note that <b>nn.Module</b> objects are used as if they are functions (i.e., they are callable), but behind the scenes Pytorch will call our <b>forward</b> method automatically.

In [12]:
loss_fn(model(xb), yb)

tensor(2.4407, grad_fn=<NllLossBackward>)

Now we can take advantage of <b>model.parameters()</b> and <b>model.zero_grad()</b> to make training loops more concise and less prone to the error of forgetting some of our parameters, particularly if we had a more complicated model:

We’ll wrap our little training loop in a <b>fit</b> function so we can run it again later.

In [13]:
epochs = 2
lr = 0.01

def fit():
    for epoch in range(epochs):
        for i in range((n-1) // bs + 1):
            start_i = i * bs
            end_i = start_i + bs
            
            xb = x_train[start_i : end_i]
            yb = y_train[start_i : end_i]
            
            pred = model(xb)
            loss = loss_fn(pred, yb)
            
            loss.backward()
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad * lr
                model.zero_grad()

In [14]:
fit()

## Refactor using nn.Linear

We continue to refactor our code. Instead of manually defining and initializing <b>self.weights</b> and <b>self.bias</b>, and calculating <b>xb @ self.weights + self.bias</b>, we will instead use the Pytorch class <b>nn.Linear</b> for a linear layer, which does all that for us. Pytorch has many types of predefined layers that can greatly simplify our code, and often makes it faster too.

In [15]:
class Mnist_Logistic(nn.Module):
    def __init__(self):
        super(Mnist_Logistic, self).__init__()
        self.lin = nn.Linear(784, 10)
        
    def forward(self, xb):
        return self.lin(xb)

In [16]:
model = Mnist_Logistic()

In [17]:
loss_fn(model(xb), yb)

tensor(2.3317, grad_fn=<NllLossBackward>)

In [18]:
fit()

In [19]:
loss_fn(model(xb), yb)

tensor(0.5348, grad_fn=<NllLossBackward>)

## Refactor using optim

Pytorch also has a package with various optimization algorithms, <b>torch.optim</b>. We can use the step method from our optimizer to take a forward step, instead of manually updating each parameter.

In [20]:
def get_model():
    model = Mnist_Logistic()
    return model, torch.optim.SGD(model.parameters(), lr=lr)

In [21]:
model, opt = get_model()

In [22]:
loss_fn(model(xb), yb)

tensor(2.3345, grad_fn=<NllLossBackward>)

In [23]:
for epoch in range(epochs):
    for i in range((n-1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        
        xb = x_train[start_i : end_i]
        yb = y_train[start_i : end_i]
        
        pred = model(xb)
        loss = loss_fn(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()

In [24]:
loss_fn(model(xb), yb)

tensor(0.9175, grad_fn=<NllLossBackward>)

## Refactor using Dataset

PyTorch has an abstract Dataset class. A Dataset can be anything that has a __len__ function (called by Python’s standard len function) and a __getitem__ function as a way of indexing into it.

PyTorch’s <b>TensorDataset</b> is a Dataset wrapping tensors. By defining a length and way of indexing, this also gives us a way to iterate, index, and slice along the first dimension of a tensor. This will make it easier to access both the independent and dependent variables in the same line as we train.

In [25]:
from torch.utils.data import TensorDataset

Both <b>x_train</b> and <b>y_train</b> can be combined in a single <b>TensorDataset</b>, which will be easier to iterate over and slice.

In [26]:
train_ds = TensorDataset(x_train, y_train)

In [27]:
xb, yb = train_ds[i*bs : i*bs+bs]

In [28]:
model, opt = get_model()

for epoch in range(epochs):
    for i in range((n-1) // bs + 1):
        start_i = i * bs
        end_i = start_i + bs
        
        xb = x_train[start_i : end_i]
        yb = y_train[start_i : end_i]
        
        pred = model(xb)
        loss = loss_fn(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()

In [29]:
loss_fn(model(xb), yb)

tensor(0.9160, grad_fn=<NllLossBackward>)

## Refactor using DataLoader

Pytorch’s <b>DataLoader</b> is responsible for managing batches. You can create a <b>DataLoader</b> from any Dataset. <b>DataLoader</b> makes it easier to iterate over batches. Rather than having to use <b>train_ds[i*bs : i*bs+bs]</b>, the <b>DataLoader</b> gives us each minibatch automatically.

In [30]:
from torch.utils.data import DataLoader

In [33]:
train_dataset = TensorDataset(x_train, y_train)
train_dataloader = DataLoader(train_ds, batch_size=bs)

In [35]:
for xb, yb in train_dataloader:
    pred = model(xb)

In [36]:
model, opt = get_model()

for epoch in range(epochs):
    for xb, yb in train_dataloader:
        pred = model(xb)
        loss = loss_fn(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()

In [37]:
loss_fn(model(xb), yb)

tensor(0.9464, grad_fn=<NllLossBackward>)

## Add validation

In section 1, we were just trying to get a reasonable training loop set up for use on our training data. In reality, you always should also have a <b>validation set</b>, in order to identify if you are overfitting.

Shuffling the training data is important to prevent correlation between batches and overfitting. On the other hand, the validation loss will be identical whether we shuffle the validation set or not. Since shuffling takes extra time, it makes no sense to shuffle the validation data.

We’ll use a batch size for the validation set that is twice as large as that for the training set. This is because the validation set does not need backpropagation and thus takes less memory (it doesn’t need to store the gradients). We take advantage of this to use a larger batch size and compute the loss more quickly.

In [38]:
train_dataset = TensorDataset(x_train, y_train)
train_dataloader = DataLoader(train_dataset, batch_size=bs, shuffle=True)

valid_dataset = TensorDataset(x_valid, y_valid)
valid_dataloader = DataLoader(valid_dataset, batch_size=bs * 2)

We will calculate and print the validation loss at the end of each epoch.

(Note that we always call <b>model.train()</b> before training, and <b>model.eval()</b> before inference, because these are used by layers such as <b>nn.BatchNorm2d</b> and <b>nn.Dropout</b> to ensure appropriate behaviour for these different phases.)

In [40]:
model, opt = get_model()

for epoch in range(epochs):
    model.train()
    for xb, yb in train_dataloader:
        pred = model(xb)
        loss = loss_fn(pred, yb)
        
        loss.backward()
        opt.step()
        opt.zero_grad()
    
    model.eval()
    with torch.no_grad():
        valid_loss = sum(loss_fn(model(xb), yb) for xb, yb in valid_dataloader)
        
    print('epoch {}'.format(epoch), 'loss {}'.format(valid_loss / len(valid_dataloader)))

epoch 0 loss 0.6322571039199829
epoch 1 loss 0.48819637298583984


## Create fit() and get_data()

Since we go through a similar process twice of calculating the loss for both the training set and the validation set, let’s make that into its own function, <b>loss_batch</b>, which computes the loss for one batch.

In [41]:
def loss_batch(model, loss_fn, xb, yb, opt=None):
    
    loss = loss_fn(model(xb), yb)
    
    if opt is not None:
        loss.backward()
        opt.step()
        opt.zero_grad()
        
    return loss.item(), len(xb)

<b>fit</b> runs the necessary operations to train our model and compute the training and validation losses for each epoch.

In [42]:
import numpy as np

In [43]:
def fit(epochs, model, loss_fn, opt, train_dataloader, valid_dataloader):
    
    for epoch in range(epochs):
        model.train()
        for xb, yb in train_dataloader:
            loss_batch(model, loss_fn, xb, yb, opt)
        
        model.eval()
        with torch.no_grad():
            losses, nums = zip(*[loss_batch(model, loss_fn, xb, yb) for xb, yb in valid_dataloader])
        
        val_loss = np.sum(np.multiply(losses, nums)) / np.sum(nums)
        
        print('epoch {}'.format(epoch), ', loss {}'.format(val_loss))
            

<b>get_data</b> returns dataloaders for the training and validation sets.

In [44]:
def get_data(train_dataset, valid_dataset, bs):
    return (
        DataLoader(train_dataset, batch_size=bs, shuffle=True),
        DataLoader(valid_dataset, batch_size=bs * 2)
    )

Now, our whole process of obtaining the data loaders and fitting the model can be run in 3 lines of code:

In [45]:
train_dataloader, valid_dataloader = get_data(train_dataset, valid_dataset, bs)

model, opt = get_model()

fit(epochs, model, loss_fn, opt, train_dataloader, valid_dataloader)

epoch 0 , loss 0.6293734114646912
epoch 1 , loss 0.4886402978420258


You can use these basic 3 lines of code to train a wide variety of models. Let’s see if we can use them to train a convolutional neural network (CNN)!

## Switch to CNN

We will use Pytorch’s predefined <b>Conv2d</b> class as our convolutional layer. We define a CNN with 3 convolutional layers. Each convolution is followed by a <b>ReLU</b>. At the end, we perform an average pooling. (Note that <b>view</b> is PyTorch’s version of numpy’s <b>reshape</b>)

In [46]:
class Mnist_CNN(nn.Module):
    
    def __init__(self):
        super(Mnist_CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1)
        self.conv2 = nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1)
        self.conv3 = nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1)
        
    def forward(self, xb):
        xb = xb.view(-1, 1, 28, 28)
        xb = F.relu(self.conv1(xb))
        xb = F.relu(self.conv2(xb))
        xb = F.relu(self.conv3(xb))
        xb = F.avg_pool2d(xb, 4)
        return xb.view(-1, xb.size(1))

In [47]:
model = Mnist_CNN()

opt = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

In [48]:
fit(epochs, model, loss_fn, opt, train_dataloader, valid_dataloader)

epoch 0 , loss 0.3586157655954361
epoch 1 , loss 0.2642602647304535


## nn.Sequential

<b>torch.nn</b> has another handy class we can use to simply our code: <b>Sequential</b>. A <b>Sequential</b> object runs each of the modules contained within it, in a sequential manner. This is a simpler way of writing our neural network.

To take advantage of this, we need to be able to easily define a <b>custom layer</b> from a given function. For instance, PyTorch doesn’t have a view layer, and we need to create one for our network. <b>Lambda</b> will create a layer that we can then use when defining a network with <b>Sequential</b>.

In [59]:
class Lambda(nn.Module):
    
    def __init__(self, func):
        super().__init__()
        self.func = func
        
    def forward(self, x):
        return self.func(x)

In [60]:
def preprocessing(x):
    return x.view(-1, 1, 28, 28)

The model created with <b>Sequential</b> is simply:

In [65]:
model = nn.Sequential(
    Lambda(preprocessing),
    nn.Conv2d(1, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 16, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.Conv2d(16, 10, kernel_size=3, stride=2, padding=1),
    nn.ReLU(),
    nn.AvgPool2d(4),
    Lambda(lambda x: x.view(x.size(0), -1)),
)

In [66]:
opt = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)

In [67]:
fit(epochs, model, loss_fn, opt, train_dataloader, valid_dataloader)

epoch 0 , loss 0.3027653733849525
epoch 1 , loss 0.22425802769064904


## Closing thoughts

We now have a general data pipeline and training loop which you can use for training many types of models using Pytorch.

Let’s summarize what we’ve seen:
- <b>torch.nn</b>:
    - <b>Module</b>: creates a callable which behaves like a function, but can also contain state (such as neural net layer weights). It knows what <b>Parameter</b>(s) it contains and can zero all their gradients, loop through them for weight updates, etc.
    - <b>Parameter</b>: a wrapper for a tensor that tells a <b>Module</b> that it has weights that need updating during backprop. Only tensors with the requires_grad attribute set are updated
    - <b>functional</b>: a module(usually imported into the <b>F</b> namespace by convention) which contains activation functions, loss functions, etc, as well as non-stateful versions of layers such as convolutional and linear layers.<br>
- <b>torch.optim</b>: Contains optimizers such as <b>SGD</b>, which update the weights of <b>Parameter</b> during the backward step.
- <b>Dataset</b>: An abstract interface of objects with a __len__ and a __getitem__, including classes provided with Pytorch such as <b>TensorDataset</b>.
- <b>DataLoader</b>: Takes any <b>Dataset</b> and creates an iterator which returns batches of data.