Let's see a real scenario on a toy dataset

In [2]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

If you are using a GPU (if you have one definitely use it)  
You should know one thing about the GPU training.  
Your data and your model should be located at the same place to start training.  
You can always move your data or model by using .to(device)
or give function to as argument

In [3]:
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Let's set our hyper parameters which are the most important things for our training

In [14]:
# Hyper-parameters 
input_size = 28 * 28    # 784
num_classes = 10
hidden_size = 500
num_epochs = 10
batch_size = 128
learning_rate = 0.001

Let's use our datasets

In [15]:
# MNIST dataset (images and labels)
train_dataset = torchvision.datasets.MNIST(root='data', 
                                           train=True, 
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='data', 
                                          train=False, 
                                          transform=transforms.ToTensor())
                                          

# Data loader (input pipeline)
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

In [16]:
print(len(train_loader), len(test_loader))

469 79


We know how to use trained torch models but what if we want to create a custom model.  
P.S. network architecture creation is very hard if your knowledge is limited. So customizing known architectures is good place to start.

In [17]:
# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.relu = nn.ReLU()
        self.fc1 = nn.Linear(input_size, hidden_size) 
        self.fc2 = nn.Linear(hidden_size, num_classes)  
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

Sometimes you create so much layers. Giving unique name for each one is hard so we use nn.Sequential for those cases.

In [18]:
# Fully connected neural network with one hidden layer
class NeuralNet(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(NeuralNet, self).__init__()
        self.fc = nn.Sequential(
                nn.Linear(input_size, hidden_size),
                nn.ReLU(),
                nn.Linear(hidden_size, num_classes)
        )
        
    def forward(self, x):
        out = self.fc(x)
        return out

In [19]:
# We create the model and move it to the device (GPU or CPU) default is CPU
model = NeuralNet(input_size, hidden_size, num_classes).to(device)

In [20]:
# We define our loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

Now important thing is we should write a training loop.  
There could be so many different training loop as you want but every training loop has same simple logic

1. Get data and move it to appropriate location
2. Make a forward pass
3. Calculate loss
4. Zero the gradients
5. Calculate gradients
6. Optimize your network with one step gradient descent

In [21]:
total_step = len(train_loader)
total_images = len(train_dataset)
print(f"Number of images: {total_images}, Number of batches: {total_step}")

Number of images: 60000, Number of batches: 469


In [22]:
# Train the model

# At each epoch we iterate over all the dataset
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        # Image shapes are
        # [Batch_size, Channel, Height, Width]
        # Label shapes are
        # [Batch_size]
        # Move labels and images to the configured device
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        # Forward pass through the model
        outputs = model(images)
        # Calculate your loss
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        # If you don't accumulate the gradients then you should
        # zero grads before calculate them
        optimizer.zero_grad()
        loss.backward()
        # Make a step with your optimizer
        optimizer.step()
        
        # To monitor the training process we print or log some useful values 
        if (i+1) % 200 == 0:
            print ('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}' 
                   .format(epoch+1, num_epochs, i+1, total_step, loss.item()))


Epoch [1/10], Step [200/469], Loss: 0.2074
Epoch [1/10], Step [400/469], Loss: 0.1409
Epoch [2/10], Step [200/469], Loss: 0.1043
Epoch [2/10], Step [400/469], Loss: 0.1115
Epoch [3/10], Step [200/469], Loss: 0.0740
Epoch [3/10], Step [400/469], Loss: 0.1678
Epoch [4/10], Step [200/469], Loss: 0.0417
Epoch [4/10], Step [400/469], Loss: 0.0333
Epoch [5/10], Step [200/469], Loss: 0.0418
Epoch [5/10], Step [400/469], Loss: 0.0331
Epoch [6/10], Step [200/469], Loss: 0.0141
Epoch [6/10], Step [400/469], Loss: 0.0065
Epoch [7/10], Step [200/469], Loss: 0.0288
Epoch [7/10], Step [400/469], Loss: 0.0088
Epoch [8/10], Step [200/469], Loss: 0.0199
Epoch [8/10], Step [400/469], Loss: 0.0073
Epoch [9/10], Step [200/469], Loss: 0.0074
Epoch [9/10], Step [400/469], Loss: 0.0358
Epoch [10/10], Step [200/469], Loss: 0.0043
Epoch [10/10], Step [400/469], Loss: 0.0061


In [23]:
# Test the model
# In test phase, we don't need to compute gradients (for memory efficiency)
# no_grad context saves you about forgetting gradient freezing
# Within no_grad context nothing will compute or keep gradients
# If you don't use no_grad your memory and  time usage will be higher
with torch.no_grad():
    model.eval()
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.reshape(-1, 28*28).to(device)
        labels = labels.to(device)
        # If you don't use no_grad context you can use
        # model.eval() function
        # When you use it your model enters to evaluation mode (no grad calculation) 
        # Be careful some layers (BatchNorm) behaves different in training and evaluation mode 
        # You know we calculate local gradients when we do forward pass
        
        outputs = model(images)
        
        # Get predictions and calculate your accuracy
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Accuracy of the network on the 10000 test images: {} %'.format(100 * correct / total))


Accuracy of the network on the 10000 test images: 98.03 %


Awesome in a less than a minute you solved digit recognition problem with high accuracy.  
Let's save our model.

In [24]:
# Save the model checkpoint
torch.save(model.state_dict(), 'model.ckpt')