# Deep Learning tutorial

This tutorial teaches the basic aspects of training a deep neural network. The process can be divided broadly into 4 parts:

1) Creating a data loader - processing the data into a required format

2) Building the neural network

3) Training the network

4) Evaluating the performance

Let's work on a classification problem. The goal is to predict the land scene from an aerial/satellite image.

We will use PyTorch in this tutorial.

First we need to import all the necessary libraries and functions.

In [0]:
from __future__ import print_function, division
import os
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from skimage import io
import numpy as np
from torch.utils.data import DataLoader
from torchvision import transforms

We can use a GPU if available. The computations and the training process is much faster on a GPU.

In [0]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

We will be working on UC Merced Land Use dataset. It is a 21-class land use image dataset with different classes such as agriculture, buildings, roadways, etc. More information about the dataset can be found here: http://weegee.vision.ucmerced.edu/datasets/landuse.html

Download the dataset and unzip the downloaded file.

In [0]:
!wget http://weegee.vision.ucmerced.edu/datasets/UCMerced_LandUse.zip

!unzip UCMerced_LandUse.zip

We need to store the folder location of the dataset images. A custom .csv file is uploaded. It has all the class/label names and is read to store all the classes.

In [0]:
image_dir = './UCMerced_LandUse/Images/'

csv_file = './classes.csv'
classes = pd.read_csv(csv_file, header=None)

We have downloaded the dataset and now we need to create the DataLoaders which are used in training and testing the network.

We need to transform the raw data of the dataset to the required format suitable for training the network. Torchvision package has many image transformations, which can be combined together using Compose.

It is good to resize the image to a standard size since some of the images in the dataset might be of different sizes. This ensures uniformity. Randomly flipping the images about horizontal and vertical axes is a good data augmentation technique. It helps the network to learn deeper features. The resize and flip transformations for PIL Image format. So, the input image is first converted to a PIL Image, then resize and flip transformations are applied, and then converted to Torch Tensor format. It is always a good practice to normalize the data before training a model as it generally speeds up the learning and leads to faster convergence.

More information and other transforms can be found here: https://pytorch.org/docs/stable/torchvision/transforms.html

In [0]:
transforms_train = transforms.Compose([ transforms.ToPILImage(), transforms.Resize((256,256)),
    transforms.RandomHorizontalFlip(), transforms.RandomVerticalFlip(), 
    transforms.ToTensor(), transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)) ])

transforms_test = transforms.Compose([ transforms.ToPILImage(), transforms.Resize((256,256)),
    transforms.ToTensor(), transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)) ])

For DataLoaders, we need to have a list of samples where each sample is an image tensor and it's label.

Every class has 100 training examples. Here, we divide the first 80 as training samples and the last 20 as testing samples. It is good to randomly mix the samples. Also, it is a good practice to do k-fold cross validation. For more information on k-fold cross validation: https://machinelearningmastery.com/k-fold-cross-validation/

Here, ***scikit-image*** is used for reading the image. ***Scikit-image*** is an image processing library for Python. ***io*** is a module used for reading and writing images in various formats. The image is loaded using ***io.imread*** function, transformed to a tensor, and added to the data list along with its label.

In [0]:
train_data = []
test_data = []

for i in range(len(classes)):
    for j in range(80):
        pth = os.path.join(image_dir, classes.iloc[i,0], classes.iloc[i,0] + str(j).zfill(2) + '.tif')
        img = io.imread(pth)
        img_tensor = transforms_train(img)
        train_data.append([img_tensor, i])

    for j in range(20):
        pth = os.path.join(image_dir, classes.iloc[i,0], classes.iloc[i,0] + str(j+80).zfill(2) + '.tif')
        img = io.imread(pth)
        img_tensor = transforms_test(img)
        test_data.append([img_tensor, i])

We need to pass the required batch size to create batches of samples. So, in this case, training and testing dataloaders will have batches of 16 samples each. The general practice is to shuffle the training data, so that there is a mix of examples from different classes in every batch - this leads to training the network better.

In [0]:
TrainDataLoader = DataLoader(train_data, batch_size=16, shuffle=True)
TestDataLoader = DataLoader(test_data, batch_size=16, shuffle=False)

We have the data loaders ready and let's build the neural network from scratch (we can also load the pre-built or pre-trained models).

We have to define the layers and the forward pass. Since we are working with images with considerably complex features, we have to use convolutional neural networks (CNNs). In CNNs, small matrices called convolution filters are used for convolving with each of the original image pixels to produce a new pixel - the new pixels form a new image/feature map.

*nn.Conv2d* applies a 2D convolution over an input signal composed of several input planes. It takes the number of input channels, number of output channels, and the kernel size as the required parameters. It takes other optional parameters like padding and stride. Since we are working with RGB images, the initial number of input channels will be 3. After the first convolution layer, the number of input channels for every convolution layer will be the number of output channels for the previous layer.

The input shape of a Conv2d layer is (N, C_in, H_in, W_in) and the output shape is (N, C_out, H_out, W_out) where N is batch size, C is the number of channels, H is the height of the image, and W is the width of the image.

More about convolution layers: https://pytorch.org/docs/stable/nn.html#convolution-layers

After every convolution layer, the general practice is to add a batch normalization layer to normalize the features. The input shape and the output shape is the same for a batch normalization layer.

To introduce non-linearity, we use non-linear activation functions after every convolution such ReLU. ReLU is most common activation function. It applies the rectified linear unit function element-wise: ReLU(x) = max(0,x).

Different types of activation functions: https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity

*nn.MaxPool2d* applies a 2D max pooling - it is a downsampling technique ithat calculates the maximum value in each patch of each feature map/channel. Input shape is (N, C, H, W) and the output shape is (N, C, H_out, W_out). Pooling techniques are used to decrease the dimensionality of the data.

Different types of pooling functions: https://pytorch.org/docs/stable/nn.html#pooling-layers

The network architecture is defined in the *arch* list. A number signifies the number of output channels of that particular convolutional layer and 'M' signifies a max-pooling layer.

*nn.Sequential()* is sequential container, modules will be added to it in the order they are passed in the constructor.

After a set of convolutional layers, we need fully-connected layer(s) to classify the features. *nn.Linear()* is used for the fully-connected layers, it applies a linear transformation to the data.

The forward pass is defined in the *forward()* method. The images are passed into convolutional layers and the output for every sample is a 3D tensor. The 3D array/tensor is flattened to get a linear vector i.e. converting the features into a single dimension and then passed through the fully-connected layers to get the final output. The forward method determines the flow of data within the network.

In [0]:
arch = [16, 16, 'M', 64, 64, 'M', 128, 128, 'M', 256, 'M', 256, 'M', 256, 'M', 256, 'M']

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.features = self._make_layers(arch)
        self.fc1 = nn.Linear(1024,1024)
        self.fc2 = nn.Linear(1024,21)

    def forward(self, I):
        out = self.features(I)
        out = out.view(out.size(0), -1)
        out = self.fc1(out)
        out = self.fc2(out)
        return out

    def _make_layers(self, arch):
        layers = []
        in_channels = 3
        for x in arch:
            if x == 'M':
                layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
            else:
                layers += [nn.Conv2d(in_channels, x, kernel_size=3, padding=1),
                           nn.BatchNorm2d(x),
                           nn.ReLU(inplace=True)]
                in_channels = x
        return nn.Sequential(*layers)

A network object is created and loaded onto the GPU device, if available, for faster computations.

In [0]:
net = Net()
net.to(device)

Now, let's define all the required parameters for training the network and evaluating the performance.

We need to use a loss function to calculate the loss values, it is like a metric to measure the difference between actual and expected values. PyTorch gives many loss functions: https://pytorch.org/docs/stable/nn.html#loss-functions. Different loss functions work well with different tasks. Cross entropy loss is the most commonly used loss function for classification problems.

Updating the network weights/parameters is the most important part in training a network. *torch.optim* package gives different types of optimizers. Most common optimizations are Stochastic Gradient Descent (SGD) and Adam. The general practice is to start with a relatively large learning rate (of about 0.01) and gradually decrease it. So, a scheduler is used to decrease the learning by a factor (gamma) after every certain number of epochs (step size). Loss function, optimizer, learning rate, weight decay are hyperparameters which need to be tuned to achieve better accuracies. Different set of values work well for different networks and tasks.

In [0]:
num_epochs = 50

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.8, weight_decay=0.02)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5)

*.train()* sets the model to training mode. A batch of images and labels are passed through the network, outputs are obtained, loss values are calculated and backpropagated, and finally the weights are updated. The cumulative running loss is calculated and displayed after processing a every few batches. The loss values should decrease as the network gets trained.

*.eval()* sets the model to inference mode. Training and test accuracies are calculated. Test accuracy is our primary interest and the actual evaluation metric. Training accuracy can go up to 100% which means there is model overfitting. As the network gets trained, test accuracy converges. There is a possibility that test accuracy can drop after a certain point of time. So, to keep track, we calculate the best epoch and the best test accuracy after every epoch and save the best model up to that point. After every epoch, we do *scheduler.step()* to update the learning rate.

In [0]:
best_test_acc = 0
best_epoch = -1

PATH = './trained_model.pth'

for epoch in range(num_epochs):

    net.train()

    running_loss = 0.0
    for batch_no, (images, labels) in enumerate(TrainDataLoader):
        optimizer.zero_grad()
        images = images.to(device)
        labels = labels.to(device)
        outputs = net(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
        if batch_no % 25 == 24:
            print('[%d, %5d] loss: %.3f' %(epoch + 1, batch_no + 1, running_loss / 25))
            running_loss = 0.0

    net.eval()

    correct = 0
    total = 0
    with torch.no_grad():
        for data in TrainDataLoader:
            images, labels = data[0].to(device), data[1].to(device)
            outputs = net(images)
            _,predicted = torch.max(outputs.data,1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    train_acc = np.around(correct/total*100, decimals=2)
    print('TRAIN ACCURACY:', train_acc)

    correct = 0
    total = 0
    with torch.no_grad():
        for data in TestDataLoader:
            images, labels = data[0].to(device), data[1].to(device)
            outputs = net(images)
            _,predicted = torch.max(outputs.data,1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    test_acc = np.around(correct/total*100, decimals=2)
    print('TEST ACCURACY:', test_acc)

    if test_acc > best_test_acc:
        best_test_acc = test_acc
        best_epoch = epoch+1
        torch.save(net.state_dict(), PATH)

    scheduler.step()
    print()

print('Best test accuracy:', best_test_acc)
print('Best epoch:', best_epoch)