In [None]:
# Adapted from https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html

# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

Getting Started
---------------

If you are running this in Google Colab, a few points to get started.
You may want to start off by saving this as a copy to your Drive. This allows you to edit this notebook.

![copy to drive](img/drivecopy.png)

Furthermore, this notebook requires a GPU to run properly. You can change your runtime to use a GPU in the topright corner.

![change runtime](img/changeruntime.png)

Once clicked a window will open that allows you to specify whether you want to use a GPU-enabled node.

![select gpu](img/runtimetype.png)

You can run CLI commands using the `!` prefix. Let's run `nvidia-smi` to figure out what GPU we have access to.

In [None]:
!nvidia-smi

You should see one GPU listed (e.g. Tesla T4). If you don't see a GPU listed, or the command was not found, please swap your runtime to a GPU-enabled one.

Improving deep learning with profiling
=====================

For this tutorial, we will train a ResNet model on the CIFAR10 dataset.
We will record utilisation data using the inbuilt PyTorch profiling tools and change the model training process using this data.

This tutorial consists of three parts:

- A: Training an image classifier
- B: Profiling our model training
- C: Change our model training using gained insights


The CIFAR10 dataset has the classes:
'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse',
'ship', 'truck'. The images in CIFAR-10 are of size 3x32x32, i.e.
3-channel color images of 32x32 pixels in size.

![cifar10](https://pytorch.org/tutorials/_static/img/cifar10.png)

A: Training an image classifier
----------------------------

1. Load dataset, ResNet and optimizer
========================================

This loads in our dataset and ResNet34 model, which are by modern standards a relatively small dataset and convolutional network, respectively.


In [None]:
import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import torch.nn as nn
import torch.optim as optim

classes = ('plane', 'car', 'bird', 'cat',
        'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                    download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=128,
                                        shuffle=False, num_workers=2)

def define_network(batch_size=8, num_workers=1):
    trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                            download=True, transform=transform)
    trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                            shuffle=True, num_workers=num_workers)


    
    net = torchvision.models.resnet34()


    criterion = nn.CrossEntropyLoss()


    optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    # Assuming that we are on a CUDA machine, this should print a CUDA device:

    print(device)
    net.to(device)

    return net, trainloader, criterion, optimizer, device

net, trainloader, criterion, optimizer, device =  define_network()

2. Train the network
====================

This loops over the training data and trains the network.


In [None]:
def train(net, trainloader, criterion, optimizer, device, epoch_count=1, stop_early=False):
    data_length = len(trainloader)
    for epoch in range(epoch_count):  # loop over the dataset multiple times
        running_loss = 0.0

        for i, data in enumerate(trainloader, 0):

            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data[0].to(device), data[1].to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()

            # print statistics
            running_loss += loss.item()

            if i % 20 == 19:    # print every 20 mini-batches
                print(f'Epoch {epoch + 1}, {i + 1:5d}/{data_length}] loss: {running_loss / 20:.3f}')
                running_loss = 0.0

            if stop_early and i*trainloader.batch_size > 500: # stop early to limit the size of our profiling
                break

    print('Finished Training')

train(net, trainloader, criterion, optimizer, device, epoch_count=1)

3. Test the network on the test data
====================================

To verify let us test the model on the test data.

In [None]:
correct = 0
total = 0
# since we're not training, we don't need to calculate the gradients for our outputs
with torch.no_grad():
    for data in testloader:
        inputs, labels = data[0].to(device), data[1].to(device)
        # calculate outputs by running images through the network
        outputs = net(inputs)
        # the class with the highest energy is what we choose as prediction
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the network on the 10000 test images: {100 * correct // total} %')

That looks way better than chance, which is 10% accuracy (randomly
picking a class out of 10 classes). Seems like the network learnt
something.


B: Profiling our model training
----------------------------

Our model has trained, it took a long time to train even such a simple model. Let us collect information to figure out why training took so long.
Increasing the efficiency of training does not only save resources; it allows for rapid debugging and training the model for more epochs in the same amount of time.

There are multiple profilers available and for our basic analysis most will suffice. In order to comply with the limitations of Google Colab we resort to the built-in PyTorch profiler.
We can wrap the train loop in a profiler context to profile model training.

In [None]:
net, trainloader, criterion, optimizer, device = define_network()
with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU,
            torch.profiler.ProfilerActivity.CUDA,
        ],
        record_shapes=False,
        profile_memory=True,
        with_stack=False,
    ) as prof:
    train(net, trainloader, criterion, optimizer, device, epoch_count=1, stop_early=True)

Noteably, training our model like this should be *significantly* slower than our previous training loop. Profiling is not cheap; we are essentially logging all kinds of information which introduces substantial overhead!
We have enabled *stop_early* to prevent this step from taking too much time; a part of one epoch will be more than enough.

We have collected a lot of information. Here I welcome you to browse through the available stats yourself, but just to give a few examples:

In [None]:
print("CPU time spent on operations")
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))

In [None]:
print("GPU/CUDA time spent on operations")
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

In [None]:
print("RAM memory allocations")
print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

These provide valuable tabular data that shows us what hogs up more resources than it should, but a table only has so much expressive power.
One way to dig deeper is by delving in the execution trace of our model training. This gives us a timeline based view that allows us to find bottlenecks effectively.

We can dump a trace using the PyTorch profiler. Google Chrome can open trace files natively [about:tracing](about:tracing), but for other browsers you can use Perfetto instead [https://ui.perfetto.dev/#!/viewer](https://ui.perfetto.dev/#!/viewer).

Files that you save in Colab are found in the file tab to the left. You may have to press the refresh arrow in the file browser. You can download files using the right mouse button.

In [None]:
prof.export_chrome_trace("trace.json")

Alternatively, we can dive into how our program uses the GPU memory using a PyTorch CUDA memory snapshot. For this we do want to train the model for at least a full epoch in order to get a good picture of our memory consumption. We can record a GPU memory snapshot by initialising it before we start training.

In [None]:
torch.cuda.memory._record_memory_history(max_entries=100000)

net, trainloader, criterion, optimizer, device = define_network()

train(net, trainloader, criterion, optimizer, device, epoch_count=1, stop_early=False)

torch.cuda.memory._dump_snapshot("cuda_snapshot.pickle")

Go to [https://pytorch.org/memory_viz](https://pytorch.org/memory_viz) to analyze the memory snapshot `.pickle` file.

C: Change our model training using gained insights
----------------------------

Using these sources of information, we can drastically speed up the execution of our model training. The training can be tweaked in multiple ways, and I wholeheartedly recommend you to experiment and see the effects on the profiling. Remember that the effectiveness heavily depends on what hardware you are running on; some things may work extremely well while others might not benefit training at all.

As an example I would like to bring your attention to two aspects:

Batch Size
----------

Is the GPU memory saturated or is there still much to work with? Additionally, is the GPU working constantly, on full power, or is it idling for a decent chunk of time?
GPU's are efficient for deep learning training due to their ability to run massively parallel operations efficiently, and increasing the degree of parallellism in our model training can be highly beneficial to training speed.
Right now the `batch size` is set to `8`, which means that the model trains on 8 images at the same time. Try increasing this to `64` or even `128`, what is the observed effect on training speed and profiling?

Dataloading Worker Count
----------

Inspect the trace plot. Are we spending all of our time training on our data, or do we need to wait for our data? When using a basic data loader the data is being read from disk and processed whenever the GPU is done with processing the previous data. We can fetch this data while the GPU is working to prevent this stalling from happening. This can be done by adjusting `num_workers` to a higher number than `1`, e.g. `16`. What effect do you observe on the training speed and the trace?
