HHU Deep Learning, WS2023/24, 13.10.2023

Lecture: Prof. Dr. Markus Kollmann

Exercises: Felix Michels, Nikolas Adaloglou

# PyTorch Introduction

by Tim Kaiser

---

## Installation

Take a look at https://pytorch.org/get-started/locally/ to install PyTorch. Set your preferences in the "Start Locally" section and run the command.

I recommend using Python 3.11 and PyTorch >= 2.0. You don't have to install CUDA seperatly! The PyTorch installation installs the cudatoolkit package.

## What is PyTorch?

PyTorch is a Deep Learning library that has three main features: The tensor datastructure, which is similar to
the Numpy ndarray, automatic differentiation for training neural networks and it allows us to run code on a GPU, which offers significantly faster computations. 

We will now take a look at part of the [PyTorch Tutorial](https://github.com/tuelwer/pytorch-tutorial) by Tobias Uelwer to get familiar with fundamentals like Tensors, automatic differentiation and function optimization in PyTorch:

Use this as an introduction, as well as a lookup for later exercises. 

---

## PyTorch Basics

Author: Tobias Uelwer

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import torch
print(torch.__version__)

1.13.1


In [2]:
# We create a new PyTorch tensor from a NumPy tensor
torch.from_numpy(np.ones((2, 2)))
# We can instantanly access its values unlike we are used to from tensorflow

tensor([[1., 1.],
        [1., 1.]], dtype=torch.float64)

In [3]:
# Generate a random tensor
A = torch.randn(4, 5)
A

tensor([[-0.9912,  0.7258, -0.1109, -0.9342,  1.2320],
        [ 2.5265, -1.3166,  1.6025, -0.2199,  1.1588],
        [ 0.8623, -0.5271,  0.1942,  1.4443, -2.5663],
        [-1.5697,  0.0790, -0.4856,  1.2007,  0.4859]])

In [4]:
# Indexing
A[0, 0]

tensor(-0.9912)

In [5]:
# More indexing
A[:, 2]

tensor([-0.1109,  1.6025,  0.1942, -0.4856])

In [6]:
# Get the shape of A
A.shape

torch.Size([4, 5])

In [7]:
# Traces and matrix transpose
torch.trace(A.T @ A)

tensor(30.1028)

In [8]:
# Random standard normal samples
B = torch.randn(1, 5)
B

tensor([[-0.2899,  0.1590, -1.5151,  0.7761,  1.1797]])

In [9]:
# Broadcasting
A * B

tensor([[ 0.2874,  0.1154,  0.1680, -0.7251,  1.4534],
        [-0.7325, -0.2093, -2.4280, -0.1707,  1.3670],
        [-0.2500, -0.0838, -0.2942,  1.1210, -3.0274],
        [ 0.4551,  0.0126,  0.7358,  0.9319,  0.5733]])

In [10]:
# The function item() returns a simple Python scalar
type(torch.sum(A**2)), type(torch.sum(A**2).item())

(torch.Tensor, float)

In [11]:
# Reshaping
A.view(-1, 2, 10)

tensor([[[-0.9912,  0.7258, -0.1109, -0.9342,  1.2320,  2.5265, -1.3166,
           1.6025, -0.2199,  1.1588],
         [ 0.8623, -0.5271,  0.1942,  1.4443, -2.5663, -1.5697,  0.0790,
          -0.4856,  1.2007,  0.4859]]])

In [12]:
# Flattens a tensor to a vector (removes 'empty' dimensions)
Ones = torch.ones((2, 1, 9, 1))
Ones.shape, Ones.squeeze().shape

(torch.Size([2, 1, 9, 1]), torch.Size([2, 9]))

In [13]:
# Add dimensions
torch.ones(9).unsqueeze(0).shape, torch.ones(9).unsqueeze(1).shape

(torch.Size([1, 9]), torch.Size([9, 1]))

In [14]:
# Add new dimensions (this also works with numpy arrays...)
A[None, :, None, :, None].shape

torch.Size([1, 4, 1, 5, 1])

In [15]:
# Elementwise power
A**2

tensor([[9.8251e-01, 5.2685e-01, 1.2296e-02, 8.7276e-01, 1.5179e+00],
        [6.3832e+00, 1.7333e+00, 2.5679e+00, 4.8351e-02, 1.3428e+00],
        [7.4354e-01, 2.7785e-01, 3.7713e-02, 2.0861e+00, 6.5858e+00],
        [2.4639e+00, 6.2341e-03, 2.3583e-01, 1.4417e+00, 2.3614e-01]])

In [16]:
# Logarithm and absolute value
torch.log(torch.abs(A))

tensor([[-0.0088, -0.3204, -2.1992, -0.0680,  0.2087],
        [ 0.9268,  0.2750,  0.4716, -1.5146,  0.1474],
        [-0.1482, -0.6403, -1.6389,  0.3676,  0.9425],
        [ 0.4509, -2.5389, -0.7223,  0.1829, -0.7217]])

In [17]:
# All close
torch.allclose(torch.log(torch.exp(A)), A)

True

In [18]:
A.exp().log() == A

tensor([[ True,  True,  True,  True,  True],
        [ True,  True,  True,  True,  True],
        [False,  True, False,  True,  True],
        [ True, False,  True,  True, False]])

In [19]:
cuda = torch.cuda.is_available()
print(f"Using GPU: {cuda}")

if cuda:
    A = A.cuda()

print(A)

Using GPU: False
tensor([[-0.9912,  0.7258, -0.1109, -0.9342,  1.2320],
        [ 2.5265, -1.3166,  1.6025, -0.2199,  1.1588],
        [ 0.8623, -0.5271,  0.1942,  1.4443, -2.5663],
        [-1.5697,  0.0790, -0.4856,  1.2007,  0.4859]])


### Automatic Differentiation

In [20]:
# If we want to calculate gradients we need to specify this
# by setting requires_grad=True for each tensor
# X = torch.ones(2, 2, requires_grad=True)
X = torch.tensor([[1,1],[2,2]], dtype=torch.float32, requires_grad=True)
X

tensor([[1., 1.],
        [2., 2.]], requires_grad=True)

In [21]:
# We build a simple computational graph 
y = torch.sum(torch.exp(X))
y

tensor(20.2147, grad_fn=<SumBackward0>)

In [22]:
# In pytorch gradients are accumulated each time backward() is called.
# Keep that in mind!
y.backward(retain_graph=True) # don't free the graph buffers after backprop
X.grad

tensor([[2.7183, 2.7183],
        [7.3891, 7.3891]])

In [23]:
# To suppress this behavior we have to reset the gradients by
X.grad.zero_()
y.backward(retain_graph=True) # dont free the graph buffers after backprop
X.grad

tensor([[2.7183, 2.7183],
        [7.3891, 7.3891]])

### Optimizing Functions

In [24]:
# Define variable
x = torch.tensor([2., 2.], requires_grad=True).float()

We want to minimize the function $f(x_1, x_2) = 100(x_2-x_1^2)^2 + (1-x_1)^2$.

The function has a minimum at $(1, 1)$.

In [25]:
# Define objective
f = lambda x: 100 * (x[1] - x[0]**2)**2 + (1 - x[0])**2

In [26]:
# Instantiate optimizer
optimizer = torch.optim.Adam([x])  # Gradient Descent

In [27]:
i = 0
while True:
    f_val = f(x)
    
    optimizer.zero_grad()
    f_val.backward()
    optimizer.step()
    
    i += 1
    
    if torch.norm(x.grad) < 0.001:
        print(f"Converged after {i} iterations.")
        print(f"Minimum: {str(x.detach().numpy())}")
        break
    if i > 100000:
        print("Maximum number of iterations reached.")
        break

Converged after 15729 iterations.
Minimum: [1.0009778 1.0019593]


---

## Implementing Neural Networks

After familiarizing ourselves with the basics of Tensors and optimization in PyTorch, we want to build and train a 
Convolutional Network.

There are three ways to implement a Neural Network in PyTorch:
 
1. Barebones PyTorch: work directly with the lowest-level PyTorch Tensors. 
2. PyTorch Module API: use `nn.Module` to define arbitrary neural network architecture. 
3. PyTorch Sequential API: use `nn.Sequential` to define a linear feed-forward network very conveniently. 

Here is a table of comparison:

| API           | Flexibility | Convenience |
|---------------|-------------|-------------|
| Barebone      | High        | Low         |
| `nn.Module`     | High        | Medium      |
| `nn.Sequential` | Low         | High        |


In [28]:
import torch.nn as nn
import torch.optim as optim
import torch.utils.data
import torchvision
import torchvision.transforms as transforms

ModuleNotFoundError: No module named 'torchvision'

### Barebones PyTorch

We will start with the barebone implementation of a Two-Layer Fully Connected Network. This helps explore the overall
code structure, the autograd engine and PyTorch's training conventions. 

First, we will define the network by defining its forward pass. Then, we have to initialize the weights for the entire
model and finally define a training loop to train it. 

We test our forward pass by running it on a batch of tensors filled with zeros. This is to make sure that we don't get
any errors and the network produces an output of the right shape. 

In [None]:
import torch.nn.functional as F  # useful stateless functions

def two_layer_fc(x, params):
    n = x.shape[0]
    # first we flatten the image
    x = x.view(n, -1)  # shape: [batch_size, C x H x W]
    
    w1, w2 = params
    x = F.relu(x @ w1)
    x = x @ w2
    return x
    
def test_two_layer_fc():
    hidden_layer_size = 42
    x = torch.zeros((64, 50), dtype=torch.float32)  # minibatch size 64, feature dimension 50
    w1 = torch.zeros((50, hidden_layer_size), dtype=torch.float32)
    w2 = torch.zeros((hidden_layer_size, 10), dtype=torch.float32)
    scores = two_layer_fc(x, [w1, w2])
    print(scores.size())  # you should see [64, 10]

test_two_layer_fc()

For the barebone approach, we need to manually initialize the network's weights. In order to make the parameters 
trainable, we have to set `requires_grad` to True. However, when modifying a tensor in any way (even moving them
to the GPU) will remove the `requires_grad` attribute, so make sure it is the last thing you do when initializing
weights. 

In [None]:
input_size = 28 * 28
hidden_size = 256
num_classes = 10

w1 = torch.zeros((input_size, hidden_size), dtype=torch.float32)
nn.init.kaiming_normal_(w1)
w2 = torch.zeros((hidden_size, num_classes), dtype=torch.float32)
nn.init.kaiming_normal_(w2)

if cuda:
    w1 = w1.cuda()
    w2 = w2.cuda()
    
w1.requires_grad_()
w2.requires_grad_()

### PyTorch Module API

The barebone approach quickly becomes inconvenient for larger networks, because we have to track the network's parameters
by hand. The PyTorch `nn.Module` API relieves us of this work by tracking all learnable parameters. All we have to do
is define the layers and forward pass of the network. Also, PyTorch now takes care of the weight's initialization for us.
If we want to use a specific initialization, we can do so in the `__init__` function. 

Here is an example of the same network as above, but using the `nn.Module` API.

In [None]:
class TwoLayerFC(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, num_classes)
        # nn.init.kaiming_normal_(self.fc1.weight)
        # nn.init.kaiming_normal_(self.fc2.weight)
    
    def forward(self, x):
        n = x.shape[0]
        x = x.view(n, -1)
        scores = self.fc2(F.relu(self.fc1(x)))
        return scores

def test_TwoLayerFC():
    input_size = 50
    x = torch.zeros((64, input_size), dtype=torch.float32)  # minibatch size 64, feature dimension 50
    model = TwoLayerFC(input_size, 42, 10)
    scores = model(x)
    print(scores.size())  # you should see [64, 10]
    
test_TwoLayerFC()

### PyTorch Sequential API

In case we want even more convenience in defining a model than with the `nn.Module` API, we can turn to the
`nn.Sequential` API. It combines the steps of defining the layers and defining their connectivity in the forward pass. 
The trade-off is that we are limited to feed-forward architectures and cannot define more complex network structures. 
Everything about initialization, training and optimization remains the same as for the modular approach. 

In [None]:
# We need to wrap `flatten` function in a module in order to stack it
# in nn.Sequential
class Flatten(nn.Module):
    def forward(self, x):
        n = x.shape[0]
        return x.view(n, -1)

hidden_layer_size = 42

model = nn.Sequential(
    Flatten(),
    nn.Linear(50, hidden_layer_size),
    nn.ReLU(),
    nn.Linear(hidden_layer_size, 10))

def test_model(model):
    input_size = 50
    x = torch.zeros((64, input_size), dtype=torch.float32)  # minibatch size 64, feature dimension 50
    scores = model(x)
    print(scores.size())  # you should see [64, 10]
    
test_model(model)


## MNIST Classifier

Now we want to build and train a <b>Convolutional Network</b> on the [MNIST](https://en.wikipedia.org/wiki/MNIST_database)
dataset. We will be using a three-layer architecture with Dilation, Batch Normalization and ReLU activations and one fully
connected classification layer at the end. To see how convolutions with different parameter settings look like, have
a look at [this](https://github.com/vdumoulin/conv_arithmetic/blob/master/README.md) visualization. For more information
on dilation, have a look at [this](https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5)
article.

We can make use of the `nn.Sequential` API to group blocks of conv-batchnorm-relu layers together. The last convolution
has kernel_size 1x1 (known as a 1x1 convolution), which serves as a downsampling layer because it reduces the number
of channels by taking a weighted sum of all input channels. 

Batch normalization is a way of improving the overall performance and stability of a network by normalizing inputs 
between layers. For a detailed explanation, have a look at [this](https://towardsdatascience.com/batch-normalization-in-neural-networks-1ac91516821c)
article. The `momentum` parameter controls the weighting of the running averages needed for the normalization. 

In [None]:
class ConvNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=2, dilation=2),
            nn.BatchNorm2d(16, momentum=0.1),
            nn.ReLU())
        self.conv2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=2, dilation=2),
            nn.BatchNorm2d(32, momentum=0.1),
            nn.ReLU())
        self.conv3 = nn.Sequential(
            nn.Conv2d(32, 1, kernel_size=1, stride=1, padding=0),
            nn.BatchNorm2d(1, momentum=0.1),
            nn.ReLU())
        self.fc = nn.Linear(28*28, num_classes)
        
    def forward(self, x):
        x = self.conv1(x)
        x = self.conv2(x)
        x = self.conv3(x)
        x = x.reshape(x.size(0), -1)
        x = self.fc(x)
        return x

For this example we are using the PyTorch Dataloader. It's a convenient solution for batch training and provides easy
access to popular datasets, such as the MNIST dataset. After loading the dataset and setting parameters like batch_size, 
train and an optional transformation (normalization in this case), we can use the dataloader object as an iterator for
the training loop. 

We split the MNIST dataset into a portion for training and one for testing.

In [None]:
training_data

In [None]:
# Load dataset
tr = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]) # Normalization for MNIST

training_data = torchvision.datasets.MNIST(root='./data', train=True,
                                      download=True, transform=tr)
training_loader = torch.utils.data.DataLoader(training_data, batch_size=128,
                                          shuffle=True, num_workers=1)

test_data = torchvision.datasets.MNIST(root='./data', train=False,
                                     download=True, transform=tr)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=128,
                                         shuffle=True, num_workers=1)

In [None]:
# The function parameters() is implemented in nn.Module
net = ConvNet()
if cuda:
    net = net.cuda()

In [None]:
cross_entropy = nn.CrossEntropyLoss() # instantiate loss 
optimizer = optim.Adam(net.parameters(), lr=1e-3, weight_decay=5e-4) # instantiate optimizer

In [None]:
epochs = 3
history = []

for i in range(0, epochs):
    for j,(inputs, labels) in enumerate(training_loader):
        if cuda:
            inputs = inputs.cuda()
            labels = labels.cuda()
        
        # forward pass
        outputs = net(inputs)
        
        # training loss
        loss = cross_entropy(outputs, labels)
        
        # calculate total loss
        history.append(loss.item())
        
        # zero the parameter gradients
        optimizer.zero_grad()
        
        # backward pass
        loss.backward()
        optimizer.step()

        if (j + 1) % 100 == 0:
            print(f"epoch: {i+1:2} batch: {j+1:4} loss: {history[-1]:3.4}")

In [None]:
plt.plot(history);
plt.title("Training Loss Plot")
plt.xlabel("Training Steps")
plt.ylabel("Cross-Entropy Loss")
plt.show()

In [None]:
# Set model to evaluation mode 
# (important for batchnorm/dropout)
net.eval()

correct = 0

for inputs, labels in test_loader:
    with torch.no_grad():
        if cuda:
            inputs = inputs.cuda()
            labels = labels.cuda()
        
        outputs = net(inputs)
        predicted_class = outputs.max(dim=1)[1]
        
        correct += (predicted_class == labels).float().sum().item() 
        
accuracy = correct / len(test_data)

print("Test Accuracy: ", accuracy)

In [None]:
# Save model to disk

# torch.save(net.state_dict(), "net")

In [None]:
# Load model

# net = ConvNet()
# net.load_state_dict(torch.load("net"))

## Further Reading and Exercises

For more reading on PyTorch, there is [this](https://github.com/jcjohnson/pytorch-examples) comprehensive PyTorch 
Introduction by Justin Johnson.
Also, feel free to check out the rest of Tobias Uelwer's [PyTorch Tutorial](https://github.com/tuelwer/pytorch-tutorial).

[This](https://github.com/pytorch/examples) is a collection of architectures and ML models implemented in PyTorch. Check
it out if you want to take a look at some finished PyTorch example code.  