# Pytorch

Sciki-learn is a popular machine learning library, however, other popular libraries in this domain should not be ignored. While scikit-learn has its focus on traditional machine learning algorithms and is characterized by its uniform sytnax and purposes w.r.t. to preprocessing, model selection, model training and model evaluation, *pytorch* plays a more important role in the field of deep learning, especially for models which are build as neural network architectures. 

Pytorch provides tools for building and training complex neural network models, particularly those that benefit from GPU acceleration. Thus, it is suitable for applications in, e.g., computer vision, natural language processing, and reinforcement learning. In comparison to scikit-learn, it can be considered as a low-level library that provides more control and flexibility over model building and training. At the same time, it requires more code to set up and train models compared to scikit-learn, but this allows for greater customization.

Its core blocks are:

* Tensors: Fundamental data structure, similar to NumPy arrays but with GPU support.
* Autograd: Automatic differentiation for building and training neural networks.
* NN Module: High-level neural network API for constructing deep learning models.
* Optim: Optimization algorithms (e.g., SGD, Adam) for training models.
* Dynamic Computational Graphs: Graphs are built on-the-fly, allowing for flexible model design.

By the sub-modules of the NN module, neural networks can be manually defined in very custom ways. Such complex models are usually trained by numerical optimization which is based on gradient information that is in need of differentiation. The functionality of automatic differentiation in combination with the availability of different optimization algorithms is one of the main reasons for the popularity of pytorch. Furthermore, pytorch is open-source and has a large and usually helpful community. 

Note that *tensorflow* is an equivalent option to pytorch. It offers more or less the same functionalities. Due to the more pythonic way how models are defined with pytorch and its wide spread usage in the research area, we are going to stick to pytorch.  So let us learn about the key concepts of pytorch.

## Tensors

Tensors are very similar to numpy arrays, however, they can run on the GPU and are optimized for automatic differentiation. Tensors can be created from data or from a numpy array (many other ways to create tensors do exist).

In [1]:
import torch
import numpy as np


x = [[1, 2], [3, 4]]
x_tensor = torch.tensor(x)
x_array = np.array(x)
x_tensor_from_np = torch.from_numpy(x_array)

x_tensor == x_tensor_from_np

tensor([[True, True],
        [True, True]])

Attributes of tensors can be inferred similar to numpy arrays, e.g., the *shape* or *dtype* attribute. However, one difference is given by the *device* attribute which tells us if the tensor runs on the GPU or CPU. By default, it is the CPU. Given a GPU is available, the tensor can be transferred to the GPU with the *to* method. Depending on the architecture, the device name usually is "cuda" or "mps" for macbooks with M chips. 

In [2]:
print(f"Device at start: {x_tensor.device}")

if torch.backends.mps.is_available():
    device = torch.device("mps")
elif torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

x_tensor = x_tensor.to(device)

print(f"Device after shift to GPU: {x_tensor.device}")
print(f"The shape of the tensor is: {x_tensor.shape}")
print(f"The data type of the tensor is: {x_tensor.dtype}")

Device at start: cpu
Device after shift to GPU: mps:0
The shape of the tensor is: torch.Size([2, 2])
The data type of the tensor is: torch.int64


Tensors come a long with a large number of operations which are mostly similar to operations which we know from numpy array. An overview can be found [here](https://pytorch.org/docs/stable/torch.html).

## The neural network module

Pytorch includes a neural network submodule which comes with pre-defined building blocks for neural networks which can be found [here](https://pytorch.org/docs/stable/nn.html#linear-layers). For instance, the *Linear* class defines an affine transformation of the form:

$$
f \left( \mathbf{x} \right): \mathbf{x}^T W + b
$$

where $\mathbf{x}$ is the input which either is given by observable feature variables or by hidden neurons from a previous layer, $W, b$ are parameters. An instance of the *Linear* class initializes parameters randomly. See the equivalence of the class forward method and the manual processing below.

In [3]:
from torch import nn

x_tensor = torch.tensor(x)
x_tensor = x_tensor.to(torch.float32)
linear = nn.Linear(in_features=2, out_features=3)
linear(x_tensor)

tensor([[1.4774, 1.3887, 0.5240],
        [2.7257, 3.2935, 1.6014]], grad_fn=<AddmmBackward0>)

In [4]:
x_tensor @ linear.weight.transpose(0, 1) + linear.bias

tensor([[1.4774, 1.3887, 0.5240],
        [2.7257, 3.2935, 1.6014]], grad_fn=<AddBackward0>)

In [5]:
linear.weight

Parameter containing:
tensor([[0.0110, 0.6131],
        [0.3060, 0.6464],
        [0.4842, 0.0545]], requires_grad=True)

In [6]:
linear.bias

Parameter containing:
tensor([ 0.2402, -0.2102, -0.0691], requires_grad=True)

Neural networks consist of layers which are themselves compositions of different functions. For instance, a fully connected layer composes the affine transformation from above with an activation function $g$. Activation functions can be chosen as desired and an overview of pytorch implementations can be found [here](https://pytorch.org/docs/stable/nn.html#non-linear-activations-weighted-sum-nonlinearity). Using $l$ to denote the layer $l$ of a neural network, it can be seen as a large number of different functional compositions:

$$
g^{(L)} \left( f^{(L)} \left( g^{(L-1)} \left( f^{(L-1)} \left( ... g^{(2)} \left( f^{(2)} \left( g^{(1)} \left( f^{(1)} \left( \mathbf{x} \right) \right) \right) \right)  \right) \right) \right) \right) 
$$

This structure can easily be defined using pytorch's nn.Module class. The layers are defined within the *\_\_init\_\_* method of the class and a *forward* method defines how input is processed through the network. The *nn.Sequential* is a useful class which creates a container of all operations which are defined by the network. First, let us take a look at a simple example which defines a forward neural network for a regression task which includes a hidden layer with a ReLu activation function. The number of input dimensionality (number of input features) and the number of hidden neurons must be set at initialization.

In [7]:
import torch
from torch import nn

class RegressionNetwork(nn.Module):

    def __init__(self, input_dimension, hidden_dimension):
        super(RegressionNetwork, self).__init__()

        self.input_dimension = input_dimension
        self.hidden_dimension = hidden_dimension

        self.layer_stack = nn.Sequential(
            nn.Linear(in_features=self.input_dimension, out_features=self.hidden_dimension),
            nn.ReLU(),
            nn.Linear(in_features=self.hidden_dimension, out_features=1)
        )

    def forward(self, x):
        output = self.layer_stack(x)
        return output

To keep it a little more general, we further implement another forward network class which can be used for regression and classification tasks.

In [8]:
class NeuralNetwork(nn.Module):
    def __init__(self, input_dimension, hidden_dimension, output_dimension, output_activation = "identity"):
        
        self.input_dimension = input_dimension
        self.hidden_dimension = hidden_dimension
        self.output_dimension = output_dimension
        self.output_activation = output_activation

        if not(self.output_activation in ("identity", "binary", "multi")):
            raise NameError("Make sure that output activation is one of: identity, binary, multi")
        
        if (self.output_activation in ("identity", "binary")) and (self.output_dimension > 1):
            raise ValueError("If output activation is identity or binary, the output_dim argument must be set to 1.")
        
        super(NeuralNetwork, self).__init__()
        if self.output_activation == "identity":
            self.output_function = nn.Identity()
        elif self.output_activation == "binary":
            self.output_function = nn.Sigmoid()
        elif self.output_activation == "multi":
            self.output_function = nn.Softmax(dim = 1)

        self.layer_stack = nn.Sequential(
            nn.Linear(in_features=self.input_dimension, out_features=self.hidden_dimension),
            nn.ReLU(),
            nn.Linear(in_features=self.hidden_dimension, out_features=self.output_dimension),
            self.output_function
        )

    def forward(self, x):
        output = self.layer_stack(x)
        return output

## The importance of gradient calculation and automatic differentiation

Usually all neural network are trained with numerical optimization routines which use gradient information. Networks are calibrated by parameters. This is done be utilizing feature realizations to generate predictions under the current parameter setting and evaluate how much these predictions are in line with actual target realizations. The evaluation at this part is done by a loss function which is the lower, the more the prediction is in line with the target observation. Let $F$ denote the neural network and $\Theta$ its parameters, then a prediction $\hat{y}$ for input $\boldsymbol{x}$ is generated by 

$$
\hat{y} = F_{\Theta}\left( \boldsymbol{x} \right)
$$

The loss function receives the true observation and the predicted value. Its value can only be changed by adjusting the parameters of the network. 

$$
L\left( \Theta \right) = \sum_i l\left(y_i, \hat{y}_i\right)
$$

The gradient $\nabla_L$ of $L\left( \Theta \right)$ includes all partial derivates of $L$ with respect to every parameter in the network. For every partial derivative a positive value indicates an increase in the loss function if the parameter is further raised while a negative value indicates a decrease in the loss function. As a decrease for the loss function is desired, one decreases the parameter value if the partial derivative is positive and increases its value if the partial derivative is negative. This rule can be subsumed by:

$$
\Theta \leftarrow \Theta - \eta \nabla_L
$$

where $\eta$ is the learning rate which controls the size of the parameter change. Note that the loss is a sum of individual loss values for each observation. It is no problem to determine the gradient for this as the derivative for the sum is the sum of derivatives. This makes $\nabla_L$ a measure how the parameter change (positive or negative) alters predictive quality on average over these observations. Consequently, the parameter update improves predictions on average, but, not necessarily all of them. Furthermore, the gradient update is usually done by using only a subset of the data at each iteration. This is called batch gradient descent and counterbalances advantages and disadvantages of full gradient descent (using all data points for an update) and stochastic gradient descent (using only a single data point for an update).

Overall, is is important to determine derivatives fast and accurate for arbitrary functions and their compositions. Pytorch determines derivatives automatically during calculation. This can be seen in a simple example below where we determine the derivative $\partial f / \partial x$ of the function $f(x)=x^2$. If we need the derivative, the *requires_grad* argument must be set to True for the tensor. The *backward* method determines the gradient automatically and the value of the gradient is determined by the *grad* attribute.

In [9]:
x = torch.tensor([2.], requires_grad=True)
y = x**2
y.backward()
grad = x.grad
print(f"The derivative of the function with respect to x at a value of: {x.detach().numpy()} is equal to {x.grad.numpy()}")

The derivative of the function with respect to x at a value of: [2.] is equal to [4.]


To demonstrate how we can use the gradient information, let us solve a simple example. The model is given by: $f_{\theta}(x) = \theta x$, thus it only depends on $\theta$. For a datapoint $(2, 3)$, we want to find $\theta$ which minimizes $L(\theta) = \left(y - \theta x \right)^2$. We start with a arbitrary value of $\theta$ and repeat the gradient descent rule:

* determine $ \nabla_L = \frac{\partial L}{\partial \theta}$
* update the current value of theta with: $\theta \leftarrow \theta - \eta \nabla_L$

In [10]:
import torch

eta = 0.1
theta = torch.randn((1,), requires_grad=True)
x = torch.tensor([2.])
y = torch.tensor([3.])

for _ in range(10):
    # Forward pass
    y_hat = theta * x
    loss = (y - y_hat)**2

    # Print current values
    print(f"Current value of theta: {theta.detach().numpy()}")
    print(f"Loss value: {loss.detach().numpy()}")

    # Backward pass
    loss.backward()
    grad = theta.grad

    # Print gradient
    print(f"Gradient value: {grad.detach().numpy()}")

    # Update parameter with no_grad to make sure this operation does not intrude gradient calculation
    with torch.no_grad():
        theta -= eta * grad

    # Zero gradients (empty gradient to avoid accumulation over iterations)
    theta.grad.zero_()

Current value of theta: [0.09648111]
Loss value: [7.8794613]
Gradient value: [-11.228151]
Current value of theta: [1.2192962]
Loss value: [0.31517845]
Gradient value: [-2.2456303]
Current value of theta: [1.4438592]
Loss value: [0.01260715]
Gradient value: [-0.44912624]
Current value of theta: [1.4887718]
Loss value: [0.00050429]
Gradient value: [-0.08982563]
Current value of theta: [1.4977543]
Loss value: [2.0172038e-05]
Gradient value: [-0.01796532]
Current value of theta: [1.4995508]
Loss value: [8.0705286e-07]
Gradient value: [-0.00359344]
Current value of theta: [1.4999101]
Loss value: [3.2316393e-08]
Gradient value: [-0.00071907]
Current value of theta: [1.499982]
Loss value: [1.2960868e-09]
Gradient value: [-0.000144]
Current value of theta: [1.4999964]
Loss value: [5.1159077e-11]
Gradient value: [-2.861023e-05]
Current value of theta: [1.4999993]
Loss value: [2.046363e-12]
Gradient value: [-5.722046e-06]


To see how such procedures are usually handled with pytorch, we reproduce the example below with the usage of a linear layer, a pre-defined loss function and an instance of the stochastic gradient descent optimizer that takes care of parameter adjustments by gradient descent.

In [11]:
from torch.optim import SGD

eta = 0.1
f_theta = nn.Linear(in_features=1, out_features=1, bias = False)

x = torch.tensor([2.])
y = torch.tensor([3.])

optimizer = SGD(f_theta.parameters(), lr = 0.1)
loss_fun = nn.MSELoss()

for _ in range(10):
    y_hat = f_theta(x)
    loss = loss_fun(y_hat, y)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print(f"Theta after gradient update: {f_theta.weight.detach().numpy().flatten()[0]}")

Theta after gradient update: 1.2191189527511597
Theta after gradient update: 1.4438238143920898
Theta after gradient update: 1.488764762878418
Theta after gradient update: 1.4977529048919678
Theta after gradient update: 1.4995505809783936
Theta after gradient update: 1.4999101161956787
Theta after gradient update: 1.4999819993972778
Theta after gradient update: 1.4999964237213135
Theta after gradient update: 1.4999992847442627
Theta after gradient update: 1.4999998807907104


## Data processing with pytorch

Usually, data is split at least into training and test data. For a neural network, usually parameters are updated after receiving gradient information for a batch of the overall training data set. This can be handled by combining the Dataset and DataLoader classes of pytorch. Dataset includes some toy datasets and the ability to define your own dataset. A self-defined dataset must include a method for initialization (*\_\_init__*), to determine the size of the data set (*\_\_len__*) and to retrieve an observation at a given index (*\_\_getitem__*). Below we first define a custom class for a dataset which works for almost every pandas dataframe. We retrieve the California Housing from sklearn, split it, standardize features and initialize it with our dataset class.

In [12]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
from torch.utils.data import Dataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

class CaliforniaDataset(Dataset):
    def __init__(self, df):
        self.df = df
    
    def __len__(self):
        return self.df.shape[0]
    
    def __getitem__(self, idx):
        row = torch.from_numpy(self.df.iloc[idx, :].values)
        features, target = row[:-1], row[-1:]
        features = features.type(torch.float32)
        target = target.type(torch.float32)
        return features, target

cf_housing = fetch_california_housing()
cf_df = pd.DataFrame(cf_housing.data, columns = cf_housing.feature_names)
cf_df.loc[:, cf_housing.target_names] = cf_housing.target

training_data_df, test_data_df = train_test_split(cf_df, train_size=0.7, shuffle=True, random_state=42)
scaler = StandardScaler()
X_train_df, y_train_df = training_data_df.drop(["MedHouseVal"], axis = 1), training_data_df.loc[:, ["MedHouseVal"]]
X_test_df, y_test_df = test_data_df.drop(["MedHouseVal"], axis = 1), test_data_df.loc[:, ["MedHouseVal"]]
X_train_df_s = pd.DataFrame(scaler.fit_transform(X_train_df), index = training_data_df.index, columns = training_data_df.columns[:-1])
X_test_df_s = pd.DataFrame(scaler.transform(X_test_df), index = test_data_df.index, columns = test_data_df.columns[:-1])
train_df_s = pd.concat((X_train_df_s, y_train_df), axis = 1)
test_df_s = pd.concat((X_test_df_s, y_test_df), axis = 1)

training_data, test_data = CaliforniaDataset(train_df_s), CaliforniaDataset(test_df_s)

Once the pytorch Dataset is created, it can be handled by the DataLoader. Usually, we set the batch size and if data is supposed to be shuffled when creating the batches. The example demonstrates how to create instances of the Dataloader for training and test data. Training data comes in non-shuffled batches of size 64, while the test data is evaluated later at once.

In [13]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(training_data, batch_size = 64, shuffle=False)
test_dataloader = DataLoader(test_data, batch_size = test_data_df.shape[0], shuffle=False)

for i, (train_features, train_labels) in enumerate(train_dataloader):
    print(f"Batch {i+1}")
    print(f"Feature batch size: {train_features.size()}")
    print(f"Target batch size: {train_labels.size()}")
    if i > 2:
        break

Batch 1
Feature batch size: torch.Size([64, 8])
Target batch size: torch.Size([64, 1])
Batch 2
Feature batch size: torch.Size([64, 8])
Target batch size: torch.Size([64, 1])
Batch 3
Feature batch size: torch.Size([64, 8])
Target batch size: torch.Size([64, 1])
Batch 4
Feature batch size: torch.Size([64, 8])
Target batch size: torch.Size([64, 1])


# Training a neural network

Now, let us put together all previous building blocks to train a neural network for the California Housing data set using pytorch. Below, we initialize the regression task network with ten hidden neurons, a stochastic gradient descent optimizer and the mean squared error loss function. Ove 20 epochs, we repeat to retrieve batches of size 64 from the training data, determine the gradients and use this information to update parameters. After all batches were used, one training epoch is finished and we print the average loss over all batches. Next with the *no_grad* method, we evaluate the loss for the full test data sample. This method makes sure that no gradient information from the test data is used for the training process. 

In [14]:
from torch.optim import SGD
import torch
from torch import nn

regression_network = RegressionNetwork(input_dimension=8, hidden_dimension=10)
optimizer = SGD(regression_network.parameters(), lr = 0.01)
loss_fun = nn.MSELoss()

epochs = 20
for epoch in range(epochs):
    epoch_loss, num_batches = 0.0, 0
    for features, targets in train_dataloader:
        outputs = regression_network(features)
        loss = loss_fun(outputs, targets)
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
        epoch_loss += loss.item()
        num_batches += 1
    print(f"The average training batch loss for the epoch {epoch+1} is: {epoch_loss/num_batches:.4f}")

    with torch.no_grad():
        test_features, test_targets = next(iter(test_dataloader))
        test_outputs = regression_network(test_features)
        test_loss = loss_fun(test_outputs, test_targets)
        print(f"The test loss after epoch {epoch + 1} is: {test_loss.item():.4f}")

The average training batch loss for the epoch 1 is: 1.4277
The test loss after epoch 1 is: 0.6044


The average training batch loss for the epoch 2 is: 0.5843
The test loss after epoch 2 is: 0.5370


The average training batch loss for the epoch 3 is: 0.5227
The test loss after epoch 3 is: 0.4944


The average training batch loss for the epoch 4 is: 0.4896
The test loss after epoch 4 is: 0.4734


The average training batch loss for the epoch 5 is: 0.4732
The test loss after epoch 5 is: 0.4611


The average training batch loss for the epoch 6 is: 0.4636
The test loss after epoch 6 is: 0.4532


The average training batch loss for the epoch 7 is: 0.4566
The test loss after epoch 7 is: 0.4467


The average training batch loss for the epoch 8 is: 0.4504
The test loss after epoch 8 is: 0.4406


The average training batch loss for the epoch 9 is: 0.4447
The test loss after epoch 9 is: 0.4346


The average training batch loss for the epoch 10 is: 0.4398
The test loss after epoch 10 is: 0.4293


The average training batch loss for the epoch 11 is: 0.4351
The test loss after epoch 11 is: 0.4245


The average training batch loss for the epoch 12 is: 0.4301
The test loss after epoch 12 is: 0.4202


The average training batch loss for the epoch 13 is: 0.4259
The test loss after epoch 13 is: 0.4162


The average training batch loss for the epoch 14 is: 0.4218
The test loss after epoch 14 is: 0.4123


The average training batch loss for the epoch 15 is: 0.4179
The test loss after epoch 15 is: 0.4088


The average training batch loss for the epoch 16 is: 0.4143
The test loss after epoch 16 is: 0.4052


The average training batch loss for the epoch 17 is: 0.4109
The test loss after epoch 17 is: 0.4020


The average training batch loss for the epoch 18 is: 0.4078
The test loss after epoch 18 is: 0.3993


The average training batch loss for the epoch 19 is: 0.4052
The test loss after epoch 19 is: 0.3970


The average training batch loss for the epoch 20 is: 0.4029
The test loss after epoch 20 is: 0.3950
