# Introduction to Machine Learning

This is the Jupyter Notebook that contains code for the workshop "Introduction to Machine Learning" as part of the SECAI CeTI Summerschool 2023.

The notebook should run on different machines, on Binder, Google Colab, locally and on high-performance computers without and with GPU support.

If you download only this notebook and want to run it locally, please run the following code cell. It will take several minutes to download and install all required Python packages and download the dataset. If you cloned the entire Github repository, it is recommended to firstly create a virtual environment and install all packages with the scripts provided. You can therefore use the script that is part of the repository. In that case, you do not need to execute the first code cell and instead directly start with the second code cell.

In [None]:
!git clone https://github.com/TUD-STKS/SECAI-Summer-School.git
%cd SECAI-Summer-School
!pip install --upgrade .[notebook]
!python -m medmnist download

##  Some Python hints

Python is a high-level programming language. It consists of a broad variety of Packages that provide functionality to be used for your own projects.

The following list gives an overview of some (not all) important Python packages:
- [numpy](https://numpy.org/): The fundamental package for scientific computing with Python
- [scipy](https://scipy.org/): Fundamental algorithms for scientific computing in Python
- [pandas](https://pandas.pydata.org/): Fast, powerful, flexible and easy to use open source data analysis and manipulation tool
- [scikit-learn](https://scikit-learn.org/): Machine Learning in Python
- [PyTorch](https://pytorch.org/): Machine learning framework based on the Torch library, used for applications originally developed by Meta AI
- [tensorflow](https://www.tensorflow.org/): Free and open-source software library for machine learning and artificial intelligence, developed by Google Brain.
- [seaborn](https://seaborn.pydata.org/): Statistical data visualization

You can import packages in several ways:

```python
import torch  # Import the entire package
import numpy as np  # Import the entire package and use an alias name
from sklearn.metrics import accuracy_score  # Import one part of a package
from sklearn.linear_model import RidgeClassifier, SGDClassifier  # Import multiple parts of a pacakge
```

A Python function is a block of code which only runs when it is called. It receives data in form of parameters via arguments. A function can return data as a result.

The following code block shows how to define a function with two arguments. The first argument ``a`` is mandatory, i.e., it needs to be passed to the function. The second argument ``b`` is a default parameter. If no value is passed during the function call, the default value is used.

The function itself adds ``a`` and ``b`` and returns the result.

In the ideal case, every function comes along with a comment that documents what the function does, which parameters and which return values it has.

```python
def add_constant(a, b=1.1):
    """
    Add a constant to a number.

    Parameters
    ----------
    a : number
        The number to which a constant is added.
    b : number, default = 1.1
        The constant to be added with a default value.

    Returns
    -------
    c : number
        The sum of ``a`` and ``b``
    """
    c = a + b
    return c
```

The function can be called in different ways:
- Without specifying the argument that a value is assigned to, the first value is assigned to the first parameter, the second value is assigned to the second parameter, ... .
- The result of the function call can be assigned to a a new variable.

```python
result = add_constant(10)
print(result)
```

Alternatively, the argument that a value is assigned to can be used as a keyword. This is often easier to read.

```python
result = add_constant(a=10)
print(result)
```

The result of the function call can directly be used as the input to a new function, such as the ``print()`` function.

```python
print(add_constant(a=10))
print(add_constant(a=10, b=1))
```

Functions are a fundamental concept in Python. Hence, it is worthy to take some time to get familiar functions.

In [None]:
from joblib import dump, load
from sklearn.metrics import accuracy_score

from tqdm import tqdm
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import torch.utils as utils
import torchvision.transforms as transforms

from sklearn.decomposition import PCA
from sklearn.model_selection import RandomizedSearchCV, PredefinedSplit
from sklearn.linear_model import SGDClassifier
from scipy.stats import loguniform
from secai.torch_models import LinearRegression, MultiLayerPerceptron, \
    ConvolutionalNeuralNetwork, LSTMModel, EarlyStopping

import medmnist
from medmnist import INFO

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In this code cell, we define a theme for the visualizations and check, whether we have a GPU available. If so, we use it.

Within the workshop, we use Google Colab, which has GPU support. Thus, the device that is printed under the code cell should not be ``device(type='cpu')``. Please ensure that this is the case, otherwise discuss with the presenter.

In [None]:
sns.set_theme(context="notebook")

DEVICE = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
DEVICE

## Getting started with a new Machine Learning task

We work with the 2D dataset called "PathMNIST"

The following cell give us some first information about this dataset. We are dealing with an image dataset containing RGB impage patches from hematoxylin & eosin stained histological images, obtained in different clinical centers.

Since we are dealing with RGB image patches, we have three different channels.

In total, there are nine different classes. Hence, we have a multi-class dataset, and each image is assigned to exactly one class.

The training and validation set (NCT-CRC-HE-100K) contain 100,000 patches, and the test set contains 7,180 image patches (CRC-VAL-HE-7K) from a different clinical center.

In [None]:
DATASET_INFO = INFO['pathmnist']
TASK = DATASET_INFO['task']
LABELS = DATASET_INFO['label']

N_CHANNELS = DATASET_INFO['n_channels']
N_CLASSES = len(DATASET_INFO['label'])

DATASET_INFO

## Loading the MedMNIST data

Since we want to dive deeper into the dataset, we do not apply any kind of preprocessing. We only make sure that the dataset class returns the mean value of the RGB channels, because it is easier to analyze and visualize it. Furthermore, we return the data as `torch.Tensor`, such that we can easily analyze it further.

Note that we instantiate three different datasets for training, validation, and for test. This is something that we always need to keep in mind. Always split training and test data and make sure that no test data is used for training or parameter optimization.

The constant ``N_PIXELS`` is predefined and simplifies the feature extraction and model definition later. It is the size of the flattened input vector for the linear and for the MLP models.

In [None]:
N_PIXELS = 28*28

# preprocessing
data_transform = transforms.Compose([
    transforms.Grayscale(num_output_channels=1),
    transforms.ToTensor(),
])

# load the data
DataClass = getattr(medmnist, DATASET_INFO['python_class'])
train_dataset = DataClass(split='train', transform=data_transform, as_rgb=True)
validation_dataset = DataClass(split='val', transform=data_transform,
                               as_rgb=True)
test_dataset = DataClass(split='test', transform=data_transform, as_rgb=True)

In [None]:
training_input = []
training_target = []
for data in tqdm(utils.data.DataLoader(dataset=train_dataset, shuffle=True)):
    training_input.append(data[0].numpy().reshape(-1, N_PIXELS))
    training_target.append(data[1].numpy().flatten())

training_df = pd.DataFrame(np.vstack(training_input),
                           columns=[f"Pixel {k+1}" for k in range(N_PIXELS)])
training_df["Target"] = [LABELS[str(d)] for d in np.hstack(training_target)]
training_df["Numeric target"] = np.hstack(training_target)

In [None]:
validation_input = []
validation_target = []
for data in tqdm(utils.data.DataLoader(dataset=validation_dataset,
                                       shuffle=True)):
    validation_input.append(data[0].numpy().reshape(-1, N_PIXELS))
    validation_target.append(data[1].numpy().flatten())

validation_df = pd.DataFrame(np.vstack(validation_input),
                             columns=[f"Pixel {k+1}" for k in range(N_PIXELS)])
validation_df["Target"] = [
    LABELS[str(d)] for d in np.hstack(validation_target)]
validation_df["Numeric target"] = np.hstack(validation_target)

In [None]:
test_input = []
test_target = []
for data in tqdm(utils.data.DataLoader(dataset=test_dataset, shuffle=True)):
    test_input.append(data[0].numpy().reshape(-1, N_PIXELS))
    test_target.append(data[1].numpy().flatten())

test_df = pd.DataFrame(np.vstack(test_input),
                       columns=[f"Pixel {k+1}" for k in range(N_PIXELS)])
test_df["Target"] = [LABELS[str(d)] for d in np.hstack(test_target)]
test_df["Numeric target"] = np.hstack(test_target)

In [None]:
training_df

In [None]:
validation_df

In [None]:
test_df

## Visualization

Visualization is always a crucial part when getting started with a new dataset. Even when only looking on samples, we get a better idea of what is contained in the dataset.

Here, we observe severa interesting things:

- The pixel values (mean of the different RGB values) seem to be normalized, as we do not deal with integer values.
- The value ranges are different. In most of the images, the values seem to lie between 0.3 and 0.8, but not always.
- The histograms and the boxplots indicate that the pixel values overall are distributed reasonable.
- All pixels seem to carry information, as the distribution does not indicate that some pixel values have a small standard deviation.

All in all, this suggests that the pre-processing is simple in case of this dataset.

In [None]:
fig, axs = plt.subplots(3, 3, sharex="all", sharey="all")
for k in range(9):
    sns.heatmap(
        data=training_df.loc[
            k, [f"Pixel {k+1}" for k in range(N_PIXELS)]
        ].values.astype(float).reshape(28, 28).T, ax=axs.flatten()[k], 
        square=True, xticklabels=False, yticklabels=False)
plt.tight_layout()

In [None]:
fig, axs = plt.subplots(2, 1, sharex="all")
sns.histplot(data=training_df.loc[:, [f"Pixel {k+1}" for k in range(0, 5)]],
             ax=axs[0])
sns.boxplot(data=training_df.loc[:, [f"Pixel {k+1}" for k in range(0, 5)]],
            ax=axs[1], orient="h")
plt.tight_layout()

## Dimensional reduction

We will not dive too deep into this topic in the workshop today. Nevertheless, it is important to keep dimensional reduction in the mind. Therefore, we have a quick look Principal Component Analysis (PCA). It mainly rotates the coordinate system such that the new axes (i.e., main components) show in the direction that the very first components explain most of the variance contained in the dataset.

The red horizontal line shows how many main components are required to explain at least 95% of the variance contained in the dataset. Hence, it is theoretically sufficient to use approximately 280 main components.

In [None]:
pca = PCA().fit(training_df.loc[:, [f"Pixel {k+1}" for k in range(N_PIXELS)]])

fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, 785), y=np.cumsum(pca.explained_variance_ratio_),
             ax=axs)
axs.axhline(y=0.95, c="r")
axs.set_xlabel("Pixel k")
axs.set_ylabel("Accumulated explained variance")
axs.set_xlim((0, N_PIXELS+5))
axs.set_ylim((0.55, 1.01))
plt.tight_layout()

## Image classification - from linear classifiers to neural networks

It is time to train our first classification model. In general, image classification measn to assign an image to a class. There exists a vast variety of classification methods that we could use. We explore a few of them.

Now, let us use a linear classifier as provided by scikit-learn. The ``SGDClassifier`` minimizes the difference between the target and predicted output based on the cross entropy loss function. Before we train the model, we update our preprocessing pipeline. As previously mentioned, we center the data around zero and normalize the data to a range between ``-1`` and ``+1``. Afterwards, we prepare the datasets again as before.

In [None]:
# preprocessing
data_transform.transforms.append(transforms.Normalize(mean=[.5], std=[.5]))

# load the data
DataClass = getattr(medmnist, DATASET_INFO['python_class'])
train_dataset = DataClass(split='train', transform=data_transform, as_rgb=True)
validation_dataset = DataClass(split='val', transform=data_transform,
                               as_rgb=True)
test_dataset = DataClass(split='test', transform=data_transform, as_rgb=True)

training_input = []
training_target = []
for data in utils.data.DataLoader(dataset=train_dataset, shuffle=True):
    training_input.append(data[0].numpy().reshape(-1, N_PIXELS))
    training_target.append(data[1].numpy().flatten())

training_df = pd.DataFrame(np.vstack(training_input),
                           columns=[f"Pixel {k+1}" for k in range(N_PIXELS)])
training_df["Target"] = [LABELS[str(d)] for d in np.hstack(training_target)]
training_df["Numeric target"] = np.hstack(training_target)

validation_input = []
validation_target = []
for data in utils.data.DataLoader(dataset=validation_dataset, shuffle=True):
    validation_input.append(data[0].numpy().reshape(-1, N_PIXELS))
    validation_target.append(data[1].numpy().flatten())

validation_df = pd.DataFrame(np.vstack(validation_input),
                             columns=[f"Pixel {k+1}" for k in range(N_PIXELS)])
validation_df["Target"] = [
    LABELS[str(d)] for d in np.hstack(validation_target)]
validation_df["Numeric target"] = np.hstack(validation_target)

test_input = []
test_target = []
for data in utils.data.DataLoader(dataset=test_dataset, shuffle=True):
    test_input.append(data[0].numpy().reshape(-1, N_PIXELS))
    test_target.append(data[1].numpy().flatten())

test_df = pd.DataFrame(np.vstack(test_input),
                       columns=[f"Pixel {k+1}" for k in range(N_PIXELS)])
test_df["Target"] = [LABELS[str(d)] for d in np.hstack(test_target)]
test_df["Numeric target"] = np.hstack(test_target)

Since scikit-learn provides functionality for cross validation with pre-defined splits as it is the case for this dataset, we concatenate the training and validation datasets and mark the training subset with ``-1``, the validation subset with ``0``.

In [None]:
cv_training_df = pd.concat((training_df, validation_df))
X_train = cv_training_df.loc[:,
          [f"Pixel {k+1}" for k in range(N_PIXELS)]].to_numpy()
y_train = cv_training_df.loc[:, "Numeric target"].to_numpy()
test_fold = [-1] * len(training_df) + [1] * len(validation_df)

X_test = test_df.loc[:, [f"Pixel {k+1}" for k in range(N_PIXELS)]].to_numpy()
y_test = test_df.loc[:, "Numeric target"].to_numpy()

cv = PredefinedSplit(test_fold=test_fold)

### Train the very first linear regression model

scikit-learn makes it very simple to train machine learning models. With only one line of code, we can already train our very first model. One baseline model was already trained in preparation for this workshop, and we load this model. Now, take some time to explore different parameters of the model and observe, how these parameters change the performance.

In [None]:
try:
    clf = load("results/sklearn_linear_model_baseline.joblib")
except FileNotFoundError:
    clf = SGDClassifier(loss="log_loss",
                        early_stopping=True).fit(X=X_train, y=y_train)
    dump(clf, "results/sklearn_linear_model_baseline.joblib")

In [None]:
clf.score(X=X_train, y=y_train)

In [None]:
clf.score(X=X_test, y=y_test)

### Hyperparameter optimization

One important aspect is a proper hyper-parameter optimization. Linear models do not have that many hyper-parameters. However, if you have a look in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) of the ``SGDClassifier``, it gets clear that we still have many opportunities to change the basic model. One hyper-parameter that all linear models have in common, is the regularization penalty $\alpha$, which penalizes large values.

scikit-learn offers great model selection tools that allow for a simple hyper-parameter optimization. The code cell below demonstrates this for the regularization penalty ``alpha``. Since this takes a long time, we will not optimize a new model but use a pre-trained model that was prepared for this workshop.

The regularization parameter has a significant impact on the performance.

In [None]:
try:
    clf = load("results/sklearn_linear_model.joblib")
except FileNotFoundError:
    clf = RandomizedSearchCV(
        estimator=SGDClassifier(loss="log_loss"), n_iter=50, n_jobs=-1, 
        cv=cv, verbose=10, param_distributions={
            "alpha": loguniform(a=1e-5, b=1e1)}).fit(X=X_train, y=y_train)
    dump(clf, "results/sklearn_linear_model.joblib")

In [None]:
fig, axs = plt.subplots()
sns.lineplot(data=pd.DataFrame(clf.cv_results_), x="param_alpha",
             y="mean_test_score", ax=axs)
axs.set_xscale("log")
axs.set_xlim((1e-5, 1e1))
axs.set_xlabel("Ridge parameter alpha")
axs.set_ylabel("Validation accuracy")
plt.tight_layout()

In [None]:
clf.score(X=X_train, y=y_train)

In [None]:
clf.score(X=X_test, y=y_test)

### From scikit-learn to PyTorch

We can also implement linear regression in PyTorch. However, it needs more preparation in terms of code. The advantage is that the same code can be used for almost any model that we implement in PyTorch. Thus, we now step through the different elements that are required to train and evaluate PyTorch models.

At first, we define the ``BATCH_SIZE`` as one crucial parameter that should be optimized during a hyper-parameter optimization. You can tune it later when we train deep learning models. It has a significant impact on the performance in terms of both, classification performance and required training times.

The next important element is the ``DataLoader`` from PyTorch. This is an important object that handles loading and randomizing the data in batches for us.

In [None]:
BATCH_SIZE = 256
PATIENCE = 5
NUM_EPOCHS = 200

train_loader = utils.data.DataLoader(dataset=train_dataset,
                                     batch_size=BATCH_SIZE, shuffle=True,
                                     num_workers=2, pin_memory=True)
validation_loader = utils.data.DataLoader(dataset=validation_dataset,
                                          batch_size=BATCH_SIZE, shuffle=True,
                                          num_workers=2, pin_memory=True)
test_loader = utils.data.DataLoader(dataset=test_dataset,
                                    batch_size=BATCH_SIZE, shuffle=True,
                                    num_workers=2, pin_memory=True)

The function to train models looks complicated on a first glance. However, it is actually not that difficult to understand.

Essentially, it trains the model at most ``n_epochs``. In every epoch,
- the model is trained in minibatches on the entire training dataset,
- the model is validated on the entire validation set,
- the early stopping criterion is checked.

In [None]:
def train_model(torch_model, torch_optimizer, patience, n_epochs, path,
                training_dataloader, validation_dataloader):
    """
    Train a PyTorch torch_model with early stopping.

    Parameters
    ----------
    torch_model : PyTorch torch_model
        The torch_model to be optimized.
    torch_optimizer : PyTorch optimizer
        The optimizer to minimize the bce_loss and update the weights after 
        each iteration.
    patience : int
        Number of subsequent training epochs on which the validation accuracy 
        does not decrease. If this is fulfilled, the training is stopped, and 
        the best torch_model so far returned.
    n_epochs : int
        Maximum number of training epochs.
    path : Path
        Location where to store intermediate models.
    training_dataloader : DataLoader
        The training input_data loader.
    validation_dataloader : DataLoader
        The validation input_data loader for early stopping.

    Returns
    -------
    torch_model : PyTorch torch_model
        The optimized torch_model.
    optimizer : PyTorch optimizer
        The optimizer that is used.
    current_epoch : int
        The current_epoch at which the training has stopped.
    bce_loss : float
        The final validation bce_loss.
    avg_training_losses : list[float]
        The training losses after each current_epoch.
    avg_validation_losses : list[float]
        The validation losses after each current_epoch.
    """
    loss_function = nn.CrossEntropyLoss()
    # to track the training bce_loss as the torch_model trains
    training_losses = []
    # to track the validation bce_loss as the torch_model trains
    validation_losses = []
    # to track the average training bce_loss per current_epoch as the
    # torch_model trains
    avg_training_losses = []
    # to track the average validation bce_loss per current_epoch as the
    # torch_model trains
    avg_validation_losses = []
    # initialize the early_stopping object
    early_stopping = EarlyStopping(patience=patience, verbose=True, path=path)
    for current_epoch in range(1, n_epochs + 1):
        ###################
        # train the torch_model #
        ###################
        torch_model.train()  # prep torch_model for training
        for batch, (input_data, target) in enumerate(training_dataloader, 1):
            input_data = input_data.to(DEVICE)
            target = target.to(DEVICE)
            # clear the gradients of all optimized variables
            torch_optimizer.zero_grad()
            # forward pass: compute predicted outputs
            output = torch_model(input_data)
            # calculate the bce_loss
            target = target.squeeze().long()
            bce_loss = loss_function(output, target)
            # backward pass: gradient of the bce_loss with respect to
            # parameters
            bce_loss.backward()
            # perform a single optimization step (parameter update)
            torch_optimizer.step()
            # record training bce_loss
            training_losses.append(bce_loss.item())
        ######################
        # validate the torch_model #
        ######################
        validation_outputs = []
        validation_targets = []
        torch_model.eval()  # prep torch_model for evaluation
        for input_data, target in validation_dataloader:
            input_data = input_data.to(DEVICE)
            target = target.to(DEVICE)
            # forward pass: compute predicted outputs
            output = torch_model(input_data)
            # calculate the bce_loss
            target = target.squeeze().long()
            bce_loss = loss_function(output, target)
            # record validation bce_loss
            validation_losses.append(bce_loss.item())
            validation_targets.append(target.cpu().detach().numpy().flatten())
            validation_outputs.append(
                output.cpu().detach().numpy().argmax(axis=1))
        # print training/validation statistics
        # calculate average bce_loss over an current_epoch
        training_loss = np.average(training_losses)
        validation_loss = np.average(validation_losses)
        avg_training_losses.append(training_loss)
        avg_validation_losses.append(validation_loss)
        epoch_len = len(str(n_epochs))
        print_msg = (f'[{current_epoch:>{epoch_len}}/{n_epochs:>{epoch_len}}] '
                     f'train_loss: {training_loss:.5f} '
                     f'valid_loss: {validation_loss:.5f}')
        print(print_msg)
        # early_stopping needs the validation bce_loss to check if it has
        # decresed,
        # and if it has, it will make a checkpoint of the current torch_model

        early_stopping(-accuracy_score(np.hstack(validation_targets),
                                       np.hstack(validation_outputs)),
                       torch_model, torch_optimizer, current_epoch)
        if early_stopping.early_stop:
            print("Early stopping")
            break
    return (torch_model, torch_optimizer, early_stopping.epoch, bce_loss,
            avg_training_losses, avg_validation_losses)

The function to test models is easier to understand. Essentially, it evaluates the model on the entire training dataset, and on the entire test set.

In [None]:
def test_model(torch_model, training_dataloader, test_dataloader):
    """
    Train a PyTorch torch_model with early stopping.

    Parameters
    ----------
    torch_model : PyTorch torch_model
        The torch_model to be optimized.
    training_dataloader : DataLoader
        The training input_data loader.
    test_dataloader : DataLoader
        The test input_data loader for early stopping.

    Returns
    -------
    training_targets : list[int]
        The training targets.
    training_outputs : list[int]
        The predictions on the training set.
    test_targets : list[int]
        The test targets.
    test_outputs : list[int]
        The predictions on the test set.
    """
    # to track the average accuracy scores on the training dataset
    training_outputs = []
    training_targets = []
    # to track the average accuracy scores on the test dataset
    test_outputs = []
    test_targets = []
    # initialize the early_stopping object
    torch_model.eval()  # prep torch_model for inference
    with torch.no_grad():
        for batch, (input_data, target) in enumerate(training_dataloader, 1):
            input_data = input_data.to(DEVICE)
            target = target.to(DEVICE)
            # forward pass: compute predicted outputs
            output = torch_model(input_data)
            # Prepare the target output
            target = target.squeeze().long()
            target = target.float().resize_(len(target), 1)
            training_targets.append(target.cpu().detach().numpy().flatten())
            training_outputs.append(
                output.cpu().detach().numpy().argmax(axis=1))
        for batch, (input_data, target) in enumerate(test_dataloader, 1):
            input_data = input_data.to(DEVICE)
            target = target.to(DEVICE)
            # forward pass: compute predicted outputs
            output = torch_model(input_data)
            # Prepare the target output
            target = target.squeeze().long()
            target = target.float().resize_(len(target), 1)
            test_targets.append(target.cpu().detach().numpy().flatten())
            test_outputs.append(output.cpu().detach().numpy().argmax(axis=1))
    return training_targets, training_outputs, test_targets, test_outputs

In [None]:
model = LinearRegression(in_features=N_PIXELS, num_classes=N_CLASSES)
model = model.to(DEVICE)
optimizer = optim.SGD(model.parameters(), lr=1e-2, weight_decay=1e-2)

In [None]:
model, optimizer, epoch, loss, average_training_losses, \
    average_validation_losses = train_model(model, optimizer, PATIENCE,
                                            NUM_EPOCHS,
                                            "results/torch_linear_model.pt",
                                            train_loader, validation_loader)

In [None]:
fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, len(average_training_losses) + 1),
             y=average_training_losses, ax=axs, label="Training loss")
sns.lineplot(x=np.arange(1, len(average_validation_losses) + 1),
             y=average_validation_losses, ax=axs, label="Validation loss")
axs.set_xlim((0, round(len(average_training_losses) / 5) * 5))
axs.set_xlabel("Number of epochs")
axs.set_ylabel("Loss")
plt.tight_layout()

In [None]:
checkpoint = torch.load("results/torch_linear_model.pt")

model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model = model.to(DEVICE)
model.eval()

y_train_true, y_train_pred, y_test_true, y_test_pred = test_model(
    torch_model=model, training_dataloader=train_loader,
    test_dataloader=test_loader)

In [None]:
accuracy_score(np.hstack(y_train_true), np.hstack(y_train_pred))

In [None]:
accuracy_score(np.hstack(y_test_true), np.hstack(y_test_pred))

In [None]:
model = MultiLayerPerceptron(hidden_layer_sizes=(N_PIXELS, ),
                             num_classes=N_CLASSES)
model = model.to(DEVICE)
optimizer = optim.SGD(model.parameters(), lr=1e-2)

In [None]:
model, optimizer, epoch, loss, average_training_losses, \
    average_validation_losses = train_model(model, optimizer, PATIENCE,
                                            NUM_EPOCHS,
                                            "results/torch_naive_mlp_model.pt",
                                            train_loader, validation_loader)

In [None]:
fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, len(average_training_losses) + 1),
             y=average_training_losses, ax=axs, label="Training loss")
sns.lineplot(x=np.arange(1, len(average_validation_losses) + 1),
             y=average_validation_losses, ax=axs, label="Validation loss")
axs.set_xlim((0, round(len(average_training_losses) / 5) * 5))
axs.set_xlabel("Number of epochs")
axs.set_ylabel("Loss")
plt.tight_layout()

In [None]:
checkpoint = torch.load("results/torch_naive_mlp_model.pt")

model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model = model.to(DEVICE)
model.eval()

y_train_true, y_train_pred, y_test_true, y_test_pred = test_model(
    torch_model=model, training_dataloader=train_loader,
    test_dataloader=test_loader)

In [None]:
accuracy_score(np.hstack(y_train_true), np.hstack(y_train_pred))

In [None]:
accuracy_score(np.hstack(y_test_true), np.hstack(y_test_pred))

## From linear models to the Multilayer Perceptron

The Multilayer Perceptron (MLP) is a simple feed-forward neural network, where the input is connected to several hidden layers. After each weight matrix in the hidden layer follows a nonlinear activation function.

Compared to linear regression, there are more free parameters to be trained. A huge advantage of these models is that the classes are non-linearly separated.

In [None]:
model = MultiLayerPerceptron(hidden_layer_sizes=(N_PIXELS, 128, 64, ),
                             num_classes=N_CLASSES)
model = model.to(DEVICE)
optimizer = optim.SGD(model.parameters(), lr=1e-2)

In [None]:
model, optimizer, epoch, loss, average_training_losses, \
    average_validation_losses = train_model(model, optimizer, PATIENCE,
                                            NUM_EPOCHS,
                                            "results/torch_deep_mlp_model.pt",
                                            train_loader, validation_loader)

In [None]:
fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, len(average_training_losses) + 1),
             y=average_training_losses, ax=axs, label="Training loss")
sns.lineplot(x=np.arange(1, len(average_validation_losses) + 1),
             y=average_validation_losses, ax=axs, label="Validation loss")
axs.set_xlim((0, round(len(average_training_losses) / 5) * 5))
axs.set_xlabel("Number of epochs")
axs.set_ylabel("Loss")
plt.tight_layout()

In [None]:
checkpoint = torch.load("results/torch_deep_mlp_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model = model.to(DEVICE)
model.eval()

y_train_true, y_train_pred, y_test_true, y_test_pred = test_model(
    torch_model=model, training_dataloader=train_loader,
    test_dataloader=test_loader)

In [None]:
accuracy_score(np.hstack(y_train_true), np.hstack(y_train_pred))

In [None]:
accuracy_score(np.hstack(y_test_true), np.hstack(y_test_pred))

## From MLPs to Convolutional Neural Networks

The Convolutional Neural Network (CNN) is a special kind of feed-forward neural network that consists of several building blocks:
- Convolutional layers: Convolve the input to feature maps
- Pooling layers: Reduce the dimensionality by combining neighboured elements of the feature maps
- Fully connected layer: Traditional MLP that connects the final feature maps to the (classification) outputs

In this experiment, we use the following outline:
- Convolutional layer with 16 feature maps and a kernel size of 5 and a ReLU nonlinearity
- Maximum pooling with a kernel size of 2
- Convolutional layer with 32 feature maps and a kernel size of 5 and a ReLU nonlinearity
- Maximum pooling with a kernel size of 2
- Fully connected layer that maps from the feature maps to the classes, and a ReLU nonlinearity

In [None]:
model = ConvolutionalNeuralNetwork(in_channels=1, num_classes=N_CLASSES)
model = model.to(DEVICE)
optimizer = optim.SGD(model.parameters(), lr=1e-1)

In [None]:
model, optimizer, epoch, loss, average_training_losses, \
    average_validation_losses = train_model(model, optimizer, PATIENCE,
                                            NUM_EPOCHS,
                                            "results/torch_cnn_model.pt",
                                            train_loader, validation_loader)

In [None]:
fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, len(average_training_losses) + 1),
             y=average_training_losses, ax=axs, label="Training loss")
sns.lineplot(x=np.arange(1, len(average_validation_losses) + 1),
             y=average_validation_losses, ax=axs, label="Validation loss")
axs.set_xlim((0, round(len(average_training_losses) / 5) * 5))
axs.set_xlabel("Number of epochs")
axs.set_ylabel("Loss")
plt.tight_layout()

In [None]:
checkpoint = torch.load("results/torch_cnn_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model = model.to(DEVICE)
model.eval()

y_train_true, y_train_pred, y_test_true, y_test_pred = test_model(
    torch_model=model, training_dataloader=train_loader,
    test_dataloader=test_loader)

In [None]:
accuracy_score(np.hstack(y_train_true), np.hstack(y_train_pred))

In [None]:
accuracy_score(np.hstack(y_test_true), np.hstack(y_test_pred))

## From MLPs to Recurrent Neural Networks with Long Short-term Memory

The Long Short-Term Memory (LSTM) network is a complex Recurrent Neural Network (RNN) that consists of several building blocks:
- Cell: Remembers values over a time
- Input gate: decides which new information is stored in the current state
- Output gate: controls which part of the current state is output
- Forget gate: decides what information to discard from a previous state

Take some time to explore different hyper-parameters, such as the parameters that can be seen here.

In [None]:
model = LSTMModel(input_size=28, hidden_size=100, num_layers=1,
                  bidirectional=False, dropout=0., num_classes=N_CLASSES)
model = model.to(DEVICE)

optimizer = optim.SGD(model.parameters(), lr=1e-1)

In [None]:
model, optimizer, epoch, loss, average_training_losses, \
    average_validation_losses = train_model(model, optimizer, PATIENCE,
                                            NUM_EPOCHS,
                                            "results/torch_1L_lstm_model.pt",
                                            train_loader, validation_loader)

In [None]:
fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, len(average_training_losses) + 1),
             y=average_training_losses, ax=axs, label="Training loss")
sns.lineplot(x=np.arange(1, len(average_validation_losses) + 1), 
             y=average_validation_losses, ax=axs, label="Validation loss")
axs.set_xlim((0, round(len(average_training_losses) / 5) * 5))
axs.set_xlabel("Number of epochs")
axs.set_ylabel("Loss")
plt.tight_layout()

In [None]:
checkpoint = torch.load("results/torch_1L_lstm_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model = model.to(DEVICE)
model.eval()

y_train_true, y_train_pred, y_test_true, y_test_pred = test_model(
    torch_model=model, training_dataloader=train_loader,
    test_dataloader=test_loader)

In [None]:
accuracy_score(np.hstack(y_train_true), np.hstack(y_train_pred))

In [None]:
accuracy_score(np.hstack(y_test_true), np.hstack(y_test_pred))

In [None]:
model = LSTMModel(input_size=28, hidden_size=100, num_layers=2,
                  bidirectional=False, dropout=0., num_classes=N_CLASSES)
model = model.to(DEVICE)
optimizer = optim.SGD(model.parameters(), lr=1e-1)

In [None]:
model, optimizer, epoch, loss, average_training_losses, \
    average_validation_losses = train_model(model, optimizer, PATIENCE,
                                            NUM_EPOCHS,
                                            "results/torch_2L_lstm_model.pt",
                                            train_loader, validation_loader)

In [None]:
fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, len(average_training_losses) + 1),
             y=average_training_losses, ax=axs, label="Training loss")
sns.lineplot(x=np.arange(1, len(average_validation_losses) + 1),
             y=average_validation_losses, ax=axs, label="Validation loss")
axs.set_xlim((0, round(len(average_training_losses) / 5) * 5))
axs.set_xlabel("Number of epochs")
axs.set_ylabel("Loss")
plt.tight_layout()

In [None]:
checkpoint = torch.load("results/torch_2L_lstm_model.pt")

model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model = model.to(DEVICE)
model.eval()

y_train_true, y_train_pred, y_test_true, y_test_pred = test_model(
    torch_model=model, training_dataloader=train_loader,
    test_dataloader=test_loader)

In [None]:
accuracy_score(np.hstack(y_train_true), np.hstack(y_train_pred))

In [None]:
accuracy_score(np.hstack(y_test_true), np.hstack(y_test_pred))

In [None]:
model = LSTMModel(input_size=28, hidden_size=100, num_layers=1,
                  bidirectional=True, dropout=0., num_classes=N_CLASSES)
model = model.to(DEVICE)
optimizer = optim.SGD(model.parameters(), lr=1e-1)

In [None]:
model, optimizer, epoch, loss, average_training_losses,\
    average_validation_losses = train_model(
        model, optimizer, PATIENCE, NUM_EPOCHS, 
        "results/torch_1L_bi_lstm_model.pt", train_loader, validation_loader)

In [None]:
fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, len(average_training_losses) + 1),
             y=average_training_losses, ax=axs, label="Training loss")
sns.lineplot(x=np.arange(1, len(average_validation_losses) + 1), 
             y=average_validation_losses, ax=axs, label="Validation loss")
axs.set_xlim((0, round(len(average_training_losses) / 5) * 5))
axs.set_xlabel("Number of epochs")
axs.set_ylabel("Loss")
plt.tight_layout()

In [None]:
checkpoint = torch.load("results/torch_1L_bi_lstm_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model = model.to(DEVICE)
model.eval()

y_train_true, y_train_pred, y_test_true, y_test_pred = test_model(
    torch_model=model, training_dataloader=train_loader,
    test_dataloader=test_loader)

In [None]:
accuracy_score(np.hstack(y_train_true), np.hstack(y_train_pred))

In [None]:
accuracy_score(np.hstack(y_test_true), np.hstack(y_test_pred))

In [None]:
model = LSTMModel(input_size=28, hidden_size=100, num_layers=2,
                  bidirectional=True, dropout=0., num_classes=N_CLASSES)
model = model.to(DEVICE)
optimizer = optim.SGD(model.parameters(), lr=1e-1)

In [None]:
model, optimizer, epoch, loss, average_training_losses, \
    average_validation_losses = train_model(
        model, optimizer, PATIENCE, NUM_EPOCHS,
        "results/torch_2L_bi_lstm_model.pt", train_loader, validation_loader)

In [None]:
fig, axs = plt.subplots()
sns.lineplot(x=np.arange(1, len(average_training_losses) + 1),
             y=average_training_losses, ax=axs, label="Training loss")
sns.lineplot(x=np.arange(1, len(average_validation_losses) + 1),
             y=average_validation_losses, ax=axs, label="Validation loss")
axs.set_xlim((0, round(len(average_training_losses) / 5) * 5))
axs.set_xlabel("Number of epochs")
axs.set_ylabel("Loss")
plt.tight_layout()

In [None]:
checkpoint = torch.load("results/torch_2L_bi_lstm_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
model = model.to(DEVICE)
model.eval()

y_train_true, y_train_pred, y_test_true, y_test_pred = test_model(
    torch_model=model, training_dataloader=train_loader,
    test_dataloader=test_loader)

In [None]:
accuracy_score(np.hstack(y_train_true), np.hstack(y_train_pred))

In [None]:
accuracy_score(np.hstack(y_test_true), np.hstack(y_test_pred))