# Business Analytics and Artificial Intelligence
Summer semester 2024

Prof. Dr. Jürgen Bock

### Learning goals
* You are able to explain the difference between model parameters and hyperparameters and to name some hyperparameters.
* You are able to formulate the main steps to prepare data sets and to apply several methods to get an overiew over a data set.
* You can explain the approach for encoding the output of multi-class classifier networks, you can draft the principle of suitable activation functions, and you are able to interpret the output of such activation functions.
* You are able to apply artificial neural networks for multi-class classification and to evaluate the trained models according to suitable metrics.
* You are able to tell the main principle of convolutional neural networks for image classification.

## Parameters of Neural Networks

We distinguish between two kinds of parameters:
- model parameters
- hyperparameters

**Model parameters** are those parameters that represent tha actual (learned) model. In the case of neural networks these are the **weights**. If, for instance, the neural network was trained to recognize cats on images, the model parameters (weights) characterize the model of cat images.

**Hyperparameters** are those parameters, that characterize the structure of the model and of the training procedure. All configuration levers that can be changed during the training phase are hyperparameters. Among them are:
- Structure of the neural network
  - Number of layers
  - Number of neurons per layer
  - Kind of interconnection (fully connected, convolutional, etc.)
- Number of training epochs
- Batch size in batch training
- Choice of the *loss function*
- Choice of the optimizer
- Learning rate
- Encoding of the results in the ouput vector
- ...

Choosing the hyperparameters is one of the most difficult tasks during the training of neural networks. Often the effects of certain hyperparameters cannot be determined systematically. Also there can be correlations betweek hyperparameters.

## Artificial Neural Networks for Image Recognition

We will have a closer look at some hyperparameters when we look at a prominent application of neural networks: image recognition

More precisely, this is the classification of the content of the image.

### Image Data

Image data are more complex than the synthetic data that we used so far, or than the simple classification data sets, e.g. *iris*.

In classical image recognition approaches, a certain set of image features is detected and extracted in a preprocessing step. (Edges, etc.)

In neural networks, however, every pixel of the image is considered a feature. Hence, the input vector of the neural network can be quite large.

For example, given an image of the size 32 x 32 pixel in 3 color channels, we need an input vector of the size

In [None]:
32*32*3

The possibility to process input vectors of this size was achieved via recent advances in the are of neural networks:
- Availability of large data sets
- Availability of high-performance computing power
- Novel / improved algorithms

These advances allow for the definition and training of large neural networks with many layers and a large number of neurons (and thus also large input vectors). Training and evaluation of models with such many-layered architectures in called *Deep Learning*.

#### Preparation of Image Data with PyTorch / Torchvision

The `torchvision` library contains several useful packages and modules for loading, organizing and manipulating image data.

The package `datasets` in ``torchvision`` contains modules for loading freely available and frequently used image data sets (e.g. for benchmarking).

In [None]:
from torchvision import datasets

The package `transforms` contains classes for transformation of image data (e.g. converting to tensors, re-scaling, normalizing, cropping, ...). Transformations can on the one hand be used to convert images into a format that is useable by the neural network. On the other hand, they can be used to augment the training data, e.g., by using different image crops or rotations, mirrored or distorted images, using different saturation, contrast, brightness, etc.

In [None]:
from torchvision import transforms

The transformations can be passed directly to the data set representation. They are applied by the data set object directly.

Firstly, however, the transformation object must be configured and instantiated. The class `Compose` accepts a list of `transform` objects during instantiation, that will be applied sequentially.

In [None]:
transformations = transforms.Compose([
    transforms.ToTensor()
])

`torchvision` can download known data sets directly and place in a provided directory.

In [None]:
root_dir_cifar100 = "c:/data/cifar100/"

If the data are already present at the provided location, the download is skipped.

Depending on the data set different options are available, e.g., if the data set contains test and training data sets.

In [None]:
dataset_cifar100_train = datasets.CIFAR100( 
    root=root_dir_cifar100,
    train=True,
    download=True,
    transform=transformations)

In [None]:
dataset_cifar100_test = datasets.CIFAR100( 
    root=root_dir_cifar100,
    train=False,
    download=True,
    transform=transformations)

For the sake of readability, we use shorter variable names:

In [None]:
data_train = dataset_cifar100_train
data_test = dataset_cifar100_test

#### Inspection of the Data

It is always useful to manually inspect the data first. Doing so we can test if the data is available in the right format, if they can be read and further processed, and, ideally, how certain hyperparameters must be set.

Is it a training or test data set?

In [None]:
print('data_train is a training data set:', data_train.train)
print('data_test is a training data set:', data_test.train)

In which shape is the data available?

In [None]:
print('Shape: ', data_train.data.shape)

That means: 50000 data samples of size 32 x 32 x 3 (i.e. 32 x 32 pixel in 3 color channels)

How many and which classes are available?

In [None]:
print("Number of classes: ", len(data_train.classes))

for i in range(0, len(data_train.classes)):
    print("{:2d}  {}".format(i, data_train.classes[i]))

How many data samples are there?

In [None]:
print("Number of samples in training set:", len(data_train.data))
print("Number of samples in test set: ", len(data_test.data))

#### Filtering and Loading the Data

Access to the data is done, as we know already, via the `DataLoader` that in provided in the *PyTorch* package `torch.utils.data`.

In [None]:
from torch.utils.data import DataLoader

In our example, we do not want to consider the complete data set, but only a fraction that contains certain target classes. To this end, we can configure a so-called `Sampler`, that extraxcts only a certain subset of the data sets. The sampler must be told the indices of the data samples to be considered. Hence, we have to identify the indices of these data samples for the required target classes.

We choose the selected classes based on their names and determine the class indices:

In [None]:
class_selection = ['apple', 'pear', 'orange', 'mushroom', 'sweet_pepper']
class_selection_idx = [i for i in range(len(data_train.classes)) if data_train.classes[i] in class_selection]

In [None]:
print("Indices of the classes", class_selection, ":", class_selection_idx)

We need only the indices of the data samples that are labeled with selected classes (for which the target is one of the identified class indices).

Let us inspect the whole target vector first:

In [None]:
print(data_train.targets)

We need the indices of the target vector at which is represents one of the selected classes:

In [None]:
target_idx = [i for (i,t) in enumerate(data_train.targets) if t in class_selection_idx]

These are the indices:

In [None]:
print(target_idx)

As a verification, here is the target vector itself:

In [None]:
print([data_train.targets[i] for i in target_idx])

Now we can create the `Sampler` using the index list. The sampler classes can be found in the package `torch.utils.data.sampler`. We use the `SubsetRandomSampler` based on the indices that characterize the subset.

In [None]:
from torch.utils.data.sampler import SubsetRandomSampler

In [None]:
data_sampler = SubsetRandomSampler(target_idx)

The actual `DataLoader` can now be created for the `dataset` providing the `batch_size` and the `sampler`.

In [None]:
data_loader = DataLoader(dataset=data_train, batch_size=50, sampler=data_sampler)

Using the ``DataLoader`` we can now for the first time iterate over the images and have a look at them. Since we are using our configured ``SubsetRandomSampler`` we see only images of the selected classes.

In the provided ``dataview`` module (that should be placed in the same directory as this notebook) there is an auxiliary function to print images.

In [None]:
import dataview

We show the images batch-wise, as they are served by the ``DataLoader``.

In [None]:
for (input, _) in data_loader:
    dataview.view_images(input, 10)

### The Neural Network

#### Input and Output Vector

At first we define the length of the input vector: (32x32 pixel in 3 color channels)

In [None]:
n_input = 32 * 32 * 3

So far we used neural networks for binary classification only. There, the output layer consists of exactly one neuron, which (based on the *sigmoid* or *threshold* activation function) could produce output values (close to) 0 or (close to) 1, which corresponds to the two classes.

In this example we address the problem of multi-class classification, i.e., there are more than 2 classes.

We need a method to represent the different classes by the output of the neural network.

There are two typical appraoches to encode classes:
- Label Encoding
- One-Hot Encoding

As for the **label encoding** every class is represented by a unique class index. In our case this would be, e.g.

| Klasse       | Label |
|--------------|-------|
| Apple        | 0     |
| Pear         | 1     |
| Orange       | 2     |
| Mushroom     | 3     |
| Sweet_Pepper | 4     |

An advantage is the compact representation of the class. A disadvantage is that the incremental numbering suggests a sequence that is not present. Especially, if the class label is used as a numerical input later, the number value can be misinterpreted.

As for the **one-hot encoding**, a vector of length $n$ is used to represent $n$ classes. In this vector the index is representing the corresponding class. In an ideal classification result, all vector elements would be 0 apart from the one whose index represents the detected class, which would be 1. In our example this would be

| Class        | Apple | Pear | Orange | Mushroom | Sweet_Pepper |
|--------------|-------|------|--------|----------|--------------|
| Apple        | 1     | 0    | 0      | 0        | 0            |
| Pear         | 0     | 1    | 0      | 0        | 0            |
| Orange       | 0     | 0    | 1      | 0        | 0            |
| Mushroom     | 0     | 0    | 0      | 1        | 0            |
| Sweet_Pepper | 0     | 0    | 0      | 0        | 1            |

Using the one-hot encoding for representing the result vector in multi-class classification tasks with neural networks has a major advantage compared to the label encoding: The neural network does not compute unique class assertions with one vector element being 1 and all others being equal to 0. Instead, the ouput neurons are activated to a different extent. The more a neuron is activated the higher the probability of the resprective class. (This is a result presentation that cannot be realised using label encoding.)

For multi-class classification it is required to design a neural network such that its *output layer* has as many neurons as there are target classes. In order to convert the activations into a probability distribution the *Softmax* activation function can be used. It converts a vector $\vec{a}$ of activations into an equaliy sized vector of probabilities $\vec{y}$, such that for each vector element $y_i$ there is $y_i \in [0, 1]$ and $\sum_i y_i = 1$.

<img src="softmax.png" width="400">

In [None]:
import torch
import torch.nn as nn

In [None]:
a = torch.randn(1, 5)
print('Activations a:                  ', a)

In [None]:
softmax = nn.Softmax(dim=1)
print('Probabilities after softmax(a): ', softmax(a))
print('Sum of the probabilities:       ', torch.sum(softmax(a)))

Also a frequently used activation function is *LogSoftmax*. This computes the logarithm of the results of the *Softmax* function.

The *LogSoftmax* function delivers values in the range $(-\infty, 0]$. This can be explained by the logarithm function:

In [None]:
import matplotlib.pyplot as plt
import math
%matplotlib inline
xdata = torch.linspace(0, 1.5)

In [None]:
plt.plot(xdata, torch.log(xdata))
plt.grid(True)
plt.axvline(1, color="red", linestyle="--")
plt.axvline(0, color="red", linestyle="--")
plt.show()

This emphasizes higher probabilities (close to 1, close to 0 after the logarithm) and lower proabilities (close to 0) are further diminished (close to  $-\infty$ after the logarithm). Besides this, the *LogSoftmax* function has several numerical and computational advantages over *Softmax*.

In [None]:
logsoftmax = nn.LogSoftmax(dim=1)
print('Activations a:                           ', a)
print('Probabilities according to softmax(a):   ', softmax(a))
print('Activations according to  logsoftmax(a): ', logsoftmax(a))

In [None]:
plt.bar(torch.arange(len(a.flatten())) -0.3, a.flatten(), 0.2, label="Activation")
plt.bar(torch.arange(len(a.flatten())) -0.1, softmax(a).flatten(), 0.2, label="Softmax")
plt.bar(torch.arange(len(a.flatten())) +0.1, logsoftmax(a).flatten(), 0.2, label="LogSoftmax")
plt.title("Comparision of activation functions.")
plt.legend()
plt.grid()
plt.show()

In both *Softmax* and *LogSoftmax* activation - the length of the ouput vector corresponds to the number of target classes. In our example this is:

In [None]:
n_output = len(class_selection)

Thus we have:

In [None]:
print(n_input)
print(n_output)

#### Network Structure

First we define a classical *multi-layer perceptron* as we know it:

In [None]:
class MLP(nn.Module):
    
    def __init__(self):
        super(MLP, self).__init__()
        self.fc1 = nn.Linear(n_input, 200)
        self.fc2 = nn.Linear(200, 50)
        self.fc3 = nn.Linear(50, n_output)
        
    def forward(self, input):
        x = input.view(-1, n_input)
        x = torch.sigmoid(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        x = self.fc3(x)
        return x

Instantiation of the model:

In [None]:
model = MLP()

#### Optimizer and  *loss function*

We use the performant optimization algorithm *Adam* for adjusting the weights (model parameters) with an appropriate *learning rate*.

In [None]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters(), lr=0.001)

As a *loss function* for multi-class classification the *CrossEntropyLoss* is suitable according to the *BinaryCrossEntropy* for binary classification.

In [None]:
loss_fn = nn.CrossEntropyLoss()

The ``nn.CrossEntropyLoss`` function has an important property:

It expects the prediction of the neural network and the target class as arguments in order to compute the loss.

To this end, the result of the neural network has to be passed in as unnormalized vector that consists of the evaluations for the single classes. The normalization according to the *LogSoftMax* activation function is part of the *CrossEntrolyLoss* function and can be left out in the *forward pass* (in the training phase).

The expected target class is provided as a scalar value that corresponds to the class index. The class index is a value in the range $[0, ..., numberClasses - 1]$.

In our special case we have to convert the class indexes

In [None]:
print(class_selection_idx)

into

In [None]:
print(list(range(len(class_selection_idx))))

**Example:** Let the output vector of a neural network be

In [None]:
out = torch.tensor([[3.254, 0.252, 0.542, 6.233, 1.042]])

The target class index is

In [None]:
t = torch.tensor([1])

The *CrossEntropyLoss* can be calculated as:

In [None]:
print(loss_fn(out, t))

In the training loop this has to be done batch-wise:

In [None]:
out = torch.tensor([[3.253, 6.124, 0.346, 0.446, 1.153],
                    [0.421, 5.255, 1.155, 0.421, 9.532],
                    [0.221, 0.564, 1.435, 2.351, 0.532]])
t = torch.tensor([1, 4, 3])
print(loss_fn(out, t))

#### Training Loop

In order to visualize training state during the iteration over the training loop, we need some auxiliary objects:

In [None]:
from IPython import display
from statistics import mean
loss_history = []
loss_ep = []
plt.figure(figsize = (12,8));

The training loop iterated over the different epochs and within each epoch over the batches provided by the ``DataLoader``.

In [None]:
n_epochs = 50

According to the above mentioned specialty of the *CrossEntropyLoss* function, the target argument has to be prepared in each iteration.

In [None]:
for epoch in range(n_epochs) :
    for b, batch in enumerate(data_loader) :
        optimizer.zero_grad()
        input, target = batch
        output = model(input)
        # The target returned by the DataLoader is a tensor with the original class labels
        # For the CrossEntropyLoss function we need to map this to a 1D tensor with each element 
        # a class index in [0, ..., number_of_classes-1]
        t = torch.LongTensor(len(target))   # len(target) corresponds to the batch size
        for i, e in enumerate(target):
            t[i] = torch.tensor(class_selection_idx.index(e.item()))
        loss = loss_fn(output, t)
        loss.backward()
        optimizer.step()
        loss_ep.append(loss.item())   
        
    ## For visualization purposes:
    loss_history.append(mean(loss_ep))
    loss_ep = []
    display.clear_output(wait=True)
    plt.plot(loss_history)
    display.display(plt.gcf())
    display.display(print("Epoch {:2}, loss: {}".format(epoch, loss_history[-1])))

#### Evaluation of the Model

In order to evaluate the model, we use the test dataset.

This dataset is prepared the same way as the training data set.

Extraction of the data samples with selected classes:

In [None]:
print(class_selection)

In [None]:
test_target_idx = [i for (i,t) in enumerate(data_test.targets) if t in class_selection_idx]

We need an according ``SubsetRandomSampler`` ...

In [None]:
data_sampler_test = SubsetRandomSampler(test_target_idx)

... and a ``DataLoader``.

In [None]:
data_loader_test = DataLoader(dataset=data_test, batch_size=10, sampler=data_sampler_test)

For evaluation functionality we use the classical metrics from ``scikit-learn``.

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

We create two lists:

- ``y_test`` to list the target classes from the test data set (expected classification)
- ``y_pred`` to list the prediction by the neural network

In [None]:
y_test = []
y_pred = []

During training we omitted the *LogSoftmax* activation because of the *CrossEntropyLoss* function. If we want to know the probabilities of the calculated class memberships we have to apply the *Softmax* function to the ouput of the model.

Since the order (i.e. the ranking) of the class membership propabilities does not change, we can also determine class with the highest evaluation. This is done using the ``argmax`` function.

In [None]:
l = torch.randn(1,5)
print(l)
print(l.argmax())

We store the winner classes of the model and the target classes in the ``y_pred`` and ``y_test`` classes that we initialized before.

In [None]:
for batch_test in data_loader_test:
    input, target = batch_test

    for t in target:
        y_test.append(class_selection_idx.index(t.item()))

    prediction = model(input)
    
    for y in prediction:
        y_pred.append(y.argmax().item())

These two lists can now be used to compute the *Confusion Matrix* and the classification report.

In [None]:
confusion = confusion_matrix(y_test, y_pred)
print('Confusion Matrix:\n', confusion)
print('\n\nClassification Report:')
print(classification_report(y_test, y_pred, target_names=class_selection))

### Outlook

#### Convolutional Neural Networks

In the neural networks considered so far, the image has to be converted into a long one-dimensional vector.

Typically, features in an image can be found in an area of neighboring pixels. (In two dimensions.) This fact is used by *Convoluational Neural Networks* (CNNs). They take the image in its original dimensionality as an input. In a series of *feature detecting layers* neighboring pixel groups are convoluted and compressed (*pooling*). After that, several *fully connected* layers are typically used. These compute the classification based on the features detected before.

Here is an example of how a Convolutional Neural Network can be defined in PyTorch.

In [None]:
class ConvNet(nn.Module):
    
    def __init__(self):
        super(ConvNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 12, kernel_size=6, stride=2, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(12, 64, kernel_size=3, stride=1, padding=1),
            nn.ReLU(),
            nn.AdaptiveMaxPool2d((6, 6))
        )
        self.classifier = nn.Sequential(
            nn.Linear(64*6*6, 1024),
            nn.ReLU(),
            nn.Linear(1024, 1024),
            nn.ReLU(),
            nn.Linear(1024, n_output)
        )
        
    def forward(self, input):
        x = self.features(input)
        x = torch.flatten(x, 1)
        x = self.classifier(x)
        return x

In [None]:
model = ConvNet()

After changing the model, the optimizer must be reinitialized in order to make the new model parameters known to it:

In [None]:
optimizer = optim.Adam(model.parameters(), lr=0.001)

For an in-depth explanation of this principle, please use the literature and the PyTorch documentation.