# Business Analytics and Artificial Intelligence
Summer semester 2024

Prof. Dr. Jürgen Bock

## Foundations of Artificial Neural Networks

Artificial Neural Networks are inspired by natural neural networks, as they can be found in intelligent species with a nervous system. The fact that living beings with neural networks are able to demonstrate intelligent behavior motivates the assumption that some aspects of Artificial Intelligence can be achieved by Artificial Neural Networks.

### Learning Goals
* You are able to explain the principle of artificial neurons and the importance and requirements of activation functions.
* You can name different network architectures and calculate the *forward pass* for a given feed-forward neural network.
* You can explain the basic principle of learning in neural networks and draft the basic learning algorithm, the meaning of the single phases, as well as the role of batch learning.
* You can name different *loss function*s and know their typical application areas.
* You are able to demonstrate the basic workflow for machine learning with neural networks in Python using *PyTorch*.
* You are able to interpret and explain data set preparation, configuration of neural networks, as well as the learnig algorithm including all necessary components in *PyTorch* based on some given Python code.

### Neurons

A neuron (nervous cell) is the basic building block of a (natural) neural network. A neuron is a cell and consists of a cell body with a nucleus. Via the *dendrites* - branches reaching away from the cell body - signals are passed to the neuron from other neurons via synapses. If a certain threshold is exceeded via strength and number of signals, an activation potential will be triggered that is passed on via the *axon* and transmitted to other neurons via the axon terminals.

<img src="neuron_en.png" width="800">

A neuron can be modelled artificially and can be represented as a mathematical function. Such an artificial neuron has a set of inputs, a weight for each input, an activation function and an output.

Consider the neuron $j$. The activation $a_j$, i.e. the output of the neuron, is the result of an activation function $g$ of the weighted inputs $a_i$ of the neuron. A weight $w_{i,j}$ determines the weight of input $a_i$ for neuron $j$.

<img src="artificial_neuron.png" width="600">

The neuron has a special input $a_0 = 1$ that is called *Bias*. Since the bias is always 1, its contribution to the neuron's activation is only determined by the weight $w_{0,j}$. The bias guarantees that the neuron always has a learnable component that is independent of the inputs.

Mathematically an artificial neuron can be represented as follows:

$$ a_j = g(\sum_{i=0}^n a_i w_{i,j}) $$

where $a_0 = 1$.

The inputs are called $a$ since in the general case they are themselves activations of previous neurons. In the special case that the neuron is an "input neuron", these activations are input variables of the neural network, i.e., $a_i = x_i$.

#### Activation functions

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

xdata = np.arange(-5, 5, 0.01)

Different functions can serve as activation function $g$.

$$
threshold(x) = \left\{
\begin{array}\\
    1 & \mbox{if } x \geq 0 \\
    0 & \mbox{else}
\end{array} \right.
$$

In [None]:
def threshold(x):
    if x >= 0:
        return 1
    else:
        return 0

In [None]:
g = threshold
plt.plot(xdata, np.vectorize(g)(xdata))
plt.title("threshold function")
plt.show()

$$sigmoid(x) = \frac{1}{1 + e^{-x}}$$

In [None]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [None]:
g = sigmoid
plt.plot(xdata, np.vectorize(g)(xdata))
plt.title("sigmoid function")
plt.show()

The advantage of the *sigmoid* function is that it is continuously differentiable.

Both activation functions are *nonlinear*. Thus it is possible that neurons and neural networks are able to approximate nonlinear functions.

### Network architectures

1. Feed-forward neural networks
 * Neurons are connected in one direction only, i.e., there are no backward connections or loops. The network thus is an *acyclic directed graph*.
 * The network represents a function that maps an input vector to an output vector.
 * The network does not have any internal states (apart from the weights, which are not depending on the inputs - we don't talk about learning yet!)
2. Recurrent neural networks
 * Neurons are connected is a way such that their outputs are used as their own inputs.
 * The network thus depends on the activations of neurons from previous inputs.
 * The network can thus represent internal states (short-term memory).

We focus on feed-forward networks in this course.

Neural networks are typically arranged in layers. In a feed-forward network the activation is propagated from an input vector (*input layer*) layer by layer, until the activation reaches the last layer (*output layer*). The activation of the last layer is the ouput (or the result of the computation) of the neural network.

In a single-layer neural network there is only one layer of neurons, that maps the inputs directly to the outputs. In this case, every weight is responsible for a single output only. (This makes learning of the weights quite easy.)

In a multi-layer neural network (also called *multi-layer-perceptron* (MLP)), there are one or more layers between inputs and outputs. These layers are called *hidden layers*. In this case, weights in the first layers are responsible for multiple outputs.

The following example describes a three-layered neural network with one *input layer*, one *hidden layer* and one *output layer*.

<img src="simple_mlp.png" width="600">

The *input layer* corresponds to the input vector $\vec{x} = (x_1, x_2, x_3)$. The *hidden layer* consists of the neurons 4 and 5. The *output layer* consists of the single neuron 6, since the neural network computes a single output value. (In case of more than one output neurons, the neural network is computing an output vector $\vec{y} = (y_1, \ldots, y_m)$. If we would consider the above network without the output layer (neuron 6), it would be a single-layer network with an output vector $\vec{y} = (a_1, a_2)$.)

If we consider the activation function $g$ being the same for each neuron, the computation of $y$ in the network above would be as follows:

\begin{eqnarray}
y & = & g(w_{0,6} + w_{4,6}a_4 + w_{5,6}a_5) \\
  & = & g(w_{0,6} + w_{4,6}g(w_{0,4} + w_{1,4}x_1 + w_{2,4}x_2 + w_{3,4}x_3) + w_{5,6}g(w_{0,5} + w_{1,5}x_1 + w_{2,5}x_2 + w_{3,5}x_3))
\end{eqnarray}

#### Example: Calculation of simple Boolean operators

Let's try to compute simple Boolean functions with neural networks: AND, OR, XOR

Here are the according truth tables:

| AND | 0 | 1 |      | OR | 0 | 1 |      | XOR | 0 | 1 |
|-----|---|---|      |----|---|---|      |-----|---|---|
| **0**   | 0 | 0 |  | **0**  | 0 | 1 |  | **0**   | 0 | 1 |
| **1**   | 0 | 1 |  | **1**  | 1 | 1 |  | **1**   | 1 | 0 |

Some Boolean functions can be computed with a single neuron, e.g., AND and OR:

In [None]:
x1 = 0
x2 = 0

#w01, w11, w21 = -1.5, 1, 1  # AND
w01, w11, w21 = -0.5, 1, 1  # OR

y = threshold(w01 + w11*x1 + w21*x2)

print("{}, {} -> {}".format(x1, x2, y))

Computing the XOR function requires a multi-layered neural network, since the classes are not linearly separably.

In [None]:
x1 = 1
x2 = 0

w03, w13, w23 = -1, 1, 0
w04, w14, w24 = -2, 1, 1
w05, w15, w25 = -1, 0, 1

w06, w36, w46, w56 = -1, 1, -2, 1

y = threshold(w06 + 
              w36*threshold(w03 + w13*x1 + w23*x2) + 
              w46*threshold(w04 + w14*x1 + w24*x2) + 
              w56*threshold(w05 + w15*x1 + w25*x2))

print("{} XOR {} = {}".format(x1, x2, y))

<img src="mlp_xor.png" width="600">

### Learning in multi-layerd neural networks

Training a neural network means to determine the weights in a way that the network computes the right (expected) output vector $\vec{y}$ for a given input vector $\vec{x}$. The network hence "learns" a function $\vec{h}_W(\vec{x})$, that is parameterized with weights $w \in W$.

According to the procedure of *supervised learning* training data is used, for which the result vector is known. A neural network can thereby be trained as a classifier or as a regressor. The difference is merely the interpretation of the output vector.

#### Loss Function

Adjustment of the weights begins by determining the error (*loss*) of the neural network. The error is computed using a so-called *loss function* and describes the magnitude of the difference between the result of the network's computation and the expected result. The *loss function* is generally defined as $E(\vec{h}_W(\vec{x}), \vec{t})$, i.e., as a function that calculates the error based on the result of the neural network for input $\vec{x}$ and the expected target vector $\vec{t}$.

Depending on the kind of problem to be solved, there are different *loss functions* that are suitable, e.g.
- *Mean-Squared-Error (MSE)* for regression problems,
- *Cross Entropy Loss* for multi-class classification problems,
- *Binary Cross Entropy Loss* for binary classification problems,

or any other, possibly self-defined function that best describes the deviation in the use case at hand.

#### Backpropagation

The goal when training neural networks is to determine the weights such that the *loss* is minimal. In order to determine the weights, the gradient of the *loss function* wrt. to single weights is calculated. This requires the calculation of the partial derivatives $\frac{\partial E}{\partial w_{i,j}}$ of the *loss function* $E$ for each weight $w_{i,j}$. Due to the sequential ordering of the single neurons, this leads to a repeated use of the chain rule. The high degree of interconnection in neural networks leads to a situation where derivatives need to be computed several times and thus becomes redundant.

The backpropagation algorithm is an optimized approach to compute the gradient by using dynamic programming. The algorithm propagates the *loss* backwards through the layers of the neural network (from the ouput layer to the input layer). Thereby, the gradient is computed layer by layer und redundant calculations are avoided.

For illustration purposes consider the following plot. (This is not a realistic example from a neural network, but merely serves the illustration of the relation betweek weights and *loss*.):

Assume there are two weights in a neural network $w_1$ and $w_2$. For a given (and "fixed") input vector $\vec{x} = (x_1, x_2)$ the *loss* computes as $loss = E(\vec{h}_{\{w_1,w_2\}}(\vec{x}), t)$ based on $w_1$ and $w_2$. (Here, for the sake of illustration, this is an arbitrary function.)

In [None]:
w1data, w2data = np.meshgrid(np.arange(-1, 1, 0.1), np.arange(-1, 1, 0.1))
z = 3 + ((np.sin(w1data*4) * np.cosh((w2data*3)-0.5))/5)
  
plt.figure(figsize =(15, 10)) 
axes = plt.axes(projection ='3d') 
axes.set_xlabel('$w_1$', fontsize=16, labelpad=15)
axes.set_ylabel('$w_2$', fontsize=16, labelpad=15)
axes.set_zlabel('loss', fontsize=16, labelpad=15)
axes.plot_surface(w1data, w2data, z, cmap=plt.get_cmap('hot'))
plt.show()

We are looking for values for the weights, for which the *loss* will be minimal. Note, that this illustration is based on a specific training example (fixed $x_1$ and $x_2$). The difficulty is to determine the best weight configuration for  **all** training samples.

#### Adjusting the weights

When the gradient is calculated, for each weight the direction is known, in which the weight needs to be adjusted in order to achieve a minimization of the *loss*. Note, that this direction describes an adjustment towards a local improvement. Thus, there is a risk to arrive at a local minimum, in case the global minimum is behind a "hill" in the other direction, or in the same direction beyond the local minimum.

In order to overcome this problem sophisticated optimization algorithms are used. An important factor thereby is the *learning rate*. This determines the magnitude in which weights are changed, i.e., the step size. A smaller value allows for finding the minimum rather quite accurately, however, it prevents escaping a local minimum. A larger value, however, allows for jumping over local minima, but also tends to jump over the global minimum frequently and thus gets stuck in a suboptimal solution.

#### Training algorithm

Training a neural network means to determine the weights in a way, such that the error (*loss*) is minimal for all training samples.

The training takes place in *epochs*, where one epoch denotes one iteration over all training samples, consisting of *forward pass*, *backward pass*, and adjustment of the weights.
- **forward pass**: Calculation of the output vector of the neural network for a given input vector (one training sample)
- **backward pass**: Backpropagation of the loss to determine the gradient in the weight space.

Since the adjustment of the weights happens in small steps, several epochs are run through during training.

The generic **training algorithm** can be denoted as follows:

---
1. Initialization of the weights $w \in W$ with random values
2. Iteration over the epochs
  - Iteration over the training samples consisting of a feature vector $\vec{x}$ and target vector $\vec{t}$
    1. *forward pass*: Calculation of the output vector of the neural network $\vec{y} = \vec{h}_W(\vec{x})$
    2. Calculation of the loss: $loss = E(\vec{y}, \vec{t})$
    3. *backward pass*: Calculation of the gradient in the weight space based on $loss$, for the individual weights $\frac{\partial E(\vec{y}, \vec{t})}{\partial w},\quad \forall w \in W$
    4. Adjustment of the weights using an optimization algorithm based on the gradient
---

The iteration over the training samples is typically done in batches. This means, that all training samples of one *batch* (also *mini-batch*) are passed through the network (*forward pass*), before the *loss* is computed for all samples within this batch, and gradient and weight adjustment is done. The adjustment of the weights, hence, is influenced equally by all training samples within one *batch*. (Ideally, and provided the available compute resources, the calculation of all training samples within one *batch* is done in parallel.)

The main advantages of *batch*-based training are:

- Speed, since parallel computation is possible
- Better generalization of the neural network (i.e., prevention of overfitting), since training samples are not used  individually but groupwise when it comes to calculating the weights

Too large batch sizes, however, can lead to a situation where the network does not converge to an optimal configuration at all. Chosing the optimal batch size is thus an important task in the context of hyperparameter tuning.

For *batch*-based training the previously denoted algorithm changes in a way, such that the inner loop (iteration over the training samples) does not iterate over single samples, but over *batches* of samples. The computation of the *forward pass* and of the *loss function* has to be done batch-wise.

## Artificial Neural Networks with PyTorch

*PyTorch* simplifies the work with neural networks and eases the life of the user significantly in multiple ways. Many of the structural and algorithmic details of neural networks are encapsulated by the *PyTorch* API, so a user can focus on the configuration of the hyperparameter.

In [None]:
import torch

In the following we will create and train simple neural networks with *PyTorch*. We therefore consider the learning task "classification".

### Data

The *scikit-learn* library provides several functions to create artificial sample data that can be used to test classifiers and regressors.

In [None]:
from sklearn import datasets

Generating a data set can be configured in a variety of ways, in order to create specific challenges for the classifier. For the sake of simplicity we create a data set with two features per data sample and the classification into two linearly separable classes.

In [None]:
data_ls = datasets.make_classification(
    n_samples=10000,
    n_features=2,
    n_informative=2,
    n_redundant=0,
    n_repeated=0,
    n_classes=2,
    n_clusters_per_class=1,
    class_sep=2,
    flip_y=0,
    weights=[0.1,0.9],
    random_state=7 )

The data set consists of data vectors $X$ and the target vector $t$.

In [None]:
X, t = data_ls

In [None]:
print('Features X:')
print(X)
print('\nTarget t:')
print(t)

Scatter plot of the data set:

In [None]:
plt.figure(figsize = (12, 8))
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.scatter(X[:,0], X[:,1], c=t, s=3)
plt.show()

In order to use the data set in *PyTorch*, we need to convert it into a *PyTorch*-compatible format. In this case this is a generic ``TensorDataset``.

In [None]:
from torch.utils.data import TensorDataset

In [None]:
dataset = TensorDataset(torch.from_numpy(X), torch.from_numpy(t))

Accessing the data is done via a `DataLoader` that is provided in the *PyTorch* module `torch.utils.data`.

In [None]:
from torch.utils.data import DataLoader

The `DataLoader` serves data from the data set for further processing. To this end, the data can be retrieved batch-wise. `batch_size` determines the size of these batches. The parameter `shuffle=True` makes the ``DataLoader`` deliver data samples in random order.

In [None]:
data_loader = DataLoader(dataset=dataset, batch_size=10, shuffle=True)

Let's inspect the *batches* as they are delivered by the ``DataLoader``.

In [None]:
for batch in data_loader:
    input, target = batch
    print(input)
    print(target)

### One Neuron

Since the classes in this first example are linearly separable, a classifier can be realized by a single neuron.

Since *PyTorch* is made for specifying multi-layered neural networks, the definition of a single neuron is maybe not as straightforward as it would be expected. Specifically, we consider the single neuron as a network with a single layer that maps two input variables to one output variable. (A single-layerd network with exactly one output value is always a single neuron.)

In *PyTorch* neural networks are implemented as Python classes, that inherit from the class ``nn.Module`` that is defined in *PyTorch*. To this end, a method ``__init__`` is defined as clas constructor, that defines the single layers. Next to the constructor a method (class function) called ``forward`` is defined, that implements the *forward pass*. That is, it maps the input vector to the first layer and passes the result of each layer to the next one via the activation functions, that are also specified here.

In this simple case we are using a single layer of type ``nn.Linear``, i.e., a *fully connected* layer, that we were solely considering so far. As activation function we are using the *sigmoid* function.

In [None]:
from torch import nn

In [None]:
class Neuron(nn.Module):
    def __init__(self):
        super(Neuron, self).__init__()   # Call the super class (nn.Module)
        self.neuron = nn.Linear(2, 1)    # Definition of the single layer within the class named "neuron"                                         
                                         #   "self" refers to the class itself and states that 
                                         #   the object "neuron" belongs to this class
        
    def forward(self, x):                # Passing the parameter "self" is done implicitly,
                                         # so only the input vector x has to provided
        x = self.neuron(x)               # Passing the input vector through the first layer ...
        x = torch.sigmoid(x)             # ... and through the activation function
        return x

Now, the class can be instantiated:

In [None]:
model = Neuron()

The structure of the neural network can be printed:

In [None]:
print(model)

Also the weights can be shown. These are denoted as ``parameters`` in the class ``nn.Module``.

In [None]:
parameters = list(model.parameters())
print('Initial weights of the first layer:\n', parameters[0])

### Training of the model

In order to train the neural network, we follow the generic training algorithm as presented above. To this end, we iterate in an outer loop over a given number of epochs.

In [None]:
num_epochs = 20

In order to calculate the error, we need a *loss function*. For binary classification problems, like in this example, the *Binary Cross Entropy Loss* function is suitable. As many other *loss functions* it is provided by *PyTorch* and is implemented in the class ``nn.BCELoss``.

In [None]:
loss_fn = nn.BCELoss()

Moreover, we need an optimizer, that is adjusting the weights in the neural network based on the gradient. A standard optimization algorithm is *Stochastic Gradient Descent*, that is implemented in the *PyTorch* module ``optim`` as ``optim.SGD``. It needs to know the weigths that it has to optimize, as well as a *learning rate* (`lr`).

In [None]:
from torch import optim

In [None]:
optimizer = optim.SGD(model.parameters(), lr=0.001)

For each epoch, we iterated batch-wise over the training samples.

In ordert to show the progress visually we need a module from ``IPython``. Also, we save the progress of the *loss* in a list.

In [None]:
from IPython import display
loss_history = []
plt.figure(figsize = (12,8));

The *batches* are provided by the ``DataLoader``. The provided *batch* object can be split into Input and Target (i.e., the class label).

Before each new calculation of the gradient it must be reset, otherwise there will be an unwanted summation.

The input vector will now be propagated through the neural network in the *forward pass*. From the calculated output vector and the expected target vector the *loss function* computes the error. Based on that error (*loss*) the *backward pass* is carried out in order to calculate the gradient. *PyTorch* stores the gradient and its components directly inside of the data structures (the ``parameters``) of the neural network, so the optimizer can operate directly on them.

In [None]:
for epoch in range(num_epochs):
    for batch in data_loader :
        optimizer.zero_grad()
        input, target = batch
        output = model(input.float())
        loss = loss_fn(output, torch.unsqueeze(target.float(), 1))
        loss.backward()
        optimizer.step()
    
    ## For visualization purposes:
    loss_history.append(loss.item())
    plt.plot(loss_history)
    display.clear_output(wait=True)
    display.display(plt.gcf())
    display.display(print("Epoch {:2}, loss: {}".format(epoch, loss.item())))

An interesting visualization if the *decision boundary*. There is a function available in the provided module ``dataview``.

In [None]:
import dataview

The *decision boundary* shows the border line between the classes, which the neural network has learnd. For a single neuron, this is always a straigt line (linear function).

In [None]:
dataview.plot_decision_boundary2d(model, X, t)

### Multi-layered networks

In an analogous way as we defined and trained the single neuron using *PyTorch*, we can represent and train multi-layered and also rather complex network structures.

#### Data

Also in this case we are using a synthetical data set generated by *scikit-learn*. Again we consider a data set with two features and two target classes.

In [None]:
data_moons = datasets.make_moons(
    n_samples = 10000,
    noise = 0.2 )

We split the data into feature vectors and target vector.

In [None]:
X, t = data_moons

We can plot the data the same way.

In [None]:
plt.figure(figsize = (12, 8))
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.scatter(X[:,0], X[:,1], c=t, s=1)
plt.show()

Here it becomes obvious that the classes in this data set are not linearly separable.

We need an according ``TensorDataset`` and a ``DataLoader``.

In [None]:
dataset = TensorDataset(torch.from_numpy(X), torch.from_numpy(t))
data_loader = DataLoader(dataset=dataset, batch_size=10, shuffle=True)

#### Neural Network

We define a neural network with three fully connected layers. (A Multi-Layer-Perceptron, MLP):

In [None]:
from torch.nn import functional as F

class MLP( nn.Module ):
    def __init__( self ):
        super( MLP, self ).__init__()
        self.fc1 = nn.Linear( 2, 10 )        
        self.fc2 = nn.Linear( 10, 5 )
        self.fc3 = nn.Linear( 5, 1 )

        
    def forward( self, x ):
        x = F.relu( self.fc1( x ) )
        x = F.relu( self.fc2( x ) )
        x = torch.sigmoid( self.fc3( x ) )
        return x

In [None]:
model = MLP()

Optimizer and *loss function* can be used as in the previous example. However, we need to instantiate the optimizer with the new instance of ``model``.

In [None]:
optimizer = optim.SGD(model.parameters(), lr=0.001)

The training loop remains unchanged. (For demonstration purposes we define the training loop again here, although normally we would consider defining a function that implements the training loop.)

First, however, we need to reset the visualization helper objects:

In [None]:
loss_history = []
plt.figure(figsize = (12,8));

Here, we need a few more epochs.

In [None]:
num_epochs = 30

Then the training loop:

In [None]:
for epoch in range(num_epochs):
    for batch in data_loader :
        optimizer.zero_grad()
        input, target = batch
        output = model(input.float())
        loss = loss_fn(output, torch.unsqueeze(target.float(), 1))
        loss.backward()
        optimizer.step()
    
    ## Zu Visualisierungszwecken:
    loss_history.append(loss.item())
    plt.plot(loss_history)
    display.clear_output(wait=True)
    display.display(plt.gcf())
    display.display(print("Epoch {:2}, loss: {}".format(epoch, loss.item())))

The *decision boundary* looks as follows:

In [None]:
dataview.plot_decision_boundary2d(model, X, t, showData=True)