<a href="https://colab.research.google.com/github/MatzeLopi/KIT-2400024/blob/main/DLNN_SS24_Praktikum1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Praktikum 1 - Simple Neural Network with Numpy and Pytorch

Note: the praktikums are for your own practice. They will **not be graded**!

You have around one week to work on it. Then we will go over the solutions together in the praktikum time slots!

Remember to make a copy of this notebook to your own Colab. Changes made directly here will not be stored!

In this praktikum, we will build a perceptron model and a Multi-Layer Perceptron (MLP) model **from scratch** with numpy. This is intended to help you understand in details the internal working of neural networks.

Next, we will proceed with building an MLP with [**Pytorch**](https://pytorch.org/get-started/locally/), which is a widely-used framework for building deep learning models.

Our first challenge is solving the **XOR task** that you've seen in the lecture, before we move to a slightly more complex problem, namely the **Iris dataset**.

**Notice**: Whenenver you see an ellipsis `...` or TODO comment, you're supposed to insert code or text answers.

## Excercise 1: XOR Task with Single Perceptron (from scratch)

XOR (exclusive OR) is a logic function that gives 1 as an output when the number of true inputs is odd, otherwise it outputs a 0. Our goal is to model this function using neurons. We'll start with a single neuron.

<center><img src="https://www.xplore-dna.net/pluginfile.php/286/mod_page/content/21/Tabelle%20-%20XOR.png" width="250"/></center>

Let's start with importing some necessary dependencies that we will need throughout the notebook.

In [2]:
import numpy as np

In the first part of this exercise you'll build a perceptron, a single neuron, that takes two binary input values and returns a binary output value.

<center><img src="https://i.stack.imgur.com/eBSki.jpg" width="280" />

<center><img src="" width="280"/>

Perceptron can be seen as a single neuron, mapping an input $\textbf{x}$ to an output $o$ using weights $\textbf{w}$ and a bias $b$. $\cdot$ is the dot product.

$o = \textbf{w}\cdot \textbf{x}+b$

#### Perceptron Update Rule


Perceptron Update Rule is a process that is specific to the training of a single-perceptron model, which we can apply to binary classification problems. This process [has been proven to converge](https://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge.pdf) if the data is linearly seperatable and the learning rate is small enough.


Let's use it here to have a first baseline.

For classification problems $0>o$ is interpreted as class 1, and $o<0$ is interpreted as class 0.

For updating the associated weights, we can use the following update rule:

$w_i = w_i + \nabla w_i$

where

$\nabla w_i = \eta(t-o)x_i$

- $t$ is the target
- $o$ is the output
- $\eta$ is the learning rate (a small constant)

### Implementation of a Perceptron

In [3]:
class perceptron_implementation():
    def __init__(self):
        # TODO:
        # Initialize weights
        # For perceptrons, it's possible to initialize all weights with 0

        self.neuron_weights = np.zeros(2)
        self.bias = 0

    def forward_pass(self, x:np.array):
        # Implement  o = x * w + b
        output = sum(self.neuron_weights * x) + self.bias

        return output

    def perceptron_update_rule(self, target:np.array, prediction:np.array, x:np.array, learning_rate:int=1):
        # Perform perceptron update rule that is defined above
        # use self.neuron_weights
        # TODO
        
        weight_delta = learning_rate * (target - prediction) * x

        new_weights = ...

        self.neuron_weights = new_weights

    def train(self, input_data, targets):
        """
        input_data: Multi-dimensional array that contains all inputs
        """
        # TODO
        for x, y in zip(input_data, targets):
          ...
        # END TODO

    def inference(self, input_data):
        # TODO
        outputs = []
        for x in input_data:
            ...

        return np.array(outputs)

### Training

In [4]:
perceptron = perceptron_implementation()

# TODO
input_data = ...
targets = ...

# train the corresponding single neuron
...

Ellipsis

### Inference

In [5]:
# TODO
# Test the trained model

predictions = ...
print(predictions)

Ellipsis


### Evaluation

For evaluation, we will need to consider appropriate metrics. For classification tasks, **accuracy** is one of the most common metrics.

It is defined as:

$\textrm{Accuracy}=\frac{1}{N}\sum_i^N1(y_i=\hat{y}_i)$

where $y$ is an array of our target values, and $\hat{y}$ is an array of our predictions.

For accuracy, if outputs are probabilities, there needs to be a threshold for transforming logit predictions to binary `(0,1)` predictions. We will set this threshold to `0.5`. For our perceptron this is not needed, since we already output binary values, however, we will use the `accuracy` function later on, so the predictions should be considered to be probabilities.

In [6]:
def accuracy(predictions: np.ndarray, targets: np.ndarray, threshold=0.5) -> float:
    # TODO
    # Implement the accuracy metric
    ...
    # END TODO
    return accuracy_value

In [7]:
# TODO
# Call accuracy function and provide necessary inputs to calculate accuracy
accuracy_value = ...
print(accuracy_value)

Ellipsis


You will see that it is not possible to get to 100% accuracy, since XOR is not a linear-separatable problem.

## Excercise 2: XOR Task with MLP (from scratch)

As mentioned in the lecture, unlike a single perceptron, Multi-Layer Perceptron (MLP) can deal with problems that are non-linearly-separatable like XOR.

Now we will try to implement an MLP with 3 hidden layers and a hidden dimension of 3. We will also add an activiation function to introduce nonlinearity in our hidden layers.

<img src="https://i.imgur.com/IUQ05Ol.png">

### Initializing Weights

Xavier intitialization is commonly used to initialize the weights of a network. It is a random uniform distribution that’s bounded between $\pm\frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}}$ where $n_i$ is the number of incoming network connections, and $n_{i+1}$ is the number of outgoing network connections.

In [8]:
def xavier_initialization(input_size, output_size) -> np.ndarray:
    """ Returns a numpy array of initialized weights """
    bound = np.sqrt(6) / np.sqrt(input_size + output_size)
    weights = np.random.uniform(-bound, bound, size=(input_size, output_size))
    return weights


### Feed-Forward Layer


A feed-forward layer applies a linear transformation to the input $x$ using a weight matrix $\textbf{W}$ and a bias vector $b$:

$z = x\textbf{W}^T+b$

Derivatives:
$$
\dfrac{dz}{dw_i} = x_i
$$

$$
\dfrac{dz}{db} = 1
$$

$$
\dfrac{dz}{dx_i} = w_i
$$

In [9]:
class FeedForwardLayer():
    def __init__(self, input_size, output_size):
        """
        Args:
            input_size (int): Input shape of the layer
            output_size (int): Output of the layer
        """
        # initialize weights with Xavier intitialization and biases with zeros
        self.weights = xavier_initialization(input_size, output_size)
        self.biases = np.zeros((1, output_size))

    def forward(self, x):
        """
        Forward pass

        Args:
            x (Tensor): input to the layer
        """
        self.x = x

        # Calculate the output
        output = ...

        return output

    def backward(self, d_values, learning_rate):
        """
        Backpropagation

        Args:
            d_values (float): Derivative of the output
            learning_rate (float): Learning rate for gradient descent
        """

        # Calculate the derivative with respect to the weight and bias (one with weight and one with bias)
        d_weights = ...
        d_biases = ...

        # Calculate the gradient with respect to the input
        d_inputs = ...

        # Update the weights and biases using the learning rate and their derivatives
        self.weights = ...
        self.biases = ...

        return d_inputs

**Question**: Why do we need to calculate `d_weights`, `d_biases` and `d_inputs`?

**Answer**: ...

### Adding Nonlinearity

For nonlinearity, you should implement Rectified Linear Unit (ReLU) and apply it between the hidden layers to provide nonlinearity to the network.

$$ y = max(0, x) $$

When we examine the ReLU behavior, it looks like it is the combination of two different linear functions. This property makes the training easier yet effective since ReLU does not have any learnable parameters as well as easy to apply because of combination of two simple linear functions.



<center><figure><img src="https://machinelearningmastery.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png" width="450"/><figcaption>Graph of the ReLU activation function. <a href="https://machinelearningmastery.com/wp-content/uploads/2018/10/Line-Plot-of-Rectified-Linear-Activation-for-Negative-and-Positive-Inputs.png">Image source</a></figcaption></figure></center>


Derivative of ReLU:

$\dfrac{dy}{dx} = 1 $ if $x >= 0$

$\dfrac{dy}{dx} = 0 $ if $x < 0$

In [10]:
class ReluActivationFunction():

    def forward(self, x):
        # TODO
        self.x = x
        output = ...

        return output

    def backward(self, d_values):
        # calculate the gradients with help of the derivative
        d_inputs = ...
        return d_inputs



### Backpropagation

The perceptron algorithm can't be generalized to MLP, that's why we will now use **backpropagation**.

<center><img src="https://i.imgur.com/LgBzpYD.png" width="400" /></center>

### Loss Function: Binary Cross Entropy

Backpropagation requires us to have a **loss function**.

$$ L = - (y \times ln(o)+(1-y) \times ln(1-o)) $$

[Derivative](https://www.google.com/search?q=cross+entropy+loss+derivative&sca_esv=6915796dc894fc83&sca_upv=1&rlz=1C5CHFA_enVN752VN752&udm=2&biw=1309&bih=708&sxsrf=ACQVn09fs99X4SFZJk0xmct6PWrepRzpxQ%3A1713181875984&ei=sxQdZuGxO8eE9u8P9oiI6AI&ved=0ahUKEwih15zpk8SFAxVHgv0HHXYEAi0Q4dUDCBA&uact=5&oq=cross+entropy+loss+derivative&gs_lp=Egxnd3Mtd2l6LXNlcnAiHWNyb3NzIGVudHJvcHkgbG9zcyBkZXJpdmF0aXZlMgQQIxgnMgUQABiABDIHEAAYgAQYGEjkBVDRA1jRA3ACeACQAQCYATCgATCqAQExuAEDyAEA-AEBmAIBoAIzmAMAiAYBkgcBMaAHsAM&sclient=gws-wiz-serp#vhid=fKdGq3KS8we6mM&vssid=mosaic):

$$
\dfrac{dL}{do} = \dfrac{-y}{o} + \dfrac{1-y}{1-o}
$$

In [11]:
class BinaryCrossEntropy():

    def forward(self, output, target):

        # TODO
        # implement Binary Cross-Entrops loss function for output, target

        loss = ...

        # END TODO
        return loss

    def backward(self, output, target):
        # Calculate the gradient with respect to the output
        return ...

### Sigmoid Activation Function

For a binary classification problem, we can use the sigmoid activation function in the output layer which outputs values in the range of 0 and 1. So, for a positive case (class 1), we can interpret $p_1 = \sigma(o)$ as the probability of that class, while $p_0 = 1 - p_1$ can be seen the probability of the negative case (class 0).

**Sigmoid function**:
$$
\sigma(x) = \dfrac{1}{1 + e^{-x}}
$$

[Derivative](https://hausetutorials.netlify.app/posts/2019-12-01-neural-networks-deriving-the-sigmoid-derivative/#:~:text=The%20derivative%20of%20the%20sigmoid%20function%20%CF%83(x)%20is%20the,1%E2%88%92%CF%83(x).):
$$
\dfrac{d\sigma}{dx} = \sigma(x)(1-\sigma(x))
$$



In [12]:
def sigmoid(x):
    return ...

class SigmoidActivationFunction():

    def forward(self, x):
        # TODO
        # implement Sigmoid function for the input_data
        self.x = x

        output = ...

        return output

    def backward(self, d_values):
        # calculate the gradients with help of the derivative
        return ...

### Implementation

Now let's put together the components you have implemented so far to our MLP:

In [13]:
class MLP_implementation():
    def __init__(self,
        input_size,
        output_size,
        hidden_layers,
        hidden_layers_size,
        hidden_activation_func,
        output_activation_function,
        loss_function,
    ):
        self.hidden_layers = hidden_layers
        self.hidden_layers_size = hidden_layers_size
        self.hidden_activation_func = hidden_activation_func
        self.loss_function = loss_function
        self.output_activation_function = output_activation_function
        self.layers = []

        # Initialize hidden layers
        for i in range(hidden_layers):
            if i == 0:
                layer = ...
            else:
                layer = ...
            self.layers.append(layer)

        # Initialize output layer
        self.output_layer = ...

    def forward_pass(self, x):
        ...

        return output

    def backward_pass(self, d_values, learning_rate):
        # Backpropagate through output layer
        d_values = self.output_activation_function.backward(d_values)
        ...

        # Backpropagate through hidden layers
        for layer in reversed(self.layers):
            ...


    def train(self, input_data, targets, learning_rate=1, epochs=1):
        for epoch in range(epochs):
            random_order = np.random.permutation(np.array(range(len(input_data))))
            for i in random_order:
                # Forward pass
                output = ...

                # Calculate loss
                loss = ...

                # Backward pass
                ...

    def inference(self, input_data):
        output = []
        for i in range(len(input_data)):
            ...
        return np.array(output)


### MLP Inititialization

In [14]:
# Initialize MLP
xor_mlp = MLP_implementation(
    ...
)



TypeError: MLP_implementation.__init__() missing 6 required positional arguments: 'output_size', 'hidden_layers', 'hidden_layers_size', 'hidden_activation_func', 'output_activation_function', and 'loss_function'

### Training

In [None]:
# TODO
input_data = ...
targets = ...

xor_mlp.train(input_data, targets, learning_rate=0.05, epochs=2500)

### Evaluation

In [None]:
# Test and evaluate your new model as in the previous task
# TODO
predictions = ...
accuracy_value = ...
print(accuracy_value)

You will now be able to get to 100% accuracy on the XOR task with MLP!!

If you are interested, you can see this [demo](https://lecture-demo.iar.kit.edu/neural-network-demo/) to see how the decision boundaries are found by the MLPs.

## Excercise 3: XOR Task with MLP (using Pytorch)

Everything could have been much easier!

The excercises so far is only for you to undertand the internal details of training a neural network. In practice, we do not have to implement the forward and backward pass of the common function by hand. All can be taken care of by Pytorch!

Look up the Pytorch documentation, and fill in the following blocks of code to build the same MLP with Pytorch:

### Defining the model

In [None]:
import torch
import torch.nn as nn

xor_mlp_pytorch = nn.Sequential(
    ...
)

### Initializing weights

In [None]:
# Init weights
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.xavier_uniform_(m.weight)
        if m.bias is not None:
            nn.init.constant_(m.bias, 0)

# Apply the initialization to the model
xor_mlp_pytorch.apply(init_weights)

### Loss Function: Binary Cross Entropy

In [None]:
loss_fn = ...

### Optimizer: Stochastic gradient descent

In [None]:
optimizer = ...

### Training

Below we provide you with a simple training loop.

For the first two epochs, print out the gradient and the values some weights of the network.

**Question**:
- Explain what happens after each step in the training loop.
- Why do we need `optimizer.zero_grad()` here? When should we NOT use it?

**Answer**: ...


In [None]:
# Define our data
input_data_tensor = torch.tensor([[0,0], [0,1], [1,0], [1,1]], dtype=torch.float)
targets_tensor = torch.unsqueeze(
    torch.tensor([0,1,1,0], dtype=torch.float), 1
)

# Training loop
epochs = 2500
for epoch in range(epochs):

    optimizer.zero_grad()
    if epoch < 2:
        print(...)

    output = xor_mlp_pytorch(input_data_tensor)
    if epoch < 2:
        print(...)


    loss = loss_fn(output, targets_tensor)
    if epoch < 2:
        print(...)


    loss.backward()
    if epoch < 2:
        print(...)

    optimizer.step()
    if epoch < 2:
        print(...)


Follow the loss in the backward direction, using its `.grad_fn` attribute too see the computation graph:

In [None]:
print(...)
print(...)
print(...)


### Evaluation

In [None]:
predictions = ...
accuracy_value = ...
print(accuracy_value)


## Excercise 4: Iris Dataset 🌷 task with MLP (using Pytorch)

Iris is a genus of hundreds of species of flowering plants with showy flowers. The Iris data set consists of 150 samples from three species of Iris which are hard to distinguish (Iris setosa, Iris virginica and Iris versicolor). There are four features from each sample: the length and the width of the sepals and petals, in centimeters. Based on these features, the goal is to predict which species of Iris the sample belongs to.


For this exercise, you need to enable GPUs for this notebook:

- Navigate to "**Edit**" → "**Notebook Settings**"
- Select GPU from the **Hardware Accelerator** drop-down
- You might need to rerun the notebook

Next, we'll check if we can connect to the GPU with PyTorch:

In [None]:
if torch.cuda.is_available():
    device = torch.cuda.current_device()
    print('Current device:', torch.cuda.get_device_name(device))
else:
    print('Failed to find GPU. Will use CPU.')
    device = 'cpu'

###  Loading Dataset

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import torch

iris = load_iris()
X, y = iris.data, iris.target
num_classes = 3



Process the data.

**Question**: Is there anything we need to do with the default target? Why?

**Answer**: ...

In [None]:
# Split the dataset into training and test dataset
# TODO
X_train, X_test, y_train, y_test = ...

# Process the data
# TODO
y_train_one_hot = ...
y_test_one_hot = ...



### Architecture

We will again use an MLP for this task (with Pytorch).

Intitialize a model with **4 hidden layers** and a **hidden layer size of 768**.

**Question**: Is there any else we should change when building the MLP to fit this task?

**Hint**: it is no longer a binary classification problem

**Answer:** ...

In [None]:
# Defining the model
xor_mlp_pytorch = nn.Sequential(
    ...
)

# Apply the initialization to the model
xor_mlp_pytorch.apply(init_weights)

# Defining loss function: Cross Entropy Loss
loss_fn = ...

# Defining optimizer: Stochastic gradient descent
optimizer = ...


### Training

As you have learnt from the lecture, we can speed up the training process by **batching** and using **GPUs**. Modify the following code for batching and GPUs.

You can also run the code before and after you make changes to see the speed up gain from batching and using GPU.

**Hints**: You can make use of Pytorch's `DataLoader`

**Question**: Report the execution time with and without GPU and batching.

**Answer**: ...

In [None]:
# Training loop
epochs = 500
for epoch in range(epochs):
    optimizer.zero_grad()
    output = xor_mlp_pytorch(X_train)
    loss = loss_fn(output, y_train_one_hot)
    loss.backward()
    optimizer.step()




### Evaluation

Show the overall accuracy of our model on the test dataset. Use the existing `accuracy` function that you implemented earlier.

In [None]:
# TODO
predictions = ...
accuracy_value = ...
print(accuracy_value)

Print the confusion matrix using `sklearn.metrics.confusion_matrix`.

In [None]:
# TODO
...

**Question**: Now also look at the confusion matrix, what can you conclude from it?

**Answer**: ...