# Assignment 2

### MLP and $k$-distance

_Submission deadline: **14.12.2025**_

---

#### Submission Information

Upload your solution via the VC course. Please upload **one zip archive** per group. This must contain:

- Your solution as a **Notebook** (a `.ipynb` file)
- At least the file ``MLP.py`` for task 02.1.1.
- A folder **images** with all your images (keep the image sizes relatively small)

Your zip file should be named according to the following scheme:

```
assignment_<assignment number>_solution_<group number>.zip
```

In this assignment you can achieve a total of **55** points. These points translate into **2.5 bonus points** for the exam as follows:

| **Points in Assignment** | **Bonus Points for Exam** |
| :-: | :-: |
| 52 | 2.5 |
| 44 | 2.0 |
| 36 | 1.5 |
| 28 | 1.0 |
| 20 | 0.5 |

<div class='alert alert-block alert-danger'>

##### **Important Notes**

1. **This assignment will be graded. You can earn bonus points for the exam.**
2. **If it is obvious to us that a task was copied from another source and no independent work was performed, we will not award any bonus points. Formulate all answers in your own words!**
3. **If LLMs (such as ChatGPT or CoPilot) were used to create your submission, please indicate this at the respective places. Also observe the [AI-Policy](https://cogsys.uni-bamberg.de/teaching/ki-richtlinie.html).**

---

### 1 | MLP

_For a total of 37 points_

### **(02.1.1)** Backpropagation

_For **31** points_

In this task, you should develop your own Python package that implements a Multilayer Perceptron (MLP). You will later use this package in the notebook to classify the Iris dataset. The goal of the task is to understand and implement the basic computation steps of a neural network with at least one hidden layer yourself. A simple Perceptron is already available in the file `perceptron.py`. First, familiarize yourself with the implementation of this file.

The focus in the following is on implementing the backpropagation algorithm. For this, you should program two methods in the file `MLP.py`:

- `fit`: This method should perform the actual training process over multiple epochs.
- `backprop`: This method should calculate gradients for all weights and biases.

The signatures as well as additional methods are already contained in `MLP.py`. Familiarize yourself with these first. You will need them for your following implementation. A special feature is the one-hot encoding of class labels, which is done by the `__one_hot` method. One-hot encoding is a simple method to represent categories numerically. This is necessary to output an activation in the form of a vector at the end.

To enable a fast but still algorithm-oriented implementation, the weights are stored twice: once in the individual perceptrons of the layers, and once as a collected weight matrix that enables fast calculations. In order for your weight adjustments to be implemented, update the weights of the individual perceptrons (``Perceptron.update_weights()``) according to the backpropagation algorithm, and use the method ``MLP.update_matrices()`` to transfer the new weights into the matrices before you calculate a forward pass again.


### **(02.1.2)** Test Implementation with Iris Dataset

_For **6** points_

Once you have implemented the previous implementation, the `KogSysMLP` Python package can be installed using `pip`. To do this, change to the directory `KogSysMLP/` and execute the following command:

```bash
pip install -e .
```

In the next code cell, the Python package just installed is loaded. The following libraries are also imported:

- `sklearn.datasets.load_iris`: Loads the Iris dataset.
- `sklearn.model_selection.train_test_split`: Splits datasets into training and test data.
- `sklearn.preprocessing.StandardScaler`: Scales the features of the dataset.

**_Nothing needs to be changed in the next code cell._**

In [None]:
# Library imports

import KogSysMLP

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

Now we will test with the Iris dataset. For this purpose, it is loaded in the following cell.

**_Nothing needs to be changed in the next code cell._**

In [None]:
iris = load_iris()
X = iris.data
y = iris.target

First scale the dataset with the `StandardScaler` and then split it using `train_test_split` into training and test data. Use 20% of the data as the test set and a `random_state = 42` so that the split remains reproducible.

In [None]:
# 2 Points

Now initialize the implemented MLP. First test the parameters yourself. Try to maximize the accuracy, which you should implement in the following code block. A learning rate in the low range (e.g., between 0.01 and 0.1) and several thousand epochs are often a sensible starting point.

In [None]:
# 3 Points

After training, calculate the accuracy of your MLP on the test data. Use the model's predictions and compare them with the actual labels.

In [None]:
# 1 Point

---
### 2 | Local Outlier Factor

_For a total of **18** points_

To detect outliers from clusters, a $k$-distance approach can be implemented based on the $k$-Nearest-Neighbors algorithm. The basic principle of this so-called Local Outlier Factor can be read [here in the Wikipedia article](https://en.wikipedia.org/wiki/Local_Outlier_Factor). Familiarize yourself with this approach!

The goal of this task is to implement the $k$-distance algorithm.

At the end of the task there is a test block with which you can test your implementation.

### **(02.2.1)** Determining the Distance

_For **6** points_

The distances between the individual points should be calculated according to their Euclidean distance. The Euclidean distance is defined as
$$d(x_i, x_j) \equiv \sqrt{\sum_{r=1}^{n}\left(a_r(x_i) - a_r(x_j)\right)^2}.$$

To calculate this value, implement the function `euclidean_distance()`. The function `pairwise_distances()` should then calculate the pairwise Euclidean distances of a point set and return them as a matrix.

In [None]:
import numpy as np

def euclidean_distance(x1, x2):
    return NotImplementedError

def pairwise_distances(X):
    return NotImplementedError

### **(02.2.2)** k-Distance Implementation

_For **12** points_

Now implement the `k_distance` class. Base yourself on the fundamentals of the k-nearest-neighbor algorithm (as learned in the lecture and exercise) and adapt it with the formulas explained in the Wikipedia article ([Local Outlier Factor](https://en.wikipedia.org/wiki/Euclidean_distance)). 

The class should contain the methods `fit()` and `predict()`. Additional helper functions (for example, to calculate the values of the Local Outlier Factor) can make the class more manageable.

In [None]:
class k_distance:
    def __init__(self, k):
        self.k = k
        self.x_train = None

    def fit(self, X):
        """Stores training data.

        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
            Training samples to be used for LOF computation.
        """
        return NotImplementedError
    
    def predict(self):
        """Compute LOF scores for the stored training data.
        Expects that fit() has been called before.

        Returns
        -------
        lof : ndarray, shape (n_samples,)
            Local Outlier Factor score for each sample.
        neigh_idx : ndarray, shape (n_samples, k)
            Indices of the k nearest neighbors for each sample.
        """
        return NotImplementedError

#### Test on Example Data

Now your implemented algorithm should be applied! For this purpose, a training dataset `X` with two clusters and outliers is created in the following code cell, which should be detected by your k_distance class. Then the model is fitted to the training data and the LOF scores are created. If your algorithm was implemented correctly, you can see the indices of the outliers in the output, as well as in the subsequent plot.


In [None]:
rng = np.random.default_rng(42)
cluster1 = rng.normal(loc=[0, 0], scale=0.5, size=(100, 2))
cluster2 = rng.normal(loc=[5, 5], scale=0.5, size=(100, 2))
outliers = np.array([[8, 0], [0, 8], [10, 10], [-5, -5], [6, -6]])

X = np.vstack([cluster1, cluster2, outliers])

In [None]:
# initialize model
model = k_distance(k=10)
# fit model
model.fit(X)
# predict LOF scores
lof_scores_arr, neighbors = model.predict()

# print top 10 anomalies
order = np.argsort(lof_scores_arr)[::-1]
print("Top anomalies (index, LOF score):")
for idx in order[:10]:
  print(idx, float(lof_scores_arr[idx]))

In [None]:
import matplotlib.pyplot as plt

# plot data and highlight anomalies
plt.scatter(X[:,0], X[:,1], s=12, label='data')
plt.scatter(X[order[:10],0], X[order[:10],1], c='red', s=40, label='anomalies')
plt.legend()
plt.show()