<img src="https://www.th-koeln.de/img/logo.svg" style="float:right;" width="200">

# 3rd exercise: <font color="#C70039">Do DBScan clustering for anomaly detection</font>
* Course: AML
* Lecturer: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Author of notebook: Marvin Reuter
* Matriculation number: 11139466
* Date:   10.11.2024

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/af/DBSCAN-Illustration.svg/400px-DBSCAN-Illustration.svg.png" style="float: center;" width="450">

---------------------------------
**GENERAL NOTE 1**: 
Please make sure you are reading the entire notebook, since it contains a lot of information on your tasks (e.g. regarding the set of certain paramaters or a specific computational trick), and the written mark downs as well as comments contain a lot of information on how things work together as a whole. 

**GENERAL NOTE 2**: 
* Please, when commenting source code, just use English language only. 
* When describing an observation please use English language, too
* This applies to all exercises throughout this course.  

---------------------

### <font color="ce33ff">DESCRIPTION</font>:
This notebook allows you for using the DBScan clustering algorithm for anomaly detection.

-------------------------------------------------------------------------------------------------------------

### <font color="FFC300">TASKS</font>:
The tasks that you need to work on within this notebook are always indicated below as bullet points. 
If a task is more challenging and consists of several steps, this is indicated as well. 
Make sure you have worked down the task list and commented your doings. 
This should be done by using markdown.<br> 
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook.</font>

**YOUR TASKS in this exercise are as follows**:
1. import the notebook to Google Colab or use your local machine.
2. make sure you specified you name and your matriculation number in the header below my name and date. 
    * set the date too and remove mine.
3. read the entire notebook carefully 
    * add comments whereever you feel it necessary for better understanding
    * run the notebook for the first time. 
4. take the three data sets from exercize 1 and cluster them
5. read the following <a href="https://stats.stackexchange.com/questions/88872/a-routine-to-choose-eps-and-minpts-for-dbscan">article</a> for getting help estimating eps and minPts
    * https://stats.stackexchange.com/questions/88872/a-routine-to-choose-eps-and-minpts-for-dbscan
6. describe your findings and interpret the results
-----------------------------------------------------------------------------------

In [1]:
from sklearn.cluster import DBSCAN
import numpy as np
from numpy.random import randn
np.random.seed(1)
random_data = np.random.randn(50000,2)  * 20 + 20

The output of the below code is 94. This is the total number of noisy points. SKLearn labels the noisy points as (-1). The downside with this method is that the higher the dimension, the less accurate it becomes. You also need to make a few assumptions like estimating the right value for eps which can be challenging.

In [2]:
# hyperparameters
minPts = 2
eps = 3

outlier_detection = DBSCAN(min_samples = minPts, eps = eps)

clusters = outlier_detection.fit_predict(random_data)

list(clusters).count(-1)

94

## TASK

In [3]:
# Import all datasets
import pandas as pd

spotify = pd.read_csv("./data/kaggle/Most Streamed Spotify Songs 2024.csv", thousands=',', encoding='ISO-8859-1')
height = pd.read_csv("./data/kaggle/SOCR-HeightWeight.csv")
birthrate = pd.read_csv("./data/kaggle/europe.csv")

In [82]:
streams = spotify['Spotify Streams'].dropna()
heights = height['Height(Inches)'].dropna()
birthrates = birthrate['birth_rate'].dropna()
# Reshape the data to a 2D array
streams_reshaped = streams.values.reshape(-1,1)
heights_reshaped = heights.values.reshape(-1,1)
birthrates_reshaped = birthrates.values.reshape(-1,1)

In [130]:
std_dev = np.std(heights_reshaped)
print(std_dev)
print(heights_reshaped.size)

# hyperparameters
minPts = 2000
eps = std_dev  # As noted in the original comment, this value might need tuning

outlier_detection = DBSCAN(min_samples=minPts, eps=eps)
clusters = outlier_detection.fit_predict(heights_reshaped)

num_outliers = list(clusters).count(-1)
print(f"Number of outliers: {num_outliers}")

1.9016407372498432
25000
Number of outliers: 14


In [129]:
std_dev = np.std(streams_reshaped)
print(std_dev)
print(streams_reshaped.size)

# hyperparameters
minPts = 200
eps = std_dev  # As noted in the original comment, this value might need tuning

outlier_detection = DBSCAN(min_samples=minPts, eps=eps)
clusters = outlier_detection.fit_predict(streams_reshaped)

num_outliers = list(clusters).count(-1)
print(f"Number of outliers: {num_outliers}")

538383901.5027055
4487
Number of outliers: 22


In [128]:
std_dev = np.std(birthrates_reshaped)
print(std_dev)
print(birthrates_reshaped.size)

# hyperparameters
minPts = 10
eps = std_dev  # As noted in the original comment, this value might need tuning

outlier_detection = DBSCAN(min_samples=minPts, eps=eps)
clusters = outlier_detection.fit_predict(birthrates_reshaped)

num_outliers = list(clusters).count(-1)
print(f"Number of outliers: {num_outliers}")

1.0943845161257548
36
Number of outliers: 2


## Findings & results

After testing several values, the ones I used gave the best results (compared to what I expected). It was difficult to find a suitable eps. Using three times the standard deviation as epsilon resulted in overly large clusters, effectively grouping almost all data points into a single cluster and failing to identify outliers. So I decided to use just the standard deviation and it works somehow. More systematic approaches to determining epsilon include the k-distance graph method, which can help identify a characteristic 'knee' in the distance plot, suggesting an appropriate epsilon value.

In conclusion, it's hard to find good values by just guessing and trying different values. It's also hard to say whether a result is good or not so good. 