<img src="https://www.th-koeln.de/img/logo.svg" style="float:right;" width="200">

# 3rd exercise: <font color="#C70039">Do DBScan clustering for anomaly detection</font>
* Course: AML
* Lecturer: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Author of notebook: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Date:   10.11.2023
* Student: Ali Ünal

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/af/DBSCAN-Illustration.svg/400px-DBSCAN-Illustration.svg.png" style="float: center;" width="450">

---------------------------------
**GENERAL NOTE 1**:
Please make sure you are reading the entire notebook, since it contains a lot of information on your tasks (e.g. regarding the set of certain paramaters or a specific computational trick), and the written mark downs as well as comments contain a lot of information on how things work together as a whole.

**GENERAL NOTE 2**:
* Please, when commenting source code, just use English language only.
* When describing an observation please use English language, too
* This applies to all exercises throughout this course.  

---------------------

### <font color="ce33ff">DESCRIPTION</font>:
This notebook allows you for using the DBScan clustering algorithm for anomaly detection.

-------------------------------------------------------------------------------------------------------------

### <font color="FFC300">TASKS</font>:
The tasks that you need to work on within this notebook are always indicated below as bullet points.
If a task is more challenging and consists of several steps, this is indicated as well.
Make sure you have worked down the task list and commented your doings.
This should be done by using markdown.<br>
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook.</font>

**YOUR TASKS in this exercise are as follows**:
1. import the notebook to Google Colab or use your local machine.
2. make sure you specified you name and your matriculation number in the header below my name and date.
    * set the date too and remove mine.
3. read the entire notebook carefully
    * add comments whereever you feel it necessary for better understanding
    * run the notebook for the first time.
4. take the three data sets from exercize 1 and cluster them
5. read the following <a href="https://stats.stackexchange.com/questions/88872/a-routine-to-choose-eps-and-minpts-for-dbscan">article</a> for getting help estimating eps and minPts
    * https://stats.stackexchange.com/questions/88872/a-routine-to-choose-eps-and-minpts-for-dbscan
6. describe your findings and interpret the results
-----------------------------------------------------------------------------------

In [1]:
from sklearn.cluster import DBSCAN
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from scipy.stats import shapiro
from scipy.stats import lognorm
from scipy.stats import kstest
from scipy.stats import lognorm
import statsmodels.api as sm
import seaborn as sns

In [2]:
# Load normalized Data for the first two Datasets
df_firstDataset = pd.read_csv("SOCR-HeightWeight.csv")['Height(Inches)'].values.reshape(-1, 1)
df_secondDataset = pd.read_csv("SOCR-HeightWeight.csv")['Weight(Pounds)'].values.reshape(-1, 1)

# Load not normalized Data for last Dataset
df_thirdDataset = pd.read_csv("incomeUS.csv")['Age'].values.reshape(-1, 1)

The output of the below code is 94. This is the total number of noisy points. SKLearn labels the noisy points as (-1). The downside with this method is that the higher the dimension, the less accurate it becomes. You also need to make a few assumptions like estimating the right value for eps which can be challenging.

In [3]:
# hyperparameters
# using these parameters after testing similar results where achived as in exercise 1, where 69 outliers where found
# finding eps was hard but we went with trial and error aswell as trying to find the knee, as described in the stakeoverflow post, linked in the notebook
# the attemps at finding the knee were deleted afterwards
minPts = 3
eps = 0.0179

outlier_detection = DBSCAN(min_samples = minPts, eps = eps)

clusters = outlier_detection.fit_predict(df_firstDataset)

list(clusters).count(-1)

62

In [4]:
# hyperparameters
# using these parameters after testing similar results where achived as in exercise 1, where 51 outliers where found
minPts = 3
eps = 0.079

outlier_detection = DBSCAN(min_samples = minPts, eps = eps)

clusters = outlier_detection.fit_predict(df_secondDataset)

list(clusters).count(-1)

79

In [5]:
# hyperparameters
# even when we changed eps the result didnt change, only two outlier where always found. Which is strange since in exercise 1, 129 outliers where found
# maybe this could be because the data set isn't normalized
# onyl changing the minPts had an effect on the results
# the goal was to hit the same ballpark as the number of outliers in exercise 1, so 25 minPts were chosen at the end as the hyperparameter
minPts = 25
eps = 0.01

outlier_detection = DBSCAN(min_samples = minPts, eps = eps)

clusters = outlier_detection.fit_predict(df_thirdDataset)

list(clusters).count(-1)

123