<img src="https://www.th-koeln.de/img/logo.svg" style="float:right;" width="200">

# 1st exercise: <font color="#C70039">Work with standard deviations for anomaly detection</font>
* Course: AML
* Lecturer: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Author of notebook: <a href="https://www.gernotheisenberg.de/">Gernot Heisenberg</a>
* Date:   24.10.2023
* Student: Ali Ünal

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/1200px-Standard_deviation_diagram.svg.png" style="float: center;" width="450">

---------------------------------
**GENERAL NOTE 1**:
Please make sure you are reading the entire notebook, since it contains a lot of information on your tasks (e.g. regarding the set of certain paramaters or a specific computational trick), and the written mark downs as well as comments contain a lot of information on how things work together as a whole.

**GENERAL NOTE 2**:
* Please, when commenting source code, just use English language only.
* When describing an observation please use English language, too
* This applies to all exercises throughout this course.  

---------------------

### <font color="ce33ff">DESCRIPTION</font>:
This notebook allows you for getting into standard deviations as a common technique to detect anomalies when the data is normally distributed.

-------------------------------------------------------------------------------------------------------------

### <font color="FFC300">TASKS</font>:
The tasks that you need to work on within this notebook are always indicated below as bullet points.
If a task is more challenging and consists of several steps, this is indicated as well.
Make sure you have worked down the task list and commented your doings.
This should be done by using markdown.<br>
<font color=red>Make sure you don't forget to specify your name and your matriculation number in the notebook.</font>

**YOUR TASKS in this exercise are as follows**:

1. import the notebook to Google Colab or use your local machine.
2. make sure you specified you name and your matriculation number in the header below my name and date.
    * set the date too and remove mine.

3. read the entire notebook carefully
    * add comments whereever you feel it necessary for better understanding
    * run the notebook for the first time.
    * understand the outputcburn

4. go and find three different data sets on the web
    * kaggle.com might be a good source (they also offer an API for data download)
    * make sure two of the three data sets are normally distributed
    * download one data set that is not normally distributed

5. visualize the data

6. compute the anomalies
---
7. visualize the anomalies
8. does the 0,3% rule apply?
9. what are differences between the normally distributed and the non-normally distributed data sets with respect to the outlier detection?
10. which statement can be made and which cannot?
-----------------------------------------------------------------------------------

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
from scipy.stats import shapiro
from scipy.stats import lognorm
from scipy.stats import kstest
from scipy.stats import lognorm
import statsmodels.api as sm

# Define Datasets

In [3]:
# Load normalized Data for the first two Datasets
df_firstDataset = pd.read_csv("SOCR-HeightWeight.csv")['Height(Inches)']
df_secondDataset = pd.read_csv("SOCR-HeightWeight.csv")['Weight(Pounds)']

# Load not normalized Data for last Dataset
df_thirdDataset = pd.read_csv("incomeUS.csv")['Age']

# Test For Normality

In [None]:
# Method 1: Plot a histogram for the '?' column
plt.hist(df_firstDataset, bins=20, color='skyblue', edgecolor='black')
# Method 3: Perform a shapiro wilk test
# perform Shapiro-Wilk test for normality
# Since the p-value is less than .05, we reject the null hypothesis of the Shapiro-Wilk test.
# This means we have sufficient evidence to say that the sample data does not come from a normal distribution.
# shapiro(df_firstDataset)

# Method 2: Create a Q-Q plot
fig = sm.qqplot(df_secondDataset, line='45')
plt.show()

# Method 4: Perform a Kolmogorov-Smirnov Test
kstest(df_firstDataset, 'norm')

In [None]:
plt.hist(df_secondDataset, bins=20, color='skyblue', edgecolor='black')

fig = sm.qqplot(df_secondDataset, line='45')
plt.show()

kstest(df_secondDataset, 'norm')

In [None]:
plt.hist(df_thirdDataset, bins=20, color='skyblue', edgecolor='black')

fig = sm.qqplot(df_secondDataset, line='45')
plt.show()

kstest(df_thirdDataset, 'norm')

# Find Anomalies

In [None]:
# Function to Detection Outlier on one-dimentional datasets.
def find_anomalies(random_data):
    #define a list to accumlate anomalies
    anomalies = []

    # Set upper and lower limit to 3 standard deviation
    random_data_std = np.std(random_data)
    random_data_mean = np.mean(random_data)
    anomaly_cut_off = random_data_std * 3

    lower_limit  = random_data_mean - anomaly_cut_off
    upper_limit = random_data_mean + anomaly_cut_off

    print("lower limit=", round(lower_limit,8))
    print("upper limit=", round(upper_limit,8))

    # Generate outliers list
    for outlier in random_data:
        if outlier > upper_limit or outlier < lower_limit:
            anomalies.append(outlier)

    return anomalies

In [None]:
# multiply and add by random numbers to get some real values
# randn generates samples from the normal distribution (important - see below)
data = np.random.randn(50000)  * 20 + 20

In [None]:
anomalies_firstDataset = find_anomalies(df_firstDataset)
anomalies_secondDataset = find_anomalies(df_secondDataset)
anomalies_thirdDataset = find_anomalies(df_thirdDataset)

## Result
These anomalies are exceeding the lower and upper 3rd scatter range.
Thus, statistically spoken, they do belong to a population size of less than 0,3% of the entire data set!
For sure, the above conclusion is true if and only if the data is normally distributed!


In [None]:
print("Normalized Data, Dataset 1: " + str(anomalies_firstDataset))
print("Percentage of anomalies: " + str("{:.5f}".format(len(anomalies_firstDataset)/len(df_firstDataset)* 100))
 + "%; Count of anomalies: " + str(len(anomalies_firstDataset)) + "\n")

print("Normalized Data, Dataset 2: " + str(anomalies_secondDataset))
print("Percentage of anomalies: " + str("{:.5f}".format(len(anomalies_secondDataset)/len(df_secondDataset)* 100))
 + "%; Count of anomalies: " + str(len(anomalies_secondDataset)) + "\n")

print("Not Normalized Data, Dataset 3: " + str(anomalies_thirdDataset))
print("Percentage of anomalies: " + str("{:.5f}".format(len(anomalies_thirdDataset)/len(df_thirdDataset)* 100))
 + "%; Count of anomalies: " + str(len(anomalies_thirdDataset)) + "\n")

In [None]:
#(8-10)
#The not normalized data set has the slightly more anomalies than the normalized data sets
#Both the normalized data sets full fill the the 0,3% rule, but the not normalized data set is slighlty above the 0,3% Rule.
#Using the histogram for viusalizing outliners is only sensible when the data set is normalized