## Outlier Detection

In [28]:
%load_ext autoreload
%autoreload 2

from os import path
import pandas as pd
from sklearn.ensemble import IsolationForest

datasetname = path.join('..', 'dataset', 'cyclists.csv')
cyclists = pd.read_csv(datasetname)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


**Isolation Forest algorithm**: The Isolation Forest algorithm is an unsupervised learning algorithm that belongs to the ensemble decision trees family. It is based on the fact that anomalies are data points that are few and different. The algorithm is based on the fact that anomalies are easier to separate from the rest of the data due to their different characteristics. The algorithm works by isolating the outliers in the dataset. It isolates the outliers by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The algorithm then splits the dataset into two parts. The algorithm then repeats the process until the outliers are isolated. The algorithm then assigns an anomaly score to each data point. The anomaly score is the number of splits required to isolate the data point. The algorithm then classifies the data points with anomaly scores higher than a certain threshold as outliers.

In [29]:
from utils import detect_outliers_isolation_forest
columns_to_analyze = ['weight', 'height']
cyclists_with_outliers = detect_outliers_isolation_forest(cyclists, columns_to_analyze)

outliers_iso_forest = cyclists_with_outliers[cyclists_with_outliers['anomaly'] == -1]
print(f"Number of outliers detected with Isolation Forest: {len(outliers_iso_forest)}")
outliers_iso_forest


Number of outliers detected with Isolation Forest: 154


Unnamed: 0,_url,name,birth_year,weight,height,nationality,anomaly
22,bernard-bourreau,Bernard Bourreau,1951.0,63.0,164.0,France,-1
66,jeffry-romero,Jeffry Romero,1989.0,55.0,176.0,Colombia,-1
97,einer-rubio,Einer Rubio,1998.0,56.0,164.0,Colombia,-1
141,florian-guillou,Florian Guillou,1982.0,71.0,195.0,France,-1
232,david-gaudu,David Gaudu,1996.0,53.0,172.0,France,-1
...,...,...,...,...,...,...,...
5882,laurenz-rex,Laurenz Rex,1999.0,82.0,193.0,Belgium,-1
5913,magnus-backstedt,Magnus Bäckstedt,1975.0,94.0,194.0,Sweden,-1
5920,ondrej-sosenka,Ondřej Sosenka,1975.0,82.0,200.0,Czech Republic,-1
5948,ivan-ramiro-sosa,Iván Ramiro Sosa,1997.0,52.0,168.0,Colombia,-1


**Thresholding with z-score**: we can use the z-score to determine the threshold for the anomaly score. The z-score is the number of standard deviations a data point is from the mean. We can use the z-score to determine the threshold for the anomaly score. We can set the threshold to be the mean plus a certain number of standard deviations. We can then classify the data points with anomaly scores higher than the threshold as outliers.

In [30]:
from utils import detect_outliers_zscore

column_to_analyze = 'weight'
cyclists_with_outliers = detect_outliers_zscore(cyclists, column_to_analyze)

# Find and display outliers
outliers_zscore = cyclists_with_outliers[cyclists_with_outliers['anomaly_zscore'] == True]
print(f"Number of outliers detected with Z-score: {len(outliers_zscore)}")
outliers_zscore



Number of outliers detected with Z-score: 14


Unnamed: 0,_url,name,birth_year,weight,height,nationality,anomaly,anomaly_zscore
292,jens-mouris,Jens Mouris,1980.0,91.0,198.0,Netherlands,-1,True
348,jonas-rickaert,Jonas Rickaert,1994.0,88.0,187.0,Belgium,-1,True
679,conor-dunne,Conor Dunne,1992.0,88.0,204.0,Ireland,-1,True
857,jon-ander-insausti,Jon Ander Insausti,1992.0,89.0,187.0,Spain,-1,True
1375,jose-humberto-rujano,José Humberto Rujano,1982.0,48.0,162.0,Venezuela,-1,True
1487,michael-kolar,Michael Kolář,1992.0,90.0,185.0,Slovakia,-1,True
3007,max-walscheid,Max Walscheid,1993.0,90.0,199.0,Germany,-1,True
4308,linas-balciunas,Linas Balčiūnas,1978.0,90.0,196.0,Lithuania,-1,True
4554,gerrit-solleveld,Gerrit Solleveld,1961.0,93.0,183.0,Netherlands,-1,True
4690,alex-rasmussen,Alex Rasmussen,1984.0,88.0,186.0,Denmark,-1,True


**Local Outlier Factor**: The Local Outlier Factor (LOF) algorithm is an unsupervised learning algorithm that belongs to the family of density-based anomaly detection algorithms. The algorithm works by calculating the local density of a data point with respect to its neighbors. The algorithm then calculates the local outlier factor of the data point. The local outlier factor is the ratio of the local density of the data point to the local density of its neighbors. The algorithm then classifies the data points with local outlier factors higher than a certain threshold as outliers.

In [31]:
from utils import detect_outliers_lof
columns_to_analyze = ['weight', 'height']
cyclists_with_outliers = detect_outliers_lof(cyclists, columns_to_analyze)
outliers_lof = cyclists_with_outliers[cyclists_with_outliers['anomaly_lof'] == -1]
print(f"Number of outliers detected with LOF: {len(outliers_lof)}")
outliers_lof



Number of outliers detected with LOF: 148


Unnamed: 0,_url,name,birth_year,weight,height,nationality,anomaly,anomaly_zscore,anomaly_lof
36,andreas-dietziker,Andreas Dietziker,1982.0,67.0,179.0,Switzerland,1,,-1
156,hugo-page,Hugo Page,2001.0,71.0,185.0,France,1,,-1
234,bjarne-riis,Bjarne Riis,1964.0,71.0,184.0,Denmark,1,,-1
246,manuel-cardoso,Manuel Antonio Leal Cardoso,1983.0,70.0,179.0,Portugal,1,,-1
294,dion-smith,Dion Smith,1993.0,67.0,179.0,New Zealand,1,,-1
...,...,...,...,...,...,...,...,...,...
5941,jean-charles-senac,Jean-Charles Senac,1985.0,63.0,176.0,France,1,,-1
5990,lawson-craddock,Lawson Craddock,1992.0,69.0,178.0,United States,1,,-1
6011,carlos-torrent-tarres,Carlos Torrent,1974.0,71.0,180.0,Spain,1,,-1
6104,jeremy-roy,Jérémy Roy,1983.0,70.0,186.0,France,1,,-1


**Detect outliers with IQR**: The interquartile range (IQR) is a measure of statistical dispersion, or how scattered, spread out the values in a data set are. It is the difference between the third quartile and the first quartile (IQR = Q3 - Q1). The IQR is used to identify outliers by defining limits on the sample values that are a factor k of the IQR below the first quartile or above the third quartile. The data points that fall outside these limits are classified as outliers.

In [32]:
from utils import detect_outliers_iqr
column = 'weight'
cyclists_with_outliers = detect_outliers_iqr(cyclists, column)
cyclists_with_outliers

Unnamed: 0,_url,name,birth_year,weight,height,nationality,anomaly,anomaly_zscore,anomaly_lof
292,jens-mouris,Jens Mouris,1980.0,91.0,198.0,Netherlands,-1,True,1
348,jonas-rickaert,Jonas Rickaert,1994.0,88.0,187.0,Belgium,-1,True,1
679,conor-dunne,Conor Dunne,1992.0,88.0,204.0,Ireland,-1,True,1
857,jon-ander-insausti,Jon Ander Insausti,1992.0,89.0,187.0,Spain,-1,True,1
1375,jose-humberto-rujano,José Humberto Rujano,1982.0,48.0,162.0,Venezuela,-1,True,1
1487,michael-kolar,Michael Kolář,1992.0,90.0,185.0,Slovakia,-1,True,1
2720,mulu-kinfe-hailemichael,Mulu Kinfe Hailemichael,1999.0,50.0,158.0,Ethiopia,-1,,1
3007,max-walscheid,Max Walscheid,1993.0,90.0,199.0,Germany,-1,True,1
3902,blaz-jarc,Blaž Jarc,1988.0,87.0,196.0,Slovenia,-1,,1
4110,eros-poli,Eros Poli,1963.0,87.0,194.0,Italy,-1,,1
