# Benchmark of various outlier detection models

### The models are evaluated by ROC, Precision @ n and execution time on 17 benchmark datasets. All datasets are split (60% for training and 40% for testing). The full result by averaging 10 indepent trials can be found [here](https://pyod.readthedocs.io/en/latest/benchmark.html).

**[PyOD](https://github.com/yzhao062/pyod)** is a comprehensive **Python toolkit** to **identify outlying objects** in 
multivariate data with both unsupervised and supervised approaches.
The model covered in this example includes:

  1. Linear Models for Outlier Detection:
     1. **PCA: Principal Component Analysis** use the sum of
       weighted projected distances to the eigenvector hyperplane 
       as the outlier outlier scores)
     2. **MCD: Minimum Covariance Determinant** (use the mahalanobis distances 
       as the outlier scores)
     3. **OCSVM: One-Class Support Vector Machines**
     
  2. Proximity-Based Outlier Detection Models: (Using the proximity to detect the outliers)
     1. **LOF: Local Outlier Factor**
     2. **CBLOF: Clustering-Based Local Outlier Factor**
     3. **kNN: k Nearest Neighbors** (use the distance to the kth nearest 
     neighbor as the outlier score)
     4. **HBOS: Histogram-based Outlier Score**
     
  3. Probabilistic Models for Outlier Detection:
     1. **ABOD: Angle-Based Outlier Detection**
  
  4. Outlier Ensembles and Combination Frameworks
     1. **Isolation Forest**
     2. **Feature Bagging**

     
Corresponding file could be found at /examples/compare_all_models.py

In [18]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score,precision_score
from sklearn.model_selection import train_test_split
import sys


import warnings
warnings.filterwarnings("ignore")
import os
os.chdir("C:/Users/Abhishek/Desktop/Lets Upgrade Case Studies/Anamoly_detec_data-20200815T064134Z-001/Anamoly_detec_data")

## 1.1 Import Pyod and the methods

- The one we saw under : Benchmark of Various outlier detection models 
  - 1.***Linear Models*** : PCA, MCD, OCSVM
  - 2.***Proximity Based*** :LOF, CBLOF, HBOS
  - 3.***Probability Model*** : ABOD***
  - 4.***Ensemble and combination Framework*** : IForest, FeatureBagging

In [19]:
from pyod.models.pca import PCA
from pyod.models.mcd import MCD
from pyod.models.ocsvm import OCSVM
from pyod.models.lof import LOF
from pyod.models.cblof import CBLOF
from pyod.models.knn import KNN
from pyod.models.hbos import HBOS
from pyod.models.abod import ABOD
from pyod.models.iforest import IForest
from pyod.models.feature_bagging import FeatureBagging

## 1.2 Import Metrics methods

- To check/evaluate the performance of our moedls

In [20]:
from scipy.io import loadmat #Because Our input data is in mat format
from pyod.utils.utility import standardizer
from pyod.utils.utility import precision_n_scores

# Loading mat file (Matlab File)
- Mat File Stores Data in Dictionary Form

In [21]:
# List with names of all "mat" files, so we can load all in one go 
mat_file_list=['arrhythmia.mat','cardio.mat','glass.mat','ionosphere.mat','letter.mat','lympho.mat','mnist.mat','musk.mat','optdigits.mat','pendigits.mat','pima.mat','satellite.mat','satimage-2.mat','shuttle.mat','vertebral.mat','vowels.mat','wbc.mat']

mat_file_list

['arrhythmia.mat',
 'cardio.mat',
 'glass.mat',
 'ionosphere.mat',
 'letter.mat',
 'lympho.mat',
 'mnist.mat',
 'musk.mat',
 'optdigits.mat',
 'pendigits.mat',
 'pima.mat',
 'satellite.mat',
 'satimage-2.mat',
 'shuttle.mat',
 'vertebral.mat',
 'vowels.mat',
 'wbc.mat']

# How to load Mat File 

In [22]:
# REading one file just to get the Look and Feel idea.
data=loadmat("wbc.mat")
data

{'__header__': b'MATLAB 5.0 MAT-file, written by Octave 3.8.0, 2015-05-31 08:23:19 UTC',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[0.31042643, 0.15725397, 0.30177597, ..., 0.44261168, 0.27833629,
         0.11511216],
        [0.2886554 , 0.20290835, 0.28912998, ..., 0.25027491, 0.31914055,
         0.17571822],
        [0.11940934, 0.0923233 , 0.11436666, ..., 0.21398625, 0.17445299,
         0.14882592],
        ...,
        [0.72360263, 0.33682787, 0.7532997 , ..., 1.        , 0.49083383,
         0.28105733],
        [0.52103744, 0.0226581 , 0.54598853, ..., 0.91202749, 0.59846245,
         0.41886396],
        [0.32367836, 0.49983091, 0.33542948, ..., 0.52268041, 0.41119653,
         0.41492851]]),
 'y': array([[0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
 

In [23]:
len(data)

5

In [24]:
data.keys()

dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])

In [25]:
data.values()

dict_values([b'MATLAB 5.0 MAT-file, written by Octave 3.8.0, 2015-05-31 08:23:19 UTC', '1.0', [], array([[0.31042643, 0.15725397, 0.30177597, ..., 0.44261168, 0.27833629,
        0.11511216],
       [0.2886554 , 0.20290835, 0.28912998, ..., 0.25027491, 0.31914055,
        0.17571822],
       [0.11940934, 0.0923233 , 0.11436666, ..., 0.21398625, 0.17445299,
        0.14882592],
       ...,
       [0.72360263, 0.33682787, 0.7532997 , ..., 1.        , 0.49083383,
        0.28105733],
       [0.52103744, 0.0226581 , 0.54598853, ..., 0.91202749, 0.59846245,
        0.41886396],
       [0.32367836, 0.49983091, 0.33542948, ..., 0.52268041, 0.41119653,
        0.41492851]]), array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [0.],

In [26]:
print("Shape of X : ",data["X"].shape)
print("Shape of y : ",data["y"].shape)

Shape of X :  (378, 30)
Shape of y :  (378, 1)


---

#### df_Columns : This is the Sskeleton of our output Table.
- 'Data', '#Samples', '# Dimensions', 'Outlier Perc' will be same

- 1.ROC table will have These columns, where All PYPOD models will have their ROC Value
- 2.Precision table will have These columns, where All PYPOD models will have their Precision Value
- 3.Time table will have These columns, where All PYPOD models will have their Time Taken

In [27]:
df_columns = ['Data', '#Samples', '# Dimensions', 'Outlier Perc',
              'ABOD', 'CBLOF', 'FB', 'HBOS', 'IForest', 'KNN', 'LOF', 'MCD',
              'OCSVM', 'PCA']

# Just Initialize the DataFrames for ROC,Precision and Time.

### ROC Performance evulotion table

In [28]:
roc_dataframe = pd.DataFrame(columns=df_columns)
roc_dataframe

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA


### precision_n_scores - Performance evulotion table

In [29]:
pre_dataframe = pd.DataFrame(columns=df_columns)
pre_dataframe

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA


### Time dataframe

In [30]:
time_dataframe = pd.DataFrame(columns=df_columns)
time_dataframe

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA


---

# Exploraing All Mat files

# Steps
- 1.Read the file i,e Dataset from the List of datasets (mat files)
- 2.Split in X and Y
- 3.Calculate Outlier fraction and Percentage
- 4.Calculate and append "Data, #Sample, #Dimensions, Outlier Perc" for the 3 lists (roc,per,time) 
- 5.Split and Standardize the data
- 6.Define dictionary "Classifiers" with 1.Model name as keys and 2.Model Parameters as Values
- 7.For Each model in classifier : Fit,Predict and calculate roc,precision and Time values
- 8.For Each dataset Append (roc,per,time) lists i.e of all model. Then use these Lists to make out final DataFrame with all datasets and scores of All Models

In [35]:
from time import time
random_state = np.random.RandomState(42)

for mat_file in mat_file_list:
    print("*"*70)
    print("\n ..Currently Processing...",mat_file)
    mat=loadmat(mat_file)
    
    X=mat["X"]
    y=mat["y"].ravel() #to create a contiguous flattened array. (To got it in 1 row, Rn it is in 1 column)
    
    # Here Just for Demo of Outliers, we will assum 0's in y are normal and 1's in y are Outliers
    outlier_frac= np.count_nonzero(y)/len(y) # We will have occurance of 1 / length of y
    outlier_perc=np.round(outlier_frac*100,4)
    
    # Build containers for Results : 1.mat_file[:-4] will give name of File with ".mat" Skipped,2.Samples,3.Dimensions,4.Outlier Perc
    #- These values are Same for Each DataFrame.
    roc_list=[mat_file[:-4],X.shape[0],X.shape[1],outlier_perc]
    pre_list=[mat_file[:-4],X.shape[0],X.shape[1],outlier_perc]
    time_list=[mat_file[:-4],X.shape[0],X.shape[1],outlier_perc]
    
    #Splitting the Data  60% Train and 40% test
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.40,random_state = random_state)
    
    #Standardize the Data, ONly X and our Y is only 0's and 1's
    X_train_stan =standardizer(X_train)
    X_test_stan  =standardizer(X_test)
    
    #Create a dictionary "Classifiers", it will have all the Algorithems we will use
    # Key = Name of Algo, Value= Calling tha Algo along with HyperParameter values Specified
    
    #We are Detecting Outliers base on : outlier_frac
    classifiers = {'Angle-based Outlier Detector (ABOD)': ABOD(
       contamination=outlier_frac),
       'Cluster-based Local Outlier Factor': CBLOF(
           contamination=outlier_frac, check_estimator=False,
           random_state=random_state),
       'Feature Bagging': FeatureBagging(contamination=outlier_frac,
                                         random_state=random_state),
       'Histogram-base Outlier Detection (HBOS)': HBOS(
           contamination=outlier_frac),
       'Isolation Forest': IForest(contamination=outlier_frac,
                                   random_state=random_state),
       'K Nearest Neighbors (KNN)': KNN(contamination=outlier_frac),
       'Local Outlier Factor (LOF)': LOF(
           contamination=outlier_frac),
       'Minimum Covariance Determinant (MCD)': MCD(
           contamination=outlier_frac, random_state=random_state),
       'One-class SVM (OCSVM)': OCSVM(contamination=outlier_frac),
       'Principal Component Analysis (PCA)': PCA(
           contamination=outlier_frac, random_state=random_state),}
    
    for clf_name,clf in classifiers.items():
        t0 = time()
        clf.fit(X_train_stan)  # Fit the model i.e clf from classifier
        test_scores = clf.decision_function(X_test_stan)  # Predict using model (decision_Function() is used for Anomaly Detecction)
        t1 = time()
        duration = round(t1 - t0, ndigits=4)
        time_list.append(duration) # Append All Time Values in time_list
        
        roc = round(roc_auc_score(y_test, test_scores), ndigits=4)      #ROC of the Model
        prn = round(precision_n_scores(y_test, test_scores), ndigits=4) #Precision of the model
        
        print('{clf_name} ROC:{roc}, precision @ rank n:{prn}, '
              'execution time: {duration}s'.format(clf_name=clf_name, roc=roc, prn=prn, duration=duration))
        
        roc_list.append(roc) # Append All roc Values in ROC_List
        pre_list.append(prn) # Append all Precision values on prn_list
    
    temp_df = pd.DataFrame(time_list).transpose() #Time taken byt all Algo's for each dataset.
    temp_df.columns = df_columns
    time_dataframe = pd.concat([time_dataframe, temp_df], axis=0) #Append Each Row. Row=Dataset

    temp_df = pd.DataFrame(roc_list).transpose()
    temp_df.columns = df_columns
    roc_dataframe = pd.concat([roc_dataframe, temp_df], axis=0)

    temp_df = pd.DataFrame(pre_list).transpose()
    temp_df.columns = df_columns
    pre_dataframe = pd.concat([pre_dataframe, temp_df], axis=0)

**********************************************************************

 ..Currently Processing... arrhythmia.mat
Angle-based Outlier Detector (ABOD) ROC:0.7934, precision @ rank n:0.3929, execution time: 0.2493s
Cluster-based Local Outlier Factor ROC:0.793, precision @ rank n:0.4643, execution time: 0.1636s
Feature Bagging ROC:0.8072, precision @ rank n:0.4643, execution time: 0.8699s
Histogram-base Outlier Detection (HBOS) ROC:0.8532, precision @ rank n:0.5714, execution time: 0.1474s
Isolation Forest ROC:0.8576, precision @ rank n:0.5, execution time: 0.7672s
K Nearest Neighbors (KNN) ROC:0.8133, precision @ rank n:0.5, execution time: 0.1037s
Local Outlier Factor (LOF) ROC:0.8072, precision @ rank n:0.5, execution time: 0.1087s
Minimum Covariance Determinant (MCD) ROC:0.8189, precision @ rank n:0.4286, execution time: 0.9185s
One-class SVM (OCSVM) ROC:0.8301, precision @ rank n:0.5, execution time: 0.0519s
Principal Component Analysis (PCA) ROC:0.8268, precision @ rank n:0.5, execu

In [32]:
pre_dataframe

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
0,arrhythmia,452,274,14.6018,0.3929,0.4643,0.4643,0.5714,0.5,0.5,0.5,0.4286,0.5,0.5
0,cardio,1831,21,9.6122,0.2188,0.5,0.1406,0.4531,0.4375,0.2969,0.1406,0.4062,0.5156,0.6406
0,glass,214,9,4.2056,0.25,0.25,0.25,0.25,0.25,0.25,0.25,0.0,0.25,0.25
0,ionosphere,351,33,35.8974,0.7857,0.7857,0.7321,0.4286,0.6429,0.8393,0.75,0.875,0.7143,0.6071
0,letter,1600,32,6.25,0.25,0.175,0.375,0.075,0.05,0.3,0.325,0.1,0.125,0.075
0,lympho,148,18,4.0541,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6,0.6
0,mnist,7603,100,9.2069,0.3733,0.3938,0.3596,0.1199,0.3014,0.4281,0.339,0.4007,0.3699,0.3596
0,musk,3062,166,3.1679,0.0667,1.0,0.2,0.9667,0.9333,0.2667,0.2667,1.0,1.0,1.0
0,optdigits,5216,64,2.8758,0.0161,0.0,0.0323,0.2742,0.0323,0.0,0.0323,0.0,0.0,0.0
0,pendigits,6870,16,2.2707,0.0308,0.3077,0.0462,0.2615,0.2462,0.0615,0.0462,0.0615,0.2769,0.3231


In [33]:
roc_dataframe

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
0,arrhythmia,452,274,14.6018,0.7934,0.793,0.8072,0.8532,0.8576,0.8133,0.8072,0.8189,0.8301,0.8268
0,cardio,1831,21,9.6122,0.5891,0.7808,0.5023,0.8359,0.9253,0.7131,0.4844,0.8582,0.9383,0.9572
0,glass,214,9,4.2056,0.6585,0.811,0.7591,0.6006,0.628,0.7652,0.7866,0.7256,0.5274,0.4787
0,ionosphere,351,33,35.8974,0.8971,0.8834,0.8809,0.5872,0.8401,0.9179,0.879,0.9412,0.8229,0.7782
0,letter,1600,32,6.25,0.8397,0.7509,0.8753,0.5568,0.5895,0.8518,0.854,0.7573,0.5801,0.4865
0,lympho,148,18,4.0541,0.9491,0.9855,0.9709,0.9782,0.9818,0.9636,0.9745,0.8909,0.9782,0.9745
0,mnist,7603,100,9.2069,0.7792,0.8395,0.7282,0.5777,0.7789,0.8414,0.7095,0.863,0.8361,0.8347
0,musk,3062,166,3.1679,0.2306,1.0,0.5399,1.0,0.9997,0.7761,0.551,1.0,1.0,1.0
0,optdigits,5216,64,2.8758,0.3833,0.7986,0.46,0.8868,0.5482,0.3762,0.4595,0.3489,0.5234,0.5252
0,pendigits,6870,16,2.2707,0.7018,0.9563,0.4774,0.9373,0.9425,0.7618,0.4887,0.8237,0.9176,0.9294


In [34]:
time_dataframe

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
0,arrhythmia,452,274,14.6018,0.1995,0.1446,0.7565,0.0698,0.5296,0.1197,0.0804,0.7408,0.0625,0.0937
0,cardio,1831,21,9.6122,0.5848,0.2643,1.1928,0.009,0.6423,0.2074,0.1476,0.7231,0.1057,0.007
0,glass,214,9,4.2056,0.0489,0.0788,0.0539,0.003,0.362,0.009,0.006,0.0349,0.003,0.002
0,ionosphere,351,33,35.8974,0.1247,0.0878,0.0738,0.008,0.3762,0.018,0.008,0.0658,0.007,0.004
0,letter,1600,32,6.25,0.4398,0.1326,0.9582,0.0,0.5013,0.1546,0.1137,1.4269,0.1067,0.009
0,lympho,148,18,4.0541,0.0429,0.0878,0.0539,0.006,0.4478,0.009,0.004,0.0409,0.002,0.002
0,mnist,7603,100,9.2069,8.7609,0.7749,65.1948,0.0758,2.217,8.323,8.0835,3.9075,5.0726,0.1695
0,musk,3062,166,3.1679,2.4312,0.3142,15.0053,0.0588,1.5689,2.1003,1.8511,17.3117,1.6077,0.1646
0,optdigits,5216,64,2.8758,3.8969,0.5605,21.4316,0.0509,1.6705,2.4834,2.2113,1.7491,1.869,0.0748
0,pendigits,6870,16,2.2707,2.0099,0.4575,6.614,0.009,0.8527,0.747,0.7221,3.1327,1.2704,0.012


# To read this Data (ROC_df): example :
1. Data-WBC : our Outlier Perc is 505556, PCA says 88% this is correct,MCD says 89% this is correct and so on

- Whichever Algorithm gives the Best Results can be Finalised for future use, it will detect outliers corretly(most of the time).