## Different outlier models

1. Linear models for outlier detection:

    A. **PCA: Principal component analysis** use the sum of weighted proected distances to the eigenvector hyperplane as the outlier scores.
    
    B. **MCD: Minimum Covariance Determinant** use the mahalanobis distance as the outlier scores
    
    C. **OCSVM: One Class Suppport Vector Machines
    

2. Proximity based outlier detection models:

    A. **LOF: Local Outlier Factor**
    
    B. **CBLOF: Clistering Based Local Outlier Detection**
    
    C. **kNN: k Nearest Neighbors**
    
    D. **HBOS: Histogram based outlier scores**
    
    
3. Probablistic models for outlier detection:

    A. **ABOD: Angle Based Outlier Detection**
    
    
4. Outlier Ensembles and combination frameworks

    A. **Isolation forest**
    
    B. **Feature bagging**

### Importing python packages

In [36]:
import os
import sys
import numpy as np
import pandas as pd 
from time import time

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from scipy.io import loadmat  #For importing the mat file(Matlab files)

### Importing pyod and the methods

In [5]:
from pyod.models.pca import PCA
from pyod.models.mcd import MCD
from pyod.models.ocsvm import OCSVM
from pyod.models.lof import LOF
from pyod.models.cblof import CBLOF
from pyod.models.knn import KNN
from pyod.models.hbos import HBOS
from pyod.models.abod import ABOD
from pyod.models.iforest import IForest
from pyod.models.feature_bagging import FeatureBagging

### Importing Performance metric packages

In [7]:
from pyod.utils.utility import standardizer
from pyod.utils.utility import precision_n_scores
from sklearn.metrics import roc_auc_score

## Basic operation in .mat files

In [10]:
data = loadmat('Dataset/cardio.mat')
data

{'__header__': b'MATLAB 5.0 MAT-file, written by Octave 3.8.0, 2014-12-18 10:48:09 UTC',
 '__version__': '1.0',
 '__globals__': [],
 'X': array([[ 0.00491231,  0.69319077, -0.20364049, ...,  0.23149795,
         -0.28978574, -0.49329397],
        [ 0.11072935, -0.07990259, -0.20364049, ...,  0.09356344,
         -0.25638541, -0.49329397],
        [ 0.21654639, -0.27244466, -0.20364049, ...,  0.02459619,
         -0.25638541,  1.14001753],
        ...,
        [-0.41835583, -0.91998844, -0.16463485, ..., -1.49268341,
          0.24461959, -0.49329397],
        [-0.41835583, -0.91998844, -0.15093411, ..., -1.42371616,
          0.14441859, -0.49329397],
        [-0.41835583, -0.91998844, -0.20364049, ..., -1.28578165,
          3.58465295, -0.49329397]]),
 'y': array([[0.],
        [0.],
        [0.],
        ...,
        [1.],
        [1.],
        [1.]])}

In [11]:
len(data)

5

.mat file data presents in the Dictionary format

In [12]:
data.keys()

dict_keys(['__header__', '__version__', '__globals__', 'X', 'y'])

In [15]:
data.values()

dict_values([b'MATLAB 5.0 MAT-file, written by Octave 3.8.0, 2014-12-18 10:48:09 UTC', '1.0', [], array([[ 0.00491231,  0.69319077, -0.20364049, ...,  0.23149795,
        -0.28978574, -0.49329397],
       [ 0.11072935, -0.07990259, -0.20364049, ...,  0.09356344,
        -0.25638541, -0.49329397],
       [ 0.21654639, -0.27244466, -0.20364049, ...,  0.02459619,
        -0.25638541,  1.14001753],
       ...,
       [-0.41835583, -0.91998844, -0.16463485, ..., -1.49268341,
         0.24461959, -0.49329397],
       [-0.41835583, -0.91998844, -0.15093411, ..., -1.42371616,
         0.14441859, -0.49329397],
       [-0.41835583, -0.91998844, -0.20364049, ..., -1.28578165,
         3.58465295, -0.49329397]]), array([[0.],
       [0.],
       [0.],
       ...,
       [1.],
       [1.],
       [1.]])])

In [16]:
type(data['X'])

numpy.ndarray

In [17]:
data['X'].shape

(1831, 21)

In [18]:
data['y'].shape

(1831, 1)

### Define data file and read X and y

In [23]:
mat_file_list = ['arrhythmia.mat',
                 'cardio.mat',
                 'glass.mat',
                 'ionosphere.mat',
                 'letter.mat',
                 'lympho.mat',
                 'mnist.mat',
                 'musk.mat',
                 'optdigits.mat',
                 'pendigits.mat',
                 'pima.mat',
                 'satellite.mat',
                 'satimage-2.mat',
                 'shuttle.mat',
                 'vertebral.mat',
                 'vowels.mat',
                 'wbc.mat']

In [35]:
df_columns = ['Data','#Samples','# Dimensions','Outlier Perc',
               'ABOD','CBLOF','FB','HBOS','IForest','KNN','LOF','MCD',
                'OCSVM','PCA']

## ROC Performance evaluation table

In [41]:
roc_df = pd.DataFrame(columns=df_columns)
roc_df

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA


## Precision_n_scores - Performance evaluation table

In [42]:
prn_df = pd.DataFrame(columns = df_columns)
prn_df

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA


## Time dataframe

In [43]:
time_df = pd.DataFrame(columns=df_columns)
time_df

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA


# Exploring all mat files

In [44]:
# This is the main phase in the anamoly detection


random_state = np.random.RandomState(42)

#main for loop to loop through all the mat files one by one
for mat_file in mat_file_list:
    print("\n... Processing", mat_file,'...')
    mat = loadmat(os.path.join('Dataset',mat_file))
    
    X = mat['X']  #Storing the X value
    y = mat['y'].ravel() #ravel is to convert 2d -> 1d
    
    outliers_fraction = np.count_nonzero(y)/len(y) #no of outlier fractions present in the data
    outliers_percentage = round(outliers_fraction * 100, ndigits=4) #no of outliers percentage present in the data
    
    
    # CONSTRUCT CONTAINERS FOR SAVING RESULTS
    
    
    roc_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]
    prn_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]
    time_list = [mat_file[:-4], X.shape[0], X.shape[1], outliers_percentage]
    
    #60% DATA FOR TRAINING AND 40% FOR TESTING
    X_train , X_test , y_train , y_test = train_test_split(X , y , test_size = 0.4,
                                                           random_state = random_state)
    
    
    #STANDARDISING THE DATA FOR PROCESSING
    X_train_norm , X_test_norm = standardizer(X_train, X_test)
    
    #CREATING THE CLASSIFIERS DICTIONARY WHICH CONTAINS ALL THE ALGORITHMS
    
    classifiers = {'Angle-based Outlier Detector (ABOD)':ABOD(contamination=outliers_fraction),
                   'Cluster-based Local Outlier Factor': CBLOF(contamination = outliers_fraction,
                                                               check_estimator=False,random_state=random_state),
                    'Feature Bagging': FeatureBagging(contamination=outliers_fraction,random_state=random_state),
                    'Histogram_base Outlier Detection (HBOS)': HBOS(contamination=outliers_fraction),
                    'Isolation Forest': IForest(contamination=outliers_fraction,random_state=random_state),
                    'K Nearest Neighbor (KNN)': KNN(contamination = outliers_fraction),
                    'Local Outlier Factor (LOF)': LOF(contamination=outliers_fraction),
                    'Minimum Covariance Determinant (MCD)': MCD(contamination=outliers_fraction, random_state=random_state),
                    'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
                    'Principal Component Analysis (PCA)': PCA(contamination=outliers_fraction,random_state=random_state)                  
                  }
    
    
    
    #For loop for fitting all the classfiers one by one and also to find the roc score and accuracy of them
    for clf_name, clf in classifiers.items():
        t0 = time() #intial time before training the algorithm
        clf.fit(X_train_norm) # fitting the classifier model
        test_scores = clf.decision_function(X_test_norm)  #getting the score from decision_function and storing in test_scores
        t1 = time() #time after the alg is trained
        duaration = round(t1 - t0 , ndigits=4) #overall time to train the alg
        time_list.append(duaration) 
        
        roc = round(roc_auc_score(y_test , test_scores), ndigits=4) #roc_score
        prn = round(precision_n_scores(y_test, test_scores), ndigits=4) #precison value
        
        print(f"{clf_name} ROC: {roc}, Precision @ rank n: {prn}, Execution time: {duaration}s")
        
        #Appending the roc and precison score in their respective lists
        roc_list.append(roc)   
        prn_list.append(prn)
        
    
    #Creating a temp df from the time_list setting the column names and concatinating with the time_df Dataframe
    temp_df = pd.DataFrame(time_list).transpose()
    temp_df.columns = df_columns  
    time_df = pd.concat([time_df, temp_df],axis=0)
    
    #Creating a temp df from the roc_list setting the column names and concatinating with the roc_df Dataframe
    temp_df = pd.DataFrame(roc_list).transpose()
    temp_df.columns = df_columns
    roc_df = pd.concat([roc_df,temp_df],axis=0)
    
    #Creating a temp df from the prn_list setting the column names and concatinating with the prn_df Dataframe
    temp_df = pd.DataFrame(prn_list).transpose()
    temp_df.columns = df_columns
    prn_df = pd.concat([prn_df,temp_df], axis=0)


... Processing arrhythmia.mat ...
Angle-based Outlier Detector (ABOD) ROC: 0.7687, Precision @ rank n: 0.3571, Execution time: 0.24s
Cluster-based Local Outlier Factor ROC: 0.7684, Precision @ rank n: 0.4643, Execution time: 0.136s
Feature Bagging ROC: 0.7799, Precision @ rank n: 0.5, Execution time: 0.6799s
Histogram_base Outlier Detection (HBOS) ROC: 0.8511, Precision @ rank n: 0.5714, Execution time: 0.112s
Isolation Forest ROC: 0.8478, Precision @ rank n: 0.5357, Execution time: 0.8481s
K Nearest Neighbor (KNN) ROC: 0.782, Precision @ rank n: 0.5, Execution time: 0.16s
Local Outlier Factor (LOF) ROC: 0.7787, Precision @ rank n: 0.4643, Execution time: 0.104s
Minimum Covariance Determinant (MCD) ROC: 0.8228, Precision @ rank n: 0.4286, Execution time: 3.0079s
One-class SVM (OCSVM) ROC: 0.7986, Precision @ rank n: 0.5, Execution time: 0.04s
Principal Component Analysis (PCA) ROC: 0.7997, Precision @ rank n: 0.5, Execution time: 0.104s

... Processing cardio.mat ...
Angle-based Outli

Minimum Covariance Determinant (MCD) ROC: 0.3486, Precision @ rank n: 0.0, Execution time: 5.2712s
One-class SVM (OCSVM) ROC: 0.4972, Precision @ rank n: 0.0, Execution time: 1.3997s
Principal Component Analysis (PCA) ROC: 0.504, Precision @ rank n: 0.0, Execution time: 0.064s

... Processing pendigits.mat ...
Angle-based Outlier Detector (ABOD) ROC: 0.7008, Precision @ rank n: 0.0308, Execution time: 2.6162s
Cluster-based Local Outlier Factor ROC: 0.9609, Precision @ rank n: 0.3077, Execution time: 0.3599s
Feature Bagging ROC: 0.4687, Precision @ rank n: 0.0462, Execution time: 6.4693s
Histogram_base Outlier Detection (HBOS) ROC: 0.9294, Precision @ rank n: 0.2615, Execution time: 0.016s
Isolation Forest ROC: 0.9482, Precision @ rank n: 0.2615, Execution time: 0.9679s
K Nearest Neighbor (KNN) ROC: 0.7602, Precision @ rank n: 0.0462, Execution time: 0.9283s
Local Outlier Factor (LOF) ROC: 0.481, Precision @ rank n: 0.0462, Execution time: 0.7283s
Minimum Covariance Determinant (MCD) RO

In [45]:
#roc score for the data for all the algorithms
roc_df

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
0,arrhythmia,452,274,14.6018,0.7687,0.7684,0.7799,0.8511,0.8478,0.782,0.7787,0.8228,0.7986,0.7997
0,cardio,1831,21,9.6122,0.5763,0.8221,0.4879,0.8453,0.9316,0.6959,0.4715,0.8778,0.9507,0.9638
0,glass,214,9,4.2056,0.7104,0.8506,0.7043,0.6524,0.7195,0.7805,0.7774,0.7165,0.6189,0.622
0,ionosphere,351,33,35.8974,0.9004,0.8952,0.8933,0.5195,0.8294,0.9134,0.8989,0.9399,0.8372,0.7971
0,letter,1600,32,6.25,0.8465,0.7423,0.866,0.5728,0.5836,0.845,0.8409,0.7499,0.5744,0.48
0,lympho,148,18,4.0541,0.9382,0.9709,0.9673,0.9964,0.9855,0.9636,0.9636,0.9164,0.9636,0.9818
0,mnist,7603,100,9.2069,0.7813,0.8447,0.7259,0.5675,0.7813,0.8409,0.7085,0.863,0.8417,0.8396
0,musk,3062,166,3.1679,0.0809,1.0,0.5228,0.9999,0.9992,0.7348,0.5323,1.0,1.0,1.0
0,optdigits,5216,64,2.8758,0.4428,0.7852,0.4641,0.8822,0.5442,0.3824,0.4584,0.3486,0.4972,0.504
0,pendigits,6870,16,2.2707,0.7008,0.9609,0.4687,0.9294,0.9482,0.7602,0.481,0.8271,0.93,0.9332


In [46]:
#precision score for all the algorithms and data
prn_df

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
0,arrhythmia,452,274,14.6018,0.3571,0.4643,0.5,0.5714,0.5357,0.5,0.4643,0.4286,0.5,0.5
0,cardio,1831,21,9.6122,0.1875,0.4844,0.1406,0.4688,0.4531,0.2812,0.125,0.3906,0.5938,0.6875
0,glass,214,9,4.2056,0.25,0.25,0.25,0.0,0.25,0.25,0.25,0.0,0.25,0.25
0,ionosphere,351,33,35.8974,0.8214,0.8036,0.75,0.3393,0.6607,0.8393,0.75,0.8571,0.7143,0.5893
0,letter,1600,32,6.25,0.275,0.175,0.4,0.125,0.05,0.3,0.325,0.075,0.1,0.05
0,lympho,148,18,4.0541,0.4,0.6,0.6,0.8,0.6,0.6,0.6,0.6,0.6,0.8
0,mnist,7603,100,9.2069,0.3562,0.4007,0.3664,0.1199,0.3116,0.4144,0.339,0.3973,0.3801,0.3767
0,musk,3062,166,3.1679,0.0333,1.0,0.1667,0.9667,0.9,0.2333,0.1333,0.9667,1.0,1.0
0,optdigits,5216,64,2.8758,0.0161,0.0,0.0484,0.2581,0.0161,0.0,0.0484,0.0,0.0,0.0
0,pendigits,6870,16,2.2707,0.0308,0.3077,0.0462,0.2615,0.2615,0.0462,0.0462,0.0615,0.2923,0.3385


In [47]:
#Time for training of all the data and algorithm
time_df

Unnamed: 0,Data,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
0,arrhythmia,452,274,14.6018,0.24,0.136,0.6799,0.112,0.8481,0.16,0.104,3.0079,0.04,0.104
0,cardio,1831,21,9.6122,0.6997,0.176,1.1037,0.008,0.8159,0.3305,0.1576,2.0047,0.104,0.0
0,glass,214,9,4.2056,0.088,0.1114,0.0922,0.0,0.652,0.024,0.008,0.064,0.0,0.008
0,ionosphere,351,33,35.8974,0.136,0.072,0.144,0.024,0.7999,0.04,0.016,0.4959,0.008,0.008
0,letter,1600,32,6.25,1.2803,0.352,1.4128,0.04,0.9806,0.232,0.12,5.2578,0.096,0.016
0,lympho,148,18,4.0541,0.048,0.08,0.048,0.008,0.4479,0.008,0.0,0.088,0.0,0.0
0,mnist,7603,100,9.2069,10.1885,0.9039,62.6579,0.088,2.6961,8.4113,7.573,10.5019,4.3366,0.192
0,musk,3062,166,3.1679,3.1397,0.336,15.3191,0.104,1.8597,2.1363,1.8082,40.9331,1.0879,0.2166
0,optdigits,5216,64,2.8758,3.6999,0.4159,17.9564,0.048,1.5523,2.3761,2.0646,5.2712,1.3997,0.064
0,pendigits,6870,16,2.2707,2.6162,0.3599,6.4693,0.016,0.9679,0.9283,0.7283,5.1927,1.1399,0.016


## Inference

Here we have found the outliers using different classifiers in the PyOD module and the roc_score and the precison score was calculated and dataframe is created we can take the best alg for the data using these details.