<h1 style="text-align:center;vertical-align:middle">离群点分析与异常检测</h1>

## 目录
- [1 abalone数据集](#1)
    - [数据处理](#1.1)
    - [模型建立](#1.2)
    - [结果分析](#1.3)
- [2 spambase数据集](#2)
    - [数据处理](#2.1)
    - [模型建立](#2.2)
    - [结果分析](#2.3)

<h2 id="1">1 abalone数据集</h2>
<h3 id="1.1">数据处理</h3>

In [2]:
import pandas as pd
import numpy as np
import fm
from time import time
from sklearn.model_selection import train_test_split
from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF
from pyod.models.feature_bagging import FeatureBagging
from pyod.models.hbos import HBOS
from pyod.models.iforest import IForest
from pyod.models.knn import KNN
from pyod.models.lof import LOF
from pyod.models.mcd import MCD
from pyod.models.ocsvm import OCSVM
from pyod.models.pca import PCA
from pyod.utils.utility import standardizer
from pyod.utils.utility import precision_n_scores
from sklearn.metrics import roc_auc_score

首先，观察abalone数据集中的文件，可以发现不同文件由不同的列组成，其中`ground.truth`标签用于表示该数据点是否是异常点，`original.label`用于表明该点在原始数据上的分类，根据观察可以得出，benchmark的构造是随机抽样部分类别的点作为离群点再加入其它类别的点作为正常点。
  
因此，将`ground.truth`作为标签，除去`point.id`,`motherset`,`origin`,`original.label`这些归属信息，剩下的列作为输入特征，进行离群点检测模型的训练。

In [None]:
for i in range(start,len(file_lst)):
    file = file_lst[i]
    print('processing file '+ file[-8:-4])
    print('----------')
    df = pd.read_csv(file)
    x = df.drop(['ground.truth','point.id','motherset','origin','original.label'],axis = 1).values
    y = df['ground.truth'].values
    y = [0 if i == 'nominal' else 1 for i in y]

将数据集以5:2的比例划分训练集和测试集，由于使用的离群点检测的包`PyOD`中主要实现的模型和算法都是无监督类型，所以测试集和标签仅用来评估训练结果。

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4,random_state=random_state)
x_train_norm, x_test_norm = standardizer(x_train, x_test)

<h3 id="1.2">模型建立</h3>
  
使用`PyOD`工具包，使用其提供的以下离群点检测模型进行离群点检测：
  
- 线性离群点检测模型:
    - PCA：主成分分析使用到特征向量超平面的加权投影距离之和作为离群值离群值）
    - MCD：最小协方差行列式（使用马哈拉诺比斯距离作为离群值）
    - OCSVM：一种支持向量机
- 基于接近度的异常值检测模型：
    - LOF：局部离群因子
    - CBLOF：基于聚类的局部离群因子
    - kNN：k近邻（使用与第k个最近邻居的距离作为离群值）
    - HBOS：基于直方图的离群值
- 离群值检测的概率模型：
    - ABOD：基于角度的离群值检测
- 集成组合框架
    - Isolation Forest
    - Feature Bagging

首先通过`ground.truth`标签计算当前数据集的污染指数(离群点比例)，再通过工具包提供的模型接口创建模型。

In [None]:
outliers_fraction = min(np.count_nonzero(y) / len(y),0.5)
outliers_percentage = round(outliers_fraction * 100, ndigits=4)

classifiers = {'Angle-based Outlier Detector (ABOD)': ABOD(
            contamination=outliers_fraction),
            'Cluster-based Local Outlier Factor': CBLOF(
                contamination=outliers_fraction, check_estimator=False,
                random_state=random_state),
            'Feature Bagging': FeatureBagging(contamination=outliers_fraction,
                                              random_state=random_state),
            'Histogram-base Outlier Detection (HBOS)': HBOS(
                contamination=outliers_fraction),
            'Isolation Forest': IForest(contamination=outliers_fraction,
                                        random_state=random_state),
            'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),
            'Local Outlier Factor (LOF)': LOF(
                contamination=outliers_fraction),
            'Minimum Covariance Determinant (MCD)': MCD(
                contamination=outliers_fraction, random_state=random_state),
            'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
            'Principal Component Analysis (PCA)': PCA(
                contamination=outliers_fraction, random_state=random_state),
        }

使用训练数据拟合模型，记录每种模型的执行时间、prn和roc值，以便后续分析。

In [None]:
for clf_name, clf in classifiers.items():
            try:
                t0 = time()
                clf.fit(x_train_norm)
                test_scores = clf.decision_function(x_test_norm)
                t1 = time()
                duration = round(t1 - t0, ndigits=4)
                roc = round(roc_auc_score(y_test, test_scores), ndigits=4)
                prn = round(precision_n_scores(y_test, test_scores), ndigits=4)
            except Exception as e:
                roc = 0
                prn = 0
                duration = 0
            print('{clf_name} ROC:{roc}, precision @ rank n:{prn}, '
                      'execution time: {duration}s'.format(
                    clf_name=clf_name, roc=roc, prn=prn, duration=duration))
            time_list.append(duration)
            roc_list.append(roc)
            prn_list.append(prn)

完整代码如下：

In [84]:
# 读取目录下所有文件
file_lst = fm.get_filelist('../../data/abalone/benchmarks/',[])
file_lst.sort()

In [85]:
df_columns = ['Data', '#Samples', '# Dimensions', 'Outlier Perc',
          'ABOD', 'CBLOF', 'FB', 'HBOS', 'IForest', 'KNN', 'LOF', 'MCD',
          'OCSVM', 'PCA']
roc_df = pd.DataFrame(columns=df_columns)
prn_df = pd.DataFrame(columns=df_columns)
time_df = pd.DataFrame(columns=df_columns)
random_state = np.random.RandomState(42)

In [125]:
def detect_file(file_lst,start,roc_df,prn_df,time_df,random_state):
    for i in range(start,len(file_lst)):
        file = file_lst[i]
        print('processing file '+ file[-8:-4])
        print('----------')
        df = pd.read_csv(file)
        x = df.drop(['ground.truth','point.id','motherset','origin','original.label'],axis = 1).values
        y = df['ground.truth'].values
        y = [0 if i == 'nominal' else 1 for i in y]

        outliers_fraction = min(np.count_nonzero(y) / len(y),0.5)
        outliers_percentage = round(outliers_fraction * 100, ndigits=4)

        roc_list = [file[-8:-4], x.shape[0], x.shape[1], outliers_percentage]
        prn_list = [file[-8:-4], x.shape[0], x.shape[1], outliers_percentage]
        time_list = [file[-8:-4], x.shape[0], x.shape[1], outliers_percentage]

        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4,random_state=random_state)
        x_train_norm, x_test_norm = standardizer(x_train, x_test)



        classifiers = {'Angle-based Outlier Detector (ABOD)': ABOD(
            contamination=outliers_fraction),
            'Cluster-based Local Outlier Factor': CBLOF(
                contamination=outliers_fraction, check_estimator=False,
                random_state=random_state),
            'Feature Bagging': FeatureBagging(contamination=outliers_fraction,
                                              random_state=random_state),
            'Histogram-base Outlier Detection (HBOS)': HBOS(
                contamination=outliers_fraction),
            'Isolation Forest': IForest(contamination=outliers_fraction,
                                        random_state=random_state),
            'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),
            'Local Outlier Factor (LOF)': LOF(
                contamination=outliers_fraction),
            'Minimum Covariance Determinant (MCD)': MCD(
                contamination=outliers_fraction, random_state=random_state),
            'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
            'Principal Component Analysis (PCA)': PCA(
                contamination=outliers_fraction, random_state=random_state),
        }

        for clf_name, clf in classifiers.items():
            try:
                t0 = time()
                clf.fit(x_train_norm)
                test_scores = clf.decision_function(x_test_norm)
                t1 = time()
                duration = round(t1 - t0, ndigits=4)
                roc = round(roc_auc_score(y_test, test_scores), ndigits=4)
                prn = round(precision_n_scores(y_test, test_scores), ndigits=4)
            except Exception as e:
                roc = 0
                prn = 0
                duration = 0
            print('{clf_name} ROC:{roc}, precision @ rank n:{prn}, '
                      'execution time: {duration}s'.format(
                    clf_name=clf_name, roc=roc, prn=prn, duration=duration))
            time_list.append(duration)
            roc_list.append(roc)
            prn_list.append(prn)

        temp_df = pd.DataFrame(time_list).transpose()
        temp_df.columns = df_columns
        time_df = pd.concat([time_df, temp_df], axis=0)

        temp_df = pd.DataFrame(roc_list).transpose()
        temp_df.columns = df_columns
        roc_df = pd.concat([roc_df, temp_df], axis=0)

        temp_df = pd.DataFrame(prn_list).transpose()
        temp_df.columns = df_columns
        prn_df = pd.concat([prn_df, temp_df], axis=0)

In [97]:
# 存储结果文件以便后续分析
time_df.to_csv("abalone-time.csv",index = False)
roc_df.to_csv("abalone-roc.csv",index = False)
prn_df.to_csv("abalone-prn.csv",index = False)

<h3 id="1.3">结果分析</h3>

In [6]:
time_1 = pd.read_csv("abalone-time.csv",index_col = "Data")
roc_1 = pd.read_csv("abalone-roc.csv",index_col = "Data")
prn_1 = pd.read_csv("abalone-prn.csv",index_col = "Data")

**时间复杂度**
  

In [20]:
time_1.head()

Unnamed: 0_level_0,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1888,9,48.3581,0.3604,0.112,0.2147,0.0041,0.2934,0.0585,0.024,0.9736,0.1037,0.0024
2,1888,9,50.0,0.3754,0.1149,0.2082,0.0033,0.2888,0.0601,0.026,0.9395,0.0962,0.0027
3,1888,9,50.0,0.3478,0.1294,0.1966,0.0033,0.2902,0.0573,0.0241,0.7979,0.0897,0.0024
4,1888,9,47.1928,0.3303,0.1057,0.2101,0.0034,0.2983,0.0607,0.026,1.0447,0.0886,0.0024
5,1888,9,49.3644,0.3392,0.1273,0.198,0.0033,0.2809,0.0581,0.0251,0.8198,0.104,0.0022


首先看这些模型的平均时间复杂度，`PCA`所花费的平均时间最低，`MCD`花费的时间最高。

In [18]:
ans = time_1.loc[:,"ABOD":"PCA"].apply(lambda x: x.mean())
ans.sort_values()

PCA        0.003870
HBOS       0.005087
LOF        0.056318
KNN        0.084102
OCSVM      0.090292
CBLOF      0.116211
IForest    0.258679
ABOD       0.316479
FB         0.454158
MCD        1.021404
dtype: float64

 **ROC/AUC**
  

In [21]:
roc_1.head()

Unnamed: 0_level_0,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1888,9,48.3581,0.7144,0.5919,0.5696,0.4839,0.4096,0.7449,0.5617,0.7942,0.4871,0.4556
2,1888,9,50.0,0.7427,0.5308,0.5321,0.4566,0.3959,0.7416,0.5299,0.7536,0.4536,0.4412
3,1888,9,50.0,0.7414,0.6171,0.6115,0.356,0.3996,0.7673,0.6114,0.7724,0.4896,0.4259
4,1888,9,47.1928,0.7388,0.6124,0.5685,0.49,0.4539,0.7699,0.5807,0.8433,0.4773,0.4703
5,1888,9,49.3644,0.7366,0.5853,0.5662,0.4495,0.3473,0.7606,0.5531,0.7625,0.4519,0.4241


roc越高代表模型越理想,根据如下结果来看每种算法的平均roc，几乎都在0.7-0.8左右，其中时间复杂度最高的`MCD`的roc值最高，为0.818319,并且可以观察到线性模型的效果不一定比复杂的模型差。小概率地，所有模型都有roc为1的例子，如下所示。

In [24]:
ans = roc_1.loc[:,"ABOD":"PCA"].apply(lambda x: x.mean())
ans.sort_values(ascending=False)

MCD        0.818319
KNN        0.807786
FB         0.779019
ABOD       0.778553
LOF        0.767699
OCSVM      0.767049
HBOS       0.748110
CBLOF      0.738439
PCA        0.719367
IForest    0.716047
dtype: float64

In [27]:
ans = roc_1.loc[:,"ABOD":"PCA"].apply(lambda x: x.max())
ans.sort_values(ascending=False)

PCA        1.0
OCSVM      1.0
MCD        1.0
LOF        1.0
KNN        1.0
IForest    1.0
HBOS       1.0
FB         1.0
CBLOF      1.0
ABOD       1.0
dtype: float64

**PRN**

In [28]:
prn_1.head()

Unnamed: 0_level_0,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,1888,9,48.3581,0.6438,0.562,0.5462,0.4776,0.4222,0.6755,0.5462,0.7124,0.496,0.4697
2,1888,9,50.0,0.664,0.5079,0.5265,0.4921,0.4101,0.6587,0.5317,0.6772,0.455,0.4603
3,1888,9,50.0,0.6812,0.5784,0.6041,0.4072,0.4473,0.6864,0.6093,0.7044,0.4884,0.4447
4,1888,9,47.1928,0.6453,0.5503,0.5168,0.4777,0.4302,0.6732,0.5251,0.743,0.4525,0.4497
5,1888,9,49.3644,0.6477,0.5501,0.5393,0.4399,0.374,0.6694,0.523,0.6775,0.4309,0.4255


prn代表模型前n个的准确率，越高代表模型检测准确率越高,根据如下结果来看每种算法的平均prn，几乎都在0.3-0.4左右，很低，其中平均prn最高的为KNN算法，为0.416733,并且可以观察到简单模型的准确率不一定比复杂的模型差。小概率地，所有模型都有过检测完全正确的例子，如下所示。

In [32]:
ans = prn_1.loc[:,"ABOD":"PCA"].apply(lambda x: x.mean())
ans.sort_values(ascending=False)

KNN        0.416733
MCD        0.377072
OCSVM      0.374527
FB         0.362075
CBLOF      0.362065
LOF        0.354011
ABOD       0.335495
PCA        0.310596
HBOS       0.309602
IForest    0.290866
dtype: float64

In [34]:
ans = prn_1.loc[:,"ABOD":"PCA"].apply(lambda x: x.max())
ans.sort_values(ascending=False)

PCA        1.0
OCSVM      1.0
MCD        1.0
LOF        1.0
KNN        1.0
IForest    1.0
HBOS       1.0
FB         1.0
CBLOF      1.0
ABOD       1.0
dtype: float64

<h2 id="2">2 spambase数据集</h2>
<h3 id="2.1">数据处理</h3>
同上一章节。
<h3 id="2.2">模型建立</h3>
与上一章节相似。完整代码如下：

In [None]:
random_state = np.random.RandomState(42)
df_columns = ['Data', '#Samples', '# Dimensions', 'Outlier Perc',
              'ABOD', 'CBLOF', 'FB', 'HBOS', 'IForest', 'KNN', 'LOF', 'MCD',
              'OCSVM', 'PCA']
roc_df = pd.DataFrame(columns=df_columns)
prn_df = pd.DataFrame(columns=df_columns)
time_df = pd.DataFrame(columns=df_columns)
file_lst = fm.get_filelist('../../data/spambase/benchmarks/',[])
file_lst.sort()
for file in file_lst:
    print('processing file '+ file[-8:-4])
    print('----------')
    df = pd.read_csv(file)
    x = df.drop(['ground.truth','point.id','motherset','origin','original.label'],axis = 1).values
    y = df['ground.truth'].values
    y = [0 if i == 'nominal' else 1 for i in y]

    outliers_fraction = min(np.count_nonzero(y) / len(y),0.5)
    outliers_percentage = round(outliers_fraction * 100, ndigits=4)
    
    roc_list = [file[-8:-4], x.shape[0], x.shape[1], outliers_percentage]
    prn_list = [file[-8:-4], x.shape[0], x.shape[1], outliers_percentage]
    time_list = [file[-8:-4], x.shape[0], x.shape[1], outliers_percentage]
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4,random_state=random_state)
    x_train_norm, x_test_norm = standardizer(x_train, x_test)
    
    classifiers = {'Angle-based Outlier Detector (ABOD)': ABOD(
        contamination=outliers_fraction),
        'Cluster-based Local Outlier Factor': CBLOF(
            contamination=outliers_fraction, check_estimator=False,
            random_state=random_state),
        'Feature Bagging': FeatureBagging(contamination=outliers_fraction,
                                          random_state=random_state),
        'Histogram-base Outlier Detection (HBOS)': HBOS(
            contamination=outliers_fraction),
        'Isolation Forest': IForest(contamination=outliers_fraction,
                                    random_state=random_state),
        'K Nearest Neighbors (KNN)': KNN(contamination=outliers_fraction),
        'Local Outlier Factor (LOF)': LOF(
            contamination=outliers_fraction),
        'Minimum Covariance Determinant (MCD)': MCD(
            contamination=outliers_fraction, random_state=random_state),
        'One-class SVM (OCSVM)': OCSVM(contamination=outliers_fraction),
        'Principal Component Analysis (PCA)': PCA(
            contamination=outliers_fraction, random_state=random_state),
    }
    
    for clf_name, clf in classifiers.items():
        try:
            t0 = time()
            clf.fit(x_train_norm)
            test_scores = clf.decision_function(x_test_norm)
            t1 = time()
            duration = round(t1 - t0, ndigits=4)
            roc = round(roc_auc_score(y_test, test_scores), ndigits=4)
            prn = round(precision_n_scores(y_test, test_scores), ndigits=4)
        except Exception as e:
            roc = 0
            prn = 0
            duration = 0
        print('{clf_name} ROC:{roc}, precision @ rank n:{prn}, '
                  'execution time: {duration}s'.format(
                clf_name=clf_name, roc=roc, prn=prn, duration=duration))
        time_list.append(duration)
        roc_list.append(roc)
        prn_list.append(prn)

    temp_df = pd.DataFrame(time_list).transpose()
    temp_df.columns = df_columns
    time_df = pd.concat([time_df, temp_df], axis=0)

    temp_df = pd.DataFrame(roc_list).transpose()
    temp_df.columns = df_columns
    roc_df = pd.concat([roc_df, temp_df], axis=0)

    temp_df = pd.DataFrame(prn_list).transpose()
    temp_df.columns = df_columns
    prn_df = pd.concat([prn_df, temp_df], axis=0)
time_df.to_csv("spambase-time.csv",index = False)
roc_df.to_csv("spambase-roc.csv",index = False)
prn_df.to_csv("spambase-prn.csv",index = False)
print("All file saved")

<h3 id="2.3">结果分析</h3>

In [35]:
time_2 = pd.read_csv("spambase-time.csv",index_col = "Data")
roc_2 = pd.read_csv("spambase-roc.csv",index_col = "Data")
prn_2 = pd.read_csv("spambase-prn.csv",index_col = "Data")

**时间复杂度**
  

In [36]:
time_2.head()

Unnamed: 0_level_0,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,2511,58,41.0593,0.0,1.3066,2.8639,0.9236,0.3353,0.406,0.3844,1.4961,0.3488,0.0141
2,2511,58,39.7849,0.0,0.1994,2.7086,0.0159,0.3188,0.3953,0.3867,1.2631,0.3351,0.0129
3,2511,58,39.1079,0.0,0.138,2.9073,0.0167,0.3312,0.3949,0.3791,1.5047,0.3367,0.0128
4,2511,58,39.6256,0.0,0.1494,2.8392,0.0173,0.3313,0.3784,0.3898,1.4111,0.3478,0.0121
5,2511,58,39.6256,0.0,0.1469,2.6739,0.0166,0.3522,0.4064,0.3606,1.4056,0.3451,0.0128


首先看这些模型的平均时间复杂度，`HBOS`所花费的平均时间最低，`FB`花费的时间最高。与上一个数据集相同的`MCD`时间也很高，并且可以明显的看出该数据集的平均花费时间比上一个数据集长。

In [37]:
ans = time_2.loc[:,"ABOD":"PCA"].apply(lambda x: x.mean())
ans.sort_values()

HBOS       0.033642
PCA        0.042355
CBLOF      0.245409
IForest    0.361053
OCSVM      0.515216
LOF        0.661631
KNN        0.686198
ABOD       0.826201
MCD        3.015449
FB         5.057830
dtype: float64

 **ROC/AUC**
  

In [38]:
roc_2.head()

Unnamed: 0_level_0,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,2511,58,41.0593,0.0,0.5881,0.3256,0.7159,0.6634,0.6022,0.3882,0.4575,0.5747,0.5894
2,2511,58,39.7849,0.0,0.596,0.3369,0.7131,0.6261,0.6093,0.3814,0.4904,0.5633,0.5806
3,2511,58,39.1079,0.0,0.6007,0.3611,0.7054,0.6732,0.6105,0.4086,0.436,0.571,0.5834
4,2511,58,39.6256,0.0,0.6291,0.3663,0.7068,0.686,0.6025,0.4275,0.4886,0.5844,0.5942
5,2511,58,39.6256,0.0,0.5398,0.351,0.7244,0.6632,0.6009,0.4063,0.4172,0.5369,0.5515


roc越高代表模型越理想,根据如下结果来看每种算法的平均roc，几乎都在0.5-0.7左右，其中在上一个数据集中效果最差的`IForest`的roc值最高，为0.675383。小概率地，所有模型都有roc为1的例子，如下所示。

In [39]:
ans = roc_2.loc[:,"ABOD":"PCA"].apply(lambda x: x.mean())
ans.sort_values(ascending=False)

IForest    0.675383
MCD        0.667410
KNN        0.662475
HBOS       0.654777
OCSVM      0.643906
LOF        0.642503
PCA        0.641658
CBLOF      0.639057
FB         0.628230
ABOD       0.493942
dtype: float64

In [40]:
ans = roc_2.loc[:,"ABOD":"PCA"].apply(lambda x: x.max())
ans.sort_values(ascending=False)

PCA        1.000
OCSVM      1.000
MCD        1.000
LOF        1.000
KNN        1.000
FB         1.000
CBLOF      1.000
ABOD       1.000
IForest    0.998
HBOS       0.995
dtype: float64

**PRN**

In [41]:
prn_2.head()

Unnamed: 0_level_0,#Samples,# Dimensions,Outlier Perc,ABOD,CBLOF,FB,HBOS,IForest,KNN,LOF,MCD,OCSVM,PCA
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,2511,58,41.0593,0.0,0.46,0.2615,0.6005,0.5254,0.4746,0.3075,0.339,0.4636,0.46
2,2511,58,39.7849,0.0,0.447,0.25,0.5859,0.4747,0.4596,0.2753,0.3434,0.4419,0.4495
3,2511,58,39.1079,0.0,0.4219,0.25,0.5911,0.5182,0.4427,0.2786,0.2812,0.4167,0.4375
4,2511,58,39.6256,0.0,0.5036,0.3126,0.5943,0.568,0.5036,0.358,0.3675,0.4821,0.4821
5,2511,58,39.6256,0.0,0.399,0.243,0.5882,0.5294,0.4399,0.289,0.2737,0.4153,0.399


prn代表模型前n个的准确率，越高代表模型检测准确率越高,根据如下结果来看每种算法的平均prn，几乎都在0.2左右，很低，比上一个数据集还低，其中平均prn最高的IForest，KNN其次。并不是所有模型都有过检测完全正确的例子，如下所示。`IForest`虽然没有完全正确的例子，但是平均来说最高，最稳定。

In [42]:
ans = prn_2.loc[:,"ABOD":"PCA"].apply(lambda x: x.mean())
ans.sort_values(ascending=False)

IForest    0.269408
KNN        0.259158
PCA        0.248830
CBLOF      0.245386
OCSVM      0.244198
MCD        0.243359
HBOS       0.243084
LOF        0.214420
FB         0.209865
ABOD       0.207996
dtype: float64

In [43]:
ans = prn_2.loc[:,"ABOD":"PCA"].apply(lambda x: x.max())
ans.sort_values(ascending=False)

PCA        1.0000
OCSVM      1.0000
MCD        1.0000
LOF        1.0000
KNN        1.0000
FB         1.0000
CBLOF      1.0000
ABOD       1.0000
IForest    0.9474
HBOS       0.9474
dtype: float64

综合两个数据集来看，`KNN`虽然算法简单，但是模型效果相对稳定可靠，数据集1(abalone)的维度较低，可能导致了Isolation Forest的效果非常不好，数据集2(spambase)的维度比较高，Isolation Fores发挥了其优势，prn和roc都相对其他算法而言比较高。另外，在所有的实验跑下来过后，出现了OCSVM无法拟合的情况，这种情况下我把prn和roc都置为了0.