This program runs the Isolation Forest method over the KDDCUP99 dataset to determine if data instances within the set are normal or anomalous. 

In [1]:
from sklearn.metrics import average_precision_score
from sklearn.ensemble import IsolationForest
from sklearn.datasets import fetch_kddcup99
import numpy as np

In [2]:
kddcup99_data = fetch_kddcup99(subset='http')
X = kddcup99_data['data']
y = kddcup99_data['target']
y[y == b'normal.'] = 1
y[y!= 1] = -1

In [3]:
isof = IsolationForest(max_samples=100, random_state=42)
isof.fit(X)
anomaly_predictions = isof.predict(X)
print(anomaly_predictions[0:100])

[ 1  1  1  1  1  1  1 -1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1 -1 -1  1  1  1 -1  1 -1  1
 -1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1
 -1 -1  1  1]


In [4]:
def param_sweep(n):
    isof = IsolationForest(max_samples=n, random_state=42)
    isof.fit(X)
    anomaly_predictions = isof.predict(X)

    
    import pandas as pd
    df_confusion = pd.crosstab(y, anomaly_predictions)
    TP = df_confusion.iloc[0,0]  # True Positives
    FP = df_confusion.iloc[0,1]  # False Positives
    FN = df_confusion.iloc[1,0]  # False Negatives
    TN = df_confusion.iloc[1,1]  # True Negatives

    precision_score = TP/(TP+FP)
    recall_score = TP/(TP+FN)
    f1_score = 2*((precision_score*recall_score)/(precision_score+recall_score))

    true_positive_rate = TP/(TP + FN)
    false_positive_rate = FP/(FP+TN)
    
    print('\nPrecision Score = ', precision_score)
    print('\nRecall Score = ', recall_score)
    print('\nF1 Score = ', f1_score)
    print('\nTrue Positive Rate = ', true_positive_rate)
    print('\nFalse Positive Rate = ', false_positive_rate)

In [5]:
param_sweep(4)


Precision Score =  0.001358080579447714

Recall Score =  0.017857142857142856

F1 Score =  0.0025241901556583932

True Positive Rate =  0.017857142857142856

False Positive Rate =  0.037672694980958724


With the max_samples parameter set so low, the algorithm performed extremely poorly, detecting almost no anamolies and incorrectly labeling others. 

In [6]:
param_sweep(5)


Precision Score =  0.996831145314622

Recall Score =  0.1656884875846501

F1 Score =  0.28414736434608684

True Positive Rate =  0.1656884875846501

False Positive Rate =  0.00015406624848684935


By increasing this parameter by 1, the precision increased significantly and was able to correctly identify anomalies. However, the algorithm did not even get close to detecting all anomalies present. 

In [7]:
param_sweep(50)


Precision Score =  1.0

Recall Score =  0.16815102382583544

F1 Score =  0.28789261045223513

True Positive Rate =  0.16815102382583544

False Positive Rate =  0.0


By further increasing the max samples the algorithm was able to detect only anomalies, though many anomalies were still missed. We will continue to increase this parameter to determine what the optimal value for max samples is. 

In [8]:
param_sweep(100)


Precision Score =  1.0

Recall Score =  0.21542812560951824

F1 Score =  0.3544892882933483

True Positive Rate =  0.21542812560951824

False Positive Rate =  0.0


In [9]:
param_sweep(150)


Precision Score =  1.0

Recall Score =  0.24720232766338407

F1 Score =  0.39641094661283083

True Positive Rate =  0.24720232766338407

False Positive Rate =  0.0


In [10]:
param_sweep(500)


Precision Score =  1.0

Recall Score =  0.33188100961538464

F1 Score =  0.4983643542019177

True Positive Rate =  0.33188100961538464

False Positive Rate =  0.0


In [11]:
param_sweep(1000)


Precision Score =  1.0

Recall Score =  0.45772896808951513

F1 Score =  0.6280028429282161

True Positive Rate =  0.45772896808951513

False Positive Rate =  0.0


In [12]:
param_sweep(1250)


Precision Score =  1.0

Recall Score =  0.444466800804829

F1 Score =  0.6154060454102243

True Positive Rate =  0.444466800804829

False Positive Rate =  0.0


In [13]:
param_sweep(1500)


Precision Score =  1.0

Recall Score =  0.4931904442956017

F1 Score =  0.6605861244019139

True Positive Rate =  0.4931904442956017

False Positive Rate =  0.0


In [14]:
param_sweep(1700)


Precision Score =  1.0

Recall Score =  0.5081665516448125

F1 Score =  0.6738865161683953

True Positive Rate =  0.5081665516448125

False Positive Rate =  0.0


In [15]:
param_sweep(2000)


Precision Score =  1.0

Recall Score =  0.5189100305379375

F1 Score =  0.6832663161150634

True Positive Rate =  0.5189100305379375

False Positive Rate =  0.0


In [16]:
param_sweep(2400)


Precision Score =  0.05341783612494341

Recall Score =  0.058823529411764705

F1 Score =  0.05599051008303677

True Positive Rate =  0.058823529411764705

False Positive Rate =  0.036865953207919744
