This program runs the Isolation Forest method over the http subset of the KDDCUP99 dataset to determine if data instances within the set are normal or anomalous. 

In [None]:
from sklearn.metrics import average_precision_score
from sklearn.ensemble import IsolationForest
from sklearn.datasets import fetch_kddcup99
import numpy as np

In [None]:
kddcup99_data = fetch_kddcup99(subset='http')
X = kddcup99_data['data']
y = kddcup99_data['target']
y[y == b'normal.'] = 1
y[y!= 1] = -1

In [None]:
isof = IsolationForest(max_samples=100, random_state=42)
isof.fit(X)
anomaly_predictions = isof.predict(X)
print(anomaly_predictions[0:100])

In [None]:
def param_sweep(n):
    isof = IsolationForest(max_samples=n, random_state=42)
    isof.fit(X)
    anomaly_predictions = isof.predict(X)

    
    import pandas as pd
    df_confusion = pd.crosstab(y, anomaly_predictions)
    TP = df_confusion.iloc[0,0]  # True Positives
    FP = df_confusion.iloc[0,1]  # False Positives
    FN = df_confusion.iloc[1,0]  # False Negatives
    TN = df_confusion.iloc[1,1]  # True Negatives

    precision_score = TP/(TP+FP)
    recall_score = TP/(TP+FN)
    f1_score = 2*((precision_score*recall_score)/(precision_score+recall_score))

    true_positive_rate = TP/(TP + FN)
    false_positive_rate = FP/(FP+TN)
    
    print('\nPrecision Score = ', precision_score)
    print('\nRecall Score = ', recall_score)
    print('\nF1 Score = ', f1_score)
    print('\nTrue Positive Rate = ', true_positive_rate)
    print('\nFalse Positive Rate = ', false_positive_rate)

In [None]:
param_sweep(4)

With the max_samples parameter set so low, the algorithm performed extremely poorly, detecting almost no anamolies and incorrectly labeling others. 

In [None]:
param_sweep(5)

By increasing this parameter by 1, the precision increased significantly and was able to correctly identify anomalies. However, the algorithm did not even get close to detecting all anomalies present. 

In [None]:
param_sweep(50)

By further increasing the max samples the algorithm was able to detect only anomalies, though many anomalies were still missed. We will continue to increase this parameter to determine what the optimal value for max samples is. 

In [None]:
param_sweep(100)

In [None]:
param_sweep(150)

In [None]:
param_sweep(500)

In [None]:
param_sweep(1000)

In [None]:
param_sweep(1250)

In [None]:
param_sweep(1500)

In [None]:
param_sweep(1700)

In [None]:
param_sweep(2000)

In [None]:
param_sweep(2400)