This program produces 3 plots: precision, recall, and fscore for the LOF algorithm. The X-axis represents the value k, which is the parameter for the number of nearest neighbors. The Y-axis represents the scores for the three metrics. 

In [1]:
from sklearn.datasets import fetch_kddcup99
kdd99_data = fetch_kddcup99(subset='http')
import numpy as np
X = kdd99_data['data']
y = kdd99_data['target']
y[y == b'normal.'] = 1
y[y != 1] = -1
y = np.int64(y)

In [2]:
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score


recall_scores = []
precision_scores = []
f1_scores = []

def calculations(n):
    #print('K value: ', n)
    lof = LocalOutlierFactor(n_neighbors = n)
    anomaly_predictions = lof.fit_predict(X)
    
  
    anomaly_predictions = np.array(anomaly_predictions)
    r = recall_score(y, anomaly_predictions, pos_label = -1)
    p = precision_score(y, anomaly_predictions, pos_label = -1)
    f = f1_score(y, anomaly_predictions, pos_label = -1)
    
   
    recall_scores.append(r)
    precision_scores.append(p)
    f1_scores.append(f)

In [None]:
ilist = []
i = 1
while i<5:
    calculations(i)
    ilist.append(i)
    i=i+1
    
i = 5
while i <= 160:
   calculations(i)
   ilist.append(i)
   i = i*2


In [None]:
#recall_score
print('Values of K: ', ilist)

In [None]:
print('Recall Scores: ', recall_scores)
print('\n\nPrecision Scores: ', precision_scores)
print('\n\nF1 Scores: ', f1_scores)

In [None]:
import matplotlib.pyplot as plt
plt.plot(ilist, recall_scores)
plt.xlabel('Number of Nearest Neighbors')
plt.ylabel('Recall Score')
plt.title('LOF Recall Plot')
plt.show()

In [None]:
import matplotlib.pyplot as plt
plt.plot(ilist, precision_scores)
plt.xlabel('Number of Nearest Neighbors')
plt.ylabel('Precision Score')
plt.title('LOF Precision Plot')
plt.show()

In [None]:
import matplotlib.pyplot as plt
plt.plot(ilist, f1_scores)
plt.xlabel('Number of Nearest Neighbors')
plt.ylabel('F1 Score')
plt.title('LOF F1 Plot')
plt.show()

By looking at all three graphs, it can be seen that the optimal value of K is about 40. Although increasing this parameter would result in more of the anomalies being detected, there would be a consequence of higher false alarm rates. The harmonic mean of precision and recall also seems to level out and decrease after 40. 

In [None]:
lof = LocalOutlierFactor(n_neighbors = 40)
anomaly_predictions = lof.fit_predict(X)
    
  
anomaly_predictions = np.array(anomaly_predictions)
r = recall_score(y, anomaly_predictions, pos_label = -1)
p = precision_score(y, anomaly_predictions, pos_label = -1)
f = f1_score(y, anomaly_predictions, pos_label = -1)
    
print('Value of K: 40', '\nRecall Score: ', r, '\nPrecision Score: ', p, '\nF1 Score: ', f )

Even by optimizing the value of K, the calculations show that LOF is not an effective algorithm to use on this dataset (or at least the http subset) because only 4.8% of the anomalies were located. Additionally, only 12.4% of the anomalies were correctly identified. Thus, the accuracy is exceptionally low at about 7%. 