This program runs the Local Outlier Factor (LOF) method over the KDDCUP99 dataset to determine if data instances within the set are normal or anomalous.

In [1]:
from sklearn.datasets import fetch_kddcup99

In [2]:
kddcup99_data = fetch_kddcup99(subset='http')

In [3]:
X = kddcup99_data['data']
X.shape

(58725, 3)

In [4]:
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(n_neighbors = 3)

The code above allows us to select the desired number of neighbors for each data instance by manipulating the n_neighbors parameter.

In [5]:
anomaly_predictions = lof.fit_predict(X)
print(anomaly_predictions[0:100])

[ 1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1 -1  1  1 -1  1  1  1  1  1
  1  1  1  1 -1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1
  1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1
  1 -1  1  1]


Here predictions for each data point are returned as either a 1 or -1. Points with a label of 1 are considered to be normal, while points with the label of -1 are anomalous. The large size of the returned data is much too large to display here, so only the first 100 points are shown. Of the 100 points, it can be seen that 12 are labeled as anomalies. 

In [6]:
lof = LocalOutlierFactor(n_neighbors = 15)

By manipulating the n_neighbors parameter, we are able to increase the number of nearest neighbors. We would expect fewer anomalies to be present with a larger number of nearest neighbors. 

In [7]:
anomaly_predictions = lof.fit_predict(X)
print(anomaly_predictions[0:100])

[ 1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1]


As expected, less anomalies are found within the first 100 data points. Of the 100 points, now only 1 is labeled as an anomaly. From here we can look at the negative of the LOF for each data instance to tell us how anomalous each point is. Points that have a lower local outlier factor are less anomalous than those with a high factor score. 

In [8]:
print(lof.negative_outlier_factor_[0:100])

[-0.99198021 -0.97924315 -1.02248748 -0.99978689 -0.9953473  -0.9953473
 -1.01464236 -0.95650471 -2.17149096 -1.01110182 -0.98361208 -1.02151687
 -1.00010223 -1.09080608 -0.97007533 -1.09737055 -0.97094389 -1.0249061
 -1.22139993 -0.99946346 -0.99068709 -1.15963642 -0.97007533 -1.07676281
 -1.07541388 -1.03336647 -1.12832352 -1.04208377 -1.05695764 -1.06412748
 -1.0255507  -1.132639   -1.00201484 -1.01947777 -0.98537706 -1.00398002
 -0.99676279 -1.07874037 -1.00988398 -0.99429884 -0.99429884 -1.03415569
 -1.04195639 -1.02956859 -1.01436433 -1.05737536 -1.43355147 -0.95822263
 -1.04437026 -1.00451598 -1.0830859  -1.01953638 -1.26388732 -1.15040383
 -1.07583971 -1.01673097 -1.12955755 -1.02662833 -0.97829437 -0.99057346
 -1.03160055 -1.11378782 -1.00073398 -1.04664331 -1.01179738 -1.00440836
 -1.05785946 -0.99105508 -1.10101385 -1.00408763 -1.02223517 -1.04760285
 -0.98083369 -1.0457116  -0.99358395 -0.96072066 -1.00002422 -0.97482854
 -0.99735028 -0.97993074 -1.08638335 -1.02134508 -1.0

The offset_ attribute will tell us the threshold used to label the anomalous points. 

In [9]:
lof.offset_

-1.5

The output here informs us that a point is considered to be an anomaly if its local outlier factor is greater than 1.5. Using the same number of nearest neighbors, we are able to obtain a different threshold by manipulating the contamination parameter. 

In [10]:
lof = LocalOutlierFactor(n_neighbors = 15, contamination=0.2)
anomaly_predictions = lof.fit_predict(X)
print(anomaly_predictions[0:100])

[ 1  1  1  1  1  1  1  1 -1  1  1  1  1 -1  1 -1  1  1 -1  1  1 -1  1  1
  1  1 -1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1
  1  1  1  1 -1 -1  1  1 -1  1  1  1  1 -1  1  1  1  1  1  1 -1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1 -1 -1  1 -1  1
  1 -1  1  1]


Although the n_neighbors parameter remained the same, the number of anomalies increased. This is because, for this particular example, we found 20% of the data with the largest local outlier factor values. 

In [11]:
lof.offset_

-1.0906645632342706

The offset_ attribute on these anomaly predictions is lower than the result returned before setting the contamination. Intuitively this makes sense because a lower threshold usually indicates more anomalies. Thus, we have obtained our expected results.  

Now we will conduct a parameter sweep by setting the number of nearest neighbors to various values. The purpose of the parameter sweep is to find the optimal value for the parameter n_neighbors. 

In [12]:
def param_sweep(n):
    print ('\n\nNumber of Nearest Neighbors:', n, '\n')
    lof = LocalOutlierFactor(n_neighbors = n)
    anomaly_predictions = lof.fit_predict(X)
    print('Anomaly Predictions:\n',anomaly_predictions[0:100])
    print('\nNegative Outlier Factor:\n',lof.negative_outlier_factor_[0:100])
    print('\nThreshold:\n',lof.offset_)

In [13]:
i=1
while i<6:
    param_sweep(i)
    i=i+1



Number of Nearest Neighbors: 1 

Anomaly Predictions:
 [ 1 -1  1  1  1  1  1 -1 -1  1 -1 -1 -1  1  1  1  1 -1  1  1  1  1  1 -1
  1 -1 -1 -1  1 -1  1  1  1 -1  1 -1  1  1 -1  1  1  1 -1  1  1  1  1  1
  1  1  1 -1  1  1 -1 -1  1  1  1  1  1 -1  1  1 -1 -1 -1  1 -1 -1  1  1
  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1
  1 -1  1  1]

Negative Outlier Factor:
 [-1.00000000e+00 -4.10593894e+07 -1.00000000e+00 -1.00000000e+00
 -1.00000000e+00 -1.00000000e+00 -1.00000000e+00 -4.55765382e+00
 -3.81710851e+00 -1.00000000e+00 -3.19949091e+07 -8.27458495e+07
 -2.59033959e+07 -1.00000000e+00 -1.00000000e+00 -1.00000000e+00
 -1.00000000e+00 -1.53767414e+00 -1.00000000e+00 -1.00000000e+00
 -1.45471037e+00 -1.00000000e+00 -1.00000000e+00 -8.22965255e+07
 -1.00000000e+00 -1.66860726e+00 -7.38826110e+07 -8.13415271e+07
 -1.21929646e+00 -2.08336852e+00 -1.01303619e+00 -1.36557660e+00
 -1.00000000e+00 -7.04890147e+07 -1.00000000e+00 -3.01003459e+00
 -1.00000000e+00 -1.00000

In [14]:
i = 5
while i <= 80:
    i=i*2
    param_sweep(i)



Number of Nearest Neighbors: 10 

Anomaly Predictions:
 [ 1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1
  1  1  1  1]

Negative Outlier Factor:
 [-1.03042195 -0.95301252 -0.97544097 -1.03115702 -0.96641432 -0.96641432
 -1.04721119 -1.09647807 -2.03291304 -1.120309   -0.97623517 -1.11381799
 -1.04191297 -1.0186755  -1.01148745 -1.27996605 -0.99329819 -1.01972432
 -0.99501101 -0.97567026 -0.99481425 -1.01336375 -1.01148745 -1.15644704
 -1.02017126 -1.00161641 -1.08734919 -1.01222914 -1.16717104 -1.05860853
 -0.99384116 -1.09050616 -1.01311068 -1.03051465 -1.09290609 -0.98403843
 -1.0165453  -1.03246917 -1.04940301 -0.98495094 -0.98495094 -1.01511611
 -1.08010221 -1.04709527 -1.01253645 -1.18519046 -1.3495674  -1.03839658
 -1.0678441  -1.08654205 