# Implementation of K-Means Clustering on NSL-KDD Dataset
Using Method Described in *"K-Means Clustering Approach to Analyze
NSL-KDD Intrusion Detection Dataset"* found [here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.413.589&rep=rep1&type=pdf).

Uses Weka with a python wrapper to cluster [NSL-KDD dataset](http://www.unb.ca/cic/research/datasets/nsl.html) and analyze results.

## Step 1 - Start Java Virtual Machine

In [1]:
import os
cwd = os.getcwd()

import weka.core.jvm as jvm
jvm.start(max_heap_size="2g")

DEBUG:weka.core.jvm:Adding bundled jars
DEBUG:weka.core.jvm:Classpath=['C:\\Users\\Nick\\Miniconda3\\lib\\site-packages\\javabridge\\jars\\rhino-1.7R4.jar', 'C:\\Users\\Nick\\Miniconda3\\lib\\site-packages\\javabridge\\jars\\runnablequeue.jar', 'C:\\Users\\Nick\\Miniconda3\\lib\\site-packages\\javabridge\\jars\\cpython.jar', 'C:\\Users\\Nick\\Miniconda3\\lib\\site-packages\\weka\\lib\\python-weka-wrapper.jar', 'C:\\Users\\Nick\\Miniconda3\\lib\\site-packages\\weka\\lib\\weka.jar']
DEBUG:weka.core.jvm:MaxHeapSize=2g
DEBUG:javabridge.jutil:Creating JVM object
DEBUG:javabridge.jutil:Signalling caller


## Step 2 - Load dataset
We use the training dataset provided by NSL-KDD in .arff format, and we tell weka that the class is in the last column of the table.

In [2]:
from weka.core.converters import Loader
loader = Loader(classname="weka.core.converters.ArffLoader")
data = loader.load_file(cwd + "\data\KDDTrain+.arff")
data.class_is_last()

## Step 3 - Filter and Normalize the Data
We do not want the class to be visible to the clustering algorithm, and we want to normalize all numerical values to a range of [0-1].

In [3]:
from weka.filters import Filter
remove = Filter(classname="weka.filters.unsupervised.attribute.Remove", options=["-R", "last"])
remove.inputformat(data)
data_no_class = remove.filter(data)

norm = Filter(classname="weka.filters.unsupervised.attribute.Normalize")
norm.inputformat(data_no_class)
norm_data = norm.filter(data_no_class)

# Step 4 - Build Cluster and Evaluate
We will use the Simple K-Means clustering algorithm, with 4 clusters.
* First we train it on the data
* Then we evaluate it on the original dataset using classes to clusters

In [4]:
from weka.clusterers import Clusterer
clusterer = Clusterer(classname="weka.clusterers.SimpleKMeans", options=["-N", "4"])
clusterer.build_clusterer(norm_data)

from weka.clusterers import ClusterEvaluation
evaluator = ClusterEvaluation()
evaluator.set_model(clusterer)
evaluator.test_model(data)
print(evaluator.cluster_results)


kMeans

Number of iterations: 6
Within cluster sum of squared errors: 161122.68303550128

Initial starting points (random):

Cluster 0: 0,udp,domain_u,SF,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.136986,0.293542,0,0,0,0,1,0,0.01,1,0.996078,1,0.01,0,0,0,0,0,0
Cluster 1: 0,tcp,ftp_data,SF,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0.003914,0.003914,0,0,0,0,1,0,0,0.717647,0.32549,0.45,0.02,0.45,0,0,0,0,0
Cluster 2: 0,tcp,private,S0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.542074,0.019569,1,1,0,0,0.04,0.06,0,1,0.035294,0.04,0.06,0,0,1,1,0,0
Cluster 3: 0,tcp,http,SF,0,0.000001,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0.009785,0.039139,0,0,0,0,1,0,0.25,0.156863,1,1,0,0.03,0.04,0,0,0,0

Missing values globally replaced with mean/mode

Final cluster centroids:
                                         Cluster#
Attribute                     Full Data         0         1         2         3
                             (125973.0) (25779.0) (15024.0) (34973.0) (50197.0)
duration                         0.0067    0.0066

# Step 5 - Analyze Results, Evaluate using other methods, Clean Up

In [5]:
jvm.stop()