# Implementation of K-Means Clustering on NSL-KDD Dataset
Using Method Described in *"K-Means Clustering Approach to Analyze
NSL-KDD Intrusion Detection Dataset"* found [here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.413.589&rep=rep1&type=pdf).

Uses scikit-learn k-means learner to classify [NSL-KDD dataset](http://www.unb.ca/cic/research/datasets/nsl.html) and analyze results.

In [1]:
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import minmax_scale

## Data Loading
Define a data loading function, since categorical columns are converted into binary columns we need to save the order.

In [2]:
def load_data(file_path, cols=None):
    COL_NAMES = ["duration", "protocol_type", "service", "flag", "src_bytes",
                "dst_bytes", "land", "wrong_fragment", "urgent", "hot", "num_failed_logins",
                "logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
                "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
                "is_host_login", "is_guest_login", "count", "srv_count", "serror_rate",
                "srv_serror_rate", "rerror_rate", "srv_rerror_rate", "same_srv_rate",
                "diff_srv_rate", "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
                "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
                "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate",
                "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "labels"]

    data = pd.read_csv(file_path, names=COL_NAMES, index_col=False)

    NOM_IND = [1, 2, 3]
    BIN_IND = [6, 11, 13, 14, 20, 21]
    # Need to find the numerical columns for normalization
    NUM_IND = list(set(range(40)).difference(NOM_IND).difference(BIN_IND))

    # Scale all numerical data to [0-1]
    data.iloc[:, NUM_IND] = minmax_scale(data.iloc[:, NUM_IND])
    labels = data['labels']
    # Binary labeling
    labels = labels.apply(lambda x: x if x =='normal' else 'anomaly')
    del data['labels']
    data = pd.get_dummies(data)
    if cols is None:
        cols = data.columns
    else:
        map_data = pd.DataFrame(columns=cols)
        map_data = map_data.append(data)
        data = map_data.fillna(0)
        data = data[cols]
    return [data, labels, cols]

## K-Means Evaluation
Now defining an evaluation function that uses the model to cluster rows and then associates each cluster with either "normal" or "anomaly" and calculates the accuracy.

In [3]:
def evaluate_kmeans(data, labels, clf=None):
    if clf is None:
        clf = KMeans(n_clusters=4,init='random').fit(data)
    preds = clf.predict(data)
    ans = pd.DataFrame({'label':labels.values, 'kmean':preds})
    ans = ans.groupby(['kmean', 'label']).size()
    print(ans)

    # Get the larger number from each cluster
    correct = sum([anom if anom > norm else norm for anom, norm in zip(ans[::2],ans[1::2])])
    print("Total accuracy: {0:.1%}".format(correct/sum(ans)))
    return clf

## Loading the training data
The training data is loaded from file, categorical columns are converted into binary columns and numerical columns are scaled to [0-1] values.

In [4]:
train_data, train_labels, cols = load_data('data/KDDTrain+.csv')
train_data.head()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0.0,3.558064e-07,0.0,0,0.0,0.0,0.0,0.0,0,0.0,...,0,0,0,0,0,0,0,0,1,0
1,0.0,1.057999e-07,0.0,0,0.0,0.0,0.0,0.0,0,0.0,...,0,0,0,0,0,0,0,0,1,0
2,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,...,0,0,0,0,1,0,0,0,0,0
3,0.0,1.681203e-07,6.223962e-06,0,0.0,0.0,0.0,0.0,1,0.0,...,0,0,0,0,0,0,0,0,1,0
4,0.0,1.442067e-07,3.20626e-07,0,0.0,0.0,0.0,0.0,1,0.0,...,0,0,0,0,0,0,0,0,1,0


## Training
Now training the model and checking the accuracy on the training set:

In [5]:
clf = evaluate_kmeans(train_data, train_labels)

kmean  label  
0      anomaly     1960
       normal     48134
1      anomaly     9685
       normal     16087
2      anomaly    12138
       normal      2959
3      anomaly    34847
       normal       163
dtype: int64
Total accuracy: 88.3%


## Test Set
---
Loading the test data set, passing the columns will map the test set to match the training columns

In [6]:
test_data, test_labels, cols = load_data('data/KDDTest+.csv', cols)
test_data.head()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,...,1,0,0,0,0,0,0,0,0,0
1,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0,0.0,...,1,0,0,0,0,0,0,0,0,0
2,3.5e-05,0.0002066513,0.0,0,0.0,0.0,0.0,0.0,0,0.0,...,0,0,0,0,0,0,0,0,1,0
3,0.0,3.183413e-07,0.0,0,0.0,0.0,0.0,0.0,0,0.0,...,0,0,0,0,0,0,0,0,1,0
4,1.7e-05,0.0,1.1e-05,0,0.0,0.0,0.0,0.0,0,0.0,...,0,1,0,0,0,0,0,0,0,0


Now to cluster the test data:

In [7]:
evaluate_kmeans(test_data, test_labels, clf)

kmean  label  
0      anomaly    3024
       normal     7357
1      anomaly    2741
       normal     2265
2      anomaly    4809
       normal       82
3      anomaly    2259
       normal        6
dtype: int64
Total accuracy: 76.1%


KMeans(algorithm='auto', copy_x=True, init='random', max_iter=300,
    n_clusters=4, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)

## Observations