# Implementation of K-Means Clustering on NSL-KDD Dataset
Using Method Described in *"K-Means Clustering Approach to Analyze
NSL-KDD Intrusion Detection Dataset"* found [here](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.413.589&rep=rep1&type=pdf).

Uses scikit-learn k-means learner to classify [NSL-KDD dataset](http://www.unb.ca/cic/research/datasets/nsl.html) and analyze results.

In [1]:
import pandas as pd
from sklearn.preprocessing import minmax_scale
from sklearn.neural_network import MLPClassifier

## Data Preprocessing
Define a data loading function, categorical variables are converted into numerical using categorical codes.

In [2]:
def load_data(file_path, cols=None):
    COL_NAMES = ["duration", "protocol_type", "service", "flag", "src_bytes",
                 "dst_bytes", "land", "wrong_fragment", "urgent", "hot", "num_failed_logins",
                 "logged_in", "num_compromised", "root_shell", "su_attempted", "num_root",
                 "num_file_creations", "num_shells", "num_access_files", "num_outbound_cmds",
                 "is_host_login", "is_guest_login", "count", "srv_count", "serror_rate",
                 "srv_serror_rate", "rerror_rate", "srv_rerror_rate", "same_srv_rate",
                 "diff_srv_rate", "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
                 "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
                 "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate",
                 "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "labels"]

    data = pd.read_csv(file_path, names=COL_NAMES, index_col=False)
    NOM_IND = [1, 2, 3]
    BIN_IND = [6, 11, 13, 14, 20, 21]
    # Need to find the numerical columns for normalization
    NUM_IND = list(set(range(40)).difference(NOM_IND).difference(BIN_IND))
    # Convert nominal to category codes
    for num in NOM_IND:
        data.iloc[:, num] = data.iloc[:, num].astype('category')
        data.iloc[:, num] = data.iloc[:, num].cat.codes
    # Scale all numerical data to [0-1]
    data.iloc[:, NOM_IND] = minmax_scale(data.iloc[:, NOM_IND])
    data.iloc[:, NUM_IND] = minmax_scale(data.iloc[:, NUM_IND])
    labels = None
    if 'labels' in data.columns:
        labels = data['labels']
        # Binary labeling
        labels = labels.apply(lambda x: x if x =='normal' else 'anomaly')
        del data['labels']
    return [data, labels]

## Neural Network Training and Testing
Define a function that trains the model and evaluates it on the test data, outputs accuracy on the training and test sets.

In [3]:
def train_clf(train_data, train_labels):
    clf = MLPClassifier(hidden_layer_sizes=(20,), alpha=.7,
                        beta_1=.8, beta_2=.8)
    clf.fit(train_data, train_labels)
    train_preds = clf.predict(train_data)
    train_acc = sum(train_preds == train_labels)/len(train_preds)
    print("Accuracy on training set: {0:1%}".format(train_acc))
    return clf

In [4]:
def test_clf(test_data, test_labels, clf):
    test_preds = clf.predict(test_data)
    test_acc = sum(test_preds == test_labels)/len(test_preds)
    print("Accuracy on test set: {0:1%}".format(test_acc))
    return test_preds

# Training Classifier

In [5]:
train_data, train_labels = load_data('data/KDDTrain+.csv')
clf = train_clf(train_data, train_labels)

Accuracy on training set: 96.750097%


# Testing

In [6]:
test_data, test_labels = load_data('data/KDDTest+.csv')
test_preds = test_clf(test_data, test_labels, clf)

Accuracy on test set: 78.068580%
