# Mahalanobis outlier detection on KDD Cup '99 dataset

The outlier detector needs to detect computer network intrusions using TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack.

There are 4 types of attacks in the dataset:

- DOS: denial-of-service, e.g. syn flood;
- R2L: unauthorized access from a remote machine, e.g. guessing password;
- U2R: unauthorized access to local superuser (root) privileges;
- probing: surveillance and other probing, e.g., port scanning.

The dataset contains about 5 million connection records.

There are 3 types of features:

- basic features of individual connections, e.g. duration of connection
- content features within a connection, e.g. number of failed log in attempts
- traffic features within a 2 second window, e.g. number of connections to the same host as the current connection

In [None]:
import sys
sys.path.append('..')
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import OrdinalEncoder

from odcd.od.mahalanobis import Mahalanobis
from odcd.datasets import fetch_kdd
from odcd.utils.data import create_outlier_batch
from odcd.utils.visualize import plot_outlier_scores

## Load dataset

We only keep a number of continuous (18 out of 41) features.

In [None]:
kddcup = fetch_kdd(percent10=True)  # only load 10% of the dataset
print(kddcup.data.shape, kddcup.target.shape)

Assume that a model is trained on *normal* instances of the dataset (not outliers) and standardization is applied:

In [None]:
normal_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=100000, perc_outlier=0)
data, target = normal_batch.data.astype('float'), normal_batch.target
print(data.shape, target.shape)
print('{}% outliers'.format(100 * target.mean()))

In [None]:
mean, stdev = data.mean(axis=0), data.std(axis=0)

Generate batch of data with 10% outliers:

In [None]:
outlier_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=100, perc_outlier=10)
data, target = outlier_batch.data.astype('float'), outlier_batch.target
print(data.shape, target.shape)
print('{}% outliers'.format(100 * target.mean()))

Apply standardization:

In [None]:
data = (data - mean) / stdev

## Initialize and run outlier detector

Set parameters:

In [None]:
threshold = 6  # scores above threshold are classified as outliers
n_components = 2  # nb of components used in PCA
std_clip = 3  # clip values used to compute mean and cov above "std_clip" standard deviations
start_clip = 20  # start clipping values after "start_clip" instances

Initialize, predict outliers and get outlier scores:

In [None]:
mh = Mahalanobis(threshold, 
                 n_components=n_components, 
                 std_clip=std_clip, 
                 start_clip=start_clip)
preds = mh.predict(data)
scores = mh.score(data)

## Display results

Confusion matrix:

In [None]:
labels = outlier_batch.target_names
cm = confusion_matrix(target, preds)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()

Plot scores vs. the outlier threshold:

In [None]:
plot_outlier_scores(scores, target, labels, threshold)

## Include categorical variables

In [None]:
#cat_cols = ['protocol_type', 'service', 'flag', 'land', 
#            'logged_in', 'is_host_login', 'is_guest_login']
cat_cols = ['protocol_type', 'service', 'flag']
num_cols = ['srv_count', 'serror_rate', 'srv_serror_rate',
            'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 
            'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 
            'dst_host_srv_count', 'dst_host_same_srv_rate', 
            'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate',
            'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 
            'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 
            'dst_host_srv_rerror_rate']
cols = cat_cols + num_cols

In [None]:
kddcup = fetch_kdd(keep_cols=cols, percent10=True)
print(kddcup.data.shape, kddcup.target.shape)

Create a dictionary with as keys the categorical columns and values the number of categories for each variable in the dataset. This dictionary will later be used in the `fit` step of the outlier detector.

In [None]:
cat_vars_ord = {}
n_categories = len(cat_cols)
for i in range(n_categories):
    cat_vars_ord[i] = len(np.unique(kddcup.data[:, i]))
print(cat_vars_ord)

Fit an ordinal encoder on the categorical data:

In [None]:
enc = OrdinalEncoder()
enc.fit(kddcup.data[:, :7])

Use the data to infer the numerical distances between the categorical variables. Just for illustrative purposes as it uses the outlier data already:

In [None]:
X_num = (kddcup.data[:, 7:] - mean) / stdev  # standardize numerical features
X_ord = enc.transform(kddcup.data[:, :7])  # apply ordinal encoding to categorical features
X_fit = np.c_[X_ord, X_num].astype(np.float32, copy=False)  # combine numerical and categorical features
print(X_fit.shape)

Generate batch of data with 10% outliers:

In [None]:
outlier_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=100, perc_outlier=10)
data, target = outlier_batch.data, outlier_batch.target
print(data.shape, target.shape)
print('{}% outliers'.format(100 * target.mean()))

Preprocess the outlier batch:

In [None]:
X_num = (data[:, 7:] - mean) / stdev
X_ord = enc.transform(data[:, :7])
X = np.c_[X_ord, X_num].astype(np.float32, copy=False)
print(X.shape)

## Initialize and fit outlier detector

In [None]:
mh = Mahalanobis(threshold,
                 n_components=n_components, 
                 std_clip=std_clip, 
                 start_clip=start_clip)

Set `fit` parameters:

In [None]:
d_type = 'abdm'  # pairwise distance type, 'abdm' infers context from other variables
ohe = False  # True if one-hot encoding (OHE) is used
disc_perc = [25, 50, 75]  # percentiles used to bin numerical values; used in 'abdm' calculations
standardize_cat_vars = True  # standardize numerical values of categorical variables

Apply `fit` method to find numerical values for categorical variables:

In [None]:
mh.fit(X_fit,
       cat_vars=cat_vars_ord,
       ohe=ohe,
       d_type=d_type,
       disc_perc=disc_perc,
       standardize_cat_vars=standardize_cat_vars,
       feature_range=(-1e10, 1e10)
      )

## Run outlier detector

In [None]:
preds = mh.predict(X)
scores = mh.score(X)

## Display results

Confusion matrix:

In [None]:
cm = confusion_matrix(target, preds)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()

Plot scores vs. the outlier threshold:

In [None]:
plot_outlier_scores(scores, target, labels, threshold)