# Overview

This project is about detecting cyber attacks over a wifi network and how I developed a classifier with (relatively) high recall (0.83). In this dataset, wifi activity is classified as either normal, flooding, injection or impersonation.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.display import HTML
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")
from sklearn.pipeline import Pipeline

In [2]:
from classification_model.processing.data_management import load_dataset

In [3]:
train_orig = load_dataset(file_name='AWID-CLS-R-Trn.csv')
test_orig = load_dataset(file_name='AWID-CLS-R-Tst.csv')

## Data

With almost 1.8 million rows and 155 columns even the reduced data set is too large to explore each individual case. Exploring the meaning of the columns is more feasible but I will leave out most of that exploration.

The data from this project is sourced from the AWID project (http://icsdweb.aegean.gr/awid/index.html). If you would like to use this data, please go to their website and ask for permission. The data is broken up into 4 different data sets, A larger data set (F) and a reduced version (R). For each dataset size, there is one that generalizes wifi activity into those mentioned earlier (CLS) and one that has more differentiation for each type of cyber attack (ATK). I will be focusing on the reduced dataset with more generalized classes for this project.

In [4]:
train_orig.shape

(1795574, 155)

In [5]:
test_orig.shape

(575642, 155)

In [7]:
100*train_orig['class'].value_counts(normalize=True)

normal           90.956374
injection         3.641120
impersonation     2.702311
flooding          2.700195
Name: class, dtype: float64

In [8]:
100*test_orig['class'].value_counts(normalize=True)

normal           92.207309
impersonation     3.488105
injection         2.897982
flooding          1.406603
Name: class, dtype: float64

## Missing Data
From the plot, we can see that a lot of the features almost entirely consist of missing data

In [None]:
missing_data = pd.DataFrame()
missing_data['feature'] = train_orig.columns

missing = list(100*train_orig.isnull().mean())
missing_data['missing'] = missing

In [None]:
missing_data

In [None]:
plt.figure(figsize=(16, 64))
sns.barplot(data=missing_data, x = 'missing', y = 'feature' );
plt.title('Missing Values');

The column names are difficult to interpret unless you are used to looking at wifi packet information. If you are interested, please look at https://www.wireshark.org/ for more information on these columns.

## Data Prep
A lot of data preparation is about to take place behind the scenes. For a complete description of all of the behind the scenes you can check out my code under the processing section to see what's happening under the hood of these functions. I will do my best to summarize many of the changes here.

Inside the prepare_data function I am replacing all missing values inside categorical variables with the label 'missing' as well as adjusting the time feature so that it is easier to measure. Later, you will see later that this data measures wifi traffic for an hour in the training set and approximately 20 minutes in the test set. I am also creating an integer feature that simply counts the seconds that have passed to use for aggregation. Why 1 second? Good question, I haven't yet done any analysis to determine if this is the best unit of time for aggregation, but it seemed like a good starting point. 

In [None]:
from tf_ann_model.processing.data_management import prepare_data

tt = pd.concat([train_orig, test_orig])

X_train, y_train = prepare_data(train_orig, train_data=True)
X_test, y_test = prepare_data(test_orig, train_data=False)
X_train_test, y_train_test = prepare_data(tt, train_data=False)

In [None]:
train = X_train.copy()
train['class'] = train_orig['class']

test = X_test.copy()
test['class'] = test_orig['class']

train_test = X_train_test.copy()
train_test['class'] = tt['class']

## Data Target Distribution

As you can see in the plots that follow there is a time dependency in the data for the target values flooding, injection, and impersonation.

In [None]:
def h(content):
    display(HTML(content))
    
def timehist(df, target, tcol, col, title = None, clipping=9999999999999999):
        
    df[df[target] == 'normal'].set_index(tcol)[col].clip(0, clipping).plot(style='.', figsize=(15, 5), label='normal')
    df[df[target] == 'flooding'].set_index(tcol)[col].clip(0, clipping).plot(style='.', figsize=(15, 5), label='flooding')
    df[df[target] == 'injection'].set_index(tcol)[col].clip(0, clipping).plot(style='.', figsize=(15, 5), label='injection')
    df[df[target] == 'impersonation'].set_index(tcol)[col].clip(0, clipping).plot(style='.', figsize=(15, 5), label='impersonation')
    plt.title(title)
    plt.legend(loc = 'upper right')
    plt.show()
  

In [None]:
timehist(df=train, target='class', tcol='frame.time_epoch', col='frame.time_delta', title='Training Data: Distribution of Traffic time deltas over time')

In [None]:
timehist(df=test, target='class', tcol='frame.time_epoch', col='frame.time_delta', title='Test Data: Distribution of Traffic time deltas over time')

In [None]:
timehist(df=train_test, target='class', tcol='frame.time_epoch', col='frame.time_delta', title='Train and Test Data: Distribution of Traffic time deltas over time')

In [None]:
from classification_model.processing.data_management import partition_features
NUMERIC, CATEG = partition_features(X_train)

In [None]:
def diagnostic_plots(df, variable):
    # function takes a dataframe (df) and
    # the variable of interest as arguments

    # define figure size
    plt.figure(figsize=(16, 4))

    # histogram
    plt.subplot(1, 3, 1)
    sns.distplot(df[variable], bins=30)
    plt.title('Histogram')

    # Q-Q plot
    plt.subplot(1, 3, 2)
    stats.probplot(df[variable], dist="norm", plot=plt)
    plt.ylabel('Variable quantiles')

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()

In [None]:
#### Numeric variables ####
for feat in NUMERIC:
    print( 'Feature:',feat)
    print('')
    print(X_train[feat].describe())
    print(' ')
    diagnostic_plots(X_train, feat)

In [None]:
for feat in CATEG:
    print( 'Feature:',feat)
    print('')
    print('Number of unique values:')
    print(X_train[feat].nunique())
    print('')
    print('Value distribution:')
    print((X_train[feat].value_counts().head(20)))
    print('')
    print('')