## Intrusion detection system - Naive Bayes Classifier

This notebook uses the Guassian Naive Bayes Classifier for detection of network attacks on a simulation network capture dataset

In [1]:
import numpy as np
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.naive_bayes import GaussianNB
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_classif, mutual_info_classif
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import MinMaxScaler
import sys
import pandas as pd
from datetime import datetime
import matplotlib as plt

## Data Cleanup

This module cleans up some of the duplicate headers in the source files.  
The data is then stored in the "./cleaned/" folder.

In [None]:
import csv

for file in ["Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv"
             ,"Wednesday-21-02-2018_TrafficForML_CICFlowMeter.csv"
             ,"Wednesday-28-02-2018_TrafficForML_CICFlowMeter.csv"
             ,"Thursday-01-03-2018_TrafficForML_CICFlowMeter.csv"
             ,"Thursday-15-02-2018_TrafficForML_CICFlowMeter.csv"
             ,"Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv"
             ,"Friday-02-03-2018_TrafficForML_CICFlowMeter.csv"
             ,"Friday-16-02-2018_TrafficForML_CICFlowMeter.csv"
             ,"Friday-23-02-2018_TrafficForML_CICFlowMeter.csv"]:
    output_filepath="cleaned_files/" + file
    with open(file,"r") as inputfile, open(output_filepath,"w",newline="") as outputfile:
        csv_in = csv.reader(inputfile)
        csv_out = csv.writer(outputfile)
        title = next(csv_in)
        csv_out.writerow(title)
        for row in csv_in:
            if row != title:
                 csv_out.writerow(row)

## Create dataframes from CSVs

In [3]:
filenames = ['Wednesday-14-02-2018_TrafficForML_CICFlowMeter.csv',
              'Friday-16-02-2018_TrafficForML_CICFlowMeter.csv',
               'Friday-02-03-2018_TrafficForML_CICFlowMeter.csv']

flow_df = pd.DataFrame()
for file in filenames:
    filepath = "cleaned_files/" + file
    x_df = pd.read_csv(filepath)
    flow_df = flow_df.append(other=x_df, ignore_index=True)

feature_cols = ['Flow Duration', 'Tot Fwd Pkts',
            'Tot Bwd Pkts', 'TotLen Fwd Pkts', 'TotLen Bwd Pkts', 'Fwd Pkt Len Max',
            'Fwd Pkt Len Min', 'Fwd Pkt Len Mean', 'Fwd Pkt Len Std',
            'Bwd Pkt Len Max', 'Bwd Pkt Len Min', 'Bwd Pkt Len Mean',
            'Bwd Pkt Len Std', 'Flow IAT Mean',
            'Flow IAT Std', 'Flow IAT Max', 'Flow IAT Min', 'Fwd IAT Tot',
            'Fwd IAT Mean', 'Fwd IAT Std', 'Fwd IAT Max', 'Fwd IAT Min',
            'Bwd IAT Tot', 'Bwd IAT Mean', 'Bwd IAT Std', 'Bwd IAT Max',
            'Bwd IAT Min', 'Fwd PSH Flags', 'Bwd PSH Flags', 'Fwd URG Flags',
            'Bwd URG Flags', 'Fwd Header Len', 'Bwd Header Len', 'Fwd Pkts/s',
            'Bwd Pkts/s', 'Pkt Len Min', 'Pkt Len Max', 'Pkt Len Mean',
            'Pkt Len Std', 'Pkt Len Var', 'FIN Flag Cnt', 'SYN Flag Cnt',
            'RST Flag Cnt', 'PSH Flag Cnt', 'ACK Flag Cnt', 'URG Flag Cnt',
            'CWE Flag Count', 'ECE Flag Cnt', 'Down/Up Ratio', 'Pkt Size Avg',
            'Fwd Seg Size Avg', 'Bwd Seg Size Avg', 'Fwd Byts/b Avg',
            'Fwd Pkts/b Avg', 'Fwd Blk Rate Avg', 'Bwd Byts/b Avg',
            'Bwd Pkts/b Avg', 'Bwd Blk Rate Avg', 'Subflow Fwd Pkts',
            'Subflow Fwd Byts', 'Subflow Bwd Pkts', 'Subflow Bwd Byts',
            'Init Fwd Win Byts', 'Init Bwd Win Byts', 'Fwd Act Data Pkts',
            'Fwd Seg Size Min', 'Active Mean', 'Active Std', 'Active Max',
            'Active Min', 'Idle Mean', 'Idle Std', 'Idle Max', 'Idle Min',
               'Dst Port', 'Protocol']

excluded_cols = ['Flow Byts/s', 'Flow Pkts/s', 'Timestamp']

X = flow_df[feature_cols]

y = flow_df['Label']

## Feature selection and normalization

The select K-Best classifier uses the ANOVA test to pick out the 10 best features from the vast set of 74 features.

The min-max scaler normalizes all features between 0 and 1. This was done in order to improve K-NN performance, but it does improve the NB performance as well

In [4]:
feature_selector = SelectKBest(f_classif, k=10)
X =  feature_selector.fit_transform(X, y)

scaler = MinMaxScaler()
scaler.fit(X)
X_norm = scaler.transform(X)

flow_df_norm = pd.DataFrame(X_norm)

  f = msb / msw


## Test Train Split 

In [5]:
'''
Test Train Split
'''
X_train, X_test, y_train, y_test = train_test_split(flow_df_norm, y, test_size=0.2, random_state=42, shuffle=True)

## SMOTE Oversampling

This optional module will balance your data set to include equal amounts of all labels, using a oversampling technique. 

In [6]:
'''
Oversampler for dealing with imbalanced sets
'''
balance_data = False

    
if balance_data:
    oversampler = SMOTE()
    X_train, y_train = oversampler.fit_resample(X_train, y_train)
        

## Fit train data into model

In [None]:
%%time
'''
Naive Bayes Classifier
'''
gnb = GaussianNB(priors=None, var_smoothing=1e-09)

gnb.fit(X_train, y_train)

## Plot confusion matrix

The module does a decent job of predicting that a attack is happening, however it is prone to confuse between the types of attacks.

There are also a large amount of false positive attacks, which lower the precision considerably.

In [None]:
'''
Confusion Matrix
'''
plt.rcParams['figure.figsize'] = [12, 8]
plt.rcParams['figure.dpi'] = 100
plot_confusion_matrix(gnb, X_test, y_test)

## Print model metrics

recall 0.9793341077792436  
precision 0.7055214027761624  
accuracy 0.8266663487749247  

In [None]:
'''
Model Metrics
'''
y_pred = gnb.predict(X_test)

pred = pd.DataFrame(
    [0 if d == 'Benign' else 1 for d in y_pred], columns=["obs"])
test = pd.DataFrame(
    [0 if d == 'Benign' else 1 for d in y_test], columns=["obs"])

pred["obs"] = pd.to_numeric(pred["obs"])
test["obs"] = pd.to_numeric(test["obs"])

print("recall {}".format(recall_score(test, pred)))
print("precision {}".format(precision_score(test, pred)))
print("accuracy {}".format(accuracy_score(test, pred)))

print("count events {}".format(test["obs"].value_counts().to_dict()))