# Python notebook for case-study run on the CICIDS security dataset, by Will Bridges

# Code set-up: Imports and packages

The software versions used are:
- The Python3 version used for this work is: Python 3.8.x
- The scikit-learn version used is: scikit-learn 0.24.0
- The seaborn version used is: 0.11.1
- The Pandas version used is: 1.1.5 (although 1.2.0 was released recently, this should also work)

Before running, please run these commands via pip, in the terminal:
- pip install pandas
- pip install scikit-learn
- pip install scikit-plot
- pip install seaborn

### Imports

In [1]:
%matplotlib inline
import os # For accessing Python Modules in the System Path (for accessing the Statistical Measures modules).
import pandas as pd # For DataFrames, Series, and reading csv data in.
import seaborn as sns # Graphing, built ontop of MatPlot for ease-of-use and nicer diagrams.
import matplotlib.pyplot as plt # MatPlotLib for graphing data visually. Seaborn more likely to be used.
import numpy as np # For manipulating arrays and changing data into correct formats for certain libraries
import sklearn # For Machine Learning algorithms
from sklearn.decomposition import PCA # For PCA dimensionality reduction technique

### Useful environment variables

In [2]:
# 'Reduced dimensions' variable for altering the number of PCA principal components. Can be altered for needs.
dimensions_num_for_PCA = 3

# Max number of permutations to run. Can be altered for needs.
permutation_num = 10

### Importing the dataset into Pandas.DataFrame and showing the top 5 entries via 'df.head()'

In [3]:
Friday_Morning_Data = pd.read_csv('Friday-WorkingHours-Morning.pcap_ISCX.csv')
df = Friday_Morning_Data.copy()
df.head()

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,3268,112740690,32,16,6448,1152,403,0,201.5,204.724205,...,32,359.4286,11.99802,380,343,16100000.0,498804.8,16400000,15400000,BENIGN
1,389,112740560,32,16,6448,5056,403,0,201.5,204.724205,...,32,320.2857,15.74499,330,285,16100000.0,498793.7,16400000,15400000,BENIGN
2,0,113757377,545,0,0,0,0,0,0.0,0.0,...,0,9361829.0,7324646.0,18900000,19,12200000.0,6935824.0,20800000,5504997,BENIGN
3,5355,100126,22,0,616,0,28,28,28.0,0.0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,0,54760,4,0,0,0,0,0,0.0,0.0,...,0,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


Because of Excel being used to create the csv, the column headings/ names contain whitespace padding, incorrect capitalisation, etc... which makes it difficult to correctly select by column names. This piece of code below just removes these issues. 

Code Reference: https://medium.com/@chaimgluck1/working-with-pandas-fixing-messy-column-names-42a54a6659cd

In [4]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
df.head()

  df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')


Unnamed: 0,destination_port,flow_duration,total_fwd_packets,total_backward_packets,total_length_of_fwd_packets,total_length_of_bwd_packets,fwd_packet_length_max,fwd_packet_length_min,fwd_packet_length_mean,fwd_packet_length_std,...,min_seg_size_forward,active_mean,active_std,active_max,active_min,idle_mean,idle_std,idle_max,idle_min,label
0,3268,112740690,32,16,6448,1152,403,0,201.5,204.724205,...,32,359.4286,11.99802,380,343,16100000.0,498804.8,16400000,15400000,BENIGN
1,389,112740560,32,16,6448,5056,403,0,201.5,204.724205,...,32,320.2857,15.74499,330,285,16100000.0,498793.7,16400000,15400000,BENIGN
2,0,113757377,545,0,0,0,0,0,0.0,0.0,...,0,9361829.0,7324646.0,18900000,19,12200000.0,6935824.0,20800000,5504997,BENIGN
3,5355,100126,22,0,616,0,28,28,28.0,0.0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,0,54760,4,0,0,0,0,0,0.0,0.0,...,0,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


### Data Preparation: PCA Dimension reduction (Hughes' Phenomenon)

PCA acts to reduce the dimensions/ search space of the dataset as much as possible, while trying to maintain the most information possible e.g. It can easily reduce the dimensionality by more than half, while still maintaining 99% of the original data's information- it does this by extracting out the most important information/ trends/ spread (variance) of each dimension/ attribute- into n 'principal components'.

PCA "is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance."

##### *Key note:* 
"PCA centers but does not scale the input data for each feature before applying the SVD. The optional parameter whiten=True makes it possible to project the data onto the singular space while scaling each component to unit variance. This is often useful if the models down-stream make strong assumptions on the isotropy of the signal: this is for example the case for Support Vector Machines with the RBF kernel and the K-Means clustering algorithm." (https://scikit-learn.org/stable/modules/decomposition.html#pca)

First, the label column has to be removed as we wouldn't want this involved in the PCA process. It can be concatted back with the PCA tranformed dataframe.

In [5]:
# Axis=1 means columns. Axis=0 means rows. inplace=False means that the original 'df' isn't altered.
df_no_labels = df.drop('label', axis=1, inplace=False)
df_no_labels

Unnamed: 0,destination_port,flow_duration,total_fwd_packets,total_backward_packets,total_length_of_fwd_packets,total_length_of_bwd_packets,fwd_packet_length_max,fwd_packet_length_min,fwd_packet_length_mean,fwd_packet_length_std,...,act_data_pkt_fwd,min_seg_size_forward,active_mean,active_std,active_max,active_min,idle_mean,idle_std,idle_max,idle_min
0,3268,112740690,32,16,6448,1152,403,0,201.5,204.724205,...,15,32,3.594286e+02,1.199802e+01,380,343,16100000.0,4.988048e+05,16400000,15400000
1,389,112740560,32,16,6448,5056,403,0,201.5,204.724205,...,15,32,3.202857e+02,1.574499e+01,330,285,16100000.0,4.987937e+05,16400000,15400000
2,0,113757377,545,0,0,0,0,0,0.0,0.000000,...,0,0,9.361829e+06,7.324646e+06,18900000,19,12200000.0,6.935824e+06,20800000,5504997
3,5355,100126,22,0,616,0,28,28,28.0,0.000000,...,21,32,0.000000e+00,0.000000e+00,0,0,0.0,0.000000e+00,0,0
4,0,54760,4,0,0,0,0,0,0.0,0.000000,...,0,0,0.000000e+00,0.000000e+00,0,0,0.0,0.000000e+00,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191028,53,61452,4,2,180,354,45,45,45.0,0.000000,...,3,20,0.000000e+00,0.000000e+00,0,0,0.0,0.000000e+00,0,0
191029,53,171,2,2,80,272,40,40,40.0,0.000000,...,1,32,0.000000e+00,0.000000e+00,0,0,0.0,0.000000e+00,0,0
191030,53,222,2,2,90,354,45,45,45.0,0.000000,...,1,32,0.000000e+00,0.000000e+00,0,0,0.0,0.000000e+00,0,0
191031,123,16842,1,1,48,48,48,48,48.0,0.000000,...,0,20,0.000000e+00,0.000000e+00,0,0,0.0,0.000000e+00,0,0


### Looking at the original data types

In [6]:
df_no_labels.dtypes

destination_port                 int64
flow_duration                    int64
total_fwd_packets                int64
total_backward_packets           int64
total_length_of_fwd_packets      int64
                                ...   
active_min                       int64
idle_mean                      float64
idle_std                       float64
idle_max                         int64
idle_min                         int64
Length: 78, dtype: object

### Fixing issues with ScikitLearn's PCA transform on this dataset

Without cleaning the dataset, the PCA transform was throwing this error: "sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')". It isn't obvious which attribute and/ or data point is causing this as the input dataset is supposed to be fully clean with no Nan or erroneous values. Also, there are too many attributes to manually search through to check this too. Thus, a quick solution via stackoverflow was found to work (see the 'clean_dataset(df)' method)

Code reference: https://stackoverflow.com/a/46581125

In [7]:
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)

Some rows have been removed by the cleaning, indicating that some rows did have issues/ errors within them.

In [8]:
df_no_labels = clean_dataset(df_no_labels)
df_no_labels

Unnamed: 0,destination_port,flow_duration,total_fwd_packets,total_backward_packets,total_length_of_fwd_packets,total_length_of_bwd_packets,fwd_packet_length_max,fwd_packet_length_min,fwd_packet_length_mean,fwd_packet_length_std,...,act_data_pkt_fwd,min_seg_size_forward,active_mean,active_std,active_max,active_min,idle_mean,idle_std,idle_max,idle_min
0,3268.0,112740690.0,32.0,16.0,6448.0,1152.0,403.0,0.0,201.5,204.724205,...,15.0,32.0,3.594286e+02,1.199802e+01,380.0,343.0,16100000.0,4.988048e+05,16400000.0,15400000.0
1,389.0,112740560.0,32.0,16.0,6448.0,5056.0,403.0,0.0,201.5,204.724205,...,15.0,32.0,3.202857e+02,1.574499e+01,330.0,285.0,16100000.0,4.987937e+05,16400000.0,15400000.0
2,0.0,113757377.0,545.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,9.361829e+06,7.324646e+06,18900000.0,19.0,12200000.0,6.935824e+06,20800000.0,5504997.0
3,5355.0,100126.0,22.0,0.0,616.0,0.0,28.0,28.0,28.0,0.000000,...,21.0,32.0,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0
4,0.0,54760.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,...,0.0,0.0,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
191028,53.0,61452.0,4.0,2.0,180.0,354.0,45.0,45.0,45.0,0.000000,...,3.0,20.0,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0
191029,53.0,171.0,2.0,2.0,80.0,272.0,40.0,40.0,40.0,0.000000,...,1.0,32.0,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0
191030,53.0,222.0,2.0,2.0,90.0,354.0,45.0,45.0,45.0,0.000000,...,1.0,32.0,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0
191031,123.0,16842.0,1.0,1.0,48.0,48.0,48.0,48.0,48.0,0.000000,...,0.0,20.0,0.000000e+00,0.000000e+00,0.0,0.0,0.0,0.000000e+00,0.0,0.0


Inspecting the datatypes again, all have been converted to float64 to be compatible.

In [9]:
df_no_labels.dtypes

destination_port               float64
flow_duration                  float64
total_fwd_packets              float64
total_backward_packets         float64
total_length_of_fwd_packets    float64
                                ...   
active_min                     float64
idle_mean                      float64
idle_std                       float64
idle_max                       float64
idle_min                       float64
Length: 78, dtype: object

### Now fitting and transforming the data with PCA

reference: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html and https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

In [10]:
# pca model made, and the fit_transform method called to create the principle components.
pca = PCA(n_components=dimensions_num_for_PCA)

# Fix SkLearn error with PCA fit and transform (ref: https://stackoverflow.com/questions/31323499/sklearn-error-valueerror-input-contains-nan-infinity-or-a-value-too-large-for)
df_no_labels = df_no_labels.reset_index()

principalComponents = pca.fit(df_no_labels).transform(df_no_labels)
principalComponents

array([[ 1.65552413e+08, -6.79233346e+07,  1.47434143e+06],
       [ 1.65552354e+08, -6.79233154e+07,  1.47437554e+06],
       [ 1.07955363e+08, -4.23169736e+07,  1.48368380e+05],
       ...,
       [-2.21706531e+07,  9.55245555e+05, -8.42512341e+04],
       [-2.21311568e+07,  9.27912760e+05, -2.07811090e+06],
       [-2.21584869e+07,  9.42255383e+05, -9.85995340e+05]])