# Python notebook for case-study run on the CICIDS security dataset, by Will Bridges

# Code set-up: Imports and packages

The software versions used are:
- The Python3 version used for this work is: Python 3.8.x
- The scikit-learn version used is: scikit-learn 0.24.0
- The seaborn version used is: 0.11.1
- The Pandas version used is: 1.1.5 (although 1.2.0 was released recently, this should also work)

Before running, please run these commands via pip, in the terminal:
- pip install pandas
- pip install scikit-learn
- pip install scikit-plot
- pip install seaborn

### Imports

In [1]:
%matplotlib inline
import os # For accessing Python Modules in the System Path (for accessing the Statistical Measures modules).
import pandas as pd # For DataFrames, Series, and reading csv data in.
import seaborn as sns # Graphing, built ontop of MatPlot for ease-of-use and nicer diagrams.
import matplotlib.pyplot as plt # MatPlotLib for graphing data visually. Seaborn more likely to be used.
import numpy as np # For manipulating arrays and changing data into correct formats for certain libraries
import sklearn # For Machine Learning algorithms
from sklearn.decomposition import PCA # For PCA dimensionality reduction technique

### Useful environment variables

In [2]:
# 'Reduced dimensions' variable for altering the number of PCA principal components. Can be altered for needs.
dimensions_num_for_PCA = 3

# Max number of permutations to run. Can be altered for needs.
permutation_num = 10

### Importing the dataset into Pandas.DataFrame and showing the top 5 entries via 'df.head()'

In [3]:
Friday_Morning_Data = pd.read_csv('Friday-WorkingHours-Morning.pcap_ISCX.csv')
df = Friday_Morning_Data.copy()
df.head()

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,3268,112740690,32,16,6448,1152,403,0,201.5,204.724205,...,32,359.4286,11.99802,380,343,16100000.0,498804.8,16400000,15400000,BENIGN
1,389,112740560,32,16,6448,5056,403,0,201.5,204.724205,...,32,320.2857,15.74499,330,285,16100000.0,498793.7,16400000,15400000,BENIGN
2,0,113757377,545,0,0,0,0,0,0.0,0.0,...,0,9361829.0,7324646.0,18900000,19,12200000.0,6935824.0,20800000,5504997,BENIGN
3,5355,100126,22,0,616,0,28,28,28.0,0.0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,0,54760,4,0,0,0,0,0,0.0,0.0,...,0,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


Because of Excel being used to create the csv, the column headings/ names contain whitespace padding, incorrect capitalisation, etc... which makes it difficult to correctly select by column names. This piece of code below just removes these issues. 

Code Reference: https://medium.com/@chaimgluck1/working-with-pandas-fixing-messy-column-names-42a54a6659cd

In [4]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
df.head()

  df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')


Unnamed: 0,destination_port,flow_duration,total_fwd_packets,total_backward_packets,total_length_of_fwd_packets,total_length_of_bwd_packets,fwd_packet_length_max,fwd_packet_length_min,fwd_packet_length_mean,fwd_packet_length_std,...,min_seg_size_forward,active_mean,active_std,active_max,active_min,idle_mean,idle_std,idle_max,idle_min,label
0,3268,112740690,32,16,6448,1152,403,0,201.5,204.724205,...,32,359.4286,11.99802,380,343,16100000.0,498804.8,16400000,15400000,BENIGN
1,389,112740560,32,16,6448,5056,403,0,201.5,204.724205,...,32,320.2857,15.74499,330,285,16100000.0,498793.7,16400000,15400000,BENIGN
2,0,113757377,545,0,0,0,0,0,0.0,0.0,...,0,9361829.0,7324646.0,18900000,19,12200000.0,6935824.0,20800000,5504997,BENIGN
3,5355,100126,22,0,616,0,28,28,28.0,0.0,...,32,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
4,0,54760,4,0,0,0,0,0,0.0,0.0,...,0,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


### Data Preparation: PCA Dimension reduction (Hughes' Phenomenon)

PCA acts to reduce the dimensions/ search space of the dataset as much as possible, while trying to maintain the most information possible e.g. It can easily reduce the dimensionality by more than half, while still maintaining 99% of the original data's information- it does this by extracting out the most important information/ trends/ spread (variance) of each dimension/ attribute- into n 'principal components'.

PCA "is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance."

##### *Key note:* 
"PCA centers but does not scale the input data for each feature before applying the SVD. The optional parameter whiten=True makes it possible to project the data onto the singular space while scaling each component to unit variance. This is often useful if the models down-stream make strong assumptions on the isotropy of the signal: this is for example the case for Support Vector Machines with the RBF kernel and the K-Means clustering algorithm." (https://scikit-learn.org/stable/modules/decomposition.html#pca)

In [5]:
# Axis=1 means columns. Axis=0 means rows. inplace=False means that the original 'df' isn't altered.
df_no_labels = df.drop('label', axis=1, inplace=False)
df_no_labels.head()

Unnamed: 0,destination_port,flow_duration,total_fwd_packets,total_backward_packets,total_length_of_fwd_packets,total_length_of_bwd_packets,fwd_packet_length_max,fwd_packet_length_min,fwd_packet_length_mean,fwd_packet_length_std,...,act_data_pkt_fwd,min_seg_size_forward,active_mean,active_std,active_max,active_min,idle_mean,idle_std,idle_max,idle_min
0,3268,112740690,32,16,6448,1152,403,0,201.5,204.724205,...,15,32,359.4286,11.99802,380,343,16100000.0,498804.8,16400000,15400000
1,389,112740560,32,16,6448,5056,403,0,201.5,204.724205,...,15,32,320.2857,15.74499,330,285,16100000.0,498793.7,16400000,15400000
2,0,113757377,545,0,0,0,0,0,0.0,0.0,...,0,0,9361829.0,7324646.0,18900000,19,12200000.0,6935824.0,20800000,5504997
3,5355,100126,22,0,616,0,28,28,28.0,0.0,...,21,32,0.0,0.0,0,0,0.0,0.0,0,0
4,0,54760,4,0,0,0,0,0,0.0,0.0,...,0,0,0.0,0.0,0,0,0.0,0.0,0,0
