# Python notebook for case-study run on the CICIDS security dataset, by Will Bridges

# Code set-up: Imports, Packages, Environment variables, and Methods

The software versions used are:
- The Python3 version used for this work is: Python 3.8.x
- The scikit-learn version used is: scikit-learn 0.24.0
- The seaborn version used is: 0.11.1
- The Pandas version used is: 1.1.5 (although 1.2.0 was released recently, this should also work)

Before running, please run these commands via pip, in the terminal:
- pip install pandas
- pip install scikit-learn
- pip install scikit-plot
- pip install seaborn

### Imports

In [1]:
%matplotlib inline
import os, sys # For accessing Python Modules in the System Path (for accessing the Statistical Measures modules)
# See: https://stackoverflow.com/a/39311677
nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

# Importing local modules (statistical distance measures)
from CVM_Distance import CVM_Dist

import pandas as pd # For DataFrames, Series, and reading csv data in.
import seaborn as sns # Graphing, built ontop of MatPlot for ease-of-use and nicer diagrams.
import matplotlib.pyplot as plt # MatPlotLib for graphing data visually. Seaborn more likely to be used.
import numpy as np # For manipulating arrays and changing data into correct formats for certain libraries
import sklearn # For Machine Learning algorithms
import scikitplot # Confusion matrix plotting
from sklearn.decomposition import PCA # For PCA dimensionality reduction technique
from sklearn.preprocessing import StandardScaler # For scaling to unit scale, before PCA application
from sklearn.preprocessing import LabelBinarizer # For converting categorical data into numeric, for modeling stage
from sklearn.model_selection import StratifiedKFold # For optimal train_test splitting, for model input data
from sklearn.neighbors import KNeighborsClassifier # K-Nearest Neighbors ML classifier (default n. of neighbors = 5)
from scikitplot.metrics import plot_confusion_matrix # For plotting confusion matrices
from sklearn.metrics import accuracy_score # For getting the accuracy of a model's predictions
from sklearn.metrics import classification_report # Various metrics for model performance

SyntaxError: 'return' outside function (CVM_Distance.py, line 31)

In [None]:
print(sys.path)

### Methods

Clean_dataset() method is used to remove infinite and Nan value errors (in the original dataset), which was causing errors in the PCA transform step.

Code reference: https://stackoverflow.com/a/46581125 (with a minor change = removed the conversion to float64 type)

In [None]:
def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep]

get_PCA_feature_names() method is used to generate feature names for the number of PCA components passed in as a param. Returns a list of feature names for principal component column headings, in a Pandas Dataframe.

In [None]:
def get_PCA_feature_names(num_of_pca_components):
    feature_names = []
    for i in range(num_of_pca_components):    
        feature_names.append(f"Principal component {i+1}")
    return feature_names

train_model_predict() method is used to train an input model, using StratifiedFKold for train_test splitting, and uses the trained model to predict the test data. It outputs a classification report which has various useful prediction metrics displayed. It also outputs a confusion matrix for the model's predictions. Finally, it returns the accuracy of the model's predictions.

- 1) The for loop ('for train_index, test_index in skf.split(X, y):') is required as it uses the indexes that the  StratifiedKFold model (**skf**) produces to select the appropriate data rows/ points required for each data split.

- 2) The 'X_train, X_test = X.iloc[train_index], X.iloc[test_index]' uses the skf indexes to find the index location (iloc) of each index, so it can extract the correct rows for the train_test split.

- 3) The 'reshaped_y_train = np.asarray(y_train).reshape(-1, 1)' is required to reshape the label (y_train and y_test) to a 1D array, rather than a 2D array that is output by the train_test split.

- 4) The 'model.fit(X_train, reshaped_y_train.ravel())' uses the input model and fits it (trains the model) on the training data. The '.ravel()' method just reshapes the label array again (flattens it) to match the input structure required by the sklearn method.

- 5) The 'pred_y = model.predict(X_test)' uses the, now trained, model to attempt to predict the test data (X_test is passed in, and it predicts the label, pred_y).

- 6) The 'score = classification_report(reshaped_y_test, pred_y)' calculates prediction metrics based upon the model's predictions. More info in the docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

The rest is self-explanatory. A confusion matrix is plotted and output after the method runs. The accuracy of the model is returned back to the caller, as well as other data required for the statistical distance measure methods.

In [None]:
# See documentation above to understand what each step does, and why.
def train_model_predict(model, model_name, X, y, skf):
    for train_index, test_index in skf.split(X, y): # 1)
        X_train, X_test = X.iloc[train_index], X.iloc[test_index] # 2)
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        reshaped_y_train = np.asarray(y_train).reshape(-1, 1) # 3)
        reshaped_y_test = np.asarray(y_test).reshape(-1, 1)
        
    model.fit(X_train, reshaped_y_train.ravel()) # 4)
    pred_y = model.predict(X_test) # 5)
    score = classification_report(reshaped_y_test, pred_y) # 6)
    print('Classification report: \n', score, '\n')
    plot_confusion_matrix(reshaped_y_test, pred_y, title='Confusion Matrix for {}'.format(model_name))
        
    return accuracy_score(reshaped_y_test, pred_y), X_train, X_test, y_train, pred_y

### Useful environment variables

In [None]:
# 'Reduced dimensions' variable for altering the number of PCA principal components. Can be altered for needs.
dimensions_num_for_PCA = 30

# Max number of permutations to run. Can be altered for needs.
permutation_num = 10

# 10 folds is usually the heuristic to follow for larger datasets of around this size.
num_of_splits_for_skf = 10

# Seed value to pass into models so that repeated runs result in the same output
seed_val = 1

### Importing the dataset into Pandas.DataFrame and showing the top 5 entries via 'df.head()'

In [None]:
Friday_Morning_Data = pd.read_csv('Friday-WorkingHours-Morning.pcap_ISCX.csv')
df = Friday_Morning_Data.copy()
df.head()

### Fixing column name issues

Because of Excel being used to create the csv, the column headings/ names contain whitespace padding, incorrect capitalisation, etc... which makes it difficult to correctly select by column names. This piece of code below just removes these issues. 

Code Reference: https://medium.com/@chaimgluck1/working-with-pandas-fixing-messy-column-names-42a54a6659cd

In [None]:
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')
df.head()

### Looking at the original data types

In [None]:
df.dtypes

### Fixing issues with ScikitLearn's PCA transform on this dataset

Without cleaning the dataset, the PCA transform was throwing this error: 
- "sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')". 

It isn't obvious which attribute and/ or data point is causing this as the input dataset is supposed to be fully clean with no Nan or erroneous values. Also, there are too many attributes to manually search through to check this too. Thus, a quick solution via stackoverflow was found to work (see the 'clean_dataset(df)' method)

Some rows have been removed by the cleaning, indicating that some rows did have issues/ errors within them.

In [None]:
df_cleaned = df.copy()
df_cleaned = clean_dataset(df_cleaned) # see methods at top of notebook
df_cleaned

Resetting indexes since rows have been dropped.

In [None]:
df_cleaned = df_cleaned.reset_index()
# Removing un-needed index column added by reset_index method
df_cleaned.drop('index', axis=1, inplace=True)
df_cleaned

### Considerations before PCA can be used correctly (before Data Preparation feature selection via PCA)
Looking at this resource and many others (https://towardsdatascience.com/pca-is-not-feature-selection-3344fb764ae6), it can be seen that PCA can, quite easily, be used incorrectly without proper consideration and/ or understanding.

From the resource:
- "A common mistake new data scientists make is to apply PCA to non-continuous variables. While it is technically possible to use PCA on discrete variables, or categorical variables that have been one hot encoded variables, you should not. Simply put, if your variables don’t belong on a coordinate plane, then do not apply PCA to them"

Thus, PCA should **only** be applied to the numeric features- which **must** be scaled down to unit scale.

### What features should be included from PCA, and why?

Looking at the list of feature names in the dataset (shown below), one can see that all other features should be of numeric type (with domain knowledge). They're all currently numeric type (either float or int). Consequently, PCA **can be** fully applied after scaling them all to unit scale.

In [None]:
df.columns.tolist()

### Data Preparation: PCA Dimension reduction and scaling (Hughes' Phenomenon)

PCA acts to reduce the dimensions/ search space of the dataset as much as possible, while trying to maintain the most information possible e.g. It can easily reduce the dimensionality by more than half, while still maintaining 99% of the original data's information- it does this by extracting out the most important information/ trends/ spread (variance) of each dimension/ attribute- into n 'principal components'.

More formally: PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance.

##### *Key note:* 
"**PCA centers but does not scale the input data** for each feature before applying the SVD. The optional parameter whiten=True makes it possible to project the data onto the singular space while scaling each component to unit variance. This is often useful if the models down-stream make strong assumptions on the isotropy of the signal: this is for example the case for **Support Vector Machines with the RBF kernel** and the **K-Means clustering algorithm**." (https://scikit-learn.org/stable/modules/decomposition.html#pca)

PCA still works without standardizing the features to unit scale **but tranforming to unit scale should still be done** to prevent large variance features from having an over-bearing affect on other lower variance features (via something like StandardScaler here https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). 

This is **particularly important with this dataset**, as some features have massively wide variances and others do not (e.g. the 'idle_std' values can range from e+06, all the way to zero).

In [None]:
# Saving the label attribute before dropping it.
df_labels = df_cleaned['label']
# Shows all the possible labels/ classes a model can predict.
# Need to alter these to numeric 0, 1, etc... for model comprehension (e.g. pd.get_dummies()).
df_labels.unique()

The label column has to be removed as you wouldn't want this involved in the PCA process. It can be concatted back with the PCA tranformed dataframe.

In [None]:
# Axis=1 means columns. Axis=0 means rows. inplace=False means that the original 'df' isn't altered.
df_no_labels = df_cleaned.drop('label', axis=1, inplace=False)
# Getting feature names for the StandardScaler process
df_features = df_no_labels.columns.tolist()
# Printing out Dataframe with no label column, to show successful dropping
df_no_labels

### Using StandardScaler to transform features into unit scale, ready for PCA

Code references: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60 & https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

In [None]:
# x = df_no_labels.loc[:, df_features].values
df_scaled = StandardScaler().fit_transform(df_no_labels)
# Converting back to dataframe
df_scaled = pd.DataFrame(data = df_scaled, columns = df_features)
df_scaled

### Plotting principle component variance

A scree plot displays the variance explained by each principal component within the analysis.

**The plot below shows that using the first 30 PCA components actually describes most/ all of the variation (information) within the original data. This is a huge dimension reduction from the initial 78 features, down to just 30.**

Thus, looking at the Environment Variables (at the top of the notebook), the 'dimensions_num_for_PCA' variable will be set to **30** based upon this evidence.

(Code reference: https://medium.com/district-data-labs/principal-component-analysis-with-python-4962cd026465)

In [None]:
pca_test = PCA().fit(df_scaled)
plt.plot(np.cumsum(pca_test.explained_variance_ratio_))
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()

### Now fitting and transforming the data with PCA

Thus, the optimal number of principle components is set to the environment variable and this is now used to produce the appropriate multi-dimensional principle component array. This will be formatted back to a Pandas dataframe afterwards.

References: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html and https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

In [None]:
pca = PCA(n_components=dimensions_num_for_PCA)
principal_components = pca.fit(df_scaled).transform(df_scaled)
principal_components

Getting the Principal Component feature names, dynamically, for the optimal number of components (passed in as a param).

In [None]:
# See Methods at the top of the notebook
principal_component_headings = get_PCA_feature_names(dimensions_num_for_PCA)

Turning the Principal Components back into a Pandas Dataframe, ready for concatting back with the **label** feature.

In [None]:
df_pc = pd.DataFrame(data = principal_components, columns = principal_component_headings)
df_pc

Joining/ concatinating the label feature back onto the pca transformed dataset. Label still needs to be transformed into binary data (for model comprehension/ understanding i.e. the model doesn't understand string data but string data can be transformed into numeric data, which is model can understand and use).

In [None]:
df_final = pd.concat([df_pc, df_labels], axis = 1)
# Scroll to the RHS end of dataframe to see attached label feature
df_final

### Transforming the label feature's categorical data into numeric data (via LabelBinarizer)

Again, a model can't understand e.g. 'yes' and 'no' strings but, these can be mapped to a 1 for yes and a 0 for no.

The **sklearn.preprocessing.LabelBinarizer** can be used to convert the column data into binary numbers, which will then be correctly interpreted.

1. Fit the List- this tells the LabelBinarizer what values exist, and how to map them. 

2. Call transform, passing a List, and this will return the encoded List.

**(Note: if label column has more than 2 unique labels, pandas.get_dummies is required instead)**

(Code reference: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html)

In [None]:
lb = LabelBinarizer()
df_final['label'] = lb.fit_transform(df_final['label'])
df_final

Showing the transformation. **Again, to note, if label isn't binary then pd.get_dummies is required.**

In [None]:
print("Before LabelBinarizer: ", df_labels.unique())
print("After LabelBinarizer: ", df_final['label'].unique())

#### The data is now fully cleaned and transformed, ready for pre-modeling test_train data splitting

## K-Fold Cross Validation and Stratified splitting
K-Fold is a technique which splits data into K folds (splits). Train of a model K times, and for each training iteration, K-Fold selects a different fold to use for testing; the remaining K - 1 folds become the training data. Typically, the optimal K value can be derived using the size of your dataset (num of rows). Ideally, each fold should be statistically representative of the population. Too small and it won't be useful. Too large, and you lose the positives from doing K-Fold.

You can use Stratified splitting with K-Fold, which ensures balance between some criteria (balances out the classes) e.g. equal portion of label classes in each fold.

Class Imbalance is a significant issue in the ML/ Data Mining domain. It leads to incorrect results e.g. if one fold had all of 1 label (accidentally), then it would produce terrible predictive results as it wouldn't know what the other label class data point would look like. You can only work with the data you have, so this has to be dealt with.

Benefits of K-Fold:
- Use more of the data towards making a succesful model.
- Obtain K models to evaluate, can improve the confidence that you have selected an appropriate model algorithm and cleaned/ prepared the data correctly, e.g. normal split with 1 model, one doesn't know if it's good or not- it could be heavily biased. Multiple models ensures less bias and increased variance.
- Looking at the accuracy results from each of the k-Folds, you can identify data issues e.g. a certain fold performs really badly. Could this suggest that more cleaning is required? Maybe the data preparation was performed incorrectly?
- If all folds return similar accuracies, one can be more confident that a deployed model will perform similarly to how one expects.

Issues with K-Fold:
- Creating K separate models requires more computation.
- If you haven't got much data, you might not get many folds. Less folds means K-Fold loses its benefits.
- If K is very large, each fold is small, and harder to ensure statistical distribution of.
- Choosing the best of K models introduces bias. Real world data could perform better under a more general, lower performing model.

Code reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

In [None]:
# Separating the label so that the answers aren't provided to the model, in training.
X = df_final.drop(['label'], axis = 1)
y = df_final['label']
y

Initialising the StratifiedKFold model (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)

In [None]:
skf = StratifiedKFold(n_splits=num_of_splits_for_skf, shuffle=False)
skf

Now, splitting the data into train and test data, using the optimal splitting techniques of K-Fold and Stratified Splitting.

In [None]:
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    reshaped_y_train = np.asarray(y_train).reshape(-1, 1)
    reshaped_y_test = np.asarray(y_test).reshape(-1, 1)
    
print( 'X_train length: ', len(X_train) ) # To check if splits worked
print( 'y_train length: ', len(y_train) )
print( 'X_test length: ', len(X_test) )
print( 'y_test length: ', len(y_test) )

## Modeling stage
Data is now fully transformed and ready for ML model training and predictions

### K-Nearest neighbor ML classifier
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are assumed to be near to each other.

The most important factor in training a KNN model is the **number of neighbors hyperparameter**. You want to choose the K  value that reduces the number of errors, while maintaining the algorithm’s ability to accurately make predictions when it’s given data it hasn’t seen before. Here are some important considerations:

- As you decrease the value of K to **1**, predictions become less stable e.g. imagine K=1 and you have a query point surrounded by several red 'dots' and one green 'dot', but the green dot is the single nearest neighbor. Reasonably, you would think the query point is most likely red, but because K=1, KNN incorrectly predicts that the query point is green.

- Inversely, as you increase the value of K, predictions become more stable due to majority voting/ averaging, and thus, more likely to make more accurate predictions (up to a certain critical point- an **'overfitting' threshold**). Eventually, you would begin to witness an increasing number of errors. It is at this point you'd know that you have pushed the value of K too far.

- In cases where you are taking a majority vote (e.g. picking the mode in a classification problem) amongst labels/ classes, you usually make K an odd number to have a tiebreaker.

The sklearn's default k value is 5 (also true for MATLAB's implementation).

References: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html & https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

In [None]:
knn_model = KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

### Training the model and predicting test data results (confusion matrix)
The selected Machine Learning classifier/ model/ models can now be trained on training data (from the StratifiedKFold splitting). Once the model is trained, it can be used to predict the test data's labels (based upon what it has seen before).

The performance of the model can be seen below in the Classification Report, Confusion Matrix, and the model's predicitive accuracy result.

(see **methods** section, at the top of the notebook, for the train_model_predict() method code. Note that it can take a few minutes to run due to the vast amount of data used, and the training time required for the model to learn).

In [None]:
# Unpacking the method return values. Last 4 are needed for statistical distance measure methods.
accuracy, X_train, X_test, y_train, pred_y = train_model_predict(knn_model, "K-Nearest Neighbor", X, y, skf)
print("Model accuracy= ", accuracy*100, "%")

## The SafeML statistical distance measures

In [None]:
# Extracting the number of classes and labels from the label feature
class_num = len(df_final['label'].unique())
labels = df_final['label'].unique()
print("Number of classes: ", class_num)
print("Labels: ", labels)

In [None]:
X_train

test using only the first label (i.e. label[0] )

In [None]:
# x1 = X_test[np.where(np.asarray(y_train).reshape(-1, 1).ravel() == labels[0])]
# X_train_L = X_train.iloc[np.where(y_train[:,1] == 1)]
X_train_L = X_train.loc[y_train == labels[0]]
X_train_L

test

In [None]:
X_test_L = X_test.loc[pred_y == labels[0]]
X_test_L

Results(jj, 2, kk) = Cramer_Von_Mises(XTrain_L,XTest_L);



In [None]:
# cvm_distance = Cramer_Von_Mises