# Performance Analysis

We decided to use Hamming distance to analyse our models. The hamming distance of two binary arrays $ x $ and $ y $ is the number of positions at which the corresponding symbols are different: 

$$H(X,Y) = \sum\limits_{i=1}^n \mathbb{1}(x_i = y_i) $$

We decided to use this metric as it is uniform across the models and gives insight into accuracy of the model. Some of the models predicted the attack type whereas others predicted normal vs non-normal behaviour which means we had to find a standardised test across the two types of model.

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy as sp
from scipy.spatial import distance

## Matt

In [2]:
pred_matt = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/test_predictions_matt.csv')
test_labels = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/test_labels.csv')

pred_matt = np.array(pred_matt['0'])
test_labels = np.array(test_labels['label'])

hamm_dist_matt = distance.hamming(pred_matt,test_labels)
accuracy_matt = 1 - hamm_dist_matt
hamm_dist_matt = int(hamm_dist_matt * len(test_labels))

pd.DataFrame([[hamm_dist_matt, "{:.6%}".format(accuracy_matt)]], columns = ['Hamming Distance', 'Accuracy'], index = ['Matt'])

Unnamed: 0,Hamming Distance,Accuracy
Matt,3,99.993928%


## Alex - KNN (K Nearest Neighbours)

In [3]:
''' R code to get the Hamming distance for Alex's model's result'''
## library(class)
## library(caret)

# pr1 <- knn(KTT_train,KTT_test,cl=KTT_target_category,k=1, use.all=FALSE)
## [...]
## [...]
# pr37 <- knn(KTT_train,KTT_test,cl=KTT_target_category,k=1, use.all=FALSE)

# h<- vector(length=37)
# h[1]<-hamming.distance(as.vector(pr1), kt$Behaviour[1:1333])
## [...]
## [...]
# h[37]<-hamming.distance(as.vector(pr37), kt$Behaviour[46790:48122])

# m <- sum(h) ## The total Hamming Distance

# h <- -h
# h<- h+1333
# h<- h/1333
# summary(h) # mean = 98.2% accuracy. ## The accuracy

pred_alex_df = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/KNN-Performance.csv')

hamm_dist_alex = pred_alex_df.iat[0,0]
accuracy_alex = pred_alex_df.iat[0,1]

pred_alex_df

Unnamed: 0,Hamming Distance,Accuracy
0,886,98.20%


## Luke

In [4]:
pred_luke = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/Y_pred_Luke.csv')
test_labels = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/Y_test_Luke.csv')

pred_luke = np.array(pred_luke['0'])
test_labels = np.array(test_labels['label'])

hamm_dist_luke = distance.hamming(pred_luke,test_labels)
accuracy_luke = 1 - hamm_dist_luke
hamm_dist_luke = int(hamm_dist_luke * len(test_labels))

pd.DataFrame([[hamm_dist_luke, "{:.6%}".format(accuracy_luke)]], columns = ['Hamming Distance', 'Accuracy'], index = ['Luke'])

Unnamed: 0,Hamming Distance,Accuracy
Luke,1391,97.184325%


## Gabriel

I have two trained models, one using basic Logistic Regression and the other using `GridSearchCV` with cross-validation. Despite the Grid version taking almost 80 times as long to train as the basic model, the actual increase in accuracy was rather small, especially comparing the hamming distance of the two, with that many data points.

In [5]:
pred_gabe_grid = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/grid_y_pred_gabe.csv')
pred_gabe_basic = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/reg_y_pred_gabe.csv')
test_labels = pd.read_csv('https://github.com/Galeforse/DST-Assessment-01/raw/main/Data/y_test_gabe.csv')

pred_gabe_grid = np.array(pred_gabe_grid['0'])
pred_gabe_basic = np.array(pred_gabe_basic['0'])
test_labels = np.array(test_labels['0'])

hamm_dist_gabe = distance.hamming(pred_gabe_grid,test_labels)
accuracy_gabe = 1 - hamm_dist_gabe
hamm_dist_gabe = int(hamm_dist_gabe * len(test_labels))

hamm_dist_gabe2 = distance.hamming(pred_gabe_basic,test_labels)
accuracy_gabe2 = 1 - hamm_dist_gabe2
hamm_dist_gabe2 = int(hamm_dist_gabe2 * len(test_labels))

pd.DataFrame([[hamm_dist_gabe, "{:.6%}".format(accuracy_gabe)],[hamm_dist_gabe2, "{:.6%}".format(accuracy_gabe2)]], columns = ['Hamming Distance', 'Accuracy'], index = ['Gabe Grid','Gabe Basic'])

Unnamed: 0,Hamming Distance,Accuracy
Gabe Grid,73,99.852233%
Gabe Basic,76,99.846160%


## Comparison

In [6]:
pd.DataFrame([[hamm_dist_matt, "{:.6%}".format(accuracy_matt)],[hamm_dist_alex, accuracy_alex],[hamm_dist_luke, "{:.6%}".format(accuracy_luke)],[hamm_dist_gabe, "{:.6%}".format(accuracy_gabe)]], columns = ['Hamming Distance', 'Accuracy'], index = ['Matt', 'Alex', 'Luke', 'Gabriel'])

Unnamed: 0,Hamming Distance,Accuracy
Matt,3,99.993928%
Alex,886,98.20%
Luke,1391,97.184325%
Gabriel,73,99.852233%


## Conclusion

This report reflects the results we obtained from running our models. Matt ran a random forest model which had an accuracy of $99.99\%$, Alex ran a k-NN model with k=1 and had an accuracy of $98.20\%$, Luke ran a Bayesian model which had an accuracy of $97.18\%$ and Gabriel ran a Linear Regression model and obtain a $99.85\%$ accuracy. Thus we have a scoring of:
1. Matt
2. Gabriel
3. Alex
4. Luke

All of the models give very accurate results of $>97\%$. The most accurate model is the random forest for this data and produces the most accurate results. This may be due to the ability of a random forest to handle all the features we have and not having to work on a reduced space which the k-nn model had to.

The most difficult process for all of our models was the data processing to allow the models to run on the data. This involved:
- re-labelling the data to 0's and 1's to allow us to model normal vs non-normal data since most of the models are binary
- normalising data 
- creating dummy variables to allow us to run models on categorical features.

The models were then run on training data to develop them and allow them to learn about the differences between the attacks and the normal data and then tested on teh 10% test split we created. Further tests were conducted on some of us with testing on external text data such as the extra KD99 test data and also testing on non-binary data.