# Modelling Intrusion Detection: Analysis of a Feature Selection Mechanism

## Method Description

### Step 1: Data preprocessing:
All features are made numerical using one-Hot-encoding. The features are scaled to avoid features with large values that may weigh too much in the results.

### Step 2: Feature Selection:
Eliminate redundant and irrelevant data by selecting a subset of relevant features that fully represents the given problem.
Univariate feature selection with ANOVA F-test. This analyzes each feature individually to detemine the strength of the relationship between the feature and labels. Using SecondPercentile method (sklearn.feature_selection) to select features based on percentile of the highest scores. 
When this subset is found: Recursive Feature Elimination (RFE) is applied.

### Step 4: Build the model:
Decision tree model is built.

### Step 5: Prediction & Evaluation (validation):
Using the test data to make predictions of the model.
Multiple scores are considered such as:accuracy score, recall, f-measure, confusion matrix.
perform a 10-fold cross-validation.

## Version Check

In [28]:
import pandas as pd
import numpy as np
import sys
import sklearn
print(pd.__version__)
print(np.__version__)
print(sys.version)
print(sklearn.__version__)

1.2.5
1.17.4
3.8.5 (default, May 27 2021, 13:30:53) 
[GCC 9.3.0]
0.24.2


## Creating Datasets

In [29]:
# loading every parts and concat them
files = ["Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv", "Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv",
         "Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv",
         "Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv", "Friday-WorkingHours-Morning.pcap_ISCX.csv",
         "Tuesday-WorkingHours.pcap_ISCX.csv", "Wednesday-workingHours.pcap_ISCX.csv"]

#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in files])

# drop entries with NaN values
combined_csv.dropna(inplace=True)

# drop entriers with values greater than max float
for col in combined_csv.columns: 
        if(col == " Label") : print("Ignored column Label")
        
        else : combined_csv = combined_csv.loc[(combined_csv[col] < np.finfo(np.float64).max )]

# shuffle data
combined_csv = combined_csv.sample(frac=1).reset_index(drop=True)

# slice into train and test data sets
chunk = int (combined_csv.shape[0]*75/100)  #ratio = 75 %  /  25% 
train_set =combined_csv.iloc[:chunk,:]
test_set = combined_csv.iloc[chunk:,:]

# exporting data sets
train_set.to_csv("train_IDS2017.csv", index=False)
test_set.to_csv("test_IDS2017.csv", index=False)

Ignored column Label


## Load the Dataset

In [30]:
# train_IDS2017 & test_IDS2017 are the datafiles
df = pd.read_csv("train_IDS2017.csv")
df_test = pd.read_csv("test_IDS2017.csv")

# shape, this gives the dimensions of the dataset
print('Dimensions of the Training set:',df.shape)
print('Dimensions of the Test set:',df_test.shape)

Dimensions of the Training set: (1723796, 79)
Dimensions of the Test set: (574599, 79)


## Sample view of the training dataset

In [31]:
# first five rows
df.head(5)

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min,Label
0,39344,5322068,1,6,6,36,6,6,6.0,0.0,...,20,14199.0,0.0,14199,14199,5307869.0,0.0,5307869,5307869,BENIGN
1,80,81193164,7,7,372,11595,348,0,53.142857,130.05054,...,20,9298.0,0.0,9298,9298,81000000.0,0.0,81000000,81000000,DoS Hulk
2,80,5212710,5,7,256,6945,256,0,51.2,114.48668,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN
3,80,62812,3,5,351,11595,351,0,117.0,202.649945,...,32,0.0,0.0,0,0,0.0,0.0,0,0,DoS Hulk
4,123,16517,1,1,48,48,48,48,48.0,0.0,...,20,0.0,0.0,0,0,0.0,0.0,0,0,BENIGN


## Statistical Summary

In [32]:
df.describe()

Unnamed: 0,Destination Port,Flow Duration,Total Fwd Packets,Total Backward Packets,Total Length of Fwd Packets,Total Length of Bwd Packets,Fwd Packet Length Max,Fwd Packet Length Min,Fwd Packet Length Mean,Fwd Packet Length Std,...,act_data_pkt_fwd,min_seg_size_forward,Active Mean,Active Std,Active Max,Active Min,Idle Mean,Idle Std,Idle Max,Idle Min
count,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,...,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0,1723796.0
mean,7470.606,15836140.0,9.351395,10.4281,546.9494,16508.05,211.7098,18.33571,59.99579,71.66455,...,5.092346,-2776.945,84675.09,40632.66,154864.6,61699.28,9461003.0,573608.9,9892166.0,9017300.0
std,17428.27,34644140.0,732.1452,975.3259,6015.465,2232356.0,765.3291,64.42765,201.8646,303.87,...,588.44,1226724.0,662965.9,394324.3,1025116.0,593654.3,25362250.0,4998360.0,26143490.0,25096030.0
min,0.0,-13.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-536870700.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,53.0,149.0,1.0,1.0,6.0,2.0,6.0,0.0,6.0,0.0,...,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,80.0,31374.0,2.0,2.0,60.0,120.0,36.0,2.0,33.0,0.0,...,1.0,24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,443.0,4679360.0,5.0,4.0,188.0,568.0,82.0,35.0,49.42857,23.2766,...,2.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,65533.0,120000000.0,207964.0,284602.0,2321478.0,627000000.0,24820.0,2325.0,5940.857,7049.469,...,198636.0,138.0,110000000.0,74200000.0,110000000.0,110000000.0,120000000.0,76900000.0,120000000.0,120000000.0


## Label Distribution of Training and Test set

In [33]:
print('Label distribution Training set:')
print(df[' Label'].value_counts())
print()
print('Label distribution Test set:')
print(df_test[' Label'].value_counts())

Label distribution Training set:
BENIGN                        1306151
DoS Hulk                       172736
PortScan                       119174
DDoS                            96060
DoS GoldenEye                    7760
FTP-Patator                      5967
SSH-Patator                      4382
DoS slowloris                    4378
DoS Slowhttptest                 4064
Bot                              1460
Web Attack � Brute Force         1115
Web Attack � XSS                  501
Infiltration                       25
Web Attack � Sql Injection         14
Heartbleed                          9
Name:  Label, dtype: int64

Label distribution Test set:
BENIGN                        435688
DoS Hulk                       57388
PortScan                       39630
DDoS                           31965
DoS GoldenEye                   2533
FTP-Patator                     1968
SSH-Patator                     1515
DoS Slowhttptest                1435
DoS slowloris                   1418
Bot    

## Removing 700k BENIGN from training set

In [35]:
# shuffle entries
df = df.sample(frac=1).reset_index(drop=True)

# removing the first 700k
c = 0
to_remove = []
for i, row in df.iterrows():
    if (c < 700000) & (row[' Label']  == "BENIGN"):
        to_remove.append(i)
        c += 1
    elif (c >= 700000):
        break

df.drop(to_remove, inplace=True)

# shuffle entries
df = df.sample(frac=1).reset_index(drop=True)

In [36]:
print('Label distribution Training set:')
print(df[' Label'].value_counts())

Label distribution Training set:
BENIGN                        606151
DoS Hulk                      172736
PortScan                      119174
DDoS                           96060
DoS GoldenEye                   7760
FTP-Patator                     5967
SSH-Patator                     4382
DoS slowloris                   4378
DoS Slowhttptest                4064
Bot                             1460
Web Attack � Brute Force        1115
Web Attack � XSS                 501
Infiltration                      25
Web Attack � Sql Injection        14
Heartbleed                         9
Name:  Label, dtype: int64


# Step 1: Data preprocessing:

# Split Dataset into 4 datasets for every attack category
## Rename every attack label: 0=normal, 1=DoS, 2=Probe, 3=Web and 4=Infil.
## Replace labels column with new labels column
## Make new datasets


In [37]:
# take label column
labeldf=df[' Label']
labeldf_test=df_test[' Label']
# change the label column
newlabeldf=labeldf.replace({ 'BENIGN' : 0,
                            'DDoS' : 1, "DoS Hulk" : 1, "DoS GoldenEye" : 1, "DoS slowloris" : 1,
                            "DoS Slowhttptest" : 1, "Bot" : 1,
                            'PortScan' : 2, 'Web Attack � Brute Force': 3,
                            'Web Attack � XSS': 3, 'Web Attack � Sql Injection': 3,
                            'Infiltration': 4, "FTP-Patator" : 4, "SSH-Patator" : 4, "Heartbleed" : 4})
newlabeldf_test=labeldf_test.replace({ 'BENIGN' : 0,
                            'DDoS' : 1, "DoS Hulk" : 1, "DoS GoldenEye" : 1, "DoS slowloris" : 1,
                            "DoS Slowhttptest" : 1, "Bot" : 1,
                            'PortScan' : 2, 'Web Attack � Brute Force': 3,
                            'Web Attack � XSS': 3, 'Web Attack � Sql Injection': 3,
                            'Infiltration': 4, "FTP-Patator" : 4, "SSH-Patator" : 4, "Heartbleed" : 4})
# put the new label column back
df[' Label'] = newlabeldf
df_test[' Label'] = newlabeldf_test

# Printing labels to check every old label was assign a category
print('Label distribution Training set:')
print(df[' Label'].value_counts())
print()
print('Label distribution Test set:')
print(df_test[' Label'].value_counts())

Label distribution Training set:
0    606151
1    286458
2    119174
4     10383
3      1630
Name:  Label, dtype: int64

Label distribution Test set:
0    435688
1     95235
2     39630
4      3496
3       550
Name:  Label, dtype: int64


In [38]:
to_drop_DoS = [2,3,4]
to_drop_Probe = [1,3,4]
to_drop_Web = [1,2,4]
to_drop_Infil = [1,2,3]
DoS_df=df[~df[' Label'].isin(to_drop_DoS)];
Probe_df=df[~df[' Label'].isin(to_drop_Probe)];
Web_df=df[~df[' Label'].isin(to_drop_Web)];
Infil_df=df[~df[' Label'].isin(to_drop_Infil)];

#test
DoS_df_test=df_test[~df_test[' Label'].isin(to_drop_DoS)];
Probe_df_test=df_test[~df_test[' Label'].isin(to_drop_Probe)];
Web_df_test=df_test[~df_test[' Label'].isin(to_drop_Web)];
Infil_df_test=df_test[~df_test[' Label'].isin(to_drop_Infil)];
print('Train:')
print('Dimensions of DoS:' ,DoS_df.shape)
print('Dimensions of Probe:' ,Probe_df.shape)
print('Dimensions of Web:' ,Web_df.shape)
print('Dimensions of Infil:' ,Infil_df.shape)
print('Test:')
print('Dimensions of DoS:' ,DoS_df_test.shape)
print('Dimensions of Probe:' ,Probe_df_test.shape)
print('Dimensions of Web:' ,Web_df_test.shape)
print('Dimensions of Infil:' ,Infil_df_test.shape)

Train:
Dimensions of DoS: (892609, 79)
Dimensions of Probe: (725325, 79)
Dimensions of Web: (607781, 79)
Dimensions of Infil: (616534, 79)
Test:
Dimensions of DoS: (530923, 79)
Dimensions of Probe: (475318, 79)
Dimensions of Web: (436238, 79)
Dimensions of Infil: (439184, 79)


# Step 2: Feature Scaling:

In [39]:
# Split dataframes into X & Y
# assign X as a dataframe of feautures and Y as a series of outcome variables
X_DoS = DoS_df.drop(' Label',1)
Y_DoS = DoS_df[' Label']
X_Probe = Probe_df.drop(' Label',1)
Y_Probe = Probe_df[' Label']
X_Web = Web_df.drop(' Label',1)
Y_Web = Web_df[' Label']
X_Infil = Infil_df.drop(' Label',1)
Y_Infil = Infil_df[' Label']
# test set
X_DoS_test = DoS_df_test.drop(' Label',1)
Y_DoS_test = DoS_df_test[' Label']
X_Probe_test = Probe_df_test.drop(' Label',1)
Y_Probe_test = Probe_df_test[' Label']
X_Web_test = Web_df_test.drop(' Label',1)
Y_Web_test = Web_df_test[' Label']
X_Infil_test = Infil_df_test.drop(' Label',1)
Y_Infil_test = Infil_df_test[' Label']

### Save a list of feature names for later use (it is the same for every attack category). Column names are dropped at this stage.

In [40]:
colNames=list(X_DoS)
colNames_test=list(X_DoS_test)

## Use StandardScaler() to scale the dataframes

In [41]:
from sklearn import preprocessing
scaler1 = preprocessing.StandardScaler().fit(X_DoS)
X_DoS=scaler1.transform(X_DoS) 
scaler2 = preprocessing.StandardScaler().fit(X_Probe)
X_Probe=scaler2.transform(X_Probe) 
scaler3 = preprocessing.StandardScaler().fit(X_Web)
X_Web=scaler3.transform(X_Web) 
scaler4 = preprocessing.StandardScaler().fit(X_Infil)
X_Infil=scaler4.transform(X_Infil) 
# test data
scaler5 = preprocessing.StandardScaler().fit(X_DoS_test)
X_DoS_test=scaler5.transform(X_DoS_test) 
scaler6 = preprocessing.StandardScaler().fit(X_Probe_test)
X_Probe_test=scaler6.transform(X_Probe_test) 
scaler7 = preprocessing.StandardScaler().fit(X_Web_test)
X_Web_test=scaler7.transform(X_Web_test) 
scaler8 = preprocessing.StandardScaler().fit(X_Infil_test)
X_Infil_test=scaler8.transform(X_Infil_test) 

### Check that the Standard Deviation is 1

In [42]:
print(X_DoS.std(axis=0))

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]


In [43]:
X_Probe.std(axis=0);
X_Web.std(axis=0);
X_Infil.std(axis=0);

# Step 3: Feature Selection:

# 1. Univariate Feature Selection using ANOVA F-test

In [44]:
#univariate feature selection with ANOVA F-test. using secondPercentile method, then RFE
#Scikit-learn exposes feature selection routines as objects that implement the transform method
#SelectPercentile: removes all but a user-specified highest scoring percentage of features
#f_classif: ANOVA F-value between label/feature for classification tasks.
from sklearn.feature_selection import SelectPercentile, f_classif
np.seterr(divide='ignore', invalid='ignore');
selector=SelectPercentile(f_classif, percentile=10)
X_newDoS = selector.fit_transform(X_DoS,Y_DoS)
X_newDoS.shape



(892609, 8)

### Get the features that were selected: DoS

In [45]:
true=selector.get_support()
newcolindex_DoS=[i for i, x in enumerate(true) if x]
newcolname_DoS=list( colNames[i] for i in newcolindex_DoS )
newcolname_DoS

['Bwd Packet Length Max',
 ' Bwd Packet Length Mean',
 ' Bwd Packet Length Std',
 ' Max Packet Length',
 ' Packet Length Mean',
 ' Packet Length Std',
 ' Average Packet Size',
 ' Avg Bwd Segment Size']

In [46]:
X_newProbe = selector.fit_transform(X_Probe,Y_Probe)
X_newProbe.shape



(725325, 8)

### Get the features that were selected: Probe

In [47]:
true=selector.get_support()
newcolindex_Probe=[i for i, x in enumerate(true) if x]
newcolname_Probe=list( colNames[i] for i in newcolindex_Probe )
newcolname_Probe

[' Bwd Packet Length Min',
 ' Bwd Packet Length Mean',
 ' Min Packet Length',
 ' Packet Length Mean',
 ' PSH Flag Count',
 ' ACK Flag Count',
 ' Average Packet Size',
 'Init_Win_bytes_forward']

In [48]:
X_newWeb = selector.fit_transform(X_Web,Y_Web)
X_newWeb.shape



(607781, 8)

### Get the features that were selected: Web

In [49]:
true=selector.get_support()
newcolindex_Web=[i for i, x in enumerate(true) if x]
newcolname_Web=list( colNames[i] for i in newcolindex_Web)
newcolname_Web

[' Destination Port',
 ' Bwd Packet Length Min',
 ' Min Packet Length',
 ' PSH Flag Count',
 ' Down/Up Ratio',
 ' Average Packet Size',
 'Init_Win_bytes_forward',
 ' Init_Win_bytes_backward']

In [50]:
X_newInfil = selector.fit_transform(X_Infil,Y_Infil)
X_newInfil.shape



(616534, 8)

### Get the features that were selected: Infil

In [51]:
true=selector.get_support()
newcolindex_Infil=[i for i, x in enumerate(true) if x]
newcolname_Infil=list( colNames[i] for i in newcolindex_Infil)
newcolname_Infil

[' Bwd Packet Length Min',
 'Fwd PSH Flags',
 ' Min Packet Length',
 ' SYN Flag Count',
 ' PSH Flag Count',
 ' ACK Flag Count',
 ' Average Packet Size',
 'Init_Win_bytes_forward']

# Summary of features selected by Univariate Feature Selection

In [52]:
print('Features selected for DoS:',newcolname_DoS)
print()
print('Features selected for Probe:',newcolname_Probe)
print()
print('Features selected for Web:',newcolname_Web)
print()
print('Features selected for Infil:',newcolname_Infil)

Features selected for DoS: ['Bwd Packet Length Max', ' Bwd Packet Length Mean', ' Bwd Packet Length Std', ' Max Packet Length', ' Packet Length Mean', ' Packet Length Std', ' Average Packet Size', ' Avg Bwd Segment Size']

Features selected for Probe: [' Bwd Packet Length Min', ' Bwd Packet Length Mean', ' Min Packet Length', ' Packet Length Mean', ' PSH Flag Count', ' ACK Flag Count', ' Average Packet Size', 'Init_Win_bytes_forward']

Features selected for Web: [' Destination Port', ' Bwd Packet Length Min', ' Min Packet Length', ' PSH Flag Count', ' Down/Up Ratio', ' Average Packet Size', 'Init_Win_bytes_forward', ' Init_Win_bytes_backward']

Features selected for Infil: [' Bwd Packet Length Min', 'Fwd PSH Flags', ' Min Packet Length', ' SYN Flag Count', ' PSH Flag Count', ' ACK Flag Count', ' Average Packet Size', 'Init_Win_bytes_forward']


## The authors state that "After obtaining the adequate number of features during the univariate selection process, a recursive feature elimination (RFE) was operated with the number of features passed as parameter to identify the features selected". This either implies that RFE is only used for obtaining the features previously selected but also obtaining the rank. This use of RFE is however very redundant as the features selected can be obtained in another way (Done in this project). One can also not say that the features were selected by RFE, as it was not used for this. The quote could however also imply that only the number 13 from univariate feature selection was used. RFE is then used for feature selection trying to find the best 13 features. With this use of RFE one can actually say that it was used for feature selection. However the authors obtained different numbers of features for every attack category, 12 for DoS, 15 for Probe, 13 for Web and 11 for Infil. This concludes that it is not clear what mechanism is used for feature selection. 

## To procede with the data mining, the second option is considered as this uses RFE. From now on the number of features for every attack category is 13.

# 2. Recursive Feature Elimination for feature ranking (Option 1: get importance from previous selected)

In [53]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
# Create a decision tree classifier. By convention, clf means 'classifier'
clf = DecisionTreeClassifier(random_state=0)

#rank all features, i.e continue the elimination until the last one
rfe = RFE(clf, n_features_to_select=1)
rfe.fit(X_newDoS, Y_DoS)
print ("DoS Features sorted by their rank:")
print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_DoS)))

DoS Features sorted by their rank:
[(1, ' Bwd Packet Length Std'), (2, ' Average Packet Size'), (3, ' Avg Bwd Segment Size'), (4, ' Max Packet Length'), (5, ' Packet Length Std'), (6, ' Bwd Packet Length Mean'), (7, ' Packet Length Mean'), (8, 'Bwd Packet Length Max')]


In [54]:
rfe.fit(X_newProbe, Y_Probe)
print ("Probe Features sorted by their rank:")
print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_Probe)))

Probe Features sorted by their rank:
[(1, ' Bwd Packet Length Mean'), (2, ' Packet Length Mean'), (3, ' PSH Flag Count'), (4, ' Average Packet Size'), (5, ' Min Packet Length'), (6, 'Init_Win_bytes_forward'), (7, ' ACK Flag Count'), (8, ' Bwd Packet Length Min')]


In [55]:
rfe.fit(X_newWeb, Y_Web)
 
print ("Web Features sorted by their rank:")
print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_Web)))

Web Features sorted by their rank:
[(1, ' Init_Win_bytes_backward'), (2, ' Average Packet Size'), (3, ' Destination Port'), (4, 'Init_Win_bytes_forward'), (5, ' Down/Up Ratio'), (6, ' Min Packet Length'), (7, ' PSH Flag Count'), (8, ' Bwd Packet Length Min')]


In [56]:
rfe.fit(X_newInfil, Y_Infil)
 
print ("Infil Features sorted by their rank:")
print (sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), newcolname_Infil)))

Infil Features sorted by their rank:
[(1, ' Average Packet Size'), (2, 'Init_Win_bytes_forward'), (3, ' ACK Flag Count'), (4, ' Min Packet Length'), (5, ' Bwd Packet Length Min'), (6, ' PSH Flag Count'), (7, 'Fwd PSH Flags'), (8, ' SYN Flag Count')]


# 2. Recursive Feature Elimination, select 13 features each of 78 (Option 2: get 13 best features from 78 from RFE)

In [57]:
from sklearn.feature_selection import RFE
clf = DecisionTreeClassifier(random_state=0)
rfe = RFE(estimator=clf, n_features_to_select=13, step=1)
rfe.fit(X_DoS, Y_DoS)
X_rfeDoS=rfe.transform(X_DoS)
true=rfe.support_
rfecolindex_DoS=[i for i, x in enumerate(true) if x]
rfecolname_DoS=list(colNames[i] for i in rfecolindex_DoS)

In [58]:
rfe.fit(X_Probe, Y_Probe)
X_rfeProbe=rfe.transform(X_Probe)
true=rfe.support_
rfecolindex_Probe=[i for i, x in enumerate(true) if x]
rfecolname_Probe=list(colNames[i] for i in rfecolindex_Probe)

In [59]:
rfe.fit(X_Web, Y_Web)
X_rfeWeb=rfe.transform(X_Web)
true=rfe.support_
rfecolindex_Web=[i for i, x in enumerate(true) if x]
rfecolname_Web=list(colNames[i] for i in rfecolindex_Web)

In [60]:
rfe.fit(X_Infil, Y_Infil)
X_rfeInfil=rfe.transform(X_Infil)
true=rfe.support_
rfecolindex_Infil=[i for i, x in enumerate(true) if x]
rfecolname_Infil=list(colNames[i] for i in rfecolindex_Infil)

# Summary of features selected by RFE

In [61]:
print('Features selected for DoS:',rfecolname_DoS)
print()
print('Features selected for Probe:',rfecolname_Probe)
print()
print('Features selected for Web:',rfecolname_Web)
print()
print('Features selected for Infil:',rfecolname_Infil)

Features selected for DoS: [' Destination Port', ' Total Length of Bwd Packets', ' Bwd Packet Length Std', ' Flow IAT Min', ' Fwd IAT Max', ' Fwd Header Length', ' Packet Length Mean', 'FIN Flag Count', ' Subflow Bwd Packets', 'Init_Win_bytes_forward', ' Init_Win_bytes_backward', ' Active Std', ' Idle Max']

Features selected for Probe: [' Destination Port', ' Total Fwd Packets', 'Total Length of Fwd Packets', 'Flow Bytes/s', ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Max', ' Flow IAT Min', ' Bwd Packets/s', ' PSH Flag Count', ' Fwd Header Length.1', 'Init_Win_bytes_forward', ' Idle Min']

Features selected for Web: [' Destination Port', ' Flow Duration', 'Total Length of Fwd Packets', ' Fwd Packet Length Max', ' Fwd IAT Min', ' Bwd IAT Max', ' Bwd IAT Min', ' Bwd Packets/s', ' Avg Fwd Segment Size', ' Fwd Header Length.1', 'Init_Win_bytes_forward', ' Init_Win_bytes_backward', ' min_seg_size_forward']

Features selected for Infil: [' Destination Port', 'Total Length of Fwd Packets

In [62]:
print(X_rfeDoS.shape)
print(X_rfeProbe.shape)
print(X_rfeWeb.shape)
print(X_rfeInfil.shape)

(892609, 13)
(725325, 13)
(607781, 13)
(616534, 13)


# Step 4: Build the model:
### Classifier is trained for all features and for reduced features, for later comparison.
#### The classifier model itself is stored in the clf variable.

In [63]:
# all features
clf_DoS=DecisionTreeClassifier(random_state=0)
clf_Probe=DecisionTreeClassifier(random_state=0)
clf_Web=DecisionTreeClassifier(random_state=0)
clf_Infil=DecisionTreeClassifier(random_state=0)
clf_DoS.fit(X_DoS, Y_DoS)
clf_Probe.fit(X_Probe, Y_Probe)
clf_Web.fit(X_Web, Y_Web)
clf_Infil.fit(X_Infil, Y_Infil)

DecisionTreeClassifier(random_state=0)

In [64]:
# selected features
clf_rfeDoS=DecisionTreeClassifier(random_state=0)
clf_rfeProbe=DecisionTreeClassifier(random_state=0)
clf_rfeWeb=DecisionTreeClassifier(random_state=0)
clf_rfeInfil=DecisionTreeClassifier(random_state=0)
clf_rfeDoS.fit(X_rfeDoS, Y_DoS)
clf_rfeProbe.fit(X_rfeProbe, Y_Probe)
clf_rfeWeb.fit(X_rfeWeb, Y_Web)
clf_rfeInfil.fit(X_rfeInfil, Y_Infil)

DecisionTreeClassifier(random_state=0)

# Step 5: Prediction & Evaluation (validation):

# Using all Features for each category

# Confusion Matrices
## DoS

In [65]:
# Apply the classifier we trained to the test data (which it has never seen before)
clf_DoS.predict(X_DoS_test)

array([0, 0, 0, ..., 1, 0, 1])

In [66]:
# View the predicted probabilities of the first 10 observations
clf_DoS.predict_proba(X_DoS_test)[0:10]

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.]])

In [67]:
Y_DoS_pred=clf_DoS.predict(X_DoS_test)
# Create confusion matrix
pd.crosstab(Y_DoS_test, Y_DoS_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,1
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,405079,30609
1,65910,29325


## Probe

In [68]:
Y_Probe_pred=clf_Probe.predict(X_Probe_test)
# Create confusion matrix
pd.crosstab(Y_Probe_test, Y_Probe_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,2
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,433505,2183
2,39620,10


## Web

In [69]:
Y_Web_pred=clf_Web.predict(X_Web_test)
# Create confusion matrix
pd.crosstab(Y_Web_test, Y_Web_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,3
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,435681,7
3,550,0


## Infil

In [70]:
Y_Infil_pred=clf_Infil.predict(X_Infil_test)
# Create confusion matrix
pd.crosstab(Y_Infil_test, Y_Infil_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,4
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,435577,111
4,3490,6


# Accuracy, Precision, Recall, F-measure

TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative

accuracy = (TP + TN) / (TP + TN + FP + FN)

precision = TP / (FP + TP)

recall = TP / (TP + FN)

f = (2 * precision * recall) / (precision + recall)

## DoS

In [71]:
from sklearn import metrics
accuracy = metrics.accuracy_score(Y_DoS_test,Y_DoS_pred)
print("Accuracy: %0.5f" % accuracy)
precision = metrics.precision_score(Y_DoS_test,Y_DoS_pred)
print("Precision: %0.5f" % precision)
recall = metrics.recall_score(Y_DoS_test,Y_DoS_pred)
print("Recall: %0.5f" % recall)
f = metrics.f1_score(Y_DoS_test,Y_DoS_pred)
print("F-measure: %0.5f" % f)

Accuracy: 0.81821
Precision: 0.48929
Recall: 0.30792
F-measure: 0.37797


# Probe

In [72]:
accuracy = metrics.accuracy_score(Y_Probe_test/2,Y_Probe_pred/2)
print("Accuracy: %0.5f" % accuracy)
precision = metrics.precision_score(Y_Probe_test/2,Y_Probe_pred/2)
print("Precision: %0.5f" % precision)
recall = metrics.recall_score(Y_Probe_test/2,Y_Probe_pred/2)
print("Recall: %0.5f" % recall)
f = metrics.f1_score(Y_Probe_test/2,Y_Probe_pred/2)
print("F-measure: %0.5f" % f)

Accuracy: 0.91205
Precision: 0.00456
Recall: 0.00025
F-measure: 0.00048


## Web

In [73]:
accuracy = metrics.accuracy_score(Y_Web_test/3,Y_Web_pred/3)
print("Accuracy: %0.5f" % accuracy)
precision = metrics.precision_score(Y_Web_test/3,Y_Web_pred/3)
print("Precision: %0.5f" % precision)
recall = metrics.recall_score(Y_Web_test/3,Y_Web_pred/3)
print("Recall: %0.5f" % recall)
f = metrics.f1_score(Y_Web_test/3,Y_Web_pred/3)
print("F-measure: %0.5f" % f)

Accuracy: 0.99872
Precision: 0.00000
Recall: 0.00000
F-measure: 0.00000


## Infil

In [74]:
accuracy = metrics.accuracy_score(Y_Infil_test/4,Y_Infil_pred/4)
print("Accuracy: %0.5f" % accuracy)
precision = metrics.precision_score(Y_Infil_test/4,Y_Infil_pred/4)
print("Precision: %0.5f" % precision)
recall = metrics.recall_score(Y_Infil_test/4,Y_Infil_pred/4)
print("Recall: %0.5f" % recall)
f = metrics.f1_score(Y_Infil_test/4,Y_Infil_pred/4)
print("F-measure: %0.5f" % f)

Accuracy: 0.99180
Precision: 0.05128
Recall: 0.00172
F-measure: 0.00332


# Using 13 Features for each category

# Confusion Matrices
## DoS

In [75]:
# reduce test dataset to 13 features, use only features described in rfecolname_DoS etc.
X_DoS_test2=X_DoS_test[:,rfecolindex_DoS]
X_Probe_test2=X_Probe_test[:,rfecolindex_Probe]
X_Web_test2=X_Web_test[:,rfecolindex_Web]
X_Infil_test2=X_Infil_test[:,rfecolindex_Infil]
X_Infil_test2.shape

(439184, 13)

In [76]:
Y_DoS_pred2=clf_rfeDoS.predict(X_DoS_test2)
# Create confusion matrix
pd.crosstab(Y_DoS_test, Y_DoS_pred2, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,1
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,405894,29794
1,66640,28595


## Probe

In [77]:
Y_Probe_pred2=clf_rfeProbe.predict(X_Probe_test2)
# Create confusion matrix
pd.crosstab(Y_Probe_test, Y_Probe_pred2, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,2
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,331255,104433
2,39549,81


## Web

In [78]:
Y_Web_pred2=clf_rfeWeb.predict(X_Web_test2)
# Create confusion matrix
pd.crosstab(Y_Web_test, Y_Web_pred2, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,3
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,435681,7
3,550,0


## Infil

In [79]:
Y_Infil_pred2=clf_rfeInfil.predict(X_Infil_test2)
# Create confusion matrix
pd.crosstab(Y_Infil_test, Y_Infil_pred2, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,4
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,435567,121
4,3451,45


# Accuracy, Precision, Recall, F-measure

TP = True Positive, TN = True Negative, FP = False Positive, FN = False Negative

accuracy = (TP + TN) / (TP + TN + FP + FN)

precision = TP / (FP + TP)

recall = TP / (TP + FN)

f = (2 * precision * recall) / (precision + recall)

## DoS

In [80]:
from sklearn import metrics
accuracy = metrics.accuracy_score(Y_DoS_test,Y_DoS_pred2)
print("Accuracy: %0.5f" % accuracy)
precision = metrics.precision_score(Y_DoS_test,Y_DoS_pred2)
print("Precision: %0.5f" % precision)
recall = metrics.recall_score(Y_DoS_test,Y_DoS_pred2)
print("Recall: %0.5f" % recall)
f = metrics.f1_score(Y_DoS_test,Y_DoS_pred2)
print("F-measure: %0.5f" % f)

Accuracy: 0.81837
Precision: 0.48973
Recall: 0.30026
F-measure: 0.37227


# Probe

In [81]:
accuracy = metrics.accuracy_score(Y_Probe_test/2,Y_Probe_pred2/2)
print("Accuracy: %0.5f" % accuracy)
precision = metrics.precision_score(Y_Probe_test/2,Y_Probe_pred2/2)
print("Precision: %0.5f" % precision)
recall = metrics.recall_score(Y_Probe_test/2,Y_Probe_pred2/2)
print("Recall: %0.5f" % recall)
f = metrics.f1_score(Y_Probe_test/2,Y_Probe_pred2/2)
print("F-measure: %0.5f" % f)

Accuracy: 0.69708
Precision: 0.00078
Recall: 0.00204
F-measure: 0.00112


## Web

In [82]:
accuracy = metrics.accuracy_score(Y_Web_test/3,Y_Web_pred2/3)
print("Accuracy: %0.5f" % accuracy)
precision = metrics.precision_score(Y_Web_test/3,Y_Web_pred2/3)
print("Precision: %0.5f" % precision)
recall = metrics.recall_score(Y_Web_test/3,Y_Web_pred2/3)
print("Recall: %0.5f" % recall)
f = metrics.f1_score(Y_Web_test/3,Y_Web_pred2/3)
print("F-measure: %0.5f" % f)

Accuracy: 0.99872
Precision: 0.00000
Recall: 0.00000
F-measure: 0.00000


## Infil

In [83]:
accuracy = metrics.accuracy_score(Y_Infil_test/4,Y_Infil_pred2/4)
print("Accuracy: %0.5f" % accuracy)
precision = metrics.precision_score(Y_Infil_test/4,Y_Infil_pred2/4)
print("Precision: %0.5f" % precision)
recall = metrics.recall_score(Y_Infil_test/4,Y_Infil_pred2/4)
print("Recall: %0.5f" % recall)
f = metrics.f1_score(Y_Infil_test/4,Y_Infil_pred2/4)
print("F-measure: %0.5f" % f)

Accuracy: 0.99187
Precision: 0.27108
Recall: 0.01287
F-measure: 0.02458
