#### <font color='blue' face='Cambria'><h1 align="center">LAB: Study on Decision Tree and KNN Algorithm</h1></font>


  Presented By:
* Anas Hdiri
* Eya Ben Ahmed
* Hedi Thameur
* Moatez Tej
* Nada Manai
* Slim Zouari


<p style="font-family: Cambria; font-size:1.75em;color:red; font-style:bold" >
I. BUSINESS UNDERSTANDING</p>


**Introduction to Network Security**

In an era dominated by digital transactions and communications, maintaining robust network security is pivotal for any organization. Effective network defense mechanisms are vital for protecting sensitive data and ensuring uninterrupted business operations. This research delves into enhancing network intrusion detection using sophisticated data analysis techniques.

**Intrusion Detection Techniques:**
- **Pattern-Based Recognition:** This approach uses established patterns or 'signatures' of known threats for detection.
- **Behavioral Analysis Detection:** This method employs advanced data analytics, particularly machine learning, to identify abnormal network behavior that could signify a security breach.

**Categories of Network Threats:**
- **Denial-of-Service (DoS) Attacks:** These attacks aim to incapacitate a network, denying service to legitimate users.
- **Unauthorized Access (U2R and R2L):** These attacks involve gaining unauthorized access or privileges within the network, either by escalating a standard user's access (U2R) or by gaining remote entry (R2L).
- **Network Reconnaissance (Probe):** This type of attack involves scanning a network to identify vulnerabilities that can be exploited in future attacks.

**Business Goal:**
The primary objective of this study is to significantly enhance the detection and classification capabilities of network intrusion detection systems. By applying the Decision Tree and K-Nearest Neighbors (KNN) algorithms to the NSL-KDD dataset, we aim to provide a more nuanced and comprehensive understanding of network traffic patterns. This will enable more effective identification of both known and novel cyber threats, ultimately safeguarding the digital assets and operations of organizations.


<p style="font-family: Cambria; font-size:1.75em;color:red; font-style:bold" >
II. DATA UNDERSTANDING</p>

Central to our study is the NSL-KDD dataset, a renowned benchmark in the field of network intrusion detection. This dataset is an improved version of the KDD'99 dataset and is specifically curated to overcome some of the inherent problems of its predecessor, making it more suitable for evaluating Intrusion Detection Systems (IDS).

**Dataset Composition:**
- **Data Features:** The NSL-KDD dataset comprises a variety of features that simulate different types of network traffic, both normal and malicious. It includes basic features of individual TCP connections, content features within the connections, and traffic features computed based on a two-second time window.
- **Diverse Attack Types:** The dataset encompasses a broad range of attack categories, such as DoS, U2R, R2L, and Probe, mirroring the multifaceted nature of network threats. This diversity enables the development of models that are well-equipped to recognize and respond to a wide array of intrusion scenarios.

**Data Preprocessing:**
- **Handling Categorical Variables:** Several features in the dataset are categorical. We use encoding techniques such as Label Encoding and One-Hot Encoding to convert these categorical variables into a format that can be efficiently processed by machine learning algorithms.
- **Feature Selection:** Recognizing the critical importance of feature relevance, we employ Recursive Feature Elimination (RFE) to identify and retain the most significant features, thereby enhancing the predictive accuracy of our models.

**Data Analysis Implications:**
- **Balanced Representation:** The NSL-KDD dataset is structured to avoid redundant records, ensuring a more balanced representation of different types of attacks and normal connections. This balanced distribution is crucial for avoiding biased models that overfit to specific attack types.
- **Real-World Applicability:** The composition and structure of the dataset make it an ideal proxy for real-world network environments, enabling the development of intrusion detection models that are both practical and applicable in actual cybersecurity settings.



<p style="font-family: Cambria; font-size:1.75em;color:red; font-style:bold" >
III. DATA EXPLORATION</p>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn import preprocessing
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings("ignore") 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn import preprocessing

In [2]:
col_names = ["duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes", "land",
            "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in", "num_compromised",
            "root_shell", "su_attempted", "num_root", "num_file_creations", "num_shells", "num_access_files",
            "num_outbound_cmds", "is_host_login", "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate", "rerror_rate", "srv_rerror_rate", "same_srv_rate", "diff_srv_rate",
            "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count", "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate",
            "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "attack", "last_flag"]

In [3]:
#Import the train dataset
dataset_train=pd.read_csv('KDDTrain+.txt',header=None, names = col_names)
dataset_train = dataset_train.iloc[:,:-1]

In [4]:
#Import the test dataset
dataset_test=pd.read_csv('KDDTest+.txt',header=None, names = col_names)
dataset_test = dataset_test.iloc[:,:-1]

In [5]:
# the dimensions of the dataset
print('Dimensions of the Training set:',dataset_train.shape)
print('Dimensions of the Test set:',dataset_test.shape)

Dimensions of the Training set: (125973, 42)
Dimensions of the Test set: (22544, 42)


In [6]:
dataset_train.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack
0,0,tcp,ftp_data,SF,491,0,0,0,0,0,...,25,0.17,0.03,0.17,0.0,0.0,0.0,0.05,0.0,normal
1,0,udp,other,SF,146,0,0,0,0,0,...,1,0.0,0.6,0.88,0.0,0.0,0.0,0.0,0.0,normal
2,0,tcp,private,S0,0,0,0,0,0,0,...,26,0.1,0.05,0.0,0.0,1.0,1.0,0.0,0.0,neptune
3,0,tcp,http,SF,232,8153,0,0,0,0,...,255,1.0,0.0,0.03,0.04,0.03,0.01,0.0,0.01,normal
4,0,tcp,http,SF,199,420,0,0,0,0,...,255,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,normal


In [7]:
dataset_test.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack
0,0,tcp,private,REJ,0,0,0,0,0,0,...,10,0.04,0.06,0.0,0.0,0.0,0.0,1.0,1.0,neptune
1,0,tcp,private,REJ,0,0,0,0,0,0,...,1,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0,neptune
2,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,86,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0,normal
3,0,icmp,eco_i,SF,20,0,0,0,0,0,...,57,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0,saint
4,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,86,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71,mscan


In [8]:
# Statistical overview of the numerical features in the training dataset
dataset_train.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
count,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,...,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0,125973.0
mean,287.14465,45566.74,19779.11,0.000198,0.022687,0.000111,0.204409,0.001222,0.395736,0.27925,...,182.148945,115.653005,0.521242,0.082951,0.148379,0.032542,0.284452,0.278485,0.118832,0.12024
std,2604.51531,5870331.0,4021269.0,0.014086,0.25353,0.014366,2.149968,0.045239,0.48901,23.942042,...,99.206213,110.702741,0.448949,0.188922,0.308997,0.112564,0.444784,0.445669,0.306557,0.319459
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,82.0,10.0,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,44.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,63.0,0.51,0.02,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,276.0,516.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,255.0,255.0,1.0,0.07,0.06,0.02,1.0,1.0,0.0,0.0
max,42908.0,1379964000.0,1309937000.0,1.0,3.0,3.0,77.0,5.0,1.0,7479.0,...,255.0,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
# Statistical overview of the numerical features in the test dataset
dataset_test.describe()

Unnamed: 0,duration,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,num_failed_logins,logged_in,num_compromised,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
count,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,...,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0,22544.0
mean,218.859076,10395.45,2056.019,0.000311,0.008428,0.00071,0.105394,0.021647,0.442202,0.119899,...,193.869411,140.750532,0.608722,0.09054,0.132261,0.019638,0.097814,0.099426,0.233385,0.226683
std,1407.176612,472786.4,21219.3,0.017619,0.142599,0.036473,0.928428,0.150328,0.496659,7.269597,...,94.035663,111.783972,0.435688,0.220717,0.306268,0.085394,0.273139,0.281866,0.387229,0.400875
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,121.0,15.0,0.07,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,54.0,46.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,255.0,168.0,0.92,0.01,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,287.0,601.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,255.0,255.0,1.0,0.06,0.03,0.01,0.0,0.0,0.36,0.17
max,57715.0,62825650.0,1345927.0,1.0,3.0,3.0,101.0,4.0,1.0,796.0,...,255.0,255.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [10]:
#Observe the Label Distribution of Training set
print('Label distribution Training set:')
print(dataset_train['attack'].value_counts())



Label distribution Training set:
attack
normal             67343
neptune            41214
satan               3633
ipsweep             3599
portsweep           2931
smurf               2646
nmap                1493
back                 956
teardrop             892
warezclient          890
pod                  201
guess_passwd          53
buffer_overflow       30
warezmaster           20
land                  18
imap                  11
rootkit               10
loadmodule             9
ftp_write              8
multihop               7
phf                    4
perl                   3
spy                    2
Name: count, dtype: int64


In [11]:
#Observe the Label Distribution of Test set
print('Label distribution Test set:')
print(dataset_test['attack'].value_counts())

Label distribution Test set:
attack
normal             9711
neptune            4657
guess_passwd       1231
mscan               996
warezmaster         944
apache2             737
satan               735
processtable        685
smurf               665
back                359
snmpguess           331
saint               319
mailbomb            293
snmpgetattack       178
portsweep           157
ipsweep             141
httptunnel          133
nmap                 73
pod                  41
buffer_overflow      20
multihop             18
named                17
ps                   15
sendmail             14
rootkit              13
xterm                13
teardrop             12
xlock                 9
land                  7
xsnoop                4
ftp_write             3
worm                  2
loadmodule            2
perl                  2
sqlattack             2
udpstorm              2
phf                   2
imap                  1
Name: count, dtype: int64


In [12]:
#Data pre-processing
#1)Handling missing data

In [13]:
# visualize how many missed values you have per feature and their percentage in the train dataset.

missing_values = dataset_train.isnull().sum()
missing_percentage = dataset_train.isnull().sum()*100/len(dataset_train)
missing_df = pd.DataFrame({'missing_sum': dataset_train.isnull().sum(),'missing_percentage': missing_percentage})
missing_df


Unnamed: 0,missing_sum,missing_percentage
duration,0,0.0
protocol_type,0,0.0
service,0,0.0
flag,0,0.0
src_bytes,0,0.0
dst_bytes,0,0.0
land,0,0.0
wrong_fragment,0,0.0
urgent,0,0.0
hot,0,0.0


In [14]:
# visualize how many missed values you have per feature and their percentage in the test dataset.

missing_values = dataset_test.isnull().sum()
missing_percentage = dataset_test.isnull().sum()*100/len(dataset_test)
missing_df = pd.DataFrame({'missing_sum': dataset_test.isnull().sum(),'missing_percentage': missing_percentage})
missing_df

Unnamed: 0,missing_sum,missing_percentage
duration,0,0.0
protocol_type,0,0.0
service,0,0.0
flag,0,0.0
src_bytes,0,0.0
dst_bytes,0,0.0
land,0,0.0
wrong_fragment,0,0.0
urgent,0,0.0
hot,0,0.0


In [15]:
# explore categorical features in the training set
print('Training set:')
for col_name in dataset_train.columns:
    if dataset_train[col_name].dtypes == 'object' :
        unique_cat = len(dataset_train[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} categories".format(col_name=col_name, unique_cat=unique_cat))

Training set:
Feature 'protocol_type' has 3 categories
Feature 'service' has 70 categories
Feature 'flag' has 11 categories
Feature 'attack' has 23 categories


In [16]:
# explore categorical features in the test set
print('Test set:')
for col_name in dataset_test.columns:
    if dataset_test[col_name].dtypes == 'object' :
        unique_cat = len(dataset_test[col_name].unique())
        print("Feature '{col_name}' has {unique_cat} categories".format(col_name=col_name, unique_cat=unique_cat))

Test set:
Feature 'protocol_type' has 3 categories
Feature 'service' has 64 categories
Feature 'flag' has 11 categories
Feature 'attack' has 38 categories


In [17]:
#We should create 3+70+11=84 dummy variable to represent all possible categories in these features (flag,protoco_type,service)
#For the feature service the test set has 6 fewer categories that need to be added as empty columns.


In [18]:
#Transform the categorical variables into numerical values
categorical_columns=['protocol_type', 'service', 'flag']
categorical_columns=['protocol_type', 'service', 'flag'] 

dtrain_categorical_values = dataset_train[categorical_columns]
dtest_categorical_values = dataset_test[categorical_columns]

dtrain_categorical_values.head()

Unnamed: 0,protocol_type,service,flag
0,tcp,ftp_data,SF
1,udp,other,SF
2,tcp,private,S0
3,tcp,http,SF
4,tcp,http,SF


In [19]:
#Prepare a list of column names for one-hot encoding based on the unique values for each categorical column

# protocol type
unique_protocol=sorted(dataset_train.protocol_type.unique())
string1 = 'Protocol_type_'
unique_protocol2=[string1 + x for x in unique_protocol]
# service
unique_service=sorted(dataset_train.service.unique())
string2 = 'service_'
unique_service2=[string2 + x for x in unique_service]
# flag
unique_flag=sorted(dataset_train.flag.unique())
string3 = 'flag_'
unique_flag2=[string3 + x for x in unique_flag]

dumcols=unique_protocol2 + unique_service2 + unique_flag2
print(dumcols)



['Protocol_type_icmp', 'Protocol_type_tcp', 'Protocol_type_udp', 'service_IRC', 'service_X11', 'service_Z39_50', 'service_aol', 'service_auth', 'service_bgp', 'service_courier', 'service_csnet_ns', 'service_ctf', 'service_daytime', 'service_discard', 'service_domain', 'service_domain_u', 'service_echo', 'service_eco_i', 'service_ecr_i', 'service_efs', 'service_exec', 'service_finger', 'service_ftp', 'service_ftp_data', 'service_gopher', 'service_harvest', 'service_hostnames', 'service_http', 'service_http_2784', 'service_http_443', 'service_http_8001', 'service_imap4', 'service_iso_tsap', 'service_klogin', 'service_kshell', 'service_ldap', 'service_link', 'service_login', 'service_mtp', 'service_name', 'service_netbios_dgm', 'service_netbios_ns', 'service_netbios_ssn', 'service_netstat', 'service_nnsp', 'service_nntp', 'service_ntp_u', 'service_other', 'service_pm_dump', 'service_pop_2', 'service_pop_3', 'service_printer', 'service_private', 'service_red_i', 'service_remote_job', 'serv

In [20]:
#Do the same process for the test dataset
unique_service_test=sorted(dataset_test.service.unique())
unique_service2_test=[string2 + x for x in unique_service_test]
testdumcols=unique_protocol2 + unique_service2_test + unique_flag2

In [21]:
#Transform categorical features into numbers using LabelEncoder
dtrain_categorical_values_enc=dtrain_categorical_values.apply(LabelEncoder().fit_transform)
#Result
print(dtrain_categorical_values_enc.head())

# For the test dataset
dtest_categorical_values_enc=dtest_categorical_values.apply(LabelEncoder().fit_transform)

   protocol_type  service  flag
0              1       20     9
1              2       44     9
2              1       49     5
3              1       24     9
4              1       24     9


In [22]:
#One-Hot-Encoding
enc = OneHotEncoder()
dtrain_categorical_values_encenc = enc.fit_transform(dtrain_categorical_values_enc)
dtrain_cat_data = pd.DataFrame(dtrain_categorical_values_encenc.toarray(),columns=dumcols)
# test set
dtest_categorical_values_encenc = enc.fit_transform(dtest_categorical_values_enc)
dtest_cat_data = pd.DataFrame(dtest_categorical_values_encenc.toarray(),columns=testdumcols)

dtrain_cat_data.head()

Unnamed: 0,Protocol_type_icmp,Protocol_type_tcp,Protocol_type_udp,service_IRC,service_X11,service_Z39_50,service_aol,service_auth,service_bgp,service_courier,...,flag_REJ,flag_RSTO,flag_RSTOS0,flag_RSTR,flag_S0,flag_S1,flag_S2,flag_S3,flag_SF,flag_SH
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [23]:
#Add the 6 categroies of the feature 'service' from train to test dataset

trainservice=dataset_train['service'].tolist()
testservice= dataset_test['service'].tolist()
difference=list(set(trainservice) - set(testservice))
string = 'service_'
difference=[string + x for x in difference]
difference

['service_urh_i',
 'service_http_8001',
 'service_red_i',
 'service_aol',
 'service_harvest',
 'service_http_2784']

In [24]:
#for each column name in the 'diffrence' list, a new column with that name is added to dtest_cat_data and filled with 0

for col in difference:
   dtest_cat_data[col] = 0

dtest_cat_data.shape

(22544, 84)

In [25]:
#Join the original dataset (train and test) with the one-hot encoded categorical data(dtrain_cat_data,dtest_cat_data)
#then dropping the original categorical columns ('flag','protocol_type','service')

newdf=dataset_train.join(dtrain_cat_data)
newdf.drop('flag', axis=1, inplace=True)
newdf.drop('protocol_type', axis=1, inplace=True)
newdf.drop('service', axis=1, inplace=True)

# Test dataset
newdf_test=dataset_test.join(dtest_cat_data)
newdf_test.drop('flag', axis=1, inplace=True)
newdf_test.drop('protocol_type', axis=1, inplace=True)
newdf_test.drop('service', axis=1, inplace=True)
print(newdf.shape)
print(newdf_test.shape)
for column in newdf.columns:
    print(column)




(125973, 123)
(22544, 123)
duration
src_bytes
dst_bytes
land
wrong_fragment
urgent
hot
num_failed_logins
logged_in
num_compromised
root_shell
su_attempted
num_root
num_file_creations
num_shells
num_access_files
num_outbound_cmds
is_host_login
is_guest_login
count
srv_count
serror_rate
srv_serror_rate
rerror_rate
srv_rerror_rate
same_srv_rate
diff_srv_rate
srv_diff_host_rate
dst_host_count
dst_host_srv_count
dst_host_same_srv_rate
dst_host_diff_srv_rate
dst_host_same_src_port_rate
dst_host_srv_diff_host_rate
dst_host_serror_rate
dst_host_srv_serror_rate
dst_host_rerror_rate
dst_host_srv_rerror_rate
attack
Protocol_type_icmp
Protocol_type_tcp
Protocol_type_udp
service_IRC
service_X11
service_Z39_50
service_aol
service_auth
service_bgp
service_courier
service_csnet_ns
service_ctf
service_daytime
service_discard
service_domain
service_domain_u
service_echo
service_eco_i
service_ecr_i
service_efs
service_exec
service_finger
service_ftp
service_ftp_data
service_gopher
service_harvest
service

In [26]:
#Split Dataset into 4 datasets for every attack category
#Rename every attack label: 0=normal, 1=DoS, 2=Probe, 3=R2L and 4=U2R.
#Replace labels column with new labels column
#Make the 4 new datasets



In [27]:
#Replace the 'attack' column with numerical values corresponding to the attack categories

labeldf = newdf['attack']
labeldf_test = newdf_test['attack']
# change the attack column
newlabeldf=labeldf.replace({ 'normal' : 0, 'neptune' : 1 ,'back': 1, 'land': 1, 'pod': 1, 'smurf': 1, 'teardrop': 1,'mailbomb': 1, 'apache2': 1, 'processtable': 1, 'udpstorm': 1, 'worm': 1,
                           'ipsweep' : 2,'nmap' : 2,'portsweep' : 2,'satan' : 2,'mscan' : 2,'saint' : 2
                           ,'ftp_write': 3,'guess_passwd': 3,'imap': 3,'multihop': 3,'phf': 3,'spy': 3,'warezclient': 3,'warezmaster': 3,'sendmail': 3,'named': 3,'snmpgetattack': 3,'snmpguess': 3,'xlock': 3,'xsnoop': 3,'httptunnel': 3,
                           'buffer_overflow': 4,'loadmodule': 4,'perl': 4,'rootkit': 4,'ps': 4,'sqlattack': 4,'xterm': 4})
newlabeldf_test=labeldf_test.replace({ 'normal' : 0, 'neptune' : 1 ,'back': 1, 'land': 1, 'pod': 1, 'smurf': 1, 'teardrop': 1,'mailbomb': 1, 'apache2': 1, 'processtable': 1, 'udpstorm': 1, 'worm': 1,
                           'ipsweep' : 2,'nmap' : 2,'portsweep' : 2,'satan' : 2,'mscan' : 2,'saint' : 2
                           ,'ftp_write': 3,'guess_passwd': 3,'imap': 3,'multihop': 3,'phf': 3,'spy': 3,'warezclient': 3,'warezmaster': 3,'sendmail': 3,'named': 3,'snmpgetattack': 3,'snmpguess': 3,'xlock': 3,'xsnoop': 3,'httptunnel': 3,
                           'buffer_overflow': 4,'loadmodule': 4,'perl': 4,'rootkit': 4,'ps': 4,'sqlattack': 4,'xterm': 4})
# put the new attack column back
newdf['attack'] = newlabeldf
newdf_test['attack'] = newlabeldf_test
print(newdf['attack'].head())


0    0
1    0
2    1
3    0
4    0
Name: attack, dtype: int64


In [28]:
#Split the data into four subsets based on the values in the 'attack' column

#Specify the attack types that we want to exclude from each dataset
to_drop_DoS = [2,3,4] 
to_drop_Probe = [1,3,4]
to_drop_R2L = [1,2,4]
to_drop_U2R = [1,2,3]

#Train dataset
DoS_df=newdf[~newdf['attack'].isin(to_drop_DoS)]; #Example: Subset of data where the attack type is not in the set [2,3,4]
Probe_df=newdf[~newdf['attack'].isin(to_drop_Probe)];
R2L_df=newdf[~newdf['attack'].isin(to_drop_R2L)];
U2R_df=newdf[~newdf['attack'].isin(to_drop_U2R)];

#Test dataset
DoS_df_test=newdf_test[~newdf_test['attack'].isin(to_drop_DoS)];
Probe_df_test=newdf_test[~newdf_test['attack'].isin(to_drop_Probe)];
R2L_df_test=newdf_test[~newdf_test['attack'].isin(to_drop_R2L)];
U2R_df_test=newdf_test[~newdf_test['attack'].isin(to_drop_U2R)];
print('Train set:')
print('Dimensions of DoS:' ,DoS_df.shape)
print('Dimensions of Probe:' ,Probe_df.shape)
print('Dimensions of R2L:' ,R2L_df.shape)
print('Dimensions of U2R:' ,U2R_df.shape)
print('Test set:')
print('Dimensions of DoS:' ,DoS_df_test.shape)
print('Dimensions of Probe:' ,Probe_df_test.shape)
print('Dimensions of R2L:' ,R2L_df_test.shape)
print('Dimensions of U2R:' ,U2R_df_test.shape)



Train set:
Dimensions of DoS: (113270, 123)
Dimensions of Probe: (78999, 123)
Dimensions of R2L: (68338, 123)
Dimensions of U2R: (67395, 123)
Test set:
Dimensions of DoS: (17171, 123)
Dimensions of Probe: (12132, 123)
Dimensions of R2L: (12596, 123)
Dimensions of U2R: (9778, 123)


In [29]:
#Preparing training and test sets by separating features 'X' and labels 'Y' for the different attack categories 
#(DoS,Probe,R2L,U2R).It drops the 'attack' column for the feature sets and assigns it to the label sets.

X_DoS = DoS_df.drop('attack', axis=1)
Y_DoS = DoS_df.attack

X_Probe = Probe_df.drop('attack', axis=1)
Y_Probe = Probe_df.attack

X_R2L = R2L_df.drop('attack', axis=1)
Y_R2L = R2L_df.attack

X_U2R = U2R_df.drop('attack', axis=1)
Y_U2R = U2R_df.attack

# Test sets
X_DoS_test = DoS_df_test.drop('attack', axis=1)
Y_DoS_test = DoS_df_test.attack

X_Probe_test = Probe_df_test.drop('attack', axis=1)
Y_Probe_test = Probe_df_test.attack

X_R2L_test = R2L_df_test.drop('attack', axis=1)
Y_R2L_test = R2L_df_test.attack

X_U2R_test = U2R_df_test.drop('attack', axis=1)
Y_U2R_test = U2R_df_test.attack


colNames=list(X_DoS)
colNames_test=list(X_DoS_test)

In [30]:
#Standardize the feature sets for each attack category for both training and test sets

#train set
scaler1 = preprocessing.StandardScaler().fit(X_DoS)
X_DoS=scaler1.transform(X_DoS) 
scaler2 = preprocessing.StandardScaler().fit(X_Probe)
X_Probe=scaler2.transform(X_Probe) 
scaler3 = preprocessing.StandardScaler().fit(X_R2L)
X_R2L=scaler3.transform(X_R2L) 
scaler4 = preprocessing.StandardScaler().fit(X_U2R)
X_U2R=scaler4.transform(X_U2R) 

# test set
scaler5 = preprocessing.StandardScaler().fit(X_DoS_test)
X_DoS_test=scaler5.transform(X_DoS_test) 
scaler6 = preprocessing.StandardScaler().fit(X_Probe_test)
X_Probe_test=scaler6.transform(X_Probe_test) 
scaler7 = preprocessing.StandardScaler().fit(X_R2L_test)
X_R2L_test=scaler7.transform(X_R2L_test) 
scaler8 = preprocessing.StandardScaler().fit(X_U2R_test)
X_U2R_test=scaler8.transform(X_U2R_test) 

In [31]:
#Check that the Standard Deviation is equal to 1 for the X_DoS dataset as an example
print(X_DoS.std(axis=0))

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 1.]


In [32]:
#Feature Selection
#ANOVA F-test

In [33]:
from sklearn.feature_selection import SelectPercentile, f_classif
np.seterr(divide='ignore', invalid='ignore');
selector=SelectPercentile(f_classif, percentile=10)



In [34]:
#Get the features selected for DoS
X_anovaDoS = selector.fit_transform(X_DoS,Y_DoS)
X_anovaDoS.shape
true=selector.get_support()
newcoli_DoS=[i for i, x in enumerate(true) if x]
newname_DoS=list( colNames[i] for i in newcoli_DoS )
newname_DoS

['logged_in',
 'count',
 'serror_rate',
 'srv_serror_rate',
 'same_srv_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_srv_rate',
 'dst_host_serror_rate',
 'dst_host_srv_serror_rate',
 'service_http',
 'flag_S0',
 'flag_SF']

In [35]:
#Get the features selected for R2L
X_anovaR2L = selector.fit_transform(X_R2L,Y_R2L)
X_anovaR2L.shape
true=selector.get_support()
newcoli_R2L=[i for i, x in enumerate(true) if x]
newname_R2L=list( colNames[i] for i in newcoli_R2L)
newname_R2L

['src_bytes',
 'dst_bytes',
 'hot',
 'num_failed_logins',
 'is_guest_login',
 'dst_host_srv_count',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'service_ftp',
 'service_ftp_data',
 'service_http',
 'service_imap4',
 'flag_RSTO']

In [36]:
#Get the features selected for U2R
X_anovaU2R = selector.fit_transform(X_U2R,Y_U2R)
X_anovaU2R.shape
true=selector.get_support()
newcoli_U2R=[i for i, x in enumerate(true) if x]
newname_U2R=list( colNames[i] for i in newcoli_U2R)
newname_U2R

['urgent',
 'hot',
 'root_shell',
 'num_file_creations',
 'num_shells',
 'srv_diff_host_rate',
 'dst_host_count',
 'dst_host_srv_count',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'service_ftp_data',
 'service_http',
 'service_telnet']

In [37]:
#Get the features selected for Probe
X_anovaProbe = selector.fit_transform(X_Probe,Y_Probe)
X_anovaProbe.shape
true=selector.get_support()
newcoli_Probe=[i for i, x in enumerate(true) if x]
newname_Probe=list( colNames[i] for i in newcoli_Probe )
newname_Probe

['logged_in',
 'rerror_rate',
 'srv_rerror_rate',
 'dst_host_srv_count',
 'dst_host_diff_srv_rate',
 'dst_host_same_src_port_rate',
 'dst_host_srv_diff_host_rate',
 'dst_host_rerror_rate',
 'dst_host_srv_rerror_rate',
 'Protocol_type_icmp',
 'service_eco_i',
 'service_private',
 'flag_SF']

In [38]:
#Classifiers

In [39]:
#DecisionTreeClassifier on all features
dtc_DoS=DecisionTreeClassifier(random_state=0)
dtc_R2L=DecisionTreeClassifier(random_state=0)
dtc_U2R=DecisionTreeClassifier(random_state=0)
dtc_Probe=DecisionTreeClassifier(random_state=0)
dtc_DoS.fit(X_DoS, Y_DoS)
dtc_R2L.fit(X_R2L, Y_R2L)
dtc_U2R.fit(X_U2R, Y_U2R)
dtc_Probe.fit(X_Probe, Y_Probe)



In [40]:
#DecisionTreeClassifier on selected features
dtcs_DoS=DecisionTreeClassifier(random_state=0)
dtcs_R2L=DecisionTreeClassifier(random_state=0)
dtcs_Probe=DecisionTreeClassifier(random_state=0)
dtcs_U2R=DecisionTreeClassifier(random_state=0)

dtcs_DoS.fit(X_anovaDoS, Y_DoS)
dtcs_R2L.fit(X_anovaR2L, Y_R2L)
dtcs_Probe.fit(X_anovaProbe, Y_Probe)
dtcs_U2R.fit(X_anovaU2R, Y_U2R)

In [41]:
#Results and discussions

In [42]:
#DoS
Y_DoS_pred=dtc_DoS.predict(X_DoS_test)
# Create confusion matrix
pd.crosstab(Y_DoS_test, Y_DoS_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,1
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9499,212
1,2830,4630


In [43]:
#R2L
Y_R2L_pred=dtc_R2L.predict(X_R2L_test)
# Create confusion matrix
pd.crosstab(Y_R2L_test, Y_R2L_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,3
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9707,4
3,2573,312


In [44]:
#Probe
Y_Probe_pred=dtc_Probe.predict(X_Probe_test)
# Create confusion matrix
pd.crosstab(Y_Probe_test, Y_Probe_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,2
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2337,7374
2,212,2209


In [45]:
Y_U2R_pred=dtc_U2R.predict(X_U2R_test)
# Create confusion matrix
pd.crosstab(Y_U2R_test, Y_U2R_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,4
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9703,8
4,60,7


In [46]:
#Accuracy, Precision, Recall, F1-measure

In [47]:
#Dos
from sklearn.model_selection import cross_val_score
from sklearn import metrics


accuracy = cross_val_score(dtc_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='accuracy')
print("Accuracy for Dos: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(dtc_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='precision')
print("Precision for Dos: %0.5f " % (precision.mean()*100))
recall = cross_val_score(dtc_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='recall')
print("Recall for Dos: %0.5f " % (recall.mean()*100))
f = cross_val_score(dtc_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='f1')
print("F-measure for Dos: %0.5f " % (f.mean()*100))





Accuracy for Dos: 99.63892 
Precision for Dos: 99.50520 
Recall for Dos: 99.66488 
F-measure for Dos: 99.58478 


In [48]:
#U2R
accuracy = cross_val_score(dtc_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='accuracy')
print("Accuracy for U2R: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(dtc_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='precision_macro')
print("Precision for U2R: %0.5f " % (precision.mean()*100))
recall = cross_val_score(dtc_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='recall_macro')
print("Recall for U2R: %0.5f " % (recall.mean()*100))
f = cross_val_score(dtc_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='f1_macro')
print("F-measure for U2R : %0.5f " % (f.mean()*100))



Accuracy for U2R: 99.65231 
Precision for U2R: 86.29532 
Recall for U2R: 90.95815 
F-measure for U2R : 88.21025 


In [49]:
#R2L
accuracy = cross_val_score(dtc_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='accuracy')
print("Accuracy for R2L: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(dtc_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='precision_macro')
print("Precision for R2L: %0.5f " % (precision.mean()*100))
recall = cross_val_score(dtc_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='recall_macro')
print("Recall for R2L: %0.5f " % (recall.mean()*100))
f = cross_val_score(dtc_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='f1_macro')
print("F-measure for R2L: %0.5f " % (f.mean()*100))

Accuracy for R2L: 97.91995 
Precision for R2L: 97.15088 
Recall for R2L: 96.95751 
F-measure for R2L: 97.05067 


In [50]:
#Probe
accuracy = cross_val_score(dtc_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='accuracy')
print("Accuracy for Probe: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(dtc_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='precision_macro')
print("Precision for Probe: %0.5f " % (precision.mean()*100))
recall = cross_val_score(dtc_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='recall_macro')
print("Recall for Probe: %0.5f " % (recall.mean()*100))
f = cross_val_score(dtc_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='f1_macro')
print("F-measure for Probe: %0.5f " % (f.mean()*100))

Accuracy for Probe: 99.57134 
Precision for Probe: 99.39152 
Recall for Probe: 99.26705 
F-measure for Probe: 99.32861 


In [51]:
#KNN

In [52]:
#DoS
from sklearn.neighbors import KNeighborsClassifier
knn_DoS=KNeighborsClassifier()
knn_DoS.fit(X_DoS, Y_DoS)



In [53]:
#U2R
knn_U2R=KNeighborsClassifier()
knn_U2R.fit(X_U2R, Y_U2R)


In [54]:
#R2L
knn_R2L=KNeighborsClassifier()
knn_R2L.fit(X_R2L, Y_R2L)

In [55]:
#Probe
knn_Probe=KNeighborsClassifier()
knn_Probe.fit(X_Probe, Y_Probe)

In [56]:
#DoS
Y_DoS_pred=knn_DoS.predict(X_DoS_test)
# Create confusion matrix
pd.crosstab(Y_DoS_test, Y_DoS_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,1
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9422,289
1,1573,5887


In [57]:
#U2R
Y_U2R_pred=knn_U2R.predict(X_U2R_test)
# Create confusion matrix
pd.crosstab(Y_U2R_test, Y_U2R_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,4
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9711,0
4,65,2


In [58]:
#R2L
Y_R2L_pred=knn_R2L.predict(X_R2L_test)
# Create confusion matrix
pd.crosstab(Y_R2L_test, Y_R2L_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,3
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9706,5
3,2883,2


In [59]:
#Probe
Y_Probe_pred=knn_Probe.predict(X_Probe_test)
# Create confusion matrix
pd.crosstab(Y_Probe_test, Y_Probe_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

Predicted attacks,0,2
Actual attacks,Unnamed: 1_level_1,Unnamed: 2_level_1
0,9437,274
2,1272,1149


In [60]:
#Accuracy, Precision, Recall, F1-measure(KNN)

In [61]:
#Dos
from sklearn.model_selection import cross_val_score
from sklearn import metrics

accuracy = cross_val_score(knn_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='accuracy')
print("Accuracy for Dos: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(knn_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='precision')
print("Precision for Dos: %0.5f " % (precision.mean()*100))
recall = cross_val_score(knn_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='recall')
print("Recall for Dos: %0.5f " % (recall.mean()*100))
f = cross_val_score(knn_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='f1')
print("F-measure for Dos: %0.5f " % (f.mean()*100))

Accuracy for Dos: 99.71463 
Precision for Dos: 99.67843 
Recall for Dos: 99.66488 
F-measure for Dos: 99.67158 


In [62]:
#U2R
accuracy = cross_val_score(knn_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='accuracy')
print("Accuracy for U2R: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(knn_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='precision_macro')
print("Precision for U2R: %0.5f " % (precision.mean()*100))
recall = cross_val_score(knn_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='recall_macro')
print("Recall for U2R: %0.5f " % (recall.mean()*100))
f = cross_val_score(knn_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='f1_macro')
print("F-measure for U2R : %0.5f " % (f.mean()*100))

Accuracy for U2R: 99.70341 
Precision for U2R: 93.14325 
Recall for U2R: 85.07271 
F-measure for U2R : 87.83136 


In [63]:
#R2L
accuracy = cross_val_score(knn_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='accuracy')
print("Accuracy for R2L: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(knn_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='precision_macro')
print("Precision for R2L: %0.5f " % (precision.mean()*100))
recall = cross_val_score(knn_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='recall_macro')
print("Recall for R2L: %0.5f " % (recall.mean()*100))
f = cross_val_score(knn_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='f1_macro')
print("F-measure for R2L: %0.5f " % (f.mean()*100))

Accuracy for R2L: 96.72911 
Precision for R2L: 95.31232 
Recall for R2L: 95.45446 
F-measure for R2L: 95.37623 


In [64]:
#Probe
accuracy = cross_val_score(knn_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='accuracy')
print("Accuracy for Probe: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(knn_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='precision_macro')
print("Precision for Probe: %0.5f " % (precision.mean()*100))
recall = cross_val_score(knn_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='recall_macro')
print("Recall for Probe: %0.5f " % (recall.mean()*100))
f = cross_val_score(knn_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='f1_macro')
print("F-measure for Probe: %0.5f " % (f.mean()*100))

Accuracy for Probe: 99.07681 
Precision for Probe: 98.60594 
Recall for Probe: 98.50846 
F-measure for Probe: 98.55258 


In [65]:
#Comparison of average accuracy, precision, recall and f-score of the two algorithms: Decision tree and KNN 

In [None]:
# Accuracy
accuracy_Dos = cross_val_score(dtc_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='accuracy')
accuracy_U2R = cross_val_score(dtc_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='accuracy')
accuracy_R2L = cross_val_score(dtc_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='accuracy')
accuracy_Probe = cross_val_score(dtc_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='accuracy')

# Precision
precision_Dos = cross_val_score(dtc_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='precision_macro')
precision_U2R = cross_val_score(dtc_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='precision_macro')
precision_R2L = cross_val_score(dtc_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='precision_macro')
precision_Probe = cross_val_score(dtc_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='precision_macro')

# Recall
recall_Dos = cross_val_score(dtc_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='recall_macro')
recall_U2R = cross_val_score(dtc_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='recall_macro')
recall_R2L = cross_val_score(dtc_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='recall_macro')
recall_Probe = cross_val_score(dtc_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='recall_macro')

# F1-score
f_dos = cross_val_score(dtc_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='f1_macro')
f_U2R = cross_val_score(dtc_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='f1_macro')
f_R2L = cross_val_score(dtc_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='f1_macro')
f_Probe = cross_val_score(dtc_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='f1_macro')

# Calculate averages
dtc_accuracy = np.mean([accuracy_Dos.mean(), accuracy_U2R.mean(), accuracy_R2L.mean(), accuracy_Probe.mean()])*100
dtc_precision = np.mean([precision_Dos.mean(), precision_U2R.mean(), precision_R2L.mean(), precision_Probe.mean()])*100
dtc_recall = np.mean([recall_Dos.mean(), recall_U2R.mean(), recall_R2L.mean(), recall_Probe.mean()])*100
dtc_f_score = np.mean([f_dos.mean(), f_U2R.mean(), f_R2L.mean(), f_Probe.mean()])*100



In [None]:
# Accuracy
accuracy_knn_DoS = cross_val_score(knn_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='accuracy')
accuracy_knn_U2R = cross_val_score(knn_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='accuracy')
accuracy_knn_R2L = cross_val_score(knn_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='accuracy')
accuracy_knn_Probe = cross_val_score(knn_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='accuracy')

# Precision
precision_knn_Dos = cross_val_score(knn_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='precision_macro')
precision_knn_U2R = cross_val_score(knn_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='precision_macro')
precision_knn_R2L = cross_val_score(knn_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='precision_macro')
precision_knn_Probe = cross_val_score(knn_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='precision_macro')

# Recall
recall_knn_Dos = cross_val_score(knn_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='recall_macro')
recall_knn_U2R = cross_val_score(knn_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='recall_macro')
recall_knn_R2L = cross_val_score(knn_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='recall_macro')
recall_knn_Probe = cross_val_score(knn_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='recall_macro')

# F1-score
f_knn_Dos = cross_val_score(knn_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='f1_macro')
f_knn_U2R = cross_val_score(knn_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='f1_macro')
f_knn_R2L = cross_val_score(knn_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='f1_macro')
f_knn_Probe = cross_val_score(knn_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='f1_macro')

# Calculate averages
knn_accuracy = np.mean([accuracy_knn_DoS.mean(), accuracy_knn_U2R.mean(), accuracy_knn_R2L.mean(), accuracy_knn_Probe.mean()])*100
knn_precision = np.mean([precision_knn_Dos.mean(), precision_knn_U2R.mean(), precision_knn_R2L.mean(), precision_knn_Probe.mean()])*100
knn_recall = np.mean([recall_knn_Dos.mean(), recall_knn_U2R.mean(), recall_knn_R2L.mean(), recall_knn_Probe.mean()])*100
knn_f_score = np.mean([f_knn_Dos.mean(), f_knn_U2R.mean(), f_knn_R2L.mean(), f_knn_Probe.mean()])*100



In [None]:
dtc_accuracy = np.mean([accuracy_Dos.mean(), accuracy_U2R.mean(), accuracy_R2L.mean(), accuracy_Probe.mean()]) * 100
dtc_precision = np.mean([precision_Dos.mean(), precision_U2R.mean(), precision_R2L.mean(), precision_Probe.mean()]) * 100
dtc_recall = np.mean([recall_Dos.mean(), recall_U2R.mean(), recall_R2L.mean(), recall_Probe.mean()]) * 100
dtc_f_score = np.mean([f_dos.mean(), f_U2R.mean(), f_R2L.mean(), f_Probe.mean()]) * 100

knn_accuracy = np.mean([accuracy_knn_DoS.mean(), accuracy_knn_U2R.mean(), accuracy_knn_R2L.mean(), accuracy_knn_Probe.mean()]) * 100
knn_precision = np.mean([precision_knn_Dos.mean(), precision_knn_U2R.mean(), precision_knn_R2L.mean(), precision_knn_Probe.mean()]) * 100
knn_recall = np.mean([recall_knn_Dos.mean(), recall_knn_U2R.mean(), recall_knn_R2L.mean(), recall_knn_Probe.mean()]) * 100
knn_f_score = np.mean([f_knn_Dos.mean(), f_knn_U2R.mean(), f_knn_R2L.mean(), f_knn_Probe.mean()]) * 100

# Labels and positions
metrics = ['Accuracy', 'Precision', 'Recall', 'F-score']
dtc_values = [dtc_accuracy, dtc_precision, dtc_recall, dtc_f_score]
knn_values = [knn_accuracy, knn_precision, knn_recall, knn_f_score]
bar_width = 0.35
index = np.arange(len(metrics))


fig, ax = plt.subplots()
bar1 = ax.bar(index, dtc_values, bar_width, label='Decision Tree', color='blue')
bar2 = ax.bar(index + bar_width, knn_values, bar_width, label='KNN', color='red')
plt.ylim(92, 100)

ax.set_xlabel('Metrics')
ax.set_ylabel('Scores (%)')
ax.set_title('Comparison of Decision Tree and KNN')
ax.set_xticks(index + bar_width / 2)
ax.set_xticklabels(metrics)
ax.legend()


plt.show()


In [None]:
#SVM


In [None]:
from sklearn.svm import SVC
#DoS
SVM_DoS=SVC(kernel='linear', C=1.0, random_state=0)
SVM_DoS.fit(X_DoS, Y_DoS)


In [None]:
#U2R
SVM_U2R=SVC(kernel='linear', C=1.0, random_state=0)
SVM_U2R.fit(X_U2R, Y_U2R)


In [None]:
#R2L
SVM_R2L=SVC(kernel='linear', C=1.0, random_state=0)
SVM_R2L.fit(X_R2L, Y_R2L)


In [None]:
#Probe
SVM_Probe=SVC(kernel='linear', C=1.0, random_state=0)
SVM_Probe.fit(X_Probe, Y_Probe)


In [None]:
#Confusion Matrix (SVM)

In [None]:
#DoS
Y_DoS_pred=SVM_DoS.predict(X_DoS_test)
# Create confusion matrix
pd.crosstab(Y_DoS_test, Y_DoS_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

In [None]:
#U2R
Y_U2R_pred=SVM_U2R.predict(X_U2R_test)
# Create confusion matrix
pd.crosstab(Y_U2R_test, Y_U2R_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

In [None]:
#R2L
Y_R2L_pred=SVM_R2L.predict(X_R2L_test)
# Create confusion matrix
pd.crosstab(Y_R2L_test, Y_R2L_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

In [None]:
#Probe
Y_Probe_pred=SVM_Probe.predict(X_Probe_test)
# Create confusion matrix
pd.crosstab(Y_Probe_test, Y_Probe_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

In [None]:
#Accuracy, Precision, Recall, F1-measure(SVM)

In [None]:
#Dos
accuracy = cross_val_score(SVM_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='accuracy')
print("Accuracy for Dos: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(SVM_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='precision')
print("Precision for Dos: %0.5f " % (precision.mean()*100))
recall = cross_val_score(SVM_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='recall')
print("Recall for Dos: %0.5f " % (recall.mean()*100))
f = cross_val_score(SVM_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='f1')
print("F-measure for Dos: %0.5f " % (f.mean()*100))

In [None]:
#U2R
accuracy = cross_val_score(SVM_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='accuracy')
print("Accuracy for U2R: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(SVM_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='precision_macro')
print("Precision for U2R: %0.5f " % (precision.mean()*100))
recall = cross_val_score(SVM_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='recall_macro')
print("Recall for U2R: %0.5f " % (recall.mean()*100))
f = cross_val_score(SVM_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='f1_macro')
print("F-measure for U2R : %0.5f " % (f.mean()*100))

In [None]:
#R2L
accuracy = cross_val_score(SVM_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='accuracy')
print("Accuracy for R2L: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(SVM_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='precision_macro')
print("Precision for R2L: %0.5f " % (precision.mean()*100))
recall = cross_val_score(SVM_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='recall_macro')
print("Recall for R2L: %0.5f " % (recall.mean()*100))
f = cross_val_score(SVM_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='f1_macro')
print("F-measure for R2L: %0.5f " % (f.mean()*100))

In [None]:
#Probe
accuracy = cross_val_score(SVM_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='accuracy')
print("Accuracy for Probe: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(SVM_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='precision_macro')
print("Precision for Probe: %0.5f " % (precision.mean()*100))
recall = cross_val_score(SVM_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='recall_macro')
print("Recall for Probe: %0.5f " % (recall.mean()*100))
f = cross_val_score(SVM_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='f1_macro')
print("F-measure for Probe: %0.5f " % (f.mean()*100))

In [None]:
#Logistic Regression

In [None]:
#DoS
from sklearn.linear_model import LogisticRegression
lr_DoS=LogisticRegression()
lr_DoS.fit(X_DoS, Y_DoS)


In [None]:
#U2R
lr_U2R=LogisticRegression()
lr_U2R.fit(X_U2R, Y_U2R)


In [None]:
#R2L
lr_R2L=LogisticRegression()
lr_R2L.fit(X_R2L, Y_R2L)

In [None]:
#Probe
lr_Probe=LogisticRegression()
lr_Probe.fit(X_Probe, Y_Probe)

In [None]:
#Confusion Matrix
#DoS
Y_DoS_pred=lr_DoS.predict(X_DoS_test)
# Create confusion matrix
pd.crosstab(Y_DoS_test, Y_DoS_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#U2R
Y_U2R_pred=lr_U2R.predict(X_U2R_test)
# Create confusion matrix
pd.crosstab(Y_U2R_test, Y_U2R_pred, rownames=['Actual attacks'], colnames=['Predicted'])

In [None]:
#R2L
Y_R2L_pred=lr_R2L.predict(X_R2L_test)
# Create confusion matrix
pd.crosstab(Y_R2L_test, Y_R2L_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#Probe
Y_Probe_pred=lr_Probe.predict(X_Probe_test)
# Create confusion matrix
pd.crosstab(Y_Probe_test, Y_Probe_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#Accuracy, Precision, Recall, F1-measure(Logistic Regression)

In [None]:
#Dos
accuracy = cross_val_score(lr_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='accuracy')
print("Accuracy for Dos: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(lr_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='precision')
print("Precision for Dos: %0.5f " % (precision.mean()*100))
recall = cross_val_score(lr_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='recall')
print("Recall for Dos: %0.5f " % (recall.mean()*100))
f = cross_val_score(lr_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='f1')
print("F-measure for Dos: %0.5f " % (f.mean()*100))


In [None]:
#U2R
accuracy = cross_val_score(lr_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='accuracy')
print("Accuracy for U2R: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(lr_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='precision_macro')
print("Precision for U2R: %0.5f " % (precision.mean()*100))
recall = cross_val_score(lr_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='recall_macro')
print("Recall for U2R: %0.5f " % (recall.mean()*100))
f = cross_val_score(lr_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='f1_macro')
print("F-measure for U2R : %0.5f " % (f.mean()*100))

In [None]:
#R2L
accuracy = cross_val_score(lr_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='accuracy')
print("Accuracy for R2L: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(lr_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='precision_macro')
print("Precision for R2L: %0.5f " % (precision.mean()*100))
recall = cross_val_score(lr_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='recall_macro')
print("Recall for R2L: %0.5f " % (recall.mean()*100))
f = cross_val_score(lr_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='f1_macro')
print("F-measure for R2L: %0.5f " % (f.mean()*100))

In [None]:
#Probe
accuracy = cross_val_score(lr_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='accuracy')
print("Accuracy for Probe: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(lr_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='precision_macro')
print("Precision for Probe: %0.5f " % (precision.mean()*100))
recall = cross_val_score(lr_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='recall_macro')
print("Recall for Probe: %0.5f " % (recall.mean()*100))
f = cross_val_score(lr_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='f1_macro')
print("F-measure for Probe: %0.5f " % (f.mean()*100))

In [None]:
#Ensemble Learning models

In [None]:
#Random Forest

In [None]:
#Dos
from sklearn.ensemble import RandomForestClassifier
rf_DoS=RandomForestClassifier(random_state = 0)
rf_DoS.fit(X_DoS, Y_DoS)



In [None]:
#U2R
rf_U2R=RandomForestClassifier(random_state = 0)
rf_U2R.fit(X_U2R, Y_U2R)


In [None]:
#R2L
rf_R2L=RandomForestClassifier(random_state = 0)
rf_R2L.fit(X_R2L, Y_R2L)


In [None]:
#Probe
rf_Probe=RandomForestClassifier(random_state = 0)
rf_Probe.fit(X_Probe, Y_Probe)


In [None]:
#Confusion Matrix

In [None]:
#DoS
Y_DoS_pred=rf_DoS.predict(X_DoS_test)
# Create confusion matrix
pd.crosstab(Y_DoS_test, Y_DoS_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#U2R
Y_U2R_pred=rf_U2R.predict(X_U2R_test)
# Create confusion matrix
pd.crosstab(Y_U2R_test, Y_U2R_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#R2L
Y_R2L_pred=rf_R2L.predict(X_R2L_test)
# Create confusion matrix
pd.crosstab(Y_R2L_test, Y_R2L_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#Probe
Y_Probe_pred=rf_Probe.predict(X_Probe_test)
# Create confusion matrix
pd.crosstab(Y_Probe_test, Y_Probe_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#Accuracy, Precision, Recall, F1-measure(Random Forest)

In [None]:
#Dos
accuracy = cross_val_score(rf_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='accuracy')
print("Accuracy for Dos: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(rf_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='precision')
print("Precision for Dos: %0.5f " % (precision.mean()*100))
recall = cross_val_score(rf_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='recall')
print("Recall for Dos: %0.5f " % (recall.mean()*100))
f = cross_val_score(rf_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='f1')
print("F-measure for Dos: %0.5f " % (f.mean()*100))

In [None]:
#U2R
accuracy = cross_val_score(rf_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='accuracy')
print("Accuracy for U2R: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(rf_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='precision_macro')
print("Precision for U2R: %0.5f " % (precision.mean()*100))
recall = cross_val_score(rf_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='recall_macro')
print("Recall for U2R: %0.5f " % (recall.mean()*100))
f = cross_val_score(rf_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='f1_macro')
print("F-measure for U2R : %0.5f " % (f.mean()*100))

In [None]:
#R2L
accuracy = cross_val_score(rf_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='accuracy')
print("Accuracy for R2L: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(rf_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='precision_macro')
print("Precision for R2L: %0.5f " % (precision.mean()*100))
recall = cross_val_score(rf_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='recall_macro')
print("Recall for R2L: %0.5f " % (recall.mean()*100))
f = cross_val_score(rf_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='f1_macro')
print("F-measure for R2L: %0.5f " % (f.mean()*100))

In [None]:
#Probe
accuracy = cross_val_score(rf_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='accuracy')
print("Accuracy for Probe: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(rf_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='precision_macro')
print("Precision for Probe: %0.5f " % (precision.mean()*100))
recall = cross_val_score(rf_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='recall_macro')
print("Recall for Probe: %0.5f " % (recall.mean()*100))
f = cross_val_score(rf_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='f1_macro')
print("F-measure for Probe: %0.5f " % (f.mean()*100))

In [None]:
#AdaBoost

In [None]:
#DoS
from sklearn.ensemble import AdaBoostClassifier
ab_DoS=AdaBoostClassifier()
ab_DoS.fit(X_DoS, Y_DoS)


In [None]:
#U2R
ab_U2R=AdaBoostClassifier()
ab_U2R.fit(X_U2R, Y_U2R)

In [None]:
#R2L
ab_R2L=AdaBoostClassifier()
ab_R2L.fit(X_R2L, Y_R2L)

In [None]:
#Probe
ab_Probe=AdaBoostClassifier()
ab_Probe.fit(X_Probe, Y_Probe)

In [None]:
#Confusion Matrix
#DoS
Y_DoS_pred=ab_DoS.predict(X_DoS_test)
# Create confusion matrix
pd.crosstab(Y_DoS_test, Y_DoS_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#U2R
Y_U2R_pred=ab_U2R.predict(X_U2R_test)
# Create confusion matrix
pd.crosstab(Y_U2R_test, Y_U2R_pred, rownames=['Actual attacks'], colnames=['Predicted'])

In [None]:
#R2L
Y_R2L_pred=ab_R2L.predict(X_R2L_test)
# Create confusion matrix
pd.crosstab(Y_R2L_test, Y_R2L_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])

In [None]:
#Probe
Y_Probe_pred=ab_Probe.predict(X_Probe_test)
# Create confusion matrix
pd.crosstab(Y_Probe_test, Y_Probe_pred, rownames=['Actual attacks'], colnames=['Predicted attacks'])


In [None]:
#Accuracy, Precision, Recall, F1-measure(AdaBoost)

In [None]:
#Dos
accuracy = cross_val_score(ab_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='accuracy')
print("Accuracy for Dos: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(ab_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='precision')
print("Precision for Dos: %0.5f " % (precision.mean()*100))
recall = cross_val_score(ab_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='recall')
print("Recall for Dos: %0.5f " % (recall.mean()*100))
f = cross_val_score(ab_DoS, X_DoS_test, Y_DoS_test, cv=10, scoring='f1')
print("F-measure for Dos: %0.5f " % (f.mean()*100))


In [None]:
#U2R
accuracy = cross_val_score(ab_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='accuracy')
print("Accuracy for U2R: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(ab_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='precision_macro')
print("Precision for U2R: %0.5f " % (precision.mean()*100))
recall = cross_val_score(ab_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='recall_macro')
print("Recall for U2R: %0.5f " % (recall.mean()*100))
f = cross_val_score(ab_U2R, X_U2R_test, Y_U2R_test, cv=10, scoring='f1_macro')
print("F-measure for U2R : %0.5f " % (f.mean()*100))

In [None]:
#R2L
accuracy = cross_val_score(ab_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='accuracy')
print("Accuracy for R2L: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(ab_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='precision_macro')
print("Precision for R2L: %0.5f " % (precision.mean()*100))
recall = cross_val_score(ab_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='recall_macro')
print("Recall for R2L: %0.5f " % (recall.mean()*100))
f = cross_val_score(ab_R2L, X_R2L_test, Y_R2L_test, cv=10, scoring='f1_macro')
print("F-measure for R2L: %0.5f " % (f.mean()*100))

In [None]:
#Probe
accuracy = cross_val_score(ab_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='accuracy')
print("Accuracy for Probe: %0.5f " % (accuracy.mean()*100))
precision = cross_val_score(ab_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='precision_macro')
print("Precision for Probe: %0.5f " % (precision.mean()*100))
recall = cross_val_score(ab_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='recall_macro')
print("Recall for Probe: %0.5f " % (recall.mean()*100))
f = cross_val_score(ab_Probe, X_Probe_test, Y_Probe_test, cv=10, scoring='f1_macro')
print("F-measure for Probe: %0.5f " % (f.mean()*100))