# ** Intrusion Detection System using kNN and ID3 on KDD cup data**

## *Data Description*

### **Attributes:**

duration: continuous.
protocol_type: symbolic.
service: symbolic.
flag: symbolic.
src_bytes: continuous.
dst_bytes: continuous.
land: symbolic.
wrong_fragment: continuous.
urgent: continuous.
hot: continuous.
num_failed_logins: continuous.
logged_in: symbolic.
num_compromised: continuous.
root_shell: continuous.
su_attempted: continuous.
num_root: continuous.
num_file_creations: continuous.
num_shells: continuous.
num_access_files: continuous.
num_outbound_cmds: continuous.
is_host_login: symbolic.
is_guest_login: symbolic.
count: continuous.
srv_count: continuous.
serror_rate: continuous.
srv_serror_rate: continuous.
rerror_rate: continuous.
srv_rerror_rate: continuous.
same_srv_rate: continuous.
diff_srv_rate: continuous.
srv_diff_host_rate: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
dst_host_same_srv_rate: continuous.
dst_host_diff_srv_rate: continuous.
dst_host_same_src_port_rate: continuous.
dst_host_srv_diff_host_rate: continuous.
dst_host_serror_rate: continuous.
dst_host_srv_serror_rate: continuous.
dst_host_rerror_rate: continuous.
dst_host_srv_rerror_rate: continuous.

### **Classes:**

back,buffer_overflow,ftp_write,guess_passwd,imap,ipsweep,land,loadmodule,multihop,neptune,nmap,normal,perl,phf,pod,portsweep,rootkit,satan,smurf,spy,teardrop,warezclient,warezmaster.

## *Preprocessing*

### **Identifing and Separating the target classes**

In [None]:
train_X=open("train_X","w");
train_y=open("train_y","w");
test_X=open("test_X","w");
test_y=open("test_y","w")

X1=open("C:\\Users\\ASUS\\Documents\\Project\\kddcup.data_10_percent_corrected","r")
X2=open("C:\\Users\\ASUS\\Documents\\Project\\corrected","r")
lines1 = [line.rstrip('\n\r') for line in X1]
lines2 = [line.rstrip('\n\r') for line in X2]


for line in lines1:
    line=str(line)
    k=line.split(',')
    i=k[41]
    del k[41]
    for j in range(41):
        train_X.write(k[j])
        if(j!=40):
            train_X.write(',')
    train_X.write("\n")
    if i=="back." or i=="land." or i=="neptune." or i=="pod." or i=="smurf." or i=="teardrop." :
        train_y.write("2")
    elif i=="buffer_overflow." or i=="loadmodule." or i=="perl." or i=="rootkit.":
        train_y.write("3")
    elif i=="ftp_write." or "guess_passwd." or i=="imap." or i=="multihop." or i=="phf." or i=="spy." or i=="warezclient." or i=="warezmaster.":
        train_y.write("4")
    elif i=="ipsweep." or i=="nmap." or i=="portsweep." or i=="satan." :
        train_y.write("1")
    train_y.write("\n")
    

    
    
for line in lines2:
    line=str(line)
    k=line.split(',')
    i=k[41]
    del k[41]
    for j in range(41):
        test_X.write(k[j])
        if(j!=40):
            test_X.write(',')
    test_X.write("\n")
    if i=="back." or i=="land." or i=="neptune." or i=="pod." or i=="smurf." or i=="teardrop." :
        test_y.write("2")
    elif i=="buffer_overflow." or i=="loadmodule." or i=="perl." or i=="rootkit.":
        test_y.write("3")
    elif i=="ftp_write." or "guess_passwd." or i=="imap." or i=="multihop." or i=="phf." or i=="spy." or i=="warezclient." or i=="warezmaster.":
        test_y.write("4")
    elif i=="ipsweep." or i=="nmap." or i=="portsweep." or i=="satan." :
        test_y.write("1")
    test_y.write("\n")
    


### **Encoding for symbolic(string) attributes**

In [2]:
from sklearn import preprocessing
import numpy as np
import pandas as pd

train_X=pd.read_csv("train_X",header=None)
x=open("train_y")
test_X=pd.read_csv("test_X",header=None)
y=open("test_y")
train_y = [line.rstrip('\n\r') for line in x]
test_y = [line.rstrip('\n\r') for line in y]


B=np.empty((494021, 41))
C=np.empty((311029, 41))
for i in range(41):
    le = preprocessing.LabelEncoder()
    A=train_X.iloc[:,i]
    le.fit(A)
    At=le.transform(A)
    B[:,i]=At


for i in range(41):
    le = preprocessing.LabelEncoder()
    A=test_X.iloc[:,i]
    le.fit(A)
    At=le.transform(A)
    C[:,i]=At 

    

In [4]:
print(B.shape)
print(C.shape)
print(len(train_y))
print(len(test_y))

(494021, 41)
(311029, 41)
494021
311029


# **kNN Implementation**

In [2]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, zero_one_loss,accuracy_score,classification_report
import pandas as pd

learner = KNeighborsClassifier(1, n_jobs=-1)
learner.fit(B, train_y)
pred_y = learner.predict(C)


### CONFUSION MATRIX FOR kNN

In [3]:
results_knn= confusion_matrix(test_y, pred_y)
print(results_knn)

[[222787      7    504]
 [     0     10     29]
 [   498     65  87129]]


### ZERO-ONE CLASSIFICATION LOSS FOR kNN

In [4]:
error_knn = zero_one_loss(test_y, pred_y)
print(error_knn)

0.00354629311093


### ACCURACY SCORE FOR kNN

In [5]:
accuracy_knn=accuracy_score(test_y,pred_y)
print(accuracy_knn)

0.996453706889


### REPORT MATRIX FOR kNN

In [6]:
report_knn=classification_report(test_y,pred_y)
print(report_knn)

             precision    recall  f1-score   support

          2       1.00      1.00      1.00    223298
          3       0.12      0.26      0.17        39
          4       0.99      0.99      0.99     87692

avg / total       1.00      1.00      1.00    311029



# ID3 Implementation

In [5]:
from sklearn import tree
from sklearn.metrics import confusion_matrix, zero_one_loss,accuracy_score,classification_report

clf = tree.DecisionTreeClassifier()
clf = clf.fit(B, train_y)
pred_y = clf.predict(C)


### CONFUSION MATRIX FOR ID3

In [6]:
results_id3 = confusion_matrix(test_y, pred_y)
print(results_id3)

[[222171      0   1127]
 [     0      2     37]
 [   538     34  87120]]


### ZERO-ONE CLASSIFICATION LOSS FOR ID3

In [7]:
error_id3 = zero_one_loss(test_y, pred_y)
print(error_id3)

0.00558147311022


### ACCURACY SCORE FOR ID3

In [8]:
accuracy_id3=accuracy_score(test_y,pred_y)
print(accuracy_id3)

0.99441852689


### REPORT MATRIX FOR ID3

In [9]:
report_id3=classification_report(test_y,pred_y)
print(report_id3)

             precision    recall  f1-score   support

          2       1.00      0.99      1.00    223298
          3       0.06      0.05      0.05        39
          4       0.99      0.99      0.99     87692

avg / total       0.99      0.99      0.99    311029

