# projet python :Detecting botnet activity using machine learning
BENAICH Ayyoub

# projet :Detecting botnet activity using machine learning
As botnets have been the cause of serious security risks and financial 
damage over the years, existing network forensic techniques cannot identify 
and track current sophisticated methods of botnets. This is because 
commercial tools mainly depend on signature-based approaches that cannot 
discover new forms of botnet. In literature, several studies have been 
conducted with the use of Machine Learning (ML) techniques in order to train 
and validate a model for defining such attacks, but they still produce high false 
alarm rates with the challenge of investigating the tracks of botnets.
Dataset can be downloaded from: 
https://research.unsw.edu.au/projects/unsw-nb15-dataset 

In [1]:
#importation du libs pour notre projet 
import pandas as pd 
import sklearn
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [2]:
df = pd.read_csv("C:/Users/acer/Downloads/UNSW_NB15_training-set.csv") #import la data set
df.head()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82332 entries, 0 to 82331
Data columns (total 45 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 82332 non-null  int64  
 1   dur                82332 non-null  float64
 2   proto              82332 non-null  object 
 3   service            82332 non-null  object 
 4   state              82332 non-null  object 
 5   spkts              82332 non-null  int64  
 6   dpkts              82332 non-null  int64  
 7   sbytes             82332 non-null  int64  
 8   dbytes             82332 non-null  int64  
 9   rate               82332 non-null  float64
 10  sttl               82332 non-null  int64  
 11  dttl               82332 non-null  int64  
 12  sload              82332 non-null  float64
 13  dload              82332 non-null  float64
 14  sloss              82332 non-null  int64  
 15  dloss              82332 non-null  int64  
 16  sinpkt             823

In [4]:
#suppression du colonne non necessaire 
df.drop("attack_cat", axis=1, inplace=True) 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82332 entries, 0 to 82331
Data columns (total 44 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 82332 non-null  int64  
 1   dur                82332 non-null  float64
 2   proto              82332 non-null  object 
 3   service            82332 non-null  object 
 4   state              82332 non-null  object 
 5   spkts              82332 non-null  int64  
 6   dpkts              82332 non-null  int64  
 7   sbytes             82332 non-null  int64  
 8   dbytes             82332 non-null  int64  
 9   rate               82332 non-null  float64
 10  sttl               82332 non-null  int64  
 11  dttl               82332 non-null  int64  
 12  sload              82332 non-null  float64
 13  dload              82332 non-null  float64
 14  sloss              82332 non-null  int64  
 15  dloss              82332 non-null  int64  
 16  sinpkt             823

In [8]:
# restruturer la base de donne pour melanger les botnets et les no botnets
df=df.sample(frac=1).reset_index(drop=True)

In [9]:
df.head()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label
0,9903,1.810406,tcp,-,FIN,20,30,1094,19628,27.065753,...,1,1,1,0,0,0,1,1,0,1
1,48932,4.299962,tcp,smtp,FIN,66,28,62031,2094,21.628098,...,1,1,1,0,0,0,2,1,0,1
2,80611,1.659709,tcp,http,FIN,12,8,910,1324,11.44779,...,1,1,2,0,0,1,1,2,0,0
3,9148,0.616215,tcp,http,FIN,10,8,878,1420,27.587774,...,1,1,1,0,0,1,1,1,0,1
4,35351,0.116886,tcp,-,FIN,84,86,4862,79882,1445.853257,...,1,1,2,0,0,0,2,4,0,0


In [12]:
df_copy = df.copy()
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82332 entries, 0 to 82331
Data columns (total 44 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 82332 non-null  int64  
 1   dur                82332 non-null  float64
 2   proto              82332 non-null  object 
 3   service            82332 non-null  object 
 4   state              82332 non-null  object 
 5   spkts              82332 non-null  int64  
 6   dpkts              82332 non-null  int64  
 7   sbytes             82332 non-null  int64  
 8   dbytes             82332 non-null  int64  
 9   rate               82332 non-null  float64
 10  sttl               82332 non-null  int64  
 11  dttl               82332 non-null  int64  
 12  sload              82332 non-null  float64
 13  dload              82332 non-null  float64
 14  sloss              82332 non-null  int64  
 15  dloss              82332 non-null  int64  
 16  sinpkt             823

In [14]:
#separer les attributs categoriques et numeriques
cat_col=["proto","service","state"]
cat_attributes=df_copy[cat_col]
num_attributes = df_copy.drop(cat_col,axis=1)

In [15]:
cat_attributes.head()

Unnamed: 0,proto,service,state
0,tcp,-,FIN
1,tcp,smtp,FIN
2,tcp,http,FIN
3,tcp,http,FIN
4,tcp,-,FIN


In [16]:
num_attributes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82332 entries, 0 to 82331
Data columns (total 41 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 82332 non-null  int64  
 1   dur                82332 non-null  float64
 2   spkts              82332 non-null  int64  
 3   dpkts              82332 non-null  int64  
 4   sbytes             82332 non-null  int64  
 5   dbytes             82332 non-null  int64  
 6   rate               82332 non-null  float64
 7   sttl               82332 non-null  int64  
 8   dttl               82332 non-null  int64  
 9   sload              82332 non-null  float64
 10  dload              82332 non-null  float64
 11  sloss              82332 non-null  int64  
 12  dloss              82332 non-null  int64  
 13  sinpkt             82332 non-null  float64
 14  dinpkt             82332 non-null  float64
 15  sjit               82332 non-null  float64
 16  djit               823

In [17]:
#transformer les attribus categoriques en attribut numeriques
encoder = LabelEncoder()
for category in cat_col: 
    cat_attributes[category+"_coded"]=encoder.fit_transform(cat_attributes[category])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cat_attributes[category+"_coded"]=encoder.fit_transform(cat_attributes[category])


In [18]:
cat_attributes

Unnamed: 0,proto,service,state,proto_coded,service_coded,state_coded
0,tcp,-,FIN,111,0,3
1,tcp,smtp,FIN,111,9,3
2,tcp,http,FIN,111,5,3
3,tcp,http,FIN,111,5,3
4,tcp,-,FIN,111,0,3
...,...,...,...,...,...,...
82327,tcp,smtp,FIN,111,9,3
82328,tcp,-,FIN,111,0,3
82329,tcp,-,FIN,111,0,3
82330,tcp,http,FIN,111,5,3


In [19]:
cat_attributes=cat_attributes.drop(["proto","service","state"],axis=1)


In [20]:

cat_attributes

Unnamed: 0,proto_coded,service_coded,state_coded
0,111,0,3
1,111,9,3
2,111,5,3
3,111,5,3
4,111,0,3
...,...,...,...
82327,111,9,3
82328,111,0,3
82329,111,0,3
82330,111,5,3


In [21]:
dataset = pd.concat([num_attributes,cat_attributes],axis=1)


In [22]:
dataset

Unnamed: 0,id,dur,spkts,dpkts,sbytes,dbytes,rate,sttl,dttl,sload,...,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,label,proto_coded,service_coded,state_coded
0,9903,1.810406,20,30,1094,19628,2.706575e+01,62,252,4.595654e+03,...,0,0,0,1,1,0,1,111,0,3
1,48932,4.299962,66,28,62031,2094,2.162810e+01,62,252,1.136605e+05,...,0,0,0,2,1,0,1,111,9,3
2,80611,1.659709,12,8,910,1324,1.144779e+01,62,252,4.024802e+03,...,0,0,1,1,2,0,0,111,5,3
3,9148,0.616215,10,8,878,1420,2.758777e+01,62,252,1.026914e+04,...,0,0,1,1,1,0,1,111,5,3
4,35351,0.116886,84,86,4862,79882,1.445853e+03,31,29,3.288674e+05,...,0,0,0,2,4,0,0,111,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82327,51947,2.261247,14,14,776,948,1.194032e+01,254,252,2.550805e+03,...,0,0,0,3,1,0,1,111,9,3
82328,33798,0.639092,40,34,2918,17186,1.142246e+02,31,29,3.562554e+04,...,0,0,0,5,4,0,0,111,0,3
82329,11599,0.914804,8,8,364,664,1.639695e+01,62,252,2.789669e+03,...,0,0,0,1,2,0,1,111,0,3
82330,3598,0.279502,10,8,796,930,6.082246e+01,62,252,2.052221e+04,...,0,0,1,1,1,0,1,111,5,3


In [23]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82332 entries, 0 to 82331
Data columns (total 44 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 82332 non-null  int64  
 1   dur                82332 non-null  float64
 2   spkts              82332 non-null  int64  
 3   dpkts              82332 non-null  int64  
 4   sbytes             82332 non-null  int64  
 5   dbytes             82332 non-null  int64  
 6   rate               82332 non-null  float64
 7   sttl               82332 non-null  int64  
 8   dttl               82332 non-null  int64  
 9   sload              82332 non-null  float64
 10  dload              82332 non-null  float64
 11  sloss              82332 non-null  int64  
 12  dloss              82332 non-null  int64  
 13  sinpkt             82332 non-null  float64
 14  dinpkt             82332 non-null  float64
 15  sjit               82332 non-null  float64
 16  djit               823

In [24]:
#test de correlation entre les variables
correlation=dataset.corr()
correlation["label"].sort_values(ascending=False)

label                1.000000
sttl                 0.504159
state_coded          0.459040
ct_dst_sport_ltm     0.393668
ct_src_dport_ltm     0.341513
rate                 0.328629
ct_state_ttl         0.318517
ct_srv_dst           0.292931
ct_srv_src           0.290195
ct_dst_src_ltm       0.279989
ct_src_ltm           0.276494
ct_dst_ltm           0.257995
service_coded        0.143634
sload                0.124548
sbytes               0.020641
sloss                0.006360
dur                 -0.001145
proto_coded         -0.003497
is_ftp_login        -0.016206
response_body_len   -0.016414
ct_ftp_cmd          -0.017138
trans_depth         -0.025804
djit                -0.027131
sjit                -0.027397
spkts               -0.027731
dbytes              -0.032632
dinpkt              -0.037585
dloss               -0.044399
smean               -0.061146
dpkts               -0.061515
ct_flw_http_mthd    -0.075028
dttl                -0.098591
is_sm_ips_ports     -0.117407
ackdat    

In [25]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82332 entries, 0 to 82331
Data columns (total 44 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 82332 non-null  int64  
 1   dur                82332 non-null  float64
 2   spkts              82332 non-null  int64  
 3   dpkts              82332 non-null  int64  
 4   sbytes             82332 non-null  int64  
 5   dbytes             82332 non-null  int64  
 6   rate               82332 non-null  float64
 7   sttl               82332 non-null  int64  
 8   dttl               82332 non-null  int64  
 9   sload              82332 non-null  float64
 10  dload              82332 non-null  float64
 11  sloss              82332 non-null  int64  
 12  dloss              82332 non-null  int64  
 13  sinpkt             82332 non-null  float64
 14  dinpkt             82332 non-null  float64
 15  sjit               82332 non-null  float64
 16  djit               823

In [27]:
dataset_X= dataset[["sttl","state_coded","ct_dst_sport_ltm","ct_src_dport_ltm","rate","ct_state_ttl","ct_srv_dst","ct_srv_src","ct_dst_src_ltm","ct_src_ltm","ct_dst_ltm","service_coded","sload","is_sm_ips_ports","ackdat","sinpkt","tcprtt","synack","dmean","dload","stcpb","dtcpb","dwin","swin"]]
dataset_Y=dataset[["label"]]

In [28]:
dataset_X


Unnamed: 0,sttl,state_coded,ct_dst_sport_ltm,ct_src_dport_ltm,rate,ct_state_ttl,ct_srv_dst,ct_srv_src,ct_dst_src_ltm,ct_src_ltm,...,ackdat,sinpkt,tcprtt,synack,dmean,dload,stcpb,dtcpb,dwin,swin
0,62,3,1,1,2.706575e+01,1,1,1,1,1,...,0.075025,95.284523,0.143635,0.068610,654,8.384418e+04,1976232986,2379723759,255,255
1,62,3,1,1,2.162810e+01,1,1,1,1,2,...,0.090015,66.151109,0.161188,0.071173,75,3.758173e+03,1096967575,3369873993,255,255
2,62,3,1,1,1.144779e+01,1,2,2,2,1,...,0.056801,150.882631,0.153605,0.096804,166,5.586521e+03,2633385448,1286563478,255,255
3,62,3,1,1,2.758777e+01,1,1,1,1,1,...,0.024033,68.468333,0.093009,0.068976,178,1.613722e+04,3645849180,1292509503,255,255
4,31,3,1,1,1.445853e+03,0,4,8,2,2,...,0.000130,1.403916,0.000627,0.000497,929,5.403830e+06,2679777980,535721252,255,255
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82327,254,3,1,1,1.194032e+01,1,1,1,2,3,...,0.053294,173.942082,0.081645,0.028351,68,3.116865e+03,3744908533,2121155324,255,255
82328,31,3,1,1,1.142246e+02,0,4,10,3,5,...,0.000140,16.386974,0.000708,0.000568,505,2.088087e+05,2910846188,761519431,255,255
82329,62,3,1,1,1.639695e+01,1,2,1,1,1,...,0.182784,128.581281,0.329822,0.147038,83,5.080870e+03,1435491712,2249386120,255,255
82330,62,3,1,1,6.082246e+01,1,1,1,1,1,...,0.012571,31.055778,0.030662,0.018091,116,2.329858e+04,1586511101,228371412,255,255


In [23]:
dataset_Y

Unnamed: 0,label
0,0
1,0
2,1
3,0
4,0
...,...
82327,1
82328,1
82329,1
82330,1


In [30]:
X_train,X_test,Y_train,Y_test = sklearn.model_selection.train_test_split(dataset_X,dataset_Y,test_size=0.2)

In [31]:
#utilisation du la regression logitique 
logRegression = LogisticRegression()
logRegression.fit(X_train,Y_train)
predictions= logRegression.predict(X_test)

  return f(*args, **kwargs)


In [32]:
#on  trouve une mauvaise accuracy car ce modele est simple
print("Logistic Regression Accuracy : ", metrics.accuracy_score(Y_test,predictions))

Logistic Regression Accuracy :  0.7495597255116293


In [33]:
knn = KNeighborsClassifier()
knn.fit(X_train,Y_train)
knn_predictions = knn.predict(X_test)

  return self._fit(X, y)


In [34]:
#utilsation du K-NN 
print("K-Nearest Neighbours Accuracy : ", metrics.accuracy_score(Y_test,knn_predictions))

K-Nearest Neighbours Accuracy :  0.8061577700856258


In [35]:
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train,Y_train)
rf_predictions= rf.predict(X_test)

  rf.fit(X_train,Y_train)


In [36]:
#utilisation du random forest 
print("Random Forest Accuracy : ", metrics.accuracy_score(Y_test,rf_predictions))

Random Forest Accuracy :  0.9712758851035405


In [37]:
dt = DecisionTreeClassifier()
dt.fit(X_train,Y_train)
dt_predictions = dt.predict(X_test)

In [38]:
#utilisation du decision tree
print("Decision Tree accuracy : ", metrics.accuracy_score(Y_test,dt_predictions))

Decision Tree accuracy :  0.9564583712880306
