# Introduction and Purpose:
We want to build a model that can accurately predict if the transmission is benign or if it has any malicious interuption.

I have used multiple machine learning algorithms to pick the best possible model that gives the most accurate result.

Algorithms Used:

Logistic Regression

KNN Classifier

Random Forest Classifier

Multi-layer Perceptron(MLP) Classifier 

# Citation:
-- Reference to the article where the dataset was initially described and used:
Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, D. Breitenbacher, A. Shabtai, and Y. Elovici 'N-BaIoT: Network-based Detection of IoT Botnet Attacks Using Deep Autoencoders', IEEE Pervasive Computing, Special Issue - Securing the IoT (July/Sep 2018).

In [30]:
#Importing all required libraries
import pandas as pd 
import numpy as np
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn import neighbors, metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score,roc_curve,auc,classification_report,confusion_matrix

In [45]:
#Loading Data
BenignData = pd.read_csv("benign_traffic.csv")
#BenignData.head(5)
JunkData = pd.read_csv("junk.csv")
#JunkData.head(5)
ComboData = pd.read_csv("combo.csv")
#ComboData.head(5)
ScanData = pd.read_csv("scan.csv")
#ScanData.head(5)
TcpData = pd.read_csv("tcp.csv")
#TcpData.head(5)
UdpData = pd.read_csv("udp.csv")
#UdpData.head(5)

In [47]:
#Adding Categorical Vaiables 0-benign 1-threat
BenignData["output"] = 0
JunkData["output"] = 1
ComboData["output"] = 1
ScanData["output"] = 1
TcpData["output"] = 1
UdpData["output"] = 1

In [48]:
#Combining benign data with malicious data
MixedData = pd.concat([BenignData,JunkData,ComboData,ScanData,TcpData,UdpData], axis=0)

In [58]:
#We normalize the data to bring all the variables to the same range.
#This process allows you to compare scores between different types of variables.
#Here we are using z-score to Normalize data
StandardizedData=(MixedData-MixedData.mean())/(MixedData.std())
StandardizedData_array=np.array(StandardizedData)
display(StandardizedData.head())

Unnamed: 0,MI_dir_L5_weight,MI_dir_L5_mean,MI_dir_L5_variance,MI_dir_L3_weight,MI_dir_L3_mean,MI_dir_L3_variance,MI_dir_L1_weight,MI_dir_L1_mean,MI_dir_L1_variance,MI_dir_L0.1_weight,...,HpHp_L0.1_covariance,HpHp_L0.1_pcc,HpHp_L0.01_weight,HpHp_L0.01_mean,HpHp_L0.01_std,HpHp_L0.01_magnitude,HpHp_L0.01_radius,HpHp_L0.01_covariance,HpHp_L0.01_pcc,output
0,-0.602464,-0.455451,-0.087571,-0.598277,-0.509059,-0.09529,-0.586145,-0.585205,-0.117537,-0.570299,...,-0.016013,0.029565,-0.136628,-0.310015,-0.131054,-0.328842,-0.026194,-0.016724,0.128298,-2.844653
1,-0.602464,0.171191,-0.087571,-0.598277,0.202352,-0.09529,-0.586145,0.237563,-0.117537,-0.570299,...,-0.016013,0.029565,-0.136628,0.104014,-0.131054,-0.048022,-0.026194,-0.016724,0.128298,-2.844653
2,-0.602464,1.424447,-0.087568,-0.598262,1.622858,-0.095079,-0.58579,1.709798,-0.103737,-0.570014,...,-0.016013,0.029565,-0.136628,0.93207,-0.131054,0.513617,-0.026194,-0.016724,0.128298,-2.844653
3,-0.602464,-0.455451,-0.087571,-0.598277,-0.509059,-0.09529,-0.586145,-0.585205,-0.117537,-0.57016,...,-0.016013,0.029565,0.096192,-0.310015,-0.131054,-0.328842,-0.026194,-0.016724,0.128298,-2.844653
4,-0.602464,32.756584,-0.087571,-0.598277,37.195755,-0.09529,-0.586145,43.021504,-0.117537,-0.570299,...,-0.016013,0.029565,0.699982,13.762481,15.092712,9.21599,2.016339,-0.016724,0.128298,-2.844653


In [16]:
#Flattening the array
output=np.array(MixedData.output).flatten()

In [17]:
#Splitting concatinated data into training and testing (70-30)
#Concatinated data is used to avoid overfitting of the model
X_train, X_test, y_train, y_test = train_test_split(StandardizedData_array, output, test_size = 0.3, random_state = 1)

# Models Used

In [24]:
LR =  LogisticRegression()
RF =  RandomForestClassifier(n_estimators=15)
KNN =  KNeighborsClassifier()
MLP = MLPClassifier(hidden_layer_sizes=(1024,1024,))

In [25]:
#Function to train and predict various models
def train_predict(model, X_train, y_train, X_test, y_test): 
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    report = classification_report(y_test,y_pred)
    return accuracy,report

In [26]:
import warnings
# import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [63]:
LR_acc,LR_report = train_predict(LR,X_train,y_train,X_test,y_test)
print("The accuracy score of Logistic Regression is %f"%LR_acc)
print("Classification report :-")
print(LR_report)

The accuracy score of Logistic Regression is 1.000000
Classification report :-
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     11737
           1       1.00      1.00      1.00     94913

    accuracy                           1.00    106650
   macro avg       1.00      1.00      1.00    106650
weighted avg       1.00      1.00      1.00    106650



In [28]:
RF_acc,RF_report = train_predict(RF,X_train,y_train,X_test,y_test)
print("The accuracy score of Random Forest Classifier is %f"%RF_acc)
print("Classification report :-")
print(RF_report)

The accuracy score of Random Forest Classifier is 1.000000
Classification report :-
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     11737
           1       1.00      1.00      1.00     94913

    accuracy                           1.00    106650
   macro avg       1.00      1.00      1.00    106650
weighted avg       1.00      1.00      1.00    106650



In [64]:
MLP_acc,MLP_report = train_predict(MLP,X_train,y_train,X_test,y_test)
print("The accuracy score of Multilayer Preceptron is %f"%MLP_acc)
print("Classification report :-")
print(MLP_report)

The accuracy score of Multilayer Preceptron is 0.999991
Classification report :-
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     11737
           1       1.00      1.00      1.00     94913

    accuracy                           1.00    106650
   macro avg       1.00      1.00      1.00    106650
weighted avg       1.00      1.00      1.00    106650



In [65]:
KNN_acc, KNN_report =train_predict(KNN,X_train,y_train,X_test,y_test)
print("The accuracy score of K-nearest neighbour is %f"%KNN_acc)
print("Classification report :-")
print(KNN_report)

The accuracy score of K-nearest neighbour is 0.999916
Classification report :-
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     11737
           1       1.00      1.00      1.00     94913

    accuracy                           1.00    106650
   macro avg       1.00      1.00      1.00    106650
weighted avg       1.00      1.00      1.00    106650

