# Supervised ML methods for anomaly detection in IOT to enahnce network security
## Part 3 - DATA TRAINING

The IoT-23 dataset is a collection of network traffic from Internet of Things (IoT) devices. It includes 20 malware captures executed in IoT devices, and 3 hotspot captures for benign IoT devices traffic12. The 3 hotspot captures are not being included in the data cleaning because this feature was not considered relevant for the specific analysis being performed.

In this notebook, we load the processed dataset file and use it to train several classification models.

> **INPUT:** the ready dataset csv file as cleaned and processed in the previous phases. <br>
> **OUTPUT:** a comparison of the prediction accuracy and performance of multiple machine learning classification algorithms.

***

In [1]:
# Import necessary libraries and modules
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import precision_score, confusion_matrix, recall_score, accuracy_score, f1_score
from statistics import mean
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
from joblib import dump
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import time
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler

In [2]:
# Set display options
pd.set_option('display.max_columns', None)
pd.set_option("display.float", "{:.2f}".format)

In [3]:
# Read the dataset
data_df = pd.read_csv('../CSV-data/processed/iot23_processed.csv', index_col=0)

In [4]:
# Check dataset shape
data_df.shape

(1444706, 50)

In [5]:
# Check dataset head
data_df.head()

Unnamed: 0,id.orig_h,id.orig_p,id.resp_h,id.resp_p,duration,orig_bytes,resp_bytes,missed_bytes,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,label,proto_icmp,proto_tcp,proto_udp,service_dhcp,service_dns,service_http,service_irc,service_ssh,service_ssl,conn_state_OTH,conn_state_REJ,conn_state_RSTO,conn_state_RSTOS0,conn_state_RSTR,conn_state_RSTRH,conn_state_S0,conn_state_S1,conn_state_S2,conn_state_S3,conn_state_SF,conn_state_SH,conn_state_SHR,history_C,history_D,history_Dd,history_Other,history_S,history_ShADadfF,history_ShADafF,history_ShADafr,history_ShAdDaFf,history_ShAdDaFr,history_ShAdDaf,history_ShAdDafF,history_ShAdDaft,history_ShAdfDr,history_Sr
0,0.86,0.61,0.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.86,0.88,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.86,0.91,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.86,0.91,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.86,0.55,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Drop the combined column
data_df.drop(columns=["history_C","history_D","history_Dd","history_Other","history_S","history_ShADadfF","history_ShADafF","history_ShADafr","history_ShAdDaFf","history_ShAdDaFr","history_ShAdDaf","history_ShAdDafF","history_ShAdDaft","history_ShAdfDr","history_Sr"], inplace=True)

In [7]:
# Check dataset head
data_df.head()

Unnamed: 0,id.orig_h,id.orig_p,id.resp_h,id.resp_p,duration,orig_bytes,resp_bytes,missed_bytes,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,label,proto_icmp,proto_tcp,proto_udp,service_dhcp,service_dns,service_http,service_irc,service_ssh,service_ssl,conn_state_OTH,conn_state_REJ,conn_state_RSTO,conn_state_RSTOS0,conn_state_RSTR,conn_state_RSTRH,conn_state_S0,conn_state_S1,conn_state_S2,conn_state_S3,conn_state_SF,conn_state_SH,conn_state_SHR
0,0.86,0.61,0.18,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.86,0.88,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.86,0.91,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.86,0.91,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.86,0.55,0.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
# Split data into independent and dependent variables
data_X = data_df.drop("label", axis=1)
data_y = data_df["label"]
X_train, X_test, y_train, y_test = train_test_split(data_X, data_y, test_size=0.2, random_state=100)
    

In [9]:
# transform or normalize our data with standard scalar function
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [10]:
# Initialize classification models
classifiers = [
    # Since we have unbalanced labels, we use the Complement version of Naive Bayes which is particularly suited for imbalanced data sets.
    ("Naive Bayes", GaussianNB()),

    # The distance-based KNN classifier with a default n_neighbors=5.
    ("K-Nearest Neighbors", KNeighborsClassifier()),

    # We use the Decision Tree with its default parameters, including the "Gini Impurity" to measure the quality of splits and ccp_alpha=0 (no pruning is performed). 
    ("Decision Tree", DecisionTreeClassifier()),
    
    # The efficient Random Forest model with a default base estimators of 100.
    ("Random Forest", RandomForestClassifier()), 

    # The classifier version of Support Vector Machine model.
    ("Support Vector Classifier", LinearSVC(dual=False)),
  
    # The ANN classifier.
    ("Artificial Neural Network (ANN)", MLPClassifier(max_iter=500)),
   
    # The AdaBoost classifier.
    ("AdaBoost", AdaBoostClassifier(algorithm='SAMME')),   
    
    # The most powerful ensemble model of XGBoost.
    ("XGBoost", xgb.XGBClassifier()),
]

In [11]:
# Initialize the cross-validator with 5 splits and sample shuffling activated
skf_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

In [12]:
print("Model Training Started!")
# Initialize the results summary
classification_results = pd.DataFrame(index=[c[0] for c in classifiers], columns=["Accuracy", "TN", "FP", "FN", "TP", "Recall", "Precision", "F1", "Time(in sec)"])

# Iterate over the estimators
for est_name, est_object in classifiers:
    start = time.time()
    print(f"### [{est_name}]: Processing ...")
    
    # Initialize the results for each classifier
    accuracy_scores = []
    confusion_matrices = []
    recall_scores = []
    precision_scores = []
    f1_scores = []
    
    # Initialize best model object to be saved
    models_path = "../applied-ML-methods/"
    best_model = None
    best_f1 = -1

    # Iterate over the obtained folds
    for train_index, test_index in skf_cv.split(data_X, data_y):

        # Get train and test samples from the cross-validation model
        X_train, X_test = data_X.iloc[train_index], data_X.iloc[test_index]
        y_train, y_test = data_y.iloc[train_index], data_y.iloc[test_index]
        
        # Train the model
        est_object.fit(X_train.values, y_train.values)
        end = time.time()
        
        # Predict the test samples
        y_pred = est_object.predict(X_test.values)
        
        # Calculate and register accuracy metrics
        accuracy_scores.append(accuracy_score(y_test, y_pred))
        confusion_matrices.append(confusion_matrix(y_test, y_pred))
        recall_scores.append(recall_score(y_test, y_pred))
        precision_scores.append(precision_score(y_test, y_pred))
        est_f1_score = f1_score(y_test, y_pred)
        f1_scores.append(est_f1_score)
        
        # Compare with best performing model
        if best_f1 < est_f1_score:
            best_model = est_object
            best_f1 = est_f1_score
    
    print('time cost: ', end - start, 'seconds')   
    # Summarize the results for all folds for each classifier
    tn, fp, fn, tp = sum(confusion_matrices).ravel()
    classification_results.loc[est_name] = [mean(accuracy_scores),tn,fp,fn,tp,mean(recall_scores),mean(precision_scores),mean(f1_scores),(end-start)]
    
    # Save the best performing model
    if best_model:
        model_name = est_name.replace(' ', '_').replace('-', '_').lower()
        model_file = model_name + ".csv"
        dump(best_model, models_path + model_file)
    
print("Model Training Finished!")

Model Training Started!
### [Naive Bayes]: Processing ...
time cost:  6.735166788101196 seconds
### [K-Nearest Neighbors]: Processing ...
time cost:  1152.9316101074219 seconds
### [Decision Tree]: Processing ...
time cost:  30.42833185195923 seconds
### [Random Forest]: Processing ...
time cost:  386.7485637664795 seconds
### [Support Vector Classifier]: Processing ...
time cost:  55.1986563205719 seconds
### [Artificial Neural Network (ANN)]: Processing ...
time cost:  6438.087689876556 seconds
### [AdaBoost]: Processing ...
time cost:  187.82765364646912 seconds
### [XGBoost]: Processing ...
time cost:  31.387317419052124 seconds
Model Training Finished!


In [13]:
# Check the results
classification_results

Unnamed: 0,Accuracy,TN,FP,FN,TP,Recall,Precision,F1,Time(in sec)
Naive Bayes,0.9,60298,137536,2315,1244557,1.0,0.9,0.95,6.74
K-Nearest Neighbors,0.93,133087,64747,40791,1206081,0.97,0.95,0.96,1152.93
Decision Tree,0.97,172902,24932,24869,1222003,0.98,0.98,0.98,30.43
Random Forest,0.97,148954,48880,1112,1245760,1.0,0.96,0.98,386.75
Support Vector Classifier,0.9,60041,137793,809,1246063,1.0,0.9,0.95,55.2
Artificial Neural Network (ANN),0.92,87757,110077,5011,1241861,1.0,0.92,0.96,6438.09
AdaBoost,0.91,82280,115554,11101,1235771,0.99,0.92,0.95,187.83
XGBoost,1.0,197718,116,25,1246847,1.0,1.0,1.0,31.39


### RESULT ANALYSIS

Overall, all the models are performing exceptionally well with very high accuracy, precision, recall, and F1 scores. We can see that XGBoost has achieved perfect performance metrics and the rest are achieving near-perfect performance. The Naive Bayes and Support Vector Classifier models are slightly lower in accuracy and precision but still exhibit very good performance.

*Models evaluation:*

- Naive Bayes: This model has an accuracy of 0.90 and an F1 score of 0.95. It’s the fastest model to train with a time of 7.66 seconds. However, it has a relatively high number of False Positives (FP=137510).
- K-Nearest Neighbors (KNN): This model has an accuracy of 0.93 and an F1 score of 0.96. However, it’s the second slowest model to train with a time of 1198.46 seconds.
- Decision Tree: This model has an accuracy of 0.97 and an F1 score of 0.98. It’s relatively fast to train with a time of 28.03 seconds.
- Random Forest: This model also has an accuracy of 0.97 and an F1 score of 0.98. However, it’s slower to train than the Decision Tree with a time of 367.31 seconds.
- Support Vector Classifier (SVC): This model has an accuracy of 0.90 and an F1 score of 0.95. It’s faster to train than KNN but slower than Naive Bayes with a time of 88.41 seconds.
- Artificial Neural Network (ANN): This model has an accuracy of 0.92 and an F1 score of 0.96. However, it’s the slowest model to train with a time of 4875.73 seconds.
- AdaBoost: This model has an accuracy of 0.94 and an F1 score of 0.97. It’s faster to train than KNN and ANN with a time of 196.33 seconds.
- XGBoost: This model has the highest accuracy of 1.00 and an F1 score of 1.00. It’s also relatively fast to train with a time of 31.10 seconds.

*Overall observations:*

- XGBoost appears to be the best performing model with an accuracy, recall, precision, and F1 score of 1.00. It also has a relatively short training time of 31.10 seconds.
- The Artificial Neural Network (ANN) has a decent performance with an accuracy of 0.92, but it took the longest time to train (4875.73 seconds).
- Both the Decision Tree and Random Forest models have high performance metrics (0.97 accuracy), with the Decision Tree being faster to train.
- The Naive Bayes and Support Vector Classifier models have the same accuracy of 0.90, but the Naive Bayes model is much faster to train.
- The K-Nearest Neighbors model has a good accuracy of 0.93, but it took a significant amount of time to train (1198.46 seconds).
- The AdaBoost model has a good balance between performance (0.94 accuracy) and training time (196.33 seconds).
- if you need a model that trains quickly, XGBoost seems to be a good choice here. However, if you have more time for model training and want to prioritize accuracy, you might consider tuning the parameters of the ANN model.
- looking for a balance between accuracy and training time, XGBoost seems to be the best choice among these base models. However, remember that these are just the initial results. The performance of these models may improve after tuning the parameters.