# Building a network-based intrusion detection system for consumer electronics: From raw data to deployment

---
This is a tutorial that was presented at **IEEE 44th International Conference on Consumer Electronics** held in Dubai, UAE. Feb 3-5, 2026.

Prepared and presented by **[Prof. Mohammed M. Alani](https://www.rit.edu/directory/moacad-mohammed-m-al-ani)**, Rochester Institute of Tehcnology- Dubai, UAE.

Abstract of the tutorial can be found [here](https://icce.org/2026/tutorials/).

Complete github repo can be found [here](https://github.com/Mo-Alani/icce26-ids-tutorial)

---

## Some recommended prerequisites
Here are some recommendations before you dive in.

1. Install Python using [Mini Conda](https://www.anaconda.com/docs/getting-started/miniconda/main), and create a virtual environment with the latest version of Python (3.13.11 at the time of publishing this). 

2. My recommendation is to use Linux to easily install the needed tools. Install your choice of editor. Mine is VSCodium, with Python, and Jupyter plugins. Start your favorite music that keeps you in the coding mood.

3. Install wireshark, tshark for feature extraction. (Or zeek?!)

4. Install the following Python libraries:
    * pandas
    * numpy
    * scikit-learn
    * xgboost
    * imblearn
    * tqdm
    * matplotlib

*PS:* You'll see that I'm re-importing libraries in different cells of this notebook. The reason is that I wanted to make each cell independent from the previous ones so you can copy-and-paste them into individual files, if Jupyter notebooks is not your preferred method of coding.

## The dataset and feature extraction
The dataset to be used in this tutorial can be downloaded from IEEE DataPort:

<https://dx.doi.org/10.21227/q70p-q449>

The complete dataset includes two parts; the raw pcap files along with an Excel sheet that contains the truth table for the labelling of the dataset.
The features can be extracted at a packet-level, or at the network flow level. The choice is yours.

1. **Network flow-based** feature extraction is quite common in intrusion detection systems, and can be done using tools such as [Zeek](https://zeek.org/), or using [this](https://github.com/ahlashkari/CICFlowMeter) great tool by Arash Lashkari.

2. **Packet-based** feature extraction can be more effective for some types of short attacks (such as port scanning), and can be performed with tools like [tshark](https://www.wireshark.org/docs/man-pages/tshark.html). 

We have done some work that combines both that can be found [here](https://doi.org/10.1109/TII.2022.3192035).

Our choice for this tutorial is packet-based features. The features can be extracted using the tshark command:

`tshark -r YOUR_PCAP_FILE.pcap -T fields -e FIELD_NAME -E separator=, -E qoute=d`

Replace the file name with the pcap fie of your choice, and add as many fields (using `-e FIELD_NAME`)as you see suitable for your experiment. The fields are quite similar to the filters that can be used in Wireshark. You can find a complete list [here](https://www.wireshark.org/docs/dfref/). Due to the lack of time, we won't be able to dive into this in any more details.

For this experiment, we combined a few pcap files that are focused on botnet traffic using Wireshark, passed them through `tshark` to extract the features in csv format. Then, we labelled the traffic based on the truth table file, with 0 being a benign packet, and 1 being a malicious packet. We took care of the tons of empty fields as explained in the tutorial session. Finally, we created the file named: dataset1.csv. Make sure to unzip the file before the next step.


In [None]:
# Now, let's generate some information about the dataset
from pandas import read_csv
from collections import Counter

url = "dataset1.csv"
print('Dataset file:', url)

dataset = read_csv(url, low_memory=False)

names = list(dataset.columns)

array = dataset.values
f= array.shape[1]-1
print('Number of instances:',array.shape[0])
print('Number of features:',f)
print('Labels:', set(array[:,f]))
print('Each label count:')
print(Counter(array[:,f]))

You can see that the dataset has over 1.5Mil data points (instances), with 23 features in each instance.

The noticeable issue is that the dataset has about 1.2+ Mil class 0 (benign), and 200k class 1 (malicious) samples. This makes the dataset significantly imbalanced. We take care of that next, using random undersampling. The code below creates a balanced dataset where class 1 is 33% of the samples, and class 0 is 66% and store it in a file named dataset2.csv.

In [None]:
from pandas import read_csv
import numpy as np
from pandas import DataFrame
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter 
from tqdm import tqdm
dataset = read_csv("dataset1.csv")
names = list(dataset.columns)

array = dataset.values
X = array[:,0:dataset.shape[1]-1]
Y = array[:,dataset.shape[1]-1]

class_names = set(Y)
print('Before balancing:')
print(Counter(Y))

#This line was added because I received and error "Unknown label type: %r" when imblearn was handling
# the balancing
Y = Y.astype('int')

undersample = RandomUnderSampler(sampling_strategy=0.33)
a, b = undersample.fit_resample(X, Y)

print('After balancing:')
print(Counter(b))
array = np.c_[a,b]

#saving the csv file
dataset1 = DataFrame(array, columns = names)
dataset1.to_csv('dataset2.csv',index=False)

Now that we have a balanced and preprocessed dataset, we move to creating our ML classifier's pipeline.

## Classifiers' training and testing

The code shown next will create a piepline of five different types of classifiers; Random Forest, Logistic Regression, Decision Tree, Gaussian Naive Bayes, and Extreme Gradient Boost (XGB). The dataset will be randomly split into 75% training, and 25% testing samples with stratification.

For each classifier, we will calculate the accuracy, precision, recall, and F<sub>1</sub> score. We will store all of the results in a file named `All-Algorithms-Results.csv`. We will also generate Confusion Matrix Plots for each of the classifiers to make an easy comparison.

In [None]:
# Load libraries
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import warnings
import xgboost
import numpy as np
from pandas import DataFrame

warnings.filterwarnings("ignore")

# Load dataset
url = "dataset2.csv"
dataset = read_csv(url)

names = list(dataset.columns)

# Split-out training and testing datasets
array = dataset.values
X = array[:,0:dataset.shape[1]-1]
Y = array[:,dataset.shape[1]-1]
Y = Y.astype(int)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=1,stratify=Y)
class_names = set(Y)

# Preparing the output to be saved in a csv file for later reference
tops = ['RF', 'LR','DT', 'GNB', 'XGB']
allscores = np.empty([4,5], dtype=float)
met = ['Accuracy', 'Precision', 'Recall', 'F1 Score']

# Random Forest
modelname='Random Forest'
model1 = RandomForestClassifier()
model1.fit(X_train, Y_train) # training
predictions = model1.predict(X_test) # testing
report = classification_report(Y_test, predictions, digits=4,output_dict=True)
allscores[0,0] = report['accuracy']
allscores[1,0] = report['macro avg']['precision']
allscores[2,0] = report['macro avg']['recall']
allscores[3,0] = report['macro avg']['f1-score']
print(modelname,'\n', classification_report(Y_test, predictions, digits=4))

# Drawing the confusion matrix plot
disp = ConfusionMatrixDisplay.from_estimator(model1, X_test, Y_test,display_labels=class_names,cmap=plt.cm.Blues,normalize='true',values_format='.4f')
plt.gcf().set_dpi(200)
disp.ax_.set_title(modelname)

# Logistic Regression
modelname='Logistic Regression'
model2 = LogisticRegression(solver='liblinear')
model2.fit(X_train, Y_train) # training
predictions = model2.predict(X_test) # testing
report = classification_report(Y_test, predictions, digits=4,output_dict=True)
allscores[0,1] = report['accuracy']
allscores[1,1] = report['macro avg']['precision']
allscores[2,1] = report['macro avg']['recall']
allscores[3,1] = report['macro avg']['f1-score']
print(modelname,'\n', classification_report(Y_test, predictions, digits=4))

# Drawing the confusion matrix plot
disp = ConfusionMatrixDisplay.from_estimator(model2, X_test, Y_test,display_labels=class_names,cmap=plt.cm.Blues,normalize='true',values_format='.4f')
plt.gcf().set_dpi(200)
disp.ax_.set_title(modelname)

# Decision Tree
modelname='Decision Tree'
model3 = DecisionTreeClassifier()
model3.fit(X_train, Y_train) # training
predictions = model3.predict(X_test) # testing
report = classification_report(Y_test, predictions, digits=4,output_dict=True)
allscores[0,2] = report['accuracy']
allscores[1,2] = report['macro avg']['precision']
allscores[2,2] = report['macro avg']['recall']
allscores[3,2] = report['macro avg']['f1-score']
print(modelname,'\n', classification_report(Y_test, predictions, digits=4))

disp = ConfusionMatrixDisplay.from_estimator(model3, X_test, Y_test,display_labels=class_names,cmap=plt.cm.Blues,normalize='true',values_format='.4f')
plt.gcf().set_dpi(200)
disp.ax_.set_title(modelname)

# GaussianNB
modelname='GaussianNB'
model4 = GaussianNB()
model4.fit(X_train, Y_train) # training
predictions = model4.predict(X_test) # testing
report = classification_report(Y_test, predictions, digits=4,output_dict=True)
allscores[0,3] = report['accuracy']
allscores[1,3] = report['macro avg']['precision']
allscores[2,3] = report['macro avg']['recall']
allscores[3,3] = report['macro avg']['f1-score']
print(modelname,'\n', classification_report(Y_test, predictions, digits=4))

# Drawing the confusion matrix plot
disp = ConfusionMatrixDisplay.from_estimator(model4, X_test, Y_test,display_labels=class_names,cmap=plt.cm.Blues,normalize='true',values_format='.4f')
plt.gcf().set_dpi(200)
disp.ax_.set_title(modelname)

# XGB
modelname='XGB'
model5 = xgboost.XGBClassifier(tree_method="hist", device="cuda")
model5.fit(X_train, Y_train) # training
predictions = model5.predict(X_test) # testing
report = classification_report(Y_test, predictions, digits=4,output_dict=True)
allscores[0,4] = report['accuracy']
allscores[1,4] = report['macro avg']['precision']
allscores[2,4] = report['macro avg']['recall']
allscores[3,4] = report['macro avg']['f1-score']
print(modelname,'\n', classification_report(Y_test, predictions, digits=4))

# Drawing the confusion matrix plot
disp = ConfusionMatrixDisplay.from_estimator(model5, X_test, Y_test,display_labels=class_names,cmap=plt.cm.Blues,normalize='true',values_format='.4f')
plt.gcf().set_dpi(200)
disp.ax_.set_title(modelname)
plt.show()

# Storing the results in a csv file
dataset1 = DataFrame(allscores, columns=tops)
dataset1.insert(0,'Metric',met)
dataset1 = dataset1.transpose()
dataset1.to_csv('All-Algorithms-Results.csv',encoding='UTF_8')

Now you have your results, it's time to work on a few more things (on your own), such as cross-validation, validation using another dataset, explainability, hyperparameter optimization, deep neural networks, and many other things.

You can find more details about the experiments and the results obtained in my paper **BotStop : Packet-based efficient and explainable IoT botnet detection using machine learning**. You can find it [here](https://doi.org/10.1016/j.comcom.2022.06.039).

Thank you for sticking around!

Sometimes, I have some interesting things to say. 

You can find my blog with some interesting how to's and tools on this link: <https://www.mohammedalani.com>

My Youtube channel: <https://www.youtube.com/@DrMMAlani>

And we can connect on LinkedIn: <https://ae.linkedin.com/in/prof-mohammed-m-alani>

<img src="website.png" alt="Link to MohammedAlani.com">