In [6]:
import Models as models
import Model_Utils as model_utils

In [27]:
train_features, train_labels, test_features, test_labels = model_utils.preprocess_csv("Datasets/CDoBT.csv")

## Model Tests

In this section there are tests of various machine learning models on our dataset. We evaluate the performance of the following models:

1. **Multinomial Naive Bayes:**
   A probabilistic model that is commonly used for text classification tasks. It assumes that features are conditionally independent given the class label.

2. **Complement Naive Bayes:**
   An extension of the Multinomial Naive Bayes model that is designed to address the issue of imbalanced class distributions.

3. **Gaussian Naive Bayes:**
   A variant of Naive Bayes that assumes that features follow a Gaussian (normal) distribution within each class.

4. **Random Forest:**
   An ensemble learning method that combines multiple decision trees to improve predictive accuracy and control overfitting.

5. **Decision Tree:**
   A simple model that uses a tree-like structure to make decisions based on feature values.

6. **K-Nearest Neighbours (KNN):**
   An instance-based learning algorithm that makes predictions by finding the majority class among the k nearest Neighbours of a given data point.

7. **C-Support Vector Classification (SVC):**
   A type of Support Vector Machine (SVM) used for binary classification. It seeks to find a hyperplane that maximizes the margin between classes.

In [8]:
# Multinomial Naive Bayes
model_multinomial_bayes = models.create_multinomial_naive_bayes(features=train_features, labels=train_labels)
models.evaluate_model(model=model_multinomial_bayes, test_features=test_features, test_labels=test_labels)

Test accuracy: 0.10255964607362933


In [9]:
# Complement Naive Bayes
model_complement_bayes = models.create_complement_naive_bayes(features=train_features, labels=train_labels)
models.evaluate_model(model=model_complement_bayes, test_features=test_features, test_labels=test_labels)

Test accuracy: 0.08883867909622373


  logged = np.log(comp_count / comp_count.sum(axis=1, keepdims=True))


In [10]:
# Gaussian Naive Bayes
model_gaussian_bayes = models.create_gaussian_naive_bayes(features=train_features, labels=train_labels)
models.evaluate_model(model=model_gaussian_bayes, test_features=test_features, test_labels=test_labels)

Test accuracy: 0.4600695212513825


In [11]:
# Random Forest
model_random_forest = models.create_random_forest(features=train_features, labels=train_labels)
models.evaluate_model(model=model_random_forest, test_features=test_features, test_labels=test_labels, scaling_and_processing=True)

Test accuracy: 0.5932437983883709


In [12]:
# Decision Tree
model_decision_tree = models.create_decision_tree(features=train_features, labels=train_labels)
models.evaluate_model(model=model_decision_tree, test_features=test_features, test_labels=test_labels, scaling_and_processing=True)

Test accuracy: 0.5995860325485859


In [13]:
# K-Nearest Neighbours
model_knn = models.create_knn(features=train_features, labels=train_labels)
models.evaluate_model(model=model_knn, test_features=test_features, test_labels=test_labels)

Test accuracy: 0.627021646389635


In [14]:
# C-Support Vector Classification
model_svm = models.create_svc(features=train_features, labels=train_labels)
models.evaluate_model(model=model_svm, test_features=test_features, test_labels=test_labels)

## Exploring Neural Networks

In this section I employd a multi-layered feedforward neural network. The best network so far has the following architecture:

## Model Architecture

- **Input Layer:** The model commences with an input layer, defined by the `input_shape` parameter. This layer accommodates the number of features in the dataset, which is determined by the `scaled_features.shape[1]` value. In this architecture, it consists of 256 neurons.

- **Activation Function (ReLU):** The Rectified Linear Unit (ReLU) activation function is applied to the neurons in the input layer. It introduces non-linearity into the model, allowing it to learn complex relationships within the data.

- **Batch Normalization:** Following the input layer, a batch normalization layer is added. Batch normalization helps in stabilizing and accelerating the training process by normalizing the activations of the previous layer. It ensures that the input to each neuron has a consistent mean and standard deviation.

- **Dropout Layer:** To prevent overfitting and improve the model's generalization, a dropout layer is incorporated after batch normalization. With a dropout rate of 0.5, it randomly drops 50% of the neurons' outputs during training, encouraging the network to learn robust features.

- **Hidden Layer:** The model proceeds with a hidden layer consisting of 128 neurons. Similar to the input layer, these neurons are activated using the ReLU function, introducing non-linearity and enhancing the network's capacity to capture intricate patterns in the data.

- **Batch Normalization:** As with the previous layers, batch normalization is applied to maintain stable activations.

- **Dropout Layer:** Another dropout layer with a rate of 0.3 is inserted after batch normalization. This further mitigates overfitting and promotes model robustness.

- **Hidden Layer:** The next hidden layer comprises 64 neurons activated by the ReLU function, providing an additional layer of abstraction and feature extraction.

- **Batch Normalization:** Batch normalization is once again applied to maintain consistent activations.

- **Output Layer:** The final layer of the model consists of neurons equal to the number of target classes or `num_classes`. It utilizes the softmax activation function, which is well-suited for multi-class classification tasks. The softmax function assigns probabilities to each class, allowing the model to make predictions.

In [15]:
# Neural Network Mark 1 
model_neural_network_mk_1 = models.create_neural_network_mk_1(train_features=train_features, train_labels=train_labels)
models.evaluate_neural_network(trained_model=model_neural_network_mk_1, test_features=test_features, test_labels=test_labels)

Epoch 1/10


  return ops.EagerTensor(value, ctx.device_name, dtype)


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: nan
Test Accuracy: 0.017573077231645584


In [16]:
# Neural Network Mark 2
model_neural_network_mk_2 = models.create_neural_network_mk_2(train_features=train_features, train_labels=train_labels)
models.evaluate_neural_network(trained_model=model_neural_network_mk_2, test_features=test_features, test_labels=test_labels)

Epoch 1/10


  return ops.EagerTensor(value, ctx.device_name, dtype)


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: nan
Test Accuracy: 0.017573077231645584


In [17]:
# Neural Network Mark 3
model_neural_network_mk_3 = models.create_neural_network_mk_3(train_features=train_features, train_labels=train_labels)
models.evaluate_neural_network(trained_model=model_neural_network_mk_3, test_features=test_features, test_labels=test_labels, normalise=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: 0.5807669758796692
Test Accuracy: 0.6260104179382324


In [18]:
# Neural Network Mark 4
model_neural_network_mk_4 = models.create_neural_network_mk_4(train_features=train_features, train_labels=train_labels, num_epochs=15)
models.evaluate_neural_network(trained_model=model_neural_network_mk_4, test_features=test_features, test_labels=test_labels, normalise=True, encoding=True)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test Loss: 0.5787022709846497
Test Accuracy: 0.6260040998458862


### Initial Tests

In the next phase of experimentation, one of the columns subjected to exclusion was the "Length" column. The decision to drop this column was motivated by the observation that the length of packets can exhibit significant variability across different types of network traffic. As such, it was hypothesized that removing this feature could mitigate potential noise in the dataset and enhance the accuracy of the IoT device identification model.

In [19]:
# Reprocess the dataset but drop the "Length" column
train_features, train_labels, test_features, test_labels = model_utils.preprocess_csv("Datasets/CDoBT.csv", drop_length=True)

In [20]:
# K-Nearest Neighbours
model_knn = models.create_knn(features=train_features, labels=train_labels)
models.evaluate_model(model=model_knn, test_features=test_features, test_labels=test_labels)

Test accuracy: 0.6196808342550166


In [21]:
# Neural Network Mark 3
model_neural_network_mk_3 = models.create_neural_network_mk_3(train_features=train_features, train_labels=train_labels)
models.evaluate_neural_network(trained_model=model_neural_network_mk_3, test_features=test_features, test_labels=test_labels, normalise=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: 0.5867325663566589
Test Accuracy: 0.6190646290779114


In [22]:
# Neural Network Mark 4
model_neural_network_mk_4 = models.create_neural_network_mk_4(train_features=train_features, train_labels=train_labels, num_epochs=15)
models.evaluate_neural_network(trained_model=model_neural_network_mk_4, test_features=test_features, test_labels=test_labels, normalise=True, encoding=True)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test Loss: 0.5859881639480591
Test Accuracy: 0.6191088557243347


### Subsequent Tests

Building upon the insights gained from the initial tests, the feature selection strategy was extended to include the "Window Size" column in subsequent experiments. The "Window Size" parameter, relevant in TCP communication, similarly displayed considerable variation across network traffic. Its removal was considered in an effort to further refine the dataset and assess its impact on model performance.

By systematically evaluating the model's accuracy before and after the exclusion of these specific features, this feature selection strategy aimed to identify the most informative and relevant attributes for IoT device identification. The results of these tests are instrumental in guiding the refinement of the model and enhancing its overall effectiveness.

In [23]:
# Reprocess the dataset but drop the "Window Size" column
train_features, train_labels, test_features, test_labels = model_utils.preprocess_csv("Datasets/CDoBT.csv", drop_window_size=True)

In [24]:
# K-Nearest Neighbours
model_knn = models.create_knn(features=train_features, labels=train_labels)
models.evaluate_model(model=model_knn, test_features=test_features, test_labels=test_labels)

Test accuracy: 0.62330225944067


In [25]:
# Neural Network Mark 3
model_neural_network_mk_3 = models.create_neural_network_mk_3(train_features=train_features, train_labels=train_labels)
models.evaluate_neural_network(trained_model=model_neural_network_mk_3, test_features=test_features, test_labels=test_labels, normalise=True)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Loss: 0.5999453663825989
Test Accuracy: 0.6206035614013672


In [26]:
# Neural Network Mark 4
model_neural_network_mk_4 = models.create_neural_network_mk_4(train_features=train_features, train_labels=train_labels, num_epochs=15)
models.evaluate_neural_network(trained_model=model_neural_network_mk_4, test_features=test_features, test_labels=test_labels, normalise=True, encoding=True)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15
Test Loss: 0.6000670194625854
Test Accuracy: 0.6206446290016174
