<a href="https://colab.research.google.com/github/othmbela/gotham-network-packet-labeller/blob/main/notebooks/Deep%20Neural%20Network%20--%20Centralised%20Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Gotham Dataset 2025: Centralised learning

Welcome to this tutorial on using the Smart Cities Network Security Dataset for training a deep learning model. This notebook provides a step-by-step guide to loading the dataset, preprocessing it, and training a deep neural network (DNN) using PyTorch.

This dataset is designed to aid researchers in developing intrusion detection systems (IDS) for smart city environments, focusing on real-world IoT traffic characteristics.

**Objectives**:
1. Load the dataset.
2. Define a deep learning model using PyTorch.
3. Train and evaluate the model.
4. Save the trained model for future use.

---

## Setup & Installation

Now let's really begin with this tutorial!

To start working, very little is required once you have activated your Python environment (e.g. via `conda`, `virtualenv`, `pyenv`, etc). If you are running this code on Colab, there is really nothing to do except to install ???. The steps below have been verified to run in Colab.

In [None]:
# if you want to install libraries, use
# !pip install package_name

Now, import the required libraries:

In [None]:
import numpy as np
import pandas as pd
import glob
import os
import re

import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data.dataset import Dataset

from tqdm import tqdm

## Loading the Dataset

The dataset contains labeled network traffic data collected from smart city IoT devices.

The dataset can be imported using google drive. Import drive and use mount keyword to make drive as active directory. Variable basedir stores the location of folder where dataset is stored in the drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# change this line your folder where the data is found
DATA_DIR = '/content/drive/MyDrive/GothamDataset2025/processed'

### Convert Dataset into PyTorch Tensor Format

Now, we convert the dataset into a PyTorch-compatible format by creating a custom Dataset class.

In [None]:
class GothamDataset(Dataset):

    def __init__(self, features_file, target_file, transform=None, target_transform=None):
        """
        Args:
            features_file (string): Path to the csv file with features.
            target_file (string): Path to the csv file with labels.
            transform (callable, optional): Optional transform to be applied on features.
            target_transform (callable, optional): Optional transform to be applied on labels.
        """
        self.features = pd.read_pickle(features_file, compression="gzip")
        self.labels = pd.read_pickle(target_file, compression="gzip")
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        feature = self.features.iloc[idx, :]
        label = self.labels.iloc[idx, 0]
        if self.transform:
            feature = self.transform(feature.values, dtype=torch.float32)
        if self.target_transform:
            label = self.target_transform(label, dtype=torch.int64)
        return feature, label

###  Use DataLoader for Efficient Batch Processing

Instead of loading the entire dataset at once, we use PyTorch’s DataLoader to efficiently handle batches during training.

In [None]:
def get_dataset(data_path: str):
    """Load training, validation and test set."""

    train_datasets, val_datasets, test_datasets = [], [], []

    iot_devices = []
    for path in glob.glob(f"{DATA_DIR}/*_train_features.pkl"):
        match = re.search(r"([^/]+)_train_features\.pkl$", path)
        if match:
            iot_devices.append(match.group(1))

    # Get the datasets
    for iot_device in iot_devices:
        train_datasets.append(GothamDataset(
            features_file=f"{data_path}/{iot_device}_train_features.pkl",
            target_file=f"{data_path}/{iot_device}_train_labels.pkl",
            transform=torch.tensor,
            target_transform=torch.tensor
        ))
        val_datasets.append(GothamDataset(
            features_file=f"{data_path}/{iot_device}_val_features.pkl",
            target_file=f"{data_path}/{iot_device}_val_labels.pkl",
            transform=torch.tensor,
            target_transform=torch.tensor
        ))
        test_datasets.append(GothamDataset(
            features_file=f"{data_path}/{iot_device}_test_features.pkl",
            target_file=f"{data_path}/{iot_device}_test_labels.pkl",
            transform=torch.tensor,
            target_transform=torch.tensor
        ))

    train_data = torch.utils.data.ConcatDataset(train_datasets)
    val_data = torch.utils.data.ConcatDataset(val_datasets)
    test_data = torch.utils.data.ConcatDataset(test_datasets)

    return train_data, val_data, test_data

In [None]:
# Get the datasets
train_data, _, test_data = get_dataset(data_path=DATA_DIR)

# How many instances have we got?
print('# instances in training set: ', len(train_data))
print('# instances in testing set: ', len(test_data))

batch_size = 128

# Create the dataloaders - for training, validation and testing
train_loader = torch.utils.data.DataLoader(dataset=train_data, batch_size=batch_size, shuffle=True)
test_loader  = torch.utils.data.DataLoader(dataset=test_data, batch_size=batch_size, shuffle=False)

## Defining the Deep Neural Network

This tutorial is not so much about novel architectural designs so we define a simple deep learning model using PyTorch’s nn.Module. The model consists of multiple fully connected layers with ReLU activation.

In [None]:
class DNN(nn.Module):

    def __init__(self, num_features, hidden1_size, hidden2_size, hidden3_size, num_classes):
        super(DNN, self).__init__()
        self.fc1 = nn.Linear(num_features, hidden1_size)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(hidden1_size, hidden2_size)
        self.relu2 = nn.ReLU()
        self.fc3 = nn.Linear(hidden2_size, hidden3_size)
        self.relu3 = nn.ReLU()
        self.fc4 = nn.Linear(hidden3_size, num_classes)

    def forward(self, x):
        out = self.fc1(x)
        out = self.relu1(out)
        out = self.fc2(out)
        out = self.relu2(out)
        out = self.fc3(out)
        out = self.relu3(out)
        out = self.fc4(out)
        return out

Similarly to what we did with the dataset you could inspect the model in various ways. We can, for instance, count the number of model parameters.

In [None]:
# Defining some input variables
n_features = 70
n_classes = 6

# Creating a DBN
model = DNN(num_features=n_features,
            hidden1_size=128,
            hidden2_size=128,
            hidden3_size=64,
            num_classes=n_classes,
            )

In [None]:
print(model)

In [None]:
num_parameters = sum(value.numel() for value in model.state_dict().values())
print(f"{num_parameters = }")

## The Training Loop

A minimal training loop in PyTorch can be constructed with three functions:
*  `train()` that will train the model given a dataloader.
* `test()` that will be used to evaluate the performance of the model on held-out data, e.g., a training set.

Let's construct these functions!

### - `Train()` Function

In [None]:
def train(
    model: torch.nn.Module,
    criterion: torch.nn.Module,
    optimizer: torch.optim,
    train_loader: torch.utils.data.DataLoader,
    num_epochs: int,
    device: torch.device,
):
    """Train the network.

    Parameters
    ----------
    model: torch.nn.Module
        Neural network model used in this example.

    optimizer: torch.optim
        Optimizer.

    train_loader: torch.utils.data.DataLoader
        DataLoader used in training.

    num_epochs: int
        Number of epochs to run in each round.

    device: torch.device
        (Default value = torch.device("cpu"))
        Device where the network will be trained within a client.
    """

    model.to(device)
    model.train()

    for epoch in range(1, num_epochs+1):
        print(f"Epoch {epoch}/{num_epochs}:")
        for i, (inputs, labels) in tqdm(enumerate(train_loader)):
            inputs, labels = inputs.to(device), labels.to(device)

            # zero the parameter gradients
            optimizer.zero_grad()

            # Passing the batch down the model
            outputs = model(inputs)

            # forward + backward + optimize
            loss = criterion(outputs, labels)
            loss.backward()

            # performs the gradient update
            optimizer.step()

    print(f"{tag} Finished Training")

### - `Test()` Function

In [None]:
def test(
    model: torch.nn.Module,
    criterion: torch.nn.Module,
    test_loader: torch.utils.data.DataLoader,
    device: torch.device,
):
    """Validate the network.

    Parameters
    ----------
    model: torch.nn.ModuleList
        Neural network model used in this example.

    test_loader: torch.utils.data.DataLoader
        DataLoader used in testing.

    device: torch.device
        (Default value = torch.device("cpu"))
        Device where the network will be trained within a client.

    Returns
    -------
        Tuple containing the history, and a detailed report.

    """

    model.eval()

    test_loss = 0.0
    test_steps = 0
    test_total = 0
    test_correct = 0

    test_output_pred = []
    test_output_true = []

    with torch.no_grad():
        for (inputs, labels) in tqdm(test_loader):
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)

            loss = criterion(outputs, labels)
            test_loss += loss.cpu().item()
            test_steps += 1

            _, predicted = torch.max(outputs.data, 1)
            test_total += labels.size(0)
            test_correct += (predicted == labels).sum().item()

            test_output_pred += outputs.argmax(1).cpu().tolist()
            test_output_true += labels.tolist()

    history['loss'] = test_loss/test_steps
    history['accuracy'] = test_correct/test_total
    history['output_pred'] = test_output_pred
    history['output_true'] = test_output_true

    print(f'Test loss: {history['loss']}, Test accuracy: {history['accuracy']}')

    return history

Let's run this for 5 epochs (you'll see it reaching close to 99% accuracy -- as expected from a centralised setup with the MNIST dataset)

In [None]:
# Discover device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Create the optimzer
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Create the criterion
criterion = nn.CrossEntropyLoss()

# Train for the specified number of epochs
train(model, criterion, optimizer, train_loader, num_epochs, device)

# Training is completed, then evaluate model on the test set
history = test(model, criterion, test_loader, device)

### Model Evaluation

The classification report provides detailed metrics for each class, including precision, recall, and F1-score.

In [None]:
print("Classification Report", end="\n\n")
print(classification_report(test_output_true, test_output_pred, target_names=labels))

A confusion matrix helps us understand the types of misclassifications made by the model.

In [None]:
plot_confusion_matrix(y_true=test_output_true,
                      y_pred=test_output_pred,
                      labels=labels)

This analysis helps identify areas where the model may need improvement, such as handling class imbalances or misclassifications.

### Save Model

Saving the trained model allows it to be used later for real-world deployment.

In [None]:
path = '../../checkpoints/deep_neural_network.pt'
torch.save({
            'epoch': num_epochs,
            'model_state_dict': model.state_dict(),
            }, path)