# Categorical Model Comparison

**only_attacks.ipynb**

In this notebook we train a model only on attack data, and export the model for use in the **categorical_model_comparison.ipynb** notebook.

We will use a CNN-GAN model described in this study: <https://www.jait.us/articles/2024/JAIT-V15N7-886.pdf>

Our dataset is the CIC-IDS-2017 dataset

In [12]:
notebook = "only_attacks"

In [1]:
import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

## Step 1: Load the data

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("./data/concat.csv")

# Trim whitespace from column names
df.columns = df.columns.str.strip()

## Step 2: Preprocess the data

### missing values

In [3]:
df.isna().sum().sort_values(ascending=False)

Flow Bytes/s                   1358
Flow Duration                     0
Destination Port                  0
Total Backward Packets            0
Total Length of Fwd Packets       0
                               ... 
Idle Mean                         0
Idle Std                          0
Idle Max                          0
Idle Min                          0
Label                             0
Length: 79, dtype: int64

In [4]:
df.dropna(subset=["Flow Bytes/s"], inplace=True)

### Inf. values

In [5]:
df = df.replace([np.inf, -np.inf], np.nan).dropna()

In [6]:
df.isna().sum().sort_values(ascending=False)

Destination Port               0
Flow Duration                  0
Total Fwd Packets              0
Total Backward Packets         0
Total Length of Fwd Packets    0
                              ..
Idle Mean                      0
Idle Std                       0
Idle Max                       0
Idle Min                       0
Label                          0
Length: 79, dtype: int64

## Step 3: Prepare the data for training

### scaling numerical features

In [7]:
from sklearn.preprocessing import MinMaxScaler

numerical_columns = df.select_dtypes(include="number").columns
scaler = MinMaxScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

### map labels to multi-class

**important**: Here we drop all the benign rows, so we are left with only the attacks.

In [8]:
df = df[df["Label"] != "BENIGN"]

df["Label"].value_counts()

Label
DoS Hulk                      230124
PortScan                      158804
DDoS                          128025
DoS GoldenEye                  10293
FTP-Patator                     7935
SSH-Patator                     5897
DoS slowloris                   5796
DoS Slowhttptest                5499
Bot                             1956
Web Attack � Brute Force        1507
Web Attack � XSS                 652
Infiltration                      36
Web Attack � Sql Injection        21
Heartbleed                        11
Name: count, dtype: int64

In [9]:
attack_mapping = {
	"DoS Hulk": 0,
	"PortScan": 1,
	"DDoS": 2,
	"DoS GoldenEye": 3,
	"FTP-Patator": 4,
	"SSH-Patator": 5,
	"DoS slowloris": 6,
	"DoS Slowhttptest": 7,
	"Bot": 8,
	"Web Attack � Brute Force": 9,
	"Web Attack � XSS": 10,
	"Infiltration": 11,
	"Web Attack � Sql Injection": 12,
	"Heartbleed": 13
}

df["Label"] = df["Label"].map(attack_mapping)

In [10]:
df["Label"].value_counts()

Label
0     230124
1     158804
2     128025
3      10293
4       7935
5       5897
6       5796
7       5499
8       1956
9       1507
10       652
11        36
12        21
13        11
Name: count, dtype: int64

### data splitting

In [18]:
df = df.drop_duplicates()

In [19]:
X = df.drop(columns=["Label"])
y = df["Label"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [20]:
# Check for overlap between sets

train_hashes = set(X_train.index)
test_hashes = set(X_test.index)

assert train_hashes.isdisjoint(test_hashes), "Overlap detected between training and test sets!"

In [21]:
overlap = set(map(tuple, X_train.values)) & set(map(tuple, X_test.values))
print(f"Number of duplicate samples: {len(overlap)}")
assert len(overlap) == 0, "Data leakage detected!"

Number of duplicate samples: 0


### data sampling

In [22]:
# Undersample all values below 10.000
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy={
	0: 1000,
	1: 1000,
	2: 1000, 
	3: 1000,
	4: 1000,
	5: 1000,
	6: 1000,
	7: 1000,
	8: 1000,
	9: 1000,
}, random_state=28)

rus_testset = RandomUnderSampler(sampling_strategy={
	0: 200,
	1: 200,
	2: 200, 
	3: 200,
	4: 200,
	5: 200,
	6: 200,
	7: 200,
	8: 200,
	9: 200,
}, random_state=28)

X_train_balanced, y_train_balanced = rus.fit_resample(X_train, y_train)
X_test_balanced, y_test_balanced = rus_testset.fit_resample(X_test, y_test)

In [23]:
# Oversample all values below 10.000
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(sampling_strategy={
	10: 1000,
	11: 1000,
	12: 1000,
	13: 1000,
}, random_state=28)

ros_testset = RandomOverSampler(sampling_strategy={
	10: 200,
	11: 200,
	12: 200,
	13: 200
}, random_state=28)

X_train_balanced, y_train_balanced = ros.fit_resample(X_train_balanced, y_train_balanced)
X_test_balanced, y_test_balanced = ros_testset.fit_resample(X_test_balanced, y_test_balanced)

In [24]:
# Check class distribution after SMOTE
from collections import Counter

print(f"Class distribution before SMOTE: {Counter(y_train)}")
print(f"Class distribution after SMOTE: {Counter(y_train_balanced)}")

Class distribution before SMOTE: Counter({0: 138276, 2: 102411, 1: 72555, 3: 8229, 4: 4745, 6: 4308, 7: 4182, 5: 2575, 8: 1558, 9: 1176, 10: 522, 11: 29, 12: 17, 13: 9})
Class distribution after SMOTE: Counter({0: 1000, 1: 1000, 2: 1000, 3: 1000, 4: 1000, 5: 1000, 6: 1000, 7: 1000, 8: 1000, 9: 1000, 10: 1000, 11: 1000, 12: 1000, 13: 1000})


## Step 4: Train the model

### 1. CNN Feature Extractor

In [25]:
import torch
import torch.nn as nn
import torch.optim as optim

from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Define CNN Feature Extractor
class CNNFeatureExtractor(nn.Module):
    def __init__(self, input_size, num_filters=32):
        super(CNNFeatureExtractor, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=num_filters, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool1d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()
        self.fc = nn.Linear((input_size // 2) * num_filters, 64)
    
    def forward(self, x):
        x = x.unsqueeze(1)  # Add channel dimension
        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.flatten(x)
        return self.fc(x)

### 2. Generator-Discriminator

In [26]:
# Define Generator
class Generator(nn.Module):
    def __init__(self, noise_dim, output_dim):
        super(Generator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(noise_dim, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim),
            nn.Tanh()
        )
    
    def forward(self, x):
        return self.model(x)


# Define Discriminator
class Discriminator(nn.Module):
    def __init__(self, input_dim):
        super(Discriminator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.model(x)

### 3. Define Hybrid Model

In [27]:
# Define Hybrid Model
class HybridCNNGAN(nn.Module):
    def __init__(self, input_size, output_size, noise_dim=32):
        super(HybridCNNGAN, self).__init__()
        self.feature_extractor = CNNFeatureExtractor(input_size)
        self.classifier = nn.Linear(64, output_size)
        self.generator = Generator(noise_dim, input_size)
        self.discriminator = Discriminator(input_size)
    
    def forward(self, x):
        features = self.feature_extractor(x)
        return self.classifier(features)

In [28]:
# Initialize model
input_size = X_train_balanced.shape[1]
output_size = len(attack_mapping)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = HybridCNNGAN(input_size, output_size).to(device)

print("-"*30)
print("Model Summary")
print("-"*30)
print(model)
print("-"*30)
print("Device:", device)
print("-"*30)

------------------------------
Model Summary
------------------------------
HybridCNNGAN(
  (feature_extractor): CNNFeatureExtractor(
    (conv1): Conv1d(1, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (relu): ReLU()
    (pool): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (fc): Linear(in_features=1248, out_features=64, bias=True)
  )
  (classifier): Linear(in_features=64, out_features=14, bias=True)
  (generator): Generator(
    (model): Sequential(
      (0): Linear(in_features=32, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=78, bias=True)
      (3): Tanh()
    )
  )
  (discriminator): Discriminator(
    (model): Sequential(
      (0): Linear(in_features=78, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=1, bias=True)
      (3): Sigmoid()
    )
  )
)
------------------------------
Device: cuda
----

### 4. Train the model

In [29]:
# Early stopping setup
early_stopping_patience = 50
best_loss = float("inf")
epochs_without_improvement = 0

# Training Setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
train_dataset = TensorDataset(torch.tensor(X_train_balanced.values, dtype=torch.float32).to(device),
                              torch.tensor(y_train_balanced.values, dtype=torch.long).to(device))
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Training Loop
num_epochs = 1000
for epoch in range(num_epochs):
	model.train()
	total_loss, correct, total = 0, 0, 0
	for i, (data, labels) in enumerate(train_loader):
		optimizer.zero_grad()
		outputs = model(data)
		loss = criterion(outputs, labels)
		
		loss.backward()
		optimizer.step()
		
		total_loss += loss.item()
		_, predicted = torch.max(outputs.data, 1)
		total += labels.size(0)
		
		correct += (predicted == labels).sum().item()
		progress = (i + 1) / len(train_loader) * 100
		
		# print(f'\rEpoch [{epoch+1}/{num_epochs}] - Progress: {progress:.1f}%', end='')

	epoch_loss = total_loss / len(train_loader)
	epoch_accuracy = correct / total

	# print(f' - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.4f}')

	# Early Stopping Condition
	if epoch_loss < best_loss:
		best_loss = epoch_loss
		epochs_without_improvement = 0
		print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.4f}")
	else:
		epochs_without_improvement += 1
		print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.4f}", end="\r")
	
	if epochs_without_improvement >= early_stopping_patience:
		print(f"Early stopping triggered at epoch {epoch+1} due to no improvement.")
		break

Epoch [1/1000] - Loss: 1.5666 - Accuracy: 0.5243
Epoch [2/1000] - Loss: 0.7968 - Accuracy: 0.7255
Epoch [3/1000] - Loss: 0.6422 - Accuracy: 0.7686
Epoch [4/1000] - Loss: 0.5598 - Accuracy: 0.7976
Epoch [5/1000] - Loss: 0.4955 - Accuracy: 0.8185
Epoch [6/1000] - Loss: 0.4466 - Accuracy: 0.8450
Epoch [7/1000] - Loss: 0.4118 - Accuracy: 0.8595
Epoch [8/1000] - Loss: 0.3742 - Accuracy: 0.8726
Epoch [9/1000] - Loss: 0.3539 - Accuracy: 0.8794
Epoch [10/1000] - Loss: 0.3371 - Accuracy: 0.8807
Epoch [11/1000] - Loss: 0.3210 - Accuracy: 0.8823
Epoch [12/1000] - Loss: 0.3087 - Accuracy: 0.8858
Epoch [13/1000] - Loss: 0.2982 - Accuracy: 0.8861
Epoch [14/1000] - Loss: 0.2944 - Accuracy: 0.8875
Epoch [15/1000] - Loss: 0.2827 - Accuracy: 0.8882
Epoch [16/1000] - Loss: 0.2778 - Accuracy: 0.8886
Epoch [17/1000] - Loss: 0.2664 - Accuracy: 0.8946
Epoch [18/1000] - Loss: 0.2576 - Accuracy: 0.8976
Epoch [19/1000] - Loss: 0.2556 - Accuracy: 0.8949
Epoch [20/1000] - Loss: 0.2486 - Accuracy: 0.8968
Epoch [21

In [30]:
torch.save(model.state_dict(), f"./models/{notebook}.pth")

-------------------

## Evaluation

In [31]:
model.eval()

X_test_tensor = torch.tensor(X_test_balanced.values, dtype=torch.float32).to(device)
y_test_tensor = torch.tensor(y_test_balanced.values, dtype=torch.long).to(device)

with torch.no_grad():
	outputs = model(X_test_tensor)
	_, predicted = torch.max(outputs.data, 1)

print(f"Accuracy: {accuracy_score(y_test_tensor.cpu(), predicted.cpu()):.4f}")
print(f"F1 Score: {f1_score(y_test_tensor.cpu(), predicted.cpu(), average="weighted"):.4f}")

print("\nClassification Report:\n")
print(classification_report(y_test_tensor.cpu(), predicted.cpu(), target_names=attack_mapping))

Accuracy: 0.9414
F1 Score: 0.9391

Classification Report:

                            precision    recall  f1-score   support

                  DoS Hulk       1.00      0.99      0.99       200
                  PortScan       1.00      1.00      1.00       200
                      DDoS       0.99      0.99      0.99       200
             DoS GoldenEye       0.99      0.98      0.99       200
               FTP-Patator       0.99      0.99      0.99       200
               SSH-Patator       0.99      0.99      0.99       200
             DoS slowloris       0.99      0.99      0.99       200
          DoS Slowhttptest       0.88      0.99      0.93       200
                       Bot       1.00      1.00      1.00       200
  Web Attack � Brute Force       0.82      0.48      0.61       200
          Web Attack � XSS       0.65      0.88      0.74       200
              Infiltration       1.00      0.88      0.93       200
Web Attack � Sql Injection       0.94      1.00      0.9