# Categorical Model Comparison

**only_DOS.ipynb**

In this notebook we train a model focussing on WebAttack attack data, and export the model for use in the **categorical_model_comparison.ipynb** notebook.

We will use a CNN-GAN model described in this study: <https://www.jait.us/articles/2024/JAIT-V15N7-886.pdf>

Our dataset is the CIC-IDS-2017 dataset

In [1]:
notebook = "focus_on_WebAttack"

In [2]:
import warnings
from sklearn.exceptions import UndefinedMetricWarning
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

## Step 1: Load the data

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("./data/concat.csv")

# Trim whitespace from column names
df.columns = df.columns.str.strip()

## Step 2: Preprocess the data

### missing values

In [4]:
df.isna().sum().sort_values(ascending=False)

Flow Bytes/s                   1358
Flow Duration                     0
Destination Port                  0
Total Backward Packets            0
Total Length of Fwd Packets       0
                               ... 
Idle Mean                         0
Idle Std                          0
Idle Max                          0
Idle Min                          0
Label                             0
Length: 79, dtype: int64

In [5]:
df.dropna(subset=["Flow Bytes/s"], inplace=True)

### Inf. values

In [6]:
df = df.replace([np.inf, -np.inf], np.nan).dropna()

In [7]:
df.isna().sum().sort_values(ascending=False)

Destination Port               0
Flow Duration                  0
Total Fwd Packets              0
Total Backward Packets         0
Total Length of Fwd Packets    0
                              ..
Idle Mean                      0
Idle Std                       0
Idle Max                       0
Idle Min                       0
Label                          0
Length: 79, dtype: int64

## Step 3: Prepare the data for training

### scaling numerical features

In [8]:
from sklearn.preprocessing import MinMaxScaler

numerical_columns = df.select_dtypes(include="number").columns
scaler = MinMaxScaler()
df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

### map labels to multi-class

**important**: Here we drop all the benign rows, so we are left with only the attacks.

In [9]:
df = df[df["Label"] != "BENIGN"]

# Keep these labels as is: Web Attack � Brute Force, Web Attack � XSS, Web Attack � Sql Injection
# Combine these labels into "Others"

df["Label"] = df["Label"].replace(["DoS Hulk", "PortScan", "DDoS", "DoS GoldenEye", "FTP-Patator", "SSH-Patator", "DoS slowloris", "DoS Slowhttptest", "Bot", "Infiltration", "Heartbleed"], "Others")
df["Label"].value_counts()

Label
Others                        554376
Web Attack � Brute Force        1507
Web Attack � XSS                 652
Web Attack � Sql Injection        21
Name: count, dtype: int64

In [10]:
attack_mapping = {
	"Others": 0,
	"Web Attack � Brute Force": 1,
	"Web Attack � XSS": 2,
	"Web Attack � Sql Injection": 3,
}

df["Label"] = df["Label"].map(attack_mapping)

In [11]:
df["Label"].value_counts()

Label
0    554376
1      1507
2       652
3        21
Name: count, dtype: int64

### data splitting

In [12]:
X = df.drop(columns=["Label"])
y = df["Label"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=28)

### data sampling

In [13]:
# Undersample all values below 10.000
from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(sampling_strategy={
	0: 1000,
	1: 1000,
}, random_state=28)

rus_testset = RandomUnderSampler(sampling_strategy={
	0: 200,
	1: 200,
}, random_state=28)

X_train_balanced, y_train_balanced = rus.fit_resample(X_train, y_train)
X_test_balanced, y_test_balanced = rus_testset.fit_resample(X_test, y_test)

In [14]:
# Oversample all values below 10.000
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(sampling_strategy={
	2: 1000,
	3: 1000,
}, random_state=28)

ros_testset = RandomOverSampler(sampling_strategy={
	2: 200,
	3: 200,
}, random_state=28)

X_train_balanced, y_train_balanced = ros.fit_resample(X_train_balanced, y_train_balanced)
X_test_balanced, y_test_balanced = ros_testset.fit_resample(X_test_balanced, y_test_balanced)

In [15]:
# Check class distribution after SMOTE
from collections import Counter

print(f"Class distribution before SMOTE: {Counter(y_train)}")
print(f"Class distribution after SMOTE: {Counter(y_train_balanced)}")

Class distribution before SMOTE: Counter({0: 443475, 1: 1223, 2: 528, 3: 18})
Class distribution after SMOTE: Counter({0: 1000, 1: 1000, 2: 1000, 3: 1000})


## Step 4: Train the model

### 1. CNN Feature Extractor

In [16]:
import torch
import torch.nn as nn
import torch.optim as optim

from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score, f1_score, classification_report

# Define CNN Feature Extractor
class CNNFeatureExtractor(nn.Module):
    def __init__(self, input_size, num_filters=32):
        super(CNNFeatureExtractor, self).__init__()
        self.conv1 = nn.Conv1d(in_channels=1, out_channels=num_filters, kernel_size=3, stride=1, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool1d(kernel_size=2, stride=2)
        self.flatten = nn.Flatten()
        self.fc = nn.Linear((input_size // 2) * num_filters, 64)
    
    def forward(self, x):
        x = x.unsqueeze(1)  # Add channel dimension
        x = self.conv1(x)
        x = self.relu(x)
        x = self.pool(x)
        x = self.flatten(x)
        return self.fc(x)

### 2. Generator-Discriminator

In [17]:
# Define Generator
class Generator(nn.Module):
    def __init__(self, noise_dim, output_dim):
        super(Generator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(noise_dim, 128),
            nn.ReLU(),
            nn.Linear(128, output_dim),
            nn.Tanh()
        )
    
    def forward(self, x):
        return self.model(x)


# Define Discriminator
class Discriminator(nn.Module):
    def __init__(self, input_dim):
        super(Discriminator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.model(x)

### 3. Define Hybrid Model

In [18]:
# Define Hybrid Model
class HybridCNNGAN(nn.Module):
    def __init__(self, input_size, output_size, noise_dim=32):
        super(HybridCNNGAN, self).__init__()
        self.feature_extractor = CNNFeatureExtractor(input_size)
        self.classifier = nn.Linear(64, output_size)
        self.generator = Generator(noise_dim, input_size)
        self.discriminator = Discriminator(input_size)
    
    def forward(self, x):
        features = self.feature_extractor(x)
        return self.classifier(features)

In [19]:
# Initialize model
input_size = X_train_balanced.shape[1]
output_size = len(attack_mapping)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = HybridCNNGAN(input_size, output_size).to(device)

print("-"*30)
print("Model Summary")
print("-"*30)
print(model)
print("-"*30)
print("Device:", device)
print("-"*30)

------------------------------
Model Summary
------------------------------
HybridCNNGAN(
  (feature_extractor): CNNFeatureExtractor(
    (conv1): Conv1d(1, 32, kernel_size=(3,), stride=(1,), padding=(1,))
    (relu): ReLU()
    (pool): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (flatten): Flatten(start_dim=1, end_dim=-1)
    (fc): Linear(in_features=1248, out_features=64, bias=True)
  )
  (classifier): Linear(in_features=64, out_features=4, bias=True)
  (generator): Generator(
    (model): Sequential(
      (0): Linear(in_features=32, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=78, bias=True)
      (3): Tanh()
    )
  )
  (discriminator): Discriminator(
    (model): Sequential(
      (0): Linear(in_features=78, out_features=128, bias=True)
      (1): ReLU()
      (2): Linear(in_features=128, out_features=1, bias=True)
      (3): Sigmoid()
    )
  )
)
------------------------------
Device: cuda
-----

### 4. Train the model

In [20]:
# Early stopping setup
early_stopping_patience = 10000
best_loss = float("inf")
epochs_without_improvement = 0

# Training Setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
train_dataset = TensorDataset(torch.tensor(X_train_balanced.values, dtype=torch.float32).to(device),
                              torch.tensor(y_train_balanced.values, dtype=torch.long).to(device))
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Training Loop
num_epochs = 10000
for epoch in range(num_epochs):
	model.train()
	total_loss, correct, total = 0, 0, 0
	for i, (data, labels) in enumerate(train_loader):
		optimizer.zero_grad()
		outputs = model(data)
		loss = criterion(outputs, labels)
		
		loss.backward()
		optimizer.step()
		
		total_loss += loss.item()
		_, predicted = torch.max(outputs.data, 1)
		total += labels.size(0)
		
		correct += (predicted == labels).sum().item()
		progress = (i + 1) / len(train_loader) * 100
		
		# print(f'\rEpoch [{epoch+1}/{num_epochs}] - Progress: {progress:.1f}%', end='')

	epoch_loss = total_loss / len(train_loader)
	epoch_accuracy = correct / total

	# print(f' - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.4f}')

	# Early Stopping Condition
	if epoch_loss < best_loss:
		best_loss = epoch_loss
		epochs_without_improvement = 0
		print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.4f}")
	else:
		epochs_without_improvement += 1
		print(f"Epoch [{epoch+1}/{num_epochs}] - Loss: {epoch_loss:.4f} - Accuracy: {epoch_accuracy:.4f}", end="\r")
	
	if epochs_without_improvement >= early_stopping_patience:
		print(f"Early stopping triggered at epoch {epoch+1} due to no improvement.")
		break

Epoch [1/10000] - Loss: 1.1408 - Accuracy: 0.4627
Epoch [2/10000] - Loss: 0.8050 - Accuracy: 0.6555
Epoch [3/10000] - Loss: 0.6228 - Accuracy: 0.6993
Epoch [4/10000] - Loss: 0.5528 - Accuracy: 0.7268
Epoch [5/10000] - Loss: 0.5209 - Accuracy: 0.7342
Epoch [6/10000] - Loss: 0.5058 - Accuracy: 0.7310
Epoch [7/10000] - Loss: 0.4979 - Accuracy: 0.7415
Epoch [8/10000] - Loss: 0.4889 - Accuracy: 0.7372
Epoch [9/10000] - Loss: 0.4816 - Accuracy: 0.7395
Epoch [11/10000] - Loss: 0.4802 - Accuracy: 0.7332
Epoch [12/10000] - Loss: 0.4729 - Accuracy: 0.7370
Epoch [13/10000] - Loss: 0.4723 - Accuracy: 0.7462
Epoch [15/10000] - Loss: 0.4650 - Accuracy: 0.7468
Epoch [17/10000] - Loss: 0.4647 - Accuracy: 0.7380
Epoch [18/10000] - Loss: 0.4625 - Accuracy: 0.7352
Epoch [20/10000] - Loss: 0.4603 - Accuracy: 0.7470
Epoch [21/10000] - Loss: 0.4561 - Accuracy: 0.7515
Epoch [22/10000] - Loss: 0.4531 - Accuracy: 0.7520
Epoch [25/10000] - Loss: 0.4475 - Accuracy: 0.7485
Epoch [26/10000] - Loss: 0.4471 - Accura

In [21]:
torch.save(model.state_dict(), f"./models/{notebook}.pth")

-------------------

## Evaluation

In [22]:
model.eval()

X_test_tensor = torch.tensor(X_test_balanced.values, dtype=torch.float32).to(device)
y_test_tensor = torch.tensor(y_test_balanced.values, dtype=torch.long).to(device)

with torch.no_grad():
	outputs = model(X_test_tensor)
	_, predicted = torch.max(outputs.data, 1)

print(f"Accuracy: {accuracy_score(y_test_tensor.cpu(), predicted.cpu()):.4f}")
print(f"F1 Score: {f1_score(y_test_tensor.cpu(), predicted.cpu(), average="weighted"):.4f}")

print("\nClassification Report:\n")
print(classification_report(y_test_tensor.cpu(), predicted.cpu(), target_names=attack_mapping))

Accuracy: 0.7887
F1 Score: 0.7762

Classification Report:

                            precision    recall  f1-score   support

                    Others       0.99      0.99      0.99       200
  Web Attack � Brute Force       0.64      0.37      0.47       200
          Web Attack � XSS       0.58      0.79      0.67       200
Web Attack � Sql Injection       0.95      1.00      0.98       200

                  accuracy                           0.79       800
                 macro avg       0.79      0.79      0.78       800
              weighted avg       0.79      0.79      0.78       800

