# Dataset preprocessing

### 1. Overview

This notebook performs all dataset loading and preprocessing steps once, producing a clean and reproducible dataset split that will be reused by both the autoencoder-based IDS and the transformer-based IDS. This ensures a fair and controlled comparison.

### 2. Imports

In [None]:
import os
import numpy as np
import pandas as pd
import joblib

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

### 3. Dataset Loading and Merging

In [16]:
DATASET_DIR = "dataset"

csv_files = [
    os.path.join(DATASET_DIR, f)
    for f in os.listdir(DATASET_DIR)
    if f.endswith(".csv")
]

print(f"Found {len(csv_files)} CSV files")

dfs = []
for f in csv_files:
    print(f"Loading {f}")
    df_part = pd.read_csv(f)
    dfs.append(df_part)

df = pd.concat(dfs, ignore_index=True)
print("Merged dataset shape:", df.shape)
print("Column types distribution:\n", df.dtypes.value_counts())

Found 8 CSV files
Loading dataset/Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
Loading dataset/Monday-WorkingHours.pcap_ISCX.csv
Loading dataset/Friday-WorkingHours-Morning.pcap_ISCX.csv
Loading dataset/Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
Loading dataset/Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
Loading dataset/Tuesday-WorkingHours.pcap_ISCX.csv
Loading dataset/Wednesday-workingHours.pcap_ISCX.csv
Loading dataset/Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
Merged dataset shape: (2830743, 79)
Column types distribution:
 int64      54
float64    24
object      1
Name: count, dtype: int64


### 4. Column Cleanup

Remove leading/trailing spaces and drop non-feature identifiers.

In [17]:
df.columns = df.columns.str.strip()

DROP_COLS = [
    "Flow ID",
    "Source IP", "Source Port",
    "Destination IP", "Destination Port",
    "Timestamp"
]

df = df.drop(columns=[c for c in DROP_COLS if c in df.columns])
print("Remaining columns:", len(df.columns))

Remaining columns: 78


### 5. Handle NaN and Infinite Values

CICIDS2017 contains undefined flow statistics (e.g., division by zero). These rows are removed following common practice in DL-based IDS studies.

In [18]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)

bad_rows = df.isna().any(axis=1).sum()
print(f"Rows with NaN or Inf: {bad_rows}")

df.dropna(inplace=True)
print("Cleaned dataset shape:", df.shape)

# Check for non-numeric features
assert X.select_dtypes(include=["object"]).empty, "Non-numeric feature detected"

Rows with NaN or Inf: 2867
Cleaned dataset shape: (2827876, 78)


### 6. Label Processing

Convert labels to binary form: BENIGN = 0, ATTACK = 1.

In [19]:
df["Label"] = df["Label"].apply(lambda x: 0 if x == "BENIGN" else 1)
print(df["Label"].value_counts())

Label
0    2271320
1     556556
Name: count, dtype: int64


### 7. Featureâ€“Label Separation

In [20]:
X = df.drop(columns=["Label"])
y = df["Label"]

### 8. Train / Validation / Test Split

A fixed split is used to support fair model comparison.

In [21]:
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.3,
    stratify=y,
    random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,
    stratify=y_temp,
    random_state=42
)

print("Train shape:", X_train.shape)
print("Validation shape:", X_val.shape)
print("Test shape:", X_test.shape)

Train shape: (1979513, 77)
Validation shape: (424181, 77)
Test shape: (424182, 77)


### 9. Feature Scaling

StandardScaler is used to scale the features to have zero mean and unit variance.  
The scaler is fitted only on the training data and reused across all splits.

In [23]:
scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

### 10. Save Processed Artifacts

All outputs are saved and reused by subsequent notebooks.

In [None]:
os.makedirs("processed", exist_ok=True)

pd.DataFrame(X_train_scaled).to_csv("processed/X_train.csv", index=False)
pd.DataFrame(X_val_scaled).to_csv("processed/X_val.csv", index=False)
pd.DataFrame(X_test_scaled).to_csv("processed/X_test.csv", index=False)

y_train.to_csv("processed/y_train.csv", index=False)
y_val.to_csv("processed/y_val.csv", index=False)
y_test.to_csv("processed/y_test.csv", index=False)

joblib.dump(scaler, "processed/scaler.pkl")

['processed/scaler.pkl']

### 11. Summary Statistics

In [25]:
summary = {
    "total_samples": len(df),
    "train_samples": len(y_train),
    "val_samples": len(y_val),
    "test_samples": len(y_test),
    "benign_ratio": (y == 0).mean(),
    "attack_ratio": (y == 1).mean()
}

pd.Series(summary)

total_samples    2.827876e+06
train_samples    1.979513e+06
val_samples      4.241810e+05
test_samples     4.241820e+05
benign_ratio     8.031894e-01
attack_ratio     1.968106e-01
dtype: float64

### 12. Output

The processed/ directory now contains all cleaned, scaled, and split data required for both IDS implementations. Subsequent notebooks must not perform additional preprocessing steps.