# Network Intrusion Detection on DataSense: CIC IIoT 2025 (1-Second Windows)

This notebook builds a **modular and share-safe** ML pipeline for the **CIC IIoT 2025** (DataSense) dataset.  
It supports **1–10 s** window CSVs, synced sensor+network features, and both **binary** and **multiclass** tasks.

## ⚙️ Pipeline Overview
1. Load and combine the 1-second attack & benign CSVs, then **shuffle** the rows.
2. Do **light cleaning** and prepare features.
3. Use **ANOVA (f-test)** to rank features and select the most informative ones.
4. Use **Stratified K-Fold** cross-validation for robust evaluation.
5. Train baseline models:
   - Logistic Regression  
   - SVM (RBF)  
   - Random Forest  
6. Train additional models:
   - K-Means clustering (unsupervised ⇒ mapped to labels)  
   - K-Nearest Neighbors (KNN)  
   - LightGBM (if installed)  
   - XGBoost (if installed)
7. Compare models using:
   - Accuracy  
   - **Macro F1-score** (primary)  
   - Full classification report  
   - Confusion matrix for the best model


**Dependencies:** `python-dotenv`, `pandas`, `numpy`, `scikit-learn`, `matplotlib`, `seaborn`, `joblib`, `lightgbm`, `imbalanced-learn`

### Setup & Imports

In [1]:
# 0) Imports & configuration
import os
import warnings
from collections import Counter

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from dotenv import load_dotenv

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_recall_fscore_support,
    classification_report,
    confusion_matrix,
)

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier

warnings.filterwarnings("ignore")

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

### 1. Load & Combine 1-Second CSVs (Shuffled)

In [2]:
load_dotenv("/home/jovyan/Notebooks/.env")
base_path = os.getenv("DATA_PATH")
# Sanity check
print("Base data path:", base_path)

attack_file = "attack_samples_1sec.csv"
benign_file = "benign_samples_1sec.csv"

attack_path = os.path.join(base_path, "attack_data", attack_file)
benign_path = os.path.join(base_path, "benign_data", benign_file)

# Sanity check
print("Attack CSV:", attack_path)
print("Benign CSV:", benign_path)

# Loading CSVs (Preview mode)
df_attack = pd.read_csv(attack_path, low_memory=False)
df_benign = pd.read_csv(benign_path, low_memory=False)

# Check if loaded (print shape)
print("Attack shape:", df_attack.shape)
print("Benign shape:", df_benign.shape)


Base data path: /home/jovyan/Notebooks/DATA
Attack CSV: /home/jovyan/Notebooks/DATA/attack_data/attack_samples_1sec.csv
Benign CSV: /home/jovyan/Notebooks/DATA/benign_data/benign_samples_1sec.csv
Attack shape: (90391, 94)
Benign shape: (136800, 94)


In [3]:
# Add source file column to identify the origin of each sample
df_attack["source_file"] = "attack"
df_benign["source_file"] = "benign"

# Combine datasets
df = pd.concat([df_attack, df_benign], ignore_index=True)

# shuffle the combined dataset
df = df.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

# Check combined shape
print("Combined shape:", df.shape)

# Preview the first few rows
print(df.head())

Combined shape: (227191, 95)
         device_name         device_mac  \
0         plug-flame  d4:a6:51:20:91:f7   
1  ultrasonic-sensor  08:b6:1f:82:ee:c4   
2   plug-all-sensors  d4:a6:51:82:98:a8   
3             router  28:87:ba:bd:c6:6c   
4   vibration-sensor  08:b6:1f:82:27:d0   

                                         label_full  label1  label2  \
0                             benign_whole-network3  benign  benign   
1                             benign_whole-network3  benign  benign   
2  attack_recon_host-disc-udp-ping_plug-all-sensors  attack   recon   
3            attack_mitm_ip-spoofing_router--switch  attack    mitm   
4              attack_recon_port-scan_whole-network  attack   recon   

               label3                    label4  \
0              benign                    benign   
1              benign                    benign   
2  host-disc-udp-ping  recon_host-disc-udp-ping   
3         ip-spoofing          mitm_ip-spoofing   
4           port-scan         

### 2. Clean & Prepare Features

**Goals:**
- Choose the target label column: we’ll use `label2` (attack category).
- Drop raw identifier / list-like columns (IPs, ports, MACs).

In [4]:
TARGET = "label2" # Target column name

# checking it exists in the dataframe
if TARGET not in df.columns:
    raise ValueError(f"Target column '{TARGET}' not found in the dataframe.")

# Drop raw identifiers / list-like text columns, which are not useful for modeling
drop_cols = [
    "device_mac",
    "network_ips_all","network_ips_dst","network_ips_src",
    "network_macs_all","network_macs_dst","network_macs_src",
    "network_ports_all","network_ports_dst","network_ports_src",
    "network_protocols_all","network_protocols_dst","network_protocols_src",
]
drop_cols = [c for c in drop_cols if c in df.columns]
df.drop(columns=drop_cols, inplace=True, errors="ignore")

# Set up features and target
X = df.drop(columns=[TARGET])
y = df[TARGET].astype(str)  # Ensure target is string type for classification

# Drop any non-numeric columns in X
obj_cols = X.select_dtypes(include=["object"]).columns.tolist()
X = X.drop(columns=obj_cols)

# Check shapes
print("Feature matrix shape:", X.shape)
print("Target distribution:")
print(y.value_counts(normalize=True))



Feature matrix shape: (227191, 71)
Target distribution:
label2
benign        0.602137
recon         0.148104
dos           0.081077
ddos          0.079475
mitm          0.035486
malware       0.033192
web           0.012307
bruteforce    0.008222
Name: proportion, dtype: float64


### 3. ANOVA (f-test) Feature Ranking

- Uses `SelectKBest(f_classif)` to compute an F-score for each feature.
- Higher F-score = stronger relationship between that feature and the class labels.
- This is a simple, fast filter method to reduce noise and focus on informative features.

In [5]:
# Double-check imports
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer

# Impute missing values with median 
imp = SimpleImputer(strategy="median")
X_imputed = pd.DataFrame(imp.fit_transform(X), columns=X.columns)

# Computer F-scores
selector = SelectKBest(score_func=f_classif, k="all")
selector.fit(X_imputed, y)

f_scores = pd.Series(selector.scores_, index=X.columns).sort_values(ascending=False)

# Select top features (e.g., top 30)
TOP_K = 30  # Select top 30 features
print(f"\nTop {TOP_K} features by ANOVA F-score:")
display(f_scores.head(TOP_K))

selected_features = f_scores.head(TOP_K).index.tolist()
print("Selected features for modeling:", selected_features)

# Final modeling matrix
X_selected = X_imputed[selected_features].copy()
y_selected = y.copy()
print("\nFinal feature matrix shape:", X_selected.shape)




Top 30 features by ANOVA F-score:


network_mss_max                      30107.385961
network_mss_avg                      29926.800948
network_mss_min                      29334.814214
network_packets_all_count            25808.666533
network_packets_dst_count            25549.298304
network_ips_all_count                23450.157106
network_ips_dst_count                23446.285168
network_ports_all_count              15324.252625
network_macs_all_count               14823.585131
network_macs_dst_count               13972.036543
network_macs_src_count               13972.036543
network_ips_src_count                13421.363712
network_ports_src_count              13395.086059
network_protocols_dst_count          12156.897017
network_protocols_src_count          10944.116919
network_protocols_all_count           9474.356954
network_header-length_min             9243.443779
network_header-length_max             9243.104091
network_header-length_avg             9241.169235
network_fragmentation-score           8281.453361


Selected features for modeling: ['network_mss_max', 'network_mss_avg', 'network_mss_min', 'network_packets_all_count', 'network_packets_dst_count', 'network_ips_all_count', 'network_ips_dst_count', 'network_ports_all_count', 'network_macs_all_count', 'network_macs_dst_count', 'network_macs_src_count', 'network_ips_src_count', 'network_ports_src_count', 'network_protocols_dst_count', 'network_protocols_src_count', 'network_protocols_all_count', 'network_header-length_min', 'network_header-length_max', 'network_header-length_avg', 'network_fragmentation-score', 'network_fragmented-packets', 'network_packet-size_avg', 'network_payload-length_avg', 'network_ip-length_avg', 'network_ip-flags_avg', 'network_packet-size_max', 'network_ip-length_max', 'network_payload-length_max', 'network_packet-size_std_deviation', 'network_ip-length_std_deviation']

Final feature matrix shape: (227191, 30)


### 4. Use Stratified K-Fold cross-validation for robust evaluation.

We use **10-fold stratified cross-validation** (9:1 ratio per fold) so that:
- each model is trained on ~90% of the data and validated on ~10%,
- every sample appears in a validation fold exactly once,
- class proportions are preserved in each fold (StratifiedKFold).

In [7]:
# Double check imports
from sklearn.model_selection import StratifiedKFold

# Set up Stratified K-Fold
N_SPLITS = 10 # 9-1 train-test split
skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_SEED)

print(f"\nPerforming Stratified {N_SPLITS}-Fold Cross-Validation")


Performing Stratified 10-Fold Cross-Validation


### Helper Functions

We’ll reuse these for all models.

We will define:

- `cross_val_oof_predict`:
  - runs K-fold training,
  - stores predictions for each sample from the fold where it was *not* used for training.
- `eval_report`:
  - prints Accuracy & **Macro-F1**,
  - shows a full classification report,
  - returns a confusion matrix DataFrame.


In [None]:
def cross_val_oof_predict(estimator, X, y, skf):
    '''
    Perform out-of-fold predictions using cross-validation.

    Parameters:
    - estimator: The machine learning model to train.
    - X: Feature matrix.
    - y: Target vector.
    - skf: StratifiedKFold cross-validator.

    Returns:
    - oof_preds: Out-of-fold predictions.
    '''
    oof_preds = np.zeros(len(y))
    for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train = y.iloc[train_idx]
        estimator.fit(X_train, y_train)
        oof_preds[test_idx] = estimator.predict(X_test)
        print(f"    Completed Fold {fold + 1}/{N_SPLITS}")

    return oof_preds

def eval_report(y_true, y_pred, label="Model"):
    '''
    Print evaluation metrics for the model.

    Parameters:
    - y_true: True labels.
    - y_pred: Predicted labels.
    - label: Model label for reporting.
    '''
    acc = accuracy_score(y_true, y_pred)
    macro_f1 = f1_score(y_true, y_pred, average="macro", zero_division=0)
    precision, recall, fscore, _ = precision_recall_fscore_support(y_true, y_pred, average="weighted")

    print(f"\n=== Evaluation Report for {label} ===:")
    print(f"Accuracy:  {acc:.4f}")
    print(f"Macro F1: {macro_f1:.4f}\n")
    print("\nClassification Report:")
    print(classification_report(y_true, y_pred, digits=4, zero_division=0))
    cm = confusion_matrix(y_true, y_pred, labels=sorted(y_true.unique()))
    cm_df = pd.DataFrame(cm, index=sorted(y_true.unique()), columns=sorted(y_true.unique()))
    return {"accuracy": acc, "macro_f1": macro_f1, "cm": cm_df}