# Network Intrusion Detection on DataSense: CIC IIoT 2025 (1-Second Windows)

This notebook builds a **modular and share-safe** ML pipeline for the **CIC IIoT 2025** (DataSense) dataset.  
It supports **1–10 s** window CSVs, synced sensor+network features, and both **binary** and **multiclass** tasks.

## ⚙️ Pipeline Overview
1. Load and combine the 1-second attack & benign CSVs, then **shuffle** the rows.
2. Do **light cleaning** and prepare features.
3. Use **ANOVA (f-test)** to rank features and select the most informative ones.
4. Use **Stratified K-Fold** cross-validation for robust evaluation.
5. Train baseline models:
   - Logistic Regression  
   - SVM (RBF)  
   - Random Forest  
6. Train additional models:
   - K-Means clustering (unsupervised ⇒ mapped to labels)  
   - K-Nearest Neighbors (KNN)  
   - LightGBM (if installed)  
   - XGBoost (if installed)
7. Compare models using:
   - Accuracy  
   - **Macro F1-score** (primary)  
   - Full classification report  
   - Confusion matrix for the best model


**Dependencies:** `python-dotenv`, `pandas`, `numpy`, `scikit-learn`, `matplotlib`, `seaborn`, `joblib`, `lightgbm`, `imbalanced-learn`

### Setup & Imports

In [4]:
# 0) Imports & configuration
import os
import warnings
from collections import Counter

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from dotenv import load_dotenv

from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    precision_recall_fscore_support,
    classification_report,
    confusion_matrix,
)

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsClassifier

warnings.filterwarnings("ignore")

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

### 1. Load & Combine 1-Second CSVs (Shuffled)

In [5]:
load_dotenv("/home/jovyan/Notebooks/.env")
base_path = os.getenv("DATA_PATH")
# Sanity check
print("Base data path:", base_path)

attack_file = "attack_samples_1sec.csv"
benign_file = "benign_samples_1sec.csv"

attack_path = os.path.join(base_path, "attack_data", attack_file)
benign_path = os.path.join(base_path, "benign_data", benign_file)

# Sanity check
print("Attack CSV:", attack_path)
print("Benign CSV:", benign_path)

# Loading CSVs (Preview mode)
df_attack = pd.read_csv(attack_path, low_memory=False)
df_benign = pd.read_csv(benign_path, low_memory=False)

# Check if loaded (print shape)
print("Attack shape:", df_attack.shape)
print("Benign shape:", df_benign.shape)


Base data path: /home/jovyan/Notebooks/DATA
Attack CSV: /home/jovyan/Notebooks/DATA/attack_data/attack_samples_1sec.csv
Benign CSV: /home/jovyan/Notebooks/DATA/benign_data/benign_samples_1sec.csv
Attack shape: (90391, 94)
Benign shape: (136800, 94)


In [11]:
# Add source file column to identify the origin of each sample
df_attack["source_file"] = "attack"
df_benign["source_file"] = "benign"

# Combine datasets
df = pd.concat([df_attack, df_benign], ignore_index=True)

# shuffle the combined dataset
df = df.sample(frac=1, random_state=RANDOM_SEED).reset_index(drop=True)

# Check combined shape
print("Combined shape:", df.shape)

# Preview the first few rows
print(df.head())

Combined shape: (227191, 95)
         device_name         device_mac  \
0         plug-flame  d4:a6:51:20:91:f7   
1  ultrasonic-sensor  08:b6:1f:82:ee:c4   
2   plug-all-sensors  d4:a6:51:82:98:a8   
3             router  28:87:ba:bd:c6:6c   
4   vibration-sensor  08:b6:1f:82:27:d0   

                                         label_full  label1  label2  \
0                             benign_whole-network3  benign  benign   
1                             benign_whole-network3  benign  benign   
2  attack_recon_host-disc-udp-ping_plug-all-sensors  attack   recon   
3            attack_mitm_ip-spoofing_router--switch  attack    mitm   
4              attack_recon_port-scan_whole-network  attack   recon   

               label3                    label4  \
0              benign                    benign   
1              benign                    benign   
2  host-disc-udp-ping  recon_host-disc-udp-ping   
3         ip-spoofing          mitm_ip-spoofing   
4           port-scan         

### 2. Clean & Prepare Features

**Goals:**
- Choose the target label column: we’ll use `label2` (attack category).
- Drop raw identifier / list-like columns (IPs, ports, MACs).

In [None]:
TARGET = "label2" # Target column name

# checking it exists in the dataframe
if TARGET not in df.columns:
    raise ValueError(f"Target column '{TARGET}' not found in the dataframe.")

# Drop raw identifiers / list-like text columns, which are not useful for modeling
drop_cols = [
    "device_mac",
    "network_ips_all","network_ips_dst","network_ips_src",
    "network_macs_all","network_macs_dst","network_macs_src",
    "network_ports_all","network_ports_dst","network_ports_src",
    "network_protocols_all","network_protocols_dst","network_protocols_src",
]
drop_cols = [c for c in drop_cols if c in df.columns]
df.drop(columns=drop_cols, inplace=True, errors="ignore")

#

