# Network Intrusion Detection on DataSense: CIC IIoT 2025

This notebook builds a **modular and share-safe** ML pipeline for the **CIC IIoT 2025** (DataSense) dataset.  
It supports **1–10 s** window CSVs, synced sensor+network features, and both **binary** and **multiclass** tasks.

## ⚙️ Pipeline Overview
1. **Setup & Imports** – load dependencies and environment variables  
2. **Data Loading** – read the 1-second attack + benign CSVs  
3. **Data Cleaning & Feature Prep** – drop IDs, encode categoricals, fill missing values  
4. **Train/Test Split & Scaling** – stratified sampling and normalization  
5. **Baseline Model (Logistic Regression)**  
6. **Tree-Based Models (Random Forest & LightGBM)**  
7. **Evaluation Summary & Feature Importance**  


**Dependencies:** `python-dotenv`, `pandas`, `numpy`, `scikit-learn`, `matplotlib`, `seaborn`, `joblib`, `lightgbm`, `imbalanced-learn` (optional)

### 1. Setup & Imports

In [28]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from dotenv import load_dotenv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import warnings
warnings.filterwarnings("ignore")

load_dotenv("/home/jovyan/Notebooks/.env")
base_path = os.getenv("DATA_PATH")
# Sanity check
print("Base data path:", base_path)

Base data path: /home/jovyan/Notebooks/DATA


### 2. Data Loading (1-second CSVs only)

In [29]:
attack_file = "attack_samples_1sec.csv"
benign_file = "benign_samples_1sec.csv"

attack_path = os.path.join(base_path, "attack_data", attack_file)
benign_path = os.path.join(base_path, "benign_data", benign_file)

# Sanity check
print(attack_path)
print(benign_path)

# Loading CSVs (Preview mode)
df_attack = pd.read_csv(attack_path)
df_benign = pd.read_csv(benign_path, low_memory=False)

# Check if loaded (print shape)
print("Attack shape:", df_attack.shape)
print("Benign shape:", df_benign.shape)


/home/jovyan/Notebooks/DATA/attack_data/attack_samples_1sec.csv
/home/jovyan/Notebooks/DATA/benign_data/benign_samples_1sec.csv
Attack shape: (90391, 94)
Benign shape: (136800, 94)


#### Exploring the Data

In [12]:
# Checking attack data
df_attack.head()

Unnamed: 0,device_name,device_mac,label_full,label1,label2,label3,label4,timestamp,timestamp_start,timestamp_end,...,network_time-delta_min,network_time-delta_std_deviation,network_ttl_avg,network_ttl_max,network_ttl_min,network_ttl_std_deviation,network_window-size_avg,network_window-size_max,network_window-size_min,network_window-size_std_deviation
0,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:10.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:10.709000Z,2025-01-23T15:31:11.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:11.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:11.709000Z,2025-01-23T15:31:12.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:12.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:12.709000Z,2025-01-23T15:31:13.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:13.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:13.709000Z,2025-01-23T15:31:14.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:14.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:14.709000Z,2025-01-23T15:31:15.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
# Checking attack data
df_benign.head()

Unnamed: 0,device_name,device_mac,label_full,label1,label2,label3,label4,timestamp,timestamp_start,timestamp_end,...,network_time-delta_min,network_time-delta_std_deviation,network_ttl_avg,network_ttl_max,network_ttl_min,network_ttl_std_deviation,network_window-size_avg,network_window-size_max,network_window-size_min,network_window-size_std_deviation
0,router,28:87:ba:bd:c6:6c,benign_whole-network3,benign,benign,benign,benign,2025-09-09T14:09:40.400000Z_2025-09-09T14:09:4...,2025-09-09T14:09:40.400000Z,2025-09-09T14:09:41.400000Z,...,0.0,0.006059,62.8,64.0,61.0,1.469694,1870.5,3081.0,660.0,1210.5
1,router,28:87:ba:bd:c6:6c,benign_whole-network3,benign,benign,benign,benign,2025-09-09T14:09:41.400000Z_2025-09-09T14:09:4...,2025-09-09T14:09:41.400000Z,2025-09-09T14:09:42.400000Z,...,0.0,0.016469,62.5,64.0,61.0,1.5,1870.5,3081.0,660.0,1210.5
2,router,28:87:ba:bd:c6:6c,benign_whole-network3,benign,benign,benign,benign,2025-09-09T14:09:42.400000Z_2025-09-09T14:09:4...,2025-09-09T14:09:42.400000Z,2025-09-09T14:09:43.400000Z,...,0.0,0.034312,61.571429,64.0,53.0,3.736199,2441.285714,4736.0,135.0,1813.237335
3,router,28:87:ba:bd:c6:6c,benign_whole-network3,benign,benign,benign,benign,2025-09-09T14:09:43.400000Z_2025-09-09T14:09:4...,2025-09-09T14:09:43.400000Z,2025-09-09T14:09:44.400000Z,...,0.0,0.01279,62.5,64.0,61.0,1.5,1870.5,3081.0,660.0,1210.5
4,router,28:87:ba:bd:c6:6c,benign_whole-network3,benign,benign,benign,benign,2025-09-09T14:09:44.400000Z_2025-09-09T14:09:4...,2025-09-09T14:09:44.400000Z,2025-09-09T14:09:45.400000Z,...,0.0,0.017764,62.8,64.0,61.0,1.469694,2112.6,3081.0,660.0,1186.042933


#### Combining the Data

In [26]:
# Combine datasets for initial exploration
df = pd.concat([df_attack, df_benign], ignore_index=True)

# Check combined shape against original datasets
print("Attack:", df_attack.shape, "Benign:", df_benign.shape, "Combined:", df.shape)

df.head()

Attack: (90391, 94) Benign: (136800, 94) Combined: (227191, 94)


Unnamed: 0,device_name,device_mac,label_full,label1,label2,label3,label4,timestamp,timestamp_start,timestamp_end,...,network_time-delta_min,network_time-delta_std_deviation,network_ttl_avg,network_ttl_max,network_ttl_min,network_ttl_std_deviation,network_window-size_avg,network_window-size_max,network_window-size_min,network_window-size_std_deviation
0,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:10.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:10.709000Z,2025-01-23T15:31:11.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:11.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:11.709000Z,2025-01-23T15:31:12.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:12.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:12.709000Z,2025-01-23T15:31:13.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:13.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:13.709000Z,2025-01-23T15:31:14.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,edge1,dc:a6:32:dc:27:d4,attack_ddos_syn-flood-port-80_edge1,attack,ddos,syn-flood-port-80,ddos_syn-flood-port-80,2025-01-23T15:31:14.709000Z_2025-01-23T15:31:1...,2025-01-23T15:31:14.709000Z,2025-01-23T15:31:15.709000Z,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [24]:
# Preview combined data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227191 entries, 0 to 227190
Data columns (total 94 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   device_name                           227191 non-null  object 
 1   device_mac                            227191 non-null  object 
 2   label_full                            227191 non-null  object 
 3   label1                                227191 non-null  object 
 4   label2                                227191 non-null  object 
 5   label3                                227191 non-null  object 
 6   label4                                227191 non-null  object 
 7   timestamp                             227191 non-null  object 
 8   timestamp_start                       227191 non-null  object 
 9   timestamp_end                         227191 non-null  object 
 10  log_data-ranges_avg                   227191 non-null  float64
 11  

### Dataset Notes
Each row represents a 1-second window of IIoT traffic aggregated into 90+ features.  
Some columns are identifiers (MAC/IP/port lists) and are not suitable for ML training.  
We’ll remove those and keep numeric + small categorical columns.

## 3. Data Cleaning & Feature Preparations