# Phase 2 – Generalisation on the full IoT_Modbus dataset

In Phase 1 I worked with the `Train_Test_IoT_Modbus` subset as a *development playground*:  
I explored the four Modbus counters, defined both binary and multi-class targets, and performed
model selection with cross-validation and a validation split. That phase gave me two clear winners:

- Decision Tree and Random Forest for **binary detection** (normal vs attack),
- Decision Tree and Random Forest for **multi-class classification** (normal + 5 attack types),

all trained on **four Modbus function-code counters** and evaluated with macro-F1, weighted-F1,
and per-class false negative / false positive rates.

Phase 2 is deliberately different in scope. Instead of tuning again, I now want to **stress-test**
those “best” models on a more realistic, larger and more imbalanced dataset: the full
`IoT_Modbus` telemetry.

The main questions are:

1. *How much do the models degrade (or not) when I move from a balanced, curated subset to the
   full IoT_Modbus distribution?*
2. *Do the weaknesses already observed in Phase 1 (especially for rare classes such as
   `scanning`) persist when the dataset becomes more skewed and closer to an IIoT scenario?*

So, in this Phase I will:

1. **Load and briefly inspect** the full `IoT_Modbus` processed dataset, focusing on class
   distributions and imbalance.
2. **Reuse exactly the same feature set** as in Phase 1:  
   `["FC1_Read_Input_Register", "FC2_Read_Discrete_Value", "FC3_Read_Holding_Register", "FC4_Read_Coil"]`. And define the **binary label** (`label`: normal vs attack) and the **multi-class label** (`type`: normal, backdoor, injection, password, scanning, xss) with the same encoding
   strategy as in Phase 1.
3. Perform a **single 80/20 train–test split**, stratified by the multi-class labels in order
   to preserve the original imbalance in both splits.
4. Apply the **same standardisation step** (fit `StandardScaler` on the training split and
   transform the test split), even though tree-based models do not strictly require scaling.
   This keeps the preprocessing aligned with Phase 1 and allows fair comparison with other
   model families if needed.
5. **Instantiate the final models with the fixed hyperparameters** chosen in Phase 1, train them on the full training set.
6. **Fit the models**.
7. **Evaluate them**:
   - evaluate them once on the held-out test set using the same metrics as before,
   - compare the new results against the Phase 1 scores.

In [9]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, confusion_matrix, f1_score

# Load full IoT_Modbus dataset
data_path = r"F:\GEPID\3Semestre\Cybersicurezza\TON DataSet\Processed_IoT_dataset\Processed_IoT_dataset\IoT_Modbus.csv"

df_full = pd.read_csv(data_path)

In [6]:
# Helpers for evaluation used in Phase 1


def evaluate_on_test(model, X_test, y_test, class_names, labels, model_name, task_name):
    """
    Evaluate a trained model on the held-out test set and print:
    - classification report,
    - macro-F1 and weighted-F1,
    - per-class False Negative Rate (FNR) and False Positive Rate (FPR).
    """
    print(f"\n===== Final TEST performance for {model_name} ({task_name}) =====")
    
    # 1. Predictions
    y_pred = model.predict(X_test)

    # 2. Standard classification report
    print(classification_report(y_test, y_pred, target_names=class_names, digits=4))

    # 3. Confusion matrix and global F1 scores
    cm = confusion_matrix(y_test, y_pred, labels=labels)
    macro_f1 = f1_score(y_test, y_pred, average="macro")
    weighted_f1 = f1_score(y_test, y_pred, average="weighted")

    print(f"Macro-F1 (test):    {macro_f1:.4f}")
    print(f"Weighted-F1 (test): {weighted_f1:.4f}")

    # 4. Per-class FNR and FPR
    fnr_per_class = {}
    fpr_per_class = {}

    # cm[i, j] = samples with true class i predicted as class j
    for idx, cls in enumerate(class_names):
        tp = cm[idx, idx]
        fn = cm[idx, :].sum() - tp
        fp = cm[:, idx].sum() - tp
        tn = cm.sum() - (tp + fn + fp)

        fnr = fn / (tp + fn) if (tp + fn) > 0 else 0.0
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0.0

        fnr_per_class[cls] = fnr
        fpr_per_class[cls] = fpr

    print("\nFalse Negative Rate (missed attacks) per class:")
    for cls, fnr in fnr_per_class.items():
        print(f"  {cls}: {fnr:.4f}")

    print("\nFalse Positive Rate (false alarms) per class:")
    for cls, fpr in fpr_per_class.items():
        print(f"  {cls}: {fpr:.4f}")

    # 5. Return a dict in case we want to store the metrics
    return {
        "cm": cm,
        "macro_f1": macro_f1,
        "weighted_f1": weighted_f1,
        "fnr": fnr_per_class,
        "fpr": fpr_per_class,
    }


In [7]:
print("Shape:", df_full.shape)
print("\nColumns:")
print(df_full.columns)

print("\nData types:")
print(df_full.dtypes)

print("\nMissing values per column:")
print(df_full.isna().sum())

Shape: (287194, 8)

Columns:
Index(['date', 'time', 'FC1_Read_Input_Register', 'FC2_Read_Discrete_Value',
       'FC3_Read_Holding_Register', 'FC4_Read_Coil', 'label', 'type'],
      dtype='object')

Data types:
date                         object
time                         object
FC1_Read_Input_Register       int64
FC2_Read_Discrete_Value       int64
FC3_Read_Holding_Register     int64
FC4_Read_Coil                 int64
label                         int64
type                         object
dtype: object

Missing values per column:
date                         0
time                         0
FC1_Read_Input_Register      0
FC2_Read_Discrete_Value      0
FC3_Read_Holding_Register    0
FC4_Read_Coil                0
label                        0
type                         0
dtype: int64


In [8]:
print("\nMulticlass distribution (type):")
type_counts = df_full["type"].value_counts()
print(type_counts)
print("\nMulticlass relative frequencies:")
print((type_counts / len(df_full)).round(4))

print("\nBinary distribution (label: 0 = normal, 1 = attack):")
label_counts = df_full["label"].value_counts()
print(label_counts)
print("\nBinary relative frequencies:")
print((label_counts / len(df_full)).round(4))


Multiclass distribution (type):
type
normal       222855
backdoor      40011
password      18115
injection      5186
scanning        529
xss             498
Name: count, dtype: int64

Multiclass relative frequencies:
type
normal       0.7760
backdoor     0.1393
password     0.0631
injection    0.0181
scanning     0.0018
xss          0.0017
Name: count, dtype: float64

Binary distribution (label: 0 = normal, 1 = attack):
label
0    222855
1     64339
Name: count, dtype: int64

Binary relative frequencies:
label
0    0.776
1    0.224
Name: count, dtype: float64


### Dataset overview – full IoT_Modbus

The full `IoT_Modbus` processed dataset contains **287,194 records** and **8 columns**:
a timestamp (`date`, `time`), four Modbus function-code counters, and two label columns
(`label` and `type`). There are **no missing values**, so I can reuse the dataset as is
without any imputation or cleaning.

From the **multi-class perspective** (`type`), the distribution is clearly skewed:

- `normal` traffic represents about **77.6%** of all records.
- `backdoor` and `password` attacks account for roughly **13.9%** and **6.3%**.
- `injection` attacks are rarer (**1.8%**),
- while `scanning` and `xss` are **extremely rare**, each below **0.2%** of the dataset.

The **binary label** (`label`, 0 = normal, 1 = attack) collapses all attacks into a single
class and results in a less extreme imbalance: **77.6% normal** vs **22.4% attack**.
Compared with the more balanced `Train_Test_IoT_Modbus` subset used in Phase 1, this full
dataset is both **much larger** and **much more imbalanced**, especially for the rare
attack types (`scanning`, `xss`). This makes it a good stress test for the models selected
in Phase 1.


## Step 2: Define labels and reuse features

In [14]:
# 2.1 Define feature matrix X (same four Modbus counters as in Phase 1)
feature_cols = [
    "FC1_Read_Input_Register",
    "FC2_Read_Discrete_Value",
    "FC3_Read_Holding_Register",
    "FC4_Read_Coil",
]

X = df_full[feature_cols].copy()

# 2.2 Define binary labels: 0 = normal, 1 = attack
y_binary = df_full["label"].astype(int).to_numpy()

# 2.3 Define multiclass labels from 'type'
type_encoder = LabelEncoder()
y_multi = type_encoder.fit_transform(df_full["type"])
class_names = type_encoder.classes_          # e.g. ['backdoor', 'injection', ...]
labels_multi = np.arange(len(class_names))

# For consistency with Phase 1
class_names_binary = np.array(["normal", "attack"])
labels_binary = np.array([0, 1])

print("Feature matrix shape:", X.shape)
print("Binary y shape:", y_binary.shape)
print("Multiclass y shape:", y_multi.shape)
print("\nMulticlass classes:", class_names)

Feature matrix shape: (287194, 4)
Binary y shape: (287194,)
Multiclass y shape: (287194,)

Multiclass classes: ['backdoor' 'injection' 'normal' 'password' 'scanning' 'xss']


## Step 3: Train/test split and stratification

In [None]:
# 3. Train/test split (80/20), stratified by multiclass labels

X_train, X_test, y_multi_train, y_multi_test, y_binary_train, y_binary_test = train_test_split(
    X,
    y_multi,
    y_binary,
    test_size=0.2,
    random_state=42,
    stratify=y_multi,   # preserve imbalance pattern across splits
)

print("Train shape:", X_train.shape)
print("Test shape :", X_test.shape)

# Check class distributions in the split
print("\nTrain multiclass distribution:")
print(pd.Series(y_multi_train).value_counts().sort_index())

print("\nTest multiclass distribution:")
print(pd.Series(y_multi_test).value_counts().sort_index())


Train shape: (229755, 4)
Test shape : (57439, 4)

Train multiclass distribution:
0     32009
1      4149
2    178284
3     14492
4       423
5       398
Name: count, dtype: int64

Test multiclass distribution:
0     8002
1     1037
2    44571
3     3623
4      106
5      100
Name: count, dtype: int64


## Step 4: Standardisation

In [16]:
# 4. Standardisation (fit on train, apply to test)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# For consistency with Phase 1 naming
Xtr = X_train_scaled
Xte = X_test_scaled

## Step 5: Final models

In [18]:
# 5. Instantiate Phase 1 best models for IoT_Modbus full dataset

# --- Binary models (normal vs attack) ---

rf_bin_iot = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    min_samples_leaf=1,
    n_jobs=-1,
    random_state=42,
)

In [19]:
dt_bin_iot = DecisionTreeClassifier(
    max_depth=None,
    min_samples_leaf=1,
    min_samples_split=2,
    class_weight="balanced",
    random_state=42,
)

In [20]:
# --- Multiclass models (normal + 5 attack types) ---

rf_multi_iot = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,
    min_samples_leaf=1,
    max_features=2,
    class_weight="balanced",
    n_jobs=-1,
    random_state=42,
)

In [21]:
dt_multi_iot = DecisionTreeClassifier(
    max_depth=None,
    min_samples_leaf=1,
    min_samples_split=2,
    class_weight=None,
    random_state=42,
)

## Step 6: Fitting the Models

In [22]:
# 6. Fit models on the IoT_Modbus train split

# Binary
rf_bin_iot.fit(Xtr, y_binary_train)

In [23]:
dt_bin_iot.fit(Xtr, y_binary_train)

In [24]:
# Multiclass
rf_multi_iot.fit(Xtr, y_multi_train)

In [25]:
dt_multi_iot.fit(Xtr, y_multi_train)

## Step 7: Last evaluation

In [27]:
# 7. Final evaluation on the IoT_Modbus test split

# --- Binary ---
res_rf_bin_iot = evaluate_on_test(
    model=rf_bin_iot,
    X_test=Xte,
    y_test=y_binary_test,
    class_names=class_names_binary,
    labels=labels_binary,
    model_name="Random Forest",
    task_name="binary (IoT_Modbus full)",
)

res_dt_bin_iot = evaluate_on_test(
    model=dt_bin_iot,
    X_test=Xte,
    y_test=y_binary_test,
    class_names=class_names_binary,
    labels=labels_binary,
    model_name="Decision Tree",
    task_name="binary (IoT_Modbus full)",
)


===== Final TEST performance for Random Forest (binary (IoT_Modbus full)) =====
              precision    recall  f1-score   support

      normal     0.9654    0.9948    0.9799     44571
      attack     0.9798    0.8766    0.9253     12868

    accuracy                         0.9683     57439
   macro avg     0.9726    0.9357    0.9526     57439
weighted avg     0.9687    0.9683    0.9677     57439

Macro-F1 (test):    0.9526
Weighted-F1 (test): 0.9677

False Negative Rate (missed attacks) per class:
  normal: 0.0052
  attack: 0.1234

False Positive Rate (false alarms) per class:
  normal: 0.1234
  attack: 0.0052

===== Final TEST performance for Decision Tree (binary (IoT_Modbus full)) =====
              precision    recall  f1-score   support

      normal     0.9733    0.9635    0.9684     44571
      attack     0.8779    0.9084    0.8929     12868

    accuracy                         0.9512     57439
   macro avg     0.9256    0.9359    0.9306     57439
weighted avg     0.95

In [29]:
# --- Multiclass ---
res_rf_multi_iot = evaluate_on_test(
    model=rf_multi_iot,
    X_test=Xte,
    y_test=y_multi_test,
    class_names=class_names,
    labels=labels_multi,
    model_name="Random Forest",
    task_name="multiclass (IoT_Modbus full)",
)


===== Final TEST performance for Random Forest (multiclass (IoT_Modbus full)) =====
              precision    recall  f1-score   support

    backdoor     0.9712    0.9601    0.9656      8002
   injection     0.9939    0.9460    0.9694      1037
      normal     0.9683    0.9941    0.9810     44571
    password     0.9931    0.7176    0.8332      3623
    scanning     0.8409    0.6981    0.7629       106
         xss     0.9873    0.7800    0.8715       100

    accuracy                         0.9701     57439
   macro avg     0.9591    0.8493    0.8973     57439
weighted avg     0.9706    0.9701    0.9688     57439

Macro-F1 (test):    0.8973
Weighted-F1 (test): 0.9688

False Negative Rate (missed attacks) per class:
  backdoor: 0.0399
  injection: 0.0540
  normal: 0.0059
  password: 0.2824
  scanning: 0.3019
  xss: 0.2200

False Positive Rate (false alarms) per class:
  backdoor: 0.0046
  injection: 0.0001
  normal: 0.1126
  password: 0.0003
  scanning: 0.0002
  xss: 0.0000


In [30]:
res_dt_multi_iot = evaluate_on_test(
    model=dt_multi_iot,
    X_test=Xte,
    y_test=y_multi_test,
    class_names=class_names,
    labels=labels_multi,
    model_name="Decision Tree",
    task_name="multiclass (IoT_Modbus full)",
)


===== Final TEST performance for Decision Tree (multiclass (IoT_Modbus full)) =====
              precision    recall  f1-score   support

    backdoor     0.8761    0.9626    0.9174      8002
   injection     0.8605    0.9460    0.9012      1037
      normal     0.9739    0.9621    0.9680     44571
    password     0.8185    0.7419    0.7783      3623
    scanning     0.7500    0.6792    0.7129       106
         xss     0.7677    0.7600    0.7638       100

    accuracy                         0.9471     57439
   macro avg     0.8411    0.8420    0.8403     57439
weighted avg     0.9477    0.9471    0.9469     57439

Macro-F1 (test):    0.8403
Weighted-F1 (test): 0.9469

False Negative Rate (missed attacks) per class:
  backdoor: 0.0374
  injection: 0.0540
  normal: 0.0379
  password: 0.2581
  scanning: 0.3208
  xss: 0.2400

False Positive Rate (false alarms) per class:
  backdoor: 0.0220
  injection: 0.0028
  normal: 0.0891
  password: 0.0111
  scanning: 0.0004
  xss: 0.0004


## Phase 2 – Results on full IoT_Modbus and comparison with Phase 1

Phase 2 was designed as a stress test: instead of working on the balanced `Train_Test_IoT_Modbus`
subset (~31k records), the same tree-based models from Phase 1 were applied to the full
`IoT_Modbus` dataset (~287k records), which is much more imbalanced and closer to a realistic
IIoT scenario.

Below I compare the Phase 2 results (full IoT_Modbus, 80/20 train–test) with the Phase 1
results (balanced subset, 80/20 train–test), focusing on the main models:

- DT and RF for the **multiclass** task,
- DT and RF for the **binary** task.

---

### 7.1 Multiclass – Random Forest vs Decision Tree

#### Decision Tree (multiclass)

- **Phase 1 (subset)**:  
  - accuracy ≈ 0.966  
  - macro-F1 ≈ 0.918  
  - weighted-F1 ≈ 0.965  
  - `scanning` and `xss` already weaker, but still with reasonably high F1.

- **Phase 2 (full IoT_Modbus)**:  
  - accuracy ≈ 0.947  
  - macro-F1 ≈ 0.840  
  - weighted-F1 ≈ 0.947

Per-class behaviour on the full dataset:

- `normal`, `backdoor`, `injection` remain strong, with F1 between ~0.90 and ~0.97 and FNRs below ~5–6%.
- `password` degrades considerably: F1 ≈ 0.78 and FNR ≈ 0.26, i.e. about one quarter of password attacks are missed.
- `scanning` and `xss` remain problematic:
  - `scanning`: F1 ≈ 0.71, FNR ≈ 0.32  
  - `xss`: F1 ≈ 0.76, FNR ≈ 0.24  

The Decision Tree is still a useful interpretable baseline, but on the full, highly imbalanced
dataset it suffers a much stronger drop in macro-F1 than in Phase 1. The imbalance amplifies
its weaknesses on minority classes, and the macro-F1 reflects this deterioration.

#### Random Forest (multiclass)

- **Phase 1 (subset)**:  
  - accuracy ≈ 0.964  
  - macro-F1 ≈ 0.931  
  - weighted-F1 ≈ 0.963  
  - `scanning` and `xss` were the hardest classes, but still with F1 around 0.77 and 0.96 respectively.

- **Phase 2 (full IoT_Modbus)**:  
  - accuracy ≈ 0.970  
  - macro-F1 ≈ 0.897  
  - weighted-F1 ≈ 0.969

Per-class performance on the full dataset:

- `normal`, `backdoor`, `injection` keep very high F1 scores (~0.96–0.98) with FNR between ~0.6% and ~5%.
- `password` shows a sharp drop in recall (FNR ≈ 0.28), leading to F1 ≈ 0.83: many password attacks are now missed despite a precision of ~0.99.
- `scanning` and `xss` remain the weakest categories:
  - `scanning`: F1 ≈ 0.76, FNR ≈ 0.30 (slightly better recall than in Phase 1, but still high)  
  - `xss`: F1 ≈ 0.87, FNR ≈ 0.22  

So:

- Compared to Phase 1, **multiclass macro-F1 for RF drops from ~0.93 to ~0.90**, mainly because
  the more realistic class imbalance exposes vulnerabilities in `password`, `scanning`, and `xss`.
- **Weighted-F1 actually improves a bit** (0.963 → 0.969), because the huge mass of correctly classified `normal` and `backdoor` traffic dominates the metric, according to my ChatGPT.

From a risk perspective, the forest remains clearly superior to the tree in the multiclass
setting, but both models show that some attack types become harder to detect when the dataset
matches the real imbalance of an IIoT environment. In particular, the high FNR for `password`
and `scanning` suggests that relying on Modbus counters alone is risky for those categories.

---

### 7.2 Binary – Random Forest vs Decision Tree

#### Decision Tree (binary)

- **Phase 1 (subset)**:  
  - accuracy ≈ 0.982  
  - macro-F1 ≈ 0.982  
  - FNR(normal) ≈ 0.008, FNR(attack) ≈ 0.027

- **Phase 2 (full IoT_Modbus)**:  
  - accuracy ≈ 0.951  
  - macro-F1 ≈ 0.931  
  - FNR(normal) ≈ 0.037, FNR(attack) ≈ 0.092

As expected, the binary tree’s performance decreases noticeably on the full dataset:
it now misses about **9% of attack records** and around **3–4% of normal records** on the
test split. The model is still usable as a simple detector, but it no longer achieves the
“almost perfect” behaviour observed on the balanced subset.

#### Random Forest (binary)

- **Phase 1 (subset)**:  
  - accuracy ≈ 0.985  
  - macro-F1 ≈ 0.985  
  - FNR(normal) ≈ 0.008, FNR(attack) ≈ 0.022

- **Phase 2 (full IoT_Modbus)**:  
  - accuracy ≈ 0.968  
  - macro-F1 ≈ 0.953  
  - FNR(normal) ≈ 0.005, FNR(attack) ≈ 0.123

Here the degradation is particularly clear on the attack side:

- The forest still classifies **normal traffic extremely well** (FNR(normal) ≈ 0.005),
  which explains the high accuracy and weighted-F1 (~0.968).
- However, it now **misses about 12% of attack records** (FNR(attack) ≈ 0.123), compared
  with only 2–3% in Phase 1.

In other words: aggregated over all attack types, the model has become more conservative.
It protects normal traffic very well but allows a non-trivial number of attacks to slip through,
especially those belonging to the more problematic multiclass categories (`password`,
`scanning`, `xss`).

---

### 7.3 What Phase 2 tells us about robustness and risk

Putting everything together:

1. **Tree-based models remain strong overall**, even under severe imbalance:
   - Multiclass RF still achieves macro-F1 ≈ 0.90 and weighted-F1 ≈ 0.97.
   - Binary RF still reaches macro-F1 ≈ 0.95 and accuracy ≈ 0.97 on the full dataset.

2. **However, the apparent strength hides critical residual risks**:
   - Multiclass FNR for `password`, `scanning` and `xss` remains high (around 0.22–0.30),
     meaning that a significant fraction of those attacks is still missed.
   - In the binary aggregation, FNR(attack) increases from ~0.02 in Phase 1 to ~0.12 in Phase 2,
     due to the more realistic, skewed distribution of attack types.
   - I believe that if we had faced zero-day attacks, the models would not have been prepared, which reinforces Marasco’s point: a static, batch-trained IDS is not the most resilient approach for ML-based intrusion detection. This is yet another reason to explore Continual Learning in the future. 

3. **Macro-F1 vs weighted-F1 becomes an important governance signal**:
   - Weighted-F1 and accuracy look great, but they are dominated by `normal` and by frequent attack classes, and therefore they can be misleading from a risk-management point of view, because we might downplay the operational relevance of rare but high-impact threats such as `scanning` or `xss` attacks, which can silently map the network, exploit vulnerable web interfaces, and pave the way for more disruptive intrusions.

> With only four Modbus counters, we can deploy relatively simple ML-based detectors that perform
> very well on normal vs attack discrimination and reasonably well on some attack categories.
> However, the residual false negatives for specific attacks (notably password, scanning and xss)
> show that this approach must be complemented by additional data sources (network flows,
> host logs, process variables) and by other organisational and technical controls in the IIoT
> security stack.

Phase 2 thus confirms that the Phase 1 models are **robust but limited**: they scale gracefully
to a more realistic dataset, yet they also expose the structural blind spots of Modbus-only IDS
designs in an industrial environment.