# UNSW NB15 Hierarchical Intrusion Detection Classifier

## Overview
This notebook implements a Random Forest-based hierarchical classifier for intrusion detection using the UNSW-NB15 dataset.

---

## Architecture
* **Model A (Binary):** Benign (0) vs. Malicious (1) (Stage 1)
* **Model B (DoS Detector):** DoS (1) vs. Other attacks (2) (Stage 2, applied only to malicious traffic)
* **Model C (Flat Tri-class):** Single-stage classifier (Benign/DoS/Other)
* **Hierarchical Model:** Combined evaluation derived from Model A and Model B (Benign/DoS/Other)

---

## Key Features
* **7-feature schema:** `['proto', 'service', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'dur']`
* Categorical features (`proto`, `service`) encoded with `OrdinalEncoder`
* Numeric features passed through unchanged
* **SMOTENC** oversampling applied for class imbalance
* Precision-targeted thresholds for deployment
* Hierarchical prediction pipeline

---

## 1. Dataset
* **Source:** [UNSW-NB15](https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-NB15-Datasets/)
* **Size:** ~257,000 records
* **Classes:**
    * Benign (0)
    * DoS (1)
    * Other attacks (2) (all other attack categories combined)

---

## 2. Preprocessing
* Replace missing or invalid `proto` or `service` with `'unknown'`
* Encode categorical features (`proto`, `service`) with `OrdinalEncoder`
* Keep numeric features as-is (passthrough)
* Drop any features not in `selected_features`
* **SMOTENC** oversampling for training data

---

## 3. Model Training
* **Model A:** Binary Random Forest (Malicious vs. Benign)
* **Model B:** Random Forest (DoS vs. Other attacks)
* **Model C:** Flat Random Forest tri-class classifier
* **Thresholds:**
    * F1-tuned for evaluation
    * Precision-targeted for deployment

---

## 4. Model Evaluation
* **Metrics:** ROC-AUC, PR-AUC, Precision, Recall, F1-score, Confusion matrices
* Evaluated for both flat and hierarchical models

---

## 5. Model Artifacts

| Artifact | Description |
| :--- | :--- |
| `rf_bin.joblib` | Trained binary classifier (Model A) |
| `bin_threshold.json` | Precision-targeted threshold for Model A (deployment) |
| `rf_dos_vs_other.joblib` | Trained classifier for DoS vs Other (Model B) |
| `dos_threshold.json` | Deployment threshold for Model B |
| `rf_tri.joblib` | Flat tri-class classifier (Model C) |
| `features.json` | Ordered list of features used in all models |
| `metrics_*.json` | Evaluation metrics (F1, ROC-AUC, PR-AUC, confusion matrices) |

Artifacts are saved using `joblib` for models and JSON for thresholds, features, and metrics.

---

## 6. Prediction Utilities
* Functions to load saved models and thresholds for flat or hierarchical inference
* **Flat prediction:** Model A + Model C
* **Hierarchical prediction:** Model A → Model B
* Can export predictions to CSV for batch inference

---

## 7. Example Usage

```python
# Sample input
df_sample = df_test[selected_features].head(5)

# Flat predictions (binary + tri-class)
pred_flat = predict_from_df(df_sample, mode="both")
print(pred_flat)

# Hierarchical predictions (Model A → Model B)
pred_hier = predict_hier_from_df(df_sample)
print(pred_hier)

# Save predictions to CSV
save_predictions_csv(df_sample, "predictions_demo.csv", mode="hier")


## Section 1: Initial Setup & Configuration

In [3]:
# -------------------------
# Import Libraries
# -------------------------

import os, json, warnings
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score, average_precision_score, f1_score,
    precision_recall_fscore_support, confusion_matrix, classification_report,
    precision_recall_curve
)
from sklearn.ensemble import RandomForestClassifier
import joblib

# -------------------------
# Install Dependencies
# -------------------------
# imbalanced-learn is required for SMOTENC (handling mixed-type imbalance).

try:
    import imblearn
    print("imbalanced-learn already available.")
except ImportError:
    %pip install -q imbalanced-learn

# -------------------------
# SMOTENC Setup
# -------------------------
# SMOTENC handles categorical + numerical oversampling for class imbalance.

try:
    from imblearn.over_sampling import SMOTENC
    from imblearn.pipeline import Pipeline as ImbPipeline
    IMB_OK = True
except ImportError:
    IMB_OK = False
    print("imblearn unavailable – proceeding without SMOTENC.")

# Suppress warnings
warnings.filterwarnings("ignore")
# Set a global random state for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

imbalanced-learn already available.


In [4]:
# Import the Google Colab drive module.
from google.colab import drive
# Mount Google Drive to access files.
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [5]:
# Copy the dataset from Google Drive to the current working directory.
# The dataset is expected to be in 'gdrive/My Drive/NSA/Dataset'.
# The -r flag ensures recursive copying for directories.
!cp -r gdrive/My\ Drive/NSA/Dataset .

In [6]:
# -------------------------
# File Paths and Parameters
# -------------------------

TRAIN_CSV = "Dataset/UNSW_NB15_training-set.csv"
TEST_CSV  = "Dataset/UNSW_NB15_testing-set.csv"

# The specific chosen attack category to be spotlighted for this classification model
SPECIFIC_ATTACK = "DoS"

# The model training uses a 7-feature schema for simplicity and focus.
selected_features = ['proto','service','spkts','dpkts','sbytes','dbytes','dur']
# Categorical features
categorical_feature_names = ['proto','service']
# Numerical features
numeric_feature_names     = ['spkts','dpkts','sbytes','dbytes','dur']

# Toggle whether to use SMOTENC-based oversampling for handling class imbalance.
USE_SMOTENC = True

# Initialize DO_SEARCH flag for hyperparameter tuning. Default to False for quicker runs.
# Set to True if hyperparameter tuning (RandomizedSearchCV) is desired.
DO_SEARCH = False

# Define target precision for demonstration and deployment
TARGET_PRECISION_FOR_DEMO = 0.90
DOS_TARGET_PRECISION_FOR_DEMO = 0.80

## Section 2: Data Loading & Preprocessing

In [7]:
# -------------------------
# Data Loading and Cleaning
# -------------------------

def load_unsw(train_csv, test_csv):
    """
    Loads the UNSW-NB15 training and testing datasets from CSV files.
    Performs initial cleaning by replacing missing or invalid 'service' and 'proto'
    values with 'unknown'.

    Args:
        train_csv (str): Path to the training CSV file.
        test_csv (str): Path to the testing CSV file.

    Returns:
        tuple: A tuple containing two pandas DataFrames (df_tr, df_te)
               for the training and testing data respectively.
    """
    df_tr = pd.read_csv(train_csv)
    df_te = pd.read_csv(test_csv)
    # Replace missing or invalid service/protocol values
    for df in (df_tr, df_te):
        if 'service' in df.columns:
            df['service'] = df['service'].replace('-', 'unknown').fillna('unknown')
        if 'proto' in df.columns:
            df['proto'] = df['proto'].fillna('unknown')
    return df_tr, df_te

# Load dataset
df_train, df_test = load_unsw(TRAIN_CSV, TEST_CSV)

print("Train:", df_train.shape, " Test:", df_test.shape)
print("Cols:", list(df_train.columns)[:15], "... (total:", len(df_train.columns), ")")

Train: (175341, 45)  Test: (82332, 45)
Cols: ['id', 'dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes', 'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss'] ... (total: 45 )


In [8]:
# -------------------------
# Label Engineering
# -------------------------

# Binary label: 0 = Benign, 1 = Malicious
y_bin_train = df_train['label'].astype(int)
y_bin_test  = df_test['label'].astype(int)

# Label mapping:
#   0 = Benign, 1 = DoS, 2 = Other attacks
# This function maps the 'attack_cat' string labels to integer labels
def map_tri(cat: str) -> int:
    if cat == 'Normal': return 0  # Benign
    if cat == SPECIFIC_ATTACK: return 1 # DoS attack
    return 2                      # All other attack categories

y_tri_train = df_train['attack_cat'].apply(map_tri).astype(int)
y_tri_test  = df_test['attack_cat'].apply(map_tri).astype(int)

# Input features (X)
# Select only the predefined 'selected_features' for model training.
X_train = df_train[selected_features].copy()
X_test  = df_test[selected_features].copy()

print("y_bin (train):\n", y_bin_train.value_counts(), "\n")
print("y_tri (train) [0=Benign,1=DoS,2=Other]:\n", y_tri_train.value_counts().sort_index())

y_bin (train):
 label
1    119341
0     56000
Name: count, dtype: int64 

y_tri (train) [0=Benign,1=DoS,2=Other]:
 attack_cat
0     56000
1     12264
2    107077
Name: count, dtype: int64


In [9]:
# -------------------------
# Preprocessing
# -------------------------
# ColumnTransformer for preprocessing:
# - Categorical features ('proto', 'service') are encoded using OrdinalEncoder.
#   'handle_unknown='use_encoded_value', unknown_value=-1' handles unseen categories
#   by assigning them a specific value (-1).
# - Numerical features are 'passthrough', meaning they are kept as is.
# - 'remainder='drop'' ensures that any features not explicitly listed in transformers
#   are dropped from the dataset.

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1), categorical_feature_names),
        ('num', 'passthrough', numeric_feature_names),
    ],
    remainder='drop'
)

categorical_features = [0, 1]

## Section 3: Utility Functions

In [10]:
# -------------------------
# Threshold Selection Functions
# -------------------------

def pick_threshold_by_f1(scores: np.ndarray, y_true: np.ndarray, grid=None) -> float:
    """
    Selects the classification threshold that maximizes the F1-score.

    Args:
        scores (np.ndarray): Predicted probabilities.
        y_true (np.ndarray): True binary labels.
        grid (np.ndarray, optional): Array of thresholds to check. If None, a default
                                     grid from 0.05 to 0.95 is used.

    Returns:
        float: The threshold value that yields the highest F1-score.
    """
    if grid is None:
        grid = np.linspace(0.05, 0.95, 19)
    best_t, best_f1 = 0.5, -1.0
    for t in grid:
        # Convert scores to binary predictions using the current threshold
        pred = (scores >= t).astype(int)
        # Calculate F1-score
        f1 = f1_score(y_true, pred, zero_division=0)
        # Update best threshold if current F1-score is higher
        if f1 > best_f1:
            best_f1, best_t = f1, t
    return float(best_t)

def pick_threshold_by_precision(scores: np.ndarray, y_true: np.ndarray, target_prec: float) -> float:
    """
    Selects the lowest classification threshold that achieves at least the target precision.
    Uses the precision-recall curve to find suitable thresholds.

    Args:
        scores (np.ndarray): Predicted probabilities.
        y_true (np.ndarray): True binary labels.
        target_prec (float): The minimum desired precision.

    Returns:
        float: The lowest threshold that meets the target precision.
    """
    # Calculate precision, recall, and thresholds from the precision-recall curve.
    prec, rec, thr = precision_recall_curve(y_true, scores)
    chosen = None
    for i in range(len(thr)):           # Note: len(thr) = len(prec) - 1
        if prec[i] >= target_prec:
            chosen = float(thr[i])
            break
    if chosen is None:
        # Fallback to the strictest threshold if target precision isn't reached.
        # If no thresholds are available, default to 0.5.
        chosen = float(thr[-1]) if len(thr) else 0.5
    return chosen

In [11]:
# -------------------------
# Network Log Parsing Functions
# -------------------------

def zeek_conn_to_features_df(conn_log_path: str) -> pd.DataFrame:
    """
    Parses a Zeek conn.log file (JSON format) and converts it into a pandas DataFrame
    containing the selected features for model inference.

    Args:
        conn_log_path (str): Path to the Zeek conn.log file.

    Returns:
        pd.DataFrame: DataFrame with extracted features, ready for model input.
    """
    rows = []
    with open(conn_log_path, "r", encoding="utf-8") as f:
        for line in f:
            if not line.strip():
                continue
            try:
                d = json.loads(line)
            except json.JSONDecodeError:
                continue
            # Extract relevant fields, handling missing keys with defaults
            proto   = d.get('proto', 'unknown')
            service = d.get('service', 'unknown') or 'unknown' # Handle empty string service as unknown
            spkts   = d.get('orig_pkts', 0)
            dpkts   = d.get('resp_pkts', 0)
            sbytes  = d.get('orig_bytes', 0)
            dbytes  = d.get('resp_bytes', 0)
            dur     = d.get('duration', 0.0)
            rows.append([proto, service, spkts, dpkts, sbytes, dbytes, dur])

    df = pd.DataFrame(rows, columns=selected_features)
    # Ensure correct data types
    df['proto'] = df['proto'].astype(str)
    df['service'] = df['service'].astype(str)
    for c in ['spkts','dpkts','sbytes','dbytes']:
        df[c] = pd.to_numeric(df[c], errors='coerce').fillna(0).astype(int)
    df['dur'] = pd.to_numeric(df['dur'], errors='coerce').fillna(0.0).astype(float)
    return df

def suricata_eve_to_features_df(eve_json_path: str) -> pd.DataFrame:
    """
    Parses a Suricata EVE JSON log file and extracts flow-related information
    to create a pandas DataFrame suitable for model inference.

    Args:
        eve_json_path (str): Path to the Suricata EVE JSON file.

    Returns:
        pd.DataFrame: DataFrame with extracted features, ready for model input.
    """
    rows = []
    with open(eve_json_path, "r", encoding="utf-8") as f:
        for line in f:
            if not line.strip():
                continue
            try:
                d = json.loads(line)
            except json.JSONDecodeError:
                continue
            # Only process 'flow' events
            if d.get('event_type') != 'flow':
                continue
            flow = d.get('flow', {})
            # Extract relevant fields, handling missing keys with defaults
            proto   = d.get('proto', 'unknown')
            service = d.get('app_proto', 'unknown') or 'unknown' # Handle empty string app_proto as unknown
            spkts   = flow.get('pkts_toserver', 0)
            dpkts   = flow.get('pkts_toclient', 0)
            sbytes  = flow.get('bytes_toserver', 0)
            dbytes  = flow.get('bytes_toclient', 0)
            dur     = flow.get('duration', 0.0)
            rows.append([proto, service, spkts, dpkts, sbytes, dbytes, dur])

    df = pd.DataFrame(rows, columns=selected_features)
    # Ensure correct data types
    df['proto'] = df['proto'].astype(str)
    df['service'] = df['service'].astype(str)
    for c in ['spkts','dpkts','sbytes','dbytes']:
        df[c] = pd.to_numeric(df[c], errors='coerce').fillna(0).astype(int)
    df['dur'] = pd.to_numeric(df['dur'], errors='coerce').fillna(0.0).astype(float)
    return df

## Section 4: Model A - Binary Classifier (Stage 1 of Hierarchy)

Model A distinguishes between Benign and Malicious traffic. This is the first stage of the hierarchical classifier.

In [12]:
# -------------------------
# Model A: Binary Classifier (Benign vs Malicious)
#   - This is the first step of the hierarchical model, classifying traffic as either Benign or Malicious.
# -------------------------

# Split the training data into training and validation sets for model tuning and threshold selection.
# stratify=y_bin_train ensures that the proportion of classes is maintained in both splits for class imabalance cases.
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_bin_train, test_size=0.2, random_state=RANDOM_STATE, stratify=y_bin_train
)

# Initialize the Random Forest Classifier for the binary classification task.
# n_estimators: Number of trees in the forest.
# class_weight='balanced_subsample': Automatically adjusts weights inversely proportional to class frequencies
#                                    in the subsample to deal with class imbalance.
rf_model_bin = RandomForestClassifier(
    n_estimators=500, max_depth=None, min_samples_leaf=1,
    max_features='sqrt', n_jobs=-1, class_weight='balanced_subsample',
    random_state=RANDOM_STATE
)

# Construct the pipeline for Model A (binary classifier).
# If SMOTENC is enabled and available, use ImbPipeline which integrates SMOTENC.
if USE_SMOTENC and IMB_OK:
    bin_pipeline = ImbPipeline(steps=[
        ('pre', preprocessor), # Apply preprocessing (OrdinalEncoder for cat, passthrough for num)
        ('smote', SMOTENC(random_state=RANDOM_STATE, categorical_features=categorical_features, k_neighbors=5)), # Apply SMOTENC for oversampling
        ('rf', rf_model_bin) # Random Forest classifier
    ])
else:
    bin_pipeline = Pipeline(steps=[('pre', preprocessor), ('rf', rf_model_bin)])

# -------------------------
# Model A: F1-tuned Threshold
# -------------------------

# Train the binary classification pipeline (Model A) on the training data.
bin_pipeline.fit(X_tr, y_tr)

# Predict probabilities on the validation set to determine the optimal threshold.
val_scores = bin_pipeline.predict_proba(X_val)[:, 1] # Probability of the positive class (Malicious)

# Pick the threshold that maximizes the F1-score on the validation set.
bin_threshold_f1 = pick_threshold_by_f1(val_scores, y_val)

# Evaluate Model A on the independent test set using the F1-tuned threshold.
test_scores = bin_pipeline.predict_proba(X_test)[:, 1]
test_pred_bin_f1 = (test_scores >= bin_threshold_f1).astype(int)

# Evaluation metrics for Model A (binary classifier).
bin_roc = roc_auc_score(y_bin_test, test_scores) # ROC AUC
bin_ap  = average_precision_score(y_bin_test, test_scores) # Average Precision score
p_f1,r_f1,f1_f1,_ = precision_recall_fscore_support(y_bin_test, test_pred_bin_f1, average='binary', zero_division=0) # Precision, Recall, F1-score
bin_cm_f1 = confusion_matrix(y_bin_test, test_pred_bin_f1) # Confusion Matrix

print("=== Model A (Binary: Benign vs Malicious) (TEST) – F1-tuned threshold ===")
print(f"Threshold(F1): {bin_threshold_f1:.3f}")
print(f"ROC-AUC:       {bin_roc:.4f} | PR-AUC: {bin_ap:.4f}")
print(f"Precision:     {p_f1:.4f} | Recall: {r_f1:.4f} | F1: {f1_f1:.4f}")
print("Confusion [[TN,FP],[FN,TP]]:\n", bin_cm_f1)

=== Model A (Binary: Benign vs Malicious) (TEST) – F1-tuned threshold ===
Threshold(F1): 0.250
ROC-AUC:       0.9717 | PR-AUC: 0.9775
Precision:     0.7902 | Recall: 0.9758 | F1: 0.8732
Confusion [[TN,FP],[FN,TP]]:
 [[25253 11747]
 [ 1099 44233]]


In [13]:
# -------------------------
# Optional: Hyperparameter Search
# -------------------------

if DO_SEARCH:
    # Parameters
    param_dist = {
        'rf__n_estimators': [300, 500, 700], # Number of trees
        'rf__max_depth': [None, 8, 12, 20],   # Maximum depth of each tree
        'rf__min_samples_leaf': [1, 2, 4],    # Minimum number of samples required to be at a leaf node
        'rf__max_features': ['sqrt', 'log2']  # Number of features to consider when looking for the best split
    }
    # StratifiedKFold for cross-validation to maintain class balance.
    cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=RANDOM_STATE)
    # RandomizedSearchCV to search for the best hyperparameters.
    search = RandomizedSearchCV(
        bin_pipeline, # The pipeline to tune
        param_distributions=param_dist, # The parameter space to search
        n_iter=6, # Number of parameter settings that are sampled
        scoring='f1', # Metric to optimize (F1-score is chosen here)
        cv=cv, # Cross-validation strategy
        n_jobs=-1, # Use all available cores
        verbose=1, # Display progress messages
        random_state=RANDOM_STATE # For reproducibility
    )
    # Fit the RandomizedSearchCV to the training data to find the best model.
    search.fit(X_tr, y_tr)
    # Update the binary pipeline with the best estimator found by the search.
    bin_pipeline = search.best_estimator_
    # Re-evaluate on the validation set using the best pipeline to re-determine the F1-tuned threshold.
    val_scores = bin_pipeline.predict_proba(X_val)[:, 1]
    bin_threshold_f1 = pick_threshold_by_f1(val_scores, y_val)
    # Predict probabilities on the test set using the best pipeline.
    test_scores = bin_pipeline.predict_proba(X_test)[:, 1]
    print("Search best params:", search.best_params_)
else:
    print("Skipping hyperparam search (set DO_SEARCH=True to enable).")

Skipping hyperparam search (set DO_SEARCH=True to enable).


In [14]:
# -------------------------
# Model A: Precision-Targeted Threshold (For Deployment)
# -------------------------

# Determine the binary classification threshold based on a target precision.
# This threshold prioritizes a certain level of precision, which can be critical in security applications.
# This is the deployment threshold for Model A, used as the first step in the hierarchical prediction.
bin_threshold_prec = pick_threshold_by_precision(test_scores, y_bin_test, TARGET_PRECISION_FOR_DEMO)

# Apply the precision-targeted threshold to the test set scores to get binary predictions.
test_pred_bin_prec = (test_scores >= bin_threshold_prec).astype(int)

# Calculate evaluation metrics (Precision, Recall, F1-score) for the binary predictions
# obtained with the precision-targeted threshold.
p_pr,r_pr,f1_pr,_ = precision_recall_fscore_support(y_bin_test, test_pred_bin_prec, average='binary', zero_division=0)

# Confusion matrix.
bin_cm_prec = confusion_matrix(y_bin_test, test_pred_bin_prec)

print("=== Model A (Binary: Benign vs Malicious) (TEST) – Precision-targeted threshold (for Deployment) ===")
print(f"Target precision: {TARGET_PRECISION_FOR_DEMO}")
print(f"Threshold(Prec):  {bin_threshold_prec:.3f}")
print(f"Precision:        {p_pr:.4f} | Recall: {r_pr:.4f} | F1: {f1_pr:.4f}")
print("Confusion [[TN,FP],[FN,TP]]:\n", bin_cm_prec)

# Store the precision-targeted threshold as the deployment threshold for Model A.
BIN_THRESHOLD_FOR_DEPLOY = float(bin_threshold_prec)
print("\nBIN_THRESHOLD_FOR_DEPLOY (for Model A in Hierarchical Prediction) =", BIN_THRESHOLD_FOR_DEPLOY)

=== Model A (Binary: Benign vs Malicious) (TEST) – Precision-targeted threshold (for Deployment) ===
Target precision: 0.9
Threshold(Prec):  0.662
Precision:        0.9000 | Recall: 0.9316 | F1: 0.9155
Confusion [[TN,FP],[FN,TP]]:
 [[32310  4690]
 [ 3102 42230]]

BIN_THRESHOLD_FOR_DEPLOY (for Model A in Hierarchical Prediction) = 0.6623993531975984


## Section 5: Model B - DoS vs Other Classifier (Stage 2 of Hierarchy)

Model B is trained on malicious traffic only and distinguishes between DoS attacks and other attack types.

In [15]:
# -------------------------
# Model B: DoS vs Other Attacks Classifier
#   - This model is trained to distinguish between DoS and Other attacks ONLY within the malicious traffic
#     identified by Model A.
# -------------------------

# Split the original dataframes to create malicious subsets for training and testing Model B.
# This focuses only on instances identified as malicious by the binary label (y_bin_train/test == 1).
mal_train = df_train[y_bin_train == 1].copy()
mal_test  = df_test[y_bin_test  == 1].copy()

# Extract selected features for the malicious subsets.
X_mal_train = mal_train[selected_features].copy()
X_mal_test  = mal_test[selected_features].copy()

# Create binary labels for Model B classifier: 1 for SPECIFIC_ATTACK (e.g., DoS), 0 for Other malicious attacks.
y_dos_train = (mal_train['attack_cat'] == SPECIFIC_ATTACK).astype(int)
y_dos_test  = (mal_test['attack_cat']  == SPECIFIC_ATTACK).astype(int)

# Initialize the Random Forest Classifier for Model B (DoS vs. Other) classification task.
rf_model_dos = RandomForestClassifier(
    n_estimators=600, max_depth=None, min_samples_leaf=1,
    max_features='sqrt', n_jobs=-1, class_weight='balanced_subsample',
    random_state=RANDOM_STATE
)

# Construct the pipeline for Model B classifier.
# It includes preprocessing and optionally SMOTENC, but with slightly tighter k_neighbors for SMOTENC.
if USE_SMOTENC and IMB_OK:
    dos_pipeline = ImbPipeline(steps=[
        ('pre', preprocessor),
        ('smote', SMOTENC(random_state=RANDOM_STATE, categorical_features=categorical_features, k_neighbors=3)),  # slightly tighter
        ('rf', rf_model_dos)
    ])
else:
    dos_pipeline = Pipeline(steps=[('pre', preprocessor), ('rf', rf_model_dos)])

# Split the malicious training data into training and validation sets for Model B tuning.
# Stratification is used to maintain the proportion of DoS vs. Other attacks.
Xd_tr, Xd_val, yd_tr, yd_val = train_test_split(
    X_mal_train, y_dos_train, test_size=0.2, random_state=RANDOM_STATE, stratify=y_dos_train
)

# Train the Model B (DoS vs. Other) classification pipeline.
dos_pipeline.fit(Xd_tr, yd_tr)

# Predict probabilities on the validation set to determine the optimal threshold for Model B.
dos_val_scores = dos_pipeline.predict_proba(Xd_val)[:,1]

# Refine the DoS deployment threshold using the target precision on the validation set.
DOS_THRESHOLD_FOR_DEPLOY = pick_threshold_by_precision(dos_val_scores, yd_val, DOS_TARGET_PRECISION_FOR_DEMO)

# Quick check on malicious TEST subset using the newly defined threshold

# Predict probabilities on the malicious test subset for Model B.
dos_test_scores = dos_pipeline.predict_proba(X_mal_test)[:, 1]
# Apply the precision-targeted threshold to get binary predictions for DoS vs Other.
dos_test_pred   = (dos_test_scores >= DOS_THRESHOLD_FOR_DEPLOY).astype(int)

print("=== Model B (DoS vs Other) – malicious TEST subset @ precision-target ===")
print(classification_report(y_dos_test, dos_test_pred, zero_division=0, target_names=['Other','DoS']))
print("Confusion matrix [rows=Actual Other,DoS; cols=Pred Other,DoS]:\n", confusion_matrix(y_dos_test, dos_test_pred))

print("\nDOS_THRESHOLD_FOR_DEPLOY (for Model B in Hierarchical Prediction) =", DOS_THRESHOLD_FOR_DEPLOY)

=== Model B (DoS vs Other) – malicious TEST subset @ precision-target ===
              precision    recall  f1-score   support

       Other       0.91      1.00      0.95     41243
         DoS       0.77      0.04      0.07      4089

    accuracy                           0.91     45332
   macro avg       0.84      0.52      0.51     45332
weighted avg       0.90      0.91      0.87     45332

Confusion matrix [rows=Actual Other,DoS; cols=Pred Other,DoS]:
 [[41197    46]
 [ 3939   150]]

DOS_THRESHOLD_FOR_DEPLOY (for Model B in Hierarchical Prediction) = 0.9716666666666667


## Section 6: Model C - Flat Tri-class Classifier

Model C is a single-stage classifier that directly predicts Benign/DoS/Other in one step (non-hierarchical).

In [16]:
# -------------------------
# Model C: Flat Tri-class Classifier (Benign vs DoS vs Other)
# -------------------------

# Initialize Random Forest Classifier.
rf_model_tri = RandomForestClassifier(
    n_estimators=600, max_depth=None, min_samples_leaf=1,
    max_features='sqrt', n_jobs=-1, class_weight='balanced_subsample',
    random_state=RANDOM_STATE
)

# Initialize pipeline for Model C.
if USE_SMOTENC and IMB_OK:
    tri_pipeline = ImbPipeline(steps=[
        ('pre', preprocessor),
        ('smote', SMOTENC(random_state=RANDOM_STATE, categorical_features=categorical_features, k_neighbors=5)),
        ('rf', rf_model_tri)
    ])
else:
    tri_pipeline = Pipeline(steps=[('pre', preprocessor), ('rf', rf_model_tri)])

# Train.
tri_pipeline.fit(X_train, y_tri_train)

# Predict.
tri_pred_test = tri_pipeline.predict(X_test)

# Confusion matrix.
tri_cm_flat = confusion_matrix(y_tri_test, tri_pred_test)

print("=== Model C (Flat Tri-class: Benign vs DoS vs Other) (TEST) ===")
print(pd.DataFrame(
    classification_report(y_tri_test, tri_pred_test, output_dict=True, zero_division=0)
).T[['precision','recall','f1-score','support']])
print("\nConfusion matrix rows=Actual[Benign,DoS,Other], cols=Pred[Benign,DoS,Other]:")
print(tri_cm_flat)

=== Model C (Flat Tri-class: Benign vs DoS vs Other) (TEST) ===
              precision    recall  f1-score       support
0              0.936682  0.802838  0.864611  37000.000000
1              0.300595  0.506725  0.377345   4089.000000
2              0.798038  0.846083  0.821358  41243.000000
accuracy       0.809794  0.809794  0.809794      0.809794
macro avg      0.678438  0.718549  0.687771  82332.000000
weighted avg   0.835639  0.809794  0.818744  82332.000000

Confusion matrix rows=Actual[Benign,DoS,Other], cols=Pred[Benign,DoS,Other]:
[[29705   423  6872]
 [   58  2072  1959]
 [ 1950  4398 34895]]


## Section 7: Hierarchical Model - Combined Evaluation (A → B)

This section demonstrates the complete hierarchical pipeline: Model A identifies malicious traffic, then Model B classifies it as DoS or Other.

In [17]:
# -------------------------
# Hierarchical Model
#   - This section combines Model A and Model B predictions to form the final hierarchical tri-class output.
#   - Model A (Binary Classifier) runs first, and only malicious instances are passed to Model B.
# -------------------------

# Re-calculate binary classification scores for the entire test set using Model A.
bin_scores_test = bin_pipeline.predict_proba(X_test)[:, 1]

# Identify instances classified as malicious by Model A using its deployed binary threshold.
is_malicious    = (bin_scores_test >= BIN_THRESHOLD_FOR_DEPLOY)

# Initialize the array for hierarchical tri-class predictions.
# Default all instances to Benign (0).
tri_pred_hier = np.zeros(len(X_test), dtype=int)  # 0 = Benign

# If there are any instances classified as malicious by Model A, proceed with Model B classification.
if is_malicious.any():
    # Initialize an array for DoS scores for all test instances.
    dos_scores = np.zeros(len(X_test))
    # Predict DoS probabilities ONLY for the subset identified as malicious by Model A, using Model B.
    dos_scores[is_malicious] = dos_pipeline.predict_proba(X_test[is_malicious])[:, 1]

    # For these malicious instances, classify as DoS (1) if the DoS score meets Model B's threshold,
    # otherwise classify as Other attack (2).
    # The np.where condition maps 0 (Other) to 2 and 1 (DoS) to 1.
    tri_pred_hier[is_malicious] = np.where(
        dos_scores[is_malicious] >= DOS_THRESHOLD_FOR_DEPLOY, 1, 2
    )

# Print the classification report for the final hierarchical tri-class predictions.
print("=== Tri-class (HIERARCHICAL: Model A → Model B) – TEST ===")
print(classification_report(y_tri_test, tri_pred_hier, zero_division=0, target_names=['Benign','DoS','Other']))

# Compute and print the confusion matrix for the hierarchical model.
cm_hier = confusion_matrix(y_tri_test, tri_pred_hier)
print("Confusion matrix rows=Actual[Benign,DoS,Other], cols=Pred[Benign,DoS,Other]:\n", cm_hier)

=== Tri-class (HIERARCHICAL: Model A → Model B) – TEST ===
              precision    recall  f1-score   support

      Benign       0.91      0.87      0.89     37000
         DoS       0.70      0.02      0.03      4089
       Other       0.82      0.93      0.87     41243

    accuracy                           0.86     82332
   macro avg       0.81      0.61      0.60     82332
weighted avg       0.85      0.86      0.84     82332

Confusion matrix rows=Actual[Benign,DoS,Other], cols=Pred[Benign,DoS,Other]:
 [[32310     0  4690]
 [  125    62  3902]
 [ 2977    27 38239]]


## Section 8: Save Model Artifacts

Save all trained models, thresholds, features, and evaluation metrics to disk for later use.

In [18]:
os.makedirs("models", exist_ok=True)

# Save the trained binary classifier pipeline (Model A for the hierarchical prediction)
joblib.dump(bin_pipeline, "models/rf_bin.joblib")
# Save the deployment threshold for Model A
with open("models/bin_threshold.json","w") as f:
    json.dump({'threshold': float(BIN_THRESHOLD_FOR_DEPLOY)}, f)

# Save the trained flat tri-class classifier pipeline (Model C)
joblib.dump(tri_pipeline, "models/rf_tri.joblib")

# Save the trained DoS vs. Other classifier pipeline (Model B for the hierarchical prediction)
joblib.dump(dos_pipeline, "models/rf_dos_vs_other.joblib")
# Save the deployment threshold for Model B
with open("models/dos_threshold.json","w") as f:
    json.dump({'threshold': float(DOS_THRESHOLD_FOR_DEPLOY)}, f)

# Save the list of selected features used by the models
with open("models/features.json","w") as f:
    json.dump(selected_features, f)

# Save binary classification metrics (from Model A)
with open("models/metrics_bin.json","w") as f:
    json.dump({
        'threshold_deploy': float(BIN_THRESHOLD_FOR_DEPLOY),
        'roc_auc': float(bin_roc),
        'pr_auc': float(bin_ap),
        'confusion_f1': bin_cm_f1.tolist(),
        'confusion_prec': bin_cm_prec.tolist()
    }, f, indent=2)

# Save flat tri-class classification metrics (from Model C)
with open("models/metrics_tri_flat.json","w") as f:
    json.dump({'confusion_matrix': tri_cm_flat.tolist()}, f, indent=2)

# Save hierarchical tri-class classification metrics (from Model A → Model B)
with open("models/metrics_tri_hier.json","w") as f:
    json.dump({'confusion_matrix': cm_hier.tolist()}, f, indent=2)

print("✓ Saved all artifacts to ./models/")

✓ Saved all artifacts to ./models/


## Section 9: Prediction Functions & Demonstration

Load saved models and make predictions on new data. Includes utilities for both flat and hierarchical classification.

In [19]:
# -------------------------
# Prediction Loading Functions
# -------------------------

def load_flat_artifacts(art_dir="models"):
    """
    Loads the artifacts required for flat (binary and tri-class) predictions.

    Args:
        art_dir (str): Directory where model artifacts are stored.

    Returns:
        tuple: Binary pipeline (Model A), tri-class pipeline (Model C), binary threshold, and feature list.
    """
    pipe_bin = joblib.load(os.path.join(art_dir, "rf_bin.joblib"))
    pipe_tri = joblib.load(os.path.join(art_dir, "rf_tri.joblib"))
    with open(os.path.join(art_dir, "bin_threshold.json")) as f:
        th = json.load(f)['threshold']
    with open(os.path.join(art_dir, "features.json")) as f:
        feats = json.load(f)
    return pipe_bin, pipe_tri, th, feats

def predict_from_df(df_features: pd.DataFrame, mode="both", art_dir="models"):
    """
    Makes predictions using the flat classification models (Model A or Model C).

    Args:
        df_features (pd.DataFrame): DataFrame containing features for prediction.
        mode (str): "binary", "tri", or "both" to specify which predictions to return.
        art_dir (str): Directory where model artifacts are stored.

    Returns:
        dict: Dictionary containing prediction scores and/or labels.
    """
    pipe_bin, pipe_tri, th, feats = load_flat_artifacts(art_dir)
    X = df_features[feats].copy()
    out = {}
    if mode in ("binary","both"):
        scores = pipe_bin.predict_proba(X)[:,1]
        out["binary_scores"] = scores
        out["binary_labels"] = (scores >= th).astype(int)
    if mode in ("tri","both"):
        out["tri_labels"] = pipe_tri.predict(X)
    return out

def load_hier_artifacts(art_dir="models"):
    """
    Loads the artifacts required for hierarchical predictions (Model A and Model B).

    Args:
        art_dir (str): Directory where model artifacts are stored.

    Returns:
        dict: Dictionary containing binary pipeline (Model A), DoS pipeline (Model B),
              binary threshold, DoS threshold, and feature list.
    """
    pipe_bin = joblib.load(os.path.join(art_dir, "rf_bin.joblib"))
    pipe_dos = joblib.load(os.path.join(art_dir, "rf_dos_vs_other.joblib"))
    with open(os.path.join(art_dir, "bin_threshold.json")) as f:
        t1 = json.load(f)['threshold']
    with open(os.path.join(art_dir, "dos_threshold.json")) as f:
        t2 = json.load(f)['threshold']
    with open(os.path.join(art_dir, "features.json")) as f:
        feats = json.load(f)
    return dict(pipe_bin=pipe_bin, pipe_dos=pipe_dos, t1=t1, t2=t2, feats=feats)

def predict_hier_from_df(df_features: pd.DataFrame, art_dir="models"):
    """
    Makes predictions using the hierarchical classification model (Model A followed by Model B).

    Args:
        df_features (pd.DataFrame): DataFrame containing features for prediction.
        art_dir (str): Directory where model artifacts are stored.

    Returns:
        dict: Dictionary containing binary scores, binary labels, and hierarchical tri-class labels.
              The tri-class labels are derived by first applying Model A, then Model B conditionally.
    """
    art = load_hier_artifacts(art_dir)
    X = df_features[art['feats']].copy()
    # Model A: Predict binary scores (malicious probability) for all instances
    s_bin = art['pipe_bin'].predict_proba(X)[:,1]
    # Determine if an instance is classified as malicious based on Model A's binary threshold
    is_mal = (s_bin >= art['t1'])
    # Initialize tri-class predictions, default to Benign (0)
    tri = np.zeros(len(X), dtype=int)
    if is_mal.any():
        # Model B: For instances identified as malicious by Model A, predict DoS scores
        s_dos = np.zeros(len(X))
        s_dos[is_mal] = art['pipe_dos'].predict_proba(X[is_mal])[:,1]
        # Classify malicious instances as DoS (1) or Other (2) based on Model B's DoS threshold
        tri[is_mal] = (s_dos[is_mal] >= art['t2']).astype(int) + 1 # 1=DoS, 2=Other
    return {"binary_scores": s_bin, "binary_labels": (s_bin >= art['t1']).astype(int), "tri_labels": tri}

def save_predictions_csv(df_features: pd.DataFrame, out_csv: str, mode="hier", art_dir="models"):
    """
    Makes predictions using the specified model mode and saves results to CSV.

    Args:
        df_features (pd.DataFrame): DataFrame containing the input features.
        out_csv (str): Path to the output CSV file where predictions will be saved.
        mode (str): "flat" for flat classification or "hier" for hierarchical classification.
        art_dir (str): Directory where model artifacts are stored.
    """
    if mode == "flat":
        out = predict_from_df(df_features, mode='both', art_dir=art_dir)
    else:
        out = predict_hier_from_df(df_features, art_dir=art_dir)

    # Create an output DataFrame with original features and new predictions
    df_out = df_features.copy()
    if "binary_scores" in out:
        df_out["bin_prob_mal"] = out["binary_scores"]
        df_out["bin_label"]    = out["binary_labels"]
    if "tri_labels" in out:
        df_out["tri_label"]    = out["tri_labels"]

    # Save the predictions to a CSV file
    df_out.to_csv(out_csv, index=False)
    print("Saved:", out_csv)

In [20]:
# -------------------------
# Prediction Demo
# -------------------------

print("Testing prediction functions on sample data:")
print("\nFlat prediction keys:", predict_from_df(df_test[selected_features].head(5), mode="both").keys())
print("Hierarchical prediction keys:", predict_hier_from_df(df_test[selected_features].head(5)).keys())

# Generate prediction previews using 200 rows of test data
print("\n" + "="*60)
print("Generating prediction CSV previews...")
print("="*60)

save_predictions_csv(df_test[selected_features].head(200), "predictions_preview_flat.csv", mode="flat")
save_predictions_csv(df_test[selected_features].head(200), "predictions_preview_hier.csv", mode="hier")

print("\n✓ Demo complete! Prediction utilities are ready for use.")

Testing prediction functions on sample data:

Flat prediction keys: dict_keys(['binary_scores', 'binary_labels', 'tri_labels'])
Hierarchical prediction keys: dict_keys(['binary_scores', 'binary_labels', 'tri_labels'])

Generating prediction CSV previews...
Saved: predictions_preview_flat.csv
Saved: predictions_preview_hier.csv

✓ Demo complete! Prediction utilities are ready for use.
