
# ThreatScope Training Demo

This notebook shows how to go from labeled network flow data to a trained intrusion detection model.

It covers:
1. Loading a prepared dataset of flow features and labels  
2. Quick exploratory checks  
3. Train/test split  
4. Model training (XGBoost)  
5. Evaluation (precision, recall, F1, ROC AUC, confusion matrix)  
6. Saving `model.joblib`, `feature_order.json`, and `metrics.json` for runtime use

This notebook is part of the ThreatScope repo.


In [None]:

import pandas as pd
import numpy as np
import json
import joblib
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, precision_recall_fscore_support
from xgboost import XGBClassifier
import matplotlib.pyplot as plt

FEATURE_ORDER = [
    "flow_duration_ms",
    "pkt_rate",
    "avg_pkt_size",
    "std_pkt_size",
    "bytes_total",
    "syn_count",
    "fin_count",
    "psh_count",
    "entropy_dst_port",
]

ARTIFACT_DIR = Path("model_artifacts")
ARTIFACT_DIR.mkdir(exist_ok=True)



## 1. Load dataset

Assumptions:
- You have already converted raw PCAP traffic into per-flow rows.
- Each row has numeric features matching `FEATURE_ORDER`.
- There is a `label` column like `benign`, `dos`, `scan`, etc.

Edit the path below to point at your CSV.


In [None]:

DATA_CSV = "datasets/flows_labeled.csv"  # change this to your file

df = pd.read_csv(DATA_CSV)
print("Rows:", len(df))
print("Columns:", df.columns.tolist())

# sanity check required columns
missing = [c for c in FEATURE_ORDER if c not in df.columns]
if missing:
    raise ValueError(f"Dataset missing required features: {missing}")
if 'label' not in df.columns:
    raise ValueError("Dataset missing 'label' column")

df.head()



## 2. Class distribution

Intrusion datasets are usually imbalanced. For example there may be many benign flows and fewer attack flows like DoS or port scans.

We inspect label counts.


In [None]:

label_counts = df['label'].value_counts()
print(label_counts)

# bar plot of class distribution using matplotlib (no seaborn)
plt.figure()
label_counts.plot(kind='bar')
plt.title('Label distribution')
plt.xlabel('Class')
plt.ylabel('Count')
plt.show()



## 3. Train/test split

We stratify on the label so rare attack classes are represented in both train and test.


In [None]:

X = df[FEATURE_ORDER].astype(float)
y = df['label'].astype(str)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape



## 4. Train XGBoost model

We use a multiclass softmax/softprob objective. This produces per-class probabilities.


In [None]:

model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    n_jobs=-1,
    objective='multi:softprob'
)

model.fit(X_train, y_train)



## 5. Evaluation

We compute:
- macro precision / recall / F1
- ROC AUC macro (one-vs-rest)
- confusion matrix


In [None]:

y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)

# classification report
report = classification_report(y_test, y_pred, zero_division=0)
print(report)

# macro metrics
precision_macro, recall_macro, f1_macro, _ = precision_recall_fscore_support(
    y_test, y_pred, average='macro', zero_division=0
)

try:
    auc_macro = roc_auc_score(y_test, y_proba, multi_class='ovr', average='macro')
except Exception:
    auc_macro = None

print({
    'precision_macro': precision_macro,
    'recall_macro': recall_macro,
    'f1_macro': f1_macro,
    'roc_auc_macro': auc_macro
})



### Confusion matrix

Rows are true classes. Columns are predicted classes.


In [None]:

import numpy as np

cm = confusion_matrix(y_test, y_pred, labels=model.classes_)
print("Classes order:", model.classes_)
print(cm)

plt.figure()
plt.imshow(cm, interpolation='nearest')
plt.title('Confusion matrix')
plt.colorbar()
tick_marks = np.arange(len(model.classes_))
plt.xticks(tick_marks, model.classes_, rotation=45, ha='right')
plt.yticks(tick_marks, model.classes_)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.tight_layout()
plt.show()



## 6. Save artifacts for runtime inference

Runtime (`ThreatClassifier`) needs:
- `model.joblib`
- `feature_order.json`
- `metrics.json` (optional but good for README badges)

These files become part of `model_artifacts/` and get loaded by the FastAPI service.


In [None]:

metrics = {
    'precision_macro': float(precision_macro),
    'recall_macro': float(recall_macro),
    'f1_macro': float(f1_macro),
    'roc_auc_macro': float(auc_macro) if auc_macro is not None else None,
    'classes': list(model.classes_)
}

joblib.dump(model, ARTIFACT_DIR / 'model.joblib')

with open(ARTIFACT_DIR / 'feature_order.json', 'w', encoding='utf-8') as f:
    json.dump(FEATURE_ORDER, f, indent=2)

with open(ARTIFACT_DIR / 'metrics.json', 'w', encoding='utf-8') as f:
    json.dump(metrics, f, indent=2)

print("Artifacts saved in:", ARTIFACT_DIR.resolve())
