# Hybrid SSL + Drift Adaptation**
## **Final Model Training Pipeline**

This notebook implements the final solution for the SOC Alert Triage project. It addresses the **concept drift** issue (where Friday traffic differs from Mon-Thu) by creating a "Hybrid Seed."

### **The Hybrid Strategy:**
1.  **Labeled Seed (The Teacher):** Composed of **30% Mon-Thu data** AND **15% Friday data**. This "Drift Bridge" teaches the model the new attack patterns immediately.
2.  **Unlabeled Pool (The Amplifier):** **70% of Mon-Thu data**. We use Semi-Supervised Learning (Tri-Training) to leverage this massive volume of data to stabilize the decision boundaries.
3.  **Test Set (The Exam):** **85% of Friday data**. This data is completely hidden during training and is used solely to evaluate final performance.

### **Goal:**
Achieve **>99% Recall** on the Friday test set while maintaining a high Auto-Close rate.

In [1]:
import os
import joblib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings

from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import (
    roc_auc_score, recall_score, precision_score, 
    confusion_matrix, precision_recall_curve
)

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Try importing LightGBM (Faster/Better), fall back to RandomForest
try:
    from lightgbm import LGBMClassifier
    HAS_LGBM = True
    print("✅ LightGBM detected. Using Gradient Boosting.")
except ImportError:
    from sklearn.ensemble import RandomForestClassifier
    HAS_LGBM = False
    print("⚠️ LightGBM not found. Falling back to Random Forest.")

# =============================================================================
# CONFIGURATION
# =============================================================================
ROOT_DIR = r"C:\Users\knand\Desktop\K-Nandu_Minor_Project_SOC_Triage_SSL"
PROCESSED_DIR = os.path.join(ROOT_DIR, "processed")
RESULTS_DIR = os.path.join(ROOT_DIR, "results")
MODEL_DIR = os.path.join(ROOT_DIR, "models")

os.makedirs(RESULTS_DIR, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

RANDOM_STATE = 42
MIN_CONF_FLOOR = 0.05   # Minimum confidence to consider a pseudo-label
MIN_TOP_K = 1000        # Number of pseudo-labels to add per round

print(f"Configuration Loaded. Reading from: {PROCESSED_DIR}")

✅ LightGBM detected. Using Gradient Boosting.
Configuration Loaded. Reading from: C:\Users\knand\Desktop\K-Nandu_Minor_Project_SOC_Triage_SSL\processed


### **1. Class Definitions**
Here we define the **Rank-Based Tri-Training** class. This custom SSL algorithm uses three classifiers that "teach" each other. If two models agree on a label with high confidence, they teach the third model.

In [2]:
def base_builder(seed=42):
    """Factory function to build base classifiers."""
    if HAS_LGBM:
        return LGBMClassifier(
            n_estimators=200, 
            learning_rate=0.05, 
            num_leaves=31,
            class_weight='balanced', 
            random_state=seed, 
            n_jobs=-1, 
            verbose=-1
        )
    else:
        return RandomForestClassifier(
            n_estimators=100, 
            class_weight='balanced', 
            random_state=seed, 
            n_jobs=-1
        )

class RankTriTraining:
    """Rank-Based Tri-Training with Probability Calibration"""
    def __init__(self, base_builder, n_models=3, max_iter=5, top_k=1000, 
                 min_conf=0.05, random_state=42):
        self.base_builder = base_builder
        self.n_models = n_models
        self.max_iter = max_iter
        self.top_k = top_k
        self.min_conf_floor = min_conf
        self.random_state = random_state
        self.models = [] 
        self.train_data = [] 

    def fit(self, X_labeled, y_labeled, X_unlabeled, X_cal, y_cal):
        rng = np.random.RandomState(self.random_state)
        self.models = []
        self.train_data = []

        # --- Initialization Phase ---
        print(f"[Init] Training {self.n_models} base models on Labeled Seed...")
        for i in range(self.n_models):
            # Bootstrap sample for diversity
            idx = rng.choice(len(X_labeled), size=int(0.8 * len(X_labeled)), replace=True)
            X_boot = X_labeled[idx]
            y_boot = y_labeled[idx]
            
            base = self.base_builder(seed=self.random_state + i)
            base.fit(X_boot, y_boot)
            
            # Calibrate probabilities using Isotonic Regression
            cal = CalibratedClassifierCV(base, cv='prefit', method='isotonic')
            cal.fit(X_cal, y_cal)
            self.models.append(cal)
            self.train_data.append([X_boot, y_boot])

        # --- Tri-Training Loop ---
        X_u = np.array(X_unlabeled)
        mask = np.ones(len(X_u), dtype=bool) # Track available unlabeled data
        
        for iteration in range(self.max_iter):
            print(f"--- Iteration {iteration + 1}/{self.max_iter} ---")
            pool_idx = np.where(mask)[0]
            if len(pool_idx) == 0: break
            
            X_pool = X_u[pool_idx]
            # Get predictions from all models
            probs = [m.predict_proba(X_pool)[:, 1] for m in self.models]
            preds = [(p >= 0.5).astype(int) for p in probs]

            candidates = {0: [], 1: [], 2: []}
            pairs = [(0, 1, 2), (0, 2, 1), (1, 2, 0)] # (Teacher1, Teacher2, Student)
            
            for i, j, k in pairs:
                # Where do Teachers (i, j) agree?
                agree_idx = np.where(preds[i] == preds[j])[0]
                if len(agree_idx) == 0: continue
                
                # Calculate Confidence (Distance from 0.5)
                conf_i = np.abs(probs[i][agree_idx] - 0.5)
                conf_j = np.abs(probs[j][agree_idx] - 0.5)
                min_conf = np.minimum(conf_i, conf_j)
                
                # Filter by Floor & Rank by Confidence
                valid_mask = min_conf >= self.min_conf_floor
                valid_idx = agree_idx[valid_mask]
                valid_conf = min_conf[valid_mask]
                
                sorted_idx = np.argsort(-valid_conf)[:self.top_k]
                top_indices = valid_idx[sorted_idx]
                
                # Store Pseudo-Labels for Student k
                for loc_idx in top_indices:
                    g_idx = pool_idx[loc_idx]
                    lbl = preds[i][loc_idx]
                    candidates[k].append((g_idx, lbl))

            # Update Models with New Pseudo-Labels
            added = 0
            to_remove = set()
            for k in range(self.n_models):
                if not candidates[k]: continue
                unique = {x[0]: x[1] for x in candidates[k]}
                if not unique: continue
                
                new_X = X_u[list(unique.keys())]
                new_y = np.array(list(unique.values()))
                
                self.train_data[k][0] = np.vstack([self.train_data[k][0], new_X])
                self.train_data[k][1] = np.concatenate([self.train_data[k][1], new_y])
                
                # Retrain & Recalibrate
                base = self.base_builder(seed=self.random_state + iteration)
                base.fit(self.train_data[k][0], self.train_data[k][1])
                cal = CalibratedClassifierCV(base, cv='prefit', method='isotonic')
                cal.fit(X_cal, y_cal)
                self.models[k] = cal
                
                added += len(new_X)
                to_remove.update(unique.keys())
            
            mask[list(to_remove)] = False
            print(f"  Added {added} pseudo-labels. Remaining Pool: {mask.sum()}")
            if added == 0: break

    def predict_proba(self, X):
        # Ensemble Average Prediction
        probs = np.array([m.predict_proba(X)[:, 1] for m in self.models])
        return np.mean(probs, axis=0)

### **2. Data Loading & Hybrid Splitting**
This is the most critical step. We construct the training sets to include the "Drift Bridge."

**The Splits:**
* **Mon-Thu Data:** 30% goes to Labeled Seed, 70% goes to Unlabeled Pool.
* **Friday Data:** 15% goes to Labeled Seed (The Bridge), 85% goes to Hidden Test.

In [3]:
print("="*80)
print(" FINAL: HYBRID SSL + DRIFT ADAPTATION")
print("="*80)

# 1. Load Data
train_df = pd.read_parquet(os.path.join(PROCESSED_DIR, "train_mon_thu.parquet"))
test_df = pd.read_parquet(os.path.join(PROCESSED_DIR, "test_friday.parquet"))

drop_cols = {'label', 'label_str', 'timestamp', 'source_file', 'src_ip', 'dst_ip'}
feature_cols = [c for c in train_df.columns if c not in drop_cols]

X_mon_thu = train_df[feature_cols].values
y_mon_thu = train_df['label'].values
X_fri = test_df[feature_cols].values
y_fri = test_df['label'].values

# 2. CREATE THE HYBRID SPLITS (The Magic Step)
print("\n[2] Constructing Hybrid Datasets...")

# A. Split Mon-Thu: 30% Seed, 70% Unlabeled Pool
X_mt_lbl, X_mt_unlab, y_mt_lbl, y_mt_unlab = train_test_split(
    X_mon_thu, y_mon_thu, test_size=0.70, stratify=y_mon_thu, random_state=RANDOM_STATE
)

# B. Split Friday: 15% Seed (Drift Teach), 85% Test (Hidden)
X_fri_lbl, X_fri_test, y_fri_lbl, y_fri_test = train_test_split(
    X_fri, y_fri, test_size=0.85, stratify=y_fri, random_state=RANDOM_STATE
)

# C. Combine Seeds (Mon-Thu + Fri Slice)
X_seed = np.vstack([X_mt_lbl, X_fri_lbl])
y_seed = np.concatenate([y_mt_lbl, y_fri_lbl])

# D. Create Calibration Set (10% of the Seed)
# We hold out a small chunk of the seed to calibrate probabilities (important for Tri-Training)
X_train_seed, X_cal, y_train_seed, y_cal = train_test_split(
    X_seed, y_seed, test_size=0.10, stratify=y_seed, random_state=RANDOM_STATE
)

print(f"  Labeled Seed (Total):   {len(X_seed):,}")
print(f"     ├─ From Mon-Thu:     {len(X_mt_lbl):,}")
print(f"     └─ From Friday:      {len(X_fri_lbl):,} (The Drift Bridge)")
print(f"  Unlabeled Pool:         {len(X_mt_unlab):,} (Mon-Thu Only)")
print(f"  Test Set (Friday):      {len(X_fri_test):,}")

 FINAL: HYBRID SSL + DRIFT ADAPTATION

[2] Constructing Hybrid Datasets...
  Labeled Seed (Total):   743,735
     ├─ From Mon-Thu:     638,249
     └─ From Friday:      105,486 (The Drift Bridge)
  Unlabeled Pool:         1,489,249 (Mon-Thu Only)
  Test Set (Friday):      597,759


### **3. Training & Evaluation**
We now run the Tri-Training loop. After training, we dynamically calculate a **Safety Threshold** on the training data (aiming for 99.9% Recall) and apply it to the unseen Test Set.

In [5]:
# 3. Train Hybrid Model
print("\n[3] Running Tri-Training on Hybrid Data...")
tri_model = RankTriTraining(
    base_builder, n_models=3, max_iter=5, top_k=MIN_TOP_K, min_conf=MIN_CONF_FLOOR
)
tri_model.fit(X_train_seed, y_train_seed, X_mt_unlab, X_cal, y_cal)

# 4. Evaluate
print("\n[4] Evaluating on Hidden Friday Test Set...")
probs = tri_model.predict_proba(X_fri_test)

# 5. Drift-Adaptive Threshold
# We compute threshold on the Training Seed (which now includes Friday examples)
probs_seed = tri_model.predict_proba(X_train_seed)
prec, rec, thresh = precision_recall_curve(y_train_seed, probs_seed)

# Find threshold where Recall >= 99.9% on Training Data
target_idx = np.where(rec >= 0.999)[0]
safety_thresh = thresh[target_idx[-1]] if len(target_idx) > 0 else 0.001

print(f"  Safety Threshold: {safety_thresh:.6f}")

# =============================================================================
# 6. COMPREHENSIVE METRICS & REPORTING
# =============================================================================
from sklearn.metrics import f1_score, roc_auc_score, recall_score, precision_score, confusion_matrix

# 1. Apply Safety Threshold
preds = (probs >= safety_thresh).astype(int)

# 2. Confusion Matrix Elements
tn, fp, fn, tp = confusion_matrix(y_fri_test, preds).ravel()

# 3. Performance Metrics
final_rec = recall_score(y_fri_test, preds)      # Ability to catch attacks
final_prec = precision_score(y_fri_test, preds)  # Trustworthiness of alerts
final_f1 = f1_score(y_fri_test, preds)           # Harmonic mean (Balance)
final_roc = roc_auc_score(y_fri_test, probs)     # Model discrimination power

# 4. Error Rates
final_fnr = fn / (fn + tp) if (fn + tp) > 0 else 0.0  # Miss rate (Critical Risk)
final_fpr = fp / (fp + tn) if (fp + tn) > 0 else 0.0  # False Alarm Rate (Noise)

# 5. Operational Metrics (SOC ROI)
# Auto-Close: Alerts below threshold are considered "Benign" and auto-closed
auto_close_mask = probs < safety_thresh
auto_close_rate = np.mean(auto_close_mask)

# Analyst Workload: Alerts above threshold sent for human review
analyst_workload_ratio = 1.0 - auto_close_rate

# --- PRINT PROFESSIONAL REPORT ---
print("\n" + "="*80)
print("FINAL RESULTS (HYBRID APPROACH)")
print("="*80)
print(f"Safety Threshold Used: {safety_thresh:.6f}")
print("-" * 80)
print(f"{'Metric':<25} | {'Score':<10} | {'Impact on SOC':<30}")
print("-" * 80)
print(f"{'Recall (Attack)':<25} | {final_rec:.2%}     | Catching Attacks (Target >99%)")
print(f"{'Precision':<25} | {final_prec:.2%}     | Reducing False Alarms")
print(f"{'F1-Score':<25} | {final_f1:.4f}     | Overall Model Quality")
print(f"{'ROC-AUC':<25} | {final_roc:.4f}     | Threshold Independence")
print("-" * 80)
print(f"{'False Negative Rate':<25} | {final_fnr:.2%}     | Missed Attacks (Critical Risk)")
print(f"{'False Positive Rate':<25} | {final_fpr:.2%}     | Alert Credibility")
print("-" * 80)
print(f"{'Auto-Close Rate':<25} | {auto_close_rate:.2%}     | Automation Benefit")
print(f"{'Analyst Workload':<25} | {analyst_workload_ratio:.2%}     | Alert Fatigue Ratio (Remaining)")
print("-" * 80)
print(f"Confusion Matrix:   TN={tn} | FP={fp} | FN={fn} | TP={tp}")

# Save Model
joblib.dump(tri_model, os.path.join(MODEL_DIR, "final_hybrid_model.joblib"))
print(f"\nModel saved to: {os.path.join(MODEL_DIR, 'final_hybrid_model.joblib')}")


[3] Running Tri-Training on Hybrid Data...
[Init] Training 3 base models on Labeled Seed...
--- Iteration 1/5 ---
  Added 3000 pseudo-labels. Remaining Pool: 1488193
--- Iteration 2/5 ---
  Added 3000 pseudo-labels. Remaining Pool: 1487147
--- Iteration 3/5 ---
  Added 3000 pseudo-labels. Remaining Pool: 1486071
--- Iteration 4/5 ---
  Added 3000 pseudo-labels. Remaining Pool: 1485005
--- Iteration 5/5 ---
  Added 3000 pseudo-labels. Remaining Pool: 1483879

[4] Evaluating on Hidden Friday Test Set...
  Safety Threshold: 0.384259

FINAL RESULTS (HYBRID APPROACH)
Safety Threshold Used: 0.384259
--------------------------------------------------------------------------------
Metric                    | Score      | Impact on SOC                 
--------------------------------------------------------------------------------
Recall (Attack)           | 99.81%     | Catching Attacks (Target >99%)
Precision                 | 99.83%     | Reducing False Alarms
F1-Score                  | 0