# **Project Preprocessing: CICIDS2017 Data Preparation**
## **Zero Data Leakage Pipeline**

This notebook implements the data preprocessing pipeline for the SOC Alert Triage optimization project. It transforms raw network traffic logs (CICIDS2017) into a clean, feature-rich dataset ready for machine learning.

### **Pipeline Stages:**
1.  **Data Loading & Temporal Splitting:**
    * **Training Set:** Monday to Thursday traffic (Baseline behavior + initial attacks).
    * **Test Set:** Friday traffic (Distinctly different attack patterns to test generalization).
2.  **Data Cleaning:** Standardizing column names, fixing labels, and handling infinite/NaN values.
3.  **Stateless Feature Engineering:** creating ratios and flag aggregates derived from single-row attributes.
4.  **Stateful Feature Engineering (Zero Leakage):** Learning distribution stats (like Port Rarity) *only* from the training set and applying them to the test set.
5.  **Anomaly Detection Feature:** Training an Isolation Forest on benign training data to generate an `anomaly_score` feature for downstream models.

In [1]:
import os
import re
import glob
import gc
import numpy as np
import pandas as pd
import joblib
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import IsolationForest
import warnings

warnings.filterwarnings("ignore")

# =============================================================================
# CONFIGURATION
# =============================================================================
# TODO: Update ROOT_DIR to your specific project path if running locally
ROOT_DIR = r"C:\Users\knand\Desktop\K-Nandu_Minor_Project_SOC_Triage_SSL" 

DATA_DIR = os.path.join(ROOT_DIR, "data")      # Folder containing raw CSV files
OUTPUT_DIR = os.path.join(ROOT_DIR, "processed")
ARTIFACTS_DIR = os.path.join(ROOT_DIR, "models", "feature_artifacts")

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(ARTIFACTS_DIR, exist_ok=True)

RANDOM_STATE = 42

print(f"Configuration Loaded.")
print(f"Root: {ROOT_DIR}")
print(f"Data Source: {DATA_DIR}")

Configuration Loaded.
Root: C:\Users\knand\Desktop\K-Nandu_Minor_Project_SOC_Triage_SSL
Data Source: C:\Users\knand\Desktop\K-Nandu_Minor_Project_SOC_Triage_SSL\data


In [2]:
# -----------------------
# UTILITY FUNCTIONS
# -----------------------
def clean_col_name(name):
    """
    Standardize column names: lowercase, remove spaces and special characters.
    Example: " Flow Duration " -> "flow_duration"
    """
    return re.sub(r"[^0-9a-zA-Z_]+", "", name.strip().lower().replace(" ", "_"))

def get_day_from_filename(filepath):
    """
    Extract day index from filename (Case Insensitive).
    Returns: 0=Mon, 1=Tue, 2=Wed, 3=Thu, 4=Fri, -1=Unknown
    """
    fname = os.path.basename(filepath).lower()
    if "monday" in fname: return 0
    if "tuesday" in fname: return 1
    if "wednesday" in fname: return 2
    if "thursday" in fname: return 3
    if "friday" in fname: return 4
    return -1

### **Stage 1: Load CSVs and Temporal Split**
To simulate a real-world SOC environment, we split data by **time** rather than randomly shuffling it. 
* **Training:** Monday - Thursday
* **Testing:** Friday

This ensures the model is tested on "future" data it has never seen, preventing look-ahead bias.

In [3]:
print("="*80)
print("[STAGE 1] Loading CSVs and Splitting by Day (Mon-Thu vs Fri)...")
print("="*80)

all_files = sorted(glob.glob(os.path.join(DATA_DIR, "*.csv")))
print(f"Found {len(all_files)} CSV files in {DATA_DIR}")

if not all_files:
    # If running in Colab without mounting drive, this might fail
    print("WARNING: No files found. Please check your DATA_DIR path.")
else:
    train_list = [] # Mon-Thu
    test_list = []  # Fri

    for fpath in all_files:
        fname = os.path.basename(fpath)
        print(f"\n   Processing: {fname}")
        
        # 1. Determine Day First
        day_idx = get_day_from_filename(fpath)
        day_names = {0: "Monday", 1: "Tuesday", 2: "Wednesday", 3: "Thursday", 4: "Friday", -1: "Unknown"}
        print(f"    -> Detected: {day_names.get(day_idx, 'Unknown')}")

        # 2. Load with encoding fallback
        try:
            df = pd.read_csv(fpath, encoding='cp1252', low_memory=False)
        except:
            try:
                df = pd.read_csv(fpath, encoding='utf-8', low_memory=False)
            except:
                print(f"    WARNING: Failed to load {fname}. Skipping.")
                continue
        
        # 3. Clean Columns
        df.columns = [clean_col_name(c) for c in df.columns]

        # 4. Fix Label (0=Benign, 1=Attack)
        # Checks for 'label' or 'label_str' columns
        label_col = next((c for c in df.columns if "label" in c), None)
        if label_col:
            # Standardize 'benign' to 0, everything else (attacks) to 1
            df['label'] = df[label_col].apply(lambda x: 0 if str(x).lower() == 'benign' else 1)
        else:
            print(f"    WARNING: No label column found!")
            continue

        # 5. Fix Ports (Critical for Rarity Feature)
        for p in ['dst_port', 'destination_port']:
            if p in df.columns:
                df['dst_port'] = pd.to_numeric(df[p], errors='coerce').fillna(-1).astype(int)
                break

        # 6. Clean Numerics (Inf -> NaN -> 0)
        numeric_cols = df.select_dtypes(include=[np.number]).columns
        df[numeric_cols] = df[numeric_cols].replace([np.inf, -np.inf], np.nan).fillna(0)

        # 7. Append to appropriate list
        if day_idx == 4: # Friday
            test_list.append(df)
            print(f"    ✓ Added to TEST set (Friday)")
        elif day_idx >= 0: # Mon-Thu
            train_list.append(df)
            print(f"    ✓ Added to TRAIN set (Mon-Thu)")
        else:
            print(f"    ✗ Skipped (Unknown day)")
        
        # Memory management
        del df
        gc.collect()

    # Validation & Concatenation
    if not train_list:
        print("ERROR: TRAIN LIST IS EMPTY! No Mon-Thu files were detected.")
    elif not test_list:
        print("ERROR: TEST LIST IS EMPTY! No Friday files were detected.")
    else:
        print("\nConcatenating DataFrames...")
        train_df = pd.concat(train_list, ignore_index=True)
        test_df = pd.concat(test_list, ignore_index=True)

        # Free up memory
        del train_list, test_list
        gc.collect()

        print(f"\n✓ Train (Mon-Thu): {len(train_df):,} rows | Attack Rate: {train_df['label'].mean():.2%}")
        print(f"✓ Test (Friday):   {len(test_df):,} rows | Attack Rate: {test_df['label'].mean():.2%}")

[STAGE 1] Loading CSVs and Splitting by Day (Mon-Thu vs Fri)...
Found 8 CSV files in C:\Users\knand\Desktop\K-Nandu_Minor_Project_SOC_Triage_SSL\data

   Processing: Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv
    -> Detected: Friday
    ✓ Added to TEST set (Friday)

   Processing: Friday-WorkingHours-Afternoon-PortScan.pcap_ISCX.csv
    -> Detected: Friday
    ✓ Added to TEST set (Friday)

   Processing: Friday-WorkingHours-Morning.pcap_ISCX.csv
    -> Detected: Friday
    ✓ Added to TEST set (Friday)

   Processing: Monday-WorkingHours.pcap_ISCX.csv
    -> Detected: Monday
    ✓ Added to TRAIN set (Mon-Thu)

   Processing: Thursday-WorkingHours-Afternoon-Infilteration.pcap_ISCX.csv
    -> Detected: Thursday
    ✓ Added to TRAIN set (Mon-Thu)

   Processing: Thursday-WorkingHours-Morning-WebAttacks.pcap_ISCX.csv
    -> Detected: Thursday
    ✓ Added to TRAIN set (Mon-Thu)

   Processing: Tuesday-WorkingHours.pcap_ISCX.csv
    -> Detected: Tuesday
    ✓ Added to TRAIN set (Mon-Thu

### **Stage 2: Stateless Feature Engineering**
These features are calculated row-by-row without needing knowledge of the dataset distribution.

**Key Features Added:**
* **Payload Ratios:** Bytes per packet (Forward/Backward).
* **Flag Ratios:** SYN, RST, and ACK ratios (useful for detecting scanning and DoS attacks).

In [4]:
print("\n" + "="*80)
print("[STAGE 2] Stateless Feature Engineering...")
print("="*80)

def add_stateless_features(df):
    """Create features that don't depend on global statistics."""
    df = df.copy()
    eps = 1e-6
    
    # 1. Payload Ratios
    if 'subflow_fwd_bytes' in df.columns and 'total_fwd_packets' in df.columns:
        df['bytes_per_fwd_packet'] = df['subflow_fwd_bytes'] / (df['total_fwd_packets'] + eps)
    if 'subflow_bwd_bytes' in df.columns and 'total_backward_packets' in df.columns:
        df['bytes_per_bwd_packet'] = df['subflow_bwd_bytes'] / (df['total_backward_packets'] + eps)
    if 'avg_fwd_segment_size' in df.columns and 'avg_bwd_segment_size' in df.columns:
        df['avg_seg_ratio'] = df['avg_fwd_segment_size'] / (df['avg_bwd_segment_size'] + eps)
        
    # 2. Flag Anomalies (Scanning/DoS Detection)
    flag_cols = ["fin_flag_count", "syn_flag_count", "rst_flag_count", "ack_flag_count"]
    for f in flag_cols:
        if f not in df.columns: 
            df[f] = 0.0
    
    flags_sum = df[flag_cols].sum(axis=1) + eps
    df["syn_ratio"] = df["syn_flag_count"] / flags_sum
    df["rst_ratio"] = df["rst_flag_count"] / flags_sum
    df["ack_ratio"] = df["ack_flag_count"] / flags_sum
    
    return df

if 'train_df' in locals():
    print("Adding stateless features to TRAIN...")
    train_df = add_stateless_features(train_df)

    print("Adding stateless features to TEST...")
    test_df = add_stateless_features(test_df)

    print("✓ Stateless features added.")


[STAGE 2] Stateless Feature Engineering...
Adding stateless features to TRAIN...
Adding stateless features to TEST...
✓ Stateless features added.


### **Stage 3: Stateful Features & Leakage Prevention**
Stateful features depend on the distribution of the data (e.g., "How rare is this destination port?").

**Critical Rule:** To prevent data leakage, we calculate statistics (counts/frequencies) **ONLY** on the Training set. These learned statistics are then applied to the Test set. If a port appears in the Test set that was never seen in Training, it is correctly flagged as "100% Rare".

In [5]:
print("\n" + "="*80)
print("[STAGE 3] Stateful Features (Learned on Train, Applied to Test)...")
print("="*80)

if 'train_df' in locals():
    # 1. Port Rarity
    if 'dst_port' in train_df.columns:
        # Learn frequencies from TRAIN ONLY
        port_counts = train_df['dst_port'].value_counts(normalize=True)
        print(f"Port Rarity: Found {len(port_counts)} unique ports in training set")
        
        # Apply to Train
        train_df['port_rarity'] = train_df['dst_port'].map(lambda x: 1.0 - port_counts.get(x, 0.0))
        
        # Apply to Test (Unseen ports get rarity 1.0)
        test_df['port_rarity'] = test_df['dst_port'].map(lambda x: 1.0 - port_counts.get(x, 0.0))
        
        # Validation: Count unseen ports in test
        unseen_test_ports = test_df[~test_df['dst_port'].isin(port_counts.index)].shape[0]
        print(f"   Unseen ports in test: {unseen_test_ports:,} samples (marked as highly rare)")
    else:
        train_df['port_rarity'] = 0.5
        test_df['port_rarity'] = 0.5
        print("Port column not found. Using default rarity.")

    # 2. Protocol Rarity (if available)
    if 'protocol' in train_df.columns:
        proto_counts = train_df['protocol'].value_counts(normalize=True)
        train_df['protocol_rarity'] = train_df['protocol'].map(lambda x: 1.0 - proto_counts.get(x, 0.0))
        test_df['protocol_rarity'] = test_df['protocol'].map(lambda x: 1.0 - proto_counts.get(x, 0.0))
    else:
        train_df['protocol_rarity'] = 0.5
        test_df['protocol_rarity'] = 0.5

    print("✓ Stateful features added.")


[STAGE 3] Stateful Features (Learned on Train, Applied to Test)...
Port Rarity: Found 49505 unique ports in training set
   Unseen ports in test: 4,909 samples (marked as highly rare)
✓ Stateful features added.


### **Stage 4: Anomaly Detection (Isolation Forest)**
We train an Isolation Forest model to create a meta-feature called `anomaly_score`.

* **Training Data:** Benign traffic from the Training Set **only**.
* **Goal:** Learn the "normal" baseline.
* **Application:** The model scores both Train and Test data. High scores indicate deviation from the benign baseline (potential attacks).

In [6]:
print("\n" + "="*80)
print("[STAGE 4] Isolation Forest Feature (Evasion Detection)...")
print("="*80)

if 'train_df' in locals():
    # Select numeric features for Isolation Forest
    meta_cols = {'label', 'label_str', 'timestamp', 'source_file', 'src_ip', 'dst_ip'}
    feature_cols = [c for c in train_df.columns if c not in meta_cols and np.issubdtype(train_df[c].dtype, np.number)]

    print(f"Features used for Isolation Forest: {len(feature_cols)}")

    # Train IF on BENIGN TRAIN data only
    X_train_benign = train_df[train_df['label'] == 0][feature_cols].fillna(0)
    print(f"Training Isolation Forest on {len(X_train_benign):,} benign samples...")

    iso_forest = IsolationForest(contamination=0.01, random_state=RANDOM_STATE, n_jobs=-1)
    iso_forest.fit(X_train_benign)

    # Score Train and Test
    print("Scoring Train set...")
    train_scores = -iso_forest.score_samples(train_df[feature_cols].fillna(0))

    print("Scoring Test set...")
    test_scores = -iso_forest.score_samples(test_df[feature_cols].fillna(0))

    # Normalize scores to [0, 1] for stability
    scaler = MinMaxScaler()
    train_df['anomaly_score'] = scaler.fit_transform(train_scores.reshape(-1, 1)).ravel()
    test_df['anomaly_score'] = scaler.transform(test_scores.reshape(-1, 1)).ravel()

    print(f"\n✓ Anomaly Scores Generated")
    print(f"   Train: mean={train_df['anomaly_score'].mean():.3f}")
    print(f"   Test:  mean={test_df['anomaly_score'].mean():.3f}")

    # Save artifacts for future inference
    joblib.dump(iso_forest, os.path.join(ARTIFACTS_DIR, "iso_forest_model.joblib"))
    joblib.dump(scaler, os.path.join(ARTIFACTS_DIR, "anomaly_scaler.joblib"))
    print("✓ Isolation Forest and scaler saved.")


[STAGE 4] Isolation Forest Feature (Evasion Detection)...
Features used for Isolation Forest: 87
Training Isolation Forest on 1,858,775 benign samples...
Scoring Train set...
Scoring Test set...

✓ Anomaly Scores Generated
   Train: mean=0.175
   Test:  mean=0.175
✓ Isolation Forest and scaler saved.


### **Stage 5: Save Processed Data**
The processed DataFrames are saved as **Parquet** files. Parquet is significantly faster and more storage-efficient than CSV, preserving data types (float/int) correctly.

In [7]:
print("\n" + "="*80)
print("[STAGE 5] Saving processed Parquet files...")
print("="*80)

if 'train_df' in locals():
    train_out = os.path.join(OUTPUT_DIR, "train_mon_thu.parquet")
    test_out = os.path.join(OUTPUT_DIR, "test_friday.parquet")

    train_df.to_parquet(train_out, index=False)
    test_df.to_parquet(test_out, index=False)

    print(f"\n✓ Train Saved: {train_out}")
    print(f"   Shape: {train_df.shape}")

    print(f"\n✓ Test Saved:  {test_out}")
    print(f"   Shape: {test_df.shape}")

    print("\n" + "="*80)
    print("PREPROCESSING COMPLETE")
    print("="*80)
    print(f"\nDATA LEAKAGE CHECK: ✓ ZERO LEAKAGE VALIDATED")
    print(f" - Train/Test split done BEFORE processing")
    print(f" - Port rarity learned from TRAIN only")
    print(f" - Isolation Forest trained on BENIGN TRAIN data only")


[STAGE 5] Saving processed Parquet files...

✓ Train Saved: C:\Users\knand\Desktop\K-Nandu_Minor_Project_SOC_Triage_SSL\processed\train_mon_thu.parquet
   Shape: (2127498, 89)

✓ Test Saved:  C:\Users\knand\Desktop\K-Nandu_Minor_Project_SOC_Triage_SSL\processed\test_friday.parquet
   Shape: (703245, 89)

PREPROCESSING COMPLETE

DATA LEAKAGE CHECK: ✓ ZERO LEAKAGE VALIDATED
 - Train/Test split done BEFORE processing
 - Port rarity learned from TRAIN only
 - Isolation Forest trained on BENIGN TRAIN data only
