# Phase 1.2 — Global Statistical Calibration  `[v5.1 — 351M-row IDS Ocean]`

**Input:** `data/unified/ocean_v51/` (7 Hive-partitioned dirs, 4,195 files, 351,317,489 rows)  
**Outputs:** `artifacts/scalers/global_port_map.json` + `artifacts/preprocessors_v51.pkl`

| Step | Name | What it does |
|------|------|--------------|
| **2** | Global Port Scan | Stream all 4,195 parquet files → Counter → `global_port_map.json` |
| **3** | Stratified Reservoir | 100% capture for small partitions (<1M rows), 1M-row cap for massive (start/mid/end) |
| **4** | Multi-Block Fitting | RobustScaler(B1+B6) · QT(bytes/pkts) · PowerTransformer(port rarity) |
| **5** | Artifact Sealing | `preprocessors_v51.pkl` · Top-10 rarest ports · Reservoir class distribution table |

In [1]:
import sys, os, json, time, pickle, warnings, math
from pathlib import Path
from collections import Counter

import numpy as np
import pandas as pd

warnings.filterwarnings('ignore')

import pyarrow as pa
import pyarrow.parquet as pq

from sklearn.preprocessing import RobustScaler, QuantileTransformer, PowerTransformer

try:
    from tqdm import tqdm
except ImportError:
    def tqdm(it, **kw):
        return it

SCHEMA_VERSION = 'v5.1'

print(f"Python      : {sys.version.split()[0]}")
print(f"pandas      : {pd.__version__}")
print(f"numpy       : {np.__version__}")
print(f"pyarrow     : {pa.__version__}")
print(f"Schema ver  : {SCHEMA_VERSION}")
print("Imports OK.")

Python      : 3.13.9
pandas      : 2.2.3
numpy       : 2.1.3
pyarrow     : 23.0.0
Schema ver  : v5.1
Imports OK.


In [2]:
# ── Paths ─────────────────────────────────────────────────────────────────────
NOTEBOOK_DIR       = Path.cwd()
MAIN_DIR           = NOTEBOOK_DIR.parent
OCEAN_V51_DIR      = MAIN_DIR / 'data' / 'unified' / 'ocean_v51'
ARTIFACTS_DIR      = MAIN_DIR / 'artifacts'
SCALERS_DIR        = ARTIFACTS_DIR / 'scalers'
SCALERS_DIR.mkdir(parents=True, exist_ok=True)

PORT_MAP_PATH      = SCALERS_DIR / 'global_port_map.json'
PREPROCESSORS_PATH = ARTIFACTS_DIR / 'preprocessors_v51.pkl'

# ── Sampling Config ────────────────────────────────────────────────────────────
SMALL_PARTITION_THRESH = 1_000_000   # partitions with fewer rows → 100% capture
MASSIVE_PARTITION_CAP  = 1_000_000   # rows to sample from each massive partition
QT_N_QUANTILES         = 2_000

# ── Column Groups ──────────────────────────────────────────────────────────────
BLOCK1_COLS = [
    'univ_duration', 'univ_bytes_in', 'univ_bytes_out',
    'univ_pkts_in',  'univ_pkts_out',
]
BLOCK6_COLS = [
    'mom_mean', 'mom_stddev', 'mom_sum', 'mom_min', 'mom_max',
    'mom_rate', 'mom_srate', 'mom_drate',
    'mom_TnBPSrcIP', 'mom_TnBPDstIP',
    'mom_TnP_PSrcIP', 'mom_TnP_PDstIP',
    'mom_TnP_PerProto', 'mom_TnP_Per_Dport',
]
QT_BYTE_PKT_COLS = ['univ_bytes_in', 'univ_bytes_out', 'univ_pkts_in', 'univ_pkts_out']
PORT_COLS        = ['raw_sport', 'raw_dport']
META_COLS        = ['ubt_archetype', 'dataset_source', 'univ_specific_attack']
SAMPLE_COLS      = BLOCK1_COLS + BLOCK6_COLS + PORT_COLS + META_COLS

UBT_ARCHETYPES = [
    'NORMAL', 'SCAN', 'DOS_DDOS', 'BOTNET_C2',
    'EXPLOIT', 'BRUTE_FORCE', 'THEFT_EXFIL', 'ANOMALY',
]

# ── Verify Ocean & Collect Partition Inventory ─────────────────────────────────
assert OCEAN_V51_DIR.exists(), f"Ocean dir not found: {OCEAN_V51_DIR}"

part_dirs        = sorted([d for d in OCEAN_V51_DIR.iterdir() if d.is_dir()])
all_parquet_files = []

print(f"\n{'Partition':<20} {'#Files':>7}  {'Parquet files'}")
print("─" * 60)
for pd_ in part_dirs:
    files = sorted(pd_.glob('*.parquet'))
    all_parquet_files.extend(files)
    print(f"  {pd_.name:<18} {len(files):>7}")

print("─" * 60)
print(f"  {'TOTAL':<18} {len(all_parquet_files):>7}")
print(f"\nOcean root   : {OCEAN_V51_DIR}")
print(f"Scalers dir  : {SCALERS_DIR}")
print(f"Port map     : {PORT_MAP_PATH}")
print(f"Preprocessors: {PREPROCESSORS_PATH}")
print("Config OK.")


Partition             #Files  Parquet files
────────────────────────────────────────────────────────────
  ubt_archetype=BOTNET_C2    1041
  ubt_archetype=BRUTE_FORCE       8
  ubt_archetype=DOS_DDOS     537
  ubt_archetype=EXPLOIT      15
  ubt_archetype=NORMAL    1408
  ubt_archetype=SCAN    1182
  ubt_archetype=THEFT_EXFIL       4
────────────────────────────────────────────────────────────
  TOTAL                 4195

Ocean root   : c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\data\unified\ocean_v51
Scalers dir  : c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\artifacts\scalers
Port map     : c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\artifacts\scalers\global_port_map.json
Preprocessors: c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\artifacts\preprocessors_

In [3]:
# ══════════════════════════════════════════════════════════════════════════════
# STEP 2 — Global Port Scan
#   Stream all 4,195 parquet files, reading ONLY raw_sport + raw_dport.
#   Save Counter results → global_port_map.json (cache-aware).
# ══════════════════════════════════════════════════════════════════════════════

def run_global_port_scan(parquet_files, force_rerun=False):
    """Stream every parquet file and count raw_sport / raw_dport occurrences."""
    if PORT_MAP_PATH.exists() and not force_rerun:
        print(f"[CACHE HIT] Loading existing port map from:\n  {PORT_MAP_PATH}")
        with open(PORT_MAP_PATH, 'r') as f:
            data = json.load(f)
        print(f"  total_rows : {data['total_rows']:,}")
        print(f"  unique sport: {len(data['sport']):,}   unique dport: {len(data['dport']):,}")
        return data

    print(f"[PORT SCAN] Streaming {len(parquet_files):,} parquet files …")
    sport_ctr  = Counter()
    dport_ctr  = Counter()
    total_rows = 0
    n          = len(parquet_files)
    milestone  = max(1, n // 10)
    t0         = time.time()

    for i, fp in enumerate(parquet_files):
        try:
            pf  = pq.ParquetFile(fp)
            tbl = pf.read(columns=['raw_sport', 'raw_dport'])
            df  = tbl.to_pandas()
            sport_ctr.update(df['raw_sport'].dropna().astype(int).astype(str).tolist())
            dport_ctr.update(df['raw_dport'].dropna().astype(int).astype(str).tolist())
            total_rows += len(df)
        except Exception as e:
            print(f"  [WARN] {fp.name}: {e}")
            continue

        if (i + 1) % milestone == 0 or (i + 1) == n:
            elapsed = time.time() - t0
            pct     = (i + 1) / n * 100
            eta     = elapsed / (i + 1) * (n - i - 1)
            print(f"  {pct:5.1f}%  ({i+1:,}/{n:,})  rows={total_rows:,}  "
                  f"elapsed={elapsed:.0f}s  ETA={eta:.0f}s")

    port_map = {
        'sport'      : dict(sport_ctr),
        'dport'      : dict(dport_ctr),
        'total_rows' : total_rows,
    }
    with open(PORT_MAP_PATH, 'w') as f:
        json.dump(port_map, f, separators=(',', ':'))
    print(f"\n[SAVED] {PORT_MAP_PATH}")
    print(f"  total_rows : {total_rows:,}")
    print(f"  unique sport: {len(sport_ctr):,}   unique dport: {len(dport_ctr):,}")
    return port_map


# ── Run ────────────────────────────────────────────────────────────────────────
port_map       = run_global_port_scan(all_parquet_files)
TOTAL_ROWS_OCEAN = port_map['total_rows']

sport_counts_raw = {k: int(v) for k, v in port_map['sport'].items()}
dport_counts_raw = {k: int(v) for k, v in port_map['dport'].items()}

sport_rarity = {k: v / TOTAL_ROWS_OCEAN for k, v in sport_counts_raw.items()}
dport_rarity = {k: v / TOTAL_ROWS_OCEAN for k, v in dport_counts_raw.items()}

# ── Top-10 most common ─────────────────────────────────────────────────────────
print(f"\n── Top-10 Most Common SOURCE Ports ─────────────────────────────")
for port, cnt in sorted(sport_counts_raw.items(), key=lambda x: -x[1])[:10]:
    print(f"  port {port:>6}  count={cnt:>12,}  freq={sport_rarity[port]:.6f}")

print(f"\n── Top-10 Most Common DEST Ports ───────────────────────────────")
for port, cnt in sorted(dport_counts_raw.items(), key=lambda x: -x[1])[:10]:
    print(f"  port {port:>6}  count={cnt:>12,}  freq={dport_rarity[port]:.6f}")

[PORT SCAN] Streaming 4,195 parquet files …
   10.0%  (419/4,195)  rows=26,047,012  elapsed=7s  ETA=67s
   20.0%  (838/4,195)  rows=46,259,083  elapsed=14s  ETA=56s
   30.0%  (1,257/4,195)  rows=83,245,228  elapsed=25s  ETA=58s
   40.0%  (1,676/4,195)  rows=99,306,431  elapsed=31s  ETA=46s
   49.9%  (2,095/4,195)  rows=100,991,639  elapsed=32s  ETA=32s
   59.9%  (2,514/4,195)  rows=109,540,368  elapsed=35s  ETA=24s
   69.9%  (2,933/4,195)  rows=130,131,836  elapsed=41s  ETA=18s
   79.9%  (3,352/4,195)  rows=181,309,256  elapsed=57s  ETA=14s
   89.9%  (3,771/4,195)  rows=276,819,460  elapsed=91s  ETA=10s
   99.9%  (4,190/4,195)  rows=351,226,310  elapsed=115s  ETA=0s
  100.0%  (4,195/4,195)  rows=351,317,489  elapsed=115s  ETA=0s

[SAVED] c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\artifacts\scalers\global_port_map.json
  total_rows : 351,317,489
  unique sport: 65,537   unique dport: 65,537

── Top-10 Most Common SOURCE Ports ──

In [5]:
# ══════════════════════════════════════════════════════════════════════════════
# STEP 3 — Stratified Reservoir Sampling
#   Small partitions (< 1M rows)  → 100% capture
#   Massive partitions (≥ 1M rows) → exactly 1,000,000 rows via start/mid/end
# ══════════════════════════════════════════════════════════════════════════════

def _get_avail_cols(fp, wanted):
    """Return the intersection of wanted columns with those actually in the file."""
    schema = pq.read_schema(fp)
    avail  = set(schema.names)
    return [c for c in wanted if c in avail]


def _read_parquet_safe(fp, columns=None):
    """Read a parquet file, silently dropping any requested cols that don't exist."""
    try:
        cols = _get_avail_cols(fp, columns) if columns else None
        tbl  = pq.ParquetFile(fp).read(columns=cols)
        return tbl.to_pandas()
    except Exception as e:
        print(f"  [WARN] read failed {fp.name}: {e}")
        return pd.DataFrame()


def _sub_sample_file(df, per_chunk):
    """
    Take up to per_chunk rows from start, middle, and end of df.
    NO drop_duplicates — raw row slices to preserve statistical distribution.
    """
    n = len(df)
    if n == 0 or per_chunk <= 0:
        return df.iloc[:0]
    pc  = min(per_chunk, n)
    mid = max(0, (n - pc) // 2)
    parts = [
        df.iloc[:pc],
        df.iloc[mid: mid + pc],
        df.iloc[max(0, n - pc):],
    ]
    return pd.concat(parts, ignore_index=True)


def _fast_row_count(fp):
    """Get row count using parquet footer metadata (no data read)."""
    try:
        return pq.read_metadata(fp).num_rows
    except Exception:
        return 0


def sample_partition(part_dir, target=MASSIVE_PARTITION_CAP, columns=None):
    """
    Build a stratified sample from one partition directory.
    Injects ubt_archetype from directory name (Hive partition key not stored in files).
    Returns a DataFrame with numeric columns forced to float32.
    """
    # Extract archetype from Hive-style dir name: `ubt_archetype=BOTNET_C2`
    dir_name   = part_dir.name
    archetype  = dir_name.split('=', 1)[1] if '=' in dir_name else dir_name

    # Build column list — exclude ubt_archetype since it's a partition key, not in files
    base_cols  = columns or SAMPLE_COLS
    file_cols  = [c for c in base_cols if c != 'ubt_archetype']

    files  = sorted(part_dir.glob('*.parquet'))
    label  = archetype

    if not files:
        print(f"  [{label}] No parquet files found — skipping.")
        return pd.DataFrame()

    # Fast total row count via footer metadata
    file_rows  = [(fp, _fast_row_count(fp)) for fp in files]
    total_rows = sum(r for _, r in file_rows)

    # ── Small partition: 100% ──────────────────────────────────────────────────
    if total_rows < SMALL_PARTITION_THRESH:
        print(f"  [{label}] SMALL  total={total_rows:,}  → reading 100%")
        chunks = [_read_parquet_safe(fp, file_cols) for fp, _ in file_rows]
        df = pd.concat([c for c in chunks if len(c)], ignore_index=True)

    # ── Massive partition: start / middle / end regions ───────────────────────
    else:
        budget_per_region = target // 3
        print(f"  [{label}] MASSIVE total={total_rows:,}  → sampling {target:,} rows "
              f"({budget_per_region:,}/region×3)")

        n_files = len(file_rows)
        r1_end  = n_files // 3
        r2_end  = 2 * n_files // 3

        regions = [
            ('start', file_rows[:r1_end]),
            ('mid',   file_rows[r1_end:r2_end]),
            ('end',   file_rows[r2_end:]),
        ]

        sampled_regions = []
        for region_name, region_files in regions:
            if not region_files:
                continue
            # Budget per file within this region
            per_file_rows = max(1, budget_per_region // max(1, len(region_files)))
            per_chunk     = max(1, per_file_rows // 3)

            region_chunks  = []
            region_count   = 0
            for fp, _ in region_files:
                if region_count >= budget_per_region:
                    break
                chunk = _read_parquet_safe(fp, file_cols)
                if len(chunk) == 0:
                    continue
                sub       = _sub_sample_file(chunk, per_chunk)
                remaining = budget_per_region - region_count
                sub       = sub.iloc[:remaining]
                region_chunks.append(sub)
                region_count += len(sub)

            if region_chunks:
                sampled_regions.append(pd.concat(region_chunks, ignore_index=True))

        df = pd.concat(sampled_regions, ignore_index=True) if sampled_regions else pd.DataFrame()

    # ── Inject Hive partition key ──────────────────────────────────────────────
    if len(df):
        df['ubt_archetype'] = archetype

    # ── Force float32 on all numeric columns ──────────────────────────────────
    for col in df.select_dtypes(include=[np.number]).columns:
        df[col] = df[col].astype(np.float32)

    print(f"    → sampled  {len(df):,} rows")
    return df


# ── Build Reservoir ────────────────────────────────────────────────────────────
print("=" * 65)
print("STRATIFIED RESERVOIR SAMPLING")
print("=" * 65)
t_res = time.time()

reservoir_parts = []
for pd_ in part_dirs:
    reservoir_parts.append(sample_partition(pd_))

reservoir = pd.concat([r for r in reservoir_parts if len(r)], ignore_index=True)
print(f"\nReservoir build time : {time.time() - t_res:.1f}s")
print(f"Reservoir total rows : {len(reservoir):,}")
print(f"Reservoir columns    : {len(reservoir.columns)}")

# ── Class Distribution Table ───────────────────────────────────────────────────
print(f"\n── Reservoir Class Distribution ────────────────────────────────────")
print(f"  {'Archetype':<18} {'Rows':>10}   {'%':>6}")
print("  " + "─" * 38)
if 'ubt_archetype' in reservoir.columns:
    vc = reservoir['ubt_archetype'].value_counts()
    for arch in UBT_ARCHETYPES:
        cnt = vc.get(arch, 0)
        pct = cnt / len(reservoir) * 100 if len(reservoir) else 0
        print(f"  {arch:<18} {cnt:>10,}   {pct:>6.2f}%")
    print("  " + "─" * 38)
    print(f"  {'TOTAL':<18} {len(reservoir):>10,}   100.00%")
else:
    print("  [WARN] 'ubt_archetype' column not found in reservoir.")

STRATIFIED RESERVOIR SAMPLING
  [BOTNET_C2] MASSIVE total=61,556,313  → sampling 1,000,000 rows (333,333/region×3)
    → sampled  796,509 rows
  [BRUTE_FORCE] MASSIVE total=1,718,568  → sampling 1,000,000 rows (333,333/region×3)
    → sampled  999,996 rows
  [DOS_DDOS] MASSIVE total=32,665,331  → sampling 1,000,000 rows (333,333/region×3)
    → sampled  581,004 rows
  [EXPLOIT] MASSIVE total=2,635,460  → sampling 1,000,000 rows (333,333/region×3)
    → sampled  882,216 rows
  [NORMAL] MASSIVE total=31,657,548  → sampling 1,000,000 rows (333,333/region×3)
    → sampled  650,604 rows
  [SCAN] MASSIVE total=221,084,172  → sampling 1,000,000 rows (333,333/region×3)
    → sampled  999,492 rows
  [THEFT_EXFIL] SMALL  total=97  → reading 100%
    → sampled  97 rows

Reservoir build time : 39.8s
Reservoir total rows : 4,909,918
Reservoir columns    : 24

── Reservoir Class Distribution ────────────────────────────────────
  Archetype                Rows        %
  ─────────────────────────────

In [6]:
# ══════════════════════════════════════════════════════════════════════════════
# STEP 4 — Multi-Block Fitting
#   A. Block 1 & Block 6  → RobustScaler(quantile_range=(5, 95))  after log1p
#   B. Bytes / Pkts cols  → QuantileTransformer(n=2000, uniform)
#   C. Port rarity values → PowerTransformer(yeo-johnson)
# ══════════════════════════════════════════════════════════════════════════════

def _to_float64_valid(reservoir, col, sentinel=-1.0, clip_min=None):
    """
    Return a clean 1-D float64 array for `col`:
      • Drop NaN
      • Drop sentinel rows (default -1)
      • Optionally clip at clip_min
    """
    if col not in reservoir.columns:
        return np.array([], dtype=np.float64)
    s = reservoir[col].astype(np.float64)
    s = s.dropna()
    s = s[s != sentinel]
    if clip_min is not None:
        s = s[s >= clip_min]
    return s.values


print("=" * 65)
print("STEP 4 — MULTI-BLOCK FITTING")
print("=" * 65)

# ── A1. Block 1 — RobustScaler(5,95) after log1p ──────────────────────────────
print("\n[A1] Block 1 — RobustScaler(quantile_range=(5,95)) + log1p")
block1_scalers = {}
for col in BLOCK1_COLS:
    vals = _to_float64_valid(reservoir, col, sentinel=-1.0, clip_min=0.0)
    if len(vals) < 10:
        print(f"  [SKIP] {col} — too few valid rows ({len(vals)})")
        continue
    X   = np.log1p(vals).reshape(-1, 1)
    rs  = RobustScaler(quantile_range=(5, 95))
    rs.fit(X)
    block1_scalers[col] = rs
    print(f"  {col:<28}  n={len(vals):>9,}  center={rs.center_[0]:.4f}  scale={rs.scale_[0]:.4f}")

# ── A2. Block 6 — RobustScaler(5,95) after log1p + shift ─────────────────────
print("\n[A2] Block 6 — RobustScaler(quantile_range=(5,95)) + log1p+shift")
block6_scalers = {}
for col in BLOCK6_COLS:
    vals = _to_float64_valid(reservoir, col, sentinel=-1.0)
    if len(vals) < 10:
        print(f"  [SKIP] {col} — too few valid rows ({len(vals)})")
        continue
    shift = float(max(0.0, -vals.min()) + 1e-6)   # ensure all values > 0 for log1p
    X     = np.log1p(vals + shift).reshape(-1, 1)
    rs    = RobustScaler(quantile_range=(5, 95))
    rs.fit(X)
    block6_scalers[col] = {'scaler': rs, 'shift': shift}
    print(f"  {col:<28}  n={len(vals):>9,}  shift={shift:.4e}  center={rs.center_[0]:.4f}")

# ── B. QuantileTransformer on bytes / pkts ─────────────────────────────────────
print(f"\n[B] QuantileTransformer(n_quantiles={QT_N_QUANTILES}, output='uniform') on bytes/pkts")
qt_byte_pkt = {}
for col in QT_BYTE_PKT_COLS:
    vals = _to_float64_valid(reservoir, col, sentinel=-1.0, clip_min=0.0)
    if len(vals) < QT_N_QUANTILES:
        print(f"  [SKIP] {col} — only {len(vals)} valid rows (need ≥ {QT_N_QUANTILES})")
        continue
    qt = QuantileTransformer(n_quantiles=QT_N_QUANTILES, output_distribution='uniform',
                             subsample=int(2e6), random_state=42)
    qt.fit(vals.reshape(-1, 1))
    qt_byte_pkt[col] = qt
    print(f"  {col:<28}  n={len(vals):>9,}  quantiles={qt.n_quantiles_}")

# ── C. Port Rarity → PowerTransformer (yeo-johnson) ──────────────────────────
print("\n[C] Port Rarity → PowerTransformer(yeo-johnson, standardize=True)")

def _fit_port_rarity_pt(reservoir, port_col, rarity_dict):
    """Map reservoir port values through rarity_dict, then fit PowerTransformer."""
    if port_col not in reservoir.columns:
        print(f"  [SKIP] {port_col} — column missing")
        return None
    ports    = reservoir[port_col].dropna().astype(int).astype(str)
    rarities = ports.map(lambda p: rarity_dict.get(p, 0.0)).values.reshape(-1, 1)
    rarities = rarities.astype(np.float64)
    # keep only non-zero (zero means port unseen in full scan → masked)
    rarities = rarities[rarities[:, 0] > 0].reshape(-1, 1)
    if len(rarities) < 10:
        print(f"  [SKIP] {port_col} — too few non-zero rarity values ({len(rarities)})")
        return None
    pt = PowerTransformer(method='yeo-johnson', standardize=True)
    pt.fit(rarities)
    print(f"  {port_col:<12}  n={len(rarities):>9,}  lambda={pt.lambdas_[0]:.4f}")
    return pt

pt_sport_rarity = _fit_port_rarity_pt(reservoir, 'raw_sport', sport_rarity)
pt_dport_rarity = _fit_port_rarity_pt(reservoir, 'raw_dport', dport_rarity)

print("\n[DONE] All scalers fitted.")
print(f"  block1_scalers  : {len(block1_scalers)} cols")
print(f"  block6_scalers  : {len(block6_scalers)} cols")
print(f"  qt_byte_pkt     : {len(qt_byte_pkt)} cols")
print(f"  pt_sport_rarity : {'OK' if pt_sport_rarity else 'SKIPPED'}")
print(f"  pt_dport_rarity : {'OK' if pt_dport_rarity else 'SKIPPED'}")

STEP 4 — MULTI-BLOCK FITTING

[A1] Block 1 — RobustScaler(quantile_range=(5,95)) + log1p
  univ_duration                 n=4,909,918  center=0.0000  scale=0.7947
  univ_bytes_in                 n=4,909,918  center=0.0000  scale=6.7226
  univ_bytes_out                n=4,909,918  center=0.0000  scale=7.9714
  univ_pkts_in                  n=4,909,918  center=1.0986  scale=1.3863
  univ_pkts_out                 n=4,909,918  center=0.0000  scale=2.0794

[A2] Block 6 — RobustScaler(quantile_range=(5,95)) + log1p+shift
  mom_mean                      n=   29,533  shift=1.0000e-06  center=1.3058
  mom_stddev                    n=   29,533  shift=1.0000e-06  center=0.5955
  mom_sum                       n=   29,533  shift=1.0000e-06  center=2.2491
  mom_min                       n=   29,533  shift=1.0000e-06  center=0.0001
  mom_max                       n=   29,533  shift=1.0000e-06  center=1.5966
  mom_rate                      n=   29,533  shift=1.0000e-06  center=0.3429
  mom_srate       

In [7]:
# ══════════════════════════════════════════════════════════════════════════════
# STEP 5 — Artifact Sealing
#   Bundle all preprocessors into preprocessors_v51.pkl (protocol=4)
#   Print: Top-10 rarest ports + Reservoir class distribution + sizes
# ══════════════════════════════════════════════════════════════════════════════

preprocessors_v51 = {
    'schema_version'   : SCHEMA_VERSION,
    'total_rows_ocean' : TOTAL_ROWS_OCEAN,
    'reservoir_rows'   : len(reservoir),
    'qt_n_quantiles'   : QT_N_QUANTILES,
    # scalers
    'block1_scalers'   : block1_scalers,
    'block6_scalers'   : block6_scalers,
    'qt_byte_pkt'      : qt_byte_pkt,
    # port rarity
    'sport_rarity_map' : sport_rarity,
    'dport_rarity_map' : dport_rarity,
    'pt_sport_rarity'  : pt_sport_rarity,
    'pt_dport_rarity'  : pt_dport_rarity,
    # column metadata
    'block1_cols'      : BLOCK1_COLS,
    'block6_cols'      : BLOCK6_COLS,
    'qt_byte_pkt_cols' : QT_BYTE_PKT_COLS,
    'port_cols'        : PORT_COLS,
}

with open(PREPROCESSORS_PATH, 'wb') as f:
    pickle.dump(preprocessors_v51, f, protocol=4)

pkl_size_mb = os.path.getsize(PREPROCESSORS_PATH) / 1e6
port_size_kb = os.path.getsize(PORT_MAP_PATH) / 1e3 if PORT_MAP_PATH.exists() else 0

print("=" * 65)
print("ARTIFACT SEALING COMPLETE")
print("=" * 65)
print(f"\n  preprocessors_v51.pkl   → {PREPROCESSORS_PATH}")
print(f"                            {pkl_size_mb:.2f} MB")
print(f"  global_port_map.json    → {PORT_MAP_PATH}")
print(f"                            {port_size_kb:.1f} KB")

# ── Top-10 Rarest SOURCE Ports ─────────────────────────────────────────────────
print(f"\n── Top-10 Rarest SOURCE Ports (lowest freq/total_rows) ──────────────")
print(f"  {'Port':>6}  {'Count':>12}  {'Freq':>12}")
print("  " + "─" * 36)
rarest_sport = sorted(sport_rarity.items(), key=lambda x: x[1])[:10]
for port, freq in rarest_sport:
    cnt = sport_counts_raw.get(port, 0)
    print(f"  {port:>6}  {cnt:>12,}  {freq:>12.2e}")

# ── Top-10 Rarest DEST Ports ───────────────────────────────────────────────────
print(f"\n── Top-10 Rarest DEST Ports (lowest freq/total_rows) ────────────────")
print(f"  {'Port':>6}  {'Count':>12}  {'Freq':>12}")
print("  " + "─" * 36)
rarest_dport = sorted(dport_rarity.items(), key=lambda x: x[1])[:10]
for port, freq in rarest_dport:
    cnt = dport_counts_raw.get(port, 0)
    print(f"  {port:>6}  {cnt:>12,}  {freq:>12.2e}")

# ── Final Reservoir Class Distribution ────────────────────────────────────────
print(f"\n── Final Reservoir Class Distribution ───────────────────────────────")
print(f"  {'Archetype':<18} {'Rows':>10}   {'%':>6}   {'Top dataset_source'}")
print("  " + "─" * 70)
if 'ubt_archetype' in reservoir.columns:
    vc = reservoir['ubt_archetype'].value_counts()
    for arch in UBT_ARCHETYPES:
        cnt = vc.get(arch, 0)
        pct = cnt / len(reservoir) * 100 if len(reservoir) else 0
        if 'dataset_source' in reservoir.columns and cnt > 0:
            top_src = (reservoir[reservoir['ubt_archetype'] == arch]['dataset_source']
                       .value_counts().index[0] if cnt > 0 else 'N/A')
        else:
            top_src = 'N/A'
        print(f"  {arch:<18} {cnt:>10,}   {pct:>6.2f}%   {top_src}")
    print("  " + "─" * 70)
    print(f"  {'TOTAL':<18} {len(reservoir):>10,}   100.00%")

# ── Bundle Contents Summary ────────────────────────────────────────────────────
print(f"\n── Bundle Contents ───────────────────────────────────────────────────")
print(f"  Keys in preprocessors_v51:")
for k, v in preprocessors_v51.items():
    if isinstance(v, dict):
        print(f"    {k:<22} → dict({len(v)} entries)")
    elif hasattr(v, '__class__'):
        print(f"    {k:<22} → {type(v).__name__}")
    else:
        print(f"    {k:<22} → {v}")

print(f"\n{'='*65}")
print(f"Phase 1.2 — Global Statistical Calibration COMPLETE")
print(f"  Ocean rows scanned : {TOTAL_ROWS_OCEAN:,}")
print(f"  Reservoir rows     : {len(reservoir):,}")
print(f"  Artifact size      : {pkl_size_mb:.2f} MB")
print(f"{'='*65}")

ARTIFACT SEALING COMPLETE

  preprocessors_v51.pkl   → c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\artifacts\preprocessors_v51.pkl
                            2.34 MB
  global_port_map.json    → c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\artifacts\scalers\global_port_map.json
                            1523.2 KB

── Top-10 Rarest SOURCE Ports (lowest freq/total_rows) ──────────────
    Port         Count          Freq
  ────────────────────────────────────
    6667           204      5.81e-07
   20000           224      6.38e-07
   10624           226      6.43e-07
   14523           230      6.55e-07
   24638           230      6.55e-07
   17209           230      6.55e-07
    6666           231      6.58e-07
   11219           232      6.60e-07
    7614           232      6.60e-07
   19386           233      6.63e-07

── Top-10 Rarest DEST Ports (lowest freq/total_row