# Phase 2.2 — Vector Store Ingestion  `[v5.1 — 102,505-row Knowledge Base]`

**Objective:** Ingest the distilled Knowledge Vector Base into a persistent ChromaDB vector store  
optimised for **sub-10 ms** cosine-similarity retrieval at inference time.

**Input:**
- `data/vectors/X_knowledge_vectors_v51.npy` — float32, shape *(102505, 114)*
- `data/vectors/y_knowledge_metadata_v51.parquet` — archetype · specific_attack · dataset_source

**Output:** `chromadb_store_v51/` — persistent HNSW collection `ids_knowledge_base_v51`

---

## Technical Strategy

| Step | Action | Rationale |
|------|--------|----------|
| **1. L2 Normalisation** | Unit-normalise every 114-dim vector | Neutralises the 169.3 Global Max outlier — prevents high-magnitude features (e.g. packet bytes) from dominating cosine distance |
| **2. Cosine HNSW** | `hnsw:space = 'cosine'` | After L2 normalisation, cosine ≡ dot-product — maximally captures *behavioural direction* rather than raw magnitude |
| **3. Batch = 5,000** | `collection.add()` in 5K-row batches | Keeps per-batch RAM below ~12 MB (5000 × 114 × float32) — avoids the OOM crashes from Phase 2.1 |
| **4. HNSW index** | Hierarchical Navigable Small World graph | O(log N) ANN search → sub-10 ms on 102K vectors at 114 dims |

In [1]:
# ── Cell 1: Imports & Dependency Gate ─────────────────────────────────────────
import sys, os, time, warnings
from pathlib import Path

import numpy as np
import pandas as pd

warnings.filterwarnings('ignore')

# ChromaDB — install if missing
try:
    import chromadb
    from chromadb import PersistentClient
    print(f'chromadb : {chromadb.__version__}')
except ImportError:
    print('chromadb not found — installing …')
    import subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', 'chromadb', '-q'])
    import chromadb
    from chromadb import PersistentClient
    print(f'chromadb : {chromadb.__version__}  (just installed)')

SCHEMA_VERSION = 'v5.1'
TOTAL_DIMS     = 114

print(f'Python   : {sys.version.split()[0]}')
print(f'numpy    : {np.__version__}')
print(f'pandas   : {pd.__version__}')
print(f'Schema   : {SCHEMA_VERSION}  ({TOTAL_DIMS}-dim)')
print('Imports OK.')

chromadb : 1.4.1
Python   : 3.13.9
numpy    : 2.1.3
pandas   : 2.2.3
Schema   : v5.1  (114-dim)
Imports OK.


In [2]:
# ── Cell 2: Paths ──────────────────────────────────────────────────────────────
NOTEBOOK_DIR  = Path.cwd()
MAIN_DIR      = NOTEBOOK_DIR.parent
DATA_DIR      = MAIN_DIR / 'data'
VECTORS_DIR   = DATA_DIR / 'vectors'
CHROMA_DIR    = MAIN_DIR / 'chromadb_store_v51'

VECTORS_PATH  = VECTORS_DIR / 'X_knowledge_vectors_v51.npy'
META_PATH     = VECTORS_DIR / 'y_knowledge_metadata_v51.parquet'

CHROMA_DIR.mkdir(parents=True, exist_ok=True)

COLLECTION_NAME = 'ids_knowledge_base_v51'
INGEST_BATCH    = 5_000   # rows per collection.add() call

print(f'Vectors  : {VECTORS_PATH}')
print(f'Metadata : {META_PATH}')
print(f'ChromaDB : {CHROMA_DIR}')
print(f'Collection : {COLLECTION_NAME}')
print(f'Batch size : {INGEST_BATCH:,}')
assert VECTORS_PATH.exists(), f'MISSING: {VECTORS_PATH}'
assert META_PATH.exists(),    f'MISSING: {META_PATH}'
print('Paths OK.')

Vectors  : c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\data\vectors\X_knowledge_vectors_v51.npy
Metadata : c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\data\vectors\y_knowledge_metadata_v51.parquet
ChromaDB : c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\chromadb_store_v51
Collection : ids_knowledge_base_v51
Batch size : 5,000
Paths OK.


In [3]:
# ── Cell 3: Data Loading & L2 Normalisation ────────────────────────────────────
#
# WHY L2 NORMALISE?
#   Phase 2.1 produced vectors with Global Max = 169.3 (bytes / packet features).
#   Raw Euclidean / cosine search on un-normalised vectors gives those high-magnitude
#   features 169× the voting weight of binary mask bits.  L2 normalisation collapses
#   every vector onto the unit sphere so that cosine similarity measures *direction*
#   (i.e. behavioural pattern) rather than magnitude.  After normalisation, cosine
#   distance == dot product, which HNSW is built to optimise.
# ─────────────────────────────────────────────────────────────────────────────────

print('Loading knowledge base …')
X_raw  = np.load(str(VECTORS_PATH))          # (102505, 114) float32
y_meta = pd.read_parquet(str(META_PATH))      # (102505, 3)

print(f'  X_raw  : {X_raw.shape}  dtype={X_raw.dtype}')
print(f'  y_meta : {y_meta.shape}  cols={list(y_meta.columns)}')
print(f'  Global Min / Max (raw) : {X_raw.min():.4f} / {X_raw.max():.4f}')

print('\nApplying L2 normalisation …')
_norms = np.linalg.norm(X_raw, axis=1, keepdims=True)   # (102505, 1)
_norms = np.where(_norms == 0, 1.0, _norms)              # guard zero-norm vectors
X_norm = (X_raw / _norms).astype(np.float32)             # unit sphere

# Sanity checks
_magnitudes       = np.linalg.norm(X_norm, axis=1)
_max_deviation    = np.abs(_magnitudes - 1.0).max()
_zero_norm_count  = int(np.sum(np.linalg.norm(X_raw, axis=1) == 0))

print(f'  X_norm : {X_norm.shape}  dtype={X_norm.dtype}')
print(f'  Post-norm range        : [{X_norm.min():.4f}, {X_norm.max():.4f}]')
print(f'  Max deviation from |v|=1  : {_max_deviation:.2e}  (should be < 1e-6)')
print(f'  Zero-norm vectors (raw)   : {_zero_norm_count}')
print(f'  Unique archetypes  : {y_meta["ubt_archetype"].nunique()}')
print(f'  Unique attacks     : {y_meta["univ_specific_attack"].nunique()}')

assert _max_deviation < 1e-5, f'L2 normalisation failed — max deviation: {_max_deviation}'
assert len(X_norm) == len(y_meta), 'Vector / metadata alignment error'
print('\nL2 normalisation verified. ✅')

Loading knowledge base …
  X_raw  : (102505, 114)  dtype=float32
  y_meta : (102505, 3)  cols=['ubt_archetype', 'univ_specific_attack', 'dataset_source']
  Global Min / Max (raw) : -2.8248 / 169.3271

Applying L2 normalisation …
  X_norm : (102505, 114)  dtype=float32
  Post-norm range        : [-0.6364, 0.9993]
  Max deviation from |v|=1  : 1.79e-07  (should be < 1e-6)
  Zero-norm vectors (raw)   : 0
  Unique archetypes  : 7
  Unique attacks     : 33

L2 normalisation verified. ✅


In [4]:
# ── Cell 4: ChromaDB Initialisation ───────────────────────────────────────────
#
# HNSW space = 'cosine':
#   After L2 normalisation every vector has magnitude 1.  Cosine similarity then
#   equals the dot product, which HNSW evaluates in O(dim) at each graph edge
#   traversal.  This gives the fastest possible ANN performance for semantic
#   behavioural matching on our 114-dim schema.
# ─────────────────────────────────────────────────────────────────────────────────

print(f'Initialising ChromaDB at: {CHROMA_DIR}')
client = PersistentClient(path=str(CHROMA_DIR))

# Delete existing collection to allow clean re-ingest
existing = [c.name for c in client.list_collections()]
if COLLECTION_NAME in existing:
    client.delete_collection(COLLECTION_NAME)
    print(f'  Deleted existing collection: {COLLECTION_NAME}')

collection = client.create_collection(
    name     = COLLECTION_NAME,
    metadata = {
        'hnsw:space'          : 'cosine',   # dot-product ANN after L2 norm
        'hnsw:construction_ef': 200,         # higher = better recall during build
        'hnsw:M'              : 32,          # edges/node — balances RAM vs recall
        'hnsw:search_ef'      : 100,         # candidates examined per query
        'description'         : 'RAG-IDS Knowledge Base v5.1 — 102505 medoids, 114-dim, L2-normalised',
        'schema_version'      : SCHEMA_VERSION,
        'total_dims'          : str(TOTAL_DIMS),
    }
)

print(f'  Collection created : {collection.name}')
print(f'  HNSW space         : cosine')
print(f'  HNSW M             : 32   (graph edges per node)')
print(f'  HNSW ef_construct  : 200  (build-time beam width)')
print(f'  HNSW search_ef     : 100  (query-time beam width)')
print('ChromaDB initialised. ✅')

Initialising ChromaDB at: c:\Users\suhas\OneDrive\Desktop\Capstone\RAG-IDS-Knowledge-Augmented-IoT-Threat-Detection\main_folder\chromadb_store_v51
  Collection created : ids_knowledge_base_v51
  HNSW space         : cosine
  HNSW M             : 32   (graph edges per node)
  HNSW ef_construct  : 200  (build-time beam width)
  HNSW search_ef     : 100  (query-time beam width)
ChromaDB initialised. ✅


In [5]:
# ── Cell 5: Resource-Safe Batch Ingestion ──────────────────────────────────────
#
# Each batch of 5,000 × 114 float32 = 2.28 MB of raw vector data.
# ChromaDB serialises + indexes each batch before requesting the next,
# so peak RAM per batch is bounded — avoids the OOM pattern from Phase 2.1.
# Total ingestion memory footprint is dominated by the HNSW graph (~200 MB)
# not by the batch buffer.
# ─────────────────────────────────────────────────────────────────────────────────

try:
    from tqdm.auto import tqdm
except ImportError:
    def tqdm(it, **kw): return it

N_TOTAL  = len(X_norm)
n_batches = (N_TOTAL + INGEST_BATCH - 1) // INGEST_BATCH

print(f'Ingesting {N_TOTAL:,} vectors in {n_batches} batches of {INGEST_BATCH:,} …')
print(f'  Per-batch RAM (vectors only) : {INGEST_BATCH * TOTAL_DIMS * 4 / 1e6:.2f} MB')

t_ingest = time.time()
rows_added = 0

for batch_idx in tqdm(range(n_batches), desc='Ingesting', unit='batch'):
    lo = batch_idx * INGEST_BATCH
    hi = min(lo + INGEST_BATCH, N_TOTAL)

    batch_vecs  = X_norm[lo:hi].tolist()          # ChromaDB expects list[list[float]]
    batch_meta  = y_meta.iloc[lo:hi]

    batch_ids = [f'kb_{i}' for i in range(lo, hi)]

    batch_documents = [
        f"{row['ubt_archetype']}|{row['univ_specific_attack']}|{row['dataset_source']}"
        for _, row in batch_meta.iterrows()
    ]

    batch_metadatas = [
        {
            'ubt_archetype'        : str(row['ubt_archetype']),
            'univ_specific_attack' : str(row['univ_specific_attack']),
            'dataset_source'       : str(row['dataset_source']),
            'kb_index'             : int(lo + j),
        }
        for j, (_, row) in enumerate(batch_meta.iterrows())
    ]

    collection.add(
        ids        = batch_ids,
        embeddings = batch_vecs,
        documents  = batch_documents,
        metadatas  = batch_metadatas,
    )
    rows_added += (hi - lo)

t_elapsed = time.time() - t_ingest
final_count = collection.count()

print(f'\nIngestion complete.')
print(f'  Rows added           : {rows_added:,}')
print(f'  Collection count     : {final_count:,}')
print(f'  Total time           : {t_elapsed:.1f}s  ({t_elapsed/60:.1f} min)')
print(f'  Throughput           : {rows_added/t_elapsed:,.0f} rows/s')
assert final_count == N_TOTAL, f'COUNT MISMATCH: {final_count} != {N_TOTAL}'
print('Count verified. ✅')

Ingesting 102,505 vectors in 21 batches of 5,000 …
  Per-batch RAM (vectors only) : 2.28 MB


Ingesting:   0%|          | 0/21 [00:00<?, ?batch/s]


Ingestion complete.
  Rows added           : 102,505
  Collection count     : 102,505
  Total time           : 26.5s  (0.4 min)
  Throughput           : 3,870 rows/s
Count verified. ✅


In [6]:
# ── Cell 6: Latency Benchmarking & Fidelity Check ─────────────────────────────
#
# Simulates a real-time IDS detection query:
#   1. Pick a random probe vector from the knowledge base (warm path)
#   2. Query collection.query() — HNSW ANN search — for top-10 nearest neighbours
#   3. Measure wall-clock latency (cold + warm runs)
#   4. Inspect top-3 returned metadata for semantic fidelity
# ─────────────────────────────────────────────────────────────────────────────────

SEP   = '=' * 65
SEP2  = '-' * 65
N_TOP = 10

rng   = np.random.default_rng(seed=42)
probe_idx   = int(rng.integers(0, len(X_norm)))
probe_vec   = X_norm[probe_idx].tolist()          # already L2-normalised
probe_meta  = y_meta.iloc[probe_idx]

print(SEP)
print('LATENCY BENCHMARK & FIDELITY CHECK')
print(SEP)
print(f'\nProbe vector index : {probe_idx:,}')
print(f'  Archetype  : {probe_meta["ubt_archetype"]}')
print(f'  Attack     : {probe_meta["univ_specific_attack"]}')
print(f'  Source     : {probe_meta["dataset_source"]}')

# ── Cold run (first query — HNSW graph not yet in CPU cache) ──────────────────
t0_cold = time.perf_counter()
_res_cold = collection.query(
    query_embeddings=[probe_vec],
    n_results=N_TOP,
    include=['metadatas', 'distances', 'documents'],
)
latency_cold_ms = (time.perf_counter() - t0_cold) * 1000

# ── Warm runs (HNSW graph resident in CPU cache) ──────────────────────────────
N_WARM = 20
warm_probes = rng.integers(0, len(X_norm), size=N_WARM)
warm_latencies = []

for wi in warm_probes:
    _wv = X_norm[int(wi)].tolist()
    t0w = time.perf_counter()
    collection.query(
        query_embeddings=[_wv],
        n_results=N_TOP,
        include=['metadatas', 'distances'],
    )
    warm_latencies.append((time.perf_counter() - t0w) * 1000)

warm_latencies = np.array(warm_latencies)

print(f'\n  Cold query latency   : {latency_cold_ms:.2f} ms')
print(f'  Warm query latency   : {warm_latencies.mean():.2f} ms (mean over {N_WARM} queries)')
print(f'    p50                : {np.percentile(warm_latencies, 50):.2f} ms')
print(f'    p95                : {np.percentile(warm_latencies, 95):.2f} ms')
print(f'    p99                : {np.percentile(warm_latencies, 99):.2f} ms')
print(f'    min / max          : {warm_latencies.min():.2f} / {warm_latencies.max():.2f} ms')

TARGET_MS = 10.0
meets_target = warm_latencies.mean() < TARGET_MS
print(f'\n  Target (<{TARGET_MS:.0f} ms warm)  : {"✅ PASS" if meets_target else "⚠️  MISS"}')

# ── Fidelity Check — top-3 results ────────────────────────────────────────────
top_metas     = _res_cold['metadatas'][0][:3]
top_distances = _res_cold['distances'][0][:3]

print(f'\n{SEP2}')
print(f'FIDELITY CHECK — Top-3 Nearest Neighbours')
print(f'{SEP2}')
print(f'  Probe    →  [{probe_meta["ubt_archetype"]}] {probe_meta["univ_specific_attack"]}  (src: {probe_meta["dataset_source"]})')
print()

for rank, (m, d) in enumerate(zip(top_metas, top_distances), 1):
    archetype = m.get('ubt_archetype', 'N/A')
    attack    = m.get('univ_specific_attack', 'N/A')
    source    = m.get('dataset_source', 'N/A')
    match_arch = '✅' if archetype == probe_meta['ubt_archetype'] else '❌'
    print(f'  Rank {rank} {match_arch}  dist={d:.6f}')
    print(f'          ubt_archetype        : {archetype}')
    print(f'          univ_specific_attack : {attack}')
    print(f'          dataset_source       : {source}')
    print()

# Archetype precision@3
correct_arch = sum(
    1 for m in top_metas if m.get('ubt_archetype') == probe_meta['ubt_archetype']
)
precision_at_3 = correct_arch / 3

print(f'{SEP2}')
print(f'  Archetype Precision@3 : {correct_arch}/3  ({precision_at_3*100:.1f}%)')
print(f'{SEP}')
print('PHASE 2.2 — COMPLETE')
print(f'{SEP}')
print(f'  Collection : {collection.name}')
print(f'  Vectors    : {final_count:,}  ×  {TOTAL_DIMS}-dim  (L2-normalised, cosine HNSW)')
print(f'  Store path : {CHROMA_DIR}')
print(f'  Warm p50   : {np.percentile(warm_latencies, 50):.2f} ms')
print(f'  Real-time ready: {"YES" if meets_target else "REVIEW NEEDED"}')

LATENCY BENCHMARK & FIDELITY CHECK

Probe vector index : 9,148
  Archetype  : EXPLOIT
  Attack     : xss
  Source     : toniot

  Cold query latency   : 55.10 ms
  Warm query latency   : 0.94 ms (mean over 20 queries)
    p50                : 0.91 ms
    p95                : 1.21 ms
    p99                : 1.42 ms
    min / max          : 0.77 / 1.47 ms

  Target (<10 ms warm)  : ✅ PASS

-----------------------------------------------------------------
FIDELITY CHECK — Top-3 Nearest Neighbours
-----------------------------------------------------------------
  Probe    →  [EXPLOIT] xss  (src: toniot)

  Rank 1 ✅  dist=0.000000
          ubt_archetype        : EXPLOIT
          univ_specific_attack : xss
          dataset_source       : toniot

  Rank 2 ✅  dist=0.000002
          ubt_archetype        : EXPLOIT
          univ_specific_attack : xss
          dataset_source       : toniot

  Rank 3 ✅  dist=0.000005
          ubt_archetype        : EXPLOIT
          univ_specific_attack : 

In [7]:
# Rarity Validation: Can the system still find the 97 Theft/Exfil rows?
rare_probe = y_meta[y_meta['ubt_archetype'] == 'THEFT_EXFIL'].index[0]
res_rare = collection.query(query_embeddings=[X_norm[rare_probe].tolist()], n_results=1)

print(f"Rare Attack Test: {res_rare['metadatas'][0][0]['ubt_archetype']}")
print(f"Distance: {res_rare['distances'][0][0]:.6f}")

Rare Attack Test: THEFT_EXFIL
Distance: 0.000000
