# üî¢ Phase-2.3: Vector Generation
## Quantum-RAG Knowledge Fusion for Adaptive IoT Intrusion Detection

---

### üìã Phase-2.3 Objective

**Convert encoded features into fixed-length vectors suitable for ChromaDB ingestion.**

Phase-2.3 Goals:
1. ‚úÖ Load encoded features from Phase-2.2
2. ‚úÖ Load metadata labels (attack types)
3. ‚úÖ Create fixed-length feature vectors (99-dimensional)
4. ‚úÖ Validate vector integrity and dimensionality
5. ‚úÖ Generate vector IDs and metadata
6. ‚úÖ Save vectors in ChromaDB-compatible format

### üéØ Vector Requirements

| Requirement | Specification |
|-------------|---------------|
| **Dimensions** | 99 (from Phase-2.2 encoding) |
| **Data Type** | float32 (memory efficient) |
| **Metadata** | label (attack type), type (Normal/Attack) |
| **Format** | NumPy arrays + Parquet for metadata |
| **Validation** | No NaN, no Inf, all finite values |

### üìä Key Principles

- **Chunked Processing**: Handle 22.3M vectors efficiently
- **Memory Management**: Process in batches to avoid RAM exhaustion
- **Type Safety**: Ensure float32 throughout pipeline
- **Metadata Integrity**: Preserve attack labels for retrieval

---

## üì¶ Import Required Libraries

In [16]:
# Core data manipulation
import pandas as pd
import numpy as np
import json
import gc

# File handling
import os
from pathlib import Path

# Progress tracking
from tqdm import tqdm

# Display utilities
from IPython.display import display, HTML, Markdown
import warnings
warnings.filterwarnings('ignore')

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Define paths
PHASE_2_DIR = "../artifacts/phase_2"

print("‚úÖ Libraries imported successfully!")
print(f"üìÅ Phase-2 artifacts: {PHASE_2_DIR}")

‚úÖ Libraries imported successfully!
üìÅ Phase-2 artifacts: ../artifacts/phase_2


---

## üîí SECTION 1 ‚Äî Load Encoded Features & Metadata

### Objectives:
1. Load encoded features from Phase-2.2
2. Load metadata labels (attack types)
3. Validate data alignment
4. Check dimensionality

In [17]:
# Load Phase-2.2 summary
with open(f"{PHASE_2_DIR}/phase2_2_summary.json", 'r', encoding='utf-8') as f:
    phase2_2_summary = json.load(f)

print("üìä Phase-2.2 Summary:")
print(f"  ‚Ä¢ Input Rows: {phase2_2_summary['input_shape'][0]:,}")
print(f"  ‚Ä¢ Encoded Features: {phase2_2_summary['output_shape'][1]}")
print(f"  ‚Ä¢ Feature Expansion: {phase2_2_summary['feature_expansion']['expansion_factor']}x")
print(f"  ‚Ä¢ All Numeric: {phase2_2_summary['validation']['all_numeric']}")

üìä Phase-2.2 Summary:
  ‚Ä¢ Input Rows: 22,339,021
  ‚Ä¢ Encoded Features: 99
  ‚Ä¢ Feature Expansion: 3.0x
  ‚Ä¢ All Numeric: True


In [18]:
# Load encoded features (Parquet is much faster than CSV)
print("üì• Loading encoded features from Phase-2.2...\n")

encoded_features = pd.read_parquet(f"{PHASE_2_DIR}/encoded_features.parquet")

print(f"‚úÖ Loaded:")
print(f"  ‚Ä¢ Rows: {len(encoded_features):,}")
print(f"  ‚Ä¢ Columns: {len(encoded_features.columns)}")
print(f"  ‚Ä¢ Memory: {encoded_features.memory_usage(deep=True).sum() / (1024**3):.2f} GB")
print(f"  ‚Ä¢ Data Types: {encoded_features.dtypes.value_counts().to_dict()}")

üì• Loading encoded features from Phase-2.2...

‚úÖ Loaded:
  ‚Ä¢ Rows: 22,339,021
  ‚Ä¢ Columns: 99
  ‚Ä¢ Memory: 2.85 GB
  ‚Ä¢ Data Types: {dtype('uint8'): 83, dtype('float32'): 12, dtype('int16'): 2, dtype('int8'): 2}


In [19]:
# Load metadata labels
print("\nüì• Loading metadata labels...\n")

metadata_df = pd.read_csv(f"{PHASE_2_DIR}/metadata_labels.csv.gz", compression='gzip')

print(f"‚úÖ Loaded:")
print(f"  ‚Ä¢ Rows: {len(metadata_df):,}")
print(f"  ‚Ä¢ Columns: {list(metadata_df.columns)}")

# Check attack type distribution (type column has actual attack names)
print(f"\nüìä Attack Type Distribution:")
type_dist = metadata_df['type'].value_counts()
for attack_type, count in type_dist.items():
    print(f"  ‚Ä¢ {attack_type}: {count:,} ({count / len(metadata_df) * 100:.1f}%)")

print(f"\nüìä Binary Label Distribution:")
print(f"  ‚Ä¢ Normal (0): {(metadata_df['label'] == 0).sum():,}")
print(f"  ‚Ä¢ Attack (1): {(metadata_df['label'] == 1).sum():,}")


üì• Loading metadata labels...

‚úÖ Loaded:
  ‚Ä¢ Rows: 22,339,021
  ‚Ä¢ Columns: ['label', 'type']

üìä Attack Type Distribution:
  ‚Ä¢ scanning: 7,140,161 (32.0%)
  ‚Ä¢ ddos: 6,165,008 (27.6%)
  ‚Ä¢ dos: 3,375,328 (15.1%)
  ‚Ä¢ xss: 2,108,944 (9.4%)
  ‚Ä¢ password: 1,718,568 (7.7%)
  ‚Ä¢ normal: 796,380 (3.6%)
  ‚Ä¢ backdoor: 508,116 (2.3%)
  ‚Ä¢ injection: 452,659 (2.0%)
  ‚Ä¢ ransomware: 72,805 (0.3%)
  ‚Ä¢ mitm: 1,052 (0.0%)

üìä Binary Label Distribution:
  ‚Ä¢ Normal (0): 796,380
  ‚Ä¢ Attack (1): 21,542,641


In [20]:
# Validate data alignment
print("üîç Validating data alignment...\n")

if len(encoded_features) != len(metadata_df):
    raise ValueError(f"Row count mismatch! Encoded: {len(encoded_features):,}, Metadata: {len(metadata_df):,}")

print("‚úÖ Data alignment verified!")
print(f"  ‚Ä¢ Both datasets: {len(encoded_features):,} rows")
print(f"  ‚Ä¢ Vector dimensions: {encoded_features.shape[1]}")

üîç Validating data alignment...

‚úÖ Data alignment verified!
  ‚Ä¢ Both datasets: 22,339,021 rows
  ‚Ä¢ Vector dimensions: 99


---

## üî¢ SECTION 2 ‚Äî Vector Generation

### Objectives:
1. Convert DataFrame rows to NumPy vectors
2. Validate vector integrity (no NaN, no Inf)
3. Ensure float32 data type
4. Generate unique vector IDs

In [21]:
# Convert to NumPy array (vectors)
print("üî¢ Converting encoded features to vectors...\n")

# Ensure all columns are float32
for col in encoded_features.columns:
    if encoded_features[col].dtype != np.float32:
        encoded_features[col] = encoded_features[col].astype(np.float32)

# Convert to NumPy array
vectors = encoded_features.values

print(f"‚úÖ Vectors created:")
print(f"  ‚Ä¢ Shape: {vectors.shape}")
print(f"  ‚Ä¢ Data Type: {vectors.dtype}")
print(f"  ‚Ä¢ Memory: {vectors.nbytes / (1024**3):.2f} GB")

üî¢ Converting encoded features to vectors...

‚úÖ Vectors created:
  ‚Ä¢ Shape: (22339021, 99)
  ‚Ä¢ Data Type: float32
  ‚Ä¢ Memory: 8.24 GB


In [22]:
# Validate vector integrity
print("\nüîç Validating vector integrity...\n")

# Check for NaN
nan_count = np.isnan(vectors).sum()
if nan_count > 0:
    print(f"‚ö†Ô∏è  NaN values detected: {nan_count:,}")
else:
    print("‚úÖ No NaN values detected")

# Check for Inf
inf_count = np.isinf(vectors).sum()
if inf_count > 0:
    print(f"‚ö†Ô∏è  Inf values detected: {inf_count:,}")
else:
    print("‚úÖ No Inf values detected")

# Check if all finite
all_finite = np.isfinite(vectors).all()
if all_finite:
    print("‚úÖ All values are finite")
else:
    print("‚ö†Ô∏è  Some values are not finite")

# Vector statistics
print(f"\nüìä Vector Statistics:")
print(f"  ‚Ä¢ Min value: {vectors.min():.6f}")
print(f"  ‚Ä¢ Max value: {vectors.max():.6f}")
print(f"  ‚Ä¢ Mean: {vectors.mean():.6f}")
print(f"  ‚Ä¢ Std: {vectors.std():.6f}")


üîç Validating vector integrity...

‚úÖ No NaN values detected
‚úÖ No Inf values detected
‚úÖ All values are finite

üìä Vector Statistics:
  ‚Ä¢ Min value: -32768.000000
  ‚Ä¢ Max value: 541129.562500
  ‚Ä¢ Mean: 0.723042
  ‚Ä¢ Std: 58.053623


In [23]:
# Generate unique vector IDs
print("\nüÜî Generating unique vector IDs...\n")

# Use sequential IDs for 22M+ vectors
vector_ids = [f"vec_{i:08d}" for i in range(len(vectors))]

print(f"‚úÖ Generated {len(vector_ids):,} vector IDs")
print(f"  ‚Ä¢ Format: vec_00000000 to vec_{len(vector_ids)-1:08d}")
print(f"  ‚Ä¢ Example IDs: {vector_ids[:3]}")


üÜî Generating unique vector IDs...

‚úÖ Generated 22,339,021 vector IDs
  ‚Ä¢ Format: vec_00000000 to vec_22339020
  ‚Ä¢ Example IDs: ['vec_00000000', 'vec_00000001', 'vec_00000002']


---

## üìã SECTION 3 ‚Äî Metadata Preparation

### Objectives:
1. Create metadata dictionary for each vector
2. Include attack label and type
3. Prepare ChromaDB-compatible metadata format

In [24]:
# Prepare metadata for ChromaDB
print("üìã Preparing metadata for ChromaDB...\n")

# Add vector IDs to metadata
metadata_df['vector_id'] = vector_ids

# Verify the 'type' column already contains attack type strings
# (from Phase-2.1: 'normal', 'scanning', 'ddos', 'dos', 'xss', 'password', 
#  'backdoor', 'injection', 'ransomware', 'mitm')
print(f"‚úÖ Metadata prepared:")
print(f"  ‚Ä¢ Rows: {len(metadata_df):,}")
print(f"  ‚Ä¢ Columns: {list(metadata_df.columns)}")

print(f"\nüìä Attack Type Distribution:")
print(metadata_df['type'].value_counts())

print(f"\nüìä Binary Label Distribution:")
print(f"  ‚Ä¢ Normal (0): {(metadata_df['label'] == 0).sum():,}")
print(f"  ‚Ä¢ Attack (1): {(metadata_df['label'] == 1).sum():,}")

üìã Preparing metadata for ChromaDB...

‚úÖ Metadata prepared:
  ‚Ä¢ Rows: 22,339,021
  ‚Ä¢ Columns: ['label', 'type', 'vector_id']

üìä Attack Type Distribution:
type
scanning      7140161
ddos          6165008
dos           3375328
xss           2108944
password      1718568
normal         796380
backdoor       508116
injection      452659
ransomware      72805
mitm             1052
Name: count, dtype: int64

üìä Binary Label Distribution:
  ‚Ä¢ Normal (0): 796,380
  ‚Ä¢ Attack (1): 21,542,641


In [25]:
# Preview sample vectors with metadata
print("\nüîç Sample Vectors with Metadata:")
print("="*80)

sample_indices = [0, 1, len(vectors)//2, len(vectors)-2, len(vectors)-1]
for idx in sample_indices:
    print(f"\nVector ID: {vector_ids[idx]}")
    print(f"  Binary Label: {metadata_df.iloc[idx]['label']} ({'Normal' if metadata_df.iloc[idx]['label'] == 0 else 'Attack'})")
    print(f"  Attack Type: {metadata_df.iloc[idx]['type']}")
    print(f"  Vector shape: {vectors[idx].shape}")
    print(f"  First 5 values: {vectors[idx][:5]}")


üîç Sample Vectors with Metadata:

Vector ID: vec_00000000
  Binary Label: 0 (Normal)
  Attack Type: normal
  Vector shape: (99,)
  First 5 values: [-1.8826956e+00  3.3064134e+00  4.6609456e+05  0.0000000e+00
 -1.0000000e+00]

Vector ID: vec_00000001
  Binary Label: 0 (Normal)
  Attack Type: normal
  Vector shape: (99,)
  First 5 values: [ 0.46473706  0.64512044 -0.00116886  0.         -1.        ]

Vector ID: vec_11169510
  Binary Label: 1 (Attack)
  Attack Type: scanning
  Vector shape: (99,)
  First 5 values: [ 2.2780557e-01  1.6872513e+00 -1.1688597e-03  0.0000000e+00
 -1.0000000e+00]

Vector ID: vec_22339019
  Binary Label: 1 (Attack)
  Attack Type: dos
  Vector shape: (99,)
  First 5 values: [-1.8223245e+00 -2.4848045e-01 -6.5965345e-04  0.0000000e+00
 -1.0000000e+00]

Vector ID: vec_22339020
  Binary Label: 1 (Attack)
  Attack Type: dos
  Vector shape: (99,)
  First 5 values: [-1.8223245e+00 -2.4848045e-01 -3.8769108e-04  0.0000000e+00
 -1.0000000e+00]


---

## üíæ SECTION 4 ‚Äî Save Vectors & Metadata

### Objectives:
1. Save vectors as NumPy binary (.npy)
2. Save metadata as Parquet
3. Generate summary statistics
4. Validate saved files

In [26]:
# Save vectors as NumPy binary
print("üíæ Saving vectors...\n")

vectors_path = f"{PHASE_2_DIR}/feature_vectors.npy"
np.save(vectors_path, vectors)

print(f"‚úÖ Saved: {vectors_path}")
print(f"  ‚Ä¢ Vectors: {vectors.shape[0]:,}")
print(f"  ‚Ä¢ Dimensions: {vectors.shape[1]}")
print(f"  ‚Ä¢ Size: {os.path.getsize(vectors_path) / (1024**3):.2f} GB")

üíæ Saving vectors...

‚úÖ Saved: ../artifacts/phase_2/feature_vectors.npy
  ‚Ä¢ Vectors: 22,339,021
  ‚Ä¢ Dimensions: 99
  ‚Ä¢ Size: 8.24 GB


In [27]:
# Save metadata as Parquet
print("\nüíæ Saving metadata...\n")

metadata_path = f"{PHASE_2_DIR}/vector_metadata.parquet"
metadata_df.to_parquet(
    metadata_path,
    index=False,
    compression='snappy',
    engine='pyarrow'
)

print(f"‚úÖ Saved: {metadata_path}")
print(f"  ‚Ä¢ Rows: {len(metadata_df):,}")
print(f"  ‚Ä¢ Columns: {list(metadata_df.columns)}")
print(f"  ‚Ä¢ Size: {os.path.getsize(metadata_path) / (1024**2):.1f} MB")


üíæ Saving metadata...

‚úÖ Saved: ../artifacts/phase_2/vector_metadata.parquet
  ‚Ä¢ Rows: 22,339,021
  ‚Ä¢ Columns: ['label', 'type', 'vector_id']
  ‚Ä¢ Size: 108.9 MB


In [28]:
# Generate Phase-2.3 summary
summary = {
    "phase": "Phase-2.3 (Vector Generation)",
    "timestamp": pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S"),
    "input_files": {
        "encoded_features": "encoded_features.parquet",
        "metadata_labels": "metadata_labels.csv.gz"
    },
    "output_files": {
        "vectors": "feature_vectors.npy",
        "metadata": "vector_metadata.parquet"
    },
    "vectors": {
        "total_count": int(vectors.shape[0]),
        "dimensions": int(vectors.shape[1]),
        "data_type": str(vectors.dtype),
        "memory_gb": round(vectors.nbytes / (1024**3), 2)
    },
    "statistics": {
        "min_value": float(vectors.min()),
        "max_value": float(vectors.max()),
        "mean_value": float(vectors.mean()),
        "std_value": float(vectors.std())
    },
    "validation": {
        "nan_count": int(nan_count),
        "inf_count": int(inf_count),
        "all_finite": bool(all_finite)
    },
    "metadata": {
        "total_rows": int(len(metadata_df)),
        "normal_count": int((metadata_df['label'] == 0).sum()),
        "attack_count": int((metadata_df['label'] == 1).sum()),
        "unique_attack_types": int(metadata_df['type'].nunique()),
        "attack_type_distribution": metadata_df['type'].value_counts().to_dict()
    }
}

summary_path = f"{PHASE_2_DIR}/phase2_3_summary.json"
with open(summary_path, 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2)

print("\nüìä Phase-2.3 Summary:")
print(json.dumps(summary, indent=2))
print(f"\nüíæ Saved: {summary_path}")


üìä Phase-2.3 Summary:
{
  "phase": "Phase-2.3 (Vector Generation)",
  "timestamp": "2026-02-04 22:03:49",
  "input_files": {
    "encoded_features": "encoded_features.parquet",
    "metadata_labels": "metadata_labels.csv.gz"
  },
  "output_files": {
    "vectors": "feature_vectors.npy",
    "metadata": "vector_metadata.parquet"
  },
  "vectors": {
    "total_count": 22339021,
    "dimensions": 99,
    "data_type": "float32",
    "memory_gb": 8.24
  },
  "statistics": {
    "min_value": -32768.0,
    "max_value": 541129.5625,
    "mean_value": 0.7230424880981445,
    "std_value": 58.05362319946289
  },
  "validation": {
    "nan_count": 0,
    "inf_count": 0,
    "all_finite": true
  },
  "metadata": {
    "total_rows": 22339021,
    "normal_count": 796380,
    "attack_count": 21542641,
    "unique_attack_types": 10,
    "attack_type_distribution": {
      "scanning": 7140161,
      "ddos": 6165008,
      "dos": 3375328,
      "xss": 2108944,
      "password": 1718568,
      "normal"

In [29]:
# Validate saved files
print("\nüîç Validating saved files...\n")

# Load and validate vectors
loaded_vectors = np.load(vectors_path)
print(f"‚úÖ Vectors loaded successfully: {loaded_vectors.shape}")

# Load and validate metadata
loaded_metadata = pd.read_parquet(metadata_path)
print(f"‚úÖ Metadata loaded successfully: {loaded_metadata.shape}")

# Verify integrity
vectors_match = np.array_equal(vectors, loaded_vectors)
metadata_match = metadata_df.equals(loaded_metadata)

if vectors_match and metadata_match:
    print("\n‚úÖ All saved files validated successfully!")
else:
    print("\n‚ö†Ô∏è  Validation failed!")
    print(f"  ‚Ä¢ Vectors match: {vectors_match}")
    print(f"  ‚Ä¢ Metadata match: {metadata_match}")


üîç Validating saved files...

‚úÖ Vectors loaded successfully: (22339021, 99)
‚úÖ Metadata loaded successfully: (22339021, 3)

‚úÖ All saved files validated successfully!


---

## üéâ Phase-2.3 Complete!

### ‚úÖ Deliverables
1. ‚úÖ Generated 22.3M fixed-length vectors (99 dimensions)
2. ‚úÖ Validated vector integrity (no NaN, no Inf)
3. ‚úÖ Prepared metadata with attack labels
4. ‚úÖ Saved vectors as NumPy binary (.npy)
5. ‚úÖ Saved metadata as Parquet
6. ‚úÖ Generated summary statistics

### üìÇ Output Files
- `artifacts/phase_2/feature_vectors.npy` (~8.3 GB, 22.3M √ó 99 float32)
- `artifacts/phase_2/vector_metadata.parquet` (~50 MB, vector IDs + labels)
- `artifacts/phase_2/phase2_3_summary.json`

### üöÄ Next: Phase-2.4 (ChromaDB Ingestion)
Ready to:
- Initialize ChromaDB collection
- Batch insert 22.3M vectors with metadata
- Configure similarity search parameters
- Validate ingestion completeness

---

**Status**: ‚úÖ **READY FOR PHASE-2.4**