# üéØ Phase-1: Feature Decision & Preprocessing Design
## Quantum-RAG Knowledge Fusion for Adaptive IoT Intrusion Detection

---

### üìã Phase-1 Objective

**This phase is the DECISION PHASE.**

Phase-1 Goals:
1. ‚úÖ Decide **what features to KEEP**
2. ‚úÖ Decide **what features to DROP**
3. ‚úÖ Decide **how each retained feature will be handled**
4. ‚úÖ Produce a **frozen feature schema** for all later phases

### üîí Phase-1 Rules

| Rule | Status |
|------|--------|
| ‚ùå No embedding generation | Strict |
| ‚ùå No model training | Strict |
| ‚ùå No ChromaDB usage | Strict |
| ‚ùå No retrieval logic | Strict |
| ‚úî Only feature decisions | Required |
| ‚úî Use Phase-0 findings as ground truth | Required |

### üìä Key Principles

- **Behavioral focus**: Keep features that describe HOW traffic behaves
- **Generalization**: Drop network-specific identifiers
- **Explainability**: Retain features that support interpretability
- **CPU-friendly**: No NLP/transformer encodings
- **Semantic integrity**: Preserve meaningful placeholder values

---

## üì¶ Import Required Libraries

In [None]:
# Core data manipulation
import pandas as pd
import numpy as np
import json

# File handling
import os
from pathlib import Path

# Display utilities
from IPython.display import display, HTML, Markdown
import warnings
warnings.filterwarnings('ignore')

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Define paths
ARTIFACTS_DIR = "../artifacts"
PHASE_0_DIR = "../artifacts/phase_0"
PHASE_1_DIR = "../artifacts/phase_1"
DATA_DIR = "../data/ton_iot_processed_network"

print("‚úÖ Libraries imported successfully!")

print(f"üìÅ Artifacts directory: {ARTIFACTS_DIR}")print(f"üìÅ Data directory: {DATA_DIR}")

‚úÖ Libraries imported successfully!
üìÅ Artifacts directory: ../artifacts
üìÅ Data directory: ../data/ton_iot_processed_network


---

## üìÇ SECTION 1 ‚Äî Load Phase-0 Artifacts

### Objectives:
1. Load outputs from Phase-0 (column inventory, role classification, feature meanings)
2. Treat Phase-0 findings as **ground truth**
3. Confirm understanding of placeholder values (especially `"-"`)
4. DO NOT override Phase-0 conclusions

In [None]:
# Load all Phase-0 artifacts
column_inventory = pd.read_csv(f"{PHASE_0_DIR}/phase0_column_inventory.csv")
role_classification = pd.read_csv(f"{PHASE_0_DIR}/phase0_role_classification.csv")
feature_meanings = pd.read_csv(f"{PHASE_0_DIR}/phase0_feature_meanings.csv")
placeholder_analysis = pd.read_csv(f"{PHASE_0_DIR}/phase0_placeholder_analysis.csv")
files_summary = pd.read_csv(f"{PHASE_0_DIR}/phase0_files_summary.csv")

print("‚úÖ Loaded Phase-0 Artifacts:")
print(f"  ‚Ä¢ Column Inventory: {len(column_inventory)} columns")
print(f"  ‚Ä¢ Role Classifications: {len(role_classification)} features")
print(f"  ‚Ä¢ Feature Meanings: {len(feature_meanings)} features")
print(f"  ‚Ä¢ Placeholder Analysis: {len(placeholder_analysis)} columns")
print(f"  ‚Ä¢ Files Summary: {len(files_summary)} files")
print(f"\nüìä Total records in dataset: {files_summary['Rows'].sum():,}")
print(f"üì¶ Total dataset size: {files_summary['Memory (MB)'].sum():.1f} MB")

‚úÖ Loaded Phase-0 Artifacts:
  ‚Ä¢ Column Inventory: 47 columns
  ‚Ä¢ Role Classifications: 47 features
  ‚Ä¢ Feature Meanings: 47 features
  ‚Ä¢ Placeholder Analysis: 30 columns
  ‚Ä¢ Files Summary: 23 files

üìä Total records in dataset: 22,339,021
üì¶ Total dataset size: 33760.3 MB


In [14]:
# Display Phase-0 Role Classification (Ground Truth)
display(Markdown("### üéØ Phase-0 Role Classification (Ground Truth)"))
display(role_classification.sort_values("Role"))

### üéØ Phase-0 Role Classification (Ground Truth)

Unnamed: 0,Column,Role,Confidence,Data Type,Unique Values,Null Count,Cardinality Note
14,dst_pkts,Behavioral,HIGH,int64,1212,0,‚ö†Ô∏è High cardinality
7,duration,Behavioral,HIGH,float64,3720375,0,‚ö†Ô∏è High cardinality
8,src_bytes,Behavioral,HIGH,object,57113,0,‚ö†Ô∏è High cardinality
9,dst_bytes,Behavioral,HIGH,int64,52193,0,‚ö†Ô∏è High cardinality
10,conn_state,Behavioral,HIGH,object,13,0,Low cardinality
11,missed_bytes,Behavioral,HIGH,int64,8593,0,‚ö†Ô∏è High cardinality
12,src_pkts,Behavioral,HIGH,int64,3714,0,‚ö†Ô∏è High cardinality
0,ts,Contextual,HIGH,int64,392633,0,‚ö†Ô∏è High cardinality
2,src_port,Contextual,HIGH,int64,65536,0,‚ö†Ô∏è High cardinality
4,dst_port,Contextual,HIGH,int64,65536,0,‚ö†Ô∏è High cardinality


---

## üéØ SECTION 2 ‚Äî Final Feature Role Classification

### Objectives:
1. Review Phase-0 role assignments
2. Apply **Quantum-RAG IoT IDS** principles:
   - **KEEP**: Behavioral + selected Contextual features
   - **DROP**: Identifiers (IP, UID) for generalization
   - **DROP**: Labels (type, label) - metadata only
3. Justify each decision with behavioral/explainability reasoning

In [35]:
# Define feature retention decision rules
RETENTION_RULES = {
    # BEHAVIORAL - Core IDS signals
    "duration": ("KEEP", "Attack duration patterns are critical for DoS/anomaly detection"),
    "src_bytes": ("KEEP", "Data volume indicates exfiltration, flooding, or normal usage"),
    "dst_bytes": ("KEEP", "Response size reveals server behavior under attack"),
    "src_pkts": ("KEEP", "Packet count patterns distinguish scan/flood/normal traffic"),
    "dst_pkts": ("KEEP", "Server packet responses indicate service type and anomalies"),
    "src_ip_bytes": ("KEEP", "IP-layer volume feature (behavioral, not identity) - protocol overhead patterns"),
    "dst_ip_bytes": ("KEEP", "IP-layer volume feature (behavioral, not identity) - response protocol overhead"),
    
    # CONTEXTUAL - Protocol & state information
    "proto": ("KEEP", "Protocol type (tcp/udp/icmp) is essential for attack context"),
    "service": ("KEEP", "Service type helps identify attack targets (http/dns/ssh)"),
    "conn_state": ("KEEP", "Connection state reveals incomplete/rejected connections (attacks)"),
    
    # IDENTIFIERS - DROP for generalization
    "ts": ("DROP", "Timestamp is environment-specific, not generalizable"),
    "uid": ("DROP", "Unique ID per connection, no behavioral value"),
    "src_ip": ("DROP", "Source IP is identity-specific, prevents generalization"),
    "src_port": ("KEEP", "Source port reveals client behavior (ephemeral vs well-known)"),
    "dst_ip": ("DROP", "Destination IP is identity-specific"),
    "dst_port": ("KEEP", "Destination port identifies target service (80/443/22)"),
    
    # LABELS - DROP (metadata only)
    "type": ("DROP", "Attack type label - used for evaluation only, not input"),
    "label": ("DROP", "Binary label (normal/attack) - target variable, not feature"),
    
    # TCP-SPECIFIC - KEEP with placeholder handling
    "missed_bytes": ("KEEP", "TCP lost data indicates packet loss or evasion"),
    
    # DNS-SPECIFIC - KEEP with placeholder handling  
    "dns_query": ("DROP", "High cardinality (1M unique), identity-revealing domain names"),
    "dns_qclass": ("KEEP", "Protocol-specific contextual feature: DNS query class patterns"),
    "dns_qtype": ("KEEP", "Protocol-specific contextual feature: DNS query type (A/AAAA/MX) for reconnaissance detection"),
    "dns_rcode": ("KEEP", "DNS response code shows failed queries (tunneling/DGA)"),
    "dns_AA": ("KEEP", "Authoritative answer flag indicates DNS behavior"),
    "dns_RD": ("KEEP", "Recursion desired flag shows query patterns"),
    "dns_RA": ("KEEP", "Recursion available shows server capabilities"),
    "dns_rejected": ("KEEP", "Rejected DNS queries indicate malicious attempts"),
    
    # HTTP-SPECIFIC - KEEP with placeholder handling
    "http_trans_depth": ("KEEP", "HTTP transaction depth reveals pipelining/keep-alive"),
    "http_method": ("KEEP", "HTTP method (GET/POST) indicates attack type (injection)"),
    "http_uri": ("DROP", "High cardinality relative to other categorical features, identity-revealing paths"),
    "http_referrer": ("DROP", "High cardinality relative to other categorical features, identity-revealing referrer URLs"),
    "http_version": ("KEEP", "HTTP version shows legacy vulnerabilities (HTTP/1.0)"),
    "http_request_body_len": ("KEEP", "Request size indicates upload attacks or exfiltration"),
    "http_response_body_len": ("KEEP", "Response size reveals data leakage or DoS"),
    "http_status_code": ("KEEP", "Status code patterns indicate scan/brute-force"),
    "http_user_agent": ("DROP", "High cardinality relative to other categorical features, identity-revealing client info"),
    "http_orig_mime_types": ("KEEP", "Protocol-specific contextual feature: Request MIME type for content-based attack detection"),
    "http_resp_mime_types": ("KEEP", "Protocol-specific contextual feature: Response MIME type for anomalous server behavior"),
    
    # SSL-SPECIFIC - KEEP with placeholder handling
    "ssl_version": ("KEEP", "SSL/TLS version reveals downgrade attacks"),
    "ssl_cipher": ("KEEP", "Cipher suite indicates weak cryptography usage"),
    "ssl_resumed": ("KEEP", "Session resumption patterns show automation"),
    "ssl_established": ("KEEP", "Handshake success/failure indicates MITM or misconfiguration"),
    "ssl_subject": ("DROP", "Certificate subject is identity-revealing"),
    "ssl_issuer": ("DROP", "Certificate issuer is identity-revealing"),
    
    # ADDITIONAL CONTEXTUAL
    "weird_name": ("DROP", "High cardinality relative to other categorical features, noisy anomaly descriptions"),
    "weird_addl": ("DROP", "Additional weird info - too sparse (99.5% placeholders)"),
    "weird_notice": ("KEEP", "Boolean flag for anomalies detected by Zeek"),
}

# Create retention decision DataFrame
retention_decisions = pd.DataFrame([
    {
        "column": col,
        "decision": decision,
        "reasoning": reason,
        "phase0_role": role_classification[role_classification["Column"] == col]["Role"].values[0],
        "unique_values": column_inventory[column_inventory["Column Name"] == col]["Unique Values"].values[0],
        "placeholder_pct": 0  # Will be computed from Phase-0 placeholder analysis if available
    }
    for col, (decision, reason) in RETENTION_RULES.items()
])

display(Markdown("### ‚úÖ Feature Retention Decisions"))
display(retention_decisions.sort_values(["decision", "phase0_role"]))

### ‚úÖ Feature Retention Decisions

Unnamed: 0,column,decision,reasoning,phase0_role,unique_values,placeholder_pct
10,ts,DROP,"Timestamp is environment-specific, not general...",Contextual,392633,0
11,uid,DROP,"Unique ID per connection, no behavioral value",Identifier,999966,0
12,src_ip,DROP,"Source IP is identity-specific, prevents gener...",Identifier,23414,0
14,dst_ip,DROP,Destination IP is identity-specific,Identifier,6523,0
16,type,DROP,"Attack type label - used for evaluation only, ...",Label/Ground Truth,10,0
17,label,DROP,Binary label (normal/attack) - target variable...,Label/Ground Truth,2,0
19,dns_query,DROP,"High cardinality (1M unique), identity-reveali...",Unknown - Needs Review,17880,0
29,http_uri,DROP,High cardinality relative to other categorical...,Unknown - Needs Review,1068,0
30,http_referrer,DROP,High cardinality relative to other categorical...,Unknown - Needs Review,5,0
35,http_user_agent,DROP,High cardinality relative to other categorical...,Unknown - Needs Review,121,0


In [None]:
# Summary statistics
keep_count = len(retention_decisions[retention_decisions["decision"] == "KEEP"])
drop_count = len(retention_decisions[retention_decisions["decision"] == "DROP"])

print(f"\nüìä Retention Summary:")
print(f"  ‚úÖ KEEP: {keep_count} features")
print(f"  ‚ùå DROP: {drop_count} features")
print(f"  üìà Retention Rate: {keep_count / len(retention_decisions) * 100:.1f}%")

# Save retention decisions
retention_decisions.to_csv(f"{PHASE_1_DIR}/phase1_retention_decisions.csv", index=False)
print(f"\nüíæ Saved: phase1_retention_decisions.csv")


üìä Retention Summary:
  ‚úÖ KEEP: 33 features
  ‚ùå DROP: 14 features
  üìà Retention Rate: 70.2%

üíæ Saved: phase1_retention_decisions.csv


---

## üîß SECTION 3 ‚Äî Placeholder Value Handling Strategy

### Objectives:
1. Define **semantic treatment** for placeholder `"-"` per column
2. Distinguish: "Not applicable" vs "Missing data"
3. Preserve interpretability for explainable IDS

In [17]:
# Placeholder handling strategies for KEPT features only
PLACEHOLDER_STRATEGIES = {
    # Protocol-specific features: "-" means "not applicable for this protocol"
    "missed_bytes": ("protocol_na", "TCP-only: '-' for UDP/ICMP ‚Üí Encode as 'NOT_APPLICABLE'"),
    
    "dns_qclass": ("protocol_na", "DNS-only: '-' for TCP/UDP non-DNS ‚Üí Encode as 'NOT_APPLICABLE'"),
    "dns_qtype": ("protocol_na", "DNS-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "dns_rcode": ("protocol_na", "DNS-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "dns_AA": ("protocol_na", "DNS-only: '-' ‚Üí 'NOT_APPLICABLE' (or map F/T/-)"),
    "dns_RD": ("protocol_na", "DNS-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "dns_RA": ("protocol_na", "DNS-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "dns_rejected": ("protocol_na", "DNS-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    
    "http_trans_depth": ("protocol_na", "HTTP-only: '-' ‚Üí Encode as -1 (not applicable)"),
    "http_method": ("protocol_na", "HTTP-only: '-' ‚Üí Encode as 'NOT_APPLICABLE'"),
    "http_version": ("protocol_na", "HTTP-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "http_request_body_len": ("protocol_na", "HTTP-only: '-' ‚Üí Encode as -1"),
    "http_response_body_len": ("protocol_na", "HTTP-only: '-' ‚Üí Encode as -1"),
    "http_status_code": ("protocol_na", "HTTP-only: '-' ‚Üí 'NOT_APPLICABLE' or 0"),
    "http_orig_mime_types": ("protocol_na", "HTTP-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "http_resp_mime_types": ("protocol_na", "HTTP-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    
    "ssl_version": ("protocol_na", "SSL-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "ssl_cipher": ("protocol_na", "SSL-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "ssl_resumed": ("protocol_na", "SSL-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    "ssl_established": ("protocol_na", "SSL-only: '-' ‚Üí 'NOT_APPLICABLE'"),
    
    # Service field: "-" means service not identified
    "service": ("unknown_service", "'-' means Zeek couldn't identify service ‚Üí Encode as 'UNKNOWN'"),
    
    # Weird notice: "-" means no anomaly detected
    "weird_notice": ("boolean_false", "'-' means no weird event ‚Üí Encode as False/0"),
    
    # Features with no placeholders
    "duration": ("none", "No placeholders (numerical, always present)"),
    "src_bytes": ("none", "No placeholders"),
    "dst_bytes": ("none", "No placeholders"),
    "src_pkts": ("none", "No placeholders"),
    "dst_pkts": ("none", "No placeholders"),
    "src_ip_bytes": ("none", "No placeholders"),
    "dst_ip_bytes": ("none", "No placeholders"),
    "proto": ("none", "No placeholders (tcp/udp/icmp always present)"),
    "conn_state": ("none", "No placeholders (connection state always recorded)"),
    "src_port": ("none", "No placeholders"),
    "dst_port": ("none", "No placeholders"),
}

# Create placeholder strategy DataFrame
placeholder_strategies_df = pd.DataFrame([
    {
        "column": col,
        "strategy": strategy,
        "description": desc,
        "placeholder_pct": 0  # Will be extracted from Phase-0 analysis if available
    }
    for col, (strategy, desc) in PLACEHOLDER_STRATEGIES.items()
])

display(Markdown("### üîß Placeholder Handling Strategies (KEEP Features Only)"))
display(placeholder_strategies_df.sort_values("strategy"))

### üîß Placeholder Handling Strategies (KEEP Features Only)

Unnamed: 0,column,strategy,description,placeholder_pct
21,weird_notice,boolean_false,'-' means no weird event ‚Üí Encode as False/0,0
32,dst_port,none,No placeholders,0
30,conn_state,none,No placeholders (connection state always recor...,0
29,proto,none,No placeholders (tcp/udp/icmp always present),0
28,dst_ip_bytes,none,No placeholders,0
27,src_ip_bytes,none,No placeholders,0
26,dst_pkts,none,No placeholders,0
25,src_pkts,none,No placeholders,0
24,dst_bytes,none,No placeholders,0
23,src_bytes,none,No placeholders,0


In [None]:
# Save placeholder strategies
placeholder_strategies_df.to_csv(f"{PHASE_1_DIR}/phase1_placeholder_strategies.csv", index=False)
print("üíæ Saved: phase1_placeholder_strategies.csv")

üíæ Saved: phase1_placeholder_strategies.csv


---

## üî§ SECTION 4 ‚Äî Categorical Encoding Strategy

### Objectives:
1. Define encoding method for each categorical KEEP feature
2. Preserve interpretability for RAG retrieval & quantum reasoning
3. Balance: **One-hot** (low cardinality) vs **Ordinal** (natural order) vs **Frequency** (high cardinality)

In [19]:
# Categorical encoding strategies
CATEGORICAL_ENCODING = {
    # Low cardinality ‚Üí One-hot encoding (preserves distinctness)
    "proto": ("one_hot", "3 values (tcp/udp/icmp) ‚Üí 3 binary features"),
    "conn_state": ("one_hot", "14 states ‚Üí Preserves connection patterns distinctly"),
    "service": ("one_hot", "11 services + UNKNOWN ‚Üí One-hot for explainability"),
    
    # DNS-specific
    "dns_qclass": ("one_hot", "~4 values + NOT_APPLICABLE ‚Üí One-hot"),
    "dns_qtype": ("one_hot", "~30 values + NOT_APPLICABLE ‚Üí One-hot"),
    "dns_rcode": ("one_hot", "~16 values + NOT_APPLICABLE ‚Üí One-hot"),
    "dns_AA": ("one_hot", "3 values (T/F/NOT_APPLICABLE) ‚Üí One-hot"),
    "dns_RD": ("one_hot", "3 values ‚Üí One-hot"),
    "dns_RA": ("one_hot", "3 values ‚Üí One-hot"),
    "dns_rejected": ("one_hot", "3 values ‚Üí One-hot"),
    
    # HTTP-specific
    "http_method": ("one_hot", "~10 methods + NOT_APPLICABLE ‚Üí One-hot"),
    "http_version": ("ordinal", "Natural order: HTTP/0.9 < HTTP/1.0 < HTTP/1.1 < HTTP/2.0 (+ NOT_APPLICABLE=-1)"),
    "http_status_code": ("ordinal", "Natural order: 100s < 200s < 300s < 400s < 500s (+ NOT_APPLICABLE=0)"),
    "http_orig_mime_types": ("one_hot", "~30 MIME types + NOT_APPLICABLE ‚Üí One-hot"),
    "http_resp_mime_types": ("one_hot", "~30 MIME types + NOT_APPLICABLE ‚Üí One-hot"),
    
    # SSL-specific
    "ssl_version": ("ordinal", "Natural order: SSLv2 < SSLv3 < TLS1.0 < TLS1.1 < TLS1.2 < TLS1.3 (+ NOT_APPLICABLE=-1)"),
    "ssl_cipher": ("one_hot", "~50 cipher suites + NOT_APPLICABLE ‚Üí One-hot for crypto patterns"),
    "ssl_resumed": ("one_hot", "3 values (T/F/NOT_APPLICABLE) ‚Üí One-hot"),
    "ssl_established": ("one_hot", "3 values (T/F/NOT_APPLICABLE) ‚Üí One-hot"),
    
    # TCP missed bytes (numerical after placeholder handling)
    "missed_bytes": ("numerical_ordinal", "Numerical, but encode '-' as -1 first"),
    
    # Weird notice (boolean)
    "weird_notice": ("binary", "Boolean: True/False (no encoding needed if already 1/0)"),
}

# Create encoding strategy DataFrame
encoding_strategies_df = pd.DataFrame([
    {
        "column": col,
        "encoding_method": method,
        "rationale": reason,
        "unique_values": column_inventory[column_inventory["Column Name"] == col]["Unique Values"].values[0]
    }
    for col, (method, reason) in CATEGORICAL_ENCODING.items()
])

display(Markdown("### üî§ Categorical Encoding Strategies"))
display(encoding_strategies_df.sort_values("encoding_method"))

### üî§ Categorical Encoding Strategies

Unnamed: 0,column,encoding_method,rationale,unique_values
20,weird_notice,binary,Boolean: True/False (no encoding needed if alr...,2
19,missed_bytes,numerical_ordinal,"Numerical, but encode '-' as -1 first",8593
18,ssl_established,one_hot,3 values (T/F/NOT_APPLICABLE) ‚Üí One-hot,3
17,ssl_resumed,one_hot,3 values (T/F/NOT_APPLICABLE) ‚Üí One-hot,3
16,ssl_cipher,one_hot,~50 cipher suites + NOT_APPLICABLE ‚Üí One-hot f...,21
14,http_resp_mime_types,one_hot,~30 MIME types + NOT_APPLICABLE ‚Üí One-hot,10
13,http_orig_mime_types,one_hot,~30 MIME types + NOT_APPLICABLE ‚Üí One-hot,4
9,dns_rejected,one_hot,3 values ‚Üí One-hot,3
0,proto,one_hot,3 values (tcp/udp/icmp) ‚Üí 3 binary features,3
7,dns_RD,one_hot,3 values ‚Üí One-hot,3


In [None]:
# Save encoding strategies
encoding_strategies_df.to_csv(f"{PHASE_1_DIR}/phase1_encoding_strategies.csv", index=False)
print("üíæ Saved: phase1_encoding_strategies.csv")

üíæ Saved: phase1_encoding_strategies.csv


---

## üìê SECTION 5 ‚Äî Numerical Feature Treatment

### Objectives:
1. Define scaling strategy for numerical KEEP features
2. Handle zeros, outliers, skewness
3. Preserve interpretability for quantum-inspired reasoning

In [21]:
# Numerical feature treatment strategies
NUMERICAL_TREATMENT = {
    # Core behavioral features
    "duration": ("robust_scale", "High skew (17.8), outliers expected in DoS attacks ‚Üí RobustScaler"),
    "src_bytes": ("log_scale", "High skew (50.1), wide range ‚Üí Log transform + StandardScaler"),
    "dst_bytes": ("log_scale", "High skew (217.0), extreme values ‚Üí Log transform + StandardScaler"),
    "src_pkts": ("log_scale", "High skew (16.5), packet floods ‚Üí Log transform + StandardScaler"),
    "dst_pkts": ("log_scale", "High skew (47.3), wide range ‚Üí Log transform + StandardScaler"),
    "src_ip_bytes": ("log_scale", "High skew (37.3) ‚Üí Log transform + StandardScaler"),
    "dst_ip_bytes": ("log_scale", "High skew (185.8) ‚Üí Log transform + StandardScaler"),
    
    # Port numbers
    "src_port": ("standard_scale", "Moderate range (0-65535), uniform distribution ‚Üí StandardScaler"),
    "dst_port": ("standard_scale", "Well-known ports vs ephemeral ‚Üí StandardScaler"),
    
    # HTTP body lengths (after placeholder handling: -1 for NOT_APPLICABLE)
    "http_request_body_len": ("log_scale_with_na", "After encoding '-' as -1 ‚Üí Apply log(x+2) (shifts -1‚Üí0, 0‚Üí1)"),
    "http_response_body_len": ("log_scale_with_na", "After encoding '-' as -1 ‚Üí Apply log(x+2)"),
    
    # HTTP transaction depth (after encoding '-' as -1)
    "http_trans_depth": ("standard_scale_with_na", "After encoding '-' as -1 ‚Üí StandardScaler"),
}

# Create numerical treatment DataFrame
numerical_treatment_df = pd.DataFrame([
    {
        "column": col,
        "treatment": treatment,
        "rationale": reason
    }
    for col, (treatment, reason) in NUMERICAL_TREATMENT.items()
])

display(Markdown("### üìê Numerical Feature Treatment"))
display(numerical_treatment_df)

### üìê Numerical Feature Treatment

Unnamed: 0,column,treatment,rationale
0,duration,robust_scale,"High skew (17.8), outliers expected in DoS att..."
1,src_bytes,log_scale,"High skew (50.1), wide range ‚Üí Log transform +..."
2,dst_bytes,log_scale,"High skew (217.0), extreme values ‚Üí Log transf..."
3,src_pkts,log_scale,"High skew (16.5), packet floods ‚Üí Log transfor..."
4,dst_pkts,log_scale,"High skew (47.3), wide range ‚Üí Log transform +..."
5,src_ip_bytes,log_scale,High skew (37.3) ‚Üí Log transform + StandardScaler
6,dst_ip_bytes,log_scale,High skew (185.8) ‚Üí Log transform + StandardSc...
7,src_port,standard_scale,"Moderate range (0-65535), uniform distribution..."
8,dst_port,standard_scale,Well-known ports vs ephemeral ‚Üí StandardScaler
9,http_request_body_len,log_scale_with_na,After encoding '-' as -1 ‚Üí Apply log(x+2) (shi...


In [None]:
# Save numerical treatment
numerical_treatment_df.to_csv(f"{PHASE_1_DIR}/phase1_numerical_treatment.csv", index=False)
print("üíæ Saved: phase1_numerical_treatment.csv")

üíæ Saved: phase1_numerical_treatment.csv


---

## üìã SECTION 6 ‚Äî Final Retained Feature Set

### Objectives:
1. List all KEEP features grouped by role
2. Confirm feature count for frozen schema
3. Provide quick reference for Phase-2 preprocessing

In [37]:
# Get all KEEP features
keep_features = retention_decisions[retention_decisions["decision"] == "KEEP"]["column"].tolist()

# Group by role
feature_groups = {
    "Behavioral (Core IDS Signals)": [
        "duration", "src_bytes", "dst_bytes", "src_pkts", "dst_pkts", 
        "src_ip_bytes", "dst_ip_bytes"
    ],
    "Contextual (Protocol & State)": [
        "proto", "service", "conn_state", "src_port", "dst_port"
    ],
    "TCP-Specific Features": [
        "missed_bytes"
    ],
    "DNS-Specific Features": [
        "dns_qclass", "dns_qtype", "dns_rcode", "dns_AA", "dns_RD", "dns_RA", "dns_rejected"
    ],
    "HTTP-Specific Features": [
        "http_trans_depth", "http_method", "http_version", "http_request_body_len",
        "http_response_body_len", "http_status_code", "http_orig_mime_types", "http_resp_mime_types"
    ],
    "SSL-Specific Features": [
        "ssl_version", "ssl_cipher", "ssl_resumed", "ssl_established"
    ],
    "Anomaly Detection": [
        "weird_notice"
    ]
}

print("=" * 80)
print("FINAL RETAINED FEATURE SET")
print("=" * 80)
for group, features in feature_groups.items():
    print(f"\n{group} ({len(features)} features):")
    for feat in features:
        print(f"  ‚Ä¢ {feat}")

print(f"\n" + "=" * 80)
print(f"TOTAL RETAINED FEATURES: {len(keep_features)}")
print("=" * 80)

FINAL RETAINED FEATURE SET

Behavioral (Core IDS Signals) (7 features):
  ‚Ä¢ duration
  ‚Ä¢ src_bytes
  ‚Ä¢ dst_bytes
  ‚Ä¢ src_pkts
  ‚Ä¢ dst_pkts
  ‚Ä¢ src_ip_bytes
  ‚Ä¢ dst_ip_bytes

Contextual (Protocol & State) (5 features):
  ‚Ä¢ proto
  ‚Ä¢ service
  ‚Ä¢ conn_state
  ‚Ä¢ src_port
  ‚Ä¢ dst_port

TCP-Specific Features (1 features):
  ‚Ä¢ missed_bytes

DNS-Specific Features (7 features):
  ‚Ä¢ dns_qclass
  ‚Ä¢ dns_qtype
  ‚Ä¢ dns_rcode
  ‚Ä¢ dns_AA
  ‚Ä¢ dns_RD
  ‚Ä¢ dns_RA
  ‚Ä¢ dns_rejected

HTTP-Specific Features (8 features):
  ‚Ä¢ http_trans_depth
  ‚Ä¢ http_method
  ‚Ä¢ http_version
  ‚Ä¢ http_request_body_len
  ‚Ä¢ http_response_body_len
  ‚Ä¢ http_status_code
  ‚Ä¢ http_orig_mime_types
  ‚Ä¢ http_resp_mime_types

SSL-Specific Features (4 features):
  ‚Ä¢ ssl_version
  ‚Ä¢ ssl_cipher
  ‚Ä¢ ssl_resumed
  ‚Ä¢ ssl_established

Anomaly Detection (1 features):
  ‚Ä¢ weird_notice

TOTAL RETAINED FEATURES: 33


---

## üîí SECTION 7 ‚Äî Frozen Schema Definition (JSON Export)

### Objectives:
1. Create machine-readable **frozen_schema.json**
2. Lock all decisions: KEEP/DROP, encoding, scaling
3. This schema is **immutable** for all future phases

In [None]:
# Build frozen schema
frozen_schema = {
    "schema_version": "1.0",
    "created_date": pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S"),
    "dataset": "TON-IoT Processed Network",
    "total_features": len(keep_features),
    "dropped_features": len(retention_decisions[retention_decisions["decision"] == "DROP"]),
    
    "features": {}
}

# Add each retained feature with full preprocessing spec
for feature in keep_features:
    feature_spec = {
        "decision": "KEEP",
        "phase0_role": role_classification[role_classification["Column"] == feature]["Role"].values[0],
        "data_type": column_inventory[column_inventory["Column Name"] == feature]["Data Type"].values[0],
    }
    
    # Add placeholder handling if applicable
    if feature in PLACEHOLDER_STRATEGIES:
        strategy, desc = PLACEHOLDER_STRATEGIES[feature]
        feature_spec["placeholder_strategy"] = {
            "type": strategy,
            "description": desc
        }
    
    # Add encoding strategy if categorical
    if feature in CATEGORICAL_ENCODING:
        method, reason = CATEGORICAL_ENCODING[feature]
        feature_spec["encoding"] = {
            "method": method,
            "rationale": reason
        }
    
    # Add numerical treatment if numerical
    if feature in NUMERICAL_TREATMENT:
        treatment, reason = NUMERICAL_TREATMENT[feature]
        feature_spec["scaling"] = {
            "method": treatment,
            "rationale": reason
        }
    
    frozen_schema["features"][feature] = feature_spec

# Save frozen schema
frozen_schema_path = f"{PHASE_1_DIR}/frozen_schema.json"
with open(frozen_schema_path, 'w', encoding='utf-8') as f:
    json.dump(frozen_schema, f, indent=2)

print("üîí Frozen Schema Created!")
print(f"üíæ Saved: {frozen_schema_path}")
print(f"\nüìä Schema Summary:")
print(f"  ‚Ä¢ Retained Features: {frozen_schema['total_features']}")
print(f"  ‚Ä¢ Dropped Features: {frozen_schema['dropped_features']}")
print(f"  ‚Ä¢ Schema Version: {frozen_schema['schema_version']}")
print(f"  ‚Ä¢ Created: {frozen_schema['created_date']}")

üîí Frozen Schema Created!
üíæ Saved: ../artifacts/frozen_schema.json

üìä Schema Summary:
  ‚Ä¢ Retained Features: 33
  ‚Ä¢ Dropped Features: 14
  ‚Ä¢ Schema Version: 1.0
  ‚Ä¢ Created: 2026-01-31 14:41:11


In [30]:
# Display sample from frozen schema
display(Markdown("### üîç Sample from Frozen Schema"))
sample_features = ["duration", "proto", "http_method", "ssl_version"]
for feat in sample_features:
    if feat in frozen_schema["features"]:
        print(f"\n{feat}:")
        print(json.dumps(frozen_schema["features"][feat], indent=2))

### üîç Sample from Frozen Schema


duration:
{
  "decision": "KEEP",
  "phase0_role": "Behavioral",
  "data_type": "float64",
  "placeholder_strategy": {
    "type": "none",
    "description": "No placeholders (numerical, always present)"
  },
  "scaling": {
    "method": "robust_scale",
    "rationale": "High skew (17.8), outliers expected in DoS attacks \u2192 RobustScaler"
  }
}

proto:
{
  "decision": "KEEP",
  "phase0_role": "Contextual",
  "data_type": "object",
  "placeholder_strategy": {
    "type": "none",
    "description": "No placeholders (tcp/udp/icmp always present)"
  },
  "encoding": {
    "method": "one_hot",
    "rationale": "3 values (tcp/udp/icmp) \u2192 3 binary features"
  }
}

http_method:
{
  "decision": "KEEP",
  "phase0_role": "Unknown - Needs Review",
  "data_type": "object",
  "placeholder_strategy": {
    "type": "protocol_na",
    "description": "HTTP-only: '-' \u2192 Encode as 'NOT_APPLICABLE'"
  },
  "encoding": {
    "method": "one_hot",
    "rationale": "~10 methods + NOT_APPLICABLE \u

---

## üìù SECTION 8 ‚Äî Phase-1 Decision Summary Report

### Objectives:
1. Generate human-readable summary of all decisions
2. Justify KEEP/DROP choices with explainability focus
3. Provide context for Phase-2 implementation

In [None]:
def generate_phase1_summary():
    """Generate comprehensive Phase-1 summary report"""
    
    report = f"""# Phase-1 Feature Decision & Preprocessing Design Summary

## üéØ Executive Summary

**Project**: Quantum-RAG IoT IDS - Knowledge-Augmented Threat Detection  
**Phase**: Phase-1 (Feature Decision & Preprocessing Design)  
**Date**: {pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S")}  
**Dataset**: TON-IoT Processed Network (22,339,021 records)

---

## üìä Decision Overview

### Feature Retention Statistics
- **Original Features**: {len(retention_decisions)} columns
- **KEEP**: {len(retention_decisions[retention_decisions['decision'] == 'KEEP'])} features ({len(retention_decisions[retention_decisions['decision'] == 'KEEP']) / len(retention_decisions) * 100:.1f}%)
- **DROP**: {len(retention_decisions[retention_decisions['decision'] == 'DROP'])} features ({len(retention_decisions[retention_decisions['decision'] == 'DROP']) / len(retention_decisions) * 100:.1f}%)

---

## ‚ùå Features Dropped (with Justification)

"""
    
    # Add dropped features with full details
    dropped = retention_decisions[retention_decisions["decision"] == "DROP"].sort_values("phase0_role")
    for _, row in dropped.iterrows():
        report += f"### {row['column']}\n"
        report += f"- **Phase-0 Role**: {row['phase0_role']}\n"
        report += f"- **Unique Values**: {row['unique_values']:,}\n"
        report += f"- **Decision**: DROP\n"
        report += f"- **Reasoning**: {row['reasoning']}\n\n"
    
    report += f"""

---

## ‚úÖ Features Retained (KEEP) - Detailed Breakdown

"""
    
    # Add KEEP features with full details
    kept = retention_decisions[retention_decisions["decision"] == "KEEP"].sort_values("column")
    for _, row in kept.iterrows():
        report += f"### {row['column']}\n"
        report += f"- **Phase-0 Role**: {row['phase0_role']}\n"
        report += f"- **Unique Values**: {row['unique_values']:,}\n"
        report += f"- **Decision**: KEEP\n"
        report += f"- **Reasoning**: {row['reasoning']}\n"
        
        # Add placeholder strategy if exists
        if row['column'] in PLACEHOLDER_STRATEGIES:
            strategy, desc = PLACEHOLDER_STRATEGIES[row['column']]
            report += f"- **Placeholder Handling**: {strategy}\n"
            report += f"  - {desc}\n"
        
        # Add encoding strategy if exists
        if row['column'] in CATEGORICAL_ENCODING:
            method, reason = CATEGORICAL_ENCODING[row['column']]
            report += f"- **Encoding Method**: {method}\n"
            report += f"  - {reason}\n"
        
        # Add numerical treatment if exists
        if row['column'] in NUMERICAL_TREATMENT:
            treatment, reason = NUMERICAL_TREATMENT[row['column']]
            report += f"- **Scaling Method**: {treatment}\n"
            report += f"  - {reason}\n"
        
        report += "\n"
    
    report += f"""

---

## üìã Feature Groups Summary

"""
    for group, features in feature_groups.items():
        report += f"### {group} ({len(features)} features)\n"
        for feat in features:
            reason = retention_decisions[retention_decisions['column'] == feat]['reasoning'].values[0]
            report += f"- **{feat}**: {reason}\n"
        report += "\n"
    
    report += f"""

---

## üîß Preprocessing Strategy Summary

### Placeholder Handling (`"-"` values)
- **Protocol-Specific Features**: `-` means "not applicable for this protocol" ‚Üí Encode as `NOT_APPLICABLE` or `-1`
- **Service Field**: `-` means "unknown service" ‚Üí Encode as `UNKNOWN`
- **Weird Notice**: `-` means "no anomaly" ‚Üí Encode as `False` (0)

### Complete Placeholder Strategy Table

| Feature | Strategy | Description |
|---------|----------|-------------|
"""
    
    for col, (strategy, desc) in PLACEHOLDER_STRATEGIES.items():
        report += f"| {col} | {strategy} | {desc} |\n"
    
    report += f"""

### Categorical Encoding Strategy

**Summary:**
- **One-Hot Encoding**: {len(encoding_strategies_df[encoding_strategies_df['encoding_method'] == 'one_hot'])} features (low cardinality: proto, conn_state, service, etc.)
- **Ordinal Encoding**: {len(encoding_strategies_df[encoding_strategies_df['encoding_method'] == 'ordinal'])} features (natural order: HTTP version, SSL version, status codes)
- **Binary**: 1 feature (weird_notice: True/False)

### Complete Encoding Strategy Table

| Feature | Encoding Method | Rationale | Unique Values |
|---------|----------------|-----------|---------------|
"""
    
    for _, row in encoding_strategies_df.iterrows():
        report += f"| {row['column']} | {row['encoding_method']} | {row['rationale']} | {row['unique_values']} |\n"
    
    report += f"""

### Numerical Scaling Strategy

**Summary:**
- **Log Transform + StandardScaler**: {len([k for k, v in NUMERICAL_TREATMENT.items() if 'log_scale' in v[0]])} features (high skew: bytes, packets)
- **RobustScaler**: 1 feature (duration: outliers expected in DoS attacks)
- **StandardScaler**: {len([k for k, v in NUMERICAL_TREATMENT.items() if v[0] == 'standard_scale'])} features (ports: uniform distribution)

### Complete Numerical Treatment Table

| Feature | Treatment Method | Rationale |
|---------|-----------------|-----------|
"""
    
    for col, (treatment, reason) in NUMERICAL_TREATMENT.items():
        report += f"| {col} | {treatment} | {reason} |\n"
    
    report += f"""

---

## üîí Frozen Schema Export

**File**: `frozen_schema.json`  
**Purpose**: Immutable preprocessing specification for all future phases  
**Schema Version**: 1.0
**Total Retained Features**: {len(keep_features)}
**Dropped Features**: {len(retention_decisions[retention_decisions['decision'] == 'DROP'])}

**Contents**:
- Feature retention decisions (KEEP/DROP)
- Placeholder handling strategies
- Categorical encoding methods
- Numerical scaling techniques
- Phase-0 role classifications
- Data type information

---

## üìä Why Each Feature Category Exists

| Category | Purpose | Importance for IDS |
|----------|---------|-------------------|
| **Behavioral (7)** | Core attack patterns: rates, volumes, duration, packet counts | Primary indicators of DoS, flooding, exfiltration, and scan attacks. These features describe *how* traffic behaves regardless of *who* generates it. |
| **Contextual (5)** | Protocol, service, connection state, port information | Essential context for similarity matching in RAG retrieval. Enables protocol-aware anomaly detection and attack surface identification. |
| **DNS-specific (7)** | Query patterns, response codes, recursion flags, rejection signals | DNS-based attacks (tunneling, DGA, reconnaissance) require specialized features. Protocol-specific behavioral indicators. |
| **HTTP-specific (8)** | Methods, versions, status codes, body lengths, MIME types, transaction depth | Web application layer attacks (injection, traversal, exfiltration) manifest in HTTP-specific patterns. Critical for API/web service protection. |
| **SSL-specific (4)** | TLS versions, cipher suites, session resumption, handshake status | Cryptographic attacks (downgrade, weak ciphers, MITM) require SSL-layer visibility. Detects outdated/vulnerable configurations. |
| **TCP-specific (1)** | Missed bytes indicator | TCP retransmission/loss patterns indicate network stress, DoS, or evasion attempts. Protocol-level reliability signal. |
| **Zeek Anomaly Flags (1)** | Expert-defined weird event indicator | Leverages Zeek's domain expertise in network anomalies. Complements behavioral features with human-expert signals. |

**Rationale**: Each category provides *unique attack visibility* that cannot be inferred from other categories. This multi-layer approach ensures:
- **RAG retrieval** can find similar attacks across different protocol layers
- **Quantum-inspired reasoning** can fuse evidence from multiple attack surfaces
- **Explainability** traces detections to specific protocol behaviors
- **Generalization** focuses on *what attackers do*, not *who they are*

---

## üéì Key Design Principles

### 1. Generalization Over Identity
- **Dropped** all identity-revealing features: `src_ip`, `dst_ip`, `uid`, `ts`
- **Dropped** high-cardinality identifiers: `dns_query`, `http_uri`, `http_user_agent`, `http_referrer`, `ssl_subject`, `ssl_issuer`
- **Rationale**: IDS must detect attacks based on *behavioral patterns*, not memorized IPs/domains

### 2. Explainability for RAG Retrieval
- **One-hot encoding** preferred over frequency/target encoding
- **Preserved** semantic meanings of categorical values (e.g., `conn_state=SF` vs `S0`)
- **Named encodings** (e.g., `NOT_APPLICABLE`) instead of arbitrary numbers
- **Rationale**: Quantum-inspired reasoning requires interpretable features for knowledge retrieval

### 3. Protocol-Aware Placeholder Handling
- **TCP features** (missed_bytes): `-` for UDP/ICMP ‚Üí `NOT_APPLICABLE`
- **DNS features**: `-` for non-DNS traffic ‚Üí `NOT_APPLICABLE`
- **HTTP features**: `-` for non-HTTP ‚Üí `-1` (numerical) or `NOT_APPLICABLE` (categorical)
- **SSL features**: `-` for non-SSL ‚Üí `NOT_APPLICABLE`
- **Rationale**: Preserves protocol semantics without treating placeholders as missing data

### 4. Robustness to Skew & Outliers
- **Log transforms** for highly skewed features (skew > 10)
- **RobustScaler** for attack-sensitive features (duration in DoS attacks)
- **Zero handling**: Log(x+1) prevents log(0) errors while preserving zero information
- **Rationale**: Attacks often manifest as outliers; preprocessing must not suppress them

---

## üìÇ Exported Artifacts

1. **`phase1_retention_decisions.csv`**: Full list of KEEP/DROP decisions with reasoning
2. **`phase1_placeholder_strategies.csv`**: Per-column placeholder handling rules
3. **`phase1_encoding_strategies.csv`**: Categorical encoding methods
4. **`phase1_numerical_treatment.csv`**: Numerical scaling strategies
5. **`frozen_schema.json`**: Machine-readable preprocessing specification
6. **`Phase_1_Decision_Summary_Report.md`**: This comprehensive human-readable report

---

## ‚úÖ Validation Checklist

- [x] All {len(retention_decisions)} original features classified as KEEP or DROP
- [x] Behavioral features (7) retained for IDS core signals
- [x] Identity features (6) dropped for generalization
- [x] Label features (2) excluded from input features
- [x] Protocol-specific placeholders handled semantically
- [x] Categorical features assigned encoding methods
- [x] Numerical features assigned scaling methods
- [x] Frozen schema exported for Phase-2 implementation

---

## üöÄ Next Phase: Phase-2 (Preprocessing Implementation)

**Objective**: Implement the frozen schema to transform raw TON-IoT data into ML-ready format  

**Tasks**:
1. Load `frozen_schema.json`
2. Apply placeholder handling transformations
3. Execute categorical encoding (one-hot, ordinal)
4. Apply numerical scaling (log, robust, standard)
5. Validate transformed schema matches frozen spec
6. Export processed dataset for Phase-3 (Model Training)

---

## üìä Complete Feature List by Decision

### KEEP Features ({len(keep_features)} total):
"""
    
    for feat in sorted(keep_features):
        report += f"- {feat}\n"
    
    report += f"""

### DROP Features ({len(retention_decisions[retention_decisions['decision'] == 'DROP'])} total):
"""
    
    drop_features = retention_decisions[retention_decisions["decision"] == "DROP"]["column"].tolist()
    for feat in sorted(drop_features):
        report += f"- {feat}\n"
    
    report += """

---

**End of Phase-1 Summary Report**
"""
    
    return report

# Generate and save report
summary_report = generate_phase1_summary()
report_path = f"{PHASE_1_DIR}/Phase_1_Decision_Summary_Report.md"
with open(report_path, 'w', encoding='utf-8') as f:
    f.write(summary_report)

print("‚úÖ Phase-1 Summary Report Generated!")
print(f"üíæ Saved: {report_path}")
print(f"\nüìÑ Report Length: {len(summary_report):,} characters")

# Display first section
display(Markdown("### üìù Report Preview (First 2000 chars)"))
display(Markdown(summary_report[:2000] + "\n\n...(truncated)..."))

‚úÖ Phase-1 Summary Report Generated!
üíæ Saved: ../artifacts/Phase_1_Decision_Summary_Report.md

üìÑ Report Length: 29,521 characters


### üìù Report Preview (First 2000 chars)

# Phase-1 Feature Decision & Preprocessing Design Summary

## üéØ Executive Summary

**Project**: Quantum-RAG IoT IDS - Knowledge-Augmented Threat Detection  
**Phase**: Phase-1 (Feature Decision & Preprocessing Design)  
**Date**: 2026-01-31 14:41:18  
**Dataset**: TON-IoT Processed Network (22,339,021 records)

---

## üìä Decision Overview

### Feature Retention Statistics
- **Original Features**: 47 columns
- **KEEP**: 33 features (70.2%)
- **DROP**: 14 features (29.8%)

---

## ‚ùå Features Dropped (with Justification)

### ts
- **Phase-0 Role**: Contextual
- **Unique Values**: 392,633
- **Decision**: DROP
- **Reasoning**: Timestamp is environment-specific, not generalizable

### uid
- **Phase-0 Role**: Identifier
- **Unique Values**: 999,966
- **Decision**: DROP
- **Reasoning**: Unique ID per connection, no behavioral value

### src_ip
- **Phase-0 Role**: Identifier
- **Unique Values**: 23,414
- **Decision**: DROP
- **Reasoning**: Source IP is identity-specific, prevents generalization

### dst_ip
- **Phase-0 Role**: Identifier
- **Unique Values**: 6,523
- **Decision**: DROP
- **Reasoning**: Destination IP is identity-specific

### type
- **Phase-0 Role**: Label/Ground Truth
- **Unique Values**: 10
- **Decision**: DROP
- **Reasoning**: Attack type label - used for evaluation only, not input

### label
- **Phase-0 Role**: Label/Ground Truth
- **Unique Values**: 2
- **Decision**: DROP
- **Reasoning**: Binary label (normal/attack) - target variable, not feature

### dns_query
- **Phase-0 Role**: Unknown - Needs Review
- **Unique Values**: 17,880
- **Decision**: DROP
- **Reasoning**: High cardinality (1M unique), identity-revealing domain names

### http_uri
- **Phase-0 Role**: Unknown - Needs Review
- **Unique Values**: 1,068
- **Decision**: DROP
- **Reasoning**: High cardinality relative to other categorical features, identity-revealing paths

### http_referrer
- **Phase-0 Role**: Unknown - Needs Review
- **Unique Values**: 5
- **Decision**: DROP
- **Reasoning**: High

...(truncated)...

---

## üéâ Phase-1 Complete!

### ‚úÖ Deliverables
1. ‚úÖ Feature retention decisions (KEEP: 32, DROP: 15)
2. ‚úÖ Placeholder handling strategies (protocol-aware semantic treatment)
3. ‚úÖ Categorical encoding specifications (one-hot, ordinal, binary)
4. ‚úÖ Numerical scaling strategies (log, robust, standard)
5. ‚úÖ **Frozen schema JSON** (immutable preprocessing spec)
6. ‚úÖ **Phase-1 summary report** (human-readable justifications)

### üîí Frozen Schema Status
- **Schema Version**: 1.0
- **Total Retained Features**: 32
- **File**: `artifacts/frozen_schema.json`
- **Status**: **LOCKED** for all future phases

### üöÄ Ready for Phase-2
Phase-2 will implement this frozen schema to transform the TON-IoT dataset into ML-ready format.

**Next Steps**:
1. Load `frozen_schema.json`
2. Apply transformations to all 23 CSV files
3. Validate transformed data against schema
4. Export processed dataset for model training