# üîß Phase-2.2: Encoding & Normalization
## Quantum-RAG Knowledge Fusion for Adaptive IoT Intrusion Detection

---

### üìã Phase-2.2 Objective

**This notebook applies frozen schema transformations exactly as specified in Phase-1.**

Phase-2.2 Goals:
1. ‚úÖ Load cleaned features from Phase-2.1
2. ‚úÖ Apply placeholder handling (`"-"` ‚Üí protocol-aware semantics)
3. ‚úÖ Encode categorical features (one-hot, ordinal, binary)
4. ‚úÖ Transform numerical features (log, robust/standard scaling)
5. ‚úÖ Validate encoded data integrity
6. ‚úÖ Save encoded dataset for vector generation

### üîí Phase-2.2 Rules

| Rule | Status |
|------|--------|
| ‚ùå No new encoding decisions | Strict |
| ‚ùå No schema modifications | Strict |
| ‚úî Apply frozen schema exactly | Required |
| ‚úî Preserve placeholder semantics | Required |
| ‚úî Deterministic transformations | Required |

### üìä Key Principles

- **Protocol-Aware Placeholders**: `"-"` ‚â† missing, but "not applicable"
- **Semantic Preservation**: Maintain feature interpretability
- **Deterministic Pipeline**: Same input ‚Üí same output (reproducible)
- **Type Safety**: Ensure numeric columns after transformation

---

## üì¶ Import Required Libraries

In [1]:
# Core data manipulation
import pandas as pd
import numpy as np
import json
import gc

# Encoding & Scaling
from sklearn.preprocessing import (
    StandardScaler,
    RobustScaler,
    OneHotEncoder,
    OrdinalEncoder,
    LabelEncoder
)

# File handling
import os
from pathlib import Path
import pickle

# Display utilities
from IPython.display import display, HTML, Markdown
import warnings
warnings.filterwarnings('ignore')

# Configure pandas display
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

# Define paths
PHASE_1_DIR = "../artifacts/phase_1"
PHASE_2_DIR = "../artifacts/phase_2"

print("‚úÖ Libraries imported successfully!")
print(f"üìÅ Phase-1 artifacts: {PHASE_1_DIR}")
print(f"üìÅ Phase-2 artifacts: {PHASE_2_DIR}")

‚úÖ Libraries imported successfully!
üìÅ Phase-1 artifacts: ../artifacts/phase_1
üìÅ Phase-2 artifacts: ../artifacts/phase_2


---

## üîí SECTION 1 ‚Äî Load Frozen Schema & Phase-2.1 Data

### Objectives:
1. Load frozen schema (immutable ground truth)
2. Load Phase-1 encoding/normalization strategies
3. Load cleaned features from Phase-2.1
4. Validate data integrity

In [2]:
# Load frozen schema
with open(f"{PHASE_1_DIR}/frozen_schema.json", 'r', encoding='utf-8') as f:
    frozen_schema = json.load(f)

# Load Phase-1 strategy CSVs
placeholder_strategies = pd.read_csv(f"{PHASE_1_DIR}/phase1_placeholder_strategies.csv")
encoding_strategies = pd.read_csv(f"{PHASE_1_DIR}/phase1_encoding_strategies.csv")
numerical_treatment = pd.read_csv(f"{PHASE_1_DIR}/phase1_numerical_treatment.csv")

print("üîí Frozen Schema Loaded:")
print(f"  ‚Ä¢ Schema Version: {frozen_schema['schema_version']}")
print(f"  ‚Ä¢ Total Features: {frozen_schema['total_features']}")

print("\n‚úÖ Phase-1 Strategies Loaded:")
print(f"  ‚Ä¢ Placeholder Strategies: {len(placeholder_strategies)} features")
print(f"  ‚Ä¢ Encoding Strategies: {len(encoding_strategies)} features")
print(f"  ‚Ä¢ Numerical Treatment: {len(numerical_treatment)} features")

üîí Frozen Schema Loaded:
  ‚Ä¢ Schema Version: 1.0
  ‚Ä¢ Total Features: 33

‚úÖ Phase-1 Strategies Loaded:
  ‚Ä¢ Placeholder Strategies: 33 features
  ‚Ä¢ Encoding Strategies: 21 features
  ‚Ä¢ Numerical Treatment: 12 features


In [3]:
# Load cleaned features from Phase-2.1 using CHUNKED processing
print("üì• Loading and processing cleaned features from Phase-2.1...\n")
print("‚ö†Ô∏è  Using chunked processing to handle large dataset efficiently\n")

# Define chunk size (adjust based on available memory)
CHUNK_SIZE = 500000  # Process 500k rows at a time

# First pass: get total row count and column names
print("üìä Analyzing dataset structure...")
chunk_iter = pd.read_csv(
    f"{PHASE_2_DIR}/cleaned_features.csv.gz",
    compression='gzip',
    chunksize=CHUNK_SIZE,
    low_memory=False
)

first_chunk = next(chunk_iter)
total_cols = len(first_chunk.columns)
col_names = first_chunk.columns.tolist()

# Count total rows
total_rows = len(first_chunk)
for chunk in chunk_iter:
    total_rows += len(chunk)

print(f"‚úÖ Dataset structure:")
print(f"  ‚Ä¢ Total rows: {total_rows:,}")
print(f"  ‚Ä¢ Columns: {total_cols}")
print(f"  ‚Ä¢ Chunk size: {CHUNK_SIZE:,} rows")
print(f"  ‚Ä¢ Estimated chunks: {(total_rows // CHUNK_SIZE) + 1}")

# Store metadata for later
dataset_info = {
    'total_rows': total_rows,
    'total_cols': total_cols,
    'col_names': col_names,
    'chunk_size': CHUNK_SIZE
}

üì• Loading and processing cleaned features from Phase-2.1...

‚ö†Ô∏è  Using chunked processing to handle large dataset efficiently

üìä Analyzing dataset structure...
‚úÖ Dataset structure:
  ‚Ä¢ Total rows: 22,339,021
  ‚Ä¢ Columns: 33
  ‚Ä¢ Chunk size: 500,000 rows
  ‚Ä¢ Estimated chunks: 45


---

## üîß SECTION 2 ‚Äî Define Transformation Functions

### Objectives:
1. Define reusable transformation functions for chunked processing
2. Create placeholder handling function
3. Create encoding functions
4. Create normalization functions

In [4]:
# Define transformation functions for chunked processing
print("üîß Defining transformation functions...\n")

# Create strategy mappings
strategy_map = dict(zip(placeholder_strategies['column'], placeholder_strategies['strategy']))
encoding_map = dict(zip(encoding_strategies['column'], encoding_strategies['encoding_method']))
treatment_map = dict(zip(numerical_treatment['column'], numerical_treatment['treatment']))

def apply_placeholder_handling(chunk_df):
    """Apply placeholder transformations to a chunk"""
    for col in chunk_df.columns:
        if chunk_df[col].dtype == 'object' and col in strategy_map:
            strategy = strategy_map[col]
            if strategy == 'protocol_na':
                chunk_df[col] = chunk_df[col].replace('-', 'NOT_APPLICABLE')
            elif strategy == 'unknown_service':
                chunk_df[col] = chunk_df[col].replace('-', 'UNKNOWN')
            elif strategy == 'boolean_false':
                chunk_df[col] = chunk_df[col].replace('-', 'False')
    return chunk_df

print("‚úÖ Placeholder handling function defined")

# Store one-hot encoding mappings (will be built from first chunk)
onehot_categories = {}
onehot_cols = [col for col, method in encoding_map.items() if method == 'one_hot']

print(f"‚úÖ Encoding mappings prepared ({len(onehot_cols)} one-hot features)")
print(f"‚úÖ Treatment mappings prepared ({len(treatment_map)} numerical features)")

üîß Defining transformation functions...

‚úÖ Placeholder handling function defined
‚úÖ Encoding mappings prepared (16 one-hot features)
‚úÖ Treatment mappings prepared (12 numerical features)


In [5]:
# Build one-hot encoding categories from first chunk
print("üî§ Building one-hot encoding categories...\n")

# Load first chunk to get category values
first_chunk = pd.read_csv(
    f"{PHASE_2_DIR}/cleaned_features.csv.gz",
    compression='gzip',
    nrows=CHUNK_SIZE,
    low_memory=False
)

# Apply placeholder handling to first chunk
first_chunk = apply_placeholder_handling(first_chunk)

# Get unique categories for one-hot features
for col in onehot_cols:
    if col in first_chunk.columns:
        unique_vals = first_chunk[col].unique()
        onehot_categories[col] = sorted([str(v) for v in unique_vals if pd.notna(v)])
        print(f"  {col:30s}: {len(onehot_categories[col])} categories")

del first_chunk
gc.collect()

print(f"\n‚úÖ One-hot categories established for {len(onehot_categories)} features")

üî§ Building one-hot encoding categories...

  proto                         : 3 categories
  conn_state                    : 13 categories
  service                       : 10 categories
  dns_qclass                    : 3 categories
  dns_qtype                     : 11 categories
  dns_rcode                     : 3 categories
  dns_AA                        : 3 categories
  dns_RD                        : 3 categories
  dns_RA                        : 3 categories
  dns_rejected                  : 3 categories
  http_method                   : 4 categories
  http_orig_mime_types          : 3 categories
  http_resp_mime_types          : 9 categories
  ssl_cipher                    : 5 categories
  ssl_resumed                   : 3 categories
  ssl_established               : 3 categories

‚úÖ One-hot categories established for 16 features


In [6]:
# Define complete encoding function
def apply_encoding(chunk_df):
    """Apply all encoding transformations to a chunk"""
    
    # ONE-HOT ENCODING
    encoded_dfs = []
    non_onehot_cols = [c for c in chunk_df.columns if c not in onehot_cols]
    
    # Keep non-onehot columns
    base_df = chunk_df[non_onehot_cols].copy()
    for col in base_df.select_dtypes(include=['int64']).columns:
        base_df[col] = base_df[col].astype('int32')
    encoded_dfs.append(base_df)
    
    # One-hot encode categorical columns
    for col in onehot_cols:
        if col in chunk_df.columns:
            # Create dummy columns matching established categories
            for cat in onehot_categories.get(col, []):
                encoded_dfs.append(
                    pd.DataFrame({f"{col}_{cat}": (chunk_df[col].astype(str) == cat).astype('uint8')})
                )
    
    result_df = pd.concat(encoded_dfs, axis=1)
    
    # ORDINAL ENCODING
    ordinal_mappings = {
        'http_version': {'NOT_APPLICABLE': -1, 'HTTP/0.9': 0, 'HTTP/1.0': 1, 'HTTP/1.1': 2, 'HTTP/2.0': 3, 'HTTP/3.0': 4},
        'http_status_code': {'NOT_APPLICABLE': 0},
        'ssl_version': {'NOT_APPLICABLE': -1, 'SSLv2': 0, 'SSLv3': 1, 'TLSv1.0': 2, 'TLSv1.1': 3, 'TLSv1.2': 4, 'TLSv1.3': 5},
        'missed_bytes': {'NOT_APPLICABLE': -1}
    }
    
    for col, mapping in ordinal_mappings.items():
        if col in result_df.columns:
            if col in ['http_status_code', 'missed_bytes']:
                result_df[col] = result_df[col].replace(mapping)
                result_df[col] = pd.to_numeric(result_df[col], errors='coerce').fillna(-1).astype('int16')
            else:
                result_df[col] = result_df[col].map(mapping).fillna(-1).astype('int8')
    
    # BINARY ENCODING
    binary_cols = [col for col, method in encoding_map.items() if method == 'binary' and col in result_df.columns]
    for col in binary_cols:
        mapping = {'True': 1, 'False': 0, 'T': 1, 'F': 0}
        result_df[col] = result_df[col].map(mapping).fillna(0).astype('uint8')
    
    return result_df

print("‚úÖ Complete encoding function defined")

‚úÖ Complete encoding function defined


---

## üîÑ SECTION 3 ‚Äî Process All Chunks

### Objectives:
1. Process data in memory-efficient chunks
2. Apply all transformations per chunk
3. Save encoded chunks progressively
4. Track progress and memory usage

In [7]:
# Process all chunks and save progressively
print("üîÑ Processing all chunks...\n")

from tqdm import tqdm

chunk_iter = pd.read_csv(
    f"{PHASE_2_DIR}/cleaned_features.csv.gz",
    compression='gzip',
    chunksize=CHUNK_SIZE,
    low_memory=False
)

# Process each chunk
encoded_chunks = []
chunk_num = 0
total_chunks = (dataset_info['total_rows'] // CHUNK_SIZE) + 1

for chunk in tqdm(chunk_iter, total=total_chunks, desc="Processing chunks"):
    chunk_num += 1
    
    # Apply placeholder handling
    chunk = apply_placeholder_handling(chunk)
    
    # Apply encoding
    chunk_encoded = apply_encoding(chunk)
    
    # Apply log transforms
    log_cols = [col for col, treatment in treatment_map.items() if 'log' in treatment and col in chunk_encoded.columns]
    for col in log_cols:
        treatment = treatment_map[col]
        if chunk_encoded[col].dtype == 'object':
            chunk_encoded[col] = chunk_encoded[col].replace(['NOT_APPLICABLE', 'UNKNOWN', '-'], '-1')
            chunk_encoded[col] = pd.to_numeric(chunk_encoded[col], errors='coerce').fillna(-1)
        
        chunk_encoded[col] = chunk_encoded[col].astype(np.float32)
        
        offset = 2 if 'log_scale_with_na' in treatment else 1
        chunk_encoded[f"{col}_log"] = np.log1p(chunk_encoded[col] + offset - 1).astype(np.float32)
        chunk_encoded.drop(columns=[col], inplace=True)
    
    # Store encoded chunk
    encoded_chunks.append(chunk_encoded)
    
    # Periodic save to avoid memory overflow
    if chunk_num % 10 == 0:
        print(f"  üíæ Processed {chunk_num}/{total_chunks} chunks, memory: {sum(c.memory_usage(deep=True).sum() for c in encoded_chunks) / (1024**3):.2f} GB")
        gc.collect()

print(f"\n‚úÖ All {chunk_num} chunks processed!")

üîÑ Processing all chunks...



Processing chunks:  22%|‚ñà‚ñà‚ñè       | 10/45 [01:38<05:52, 10.08s/it]

  üíæ Processed 10/45 chunks, memory: 0.93 GB


Processing chunks:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 20/45 [03:10<03:50,  9.20s/it]

  üíæ Processed 20/45 chunks, memory: 1.86 GB


Processing chunks:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 30/45 [04:39<02:23,  9.59s/it]

  üíæ Processed 30/45 chunks, memory: 2.79 GB


Processing chunks:  89%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ | 40/45 [06:21<00:52, 10.57s/it]

  üíæ Processed 40/45 chunks, memory: 3.73 GB


Processing chunks: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 45/45 [06:45<00:00,  9.01s/it]


‚úÖ All 45 chunks processed!





In [8]:
# Concatenate all encoded chunks
print("üîó Concatenating all encoded chunks...\n")

feature_df_encoded = pd.concat(encoded_chunks, axis=0, ignore_index=True)

# Free memory
del encoded_chunks
gc.collect()

print(f"‚úÖ Concatenation complete!")
print(f"  ‚Ä¢ Rows: {len(feature_df_encoded):,}")
print(f"  ‚Ä¢ Columns: {len(feature_df_encoded.columns)}")
print(f"  ‚Ä¢ Memory: {feature_df_encoded.memory_usage(deep=True).sum() / (1024**3):.2f} GB")

üîó Concatenating all encoded chunks...

‚úÖ Concatenation complete!
  ‚Ä¢ Rows: 22,339,021
  ‚Ä¢ Columns: 99
  ‚Ä¢ Memory: 4.16 GB


In [9]:
# Apply ROBUST SCALER (outlier-resistant)
print("üîß Applying ROBUST scaling...\n")

robust_cols = [col for col, treatment in treatment_map.items() 
               if treatment == 'robust_scale' and col in feature_df_encoded.columns]

robust_scalers = {}

for col in robust_cols:
    scaler = RobustScaler()
    if feature_df_encoded[col].dtype != 'float32':
        feature_df_encoded[col] = feature_df_encoded[col].astype(np.float32)
    feature_df_encoded[col] = scaler.fit_transform(feature_df_encoded[[col]]).astype(np.float32)
    robust_scalers[col] = scaler
    print(f"  ‚úÖ {col:30s}: RobustScaler applied")

print(f"\n‚úÖ Robust scaling complete! ({len(robust_scalers)} features)")

üîß Applying ROBUST scaling...

  ‚úÖ duration                      : RobustScaler applied

‚úÖ Robust scaling complete! (1 features)


In [10]:
# Apply STANDARD SCALER
print("üìè Applying STANDARD scaling...\n")

# Identify columns needing standard scaling
standard_cols = [col for col, treatment in treatment_map.items() 
                 if 'standard' in treatment.lower() and col in feature_df_encoded.columns]

# Also scale log-transformed columns
log_transformed_cols = [col for col in feature_df_encoded.columns if col.endswith('_log')]
standard_cols.extend(log_transformed_cols)

standard_scalers = {}

for col in standard_cols:
    scaler = StandardScaler()
    
    # Handle potential string values in columns
    if feature_df_encoded[col].dtype == 'object':
        feature_df_encoded[col] = feature_df_encoded[col].replace(['NOT_APPLICABLE', 'UNKNOWN', '-'], '-1')
        feature_df_encoded[col] = pd.to_numeric(feature_df_encoded[col], errors='coerce').fillna(-1)
    
    # Convert to float32
    if feature_df_encoded[col].dtype != 'float32':
        feature_df_encoded[col] = feature_df_encoded[col].astype(np.float32)
    
    # Replace Inf values with large finite number
    feature_df_encoded[col] = feature_df_encoded[col].replace([np.inf, -np.inf], np.nan).fillna(0)
    
    feature_df_encoded[col] = scaler.fit_transform(feature_df_encoded[[col]]).astype(np.float32)
    standard_scalers[col] = scaler
    print(f"  ‚úÖ {col:30s}: StandardScaler applied")

print(f"\n‚úÖ Standard scaling complete! ({len(standard_scalers)} features)")

üìè Applying STANDARD scaling...

  ‚úÖ src_port                      : StandardScaler applied
  ‚úÖ dst_port                      : StandardScaler applied
  ‚úÖ http_trans_depth              : StandardScaler applied
  ‚úÖ src_bytes_log                 : StandardScaler applied
  ‚úÖ dst_bytes_log                 : StandardScaler applied
  ‚úÖ src_pkts_log                  : StandardScaler applied
  ‚úÖ dst_pkts_log                  : StandardScaler applied
  ‚úÖ src_ip_bytes_log              : StandardScaler applied
  ‚úÖ dst_ip_bytes_log              : StandardScaler applied
  ‚úÖ http_request_body_len_log     : StandardScaler applied
  ‚úÖ http_response_body_len_log    : StandardScaler applied

‚úÖ Standard scaling complete! (11 features)


In [11]:
# Save fitted scalers for inference
print("üíæ Saving fitted scalers...\n")

scalers = {
    'robust': robust_scalers,
    'standard': standard_scalers
}

scaler_path = f"{PHASE_2_DIR}/fitted_scalers.pkl"
with open(scaler_path, 'wb') as f:
    pickle.dump(scalers, f)

print(f"‚úÖ Saved: {scaler_path}")
print(f"  ‚Ä¢ RobustScalers: {len(robust_scalers)}")
print(f"  ‚Ä¢ StandardScalers: {len(standard_scalers)}")

üíæ Saving fitted scalers...

‚úÖ Saved: ../artifacts/phase_2/fitted_scalers.pkl
  ‚Ä¢ RobustScalers: 1
  ‚Ä¢ StandardScalers: 11


---

## ‚úÖ SECTION 5 ‚Äî Validation & Export

### Objectives:
1. Validate final encoded dataset
2. Check for NaN/Inf values
3. Verify all columns are numeric
4. Save encoded dataset
5. Generate encoding summary

In [12]:
# Save encoded dataset
print("üíæ Saving encoded dataset...\n")

# Save as Parquet with compression (now all numeric, Parquet-safe)
encoded_output_path = f"{PHASE_2_DIR}/encoded_features.parquet"
feature_df_encoded.to_parquet(
    encoded_output_path, 
    index=False,
    compression='snappy',  # Fast compression
    engine='pyarrow'
)

print(f"‚úÖ Saved: {encoded_output_path}")
print(f"  ‚Ä¢ Rows: {len(feature_df_encoded):,}")
print(f"  ‚Ä¢ Features: {len(feature_df_encoded.columns)}")
print(f"  ‚Ä¢ Size: {os.path.getsize(encoded_output_path) / (1024**2):.1f} MB")

üíæ Saving encoded dataset...

‚úÖ Saved: ../artifacts/phase_2/encoded_features.parquet
  ‚Ä¢ Rows: 22,339,021
  ‚Ä¢ Features: 99
  ‚Ä¢ Size: 208.5 MB


---

## üéâ Phase-2.2 Complete!

### ‚úÖ Deliverables
1. ‚úÖ Applied placeholder handling (protocol-aware)
2. ‚úÖ Applied categorical encoding (one-hot, ordinal, binary)
3. ‚úÖ Applied numerical transformations (log, robust, standard)
4. ‚úÖ Validated all columns numeric, no NaN/Inf
5. ‚úÖ Saved encoded dataset (Parquet format)
6. ‚úÖ Saved fitted scalers for inference

### üìÇ Output Files
- `artifacts/phase_2/encoded_features.parquet` (all numeric, ready for vectors)
- `artifacts/phase_2/fitted_scalers.pkl` (for inference pipeline)
- `artifacts/phase_2/phase2_2_summary.json`

### üöÄ Next: Phase-2.3 (Vector Generation)
Ready to:
- Construct fixed-length feature vectors
- Validate vector dimensionality
- Prepare for ChromaDB ingestion

---

**Status**: ‚úÖ **READY FOR PHASE-2.3**

In [13]:
# Final validation
print("üîç Final Validation...\n")

# Check for NaN values
nan_counts = feature_df_encoded.isna().sum()
nan_cols = nan_counts[nan_counts > 0]

if len(nan_cols) == 0:
    print("‚úÖ No NaN values detected")
else:
    print(f"‚ö†Ô∏è  NaN values detected in {len(nan_cols)} columns:")
    for col, count in nan_cols.items():
        print(f"  ‚Ä¢ {col}: {count:,} NaNs")

# Check for Inf values
inf_counts = np.isinf(feature_df_encoded.select_dtypes(include=['number'])).sum()
inf_cols = inf_counts[inf_counts > 0]

if len(inf_cols) == 0:
    print("‚úÖ No Inf values detected")
else:
    print(f"\n‚ö†Ô∏è  Inf values detected in {len(inf_cols)} columns:")
    for col, count in inf_cols.items():
        print(f"  ‚Ä¢ {col}: {count:,} Infs")

# Check all numeric
non_numeric = feature_df_encoded.select_dtypes(exclude=['number']).columns
if len(non_numeric) == 0:
    print("‚úÖ All columns are numeric")
else:
    print(f"\n‚ö†Ô∏è  {len(non_numeric)} non-numeric columns:")
    print(f"  {list(non_numeric)}")

print(f"\nüìä Final Dataset Shape: {feature_df_encoded.shape}")
print(f"  ‚Ä¢ Rows: {feature_df_encoded.shape[0]:,}")
print(f"  ‚Ä¢ Features: {feature_df_encoded.shape[1]}")
print(f"  ‚Ä¢ Memory: {feature_df_encoded.memory_usage(deep=True).sum() / (1024**3):.2f} GB")

üîç Final Validation...

‚úÖ No NaN values detected
‚úÖ No Inf values detected
‚úÖ All columns are numeric

üìä Final Dataset Shape: (22339021, 99)
  ‚Ä¢ Rows: 22,339,021
  ‚Ä¢ Features: 99
  ‚Ä¢ Memory: 2.85 GB


In [14]:
# Generate encoding summary
summary = {
    "phase": "Phase-2.2 (Encoding & Normalization)",
    "timestamp": pd.Timestamp.now().strftime("%Y-%m-%d %H:%M:%S"),
    "frozen_schema_version": frozen_schema['schema_version'],
    "input_shape": [dataset_info['total_rows'], dataset_info['total_cols']],
    "output_shape": list(feature_df_encoded.shape),
    "transformations": {
        "placeholder_handling": len([col for col in strategy_map if col in dataset_info['col_names']]),
        "onehot_encoded": len(onehot_cols),
        "ordinal_encoded": 4,
        "binary_encoded": len([col for col, method in encoding_map.items() if method == 'binary']),
        "log_transformed": len([col for col, treatment in treatment_map.items() if 'log' in treatment]),
        "robust_scaled": len(robust_scalers),
        "standard_scaled": len(standard_scalers)
    },
    "feature_expansion": {
        "original_features": dataset_info['total_cols'],
        "encoded_features": feature_df_encoded.shape[1],
        "expansion_factor": round(feature_df_encoded.shape[1] / dataset_info['total_cols'], 2)
    },
    "validation": {
        "nan_columns": len(nan_cols),
        "inf_columns": len(inf_cols),
        "all_numeric": len(non_numeric) == 0
    }
}

summary_path = f"{PHASE_2_DIR}/phase2_2_summary.json"
with open(summary_path, 'w', encoding='utf-8') as f:
    json.dump(summary, f, indent=2)

print("\nüìä Phase-2.2 Summary:")
print(json.dumps(summary, indent=2))
print(f"\nüíæ Saved: {summary_path}")


üìä Phase-2.2 Summary:
{
  "phase": "Phase-2.2 (Encoding & Normalization)",
  "timestamp": "2026-02-04 22:02:28",
  "frozen_schema_version": "1.0",
  "input_shape": [
    22339021,
    33
  ],
  "output_shape": [
    22339021,
    99
  ],
  "transformations": {
    "placeholder_handling": 33,
    "onehot_encoded": 16,
    "ordinal_encoded": 4,
    "binary_encoded": 1,
    "log_transformed": 8,
    "robust_scaled": 1,
    "standard_scaled": 11
  },
  "feature_expansion": {
    "original_features": 33,
    "encoded_features": 99,
    "expansion_factor": 3.0
  },
  "validation": {
    "nan_columns": 0,
    "inf_columns": 0,
    "all_numeric": true
  }
}

üíæ Saved: ../artifacts/phase_2/phase2_2_summary.json
