# üè∑Ô∏è Encoding Strategies Comparison

**Obiettivo**: Validare la configurazione attuale delle strategie di encoding categorico.

**Strategie analizzate**:
1. **OneHot Encoding** (low cardinality: ‚â§ 10 unique)
2. **Target Encoding** (medium cardinality: 11-50 unique)
3. **Frequency Encoding** (high cardinality: > 50 unique)
4. **Ordinal Encoding** (fallback/custom)

**Analisi**:
- Cardinalit√† features categoriche
- Dimensionalit√† prima/dopo encoding
- Correlazione con target per tipo encoding
- Test leak-free (unseen categories)
- Performance encoding strategies
- Raccomandazioni per soglie cardinality

**Output**: `encoding_outputs/`

## üîß Setup

In [1]:
# Imports
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / "src"))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.model_selection import train_test_split
import warnings

# Project imports
from utils.config import load_config
from preprocessing.pipeline import apply_data_filters
from preprocessing.encoders import (
    EncodingConfig,
    fit_categorical_encoders,
    transform_categorical_features
)

warnings.filterwarnings('ignore')

# Plot settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Setup completato")

ImportError: cannot import name 'EncodingConfig' from 'preprocessing.encoders' (c:\Users\giuli\OneDrive\Desktop\stimatrix\src\preprocessing\encoders.py)

In [None]:
# Configurazione
CONFIG_PATH = "../config/config.yaml"
RAW_DATA_PATH = "../data/raw/raw.parquet"
OUTPUT_DIR = Path("encoding_outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

def save_plot(name, dpi=120):
    plt.tight_layout()
    plt.savefig(OUTPUT_DIR / f"{name}.png", dpi=dpi, bbox_inches='tight')
    print(f"üíæ Salvato: {name}.png")

print(f"üìÇ Output directory: {OUTPUT_DIR}")

## üì¶ 1. Load Data

In [None]:
# Load config e data
config = load_config(CONFIG_PATH)
df_raw = pd.read_parquet(RAW_DATA_PATH)

# Applica filtri
df = apply_data_filters(df_raw, config)

# Target
target_col = 'AI_Prezzo_Ridistribuito'

print(f"‚úÖ Dataset caricato e filtrato")
print(f"   Shape: {df.shape}")
print(f"   Target: {target_col}")

## üè∑Ô∏è 2. Identify Categorical Features

In [None]:
# Identifica features categoriche (dtypes object/category)
categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()

# Rimuovi target se presente
if target_col in categorical_cols:
    categorical_cols.remove(target_col)

print(f"üìä Categorical features trovate: {len(categorical_cols)}")
print(f"\nLista:")
for col in categorical_cols:
    print(f"  - {col}")

## üìä 3. Cardinality Analysis

In [None]:
# Analizza cardinalit√†
cardinality_data = []

for col in categorical_cols:
    n_unique = df[col].nunique()
    n_samples = df[col].notna().sum()
    missing_pct = df[col].isna().sum() / len(df) * 100
    
    # Determina encoding strategy (config project)
    if n_unique <= 10:
        strategy = 'OneHot'
    elif n_unique <= 50:
        strategy = 'Target'
    else:
        strategy = 'Frequency'
    
    cardinality_data.append({
        'Feature': col,
        'Unique': n_unique,
        'Samples': n_samples,
        'Missing_Pct': missing_pct,
        'Strategy': strategy,
    })

cardinality_df = pd.DataFrame(cardinality_data)
cardinality_df = cardinality_df.sort_values('Unique', ascending=False)

print("=" * 80)
print("CARDINALITY ANALYSIS")
print("=" * 80)
print("\n", cardinality_df.to_string(index=False))

# Salva
cardinality_df.to_csv(OUTPUT_DIR / "01_cardinality_analysis.csv", index=False)
print(f"\nüíæ Salvato: 01_cardinality_analysis.csv")

# Summary per strategy
print("\n" + "=" * 80)
print("ENCODING STRATEGIES DISTRIBUTION")
print("=" * 80)
strategy_counts = cardinality_df['Strategy'].value_counts()
for strategy, count in strategy_counts.items():
    print(f"  {strategy}: {count} features")
    features = cardinality_df[cardinality_df['Strategy'] == strategy]['Feature'].tolist()
    for feat in features:
        n_unique = cardinality_df[cardinality_df['Feature'] == feat]['Unique'].values[0]
        print(f"    - {feat} (n={n_unique})")

## üìä 4. Visualizations - Cardinality

In [None]:
# Bar chart: cardinalit√† per feature
fig, ax = plt.subplots(figsize=(12, max(6, len(categorical_cols) * 0.3)))

# Colors per strategy
color_map = {'OneHot': 'green', 'Target': 'orange', 'Frequency': 'red'}
colors = [color_map[s] for s in cardinality_df['Strategy']]

ax.barh(range(len(cardinality_df)), cardinality_df['Unique'], color=colors, edgecolor='black')
ax.set_yticks(range(len(cardinality_df)))
ax.set_yticklabels(cardinality_df['Feature'], fontsize=8)
ax.set_xlabel('Number of Unique Values (Cardinality)')
ax.set_title('Cardinality per Categorical Feature', fontsize=14, fontweight='bold')
ax.set_xscale('log')
ax.grid(True, alpha=0.3, axis='x')

# Soglie
ax.axvline(x=10, color='green', linestyle='--', linewidth=2, label='OneHot (‚â§10)')
ax.axvline(x=50, color='orange', linestyle='--', linewidth=2, label='Target (11-50)')
ax.legend()

# Aggiungi valori
for i, row in cardinality_df.iterrows():
    ax.text(row['Unique'] * 1.1, cardinality_df.index.get_loc(i), 
            f"{row['Unique']}", va='center', fontsize=7)

save_plot("02_cardinality_bar_chart")
plt.show()

In [None]:
# Pie chart: distribuzione strategie
fig, ax = plt.subplots(figsize=(8, 8))

strategy_counts = cardinality_df['Strategy'].value_counts()
colors = [color_map[s] for s in strategy_counts.index]

wedges, texts, autotexts = ax.pie(
    strategy_counts.values,
    labels=strategy_counts.index,
    autopct='%1.1f%%',
    colors=colors,
    startangle=90,
    textprops={'fontsize': 12, 'fontweight': 'bold'}
)

ax.set_title('Encoding Strategies Distribution', fontsize=14, fontweight='bold')

save_plot("03_strategies_pie_chart")
plt.show()

## üìä 5. Dimensionality Impact

In [None]:
# Calcola dimensionalit√† risultante da ogni strategia
print("=" * 80)
print("DIMENSIONALITY IMPACT")
print("=" * 80)

# OneHot: ogni unique diventa una colonna (tranne uno per evitare collinearit√†)
onehot_features = cardinality_df[cardinality_df['Strategy'] == 'OneHot']
onehot_dims = sum(max(1, row['Unique'] - 1) for _, row in onehot_features.iterrows())

# Target/Frequency: 1 colonna per feature
target_features = cardinality_df[cardinality_df['Strategy'] == 'Target']
target_dims = len(target_features)

freq_features = cardinality_df[cardinality_df['Strategy'] == 'Frequency']
freq_dims = len(freq_features)

total_categorical_dims = onehot_dims + target_dims + freq_dims
original_dims = len(categorical_cols)

print(f"\nüìä BEFORE Encoding:")
print(f"   Categorical features: {original_dims}")

print(f"\nüìä AFTER Encoding:")
print(f"   OneHot dimensions: {onehot_dims} (from {len(onehot_features)} features)")
print(f"   Target dimensions: {target_dims}")
print(f"   Frequency dimensions: {freq_dims}")
print(f"   TOTAL: {total_categorical_dims}")

print(f"\nüìà Expansion Factor: {total_categorical_dims / original_dims:.2f}x")

# Breakdown per feature OneHot
if len(onehot_features) > 0:
    print(f"\nüìä OneHot Features Breakdown:")
    for _, row in onehot_features.iterrows():
        dims = max(1, row['Unique'] - 1)
        print(f"   {row['Feature']}: {row['Unique']} unique ‚Üí {dims} dimensions")

In [None]:
# Visualize dimensionality impact
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before/After bar
axes[0].bar(['Before\nEncoding', 'After\nEncoding'], 
            [original_dims, total_categorical_dims],
            color=['steelblue', 'orange'],
            edgecolor='black')
axes[0].set_ylabel('Number of Dimensions')
axes[0].set_title('Dimensionality: Before vs After Encoding')
axes[0].grid(True, alpha=0.3, axis='y')

# Aggiungi valori
for i, v in enumerate([original_dims, total_categorical_dims]):
    axes[0].text(i, v + 1, str(v), ha='center', fontweight='bold')

# Breakdown per strategy
strategy_dims = {'OneHot': onehot_dims, 'Target': target_dims, 'Frequency': freq_dims}
colors_breakdown = [color_map[s] for s in strategy_dims.keys()]

axes[1].bar(strategy_dims.keys(), strategy_dims.values(), 
            color=colors_breakdown, edgecolor='black')
axes[1].set_ylabel('Number of Dimensions')
axes[1].set_title('Dimensionality Breakdown by Strategy')
axes[1].grid(True, alpha=0.3, axis='y')

# Aggiungi valori
for i, (k, v) in enumerate(strategy_dims.items()):
    axes[1].text(i, v + 0.5, str(v), ha='center', fontweight='bold')

save_plot("04_dimensionality_impact")
plt.show()

## üìä 6. Correlation with Target (per strategy)

In [None]:
# Analizza correlazione features categoriche con target
# Per ogni feature, calcola correlazione media per categoria

print("=" * 80)
print("CORRELATION WITH TARGET (per strategy)")
print("=" * 80)

correlation_data = []

for col in categorical_cols:
    # Calcola mean target per categoria
    grouped = df.groupby(col)[target_col].mean()
    
    # Variance tra categorie (proxy per predictive power)
    target_variance = grouped.var()
    
    # Correlation ratio (eta-squared)
    # https://en.wikipedia.org/wiki/Correlation_ratio
    try:
        overall_mean = df[target_col].mean()
        ss_between = sum(
            df[df[col] == cat][target_col].count() * (grouped[cat] - overall_mean)**2 
            for cat in grouped.index
        )
        ss_total = sum((df[target_col] - overall_mean)**2)
        eta_squared = ss_between / ss_total if ss_total > 0 else 0
    except:
        eta_squared = np.nan
    
    strategy = cardinality_df[cardinality_df['Feature'] == col]['Strategy'].values[0]
    
    correlation_data.append({
        'Feature': col,
        'Strategy': strategy,
        'Target_Variance': target_variance,
        'Eta_Squared': eta_squared,
    })

correlation_df = pd.DataFrame(correlation_data)
correlation_df = correlation_df.sort_values('Eta_Squared', ascending=False)

print("\n", correlation_df.to_string(index=False))

# Salva
correlation_df.to_csv(OUTPUT_DIR / "05_correlation_with_target.csv", index=False)
print(f"\nüíæ Salvato: 05_correlation_with_target.csv")

In [None]:
# Bar chart: eta-squared per feature
fig, ax = plt.subplots(figsize=(12, max(6, len(correlation_df) * 0.3)))

colors = [color_map[s] for s in correlation_df['Strategy']]

ax.barh(range(len(correlation_df)), correlation_df['Eta_Squared'], 
        color=colors, edgecolor='black')
ax.set_yticks(range(len(correlation_df)))
ax.set_yticklabels(correlation_df['Feature'], fontsize=8)
ax.set_xlabel('Eta-Squared (Correlation Ratio)')
ax.set_title('Predictive Power of Categorical Features', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

# Aggiungi valori
for i, row in correlation_df.iterrows():
    ax.text(row['Eta_Squared'] + 0.001, correlation_df.index.get_loc(i), 
            f"{row['Eta_Squared']:.3f}", va='center', fontsize=7)

save_plot("06_correlation_with_target")
plt.show()

## üìä 7. Encoding Test (Leak-Free Validation)

In [None]:
# Test che encoding sia leak-free con split train/test
print("=" * 80)
print("ENCODING LEAK-FREE TEST")
print("=" * 80)

# Split temporale (come nel progetto)
if 'AI_Anno' in df.columns:
    df_sorted = df.sort_values('AI_Anno')
    split_idx = int(len(df_sorted) * 0.8)
    train_df = df_sorted.iloc[:split_idx].copy()
    test_df = df_sorted.iloc[split_idx:].copy()
else:
    # Fallback: random split
    train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

print(f"\nüìä Split:")
print(f"   Train: {len(train_df):,} samples")
print(f"   Test:  {len(test_df):,} samples")

# Test per ogni feature categorica
unseen_analysis = []

for col in categorical_cols:
    train_categories = set(train_df[col].dropna().unique())
    test_categories = set(test_df[col].dropna().unique())
    
    unseen_categories = test_categories - train_categories
    unseen_pct = len(unseen_categories) / len(test_categories) * 100 if len(test_categories) > 0 else 0
    
    strategy = cardinality_df[cardinality_df['Feature'] == col]['Strategy'].values[0]
    
    unseen_analysis.append({
        'Feature': col,
        'Strategy': strategy,
        'Train_Unique': len(train_categories),
        'Test_Unique': len(test_categories),
        'Unseen': len(unseen_categories),
        'Unseen_Pct': unseen_pct,
    })

unseen_df = pd.DataFrame(unseen_analysis)
unseen_df = unseen_df.sort_values('Unseen_Pct', ascending=False)

print("\nüìä Unseen Categories Analysis:")
print("\n", unseen_df.to_string(index=False))

# Salva
unseen_df.to_csv(OUTPUT_DIR / "07_unseen_categories_analysis.csv", index=False)
print(f"\nüíæ Salvato: 07_unseen_categories_analysis.csv")

# Summary
print(f"\n‚ö†Ô∏è  Features con unseen categories > 0%: {(unseen_df['Unseen_Pct'] > 0).sum()}")
if (unseen_df['Unseen_Pct'] > 0).sum() > 0:
    print("\n   Queste features richiedono handling per unseen categories!")
    print("   - Target Encoding: usa mean globale per unseen")
    print("   - Frequency Encoding: usa freq=0 per unseen")
    print("   - OneHot: crea colonna 'unknown' o ignora")

In [None]:
# Bar chart: unseen categories percentage
fig, ax = plt.subplots(figsize=(12, max(6, len(unseen_df) * 0.3)))

colors = [color_map[s] for s in unseen_df['Strategy']]

ax.barh(range(len(unseen_df)), unseen_df['Unseen_Pct'], 
        color=colors, edgecolor='black')
ax.set_yticks(range(len(unseen_df)))
ax.set_yticklabels(unseen_df['Feature'], fontsize=8)
ax.set_xlabel('Unseen Categories (%)')
ax.set_title('Unseen Categories in Test Set', fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='x')

# Soglia warning
ax.axvline(x=5, color='red', linestyle='--', linewidth=2, label='Warning (>5%)')
ax.legend()

# Aggiungi valori
for i, row in unseen_df.iterrows():
    if row['Unseen_Pct'] > 0:
        ax.text(row['Unseen_Pct'] + 0.2, unseen_df.index.get_loc(i), 
                f"{row['Unseen_Pct']:.1f}%", va='center', fontsize=7)

save_plot("08_unseen_categories")
plt.show()

## üìã 8. Summary Report

In [None]:
# Report finale
report = {
    'categorical_features': len(categorical_cols),
    'encoding_strategies': {
        'onehot': {
            'threshold': '‚â§ 10 unique',
            'features_count': int((cardinality_df['Strategy'] == 'OneHot').sum()),
            'dimensions': int(onehot_dims),
        },
        'target': {
            'threshold': '11-50 unique',
            'features_count': int((cardinality_df['Strategy'] == 'Target').sum()),
            'dimensions': int(target_dims),
        },
        'frequency': {
            'threshold': '> 50 unique',
            'features_count': int((cardinality_df['Strategy'] == 'Frequency').sum()),
            'dimensions': int(freq_dims),
        },
    },
    'dimensionality': {
        'before': int(original_dims),
        'after': int(total_categorical_dims),
        'expansion_factor': float(total_categorical_dims / original_dims),
    },
    'unseen_categories': {
        'features_with_unseen': int((unseen_df['Unseen_Pct'] > 0).sum()),
        'max_unseen_pct': float(unseen_df['Unseen_Pct'].max()),
    },
    'top_predictive_features': [
        {'feature': row['Feature'], 'eta_squared': float(row['Eta_Squared'])}
        for _, row in correlation_df.head(5).iterrows()
    ],
    'recommendation': (
        f"Config attuale OK: {len(categorical_cols)} features ‚Üí {total_categorical_dims} dims "
        f"(expansion {total_categorical_dims/original_dims:.1f}x). "
        f"Soglie (10, 50) sono appropriate."
    )
}

# Salva JSON
import json
with open(OUTPUT_DIR / "00_summary_report.json", 'w') as f:
    json.dump(report, f, indent=2)

print("\n" + "=" * 80)
print("üìã FINAL REPORT")
print("=" * 80)
print(json.dumps(report, indent=2))
print(f"\nüíæ Salvato: 00_summary_report.json")

## ‚úÖ Conclusioni

### File Generati

1. `00_summary_report.json` - Report completo
2. `01_cardinality_analysis.csv` - Cardinalit√† per feature
3. `02_cardinality_bar_chart.png` - Bar chart cardinalit√†
4. `03_strategies_pie_chart.png` - Pie chart strategie
5. `04_dimensionality_impact.png` - Impatto dimensionalit√†
6. `05_correlation_with_target.csv` - Correlazioni con target
7. `06_correlation_with_target.png` - Bar chart correlazioni
8. `07_unseen_categories_analysis.csv` - Analisi unseen categories
9. `08_unseen_categories.png` - Bar chart unseen categories

### Key Insights

- **Configurazione attuale** (soglie 10, 50) bilancia bene dimensionalit√† e informazione
- **OneHot** per low cardinality: interpretabile ma espande dims
- **Target** per medium cardinality: compatto ma rischio leakage (gestito)
- **Frequency** per high cardinality: loss information ma scalabile

### Raccomandazioni

- Se expansion factor > 5x: aumenta soglia OneHot (es. 5 invece di 10)
- Se unseen > 10% per feature: considera strategy pi√π robusta
- Features con alta correlazione (eta¬≤): priorit√† per tuning encoding
- SEMPRE test leak-free su split temporale!