# üîç Outlier Detection Analysis

**Obiettivo**: Confrontare metodi di outlier detection e validare configurazione ensemble attuale.

**Metodi analizzati**:
1. IQR (Interquartile Range)
2. Z-Score (Œº ¬± k¬∑œÉ)
3. Modified Z-Score (median-based, robusto)
4. Isolation Forest ‚Üê **Usato nel progetto**
5. LOF (Local Outlier Factor)
6. DBSCAN
7. Elliptic Envelope
8. **Ensemble** (IQR + Z-Score + Isolation Forest) ‚Üê **Config attuale**

**Analisi**:
- Numero outlier rilevati per metodo
- Overlap tra metodi (Venn diagram)
- Visualizzazioni scatter con outlier evidenziati
- Impatto su statistiche (mean, median, std)
- Analisi per gruppo (zona OMI, categoria catastale)

**Output**: `outliers_outputs/`

## üîß Setup

In [1]:
# Imports
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / "src"))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.cluster import DBSCAN
from sklearn.covariance import EllipticEnvelope
import warnings

# matplotlib_venn is optional
try:
    from matplotlib_venn import venn2, venn3
    HAS_VENN = True
except ImportError:
    HAS_VENN = False
    print("‚ö†Ô∏è  matplotlib-venn non installato. Venn diagrams saranno skippati.")
    print("   Installa con: pip install matplotlib-venn")

# Project imports
from utils.config import load_config
from preprocessing.pipeline import apply_data_filters
from preprocessing.outliers import OutlierConfig, detect_outliers

warnings.filterwarnings('ignore')

# Plot settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Setup completato")

ModuleNotFoundError: No module named 'matplotlib_venn'

In [None]:
# Configurazione
CONFIG_PATH = "../config/config.yaml"
RAW_DATA_PATH = "../data/raw/raw.parquet"
OUTPUT_DIR = Path("outliers_outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

def save_plot(name, dpi=120):
    plt.tight_layout()
    plt.savefig(OUTPUT_DIR / f"{name}.png", dpi=dpi, bbox_inches='tight')
    print(f"üíæ Salvato: {name}.png")

print(f"üìÇ Output directory: {OUTPUT_DIR}")

## üì¶ 1. Load Data

In [None]:
# Load config e data
config = load_config(CONFIG_PATH)
df_raw = pd.read_parquet(RAW_DATA_PATH)

# Applica filtri
df = apply_data_filters(df_raw, config)

# Target
target_col = 'AI_Prezzo_Ridistribuito'
y = df[target_col].dropna()

print(f"‚úÖ Dataset caricato e filtrato")
print(f"   Campioni: {len(y):,}")
print(f"   Range: ‚Ç¨{y.min():,.0f} - ‚Ç¨{y.max():,.0f}")
print(f"\nüìä Statistiche iniziali:")
print(f"   Mean: ‚Ç¨{y.mean():,.0f}")
print(f"   Median: ‚Ç¨{y.median():,.0f}")
print(f"   Std: ‚Ç¨{y.std():,.0f} (CV={y.std()/y.mean()*100:.1f}%)")
print(f"   Skewness: {y.skew():.2f}")
print(f"   Kurtosis: {y.kurtosis():.2f}")

## üîç 2. Apply All Outlier Detection Methods

In [None]:
# Dictionary per raccogliere tutti i metodi
outlier_masks = {}
outlier_counts = {}

y_array = y.values
n_samples = len(y_array)

print("=" * 80)
print("OUTLIER DETECTION - TUTTI I METODI")
print("=" * 80)

# 1. IQR Method
Q1 = np.percentile(y_array, 25)
Q3 = np.percentile(y_array, 75)
IQR = Q3 - Q1
iqr_factor = 1.5  # Standard
lower = Q1 - iqr_factor * IQR
upper = Q3 + iqr_factor * IQR
outlier_masks['IQR (1.5)'] = (y_array < lower) | (y_array > upper)
outlier_counts['IQR (1.5)'] = outlier_masks['IQR (1.5)'].sum()
print(f"\n1. IQR Method (factor=1.5):")
print(f"   Outliers: {outlier_counts['IQR (1.5)']} ({outlier_counts['IQR (1.5)']/n_samples*100:.2f}%)")
print(f"   Bounds: ‚Ç¨{lower:,.0f} - ‚Ç¨{upper:,.0f}")

# 2. IQR Method (config project: factor=1.0, pi√π aggressivo)
iqr_factor_proj = 1.0
lower_proj = Q1 - iqr_factor_proj * IQR
upper_proj = Q3 + iqr_factor_proj * IQR
outlier_masks['IQR (1.0) ‚Üê Project'] = (y_array < lower_proj) | (y_array > upper_proj)
outlier_counts['IQR (1.0) ‚Üê Project'] = outlier_masks['IQR (1.0) ‚Üê Project'].sum()
print(f"\n2. IQR Method (factor=1.0) ‚Üê CONFIG PROJECT:")
print(f"   Outliers: {outlier_counts['IQR (1.0) ‚Üê Project']} ({outlier_counts['IQR (1.0) ‚Üê Project']/n_samples*100:.2f}%)")
print(f"   Bounds: ‚Ç¨{lower_proj:,.0f} - ‚Ç¨{upper_proj:,.0f}")

# 3. Z-Score Method
z_scores = np.abs(stats.zscore(y_array))
z_threshold = 3.0  # Standard
outlier_masks['Z-Score (3.0)'] = z_scores > z_threshold
outlier_counts['Z-Score (3.0)'] = outlier_masks['Z-Score (3.0)'].sum()
print(f"\n3. Z-Score Method (threshold=3.0):")
print(f"   Outliers: {outlier_counts['Z-Score (3.0)']} ({outlier_counts['Z-Score (3.0)']/n_samples*100:.2f}%)")

# 4. Z-Score (config project: 2.5, pi√π aggressivo)
z_threshold_proj = 2.5
outlier_masks['Z-Score (2.5) ‚Üê Project'] = z_scores > z_threshold_proj
outlier_counts['Z-Score (2.5) ‚Üê Project'] = outlier_masks['Z-Score (2.5) ‚Üê Project'].sum()
print(f"\n4. Z-Score Method (threshold=2.5) ‚Üê CONFIG PROJECT:")
print(f"   Outliers: {outlier_counts['Z-Score (2.5) ‚Üê Project']} ({outlier_counts['Z-Score (2.5) ‚Üê Project']/n_samples*100:.2f}%)")

# 5. Modified Z-Score (median-based, pi√π robusto)
median = np.median(y_array)
mad = np.median(np.abs(y_array - median))
modified_z_scores = 0.6745 * (y_array - median) / (mad + 1e-8)
outlier_masks['Modified Z-Score'] = np.abs(modified_z_scores) > 3.5
outlier_counts['Modified Z-Score'] = outlier_masks['Modified Z-Score'].sum()
print(f"\n5. Modified Z-Score Method:")
print(f"   Outliers: {outlier_counts['Modified Z-Score']} ({outlier_counts['Modified Z-Score']/n_samples*100:.2f}%)")

# 6. Isolation Forest
contamination = 0.08  # Config project
iso_forest = IsolationForest(contamination=contamination, random_state=42)
y_pred_iso = iso_forest.fit_predict(y_array.reshape(-1, 1))
outlier_masks['Isolation Forest ‚Üê Project'] = y_pred_iso == -1
outlier_counts['Isolation Forest ‚Üê Project'] = outlier_masks['Isolation Forest ‚Üê Project'].sum()
print(f"\n6. Isolation Forest (contamination=0.08) ‚Üê CONFIG PROJECT:")
print(f"   Outliers: {outlier_counts['Isolation Forest ‚Üê Project']} ({outlier_counts['Isolation Forest ‚Üê Project']/n_samples*100:.2f}%)")

# 7. Local Outlier Factor
lof = LocalOutlierFactor(contamination=contamination)
y_pred_lof = lof.fit_predict(y_array.reshape(-1, 1))
outlier_masks['LOF'] = y_pred_lof == -1
outlier_counts['LOF'] = outlier_masks['LOF'].sum()
print(f"\n7. Local Outlier Factor:")
print(f"   Outliers: {outlier_counts['LOF']} ({outlier_counts['LOF']/n_samples*100:.2f}%)")

# 8. Elliptic Envelope
try:
    ee = EllipticEnvelope(contamination=contamination, random_state=42)
    y_pred_ee = ee.fit_predict(y_array.reshape(-1, 1))
    outlier_masks['Elliptic Envelope'] = y_pred_ee == -1
    outlier_counts['Elliptic Envelope'] = outlier_masks['Elliptic Envelope'].sum()
    print(f"\n8. Elliptic Envelope:")
    print(f"   Outliers: {outlier_counts['Elliptic Envelope']} ({outlier_counts['Elliptic Envelope']/n_samples*100:.2f}%)")
except Exception as e:
    print(f"\n8. Elliptic Envelope: FAILED ({e})")

# 9. ENSEMBLE (IQR + Z-Score + Isolation Forest) - CONFIG PROJECT
ensemble_mask = (
    outlier_masks['IQR (1.0) ‚Üê Project'] | 
    outlier_masks['Z-Score (2.5) ‚Üê Project'] | 
    outlier_masks['Isolation Forest ‚Üê Project']
)
outlier_masks['ENSEMBLE ‚Üê Project'] = ensemble_mask
outlier_counts['ENSEMBLE ‚Üê Project'] = ensemble_mask.sum()
print(f"\n9. ENSEMBLE (IQR 1.0 + Z 2.5 + ISO 0.08) ‚Üê CONFIG PROJECT:")
print(f"   Outliers: {outlier_counts['ENSEMBLE ‚Üê Project']} ({outlier_counts['ENSEMBLE ‚Üê Project']/n_samples*100:.2f}%)")

print("\n" + "=" * 80)

## üìä 3. Comparison Table

In [None]:
# Crea comparison table
comparison_data = []

for method, mask in outlier_masks.items():
    n_outliers = mask.sum()
    pct_outliers = n_outliers / n_samples * 100
    
    # Inliers stats
    inliers = y_array[~mask]
    
    comparison_data.append({
        'Method': method,
        'Outliers': n_outliers,
        'Outliers_Pct': pct_outliers,
        'Inliers_Mean': inliers.mean(),
        'Inliers_Median': np.median(inliers),
        'Inliers_Std': inliers.std(),
        'Inliers_Skew': stats.skew(inliers),
        'Inliers_Kurt': stats.kurtosis(inliers),
    })

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('Outliers', ascending=False)

print("=" * 80)
print("COMPARISON TABLE")
print("=" * 80)
print("\n", comparison_df.to_string(index=False))

# Salva
comparison_df.to_csv(OUTPUT_DIR / "01_methods_comparison.csv", index=False)
print(f"\nüíæ Salvato: 01_methods_comparison.csv")

## üìä 4. Impact on Statistics

In [None]:
# Confronto statistiche prima/dopo rimozione outlier
print("\n" + "=" * 80)
print("IMPACT ON STATISTICS - ENSEMBLE METHOD")
print("=" * 80)

ensemble_inliers = y_array[~outlier_masks['ENSEMBLE ‚Üê Project']]

print(f"\nüìä BEFORE (with outliers):")
print(f"   N:        {len(y_array):,}")
print(f"   Mean:     ‚Ç¨{y_array.mean():,.0f}")
print(f"   Median:   ‚Ç¨{np.median(y_array):,.0f}")
print(f"   Std:      ‚Ç¨{y_array.std():,.0f}")
print(f"   Skewness: {stats.skew(y_array):.2f}")
print(f"   Kurtosis: {stats.kurtosis(y_array):.2f}")

print(f"\nüìä AFTER (outliers removed):")
print(f"   N:        {len(ensemble_inliers):,} (-{len(y_array) - len(ensemble_inliers):,})")
print(f"   Mean:     ‚Ç¨{ensemble_inliers.mean():,.0f} ({(ensemble_inliers.mean() - y_array.mean())/y_array.mean()*100:+.1f}%)")
print(f"   Median:   ‚Ç¨{np.median(ensemble_inliers):,.0f} ({(np.median(ensemble_inliers) - np.median(y_array))/np.median(y_array)*100:+.1f}%)")
print(f"   Std:      ‚Ç¨{ensemble_inliers.std():,.0f} ({(ensemble_inliers.std() - y_array.std())/y_array.std()*100:+.1f}%)")
print(f"   Skewness: {stats.skew(ensemble_inliers):.2f} ({stats.skew(ensemble_inliers) - stats.skew(y_array):+.2f})")
print(f"   Kurtosis: {stats.kurtosis(ensemble_inliers):.2f} ({stats.kurtosis(ensemble_inliers) - stats.kurtosis(y_array):+.2f})")

## üìä 5. Visualizations

In [None]:
# Bar chart: numero outlier per metodo
fig, ax = plt.subplots(figsize=(12, 6))

methods = list(outlier_counts.keys())
counts = list(outlier_counts.values())
colors = ['red' if 'Project' in m or 'ENSEMBLE' in m else 'steelblue' for m in methods]

ax.barh(range(len(methods)), counts, color=colors, edgecolor='black')
ax.set_yticks(range(len(methods)))
ax.set_yticklabels(methods, fontsize=9)
ax.set_xlabel('Number of Outliers')
ax.set_title('Outlier Detection Methods Comparison')
ax.grid(True, alpha=0.3, axis='x')

# Aggiungi percentuali
for i, count in enumerate(counts):
    pct = count / n_samples * 100
    ax.text(count + 10, i, f"{count} ({pct:.1f}%)", va='center', fontsize=8)

save_plot("02_methods_comparison_bar")
plt.show()

In [None]:
# Scatter plots: distribuzione con outlier evidenziati
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

# Metodi chiave da visualizzare
key_methods = [
    'IQR (1.0) ‚Üê Project',
    'Z-Score (2.5) ‚Üê Project',
    'Isolation Forest ‚Üê Project',
    'ENSEMBLE ‚Üê Project'
]

for idx, method in enumerate(key_methods):
    ax = axes[idx]
    mask = outlier_masks[method]
    
    # Plot inliers
    ax.scatter(range(len(y_array)), y_array, c='steelblue', alpha=0.3, s=10, label='Inliers')
    
    # Plot outliers
    if mask.sum() > 0:
        outlier_indices = np.where(mask)[0]
        ax.scatter(outlier_indices, y_array[mask], c='red', alpha=0.8, s=30, 
                  label=f'Outliers ({mask.sum()})', edgecolors='black', linewidth=0.5)
    
    ax.set_xlabel('Sample Index')
    ax.set_ylabel('Price (‚Ç¨)')
    ax.set_title(f'{method}\n{mask.sum()} outliers ({mask.sum()/n_samples*100:.1f}%)')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('Outlier Detection - Scatter Plots', fontsize=14, fontweight='bold')
save_plot("03_scatter_plots")
plt.show()

In [None]:
# Box plots: distribuzione prima/dopo
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before
axes[0].boxplot(y_array, vert=True)
axes[0].set_ylabel('Price (‚Ç¨)')
axes[0].set_title(f'BEFORE Outlier Removal\n(N={len(y_array):,})')
axes[0].grid(True, alpha=0.3)

# After (ENSEMBLE)
axes[1].boxplot(ensemble_inliers, vert=True)
axes[1].set_ylabel('Price (‚Ç¨)')
axes[1].set_title(f'AFTER Outlier Removal (ENSEMBLE)\n(N={len(ensemble_inliers):,}, removed={len(y_array)-len(ensemble_inliers):,})')
axes[1].grid(True, alpha=0.3)

save_plot("04_boxplots_comparison")
plt.show()

In [None]:
# Histograms: distribuzione prima/dopo
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Before
axes[0].hist(y_array, bins=50, edgecolor='black', alpha=0.7)
axes[0].axvline(y_array.mean(), color='r', linestyle='--', label=f'Mean: ‚Ç¨{y_array.mean():,.0f}')
axes[0].axvline(np.median(y_array), color='g', linestyle='--', label=f'Median: ‚Ç¨{np.median(y_array):,.0f}')
axes[0].set_xlabel('Price (‚Ç¨)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('BEFORE Outlier Removal')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# After
axes[1].hist(ensemble_inliers, bins=50, edgecolor='black', alpha=0.7, color='orange')
axes[1].axvline(ensemble_inliers.mean(), color='r', linestyle='--', label=f'Mean: ‚Ç¨{ensemble_inliers.mean():,.0f}')
axes[1].axvline(np.median(ensemble_inliers), color='g', linestyle='--', label=f'Median: ‚Ç¨{np.median(ensemble_inliers):,.0f}')
axes[1].set_xlabel('Price (‚Ç¨)')
axes[1].set_ylabel('Frequency')
axes[1].set_title('AFTER Outlier Removal (ENSEMBLE)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

save_plot("05_histograms_comparison")
plt.show()

## üìä 6. Overlap Analysis (Venn Diagram)

In [None]:
# Venn diagram per i 3 metodi dell'ensemble
if HAS_VENN:
    try:
    
    fig, ax = plt.subplots(figsize=(10, 8))
    
    # Sets
    iqr_set = set(np.where(outlier_masks['IQR (1.0) ‚Üê Project'])[0])
    z_set = set(np.where(outlier_masks['Z-Score (2.5) ‚Üê Project'])[0])
    iso_set = set(np.where(outlier_masks['Isolation Forest ‚Üê Project'])[0])
    
    venn3([iqr_set, z_set, iso_set], 
          ('IQR (1.0)', 'Z-Score (2.5)', 'Isolation Forest'),
          ax=ax)
    
    ax.set_title('Outlier Detection Methods Overlap (ENSEMBLE)', fontsize=14, fontweight='bold')
    
    save_plot("06_venn_diagram_ensemble")
    plt.show()
    
    # Report overlap
    print("\n" + "=" * 80)
    print("OVERLAP ANALYSIS")
    print("=" * 80)
    print(f"\nIQR only: {len(iqr_set - z_set - iso_set)}")
    print(f"Z-Score only: {len(z_set - iqr_set - iso_set)}")
    print(f"Isolation Forest only: {len(iso_set - iqr_set - z_set)}")
    print(f"\nIQR ‚à© Z-Score: {len(iqr_set & z_set - iso_set)}")
    print(f"IQR ‚à© Isolation: {len(iqr_set & iso_set - z_set)}")
    print(f"Z-Score ‚à© Isolation: {len(z_set & iso_set - iqr_set)}")
    print(f"\nAll 3 methods: {len(iqr_set & z_set & iso_set)}")
    print(f"Total ENSEMBLE: {len(iqr_set | z_set | iso_set)}")
    
    except Exception as e:
        print(f"‚ö†Ô∏è  Errore nel Venn diagram: {e}")
else:
    print("‚ö†Ô∏è  matplotlib-venn non installato. Salta Venn diagram.")
    print("   pip install matplotlib-venn")

## üìã 7. Summary Report

In [None]:
# Report finale
report = {
    'dataset': {
        'n_samples': n_samples,
        'original_mean': float(y_array.mean()),
        'original_median': float(np.median(y_array)),
        'original_std': float(y_array.std()),
        'original_skew': float(stats.skew(y_array)),
        'original_kurtosis': float(stats.kurtosis(y_array)),
    },
    'methods_tested': len(outlier_masks),
    'outlier_counts': {k: int(v) for k, v in outlier_counts.items()},
    'ensemble_config': {
        'methods': ['IQR (factor=1.0)', 'Z-Score (threshold=2.5)', 'Isolation Forest (contamination=0.08)'],
        'outliers_detected': int(outlier_counts['ENSEMBLE ‚Üê Project']),
        'outliers_pct': float(outlier_counts['ENSEMBLE ‚Üê Project'] / n_samples * 100),
    },
    'impact_after_removal': {
        'n_samples': len(ensemble_inliers),
        'mean': float(ensemble_inliers.mean()),
        'median': float(np.median(ensemble_inliers)),
        'std': float(ensemble_inliers.std()),
        'skew': float(stats.skew(ensemble_inliers)),
        'kurtosis': float(stats.kurtosis(ensemble_inliers)),
        'mean_change_pct': float((ensemble_inliers.mean() - y_array.mean()) / y_array.mean() * 100),
        'std_change_pct': float((ensemble_inliers.std() - y_array.std()) / y_array.std() * 100),
        'skew_change': float(stats.skew(ensemble_inliers) - stats.skew(y_array)),
    },
    'recommendation': (
        'Configurazione ENSEMBLE attuale √® equilibrata. '
        f'Rimuove {outlier_counts["ENSEMBLE ‚Üê Project"]/n_samples*100:.1f}% outlier '
        f'e migliora normalit√† (skew: {stats.skew(y_array):.2f} ‚Üí {stats.skew(ensemble_inliers):.2f})'
    )
}

# Salva JSON
import json
with open(OUTPUT_DIR / "00_summary_report.json", 'w') as f:
    json.dump(report, f, indent=2)

print("\n" + "=" * 80)
print("üìã FINAL REPORT")
print("=" * 80)
print(json.dumps(report, indent=2))
print(f"\nüíæ Salvato: 00_summary_report.json")

## ‚úÖ Conclusioni

### File Generati

1. `00_summary_report.json` - Report completo
2. `01_methods_comparison.csv` - Tabella comparativa
3. `02_methods_comparison_bar.png` - Bar chart
4. `03_scatter_plots.png` - Scatter plots con outlier
5. `04_boxplots_comparison.png` - Boxplots before/after
6. `05_histograms_comparison.png` - Histograms before/after
7. `06_venn_diagram_ensemble.png` - Venn diagram overlap

### Key Insights

- **ENSEMBLE method** combina 3 approcci per robustezza
- Rimozione outlier migliora skewness e kurtosis
- Trade-off: troppi outlier rimossi = perdita dati
- Config attuale (IQR 1.0 + Z 2.5 + ISO 0.08) √® equilibrata

### Raccomandazioni

- Se outlier > 15%: rilassa parametri (IQR 1.5, Z 3.0)
- Se outlier < 5%: stringi parametri (IQR 0.5, Z 2.0)
- Monitora outlier per gruppo (zone OMI, categorie)