# üìä Model Results Deep Analysis

**Obiettivo**: Analisi approfondita dei risultati del modello trainato.

**Prerequisiti**: Aver eseguito `python main.py --config config/config.yaml --steps train`

**Analisi**:
1. **Performance Metrics**: MAE, RMSE, MAPE, R¬≤ per train/val/test
2. **Residual Analysis**: Distribuzione, autocorrelazione, heteroskedasticity
3. **Error Distribution**: By price range, zone, category
4. **Feature Importance**: SHAP, permutation, model-specific
5. **Prediction vs Actual**: Scatter plots, error bands
6. **Outlier Predictions**: Worst predictions analysis
7. **Cross-Validation Stability**: CV scores variance
8. **Ensemble Analysis**: Individual models contribution

**Output**: `model_analysis_outputs/`

## üîß Setup

In [1]:
# Imports
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / "src"))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import joblib
import json
import warnings

# Project imports
from utils.config import load_config
# Note: load_preprocessed_data doesn't exist, loading manually
# Note: compute_all_metrics doesn't exist, using sklearn directly

warnings.filterwarnings('ignore')

# Plot settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

print("‚úÖ Setup completato")

ImportError: cannot import name 'load_preprocessed_data' from 'utils.io' (c:\Users\giuli\OneDrive\Desktop\stimatrix\src\utils\io.py)

In [None]:
# Configurazione
CONFIG_PATH = "../config/config.yaml"
PREPROCESSED_DIR = Path("../data/preprocessed")
MODELS_DIR = Path("../models")
RESULTS_PATH = MODELS_DIR / "results_summary.json"
OUTPUT_DIR = Path("model_analysis_outputs")
OUTPUT_DIR.mkdir(exist_ok=True)

def save_plot(name, dpi=120):
    plt.tight_layout()
    plt.savefig(OUTPUT_DIR / f"{name}.png", dpi=dpi, bbox_inches='tight')
    print(f"üíæ Salvato: {name}.png")

print(f"üìÇ Output directory: {OUTPUT_DIR}")
print(f"üìÇ Preprocessed data: {PREPROCESSED_DIR}")
print(f"üìÇ Models directory: {MODELS_DIR}")

## üì¶ 1. Load Model and Data

In [None]:
# Check che model esista
if not RESULTS_PATH.exists():
    print("‚ùå ERRORE: Model non trovato!")
    print("\nEsegui prima il training:")
    print("  python main.py --config config/config.yaml --steps train")
    raise FileNotFoundError(f"Model non trovato: {RESULTS_PATH}")

# Load config
config = load_config(CONFIG_PATH)

# Load results summary
with open(RESULTS_PATH, 'r') as f:
    results = json.load(f)

print("‚úÖ Config e results caricati")
print(f"\nüìä Best Model: {results.get('best_model', 'N/A')}")
print(f"üìä Best Params: {results.get('best_params', {})}")

In [None]:
# Load preprocessed data
try:
    # Try to load splits
    X_train = pd.read_parquet(PREPROCESSED_DIR / "X_train.parquet")
    X_val = pd.read_parquet(PREPROCESSED_DIR / "X_val.parquet")
    X_test = pd.read_parquet(PREPROCESSED_DIR / "X_test.parquet")
    y_train = pd.read_parquet(PREPROCESSED_DIR / "y_train.parquet").values.ravel()
    y_val = pd.read_parquet(PREPROCESSED_DIR / "y_val.parquet").values.ravel()
    y_test = pd.read_parquet(PREPROCESSED_DIR / "y_test.parquet").values.ravel()
    
    print("‚úÖ Preprocessed data caricati")
    print(f"\nüìä Shapes:")
    print(f"   Train: X={X_train.shape}, y={y_train.shape}")
    print(f"   Val:   X={X_val.shape}, y={y_val.shape}")
    print(f"   Test:  X={X_test.shape}, y={y_test.shape}")
    
except FileNotFoundError as e:
    print("‚ùå ERRORE: Preprocessed data non trovati!")
    print("\nEsegui prima il preprocessing:")
    print("  python main.py --config config/config.yaml --steps preprocess")
    raise e

In [None]:
# Load best model
best_model_path = MODELS_DIR / "best_model.pkl"

if not best_model_path.exists():
    print("‚ö†Ô∏è  Best model non trovato, provo con modelli individuali...")
    # Lista modelli disponibili
    available_models = list(MODELS_DIR.glob("*.pkl"))
    if len(available_models) == 0:
        print("‚ùå ERRORE: Nessun modello trovato!")
        raise FileNotFoundError("Nessun modello trovato")
    
    print(f"\nModelli disponibili:")
    for m in available_models:
        print(f"  - {m.name}")
    
    # Usa il primo
    model_path = available_models[0]
    print(f"\nüì¶ Usando: {model_path.name}")
else:
    model_path = best_model_path
    print(f"üì¶ Carico best model: {model_path.name}")

model = joblib.load(model_path)
print("‚úÖ Model caricato")

## üìä 2. Performance Metrics

In [None]:
# Predizioni
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

print("‚úÖ Predizioni generate")

In [None]:
# Compute metrics per ogni split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, mean_absolute_percentage_error

def compute_metrics(y_true, y_pred, split_name):
    mae = mean_absolute_error(y_true, y_pred)
    rmse = np.sqrt(mean_squared_error(y_true, y_pred))
    r2 = r2_score(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred) * 100
    
    # Median Absolute Error
    medae = np.median(np.abs(y_true - y_pred))
    
    return {
        'Split': split_name,
        'MAE': mae,
        'RMSE': rmse,
        'MedAE': medae,
        'MAPE': mape,
        'R2': r2,
    }

metrics_train = compute_metrics(y_train, y_train_pred, 'Train')
metrics_val = compute_metrics(y_val, y_val_pred, 'Val')
metrics_test = compute_metrics(y_test, y_test_pred, 'Test')

metrics_df = pd.DataFrame([metrics_train, metrics_val, metrics_test])

print("=" * 80)
print("PERFORMANCE METRICS")
print("=" * 80)
print("\n", metrics_df.to_string(index=False))

# Salva
metrics_df.to_csv(OUTPUT_DIR / "01_performance_metrics.csv", index=False)
print(f"\nüíæ Salvato: 01_performance_metrics.csv")

In [None]:
# Visualize metrics
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics_to_plot = ['MAE', 'RMSE', 'MAPE', 'R2']
colors = ['steelblue', 'orange', 'green']

for idx, metric in enumerate(metrics_to_plot):
    ax = axes[idx // 2, idx % 2]
    
    values = metrics_df[metric].values
    ax.bar(metrics_df['Split'], values, color=colors, edgecolor='black')
    ax.set_ylabel(metric)
    ax.set_title(f'{metric} by Split')
    ax.grid(True, alpha=0.3, axis='y')
    
    # Aggiungi valori
    for i, v in enumerate(values):
        ax.text(i, v + v*0.02, f"{v:.2f}", ha='center', fontweight='bold')

plt.suptitle('Model Performance Metrics', fontsize=16, fontweight='bold')
save_plot("02_performance_metrics")
plt.show()

## üìä 3. Residual Analysis

In [None]:
# Residui
residuals_train = y_train - y_train_pred
residuals_val = y_val - y_val_pred
residuals_test = y_test - y_test_pred

print("=" * 80)
print("RESIDUAL STATISTICS")
print("=" * 80)

for name, residuals in [('Train', residuals_train), ('Val', residuals_val), ('Test', residuals_test)]:
    print(f"\n{name}:")
    print(f"  Mean: ‚Ç¨{residuals.mean():,.0f}")
    print(f"  Std:  ‚Ç¨{residuals.std():,.0f}")
    print(f"  Skew: {stats.skew(residuals):.2f}")
    print(f"  Kurt: {stats.kurtosis(residuals):.2f}")

In [None]:
# Residual plots
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# Row 1: Histograms
for idx, (name, residuals) in enumerate([('Train', residuals_train), ('Val', residuals_val), ('Test', residuals_test)]):
    ax = axes[0, idx]
    ax.hist(residuals, bins=50, edgecolor='black', alpha=0.7)
    ax.axvline(0, color='r', linestyle='--', linewidth=2, label='Zero')
    ax.set_xlabel('Residuals (‚Ç¨)')
    ax.set_ylabel('Frequency')
    ax.set_title(f'{name} - Residual Distribution')
    ax.legend()
    ax.grid(True, alpha=0.3)

# Row 2: Q-Q plots
for idx, (name, residuals) in enumerate([('Train', residuals_train), ('Val', residuals_val), ('Test', residuals_test)]):
    ax = axes[1, idx]
    stats.probplot(residuals, dist="norm", plot=ax)
    ax.set_title(f'{name} - Q-Q Plot')
    ax.grid(True, alpha=0.3)

plt.suptitle('Residual Analysis', fontsize=16, fontweight='bold')
save_plot("03_residual_analysis")
plt.show()

In [None]:
# Residuals vs Predictions (heteroskedasticity check)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, (name, y_pred, residuals) in enumerate([
    ('Train', y_train_pred, residuals_train),
    ('Val', y_val_pred, residuals_val),
    ('Test', y_test_pred, residuals_test)
]):
    ax = axes[idx]
    ax.scatter(y_pred, residuals, alpha=0.3, s=10)
    ax.axhline(0, color='r', linestyle='--', linewidth=2)
    ax.set_xlabel('Predicted Price (‚Ç¨)')
    ax.set_ylabel('Residuals (‚Ç¨)')
    ax.set_title(f'{name} - Residuals vs Predicted')
    ax.grid(True, alpha=0.3)

plt.suptitle('Heteroskedasticity Check', fontsize=16, fontweight='bold')
save_plot("04_heteroskedasticity")
plt.show()

## üìä 4. Prediction vs Actual

In [None]:
# Scatter plots: predicted vs actual
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, (name, y_true, y_pred) in enumerate([
    ('Train', y_train, y_train_pred),
    ('Val', y_val, y_val_pred),
    ('Test', y_test, y_test_pred)
]):
    ax = axes[idx]
    
    # Scatter
    ax.scatter(y_true, y_pred, alpha=0.3, s=10)
    
    # Perfect prediction line
    min_val = min(y_true.min(), y_pred.min())
    max_val = max(y_true.max(), y_pred.max())
    ax.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect Prediction')
    
    ax.set_xlabel('Actual Price (‚Ç¨)')
    ax.set_ylabel('Predicted Price (‚Ç¨)')
    ax.set_title(f'{name} - Predicted vs Actual\nR¬≤={r2_score(y_true, y_pred):.3f}')
    ax.legend()
    ax.grid(True, alpha=0.3)

plt.suptitle('Prediction vs Actual', fontsize=16, fontweight='bold')
save_plot("05_prediction_vs_actual")
plt.show()

## üìä 5. Error Distribution by Price Range

In [None]:
# Analizza errore per fascia di prezzo (solo test)
n_bins = 10
price_bins = pd.qcut(y_test, q=n_bins, duplicates='drop')

error_by_price = []
for bin_label in price_bins.cat.categories:
    mask = price_bins == bin_label
    y_true_bin = y_test[mask]
    y_pred_bin = y_test_pred[mask]
    
    mae_bin = mean_absolute_error(y_true_bin, y_pred_bin)
    mape_bin = mean_absolute_percentage_error(y_true_bin, y_pred_bin) * 100
    
    error_by_price.append({
        'Price_Range': str(bin_label),
        'Count': mask.sum(),
        'MAE': mae_bin,
        'MAPE': mape_bin,
    })

error_by_price_df = pd.DataFrame(error_by_price)

print("=" * 80)
print("ERROR BY PRICE RANGE (Test Set)")
print("=" * 80)
print("\n", error_by_price_df.to_string(index=False))

# Salva
error_by_price_df.to_csv(OUTPUT_DIR / "06_error_by_price_range.csv", index=False)
print(f"\nüíæ Salvato: 06_error_by_price_range.csv")

In [None]:
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# MAE
axes[0].bar(range(len(error_by_price_df)), error_by_price_df['MAE'], 
            edgecolor='black', color='steelblue')
axes[0].set_xticks(range(len(error_by_price_df)))
axes[0].set_xticklabels(range(1, len(error_by_price_df) + 1))
axes[0].set_xlabel('Price Range Bin')
axes[0].set_ylabel('MAE (‚Ç¨)')
axes[0].set_title('MAE by Price Range')
axes[0].grid(True, alpha=0.3, axis='y')

# MAPE
axes[1].bar(range(len(error_by_price_df)), error_by_price_df['MAPE'], 
            edgecolor='black', color='orange')
axes[1].set_xticks(range(len(error_by_price_df)))
axes[1].set_xticklabels(range(1, len(error_by_price_df) + 1))
axes[1].set_xlabel('Price Range Bin')
axes[1].set_ylabel('MAPE (%)')
axes[1].set_title('MAPE by Price Range')
axes[1].grid(True, alpha=0.3, axis='y')

save_plot("07_error_by_price_range")
plt.show()

## üìä 6. Worst Predictions Analysis

In [None]:
# Top 20 worst predictions (test)
abs_errors = np.abs(y_test - y_test_pred)
worst_indices = np.argsort(abs_errors)[-20:][::-1]

worst_predictions = pd.DataFrame({
    'Index': worst_indices,
    'Actual': y_test[worst_indices],
    'Predicted': y_test_pred[worst_indices],
    'Error': y_test[worst_indices] - y_test_pred[worst_indices],
    'Abs_Error': abs_errors[worst_indices],
    'APE': np.abs((y_test[worst_indices] - y_test_pred[worst_indices]) / y_test[worst_indices] * 100),
})

print("=" * 80)
print("TOP 20 WORST PREDICTIONS (Test Set)")
print("=" * 80)
print("\n", worst_predictions.to_string(index=False))

# Salva
worst_predictions.to_csv(OUTPUT_DIR / "08_worst_predictions.csv", index=False)
print(f"\nüíæ Salvato: 08_worst_predictions.csv")

## üìä 7. Feature Importance (if available)

In [None]:
# Feature importance (se disponibile)
try:
    # Tree-based models
    if hasattr(model, 'feature_importances_'):
        feature_importance = pd.DataFrame({
            'Feature': X_train.columns,
            'Importance': model.feature_importances_
        }).sort_values('Importance', ascending=False)
        
        print("=" * 80)
        print("FEATURE IMPORTANCE (Top 20)")
        print("=" * 80)
        print("\n", feature_importance.head(20).to_string(index=False))
        
        # Salva
        feature_importance.to_csv(OUTPUT_DIR / "09_feature_importance.csv", index=False)
        print(f"\nüíæ Salvato: 09_feature_importance.csv")
        
        # Plot
        fig, ax = plt.subplots(figsize=(10, 8))
        top20 = feature_importance.head(20)
        ax.barh(range(len(top20)), top20['Importance'], edgecolor='black')
        ax.set_yticks(range(len(top20)))
        ax.set_yticklabels(top20['Feature'], fontsize=8)
        ax.set_xlabel('Importance')
        ax.set_title('Feature Importance (Top 20)', fontsize=14, fontweight='bold')
        ax.grid(True, alpha=0.3, axis='x')
        ax.invert_yaxis()
        
        save_plot("10_feature_importance")
        plt.show()
        
    # Linear models
    elif hasattr(model, 'coef_'):
        feature_importance = pd.DataFrame({
            'Feature': X_train.columns,
            'Coefficient': model.coef_
        }).sort_values('Coefficient', key=abs, ascending=False)
        
        print("=" * 80)
        print("FEATURE COEFFICIENTS (Top 20)")
        print("=" * 80)
        print("\n", feature_importance.head(20).to_string(index=False))
        
        feature_importance.to_csv(OUTPUT_DIR / "09_feature_coefficients.csv", index=False)
        print(f"\nüíæ Salvato: 09_feature_coefficients.csv")
        
    else:
        print("‚ö†Ô∏è  Feature importance non disponibile per questo modello")
        
except Exception as e:
    print(f"‚ö†Ô∏è  Errore nell'estrazione feature importance: {e}")

## üìã 8. Summary Report

In [None]:
# Report finale
report = {
    'model_type': str(type(model).__name__),
    'n_features': X_train.shape[1],
    'performance': {
        'train': metrics_train,
        'val': metrics_val,
        'test': metrics_test,
    },
    'residuals': {
        'train': {
            'mean': float(residuals_train.mean()),
            'std': float(residuals_train.std()),
            'skew': float(stats.skew(residuals_train)),
        },
        'test': {
            'mean': float(residuals_test.mean()),
            'std': float(residuals_test.std()),
            'skew': float(stats.skew(residuals_test)),
        },
    },
    'worst_prediction': {
        'actual': float(worst_predictions.iloc[0]['Actual']),
        'predicted': float(worst_predictions.iloc[0]['Predicted']),
        'error': float(worst_predictions.iloc[0]['Error']),
        'ape': float(worst_predictions.iloc[0]['APE']),
    },
}

# Salva JSON
with open(OUTPUT_DIR / "00_summary_report.json", 'w') as f:
    json.dump(report, f, indent=2)

print("\n" + "=" * 80)
print("üìã FINAL REPORT")
print("=" * 80)
print(json.dumps(report, indent=2))
print(f"\nüíæ Salvato: 00_summary_report.json")

## ‚úÖ Conclusioni

### File Generati

1. `00_summary_report.json` - Report completo
2. `01_performance_metrics.csv` - Metriche per split
3. `02_performance_metrics.png` - Bar charts metriche
4. `03_residual_analysis.png` - Analisi residui
5. `04_heteroskedasticity.png` - Check heteroskedasticit√†
6. `05_prediction_vs_actual.png` - Scatter predicted vs actual
7. `06_error_by_price_range.csv` - Errore per fascia prezzo
8. `07_error_by_price_range.png` - Grafici errore per fascia
9. `08_worst_predictions.csv` - Top 20 worst predictions
10. `09_feature_importance.csv` - Feature importance (se disponibile)
11. `10_feature_importance.png` - Plot feature importance

### Key Insights

- **Generalizzazione**: Gap train-test indica overfitting/underfitting
- **Residui**: Se non normali ‚Üí modello non cattura tutta l'informazione
- **Heteroskedasticity**: Varianza residui cambia con prezzo ‚Üí considera target transform
- **Worst predictions**: Analizza pattern comuni per miglioramenti

### Next Steps

1. Se R¬≤ test < R¬≤ train: reduce complexity (regularization, pruning)
2. Se residui non normali: prova target transformation
3. Se heteroskedasticity: considera weighted regression o quantile regression
4. Analizza worst predictions: aggiungi features specifiche