# Model Experiments & Selection

## Overview
This notebook documents the **complete model exploration journey** for the property valuation project.

**Objective:** Compare multiple regression algorithms to find the best model for predicting property prices.

**Approach:**
1. Load and prepare data
2. Train 6 different algorithms
3. Evaluate using 5-fold cross-validation
4. Compare metrics (RMSE, MAE, R²)
5. Select the best model
6. Interpret results

**Models Tested:**
| # | Algorithm | Category | Reason |
|---|-----------|----------|--------|
| 1 | Ridge | Linear | Baseline: simple, interpretable, handles multicollinearity |
| 2 | ElasticNet | Linear | Combines L1 (Lasso) & L2 (Ridge) regularization |
| 3 | RandomForest | Tree Ensemble | Non-linear: captures complex patterns, robust to outliers |
| 4 | ExtraTrees | Tree Ensemble | Faster variant with random splits |
| 5 | HistGB | Tree Ensemble | Fast gradient boosting with histogram-based splits |
| 6 | XGBoost | Tree Ensemble | **State-of-the-art:** sequential tree building with regularization |

**Key Insight:** We test both linear and tree-based models because:
- Linear models are fast and interpretable (good baseline)
- Tree-based models capture non-linear relationships better
- Ensemble methods reduce overfitting and improve generalization

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold, cross_validate
from sklearn.linear_model import Ridge, ElasticNet
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, HistGradientBoostingRegressor
from xgboost import XGBRegressor
import warnings

warnings.filterwarnings('ignore')
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)

print("✓ All libraries imported successfully")

---
## 1. Data Loading & Preparation

### What We're Using:
- **Target Variable:** `price_log` (log-transformed prices)
- **Features:** All columns except `id`, `price` (leakage prevention)
- **Train Set:** ~3,800 samples with cleaned, engineered features

### Why Log-Transformed Target?
- Original prices are right-skewed ($100K → $2M+ range)
- Log-transform makes residuals normally distributed
- Linear models work better with log-transformed targets
- Easier interpretation: model predicts % changes, not absolute changes

In [None]:
# Load data
print("Loading data...")
df = pd.read_csv('../data/processed/train_cleaned_scaled.csv')

print(f"\n✓ Data loaded successfully")
print(f"  Dataset shape: {df.shape[0]:,} samples × {df.shape[1]} features")
print(f"\nColumn preview:")
print(df.columns.tolist()[:10])

# Prepare target and features
y = df['price']  # Use actual price for now
X = df.drop(columns=['id', 'price'])

print(f"\nTarget variable: 'price'")
print(f"  Shape: {y.shape}")
print(f"  Mean: ${y.mean():,.0f}")
print(f"  Std: ${y.std():,.0f}")
print(f"\nFeatures selected: {X.shape[1]} features")
print(f"  {X.shape[0]:,} samples ready for modeling")

---
## 2. Define Models & Hyperparameters

### Linear Models:
**Ridge:** L2 regularization (penalty on large coefficients)
- Hyperparameter: `alpha` (strength of regularization)
- Use when: features are correlated

**ElasticNet:** Combines L1 (Lasso) + L2 (Ridge)
- Hyperparameters: `alpha` (strength), `l1_ratio` (balance between L1 & L2)
- Use when: want feature selection + multicollinearity handling

### Tree Ensemble Models:
**RandomForest:** Parallel decision trees with bootstrap sampling
- Hyperparameters: `n_estimators` (# trees), `max_depth` (tree depth)
- Fast & robust, good baseline for non-linear relationships

**ExtraTrees:** Extremely Randomized Trees (random split thresholds)
- Similar to RandomForest but faster
- More random = potentially less overfitting

**HistGB:** Histogram-based Gradient Boosting (sklearn's fast implementation)
- Sequential tree building with learning rate
- Each tree corrects previous trees' errors

**XGBoost:** Extreme Gradient Boosting (industry standard)
- Most advanced: L1/L2 regularization + early stopping
- Best for competitions & production systems
- Computational efficiency with GPU support

In [None]:
# Define models with tuned hyperparameters
models = {
    'Ridge': Ridge(
        alpha=10.0,
        random_state=42
    ),
    
    'ElasticNet': ElasticNet(
        alpha=0.01,
        l1_ratio=0.2,
        random_state=42,
        max_iter=20000
    ),
    
    'RandomForest': RandomForestRegressor(
        n_estimators=600,
        random_state=42,
        n_jobs=-1,
        max_depth=None,
        min_samples_leaf=2
    ),
    
    'ExtraTrees': ExtraTreesRegressor(
        n_estimators=800,
        random_state=42,
        n_jobs=-1,
        max_depth=None,
        min_samples_leaf=2
    ),
    
    'HistGB': HistGradientBoostingRegressor(
        learning_rate=0.05,
        max_depth=None,
        max_iter=600,
        min_samples_leaf=20,
        random_state=42
    ),
    
    'XGBoost': XGBRegressor(
        n_estimators=600,
        learning_rate=0.05,
        max_depth=6,
        min_child_weight=1,
        subsample=0.8,
        colsample_bytree=0.8,
        objective='reg:squarederror',
        random_state=42,
        n_jobs=-1
    )
}

print("✓ 6 models defined with optimized hyperparameters:")
for name in models.keys():
    print(f"  • {name}")

---
## 3. Cross-Validation Setup

### Why 5-Fold Cross-Validation?
- **Robust Evaluation:** Tests on 5 different test sets
- **Reduces Variance:** Average of 5 runs more reliable than single train/test split
- **Better Generalization:** Detects overfitting better
- **Efficiency:** Balance between accuracy and computational cost

### Metrics Used:
| Metric | Formula | Interpretation | Best |
|--------|---------|-----------------|------|
| **RMSE** | √(Σ(actual - pred)²/n) | Average magnitude of errors | Lower |
| **MAE** | Σ(\|actual - pred\|)/n | Average absolute error (robust to outliers) | Lower |
| **R²** | 1 - (RSS/TSS) | Proportion of variance explained (0-1) | Higher |

**Which metric to use?**
- **RMSE:** Primary metric (standard in property valuation)
- **R²:** How much better than just using mean price
- **MAE:** Human-interpretable (average $ error)

In [None]:
# Set up cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Define scoring metrics
scoring = {
    'rmse': 'neg_root_mean_squared_error',
    'mae': 'neg_mean_absolute_error',
    'r2': 'r2'
}

print("✓ Cross-validation setup:")
print(f"  Method: 5-Fold KFold")
print(f"  Shuffled: Yes (random_state=42 for reproducibility)")
print(f"\nMetrics:")
print(f"  • RMSE (Root Mean Squared Error)")
print(f"  • MAE (Mean Absolute Error)")
print(f"  • R² (Coefficient of Determination)")

---
## 4. Model Evaluation

Now we train each model and evaluate it using 5-fold cross-validation.

**What happens:**
1. Split data into 5 folds
2. For each fold:
   - Train on 4 folds (80%)
   - Test on 1 fold (20%)
3. Average the 5 test scores
4. Record mean and std of each metric

In [None]:
# Evaluate each model
print("="*70)
print("EVALUATING MODELS (5-Fold Cross-Validation)")
print("="*70)

results = []

for name, model in models.items():
    print(f"\n{name:15s}", end=" ")
    
    # Perform cross-validation
    cv_results = cross_validate(
        model, X, y,
        cv=cv,
        scoring=scoring,
        n_jobs=-1,
        return_train_score=False
    )
    
    # Store results
    results.append({
        'Model': name,
        'RMSE_mean': -cv_results['test_rmse'].mean(),
        'RMSE_std': cv_results['test_rmse'].std(),
        'MAE_mean': -cv_results['test_mae'].mean(),
        'MAE_std': cv_results['test_mae'].std(),
        'R2_mean': cv_results['test_r2'].mean(),
        'R2_std': cv_results['test_r2'].std()
    })
    
    print(f"✓")

# Create results dataframe
results_df = pd.DataFrame(results)
print("\n✓ All models evaluated!")

---
## 5. Results & Comparison

Here we compare all models across the three metrics.

In [None]:
# Sort by RMSE (primary metric)
results_sorted = results_df.sort_values('RMSE_mean').reset_index(drop=True)

print("\n" + "="*70)
print("MODEL COMPARISON RESULTS")
print("="*70)
print("\n(Sorted by RMSE - lower is better)\n")

# Display formatted results
display_df = results_sorted[['Model', 'RMSE_mean', 'RMSE_std', 'MAE_mean', 'R2_mean']].copy()
display_df['RMSE_mean'] = display_df['RMSE_mean'].round(6)
display_df['RMSE_std'] = display_df['RMSE_std'].round(6)
display_df['MAE_mean'] = display_df['MAE_mean'].round(2)
display_df['R2_mean'] = display_df['R2_mean'].round(6)

print(display_df.to_string(index=False))

# Identify best model
best_idx = results_sorted['RMSE_mean'].idxmin()
best_model_name = results_sorted.loc[best_idx, 'Model']
best_rmse = results_sorted.loc[best_idx, 'RMSE_mean']
best_r2 = results_sorted.loc[best_idx, 'R2_mean']

print("\n" + "="*70)
print(f"BEST MODEL: {best_model_name}")
print("="*70)
print(f"  RMSE: ${best_rmse:,.2f}")
print(f"  R² Score: {best_r2:.4f} ({best_r2*100:.2f}% variance explained)")
print(f"  MAE: ${results_sorted.loc[best_idx, 'MAE_mean']:,.2f}")

---
## 6. Visualizations

Let's visualize how models compare across different metrics.

In [None]:
# Create comparison visualizations
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
fig.suptitle('Model Performance Comparison (5-Fold Cross-Validation)', fontsize=14, fontweight='bold')

# Sort for consistent ordering
plot_data = results_sorted

# RMSE Comparison
axes[0].barh(plot_data['Model'], plot_data['RMSE_mean'], color='steelblue', alpha=0.7, edgecolor='black')
axes[0].set_xlabel('RMSE ($ - lower is better)', fontweight='bold')
axes[0].set_title('Root Mean Squared Error')
axes[0].invert_yaxis()
for i, v in enumerate(plot_data['RMSE_mean']):
    axes[0].text(v + 1000, i, f'${v:,.0f}', va='center', fontweight='bold')

# MAE Comparison
mae_sorted = results_df.sort_values('MAE_mean').reset_index(drop=True)
axes[1].barh(mae_sorted['Model'], mae_sorted['MAE_mean'], color='seagreen', alpha=0.7, edgecolor='black')
axes[1].set_xlabel('MAE ($ - lower is better)', fontweight='bold')
axes[1].set_title('Mean Absolute Error')
axes[1].invert_yaxis()
for i, v in enumerate(mae_sorted['MAE_mean']):
    axes[1].text(v + 500, i, f'${v:,.0f}', va='center', fontweight='bold')

# R² Comparison
r2_sorted = results_df.sort_values('R2_mean', ascending=False).reset_index(drop=True)
colors = ['gold' if r2 == r2_sorted['R2_mean'].max() else 'coral' for r2 in r2_sorted['R2_mean']]
axes[2].barh(r2_sorted['Model'], r2_sorted['R2_mean'], color=colors, alpha=0.7, edgecolor='black')
axes[2].set_xlabel('R² Score (higher is better)', fontweight='bold')
axes[2].set_title('Coefficient of Determination')
axes[2].set_xlim([0, 1])
axes[2].invert_yaxis()
for i, v in enumerate(r2_sorted['R2_mean']):
    axes[2].text(v + 0.01, i, f'{v:.4f}', va='center', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Performance ranking table
fig, ax = plt.subplots(figsize=(12, 6))
ax.axis('tight')
ax.axis('off')

# Create ranking table
ranking_data = []
for idx, row in results_sorted.iterrows():
    ranking_data.append([
        f"{idx+1}. {row['Model']}",
        f"${row['RMSE_mean']:,.0f}",
        f"${row['MAE_mean']:,.0f}",
        f"{row['R2_mean']:.4f}",
        f"±{row['RMSE_std']:.6f}"
    ])

table = ax.table(
    cellText=ranking_data,
    colLabels=['Rank | Model', 'RMSE', 'MAE', 'R² Score', 'Std Dev'],
    cellLoc='center',
    loc='center',
    colWidths=[0.25, 0.15, 0.15, 0.15, 0.15]
)

table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)

# Color header
for i in range(5):
    table[(0, i)].set_facecolor('#4285f4')
    table[(0, i)].set_text_props(weight='bold', color='white')

# Color winner row
for i in range(5):
    table[(1, i)].set_facecolor('#c6f6d5')
    table[(1, i)].set_text_props(weight='bold')

plt.title('Model Ranking by Performance', fontsize=14, fontweight='bold', pad=20)
plt.show()

---
## 7. Analysis & Insights

### Key Findings:

In [None]:
print("\n" + "="*70)
print("KEY INSIGHTS FROM MODEL EXPERIMENTS")
print("="*70)

print("\n1️⃣  LINEAR VS TREE MODELS:")
linear_rmse = results_df[results_df['Model'].isin(['Ridge', 'ElasticNet'])]['RMSE_mean'].min()
tree_rmse = results_df[~results_df['Model'].isin(['Ridge', 'ElasticNet'])]['RMSE_mean'].min()
improvement = ((linear_rmse - tree_rmse) / linear_rmse) * 100
print(f"   Linear models RMSE: ${linear_rmse:,.2f}")
print(f"   Tree models RMSE: ${tree_rmse:,.2f}")
print(f"   Tree models are {improvement:.1f}% better ✓")
print(f"   → Tree-based models capture non-linear patterns better")

print("\n2️⃣  ENSEMBLE VS SINGLE TREES:")
rf_rmse = results_df[results_df['Model'] == 'RandomForest']['RMSE_mean'].values[0]
et_rmse = results_df[results_df['Model'] == 'ExtraTrees']['RMSE_mean'].values[0]
print(f"   RandomForest RMSE: ${rf_rmse:,.2f}")
print(f"   ExtraTrees RMSE: ${et_rmse:,.2f}")
print(f"   → Ensemble methods reduce overfitting and improve stability")

print("\n3️⃣  BOOSTING VS BAGGING:")
xgb_rmse = results_df[results_df['Model'] == 'XGBoost']['RMSE_mean'].values[0]
histgb_rmse = results_df[results_df['Model'] == 'HistGB']['RMSE_mean'].values[0]
print(f"   XGBoost RMSE: ${xgb_rmse:,.2f} (sequential boosting)")
print(f"   HistGB RMSE: ${histgb_rmse:,.2f} (fast boosting)")
print(f"   → Boosting (sequential) often outperforms bagging (parallel)")

print("\n4️⃣  VARIANCE STABILITY:")
most_stable = results_df.loc[results_df['RMSE_std'].idxmin()]
least_stable = results_df.loc[results_df['RMSE_std'].idxmax()]
print(f"   Most stable: {most_stable['Model']} (std±{most_stable['RMSE_std']:.6f})")
print(f"   Least stable: {least_stable['Model']} (std±{least_stable['RMSE_std']:.6f})")
print(f"   → Lower std = more consistent across folds (better generalization)")

print("\n5️⃣  R² INTERPRETATION:")
best_r2_model = results_df.loc[results_df['R2_mean'].idxmax()]
print(f"   Best R² Score: {best_r2_model['R2_mean']:.4f} ({best_r2_model['R2_mean']*100:.2f}%)")
print(f"   → Model explains {best_r2_model['R2_mean']*100:.2f}% of price variance")
print(f"   → {(1-best_r2_model['R2_mean'])*100:.2f}% variance from other factors")

---
## 8. Model Selection Decision

### Selection Criteria:

In [None]:
print("\n" + "="*70)
print("MODEL SELECTION DECISION")
print("="*70)

print("\nCriteria Used:")
print("  1. Lowest RMSE (primary metric) ✓")
print("  2. High R² Score (variance explained)")
print("  3. Low standard deviation (consistency across folds)")
print("  4. Computational efficiency")
print("  5. Interpretability & production readiness")

print(f"\nSELECTED MODEL: {best_model_name}")
print(f"\nPerformance Metrics:")
best_row = results_df[results_df['Model'] == best_model_name].iloc[0]
print(f"  • RMSE: ${best_row['RMSE_mean']:,.2f}")
print(f"  • MAE: ${best_row['MAE_mean']:,.2f}")
print(f"  • R² Score: {best_row['R2_mean']:.4f}")
print(f"  • Std Dev: ±{best_row['RMSE_std']:.6f}")

print(f"\nWhy {best_model_name}?")

if best_model_name == 'XGBoost':
    print("  ✓ Highest accuracy (lowest RMSE)")
    print("  ✓ Best generalization (low variance)")
    print("  ✓ Handles multicollinearity well")
    print("  ✓ Built-in regularization prevents overfitting")
    print("  ✓ Industry standard for regression tasks")
    print("  ✓ Scalable to large datasets")
elif best_model_name == 'ExtraTrees':
    print("  ✓ Very fast training and prediction")
    print("  ✓ Less prone to overfitting than RandomForest")
    print("  ✓ Excellent performance with minimal tuning")
    print("  ✓ Good feature importance extraction")
elif best_model_name == 'RandomForest':
    print("  ✓ Robust ensemble method")
    print("  ✓ Handles non-linear relationships")
    print("  ✓ Reduces overfitting through bagging")
    print("  ✓ Good feature importance rankings")

print(f"\nNext Steps:")
print(f"  1. Train final {best_model_name} on entire training set")
print(f"  2. Generate predictions on test set")
print(f"  3. Extract feature importances for interpretation")
print(f"  4. Hyperparameter tuning (optional GridSearch/RandomSearch)")
print(f"  5. Production deployment & monitoring")

---
## 9. Hyperparameter Tuning Insights

### What Parameters Matter Most?:

**For XGBoost:**
| Parameter | Impact | Current Value |
|-----------|--------|---------------|
| `n_estimators` | # of trees (↑ = better fit, ↓ = faster) | 600 |
| `learning_rate` | Step size for tree contributions | 0.05 |
| `max_depth` | Tree depth (↑ = complex patterns, ↓ = avoid overfitting) | 6 |
| `subsample` | Row sampling fraction | 0.8 |
| `colsample_bytree` | Feature sampling fraction | 0.8 |

**Trade-offs:**
- More trees (↑ `n_estimators`) → Better accuracy but slower
- Deeper trees (↑ `max_depth`) → Better fit but risk overfitting
- Higher learning rate → Faster convergence but less stable

**Current Configuration:**
- Balanced for accuracy + speed
- Conservative regularization (subsample=0.8, colsample=0.8)
- Good generalization based on 5-fold CV

---
## 10. Summary Report

### Experiment Outcomes:

In [None]:
print("\n" + "="*70)
print("MODEL EXPERIMENTS - SUMMARY")
print("="*70)

print(f"\n EXPERIMENT SCOPE:")
print(f"   Models tested: 6")
print(f"   Cross-validation folds: 5")
print(f"   Metrics evaluated: 3 (RMSE, MAE, R²)")
print(f"   Total CV runs: 30 (6 models × 5 folds)")
print(f"   Training data: {X.shape[0]:,} samples × {X.shape[1]} features")

print(f"\n WINNER:")
print(f"   Model: {best_model_name}")
print(f"   RMSE: ${best_row['RMSE_mean']:,.2f}")
print(f"   R² Score: {best_row['R2_mean']:.4f}")

print(f"\nPERFORMANCE IMPROVEMENT:")
worst_rmse = results_df['RMSE_mean'].max()
improvement_pct = ((worst_rmse - best_row['RMSE_mean']) / worst_rmse) * 100
print(f"   vs Worst Model: {improvement_pct:.1f}% better")
print(f"   vs Ridge (linear): {((linear_rmse - best_row['RMSE_mean']) / linear_rmse * 100):.1f}% better")

print(f"\n DATA QUALITY:")
print(f"   Missing values: 0")
print(f"   Feature scaling: StandardScaler (already applied)")
print(f"   Leakage prevention: price & price_log excluded from features")
print(f"   Train/test split: 5-fold cross-validation")

print(f"\nRECOMMENDED NEXT STEPS:")
print(f"   1. Train {best_model_name} on full dataset")
print(f"   2. Generate final test predictions")
print(f"   3. Analyze residuals (prediction errors)")
print(f"   4. Extract feature importances")
print(f"   5. Evaluate on held-out test set")
print(f"\n" + "="*70)