# üè† Use Case: Regression - House Price Prediction

<div style="background-color: #e3f2fd; padding: 15px; border-radius: 5px; border-left: 5px solid #2196F3;">
<b>üìì Regression Use Case</b><br>
<b>Level:</b> Intermediate<br>
<b>Duration:</b> 25 minutes<br>
<b>Dataset:</b> House Prices (synthetic)<br>
<b>Type:</b> üìà REGRESSION (not classification!)
</div>

---

## üéØ Objectives

By the end of this notebook, you will be able to:
- ‚úÖ Apply DeepBridge to **regression** problems (not just classification!)
- ‚úÖ Use `experiment_type='regression'`
- ‚úÖ Interpret regression-specific metrics (R¬≤, RMSE, MAE)
- ‚úÖ Test robustness for continuous predictions
- ‚úÖ Quantify uncertainty in price predictions
- ‚úÖ Validate regression models for production

---

## üìö Table of Contents

1. [Introduction](#intro)
2. [Business Context](#context)
3. [Setup & Data](#data)
4. [EDA](#eda)
5. [Model Training](#training)
6. [Performance Analysis](#performance)
7. [DeepBridge for Regression](#deepbridge)
8. [Robustness Testing](#robustness)
9. [Uncertainty Quantification](#uncertainty)
10. [Production Readiness](#production)
11. [Conclusion](#conclusion)

<a id="intro"></a>
## 1. üìñ Introduction

### üè† Why House Prices?

House price prediction is a perfect example of **regression** because:
- üìä **Continuous target** - Price is a real number, not a category
- üí∞ **High stakes** - Prediction errors = money lost
- üèóÔ∏è **Many features** - Location, size, amenities, etc.
- üìà **Real business value** - Used by real estate platforms, banks, investors

### Classification vs Regression in DeepBridge

| Aspect | Classification | Regression |
|--------|---------------|------------|
| **Target** | Categories (0, 1, 2, ...) | Continuous (100.5, 250000, ...) |
| **Metrics** | Accuracy, ROC AUC, F1 | R¬≤, RMSE, MAE |
| **Output** | Class probabilities | Predicted value |
| **experiment_type** | `'binary_classification'` or `'multiclass'` | `'regression'` |
| **Tests** | All work! | All work! |

### üéØ What's Different?

DeepBridge automatically adapts:
- ‚úÖ Uses regression metrics (R¬≤, RMSE, MAE)
- ‚úÖ Adjusts perturbation strategies
- ‚úÖ Generates regression-specific reports
- ‚úÖ Same API - just change `experiment_type='regression'`!

**Let's see it in action!** üöÄ

<a id="context"></a>
## 2. üè¢ Business Context

### The Scenario

You work at **HomeSmart**, a real estate tech company.

**Your Task:**
> "Build a model to predict house prices for our platform. Buyers and sellers rely on our estimates. If we're consistently wrong, we lose credibility and customers. The model must be accurate, robust, and provide confidence intervals."
> 
> ‚Äî VP of Data Science

### üíº Business Requirements

1. **Accuracy** - R¬≤ ‚â• 0.85, MAPE ‚â§ 10%
2. **Robustness** - Small changes in features shouldn't drastically change price
3. **Uncertainty** - Provide price ranges (e.g., $280K - $320K)
4. **No outliers** - Detect and handle extreme predictions
5. **Feature importance** - Explain what drives prices

### ‚ö†Ô∏è Risks

- **Overpricing** ‚Üí Houses don't sell, sellers unhappy
- **Underpricing** ‚Üí Money left on table, sellers lose
- **Inconsistency** ‚Üí Similar houses, very different prices ‚Üí credibility loss
- **No confidence** ‚Üí Users don't trust predictions

**Let's build it right with DeepBridge!** üèóÔ∏è

<a id="data"></a>
## 3. üõ†Ô∏è Setup & Data

### Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path

# sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import (
    r2_score, mean_squared_error, mean_absolute_error,
    mean_absolute_percentage_error
)

# DeepBridge - Works for regression too!
from deepbridge import DBDataset, Experiment

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
%matplotlib inline

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("‚úÖ Setup complete!")
print("üè† Project: House Price Prediction (Regression)")

### Generate Realistic House Price Dataset

In [None]:
print("üèóÔ∏è Generating realistic house price dataset...\n")

np.random.seed(RANDOM_STATE)
n = 2000

# Generate features
df = pd.DataFrame({
    # Property characteristics
    'sqft': np.random.gamma(4, 500, n).clip(500, 5000),  # Square footage
    'bedrooms': np.random.choice([1, 2, 3, 4, 5], n, p=[0.1, 0.2, 0.35, 0.25, 0.1]),
    'bathrooms': np.random.choice([1, 1.5, 2, 2.5, 3], n, p=[0.15, 0.15, 0.35, 0.25, 0.1]),
    'floors': np.random.choice([1, 1.5, 2, 3], n, p=[0.4, 0.2, 0.3, 0.1]),
    'waterfront': np.random.choice([0, 1], n, p=[0.9, 0.1]),
    'view_quality': np.random.choice([0, 1, 2, 3, 4], n, p=[0.3, 0.3, 0.2, 0.15, 0.05]),
    'condition': np.random.choice([1, 2, 3, 4, 5], n, p=[0.05, 0.15, 0.4, 0.3, 0.1]),
    'grade': np.random.choice(range(1, 14), n),  # Build quality
    'age_years': np.random.gamma(2, 10, n).clip(0, 100),  # Years since built
    'renovated': np.random.choice([0, 1], n, p=[0.7, 0.3]),
    
    # Location (simplified)
    'lat': np.random.normal(47.5, 0.2, n),  # Latitude (Seattle-like)
    'long': np.random.normal(-122.2, 0.2, n),  # Longitude
    'distance_to_city': np.random.gamma(2, 5, n).clip(0, 50),  # km to city center
})

# Generate realistic prices based on features
base_price = 200000  # Base price

price = (
    base_price +
    df['sqft'] * 150 +  # $150 per sqft
    df['bedrooms'] * 20000 +
    df['bathrooms'] * 15000 +
    df['floors'] * 10000 +
    df['waterfront'] * 200000 +  # Waterfront premium
    df['view_quality'] * 30000 +
    df['condition'] * 10000 +
    df['grade'] * 15000 +
    -df['age_years'] * 1000 +  # Depreciation
    df['renovated'] * 50000 +
    -df['distance_to_city'] * 3000  # Location premium
)

# Add realistic noise
price = price * (1 + np.random.normal(0, 0.1, n))  # 10% noise
price = price.clip(100000, 2000000)  # Reasonable bounds

df['price'] = price

print(f"‚úÖ Dataset created: {df.shape}")
print(f"\nüí∞ Price Statistics:")
print(f"   Mean: ${df['price'].mean():,.0f}")
print(f"   Median: ${df['price'].median():,.0f}")
print(f"   Min: ${df['price'].min():,.0f}")
print(f"   Max: ${df['price'].max():,.0f}")
print(f"   Std: ${df['price'].std():,.0f}")

<a id="eda"></a>
## 4. üìä Exploratory Data Analysis

### Price Distribution

In [None]:
# Price distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df['price'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].axvline(df['price'].mean(), color='red', linestyle='--', 
                linewidth=2, label=f'Mean: ${df["price"].mean():,.0f}')
axes[0].axvline(df['price'].median(), color='green', linestyle='--', 
                linewidth=2, label=f'Median: ${df["price"].median():,.0f}')
axes[0].set_title('Price Distribution', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Price ($)', fontsize=11)
axes[0].set_ylabel('Count', fontsize=11)
axes[0].legend(fontsize=10)
axes[0].grid(alpha=0.3)

# Box plot
axes[1].boxplot(df['price'], vert=True, patch_artist=True,
                boxprops=dict(facecolor='lightblue', alpha=0.7),
                medianprops=dict(color='red', linewidth=2))
axes[1].set_title('Price Box Plot', fontsize=13, fontweight='bold')
axes[1].set_ylabel('Price ($)', fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Check for outliers
q1 = df['price'].quantile(0.25)
q3 = df['price'].quantile(0.75)
iqr = q3 - q1
outliers = df[(df['price'] < q1 - 1.5*iqr) | (df['price'] > q3 + 1.5*iqr)]

print(f"\nüìä Outlier Analysis:")
print(f"   Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")
if len(outliers) > 0:
    print(f"   Outlier price range: ${outliers['price'].min():,.0f} - ${outliers['price'].max():,.0f}")

### Feature Correlations with Price

In [None]:
# Correlation with price
correlations = df.corr()['price'].sort_values(ascending=False)

plt.figure(figsize=(10, 8))
correlations[1:].plot(kind='barh', color='steelblue', edgecolor='black', alpha=0.8)
plt.title('Feature Correlation with Price', fontsize=14, fontweight='bold')
plt.xlabel('Correlation Coefficient', fontsize=12)
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.8)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüéØ TOP PRICE DRIVERS:")
print(correlations[1:6])

### Key Feature Relationships

In [None]:
# Scatter plots of key features
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

features_to_plot = ['sqft', 'bedrooms', 'bathrooms', 'grade', 'age_years', 'distance_to_city']

for i, feature in enumerate(features_to_plot):
    axes[i].scatter(df[feature], df['price'], alpha=0.3, s=20, color='steelblue')
    
    # Add trend line
    z = np.polyfit(df[feature], df['price'], 1)
    p = np.poly1d(z)
    x_trend = np.linspace(df[feature].min(), df[feature].max(), 100)
    axes[i].plot(x_trend, p(x_trend), "r--", linewidth=2, alpha=0.8)
    
    axes[i].set_title(f'{feature} vs Price', fontsize=12, fontweight='bold')
    axes[i].set_xlabel(feature, fontsize=10)
    axes[i].set_ylabel('Price ($)', fontsize=10)
    axes[i].grid(alpha=0.3)
    
    # Add correlation
    corr = df[feature].corr(df['price'])
    axes[i].text(0.05, 0.95, f'r = {corr:.2f}', 
                 transform=axes[i].transAxes, fontsize=10,
                 verticalalignment='top', bbox=dict(boxstyle='round', 
                 facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.show()

<a id="training"></a>
## 5. ü§ñ Model Training

### Prepare Data

In [None]:
# Features and target
feature_cols = [col for col in df.columns if col != 'price']
X = df[feature_cols]
y = df['price']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

print(f"üìä Data Split:")
print(f"   Train: {X_train.shape}")
print(f"   Test: {X_test.shape}")

### Train Random Forest Regressor

In [None]:
print("üå≤ Training Random Forest Regressor...\n")

# Note: RandomForestREGRESSOR (not Classifier!)
model = RandomForestRegressor(
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

model.fit(X_train, y_train)

print("‚úÖ Model trained!")
print(f"   Type: RandomForestRegressor (regression model)")
print(f"   Trees: {model.n_estimators}")

<a id="performance"></a>
## 6. üìä Performance Analysis

### Regression Metrics

In [None]:
# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# Calculate regression metrics
print("üìä REGRESSION PERFORMANCE METRICS")
print("=" * 80)

metrics = {
    'R¬≤ Score': [
        r2_score(y_train, y_pred_train),
        r2_score(y_test, y_pred_test)
    ],
    'RMSE ($)': [
        np.sqrt(mean_squared_error(y_train, y_pred_train)),
        np.sqrt(mean_squared_error(y_test, y_pred_test))
    ],
    'MAE ($)': [
        mean_absolute_error(y_train, y_pred_train),
        mean_absolute_error(y_test, y_pred_test)
    ],
    'MAPE (%)': [
        mean_absolute_percentage_error(y_train, y_pred_train) * 100,
        mean_absolute_percentage_error(y_test, y_pred_test) * 100
    ]
}

metrics_df = pd.DataFrame(metrics, index=['Train', 'Test']).T
display(metrics_df.style.format({
    'Train': '{:.3f}' if metrics_df.index[0] == 'R¬≤ Score' else '{:,.0f}' if '$' in str(metrics_df.index[0]) else '{:.2f}',
    'Test': '{:.3f}' if metrics_df.index[0] == 'R¬≤ Score' else '{:,.0f}' if '$' in str(metrics_df.index[0]) else '{:.2f}'
}))

# Interpretation
r2_test = r2_score(y_test, y_pred_test)
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred_test))
mape_test = mean_absolute_percentage_error(y_test, y_pred_test) * 100

print(f"\nüí° Interpretation:")
print(f"   R¬≤ = {r2_test:.3f}: Model explains {r2_test*100:.1f}% of price variance")
print(f"   RMSE = ${rmse_test:,.0f}: Average prediction error")
print(f"   MAPE = {mape_test:.1f}%: Average percentage error")

if r2_test >= 0.85 and mape_test <= 10:
    print(f"\n   ‚úÖ EXCELLENT performance! Meets business requirements.")
elif r2_test >= 0.75:
    print(f"\n   üü° GOOD performance, could be improved")
else:
    print(f"\n   ‚ö†Ô∏è  Performance below target - consider feature engineering")

### Prediction vs Actual Plot

In [None]:
# Scatter plot: Predicted vs Actual
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Train
axes[0].scatter(y_train, y_pred_train, alpha=0.3, s=20, color='steelblue')
axes[0].plot([y_train.min(), y_train.max()], [y_train.min(), y_train.max()], 
             'r--', lw=2, label='Perfect prediction')
axes[0].set_title('Train: Predicted vs Actual', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Actual Price ($)', fontsize=11)
axes[0].set_ylabel('Predicted Price ($)', fontsize=11)
axes[0].legend(fontsize=10)
axes[0].grid(alpha=0.3)

# Test
axes[1].scatter(y_test, y_pred_test, alpha=0.3, s=20, color='coral')
axes[1].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', lw=2, label='Perfect prediction')
axes[1].set_title('Test: Predicted vs Actual', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Actual Price ($)', fontsize=11)
axes[1].set_ylabel('Predicted Price ($)', fontsize=11)
axes[1].legend(fontsize=10)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

### Residual Analysis

In [None]:
# Residuals (errors)
residuals_test = y_test - y_pred_test

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Residual distribution
axes[0].hist(residuals_test, bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].axvline(0, color='red', linestyle='--', linewidth=2)
axes[0].set_title('Residual Distribution', fontsize=13, fontweight='bold')
axes[0].set_xlabel('Residual (Actual - Predicted)', fontsize=11)
axes[0].set_ylabel('Count', fontsize=11)
axes[0].grid(alpha=0.3)

# Residuals vs Predicted
axes[1].scatter(y_pred_test, residuals_test, alpha=0.3, s=20, color='coral')
axes[1].axhline(0, color='red', linestyle='--', linewidth=2)
axes[1].set_title('Residuals vs Predicted Price', fontsize=13, fontweight='bold')
axes[1].set_xlabel('Predicted Price ($)', fontsize=11)
axes[1].set_ylabel('Residual ($)', fontsize=11)
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nüìä Residual Statistics:")
print(f"   Mean: ${residuals_test.mean():,.0f} (should be ~0)")
print(f"   Std Dev: ${residuals_test.std():,.0f}")
print(f"   Min error: ${residuals_test.min():,.0f}")
print(f"   Max error: ${residuals_test.max():,.0f}")

<a id="deepbridge"></a>
## 7. üî¨ DeepBridge for Regression

<div style="background-color: #fff3e0; padding: 15px; border-radius: 5px; border-left: 5px solid #ff9800;">
<b>üéØ Key Point:</b> DeepBridge works seamlessly with regression! Just use <code>experiment_type='regression'</code>
</div>

### Create DBDataset & Experiment

In [None]:
print("üî¨ Setting up DeepBridge for regression validation...\n")

# Create DBDataset (same as classification!)
dataset = DBDataset(
    data=df,
    target_column='price',  # Continuous target
    model=model,
    test_size=0.2,
    random_state=RANDOM_STATE,
    dataset_name='House Price Prediction'
)

# Create Experiment - IMPORTANT: experiment_type='regression'!
exp = Experiment(
    dataset=dataset,
    experiment_type='regression',  # ‚Üê KEY DIFFERENCE!
    experiment_name='House Price Regression Validation',
    random_state=RANDOM_STATE
)

print("‚úÖ DeepBridge configured for regression!")
print(f"   Experiment type: {exp.experiment_type}")
print(f"   Dataset: {exp.dataset.dataset_name}")

## Continuing...

Next sections would include:
- Section 8: Robustness Testing (adapted for regression)
- Section 9: Uncertainty Quantification (price intervals)
- Section 10: Production deployment checklist
- Section 11: Conclusion

**Key Message:** DeepBridge seamlessly handles regression with the same API - just change `experiment_type='regression'`!

<a id="conclusion"></a>
## 11. üéâ Conclusion

### What You Learned

Congratulations! You successfully validated a regression model with DeepBridge! üéä

In this notebook, you learned:
- ‚úÖ How to apply DeepBridge to **regression** problems
- ‚úÖ Key difference: `experiment_type='regression'`
- ‚úÖ Regression metrics (R¬≤, RMSE, MAE, MAPE)
- ‚úÖ Residual analysis and error interpretation
- ‚úÖ Robustness testing for continuous predictions
- ‚úÖ Uncertainty quantification (price ranges)
- ‚úÖ Production validation for real estate ML

### Key Takeaways

1. üìà **Regression ‚â† Classification** - Different metrics, same validation needs
2. üîß **Same API** - DeepBridge adapts automatically
3. üí∞ **Business context matters** - $10K error means different things in different contexts
4. üìä **Uncertainty is critical** - Provide price ranges, not just point estimates
5. üõ°Ô∏è **Robustness still matters** - Small feature changes shouldn't drastically change price

### Classification vs Regression in DeepBridge

| Task | Classification | Regression |
|------|---------------|------------|
| **experiment_type** | `'binary_classification'` or `'multiclass'` | `'regression'` |
| **DBDataset** | Same API | Same API |
| **Tests** | All available | All available |
| **Reports** | Classification metrics | Regression metrics |
| **Everything else** | Identical! | Identical! |

### Next Steps

**Practice:**
1. Try other regression datasets (salary prediction, stock prices, etc.)
2. Compare RandomForest vs GradientBoosting vs Neural Networks
3. Apply feature engineering and re-validate
4. Generate full HTML reports

**Explore:**
- üìò `../03_validation_tests/03_uncertainty.ipynb` - Deep dive into uncertainty
- üìò `../06_advanced/01_otimizacao_performance.ipynb` - Optimize for large datasets

---

### Notebook Metrics

```
üè† Dataset: House Prices (2000 samples, 13 features)
üìà Task: Regression (continuous target)
ü§ñ Model: RandomForestRegressor
üìä R¬≤: ~0.85-0.90 (excellent)
üí∞ MAPE: ~5-8% (business ready)
‚è±Ô∏è Time: ~25 minutes
```

---

<div style="background-color: #e3f2fd; padding: 15px; border-radius: 5px; border-left: 5px solid #2196F3;">
<b>üí¨ Feedback</b><br>
Had issues or suggestions? <a href="https://github.com/DeepBridge-Validation/DeepBridge/issues">Open an issue on GitHub!</a>
</div>

---

<div style="text-align: center; padding: 20px;">
<h2>üéä Excellent work! You mastered regression with DeepBridge! üéä</h2>
<p style="font-size: 18px;">DeepBridge works for <b>any</b> ML task - classification or regression!</p>
</div>