# Final House Price Analysis Report

This notebook presents the complete end-to-end analysis of King County House Prices, including data cleaning, feature engineering, and model comparison (Linear Regression, Random Forest, XGBoost).

## 1. Setup and Data Loading


In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image, display

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

from houseprice.config import DATA_PATH, OUT_DIR, RANDOM_STATE, TEST_SIZE
from houseprice.data import load_data, clean_data
from houseprice.features import engineer_features
from houseprice.preprocess import split_columns, make_preprocessors
from houseprice.models import make_linear, make_random_forest, make_xgb
from houseprice.plots import (
    plot_lr_residuals_enhanced, 
    plot_tree_importance,
    plot_feature_importance_comparison,
    plot_linear_coefficients,
    plot_shap_summary
)

# Initialize output directory
out_dir = Path.cwd().parent / OUT_DIR
out_dir.mkdir(parents=True, exist_ok=True)
print(f"Output directory: {out_dir}")


In [None]:
# Load and clean data
raw_df = load_data(Path.cwd().parent / DATA_PATH)
print(f"Raw shape: {raw_df.shape}")

df_clean = clean_data(raw_df)
print(f"Cleaned shape: {df_clean.shape}")

# Feature Engineering
df = engineer_features(df_clean)
print(f"Engineered shape: {df.shape}")
print("Columns:", df.columns.tolist())


## 2. Model Training & Comparison
We train three models:
1. **Linear Regression** (Baseline, scaled features)
2. **Random Forest** (Tree ensemble)
3. **XGBoost** (Gradient boosting)

We use 5-fold Cross-Validation to evaluate performance.


In [None]:
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, cross_validate
from sklearn.metrics import r2_score, root_mean_squared_error

# Prepare X and y
y_log = np.log1p(df["price"].values)
X = df.drop(columns=["price"])

# Train/Test Split
X_train, X_test, y_train_log, y_test_log = train_test_split(
    X, y_log, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

# Pipelines
num, cat = split_columns(X_train)
prep_lr, prep_trees = make_preprocessors(num, cat)

lr_pipe, lr_grid = make_linear(prep_lr)
rf_pipe, rf_grid = make_random_forest(prep_trees, RANDOM_STATE)
xgb_pipe, xgb_grid = make_xgb(prep_trees, RANDOM_STATE)

models = [
    ("LinearRegression", lr_pipe, lr_grid),
    ("RandomForest", rf_pipe, rf_grid),
]
if xgb_pipe:
    models.append(("XGBoost", xgb_pipe, xgb_grid))

# Run CV
kf = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
best_estimators = {}
results = []

print("Training models...")
for name, pipe, grid in models:
    print(f"  > Tuning {name}...")
    gs = GridSearchCV(pipe, param_grid=grid, cv=kf, scoring="r2", n_jobs=-1)
    gs.fit(X_train, y_train_log)
    best_estimators[name] = gs.best_estimator_
    
    cv_res = cross_validate(gs.best_estimator_, X_train, y_train_log, cv=kf, 
                            scoring=["r2", "neg_root_mean_squared_error"])
    
    r2 = cv_res["test_r2"].mean()
    rmse = -cv_res["test_neg_root_mean_squared_error"].mean()
    
    # Approximate dollar RMSE
    # Note: rigorous way is expm1(pred) vs true, but this gives a quick view on log scale or we compute explicitly
    # Let's compute explicitly for consistency with report
    y_pred_oof = cross_validate(gs.best_estimator_, X_train, y_train_log, cv=kf, scoring="r2")
    # Actually, simpler to just trust the scripts/run_cv.py for the exact table
    # Here we just show the results we found.
    
    results.append({
        "Model": name,
        "Best Params": str(gs.best_params_),
        "R2 (CV Mean)": r2
    })

results_df = pd.DataFrame(results)
display(results_df)


## 3. In-Depth Analysis

### 3.1 Linear Regression Residuals
Linear Regression often fails to capture non-linear market dynamics.


In [None]:
lr_model = best_estimators["LinearRegression"]
plot_lr_residuals_enhanced(lr_model, X_test, y_test_log, out_dir / "final_lr_residuals.png")
display(Image(filename=str(out_dir / "final_lr_residuals.png")))


### 3.2 Feature Importance (Random Forest vs XGBoost)
Comparing what the tree models find important.


In [None]:
if "RandomForest" in best_estimators and "XGBoost" in best_estimators:
    plot_feature_importance_comparison(
        best_estimators["RandomForest"], 
        best_estimators["XGBoost"], 
        prep_trees, 
        out_dir / "final_imp_comparison.png"
    )
    display(Image(filename=str(out_dir / "final_imp_comparison.png")))


### 3.3 SHAP Analysis
Understanding how specific feature values push the price up or down.


In [None]:
# Using the best tree model
winner_name = results_df.sort_values("R2 (CV Mean)", ascending=False).iloc[0]["Model"]
print(f"Analyzing {winner_name} with SHAP...")

if winner_name in ["RandomForest", "XGBoost"]:
    plot_shap_summary(best_estimators[winner_name], X_train, out_dir / "final_shap.png")
    display(Image(filename=str(out_dir / "final_shap.png")))


## 4. Final Conclusions

1.  **Model Performance**: Ensemble methods (RF/XGB) significantly outperform Linear Regression.
2.  **Key Drivers**: Square footage and grade are dominant, but location (zipcode/lat/long) is critical.
3.  **Data Quality**: Cleaning (removing bad data) and Log-transformation of price were essential steps.

### Future Work
-   Implement neighborhood clustering using Lat/Long.
-   Add more external data (school ratings, interest rates).
