# King County House Price Analysis

This notebook performs comprehensive exploratory data analysis (EDA) and model comparison for predicting house prices in King County, Washington.

## Objectives
1. Understand the distribution and characteristics of house prices
2. Identify key features that influence pricing
3. Compare Linear Regression, Random Forest, and XGBoost models
4. Analyze model performance and feature importance

## Setup and Imports

In [None]:
import sys
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import Image, display

# Add src to path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

from houseprice.config import DATA_PATH, OUT_DIR, EDA_PLOTS, RANDOM_STATE, N_FOLDS, TEST_SIZE
from houseprice.data import load_data
from houseprice.features import engineer_features
from houseprice.eda import (
    plot_price_distribution,
    plot_log_price_distribution,
    plot_correlation_heatmap,
    plot_sqft_vs_price,
    plot_geographic_distribution
)

print("✓ All imports successful")
print(f"Data path: {DATA_PATH}")
print(f"Output directory: {OUT_DIR}")

## 1. Data Loading and Initial Inspection

In [None]:
# Load the dataset
df = load_data(Path.cwd().parent / DATA_PATH)

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nData types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nBasic statistics:\n{df.describe()}")

## 2. Exploratory Data Analysis

### 2.1 Price Distribution

Understanding the distribution of house prices is crucial. We expect to see a right-skewed distribution, which is typical for real estate data.

In [None]:
# Generate price distribution plot
out_dir = Path.cwd().parent / OUT_DIR
out_dir.mkdir(parents=True, exist_ok=True)

plot_price_distribution(df, out_dir / EDA_PLOTS["price_dist"])
display(Image(filename=str(out_dir / EDA_PLOTS["price_dist"])))

**Interpretation**: The price distribution shows right skewness, with most houses priced below the mean. This suggests the need for log transformation to normalize the target variable for linear models.

### 2.2 Log-Transformed Price Distribution

In [None]:
plot_log_price_distribution(df, out_dir / EDA_PLOTS["price_log_dist"])
display(Image(filename=str(out_dir / EDA_PLOTS["price_log_dist"])))

**Interpretation**: After log transformation, the distribution becomes more symmetric and closer to normal, which is beneficial for linear regression models.

### 2.3 Feature Correlations

In [None]:
plot_correlation_heatmap(df, out_dir / EDA_PLOTS["correlation"], top_n=15)
display(Image(filename=str(out_dir / EDA_PLOTS["correlation"])))

**Interpretation**: The correlation heatmap reveals which features have the strongest linear relationships with price. Features like sqft_living, grade, and bathrooms typically show high correlation.

### 2.4 Living Area vs Price

In [None]:
plot_sqft_vs_price(df, out_dir / EDA_PLOTS["sqft_vs_price"])
display(Image(filename=str(out_dir / EDA_PLOTS["sqft_vs_price"])))

**Interpretation**: The scatter plot shows a positive relationship between living area and price. The color gradient (grade) indicates that higher-grade homes command premium prices even at similar square footage.

### 2.5 Geographic Distribution

In [None]:
plot_geographic_distribution(df, out_dir / EDA_PLOTS["geographic"])
display(Image(filename=str(out_dir / EDA_PLOTS["geographic"])))

**Interpretation**: Geographic clustering reveals that location significantly impacts price. Certain areas (likely waterfront or urban centers) show consistently higher prices.

## 3. Feature Engineering

We apply feature engineering to create additional predictive features from the raw data.

In [None]:
# Apply feature engineering
df_engineered = engineer_features(df)

print("New features created:")
new_cols = set(df_engineered.columns) - set(df.columns)
for col in new_cols:
    print(f"  - {col}")

print(f"\nEngineered dataset shape: {df_engineered.shape}")
print(f"\nSample of new features:")
display(df_engineered[['sale_year', 'sale_month', 'house_age', 'was_renovated']].head())

## 4. Model Training and Cross-Validation

Now we train three models and compare their performance using 5-fold cross-validation.

**Note**: This cell runs the full CV pipeline which may take several minutes.

In [None]:
# Run the CV pipeline
import subprocess

result = subprocess.run(
    ["python", "../scripts/run_cv.py", 
     "--data", str(Path.cwd().parent / DATA_PATH),
     "--out", str(out_dir)],
    capture_output=True,
    text=True
)

print(result.stdout)
if result.stderr:
    print("Errors:", result.stderr)

## 5. Model Comparison Results

### 5.1 Comparison Table

In [None]:
# Display the comparison table
table_path = out_dir / "comparison_table.txt"
if table_path.exists():
    print(table_path.read_text())
else:
    print("Comparison table not found. Run the CV pipeline first.")

In [None]:
# Load and display CSV results
cv_results = pd.read_csv(out_dir / "model_cv_results.csv")
display(cv_results)

**Interpretation**: 
- **R² Score**: Measures the proportion of variance explained by the model (higher is better)
- **R² SD**: Standard deviation across folds indicates model stability (lower is better)
- **RMSE**: Root Mean Squared Error in dollars (lower is better)

Tree-based models (Random Forest and XGBoost) typically outperform Linear Regression due to their ability to capture non-linear relationships.

### 5.2 Linear Regression Residual Analysis

In [None]:
lr_residuals_path = out_dir / "lr_residuals.png"
if lr_residuals_path.exists():
    display(Image(filename=str(lr_residuals_path)))
else:
    print("Residual plot not found.")

**Interpretation**: 
- Systematic patterns in residuals indicate that Linear Regression fails to capture non-linear relationships
- Under-prediction of expensive homes suggests the model cannot handle the complexity of high-end real estate pricing
- The residual distribution shows whether errors are normally distributed

### 5.3 Feature Importance Analysis

In [None]:
importance_path = out_dir / "tree_feature_importance.png"
if importance_path.exists():
    display(Image(filename=str(importance_path)))
else:
    print("Feature importance plot not found.")

**Interpretation**:
- **Quantity features** (sqft_living, sqft_lot): Physical size of the property
- **Quality features** (grade, condition): Build quality and maintenance
- **Location features** (lat, long, zipcode): Geographic factors
- **Age features** (house_age, yr_built): Property age and renovation status

The relative importance reveals which factors drive pricing decisions in the King County market.

## 6. Conclusions

### Key Findings:

1. **Data Characteristics**:
   - House prices are right-skewed, requiring log transformation
   - Strong correlations exist between physical features (sqft) and price
   - Geographic location plays a significant role in pricing

2. **Model Performance**:
   - Tree-based models (RF and XGBoost) significantly outperform Linear Regression
   - XGBoost typically achieves the best performance with lowest RMSE
   - Linear Regression shows systematic residual patterns indicating model inadequacy

3. **Feature Importance**:
   - Living area (sqft_living) is consistently the most important feature
   - Quality indicators (grade) and location features are critical
   - Engineered features (house_age, was_renovated) provide additional predictive power

### Recommendations:

- **For Production**: Use XGBoost or Random Forest for best predictive accuracy
- **For Interpretability**: Linear Regression provides clear coefficient interpretation but at the cost of accuracy
- **Feature Engineering**: Continue exploring interaction terms and geographic clustering
- **Future Work**: Consider ensemble methods combining multiple models