# üè† House Price Prediction (Starter Notebook)

This notebook contains a beginner-friendly end-to-end pipeline for a house price regression task using the California Housing dataset from scikit-learn.

## 1. Setup

If you run this locally, install required packages:

```bash
pip install numpy pandas matplotlib scikit-learn jupyter
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error

# display settings
pd.set_option('display.max_columns', 50)

In [None]:
data = fetch_california_housing(as_frame=True)
df = data.frame

df.head()

# Quick info
print('Shape:', df.shape)
print('\nFeatures:\n', data.feature_names)


In [None]:
df.describe().T

In [None]:
X = df.drop(columns=['MedHouseVal'])
y = df['MedHouseVal']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

pred_lr = lr.predict(X_test)
mae_lr = mean_absolute_error(y_test, pred_lr)
rmse_lr = mean_squared_error(y_test, pred_lr, squared=False)

print('Linear Regression MAE:', round(mae_lr, 3))
print('Linear Regression RMSE:', round(rmse_lr, 3))

In [None]:
rf = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)

pred_rf = rf.predict(X_test)
mae_rf = mean_absolute_error(y_test, pred_rf)
rmse_rf = mean_squared_error(y_test, pred_rf, squared=False)

print('Random Forest MAE:', round(mae_rf, 3))
print('Random Forest RMSE:', round(rmse_rf, 3))

In [None]:
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
importances.head(10)

# Compare predictions vs actual (first 10 rows)
comparison = pd.DataFrame({'actual': y_test.values, 'pred_lr': pred_lr, 'pred_rf': pred_rf})
comparison.head(10)

## Next steps (suggested)
- Perform feature engineering (create new features, log transforms)
- Tune hyperparameters (GridSearchCV / RandomizedSearchCV)
- Try scaling and regularization (Ridge, Lasso)
- Build cross-validation pipelines and persist best model with `joblib` or `pickle`
- Add visualizations (residuals, prediction error scatterplots)

---

*When you're ready, upload this notebook to GitHub in the `house_price_prediction/` folder.*