# Coding Exercise: Melbourne Housing Price Prediction

**Dataset:** `./datasets/housing/melb_data.csv` (Melbourne housing)

## Your Task
You will build a regression model to predict house prices using the workflow from the demo notebook:

1. Load and inspect the data
2. Define features + label (target)
3. Split into train/test
4. Build a preprocessing + model **Pipeline**
5. Evaluate with RMSE
6. Use cross-validation without data leakage
7. (Optional) Tune with GridSearchCV

### Rules
- Fill in the code where you see `# TODO:`.
- Do **not** fit preprocessing on the test set.
- For cross-validation and grid search, use a **single Pipeline(preprocess + model)** to avoid leakage.

## 1) Imports
Run this cell to import the libraries you will need.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error

## 2) Load the data
Load `melb_data.csv` into a DataFrame named `df`. Then show the first 5 rows and the column names.

In [None]:
# TODO: load the Melbourne housing dataset
# Path: ./datasets/housing/melb_data.csv
df = ...

# TODO: display basic info
df.head()

### Quick inspection
1. How many rows and columns are there?
2. Which columns have missing values?
3. Which column looks like the target (price)?

In [None]:
# TODO: inspect the dataset
df.shape

In [None]:
# TODO: show missing values per column (sorted)
...

## 3) Define label (y) and features (X)
We will predict the column `Price`. Create:
- `y` = `df['Price']`
- `X` = all other columns (drop `Price`)

In [None]:
# TODO: define X and y
y = ...
X = ...

X.head()

## 4) Train/test split
Split into train and test sets:
- 80% train, 20% test
- set `random_state=42`

Create variables: `X_train`, `X_test`, `y_train`, `y_test`.

In [None]:
# TODO: train/test split
X_train, X_test, y_train, y_test = ...

## 5) Build preprocessing (ColumnTransformer)
We will:
- **Numerical columns**: impute missing values (median) + scale (StandardScaler)
- **Categorical columns**: impute missing values (most_frequent) + one-hot encode

### Task
1. Identify categorical columns as those with dtype `object`.
2. Identify numerical columns as the remaining columns.
3. Build `preprocess = ColumnTransformer(...)` using two pipelines (numeric and categorical).

In [None]:
# TODO: identify categorical and numerical columns
categorical_cols = ...
numerical_cols = ...

categorical_cols, numerical_cols

In [None]:
# TODO: create numeric and categorical preprocessing pipelines
numeric_transformer = ...
categorical_transformer = ...

# TODO: build the full ColumnTransformer
preprocess = ...

preprocess

## 6) Baseline model (Linear Regression) in a Pipeline
Create a pipeline called `lin_model` that includes:
- `preprocess`
- `LinearRegression()`

Then:
1. Fit on the training set
2. Predict on the test set
3. Compute RMSE on the test set

In [None]:
# TODO: create + fit the baseline pipeline
lin_model = ...

# TODO: fit on the training set
...

# TODO: predict on the test set
...

# TODO: compute test RMSE
...

## 7) Cross-validation (no leakage)
Evaluate your **pipeline** using 5-fold cross-validation RMSE.

Reminder: because `lin_model` includes preprocessing + model in one Pipeline, `cross_val_score` will fit preprocessing only on each training fold (no leakage).

In [None]:
# TODO: cross-validation RMSE (no leakage)
# Hint: use cross_val_score(..., scoring="neg_mean_squared_error", cv=5)
lin_scores = ...
lin_rmse_scores = ...
print(...)

## 8) Try two stronger models
Create two more pipelines:
- `tree_model` = DecisionTreeRegressor
- `forest_model` = RandomForestRegressor

### Requirements
- Both must include the same `preprocess` step.
- Set `random_state=42` for reproducibility.
- Evaluate both using 5-fold CV RMSE (same method as before).

In [None]:
# TODO: Decision Tree pipeline + CV RMSE
# Hint: DecisionTreeRegressor(random_state=42)

tree_model = ...
tree_scores = ...
tree_rmse_scores = ...
print(...)

In [None]:
# TODO: Random Forest pipeline + CV RMSE
forest_model = ...
forest_scores = ...
forest_rmse_scores = ...
print(...)

## 9) (Optional) Hyperparameter tuning with GridSearchCV
Tune the Random Forest **pipeline** using GridSearchCV.

### Task
1. Import `GridSearchCV`
2. Use a `param_grid` with parameters prefixed by `model__`
   - Example: `model__n_estimators`, `model__max_features`
3. Use `cv=3` to keep runtime reasonable
4. Fit on `X_train`, `y_train`
5. Print the best params and best RMSE (convert from negative MSE)

In [None]:
# TODO: GridSearchCV over the Random Forest pipeline
# Hint: parameters should be prefixed with model__ (e.g., model__n_estimators)

from sklearn.model_selection import GridSearchCV

param_grid = ...

grid_search = ...

grid_search.fit(X_train, y_train)
print(...)
print(...)

## 10) Final evaluation on the test set
Choose your final model:
- If you did grid search: use `grid_search.best_estimator_`
- Otherwise: use `forest_model` (or your best model)

Then compute RMSE on `X_test` / `y_test`.

In [None]:
# TODO: pick final model
final_model = ...
# TODO: fit on full training data
final_model.fit(...)
# TODO: evaluate on test set
test_pred = ...
test_rmse = ...
test_rmse