# UK House Price Estimator - Model Training and Evaluation


## Objective
Train a regression model to estimate house prices and evaluate its performance.

## Input
- `HousePricesRecords_clean.csv`: Cleaned dataset with numeric and categorical features prepared.

## Output
- Trained regression model
- Model evaluation metrics (e.g., MAE, RMSE)
- Price prediction examples

## Key Tasks
- Encode categorical variables
- Split the data into training and test sets
- Train a regression model (e.g., Linear Regression)
- Evaluate model accuracy using test data


## Imports, Load & Split Data

In [16]:

import os
import joblib
import pandas as pd
import numpy as np
import json
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV      
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

df = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")
y  = df["Price"]
X  = df.drop(columns=["Price", "Date of Transfer"], errors="ignore")

## Features

In [17]:
numeric_features = [
    "Year", "Month",
    "RegionMedianPrice", "RegionSaleCount",
    "CountyMedianPrice", "CountySaleCount"
]

categorical_features = [
    "Old/New", "Duration",
    "Town/City", "County", "PPDCategory Type",
    "Property_D", "Property_F", "Property_S", "Property_T",
    "Region"
]

## Filter Feature Lists to Available Columns

In [18]:

numeric_features     = [c for c in numeric_features     if c in X.columns]
categorical_features = [c for c in categorical_features if c in X.columns]

## Build Preprocessing & Modeling Pipeline

Create a Scikit-learn pipeline that standardises numeric features and one-hot encodes categorical features, then fits a Random Forest regressor on the transformed data.



In [19]:
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(random_state=42))
])

## Train-Test Split

Divide the dataset into training and testing subsets to evaluate model performance on unseen data.

In [20]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,  
    shuffle=True
)

## Define Hyperparameter Grid

Specify the range of hyperparameters to explore for the `RandomForestRegressor` during hyperparameter tuning. These parameters control model complexity, tree depth, and randomness:

- **`n_estimators`**: Number of trees in the forest.  
- **`max_depth`**: Maximum depth of each tree (controls overfitting).  
- **`min_samples_split`**: Minimum number of samples required to split an internal node.  
- **`min_samples_leaf`**: Minimum number of samples required to be at a leaf node.  
- **`max_features`**: Number of features to consider when looking for the best split.  
- **`bootstrap`**: Whether bootstrap samples are used when building trees.  



In [21]:
param_grid = {
    "regressor__n_estimators":      [100, 200, 300],
    "regressor__max_depth":         [5, 10, 20],
    "regressor__min_samples_split": [2, 5, 10],
    "regressor__min_samples_leaf":  [1, 2, 4],
    "regressor__max_features":      ["sqrt", "log2", None],
    "regressor__bootstrap":         [True, False],
}

## Hyperparameter Tuning with RandomizedSearchCV

Configure a randomized search over the hyperparameter grid to find an optimal set of parameters for the `RandomForestRegressor` within the pipeline.

In [22]:
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_grid,
    n_iter=50,
    cv=3,
    scoring="neg_mean_absolute_error",
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

best_model = random_search.best_estimator_
print("🔍 Best params (random):", random_search.best_params_)

Fitting 3 folds for each of 50 candidates, totalling 150 fits
🔍 Best params (random): {'regressor__n_estimators': 100, 'regressor__min_samples_split': 10, 'regressor__min_samples_leaf': 4, 'regressor__max_features': None, 'regressor__max_depth': 5, 'regressor__bootstrap': True}


## Verify All Features Are Numeric

Identify any remaining non-numeric columns in the feature matrix `X`. Any columns listed here must be either encoded or removed before fitting the regression model to avoid errors.


In [23]:
non_numeric = X.select_dtypes(exclude=[np.number]).columns.tolist()
print("Still non-numeric:", non_numeric)

Still non-numeric: ['Town/City', 'County', 'PPDCategory Type', 'Property_D', 'Property_F', 'Property_O', 'Property_S', 'Property_T', 'Region']


## List Available Training Features

Display all feature names in the training dataset (`X_train`) to verify which variables are being fed into the model after splitting and preprocessing.


In [24]:

print("Available columns:", X_train.columns.tolist())

Available columns: ['Old/New', 'Duration', 'Town/City', 'County', 'PPDCategory Type', 'Year', 'Month', 'Property_D', 'Property_F', 'Property_O', 'Property_S', 'Property_T', 'Region', 'RegionMedianPrice', 'RegionSaleCount', 'CountyMedianPrice', 'CountySaleCount', 'LogPrice']


## Evaluate Model Performance on Test Set

Use the trained pipeline (`best_model`) to predict prices for the held-out test set and compute key evaluation metrics:

- **Mean Absolute Error (MAE):** Average absolute difference between predicted and actual prices.  
- **Root Mean Squared Error (RMSE):** Square root of the average squared prediction errors, penalizing larger errors.  
- **R² Score:** Proportion of variance in the target explained by the model.

In [25]:

y_pred = best_model.predict(X_test)

mae  = mean_absolute_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test MAE:  £{mae:,.0f}")
print(f"Test RMSE: £{rmse:,.0f}")
print(f"Test R²:   {r2:.2f}")

Test MAE:  £69,667
Test RMSE: £118,120
Test R²:   0.53


## Save Evaluation Metrics

Persist key model performance metrics to a JSON file for record-keeping and deployment dashboards.



In [26]:
os.makedirs("../outputs/models", exist_ok=True)
metrics = {"mae": mae, "rmse": rmse, "r2": r2}
with open("../outputs/models/metrics.json", "w") as f:
    json.dump(metrics, f)

print("Saved metrics to outputs/models/metrics.json")

Saved metrics to outputs/models/metrics.json


## Serialize Trained Model Pipeline

Save the finalized machine learning pipeline so it can be loaded for inference or deployment without retraining.

In [27]:
os.makedirs("../outputs/models", exist_ok=True)
joblib.dump(best_model, "../outputs/models/house_price_pipeline.pkl")
print("✅ Pipeline saved to ../outputs/models/house_price_pipeline.pkl")

✅ Pipeline saved to ../outputs/models/house_price_pipeline.pkl
