# UK House Price Estimator - Model Training and Evaluation


## Objective
Train a regression model to estimate house prices and evaluate its performance.

## Input
- `HousePricesRecords_clean.csv`: Cleaned dataset with numeric and categorical features prepared.

## Output
- Trained regression model
- Model evaluation metrics (e.g., MAE, RMSE)
- Price prediction examples

## Key Tasks
- Encode categorical variables
- Split the data into training and test sets
- Train a regression model (e.g., Linear Regression)
- Evaluate model accuracy using test data


## Imports, Load & Split Data

In [None]:

import os
import joblib
import pandas as pd
import numpy as np
import json
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV      
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

df = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")
y  = df["Price"]
X  = df.drop(columns=["Price", "Date of Transfer"], errors="ignore")

## Features

In [None]:
numeric_features = [
    "Year", "Month",
    "RegionMedianPrice", "RegionSaleCount",
    "CountyMedianPrice", "CountySaleCount"
]

categorical_features = [
    "Old/New", "Duration",
    "Town/City", "County", "PPDCategory Type",
    "Property_D", "Property_F", "Property_S", "Property_T",
    "Region"
]

## Filter Feature Lists to Available Columns

In [None]:

numeric_features     = [c for c in numeric_features     if c in X.columns]
categorical_features = [c for c in categorical_features if c in X.columns]

## Build Preprocessing & Modeling Pipeline

Create a Scikit-learn pipeline that standardises numeric features and one-hot encodes categorical features, then fits a Random Forest regressor on the transformed data.



In [None]:
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(random_state=42))
])

## Train-Test Split

Divide the dataset into training and testing subsets to evaluate model performance on unseen data.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,  
    shuffle=True
)