# üìä Housing price model

**Author: Tiebe Goossens**

## 1Ô∏è‚É£ Loading and Preparing the Training Dataset

In this notebook, we train several machine-learning models to predict housing 
prices based on the UK Price Paid dataset.

We start by loading the cleaned and prepared dataset produced in the previous
‚ÄúFinal Prep‚Äù notebook. Because the full dataset contains over 22 million rows,
we use a 10% training subset that remains large enough to capture price
patterns while still allowing local training.


In [1]:
!pip install scikit-learn lightgbm pycaret




[notice] A new release of pip available: 22.3.1 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd
import numpy as np

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from lightgbm import LGBMRegressor
import math

from pycaret.regression import setup, compare_models, tune_model, finalize_model, save_model, predict_model

In [3]:
# Load your 10% subset dataset (as provided)
df = pd.read_csv("../../Data/housing_prices/price_paid_records_prepared_subset_10.csv")

# Convert date column to datetime and sort in time order
df["date_of_transfer"] = pd.to_datetime(df["date_of_transfer"])
df = df.sort_values("date_of_transfer").reset_index(drop=True)

df.head()

Unnamed: 0,transaction_unique_identifier,price,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,record_status__monthly_file_only
0,{891F7B57-80C4-4BAA-83A8-BAE12D824A6E},35000,1995-01-01,D,N,F,STOURBRIDGE,DUDLEY,WEST MIDLANDS,A,A
1,{677AA6E5-A626-461A-BE2B-7B0580842D46},300000,1995-01-01,F,Y,L,LONDON,MERTON,GREATER LONDON,A,A
2,{9AE2B97B-69D0-4A5E-A4A9-3B496E56C6EB},125000,1995-01-01,D,N,F,MACCLESFIELD,MACCLESFIELD,CHESHIRE,A,A
3,{82C4FDD1-D6EB-4264-9A23-3640B8EF5973},17000,1995-01-01,T,N,F,HULL,KINGSTON UPON HULL,HUMBERSIDE,A,A
4,{7000CF90-DE77-4022-BFBE-99C0265A984D},52000,1995-01-01,S,N,L,WARRINGTON,WARRINGTON,WARRINGTON,A,A


In [4]:
def make_features(df, lags=[1, 10, 100], rolls=[5, 20]):
    df2 = df.copy()
    
    # Time-based features
    df2["year"] = df2["date_of_transfer"].dt.year
    df2["month"] = df2["date_of_transfer"].dt.month
    df2["dayofweek"] = df2["date_of_transfer"].dt.dayofweek
    df2["quarter"] = df2["date_of_transfer"].dt.quarter
    df2["dayofmonth"] = df2["date_of_transfer"].dt.day
    
    # Lag features on price
    for l in lags:
        df2[f"price_lag_{l}"] = df2["price"].shift(l)
    
    # Rolling stats on price (shifted by 1 to avoid leakage)
    for r in rolls:
        df2[f"price_roll_mean_{r}"] = df2["price"].shift(1).rolling(r).mean()
        df2[f"price_roll_std_{r}"] = df2["price"].shift(1).rolling(r).std()
    
    return df2

df_feat = make_features(df)

# Drop rows that don't have full lag/rolling data
df_feat = df_feat.dropna().reset_index(drop=True)

df_feat.head()

Unnamed: 0,transaction_unique_identifier,price,date_of_transfer,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,...,dayofweek,quarter,dayofmonth,price_lag_1,price_lag_10,price_lag_100,price_roll_mean_5,price_roll_std_5,price_roll_mean_20,price_roll_std_20
0,{C888AFBD-1EE6-4C4A-8ADF-23325B8B5EB0},42500,1995-01-03,T,Y,F,BRIDLINGTON,EAST YORKSHIRE,HUMBERSIDE,A,...,1,1,3,38000.0,67400.0,35000.0,68600.0,20097.885461,51195.0,21065.623656
1,{D1418EA2-7D82-482D-A44C-4143D75A4790},86000,1995-01-03,D,N,F,LEEDS,LEEDS,WEST YORKSHIRE,A,...,1,1,3,42500.0,32000.0,300000.0,58900.0,18198.214198,51370.0,20973.31889
2,{9502E087-DC88-4DAF-8D71-F794F9587E49},50000,1995-01-03,T,N,F,BRAMPTON,TYNEDALE,NORTHUMBERLAND,A,...,1,1,3,86000.0,60000.0,125000.0,61600.0,21434.201641,51570.0,21297.346017
3,{85B55554-E0B6-4CD8-BD01-48C3DC7C1041},35000,1995-01-03,T,N,F,SHERBORNE,SOUTH SOMERSET,SOMERSET,A,...,1,1,3,50000.0,29500.0,17000.0,59200.0,22041.438247,51770.0,21261.036762
4,{299C3702-AC93-4B64-8638-A53745A051E5},39500,1995-01-03,S,N,L,MANCHESTER,BURY,GREATER MANCHESTER,A,...,1,1,3,35000.0,59000.0,52000.0,50300.0,20741.263221,51845.0,21195.741255


## 2Ô∏è‚É£ Time-Based Train/Test Split

We must avoid random shuffling when training models on time-series data.
Instead, we split the dataset chronologically:

- First 80% ‚Üí Training  
- Last 20% ‚Üí Testing  

This ensures the model is evaluated on *future* data it has never seen, which 
better reflects real-world prediction scenarios.


In [5]:
# 80% train, 20% test, chronologically
n = len(df_feat)
train_idx = int(n * 0.8)

train = df_feat.iloc[:train_idx]
test  = df_feat.iloc[train_idx:]

len(train), len(test)

(1799068, 449767)

## 3Ô∏è‚É£ Feature Engineering (Numeric-Only Baseline Model)

For our first quick model, we intentionally limit ourselves to **numeric features**
such as:

- Year, month, weekday  
- Lagged price values  
- Rolling averages of price  
- Rolling standard deviations  

This model is **not expected to perform well**, but the assignment requires 
a ‚Äúquick first model‚Äù to start testing deployment workflows.

Categorical variables such as county, district, property_type, etc. are ignored
in this first baseline model because scikit-learn models cannot directly consume
string values without encoding, and the goal here is simplicity and speed.


In [6]:
# Use only numeric columns as features
numeric_cols = df_feat.select_dtypes(include=[np.number]).columns.tolist()

# Remove the target from the feature list
features = [c for c in numeric_cols if c != "price"]
target = "price"

X_train = train[features]
y_train = train[target]

X_test = test[features]
y_test = test[target]

X_train.dtypes.head()

year          int32
month         int32
dayofweek     int32
quarter       int32
dayofmonth    int32
dtype: object

## 4Ô∏è‚É£ Baseline Model: Extra Trees Regressor

The **Extra Trees Regressor** serves as our first quick model. This model:

- Trains very quickly  
- Requires minimal feature engineering  
- Produces a ready-to-deploy model early in the project  
- Acts as a baseline to compare better models against  

We expect the MAE and RMSE to be large because the model does not yet include
location or property-type information, which are critical predictors for price.


In [7]:
etr = ExtraTreesRegressor(
    n_estimators=200,   # you can increase later
    max_depth=None,
    random_state=42,
    n_jobs=-1
)

etr.fit(X_train, y_train)

etr_pred = etr.predict(X_test)

etr_mae = mean_absolute_error(y_test, etr_pred)
etr_rmse = math.sqrt(mean_squared_error(y_test, etr_pred))

print("EXTRA TREES RESULTS")
print("MAE :", etr_mae)
print("RMSE:", etr_rmse)

EXTRA TREES RESULTS
MAE : 181839.26543826028
RMSE: 769742.2685869752


## 5Ô∏è‚É£ Second Baseline Model: LightGBM

LightGBM is a gradient-boosting ensemble well-suited for large datasets.

This model generally performs better than ExtraTrees but still uses only 
numeric features. Therefore, while performance should improve slightly,
it will remain limited compared to models that incorporate categorical data.


In [8]:
lgb = LGBMRegressor(
    n_estimators=600,
    learning_rate=0.05,
    max_depth=-1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

lgb.fit(X_train, y_train)

lgb_pred = lgb.predict(X_test)

lgb_mae = mean_absolute_error(y_test, lgb_pred)
lgb_rmse = math.sqrt(mean_squared_error(y_test, lgb_pred))

print("LIGHTGBM RESULTS")
print("MAE :", lgb_mae)
print("RMSE:", lgb_rmse)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.020499 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1861
[LightGBM] [Info] Number of data points in the train set: 1799068, number of used features: 12
[LightGBM] [Info] Start training from score 151188.554962
LIGHTGBM RESULTS
MAE : 166081.34364775303
RMSE: 768518.8580883646


## 6Ô∏è‚É£ Model Performance Interpretation

The baseline models deliver the following metrics:

- **MAE** (Mean Absolute Error): the average amount the predictions are wrong  
- **RMSE** (Root Mean Squared Error): penalizes larger mistakes more  

Because the baseline models ignore location and property characteristics, the 
errors appear large (‚âà ¬£160k‚Äì¬£180k MAE), which is expected.

These models fulfill the assignment requirement of a ‚Äúquick first model‚Äù
that can be deployed while we continue improving accuracy.


In [9]:
# Final comparison
print("\n--- FINAL COMPARISON ---")
print(f"Extra Trees RMSE:  {etr_rmse:.2f}")
print(f"LightGBM RMSE:     {lgb_rmse:.2f}")
print(f"Extra Trees MAE:   {etr_mae:.2f}")
print(f"LightGBM MAE:      {lgb_mae:.2f}")


--- FINAL COMPARISON ---
Extra Trees RMSE:  769742.27
LightGBM RMSE:     768518.86
Extra Trees MAE:   181839.27
LightGBM MAE:      166081.34


## 7Ô∏è‚É£ Improving the Model With Categorical Features

The baseline models are intentionally simple.  
To significantly improve accuracy, we will incorporate key features:

- **County / district / town**
- **Property type** (Detached, Semi, Terraced, Flat)
- **Old/New** indicator
- **Freehold vs Leasehold duration**

These are essential determinants of housing price.

However, scikit-learn tree models cannot process raw string values.  
At this stage, we have two options:

### Option A ‚Äî One-Hot Encode the Categorical Features  
This creates binary columns (e.g. county_LONDON, type_D, etc.).  
Good for tree-based scikit-learn models.

### Option B ‚Äî Use PyCaret AutoML  
PyCaret automatically:
- encodes categoricals  
- compares many models  
- tunes the best model  
- produces a **much better model** than the baselines  

We will now proceed with **PyCaret AutoML** to build our improved model.


In [10]:
# If df isn't in memory anymore, reload it:
# df = pd.read_csv("../Data/housing_prices/price_paid_records_prepared_subset_10.csv")

# Make sure date is datetime
df["date_of_transfer"] = pd.to_datetime(df["date_of_transfer"])

# Build a PyCaret-friendly dataframe:
# Keep price (target), date, and important categoricals; drop admin / ID columns.
py_df = df[[
    "price",
    "date_of_transfer",
    "property_type",
    "oldnew",
    "duration",
    "towncity",
    "district",
    "county",
    "ppdcategory_type"
]].copy()

# Add simple time features as numeric columns
py_df["year"] = py_df["date_of_transfer"].dt.year
py_df["month"] = py_df["date_of_transfer"].dt.month
py_df["dayofweek"] = py_df["date_of_transfer"].dt.dayofweek

# Optional: drop the raw datetime if you don‚Äôt want PyCaret to treat it as a feature
py_df = py_df.drop(columns=["date_of_transfer"])

py_df.head()

Unnamed: 0,price,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,year,month,dayofweek
0,35000,D,N,F,STOURBRIDGE,DUDLEY,WEST MIDLANDS,A,1995,1,6
1,300000,F,Y,L,LONDON,MERTON,GREATER LONDON,A,1995,1,6
2,125000,D,N,F,MACCLESFIELD,MACCLESFIELD,CHESHIRE,A,1995,1,6
3,17000,T,N,F,HULL,KINGSTON UPON HULL,HUMBERSIDE,A,1995,1,6
4,52000,S,N,L,WARRINGTON,WARRINGTON,WARRINGTON,A,1995,1,6


## 8Ô∏è‚É£ PyCaret AutoML: Comparing Models

PyCaret‚Äôs `compare_models()` function trains and evaluates many algorithms:

- LightGBM  
- CatBoost  
- Random Forest  
- Extra Trees  
- AdaBoost  
- Ridge / Lasso  
- ElasticNet  
- KNN  
- Neural networks  
- XGBoost  

PyCaret automatically handles:
- train/test splitting  
- categorical encoding  
- normalization  
- hyperparameters  
- outlier handling  

The result is a ranked list of models, helping us identify the best candidates.


In [11]:
# For speed, you *can* subsample further for AutoML:
py_data = py_df.sample(frac=0.3, random_state=42)
# If your machine handles it, just use full py_df:

reg_setup = setup(
    data=py_data,
    target="price",
    session_id=42,
    train_size=0.8
)

# Get the top 3 models from AutoML
top3_models = compare_models(n_select=3)
top3_models

Unnamed: 0,Description,Value
0,Session id,42
1,Target,price
2,Target type,Regression
3,Original data shape,"(674680, 11)"
4,Transformed data shape,"(674680, 17)"
5,Transformed train set shape,"(539744, 17)"
6,Transformed test set shape,"(134936, 17)"
7,Numeric features,3
8,Categorical features,7
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
gbr,Gradient Boosting Regressor,63098.2272,90578883802.3112,291470.6444,0.285,0.4645,1.0534,10.865
lightgbm,Light Gradient Boosting Machine,61734.8325,91146212884.9589,293209.355,0.2749,0.4415,0.9521,1.202
lasso,Lasso Regression,80455.3393,101451355238.6187,310096.1011,0.1901,0.7482,1.4268,15.013
ridge,Ridge Regression,80456.1468,101451430756.7349,310096.1382,0.1901,0.748,1.4268,0.768
lar,Least Angle Regression,80456.3019,101451464285.271,310096.4678,0.1901,0.7481,1.4269,0.625
llar,Lasso Least Angle Regression,80455.4092,101451386133.6251,310096.1609,0.1901,0.7482,1.4268,0.616
br,Bayesian Ridge,80455.9978,101451417568.881,310095.8584,0.1901,0.748,1.4268,0.716
lr,Linear Regression,80456.3019,101451464285.2871,310096.4678,0.1901,0.7481,1.4269,1.916
xgboost,Extreme Gradient Boosting,61044.8689,103117827506.4536,313433.6804,0.1607,0.4283,0.8982,1.199
en,Elastic Net,80453.052,106203558640.0424,317368.0713,0.1515,0.6805,1.4488,1.189


[GradientBoostingRegressor(random_state=42),
 LGBMRegressor(n_jobs=-1, random_state=42),
 Lasso(random_state=42)]

## 9Ô∏è‚É£ Tuning the Best PyCaret Model

Using `tune_model()`, we optimize the top algorithm found by AutoML.
This process finds better hyperparameters and often significantly improves
MAE and RMSE.

The tuned model is then finalized and saved for later deployment.


In [12]:
# Top3 is a list: [best_model, second_best, third_best]
best_model_pycaret = top3_models[0]

# Tune hyperparameters of the best model
tuned_best_model = tune_model(best_model_pycaret)

# Finalize the tuned model (train on full training split)
final_best_model = finalize_model(tuned_best_model)

# Save for later deployment
save_model(final_best_model, "pycaret_best_housing_model")

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,61440.6405,106095690602.3072,325723.3344,0.1661,0.4369,0.478
1,62647.013,176821181198.9794,420501.1073,0.2187,0.4346,0.5164
2,61700.9794,61244240137.2176,247475.7365,0.3116,0.436,0.4959
3,60413.2355,59720761607.2109,244378.3166,0.3891,0.4324,0.4846
4,61979.3557,72900208881.6259,270000.3868,0.3316,0.4363,0.4755
5,59802.3642,74501662227.5963,272949.9262,0.134,0.4418,0.8055
6,60274.0135,48293618904.5897,219758.0918,0.2967,0.4465,0.643
7,59600.1879,52406394788.8398,228924.4303,0.4573,0.4361,4.7893
8,63456.2406,187327493528.8616,432813.4627,0.1676,0.4388,0.4648
9,61364.2221,73867244444.4153,271785.2911,0.2542,0.4378,0.5381


Fitting 10 folds for each of 10 candidates, totalling 100 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).
Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(include=['year', 'month', 'dayofweek'],
                                     transformer=SimpleImputer())),
                 ('categorical_imputer',
                  TransformerWrapper(include=['property_type', 'oldnew',
                                              'duration', 'towncity', 'district',
                                              'county', 'ppdcategory_type'],
                                     transformer=SimpleImputer(strategy='most_frequent'))),
                 ('ordi...
                  TransformerWrapper(include=['property_type', 'duration'],
                                     transformer=OneHotEncoder(cols=['property_type',
                                                                     'duration'],
                                                               handle_missing='return_nan',
                                                

## üîü Model Comparison and Final Selection

At this point, we have 2‚Äì3 models:

- Extra Trees baseline  
- LightGBM baseline  
- PyCaret Best Tuned Model  

We compare all models side-by-side using MAE/RMSE.
The PyCaret tuned model usually performs best because it integrates 
categorical features automatically.

The selected final model is the one that:

- Has the lowest MAE and RMSE  
- Generalizes best on the test set  
- Uses the richest feature representation  


In [13]:
# Evaluate tuned PyCaret model on its hold-out test set (inside PyCaret)
pycaret_test_results = predict_model(final_best_model)
pycaret_test_results.head()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Gradient Boosting Regressor,63057.8672,95134052765.632,308438.0858,0.3432,0.4628,1.2842


Unnamed: 0,property_type,oldnew,duration,towncity,district,county,ppdcategory_type,year,month,dayofweek,price,prediction_label
1774831,T,N,F,HEMEL HEMPSTEAD,DACORUM,HERTFORDSHIRE,A,2012,3,4,249500,241650.153425
1249865,T,N,F,STEVENAGE,STEVENAGE,HERTFORDSHIRE,A,2006,2,4,185000,165526.551441
1093910,T,N,F,LONDON,MERTON,GREATER LONDON,A,2004,8,2,375000,319796.06324
462808,D,N,F,BARNSLEY,BARNSLEY,SOUTH YORKSHIRE,A,1999,8,4,79950,87230.241066
974482,D,N,F,UCKFIELD,WEALDEN,EAST SUSSEX,A,2003,10,3,555000,311721.997338


In [16]:
# MAE / RMSE for the PyCaret model on its internal test predictions
py_mae = mean_absolute_error(pycaret_test_results["price"], pycaret_test_results["prediction_label"])
py_rmse = np.sqrt(mean_squared_error(pycaret_test_results["price"], pycaret_test_results["prediction_label"]))

print("PYCARET BEST MODEL RESULTS")
print("MAE :", py_mae)
print("RMSE:", py_rmse)


PYCARET BEST MODEL RESULTS
MAE : 63057.86724100634
RMSE: 308438.0857897287


In [17]:
# Summary comparison table: ExtraTrees vs LightGBM vs PyCaret best
results_summary = pd.DataFrame({
    "Model": ["Extra Trees", "LightGBM", "PyCaret Best"],
    "MAE":   [etr_mae,       lgb_mae,    py_mae],
    "RMSE":  [etr_rmse,      lgb_rmse,   py_rmse]
})

results_summary

Unnamed: 0,Model,MAE,RMSE
0,Extra Trees,181839.265438,769742.268587
1,LightGBM,166081.343648,768518.858088
2,PyCaret Best,63057.867241,308438.08579
