# Housing Price Model Comparison with PyCaret
Author: Tijl Cleynhens

This notebook performs a full model comparison workflow using PyCaret‚Äôs regression module. It includes feature preparation, dataset splitting, configuration of the PyCaret environment, automated model benchmarking, and final evaluation on a held-out test set. The goal is to identify the most effective regression model for predicting housing prices based on the cleaned UK housing dataset.

# 1. Load cleaned data
-Import the required libraries for data handling and model evaluation.

-Load the cleaned housing dataset.

-Verify that the target column (price) is present in the DataFrame.

-Print the dataset shape to confirm successful loading and understand its size.

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score


cleaned_path = "cleaned_uk_housing2.csv"
df = pd.read_csv(cleaned_path)

target_col = "price"
if target_col not in df.columns:
    raise ValueError("Column 'price' not found in the cleaned dataset.")

print(" Cleaned UK housing data loaded:", df.shape)


‚úÖ Cleaned UK housing data loaded: (21158869, 14)


# 2. Feature engineering + filter (2016-2017 only, all rows)
-Select only the relevant feature columns from the dataset, excluding the pre-existing year and month columns for now.

-Cast the new-build flag to an object type so it can be handled correctly during encoding.

-Keep only the newest 100,000 rows to maintain a manageable dataset size without losing recent patterns.

In [None]:

# choose features you actually want the model to see
selected_features = [
    "district",
    "town",
    "county",
    "month",
    "year",
    "property_type",
    "tenure",
    "new_build_flag",
    "date_numeric",
]

cols_to_use = [c for c in selected_features if c in df.columns] + [target_col]
df_small = df[cols_to_use].copy()

# enforce categorical type for the flag
if "new_build_flag" in df_small.columns:
    df_small["new_build_flag"] = df_small["new_build_flag"].astype("object")

# keep only 2016‚Äì2017
df_small = df_small[df_small["year"].isin([2016, 2017])].copy()
print(f"Rows after year filter (2016‚Äì2017): {df_small.shape}")

# sort oldest ‚Üí newest
df_small = df_small.sort_values(by=["year", "month"], ascending=True)

# only keep the newest 100k
MAX_ROWS = 100_000
if len(df_small) > MAX_ROWS:
    df_small = df_small.tail(MAX_ROWS).reset_index(drop=True)
    print(f"Using newest {MAX_ROWS:,} rows from 2016‚Äì2017: {df_small.shape}")
else:
    df_small = df_small.reset_index(drop=True)
    print("Using all rows (fewer than 100k available):", df_small.shape)




Rows after year filter (2016‚Äì2017): (1170866, 10)
Using newest 100,000 rows from 2016‚Äì2017: (100000, 10)


# 3. Train / Val / Test split (80 / 10 / 10)
-Separate the target column from the feature set.

-Split the data into training+validation and test sets using an 80/20 ratio.

-Further divide the training+validation portion so the final split becomes approximately 80% training, 10% validation, and 10% testing.

-Print the shapes of each subset to verify that the split was applied correctly.

In [4]:

X = df_small.drop(columns=[target_col])
y = df_small[target_col]

# 80% train+val, 20% test
X_trainval, X_test, y_trainval, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Within trainval, we still want an explicit "validation" chunk if needed
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.1111, random_state=42
)
# 0.1111 of 0.8 ‚âà 0.0889, so ~80 / 10 / 10 overall

print("\n=== SPLIT SUMMARY ===")
print("Train shape:", X_train.shape)
print("Val shape:  ", X_val.shape)
print("Test shape: ", X_test.shape)


=== SPLIT SUMMARY ===
Train shape: (71112, 9)
Val shape:   (8888, 9)
Test shape:  (20000, 9)


# 4. Prep dataframe for Pycaret (Train+Val Only)
-Combine the training and validation feature sets with the target column into a single DataFrame, as required by PyCaret.

-Remove any irrelevant or ID-like columns that could interfere with modeling, ignoring them if they aren‚Äôt present.

-Apply the same column cleanup to the test feature set to maintain consistent structure.

-Print the resulting shape to confirm that the PyCaret-ready dataset was prepared correctly.

In [None]:
# PyCaret expects a single DataFrame with target included
trainval_df = pd.concat([X_trainval, y_trainval], axis=1)

# optional: drop pure IDs / garbage if they sneak in
cols_to_drop = [
    "transaction_unique_identifier",
    "record_status___monthly_file_only",
    "ppdcategory_type",
    "price_is_outlier_iqr", "date_of_transfer",
]
trainval_df = trainval_df.drop(columns=cols_to_drop, errors="ignore")

# same cleanup for test
X_test_mod = X_test.drop(columns=cols_to_drop, errors="ignore")

print("\nTrain+Val for PyCaret:", trainval_df.shape)



Train+Val for PyCaret: (80000, 10)


# 5. Pycaret setup & model comparison
-Import the necessary PyCaret regression functions for automated model comparison and evaluation.

-Explicitly define which columns should be treated as categorical and which as numeric, based on their presence in the training data.

-Initialize the PyCaret regression environment with the cleaned train+validation dataset, specifying the target, train size, cross-validation folds, and target transformation.

-Provide a curated list of regression models to compare using compare_models, sorted by RMSE.

-Identify and display the best-performing baseline model selected by PyCaret.

In [6]:

from pycaret.regression import (
    setup,
    compare_models,
    tune_model,
    finalize_model,
    predict_model,
    save_model,
)

# define which columns are categorical / numeric explicitly
categorical_features = []
for col in ["district", "town", "county", "property_type", "tenure", "new_build_flag"]:
    if col in trainval_df.columns:
        categorical_features.append(col)

numeric_features = []
for col in ["month", "year", "date_numeric"]:
    if col in trainval_df.columns:
        numeric_features.append(col)

print("\nCategorical features for PyCaret:", categorical_features)
print("Numeric features for PyCaret:", numeric_features)
print("trainval_df exists?", 'trainval_df' in globals())
print("target_col exists?", 'target_col' in globals())

# setup PyCaret
reg_setup = setup(
    data=trainval_df,
    target=target_col,
    session_id=42,
    train_size=0.9,
    fold=3,
    fold_shuffle=True,
    categorical_features=categorical_features,
    numeric_features=numeric_features,
    transform_target=True,
    transform_target_method="yeo-johnson",
    verbose=True
)

# list of ~20 models to compare
models_to_compare = [
    "lr",         # Linear Regression
    "lasso",
    "ridge",
    "en",         # Elastic Net
    "lar",
    "llar",
    "omp",
    "br",         # Bayesian Ridge
    "huber",
    "dt",
    "rf",         # Random Forest
    "et",         # Extra Trees
    "ada",        # AdaBoost
    "gbr",        # Gradient Boosting
    "xgboost",
    "lightgbm",
    "catboost",
]

print("\nüîç Comparing models with PyCaret...")
best_base = compare_models(include=models_to_compare, sort="RMSE")
print("\nBest base model from compare_models():")
print(best_base)



Categorical features for PyCaret: ['district', 'town', 'county', 'property_type', 'tenure', 'new_build_flag']
Numeric features for PyCaret: ['month', 'year', 'date_numeric']
trainval_df exists? True
target_col exists? True


Unnamed: 0,Description,Value
0,Session id,42
1,Target,price
2,Target type,Regression
3,Original data shape,"(80000, 10)"
4,Transformed data shape,"(80000, 14)"
5,Transformed train set shape,"(72000, 14)"
6,Transformed test set shape,"(8000, 14)"
7,Numeric features,3
8,Categorical features,6
9,Preprocess,True



üîç Comparing models with PyCaret...


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
catboost,CatBoost Regressor,43376.2328,3455925713.4678,58782.1534,0.6023,0.4248,0.6853,2.4333
xgboost,Extreme Gradient Boosting,43450.7956,3478196188.5336,58970.7437,0.5997,0.427,0.7072,0.66
lightgbm,Light Gradient Boosting Machine,44112.1766,3522626370.3191,59345.7775,0.5946,0.4284,0.6752,0.3467
gbr,Gradient Boosting Regressor,45633.1464,3703218592.9105,60849.4219,0.5738,0.4397,0.6864,1.2833
rf,Random Forest Regressor,46910.1854,4020606249.2113,63405.0345,0.5373,0.4449,0.698,1.5
lasso,Lasso Regression,49476.4072,4274600848.9945,65376.1923,0.5081,0.4706,0.8226,1.1433
llar,Lasso Least Angle Regression,49476.4073,4274600871.4501,65376.1925,0.5081,0.4706,0.8226,1.0233
br,Bayesian Ridge,49460.9936,4274679533.8442,65376.7159,0.5081,0.4703,0.8158,0.17
ridge,Ridge Regression,49460.7631,4274714339.401,65376.979,0.508,0.4702,0.8156,1.1067
lr,Linear Regression,49460.6885,4274727128.0559,65377.0756,0.508,0.4702,0.8155,1.4567



Best base model from compare_models():
<catboost.core.CatBoostRegressor object at 0x000001FFC3874110>


# 6. Tune the best model

In [7]:
print("\nüéØ Tuning the best model...")
best_tuned = tune_model(best_base, optimize="RMSE", n_iter=50)
print("\nTuned model:")
print(best_tuned)

# finalize (fit on all train+val data)
final_model = finalize_model(best_tuned)




üéØ Tuning the best model...


Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,42998.249,3364730511.3217,58006.2972,0.6122,0.4316,0.6703
1,44023.5545,3591306277.5727,59927.5085,0.5887,0.416,0.7079
2,43251.6764,3447191943.2942,58712.792,0.6018,0.4249,0.6641
Mean,43424.4933,3467742910.7295,58882.1992,0.6009,0.4241,0.6808
Std,436.0521,93633687.5273,793.4261,0.0096,0.0064,0.0194


Fitting 3 folds for each of 50 candidates, totalling 150 fits
Original model was better than the tuned model, hence it will be returned. NOTE: The display metrics are for the tuned model (not the original one).

Tuned model:
<catboost.core.CatBoostRegressor object at 0x000001FE9C4E9790>


# 7. Evaluate on held-out test set
-Rebuild the test DataFrame in the same format expected by PyCaret by combining the test features with the target column.

-Use the finalized PyCaret model to generate predictions on the held-out test set.

-Extract true and predicted values and compute MAE, RMSE, and R¬≤ to quantify model performance.

-Calculate the average actual house price and express MAE as a percentage of this value to make the error size easier to interpret.

In [8]:
# build test df in PyCaret format
test_df = X_test_mod.copy()
test_df[target_col] = y_test.values

print("\nüìä Predicting on TEST set with final PyCaret model...")
test_with_preds = predict_model(final_model, data=test_df)

# PyCaret adds 'prediction_label' by default
y_true = test_with_preds[target_col].values
y_pred = test_with_preds["prediction_label"].values

test_mae = mean_absolute_error(y_true, y_pred)
test_rmse = np.sqrt(mean_squared_error(y_true, y_pred))
test_r2 = r2_score(y_true, y_pred)
avg_price = y_true.mean()
mae_percent = (test_mae / avg_price) * 100

print("\n===== PYCARET TEST METRICS (BEST MODEL) =====")
print(f"MAE:  ¬£{test_mae:,.2f}")
print(f"RMSE: ¬£{test_rmse:,.2f}")
print(f"R¬≤:   {test_r2:.4f}")
print(f"\nAverage actual house price: ¬£{avg_price:,.2f}")
print(f"MAE ‚âà {mae_percent:.2f}% of average price.")




üìä Predicting on TEST set with final PyCaret model...


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,CatBoost Regressor,42864.328,3350162942.4109,57880.5921,0.6213,0.4164,0.572



===== PYCARET TEST METRICS (BEST MODEL) =====
MAE:  ¬£42,864.33
RMSE: ¬£57,880.59
R¬≤:   0.6213

Average actual house price: ¬£197,721.93
MAE ‚âà 21.68% of average price.


# 8. Save the best model

In [None]:
model_name = "uk_housing_price_pycaret_best"
save_model(final_model, model_name)
print(f"\n Saved best PyCaret model to '{model_name}.pkl' (plus metadata).")
print("=== DONE ===")

Transformation Pipeline and Model Successfully Saved

‚úÖ Saved best PyCaret model to 'uk_housing_price_pycaret_best.pkl' (plus metadata).
=== DONE ===
