## 05 - Quick Baseline Model (LightGBM)

**Project:** UK Housing Price Paid Records  

**Purpose:** To establish a performance benchmark (baseline model) for the price prediction task using the fully cleaned dataset. This script utilizes LightGBM, a highly efficient and scalable gradient boosting framework, due to its speed and native handling of categorical features. The model is trained on the log-transformed price to account for the skewed target distribution.  

**Key Steps:**
- Load the `price_paid_model_ready.parquet` file.  
- Log-transform the target variable (`price`).  
- Split data into 80% training and 20% testing sets.  
- Train a LightGBM Regressor using early stopping.  
- Evaluate performance using RMSE and RÂ² on both the log-transformed and original price scale.  

**Team Member(s):** Tymo Verhaegen  

**Input File:** `../data/housing/processed/price_paid_model_ready.parquet`  

**Date Last Run:** 06/11/2025  


In [None]:
# --- Quick Baseline Model (Linear & Random Forest) ---
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import os

print("--- Starting Quick Baseline Model (Linear + Random Forest) ---")

# --- 1. Load Cleaned Data ---
try:
    script_dir = os.path.dirname(os.path.abspath(__file__))
    parquet_file_path = os.path.join(script_dir, '.', '..', '..', 'data', 'housing', 'processed', 'price_paid_model_ready.parquet')
except NameError:
    parquet_file_path = './../../data/housing/processed/price_paid_model_ready.parquet'

try:
    print(f"1. Loading cleaned data from: {parquet_file_path}")
    df = pd.read_parquet(parquet_file_path)
    print(f"   -> Loaded successfully. Total records: {len(df):,}")
except FileNotFoundError:
    print(f"   -> ERROR: File not found at {parquet_file_path}")
    exit()


# --- 2. Prepare Features and Target ---
print("\n2. Preparing features and target...")

y = np.log(df['price'])
features = ['sale_year', 'property_type', 'old/new', 'duration', 'town/city', 'district', 'county']
X = df[features].copy()

# Identify categorical and numerical columns
cat_cols = X.select_dtypes(include=['object', 'category']).columns.tolist()
num_cols = [c for c in X.columns if c not in cat_cols]

print(f"   -> Found {len(cat_cols)} categorical columns: {cat_cols}")
print(f"   -> Total features: {len(features)}")


# --- 3. Train/Test Split ---
print("\n3. Splitting data into training/testing sets (80/20)...")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"   -> Training samples: {len(X_train):,}")
print(f"   -> Testing samples: {len(X_test):,}")


# --- 4. Encode Categorical Variables ---
print("\n4. Encoding categorical variables (OrdinalEncoder)...")

encoder = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
X_train[cat_cols] = encoder.fit_transform(X_train[cat_cols])
X_test[cat_cols] = encoder.transform(X_test[cat_cols])

# Replace unknown (-1) with 0 to avoid issues with models like Linear Regression
X_train[cat_cols] = X_train[cat_cols].replace(-1, 0)
X_test[cat_cols] = X_test[cat_cols].replace(-1, 0)

# Just to be extra safe:
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

print(f"   -> Encoding complete. Final feature shape: {X_train.shape}")


# --- 5. Train Models ---

# Model 1: Linear Regression
print("\n5A. Training Linear Regression baseline...")
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
y_pred_lin = lin_reg.predict(X_test)

# Model 2: Random Forest
print("\n5B. Training Random Forest baseline...")
rf_reg = RandomForestRegressor(
    n_estimators=200,
    random_state=42,
    n_jobs=-1
)
rf_reg.fit(X_train, y_train)
y_pred_rf = rf_reg.predict(X_test)


# --- 6. Evaluate Models ---
print("\n6. Evaluating models...")

def evaluate(y_true, y_pred, model_name):
    rmse_log = np.sqrt(mean_squared_error(y_true, y_pred))
    r2_log = r2_score(y_true, y_pred)
    
    # Convert predictions back to price scale
    y_true_price = np.exp(y_true)

--- Starting 05 Quick Baseline Model (scikit-learn) ---
1. Loading cleaned data from: ./../../data/housing/processed/price_paid_model_ready.parquet
   -> Load successful. Total records: 22,489,256

2. Preparing features and target...
   -> Encoded categorical columns: ['property_type', 'old/new', 'duration', 'town/city', 'district', 'county']
   -> Final feature shape: (22489256, 7)

3. Data Split (80/20):
   -> Training samples: 17,991,404
   -> Testing samples: 4,497,852

4A. Training Linear Regression baseline...


ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values