# Notebook 03: Building the Model (XGBoost)

**Auteur:** Natan Wojtowicz

## Doel
We bouwen een Machine Learning model om de huizenprijs te voorspellen.
Omdat we "All Out" gaan, gebruiken we **XGBoost** (Extreme Gradient Boosting). Dit is een krachtig model dat sequentieel bomen bouwt die leren van de fouten van de vorige boom.

## Strategie
1.  **Data Laden:** We lezen de geoptimaliseerde Parquet file (uit stap 2).
2.  **Feature Engineering:** We halen meer informatie uit de datum en locatie.
3.  **Target Encoding:** Een geavanceerde techniek om locaties (County/District) om te zetten naar getallen zonder duizenden nieuwe kolommen te maken.
4.  **Time-Based Split:** We trainen op het verleden (1995-2015) en testen op de toekomst (2016-2017).
5.  **Training:** Trainen van de XGBoost Regressor.
6.  **Evaluatie:** Hoe goed werkt het? (RMSE, R2 en Feature Importance).

In [24]:
import os

# Verwijder een mogelijk ongeldige MPLBACKEND die import kan breken (zonder matplotlib te importeren)
os.environ.pop('MPLBACKEND', None)
print('MPLBACKEND cleared for this kernel (if it existed)')


MPLBACKEND cleared for this kernel (if it existed)


In [26]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score
import joblib # Om het model op te slaan
import os

# Settings
pd.set_option('display.float_format', lambda x: '%.2f' % x)

print("Libraries geladen. XGBoost versie:", xgb.__version__)

Libraries geladen. XGBoost versie: 3.1.2


In [27]:
processed_path = os.path.join("data", "uk_housing", "processed", "housing_clean.parquet")

print("Data inladen...")
df = pd.read_parquet(processed_path)

# Sorteer op datum voor de time-based split
df = df.sort_values("date_of_transfer")

print(f"Dataset geladen: {df.shape[0]} rijen.")
display(df.head())

Data inladen...
Dataset geladen: 22484600 rijen.
Dataset geladen: 22484600 rijen.


Unnamed: 0,transaction_id,price,date_of_transfer,property_type,old_new,duration,town_city,district,county,ppd_category,record_status
638262,{631C8BBC-CA12-45A0-A735-E3E3429E55F7},47000,1995-01-01,T,N,F,ABERYSTWYTH,CEREDIGION,CEREDIGION,A,A
648419,{1F6B59A3-72E9-4E4A-9EB7-E073108E5F0E},29500,1995-01-01,S,Y,L,RAYLEIGH,ROCHFORD,ESSEX,A,A
44562,{6E9699B4-99BD-4898-818F-46315E73D9F0},28500,1995-01-01,T,N,F,BISHOP AUCKLAND,WEAR VALLEY,DURHAM,A,A
226179,{E32B0A70-13E1-4F34-8E38-4D7950245171},50000,1995-01-01,F,N,L,BRISTOL,BRISTOL,AVON,A,A
698648,{20F4D602-7F13-4A88-BC53-F9880B076170},160000,1995-01-01,D,N,F,WANTAGE,VALE OF WHITE HORSE,OXFORDSHIRE,A,A


### Feature Engineering
Ruwe data is zelden genoeg. We maken nieuwe features:
1.  **Datum splitsen:** Modellen snappen "1995-01-01" niet goed, maar wel "Jaar: 1995" en "Maand: 1".
2.  **Locatie Encoding:** We kunnen geen One-Hot encoding doen op 'District' (dat zijn 300+ kolommen) of 'Town' (duizenden kolommen). Dat blaast ons geheugen op.
    * *Oplossing:* **Target Encoding**. We vervangen de naam van het district door de *gemiddelde huizenprijs* in dat district (berekend op de trainingsset).

In [28]:
# 1. Datum Features
df['year'] = df['date_of_transfer'].dt.year
df['month'] = df['date_of_transfer'].dt.month

# 2. Categoricals omzetten naar codes (voor XGBoost intern)
# We houden de mappings bij om later terug te kunnen rekenen als dat nodig is
cat_columns = ['property_type', 'old_new', 'duration', 'ppd_category', 'record_status']

for col in cat_columns:
    df[col] = df[col].astype('category').cat.codes

print("Basis features toegevoegd.")

Basis features toegevoegd.


### Train / Test Split
**Let op:** We gebruiken GEEN random split. Huizenprijzen zijn tijdsgebonden.
Als we random splitten, zou het model de prijs van een huis in 2010 kunnen leren door naar een huis uit 2011 te kijken. Dat is valsspelen ("Data Leakage").

* **Train:** 1995 t/m 2015
* **Test:** 2016 t/m 2017 (De meest recente data)

In [29]:
# Split punt
split_year = 2016

train_df = df[df['year'] < split_year].copy()
test_df = df[df['year'] >= split_year].copy()

print(f"Train set grootte: {len(train_df)}")
print(f"Test set grootte:  {len(test_df)}")

# Diagnose: toon dtypes en enkele voorbeeldwaarden
print('Kolommen sample:', df.columns.tolist()[:40])
print('\nDtypes count:\\n', df.dtypes.value_counts())
for c in ['county', 'district', 'county_encoded', 'district_encoded']:
    if c in train_df.columns:
        print(c, 'train dtype:', train_df[c].dtype, 'unique sample:', train_df[c].dropna().unique()[:5])
    if c in test_df.columns:
        print(c, 'test dtype:', test_df[c].dtype, 'unique sample:', test_df[c].dropna().unique()[:5])

# --- TARGET ENCODING (De 'Pro' manier om met locaties om te gaan) ---

# Functie om target encoding toe te passen (robust: drop via reassignment en veilige toewijzing)
def calculate_target_encoding(train, test, col_name, target_col='price'):
    # Bereken gemiddelde prijs per categorie in TRAIN
    means = train.groupby(col_name)[target_col].mean()

    # Globaal gemiddelde (fallback voor onbekende categorieën in test)
    global_mean = train[target_col].mean()

    enc_col = col_name + '_encoded'

    # Verwijder bestaande encoded kolommen door reassignment (vermijdt inplace/view issues)
    train = train.drop(columns=[enc_col], errors='ignore')
    test = test.drop(columns=[enc_col], errors='ignore')

    # Maak nieuwe Series en forceer float dtype (XGBoost verwacht numerieke features)
    # Cast source column to object/str before mapping to avoid returning a Categorical Series
    train_enc = train[col_name].astype(object).map(means).fillna(global_mean).astype(float)
    test_enc  = test[col_name].astype(object).map(means).fillna(global_mean).astype(float)

    # Wijs toe via .loc (duidelijk en veilig)
    train.loc[:, enc_col] = train_enc
    test.loc[:, enc_col]  = test_enc

    return train, test, means

# Pas toe op County en District (Town is te specifiek, District is goed niveau)
print("Bezig met Target Encoding op locaties...")
train_df, test_df, county_means = calculate_target_encoding(train_df, test_df, 'county')
train_df, test_df, district_means = calculate_target_encoding(train_df, test_df, 'district')

# Drop de originele tekst kolommen en de datum (die snapt het model niet)
drop_cols = ['date_of_transfer', 'town_city', 'district', 'county', 'transaction_id']
X_train = train_df.drop(columns=['price'] + drop_cols)
y_train = train_df['price']

X_test = test_df.drop(columns=['price'] + drop_cols)
y_test = test_df['price']

print("Data klaar voor training.")
display(X_train.head())

Train set grootte: 21078720
Test set grootte:  1405880
Kolommen sample: ['transaction_id', 'price', 'date_of_transfer', 'property_type', 'old_new', 'duration', 'town_city', 'district', 'county', 'ppd_category', 'record_status', 'year', 'month']

Dtypes count:\n int8              5
int32             2
object            1
int64             1
datetime64[ns]    1
category          1
category          1
category          1
Name: count, dtype: int64
county train dtype: category unique sample: ['CEREDIGION', 'ESSEX', 'DURHAM', 'AVON', 'OXFORDSHIRE']
Categories (127, object): ['AVON', 'BATH AND NORTH EAST SOMERSET', 'BEDFORD', 'BEDFORDSHIRE', ..., 'WORCESTERSHIRE', 'WREKIN', 'WREXHAM', 'YORK']
county test dtype: category unique sample: ['FLINTSHIRE', 'GREATER LONDON', 'HEREFORDSHIRE', 'SUFFOLK', 'SOUTH YORKSHIRE']
Categories (127, object): ['AVON', 'BATH AND NORTH EAST SOMERSET', 'BEDFORD', 'BEDFORDSHIRE', ..., 'WORCESTERSHIRE', 'WREKIN', 'WREXHAM', 'YORK']
district train dtype: category uniqu









Data klaar voor training.


Unnamed: 0,property_type,old_new,duration,ppd_category,record_status,year,month,county_encoded,district_encoded
638262,4,0,0,0,0,1995,1,131203.07,131261.7
648419,3,1,1,0,0,1995,1,179886.76,179319.49
44562,4,0,0,0,0,1995,1,83293.03,79542.38
226179,1,0,1,0,0,1995,1,64468.44,56744.0
698648,0,0,0,0,0,1995,1,223737.76,227517.17


In [30]:
print("Start training XGBoost model... (Haal koffie, dit duurt even)")

# Gebruik low-level xgboost.train met DMatrix (compatibel met xgboost 3.x)
params = {
    'objective': 'reg:squarederror',
    'eta': 0.05,  # learning_rate
    'max_depth': 8,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'tree_method': 'hist',
    'seed': 42
}

dtrain = xgb.DMatrix(X_train, label=y_train)
dtest  = xgb.DMatrix(X_test,  label=y_test)

evals = [(dtrain, 'train'), (dtest, 'eval')]

# Train met early stopping
booster = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=500,
    evals=evals,
    early_stopping_rounds=50,
    verbose_eval=50,
)

print(f"Training compleet! Beste ronde: {booster.best_iteration}")

Start training XGBoost model... (Haal koffie, dit duurt even)
[0]	train-rmse:191786.34919	eval-rmse:399382.15982
[0]	train-rmse:191786.34919	eval-rmse:399382.15982
[50]	train-rmse:139819.87755	eval-rmse:322795.06705
[50]	train-rmse:139819.87755	eval-rmse:322795.06705
[100]	train-rmse:135829.79473	eval-rmse:317106.41698
[100]	train-rmse:135829.79473	eval-rmse:317106.41698
[150]	train-rmse:135051.09303	eval-rmse:316162.62920
[150]	train-rmse:135051.09303	eval-rmse:316162.62920
[200]	train-rmse:134631.98405	eval-rmse:315531.87047
[200]	train-rmse:134631.98405	eval-rmse:315531.87047
[250]	train-rmse:134313.38606	eval-rmse:315145.59211
[250]	train-rmse:134313.38606	eval-rmse:315145.59211
[300]	train-rmse:134052.27176	eval-rmse:314952.62725
[300]	train-rmse:134052.27176	eval-rmse:314952.62725
[350]	train-rmse:133845.54470	eval-rmse:314825.29878
[350]	train-rmse:133845.54470	eval-rmse:314825.29878
[400]	train-rmse:133641.78195	eval-rmse:314782.79004
[400]	train-rmse:133641.78195	eval-rmse:314

In [31]:
# Voorspellingen maken (gebruik booster.predict op DMatrix)
predictions = booster.predict(xgb.DMatrix(X_test))

# Metrics berekenen
rmse = np.sqrt(mean_squared_error(y_test, predictions))
r2 = r2_score(y_test, predictions)

print(f"--- Model Resultaten op Test Set (2016-2017) ---")
print(f"Root Mean Squared Error (RMSE): £{rmse:,.2f}")
print(f"R2 Score: {r2:.4f}")
print(f"Dit betekent dat ons model ongeveer {r2*100:.2f}% van de variantie in de prijs kan verklaren.")

--- Model Resultaten op Test Set (2016-2017) ---
Root Mean Squared Error (RMSE): £314,795.85
R2 Score: 0.3290
Dit betekent dat ons model ongeveer 32.90% van de variantie in de prijs kan verklaren.


In [34]:
# Feature importance verkrijgen uit booster
importance_dict = booster.get_score(importance_type='weight')  # 'weight'|'gain'|'cover' etc.

# Map keys if they are in f0..fN format back to column names
if importance_dict:
    keys = list(importance_dict.keys())
    if all(k.startswith('f') and k[1:].isdigit() for k in keys):
        mapped = { X_train.columns[int(k[1:])]: v for k, v in importance_dict.items() }
    else:
        mapped = importance_dict
else:
    mapped = {}

feature_importance = pd.DataFrame({
    'Feature': list(mapped.keys()),
    'Importance': list(mapped.values())
}).sort_values(by='Importance', ascending=False)

# Plot feature importance with Plotly (no matplotlib/seaborn)
fi = feature_importance.sort_values('Importance')  # ascending for horizontal bar
import plotly.express as px
fig = px.bar(fi, x='Importance', y='Feature', orientation='h', color='Importance',
             labels={'Importance': 'Importance', 'Feature': 'Feature'})
fig.update_layout(title='Welke features bepalen de huizenprijs het meest? (XGBoost Feature Importance)', height=600)
fig.show(renderer="browser")


In [36]:
# We plotten een sample van 2000 punten om de grafiek leesbaar te houden (Plotly)
import pandas as _pd
sample_df = _pd.DataFrame({'actual': y_test.values[:2000], 'predicted': predictions[:2000]})
fig = px.scatter(sample_df, x='actual', y='predicted', opacity=0.6,
                 labels={'actual': 'Echte Prijs (£)', 'predicted': 'Voorspelde Prijs (£)'},
                 title='Echte vs Voorspelde prijs (Sample)')
# add y=x reference line
m = max(sample_df['actual'].max(), sample_df['predicted'].max())
fig.add_shape(type='line', x0=0, y0=0, x1=m, y1=m, line=dict(color='red', dash='dash'))
fig.update_layout(height=600)
fig.show(renderer="browser")


In [38]:
model_dir = os.path.join("models", "uk_housing")
os.makedirs(model_dir, exist_ok=True)

# Sla het model op
model_path = os.path.join(model_dir, "xgboost_housing_v1.json")
booster.save_model(model_path)
print(f"Model opgeslagen in: {model_path}")

# Sla de encoding mappings op (nodig voor deployment!)
# Als we live gaan voorspellen, moeten we weten wat de gemiddelde prijs van 'Londen' was in onze training data.
import pickle
mappings = {
    'county_means': county_means,
    'district_means': district_means
}
mapping_path = os.path.join(model_dir, "encoding_mappings.pkl")
with open(mapping_path, 'wb') as f:
    pickle.dump(mappings, f)
    
print(f"Encoding mappings opgeslagen in: {mapping_path}")

Model opgeslagen in: models\uk_housing\xgboost_housing_v1.json
Encoding mappings opgeslagen in: models\uk_housing\encoding_mappings.pkl
