# 05 - Model Training and Evaluation

In this notebook, we train a machine learning model to predict house prices using the cleaned dataset.

## Objectives

-Load the cleaned, feature-engineered data set.

-One-hot-encode categorical columns (drop_first=True avoids dummy trap).

-Train / test-split the data and fit a baseline Linear Regression model.

-Persist the trained model **and** its column order for reproducible inference.

## Imports

In [7]:
import os
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score, mean_absolute_error

## Load cleaned data

y  → target variable (what we want to predict)

X  → feature matrix (everything except target + unneeded columns)
'Date of Transfer' is dropped because it's a timestamp string we
already decomposed into Year / Month during cleaning.

In [8]:
df = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")

y = df["Price"]
X = df.drop(columns=["Price", "Date of Transfer"],)

## One-hot-encode all categorical columns

get_dummies() turns strings / bools into 0-1 indicator columns.

drop_first=True removes the first dummy in each group to keep

features linearly independent (important for linear models).

In [9]:
X = pd.get_dummies(X, drop_first=True)

## Train / Test split

80 % for training, 20 % held out for unbiased evaluation.



In [10]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,  
    shuffle=True
)

In [11]:
import numpy as np

# after X = pd.get_dummies(...)
non_numeric = X.select_dtypes(exclude=[np.number]).columns.tolist()
print("Still non-numeric:", non_numeric)


Still non-numeric: ['Property_D', 'Property_F', 'Property_S', 'Property_T', 'Town/City_ADDLESTONE', 'Town/City_ALDERSHOT', 'Town/City_ALTON', 'Town/City_ALTRINCHAM', 'Town/City_AMERSHAM', 'Town/City_AMMANFORD', 'Town/City_ANDOVER', 'Town/City_ARUNDEL', 'Town/City_ASHFORD', 'Town/City_ASHTON-UNDER-LYNE', 'Town/City_ATHERSTONE', 'Town/City_ATTLEBOROUGH', 'Town/City_AYLESBURY', 'Town/City_AYLESFORD', 'Town/City_BAGSHOT', 'Town/City_BALDOCK', 'Town/City_BANBURY', 'Town/City_BARKING', 'Town/City_BARNET', 'Town/City_BARNSLEY', 'Town/City_BARNSTAPLE', 'Town/City_BARRY', 'Town/City_BASILDON', 'Town/City_BASINGSTOKE', 'Town/City_BATH', 'Town/City_BEDFORD', 'Town/City_BEDLINGTON', 'Town/City_BEDWORTH', 'Town/City_BELPER', 'Town/City_BENFLEET', 'Town/City_BERKHAMSTED', 'Town/City_BEVERLEY', 'Town/City_BEXHILL-ON-SEA', 'Town/City_BEXLEYHEATH', 'Town/City_BICESTER', 'Town/City_BIDEFORD', 'Town/City_BILLERICAY', 'Town/City_BILLINGHAM', 'Town/City_BILSTON', 'Town/City_BIRKENHEAD', 'Town/City_BIRMINGH

## Fit baseline model

In [12]:
model = LinearRegression()
model.fit(X_train, y_train)

## Persist model + column order

In [13]:
os.makedirs("../outputs/models", exist_ok=True)

joblib.dump(model,
            "../outputs/models/house_price_model.pkl")          
joblib.dump(X_train.columns.tolist(),
            "../outputs/models/model_columns.pkl")              

print("✅  Model and column list saved to ../outputs/models/")

✅  Model and column list saved to ../outputs/models/


## Conclusion
The model's performance is evaluated using standard metrics. If R² is close to 1, the model explains most of the variance in the data.