# 05 - Model Training and Evaluation

In this notebook, we train a machine learning model to predict house prices using the cleaned dataset.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib
import os

## Load Cleaned Data

In [2]:
df_chunk = pd.read_csv('../outputs/datasets/collection/HousePricesRecords_clean.csv')
df_chunk.head()

Unnamed: 0,Price,Date of Transfer,Old/New,Duration,Town/City,County,PPDCategory Type,Year,Month,Property_D,Property_F,Property_S,Property_T
0,25000,1995-08-18,0,1,OLDHAM,GREATER MANCHESTER,A,1995,8,False,False,False,True
1,42500,1995-08-09,0,1,GRAYS,THURROCK,A,1995,8,False,False,True,False
2,45000,1995-06-30,0,1,HIGHBRIDGE,SOMERSET,A,1995,6,False,False,False,True
3,43150,1995-11-24,0,1,BEDFORD,BEDFORDSHIRE,A,1995,11,False,False,False,True
4,18899,1995-06-23,0,1,WAKEFIELD,WEST YORKSHIRE,A,1995,6,False,False,True,False


## Define Features and Target Variable
We use numerical and encoded features to predict the house **Price**.

In [3]:
features = ['Old/New', 'Duration', 'Year', 'Month', 'Property_D', 'Property_F', 'Property_S', 'Property_T']
X = df_chunk[features]
y = df_chunk['Price']

## Train-Test Split

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Train Linear Regression Model

In [5]:
model = LinearRegression()
model.fit(X_train, y_train)

## Make Predictions

In [6]:
y_pred = model.predict(X_test)

## Evaluate Model

In [7]:
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"MAE: {mae:.2f}")
print(f"MSE: {mse:.2f}")
print(f"RMSE: {rmse:.2f}")
print(f"R² Score: {r2:.4f}")

MAE: 29481.31
MSE: 2038163819.19
RMSE: 45146.03
R² Score: 0.2104


creates the file house_price_model.pkl correctly, you just need to add the code that saves your trained model using joblib.

In [9]:
os.makedirs('../outputs/models', exist_ok=True)

# Save the trained model
feature_order = X.columns.tolist()
joblib.dump((model, feature_order), '../outputs/models/house_price_model.pkl')

print("✅ Model saved to ../outputs/models/house_price_model.pkl")



✅ Model saved to ../outputs/models/house_price_model.pkl


## Conclusion
The model's performance is evaluated using standard metrics. If R² is close to 1, the model explains most of the variance in the data.