# 05 – Model Training & Evaluation  
**CRISP-DM Phase 4: Modeling** (and Phase 5: Evaluation)  
This notebook fits a regression model, evaluates its performance, and serialises artifacts for deployment.

### Objectives
* One-hot encode features to match model expectations.  
* Split data 80 / 20 (with `random_state=42`).  
* Fit a baseline `LinearRegression` model.  
* Evaluate on test set: R², MAE, RMSE.  
* Serialize artifacts for deployment:  
  - `house_price_model.pkl`  
  - `model_columns.pkl`  
  - *(optional)* `model_metrics.json`  

### Inputs
* `outputs/datasets/collection/HousePricesRecords_clean.csv`  

### Outputs
* `outputs/models/house_price_model.pkl`  
* `outputs/models/model_columns.pkl`  
* *(optional)* `outputs/models/model_metrics.json`  

### Additional Comments  
#### Business Requirements Addressed  
* **BR3**: Produces the trained model for the Sale Price Prediction tab.  

#### Additional Notes  
* Later: upgrade to a pipeline with XGBoost + hyperparameter tuning to boost performance. 

### Import Required Libraries for Modeling & Evaluation  
This cell brings in the modules we’ll need to load data and the trained model, split the dataset, fit our regression algorithm, and compute performance metrics:

- **`os`** for file‐system operations (ensuring output folders exist, constructing paths).  
- **`joblib`** to deserialize the previously saved `house_price_model.pkl` and `model_columns.pkl`.  
- **`pandas as pd`** for tabular data manipulation (loading CSV, creating DataFrames).  
- **`train_test_split`** and **`LinearRegression`** from **`sklearn`** for splitting data and fitting the baseline regression model.  
- **`r2_score`** and **`mean_absolute_error`** for evaluating model performance.


In [8]:
import os
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV      

### Load cleaned data

y  → target variable (what we want to predict)

X  → feature matrix 

In [9]:
df = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")
y  = df["Price"]
X  = df.drop(columns=["Price", "Date of Transfer"], errors="ignore")

#### Define numeric & categorical feature lists

Numeric features (continuous variables to be scaled with a `StandardScaler`)  

Categorical features (discrete variables to be one-hot-encoded with a `OneHotEncoder`)

In [10]:
numeric_features = [
    "Year", "Month",
    "RegionMedianPrice", "RegionSaleCount",
    "CountyMedianPrice", "CountySaleCount"
]

categorical_features = [
    "Old/New", "Duration",
    "Town/City", "County", "PPDCategory Type",
    "Property_D", "Property_F", "Property_S", "Property_T",
    "Region"
]

#### Filter Feature Lists to Available Columns

This code ensures that both your numeric and categorical feature lists only include columns that actually exist in the current DataFrame `X`.  
This prevents errors in your preprocessing pipeline if any expected column is missing.

In [11]:
numeric_features     = [c for c in numeric_features     if c in X.columns]
categorical_features = [c for c in categorical_features if c in X.columns]

#### Build Preprocessing & Modeling Pipeline  
This block creates a scikit-learn `ColumnTransformer` named `preprocessor` that:

Scales all numeric features using `StandardScaler` (zero mean, unit variance).  One-hot encodes all categorical features (with unseen categories ignored).

It then defines a `Pipeline` that sequentially:

Applies the `preprocessor` to prepare the data Fits a `RandomForestRegressor` (named “regressor”) on the transformed features.



In [12]:
preprocessor = ColumnTransformer([
    ("num", StandardScaler(), numeric_features),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(random_state=42))
])

#### Split Data into Training and Test Sets  
This cell uses scikit-learn’s `train_test_split` to randomly split our feature matrix `X` and target vector `y` into:

- **Training set** (`X_train`, `y_train`) comprising 80 % of the data, used to fit the model.  
- **Test set** (`X_test`, `y_test`) comprising 20 % of the data, reserved for evaluating performance on unseen examples.  


In [13]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,  
    shuffle=True
)

#### Hyperparameter Grid Search with Cross-Validation

this cell define a `param_grid` dictionary listing different settings to try for a Random Forest regressor—such as the number of trees, tree depth, and how splits are made. 



In [14]:
param_grid = {
    "regressor__n_estimators":      [100, 200, 300],
    "regressor__max_depth":         [5, 10, 20],
    "regressor__min_samples_split": [2, 5, 10],
    "regressor__min_samples_leaf":  [1, 2, 4],
    "regressor__max_features":      ["sqrt", "log2", None],
    "regressor__bootstrap":         [True, False],
}

#### RandomizedSearchCV
Samples 50 random hyper-parameter settings (out of the full grid), runs 3-fold CV on each, and picks the best one.

In [15]:
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_grid,
    n_iter=50,
    cv=3,
    scoring="neg_mean_absolute_error",
    n_jobs=-1,
    random_state=42,
    verbose=1
)

In [16]:
random_search.fit(X_train, y_train)

best_model = random_search.best_estimator_
print("🔍 Best params (random):", random_search.best_params_)

Fitting 3 folds for each of 50 candidates, totalling 150 fits
🔍 Best params (random): {'regressor__n_estimators': 100, 'regressor__min_samples_split': 10, 'regressor__min_samples_leaf': 4, 'regressor__max_features': None, 'regressor__max_depth': 5, 'regressor__bootstrap': True}


#### Check for Any Remaining Non-Numeric Features  
This cell verifies that our feature matrix `X` contains only numeric columns after one-hot encoding. It uses `select_dtypes(exclude=[np.number])` to list any columns that still aren’t numeric. An empty list means you’re safe to proceed; if any names appear, you’ll need to encode or drop those fields before fitting the model.


In [17]:
non_numeric = X.select_dtypes(exclude=[np.number]).columns.tolist()
print("Still non-numeric:", non_numeric)


Still non-numeric: ['Town/City', 'County', 'PPDCategory Type', 'Property_D', 'Property_F', 'Property_O', 'Property_S', 'Property_T', 'Region']


#### List Available Training Features  
This cell prints out all column names in `X_train` so you can verify which features are present after splitting. It’s a quick check to ensure your preprocessing and feature lists align with the actual training data.

In [18]:
print("Available columns:", X_train.columns.tolist())

Available columns: ['Old/New', 'Duration', 'Town/City', 'County', 'PPDCategory Type', 'Year', 'Month', 'Property_D', 'Property_F', 'Property_O', 'Property_S', 'Property_T', 'Region', 'RegionMedianPrice', 'RegionSaleCount', 'CountyMedianPrice', 'CountySaleCount', 'LogPrice']


#### Evaluate Model Performance on Test Set  
This cell uses the tuned `best_model` to predict sale prices for the held-out `X_test` data, then reports the Mean Absolute Error (MAE) and R² score to show how accurately the model generalises to new, unseen properties.

In [19]:
y_pred = best_model.predict(X_test)
print("📊 Test MAE:", mean_absolute_error(y_test, y_pred))
print("📈 Test R² :", r2_score(y_test, y_pred))



📊 Test MAE: 69666.93602448335
📈 Test R² : 0.52982827718276


#### Save the Trained Pipeline  
This cell ensures the `outputs/models` folder exists, serialises the tuned `best_model` pipeline to `house_price_pipeline.pkl` using `joblib.dump`, and prints a confirmation so you can load it later for live predictions.



In [20]:
os.makedirs("../outputs/models", exist_ok=True)
joblib.dump(best_model, "../outputs/models/house_price_pipeline.pkl")
print("✅ Pipeline saved to ../outputs/models/house_price_pipeline.pkl")

✅ Pipeline saved to ../outputs/models/house_price_pipeline.pkl
