# 05 – Model Training & Evaluation  
**CRISP-DM Phase 4: Modeling** (and Phase 5: Evaluation)  
This notebook fits a regression model, evaluates its performance, and serialises artifacts for deployment.

### Objectives
* One-hot encode features to match model expectations.  
* Split data 80 / 20 (with `random_state=42`).  
* Fit a baseline `LinearRegression` model.  
* Evaluate on test set: R², MAE, RMSE.  
* Serialize artifacts for deployment:  
  - `house_price_model.pkl`  
  - `model_columns.pkl`  
  - *(optional)* `model_metrics.json`  

### Inputs
* `outputs/datasets/collection/HousePricesRecords_clean.csv`  

### Outputs
* `outputs/models/house_price_model.pkl`  
* `outputs/models/model_columns.pkl`  
* *(optional)* `outputs/models/model_metrics.json`  

### Additional Comments  
#### Business Requirements Addressed  
* **BR3**: Produces the trained model for the Sale Price Prediction tab.  

#### Additional Notes  
* Later: upgrade to a pipeline with XGBoost + hyperparameter tuning to boost performance. 

### Import Required Libraries for Modeling & Evaluation  
This cell brings in the modules we’ll need to load data and the trained model, split the dataset, fit our regression algorithm, and compute performance metrics:

- **`os`** for file‐system operations (ensuring output folders exist, constructing paths).  
- **`joblib`** to deserialize the previously saved `house_price_model.pkl` and `model_columns.pkl`.  
- **`pandas as pd`** for tabular data manipulation (loading CSV, creating DataFrames).  
- **`train_test_split`** and **`LinearRegression`** from **`sklearn`** for splitting data and fitting the baseline regression model.  
- **`r2_score`** and **`mean_absolute_error`** for evaluating model performance.


In [None]:
import os
import joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error

### Load cleaned data

y  → target variable (what we want to predict)

X  → feature matrix (everything except target + unneeded columns)
'Date of Transfer' is dropped because it's a timestamp string we
already decomposed into Year / Month during cleaning.

In [2]:
df = pd.read_csv("../outputs/datasets/collection/HousePricesRecords_clean.csv")

y = df["Price"]
X = df.drop(columns=["Price", "Date of Transfer"],)

#### One-Hot Encode All Categorical Features  
This cell converts any remaining non-numeric (categorical) columns in `X` into binary indicator (0/1) columns using pandas’ `get_dummies`.  
- `drop_first=True` removes the first category for each feature to avoid multicollinearity, which is especially important for linear models like `LinearRegression`.


In [3]:
X = pd.get_dummies(X, drop_first=True)

#### Split Data into Training and Test Sets  
This cell uses scikit-learn’s `train_test_split` to randomly split our feature matrix `X` and target vector `y` into:

- **Training set** (`X_train`, `y_train`) comprising 80 % of the data, used to fit the model.  
- **Test set** (`X_test`, `y_test`) comprising 20 % of the data, reserved for evaluating performance on unseen examples.  


In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.20,
    random_state=42,  
    shuffle=True
)

#### Check for Any Remaining Non-Numeric Features  
This cell verifies that our feature matrix `X` contains only numeric columns after one-hot encoding. It uses `select_dtypes(exclude=[np.number])` to list any columns that still aren’t numeric. An empty list means you’re safe to proceed; if any names appear, you’ll need to encode or drop those fields before fitting the model.


In [None]:
non_numeric = X.select_dtypes(exclude=[np.number]).columns.tolist()
print("Still non-numeric:", non_numeric)


Still non-numeric: ['Property_D', 'Property_F', 'Property_S', 'Property_T', 'Town/City_ADDLESTONE', 'Town/City_ALDERSHOT', 'Town/City_ALTON', 'Town/City_ALTRINCHAM', 'Town/City_AMERSHAM', 'Town/City_AMMANFORD', 'Town/City_ANDOVER', 'Town/City_ARUNDEL', 'Town/City_ASHFORD', 'Town/City_ASHTON-UNDER-LYNE', 'Town/City_ATHERSTONE', 'Town/City_ATTLEBOROUGH', 'Town/City_AYLESBURY', 'Town/City_AYLESFORD', 'Town/City_BAGSHOT', 'Town/City_BALDOCK', 'Town/City_BANBURY', 'Town/City_BARKING', 'Town/City_BARNET', 'Town/City_BARNSLEY', 'Town/City_BARNSTAPLE', 'Town/City_BARRY', 'Town/City_BASILDON', 'Town/City_BASINGSTOKE', 'Town/City_BATH', 'Town/City_BEDFORD', 'Town/City_BEDLINGTON', 'Town/City_BEDWORTH', 'Town/City_BELPER', 'Town/City_BENFLEET', 'Town/City_BERKHAMSTED', 'Town/City_BEVERLEY', 'Town/City_BEXHILL-ON-SEA', 'Town/City_BEXLEYHEATH', 'Town/City_BICESTER', 'Town/City_BIDEFORD', 'Town/City_BILLERICAY', 'Town/City_BILLINGHAM', 'Town/City_BILSTON', 'Town/City_BIRKENHEAD', 'Town/City_BIRMINGH

#### Initialize & Train Linear Regression Model  
This cell creates a `LinearRegression` estimator and fits it to the training data, allowing the algorithm to learn the best‐fit coefficients that minimise the sum of squared errors between predicted and actual sale prices.


In [6]:
model = LinearRegression()
model.fit(X_train, y_train)

#### Save Trained Model & Feature Columns  
This cell prepares the deployment artifacts by:

1. Creating the `outputs/models/` directory if it doesn’t already exist.  
2. Serializing the trained `LinearRegression` model to `house_price_model.pkl` using `joblib.dump`.  
3. Saving the list of feature column names (in the exact order used during training) to `model_columns.pkl`.  
4. Printing a confirmation message once both files are written.

These files are then picked up by the Streamlit app for making live predictions.


In [7]:
os.makedirs("../outputs/models", exist_ok=True)

joblib.dump(model,
            "../outputs/models/house_price_model.pkl")          
joblib.dump(X_train.columns.tolist(),
            "../outputs/models/model_columns.pkl")              

print("✅  Model and column list saved to ../outputs/models/")

✅  Model and column list saved to ../outputs/models/
