# 06 Final Pipeline And Deployment Preparation 

## 1. Objectives

This notebook consolidates the final predictive pipeline using the selected model. The goals are:

- Load training artefacts and model components
- Reconstruct the preprocessing and prediction pipeline
- Validate pipeline on test data
- Prepare outputs for deployment and dashboards
- Serialize final components for production use

## Change Working Directory
- Since it is expected that you would keep the notebooks in a subfolder, you will need to switch the working directory when you run the notebook in the editor.
- The working directory must be changed from its current folder to its parent folder.
- We wish to change the current directory's parent to the new current directory.
- Verify the updated current directory.

In [8]:
# Smart Working Directory Setup
import os
project_root = '/workspaces/heritage_housing'
if os.getcwd() != project_root:
    try:
        os.chdir(project_root)
        print(f"[INFO] Changed working directory to project root: {os.getcwd()}")
    except FileNotFoundError:
        raise FileNotFoundError(f"[ERROR] Project root '{project_root}' not found!")

### Requirements (Import Libraries + Verify + Load Artifacts)

In [16]:
# Import Libraries

import pandas as pd
import numpy as np
import os
import json
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin

import datetime  # For any timestamping/logging

# Verify Dependencies

required_dependencies = {
    "pandas": "1.4.2",
    "numpy": "1.24.4",
    "matplotlib": "3.4.3",
    "seaborn": "0.11.2",
    "joblib": "1.4.2"
}

installed_dependencies = {}
for lib, expected_version in required_dependencies.items():
    try:
        lib_version = __import__(lib).__version__
        installed_dependencies[lib] = lib_version
        if lib_version != expected_version:
            print(f"{lib} version mismatch: Expected {expected_version}, found {lib_version}")
        else:
            print(f"{lib} is correctly installed (version {lib_version})")
    except ImportError:
        print(f"{lib} is not installed!")

print("\nInstalled Dependencies:")
print(json.dumps(installed_dependencies, indent=4))

# Define artifact paths
artifacts_paths = {
    "best_rf_model": "outputs/models/best_random_forest.pkl",
    "best_dt_model": "outputs/models/best_decision_tree.pkl",
    "best_gbr_model": "outputs/models/best_gradient_boosting.pkl",
    "best_ridge_model": "outputs/models/best_ridge.pkl",
    "best_svr_model": "outputs/models/best_svr.pkl",
    "evaluation_metrics": "outputs/metrics/consolidated_model_performance.csv",
    "cv_results": "outputs/metrics/cross_validation_results.csv",
    "test_set_results": "outputs/metrics/test_set_results.csv",
    "feature_importance_rf": "outputs/ft_importance/random_forest_importance.csv",
    "feature_importance_xgb": "outputs/ft_importance/xgboost_importance.csv",
    "shap_rf": "outputs/shap_values/shap_summary_random_forest.csv",
    "shap_xgb": "outputs/shap_values/shap_summary_xgboost.csv",
    "X_train": "data/processed/final/X_train.csv",
    "X_test": "data/processed/final/X_test.csv",
    "y_train": "data/processed/final/y_train.csv",
    "y_test": "data/processed/final/y_test.csv",
}

# Load models
models = {}
for model_key in ["best_rf_model", "best_dt_model", "best_gbr_model", "best_ridge_model", "best_svr_model"]:
    try:
        models[model_key] = joblib.load(artifacts_paths[model_key])
        print(f"{model_key} loaded.")
    except FileNotFoundError as e:
        print(f"Error loading {model_key}: {e}")

# Load evaluation metrics
evaluation_metrics = pd.read_csv(artifacts_paths["evaluation_metrics"])
cv_results = pd.read_csv(artifacts_paths["cv_results"])
test_set_results = pd.read_csv(artifacts_paths["test_set_results"])

# Load feature importance
feature_importance_rf = pd.read_csv(artifacts_paths["feature_importance_rf"])
feature_importance_xgb = pd.read_csv(artifacts_paths["feature_importance_xgb"])

# Load SHAP values
shap_rf = pd.read_csv(artifacts_paths["shap_rf"])
shap_xgb = pd.read_csv(artifacts_paths["shap_xgb"])

# Load train/test data
X_train = pd.read_csv(artifacts_paths["X_train"])
X_test = pd.read_csv(artifacts_paths["X_test"])
y_train = pd.read_csv(artifacts_paths["y_train"]).values.ravel()
y_test = pd.read_csv(artifacts_paths["y_test"]).values.ravel()

print("\nLoaded all final outputs and datasets successfully.")

# Preview key artefacts
print("\nEvaluation Metrics:")
display(evaluation_metrics.head())

print("\nCross Validation Results:")
display(cv_results.head())

print("\nTest Set Results:")
display(test_set_results.head())

print("\nFeature Importance (Random Forest):")
display(feature_importance_rf.head())

print("\nFeature Importance (XGBoost):")
display(feature_importance_xgb.head())

print("\nSHAP Summary (Random Forest):")
display(shap_rf.head())

print("\nSHAP Summary (XGBoost):")
display(shap_xgb.head())

pandas version mismatch: Expected 1.4.2, found 2.1.1
numpy version mismatch: Expected 1.24.4, found 1.26.1
matplotlib version mismatch: Expected 3.4.3, found 3.8.0
seaborn version mismatch: Expected 0.11.2, found 0.13.2
joblib is correctly installed (version 1.4.2)

Installed Dependencies:
{
    "pandas": "2.1.1",
    "numpy": "1.26.1",
    "matplotlib": "3.8.0",
    "seaborn": "0.13.2",
    "joblib": "1.4.2"
}
best_rf_model loaded.
best_dt_model loaded.
best_gbr_model loaded.
best_ridge_model loaded.
best_svr_model loaded.

Loaded all final outputs and datasets successfully.

Evaluation Metrics:


Unnamed: 0,Model,R2,MAE,RMSE
0,Random Forest,0.8766,0.1018,0.1517
1,Decision Tree,0.8058,0.1373,0.1902
2,Gradient Boosting,0.8774,0.0998,0.1513
3,Ridge Regression,0.8686,0.1081,0.1565
4,Support Vector Regressor,0.7908,0.1189,0.1975



Cross Validation Results:


Unnamed: 0,Model,CV R²,CV MAE,CV RMSE
0,Gradient Boosting,0.8756,0.0951,0.1376
1,Random Forest,0.8581,0.1001,0.1473
2,Ridge Regression,0.8176,0.1038,0.1702
3,SVR,0.7974,0.1173,0.1759
4,Decision Tree,0.7156,0.147,0.2071



Test Set Results:


Unnamed: 0,Model,Test R²,Test MAE,Test RMSE
0,Gradient Boosting,0.8774,0.0998,0.1513
1,Random Forest,0.8766,0.1018,0.1517
2,Ridge Regression,0.8686,0.1081,0.1565
3,SVR,0.7908,0.1189,0.1975
4,Decision Tree,0.8058,0.1373,0.1903



Feature Importance (Random Forest):


Unnamed: 0,Feature,Importance
0,overallqual,0.565988
1,grlivarea,0.129338
2,totalbsmtsf,0.05439
3,garagearea,0.052452
4,bsmtfinsf1,0.029615



Feature Importance (XGBoost):


Unnamed: 0,Feature,Importance
0,overallqual,0.52583
1,grlivarea,0.14483
2,garagearea,0.058651
3,totalbsmtsf,0.053226
4,yearbuilt,0.038472



SHAP Summary (Random Forest):


Unnamed: 0,Feature,Mean SHAP Value
0,overallqual,0.200133
1,grlivarea,0.084338
2,totalbsmtsf,0.032506
3,garagearea,0.027361
4,bsmtfinsf1,0.024495



SHAP Summary (XGBoost):


Unnamed: 0,Feature,Mean SHAP Value
0,overallqual,0.153576
1,grlivarea,0.096429
2,totalbsmtsf,0.035504
3,yearbuilt,0.032242
4,bsmtfinsf1,0.030743


## Pipeline Design

### Preprocessing Pipeline

- Uses ColumnTransformer with numerical and categorical pipelines.
- You already have features like num__OverallQual, so we assume feature names are prefixed.

In [17]:
# Define columns (optional if already processed)
numerical_cols = [col for col in X_train.columns if col.startswith("num__")]
categorical_cols = [col for col in X_train.columns if col.startswith("cat__")]

# Pipelines
numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# Full preprocessing pipeline
preprocessor = ColumnTransformer([
    ("num", numerical_pipeline, numerical_cols),
    ("cat", categorical_pipeline, categorical_cols)
])


**Model Integration (Final Pipeline Creation)**

Combine preprocessor + model into one pipeline

In [18]:
# 4.2 Model Integration (build pipeline for deployment)

from sklearn.pipeline import Pipeline

# Final pipeline with preprocessing and model 
final_pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", models["best_rf_model"])
])


**Prediction Pipeline**

Run predictions using either your trained model directly or the wrapped pipeline

In [20]:
# Prediction Pipeline

# If data is already preprocessed:
y_pred = models["best_rf_model"].predict(X_test)

# Display-Ready Dataset for Dashboard

# Create comparison DataFrame
results_df = pd.DataFrame({
    "Actual_LogSalePrice": y_test,
    "Predicted_LogSalePrice": y_pred
})

# Inverse transform log if needed (optional)
results_df["Actual_Price"] = np.expm1(results_df["Actual_LogSalePrice"])
results_df["Predicted_Price"] = np.expm1(results_df["Predicted_LogSalePrice"])

# Optionally include key features
key_features = ["num__OverallQual", "num__GrLivArea", "num__GarageArea"]
for feat in key_features:
    if feat in X_test.columns:
        results_df[feat] = X_test[feat]

# Display
display(results_df.head())


Unnamed: 0,Actual_LogSalePrice,Predicted_LogSalePrice,Actual_Price,Predicted_Price
0,11.947956,11.851151,154500.0,140244.641193
1,12.691584,12.734056,325000.0,339100.996458
2,11.652696,11.610133,115000.0,110207.890421
3,11.976666,11.929078,159000.0,151610.756621
4,12.661917,12.626878,315500.0,304636.510952


**Save Display-Ready Dataset for Dashboard**

In [22]:
# Save Display-Ready Dataset for Dashboard

# Create predictions folder if it doesn't exist
predictions_path = "outputs/predictions"
os.makedirs(predictions_path, exist_ok=True)

# Save the results
results_df.to_csv(f"{predictions_path}/rf_predictions_dashboard.csv", index=False)

print("Dashboard-ready prediction file saved to:", f"{predictions_path}/rf_predictions_dashboard.csv")


Dashboard-ready prediction file saved to: outputs/predictions/rf_predictions_dashboard.csv
