# **Notebook 6: Final Pipeline and Deployment Preparation**

## Objectives

The objective of this notebook is to consolidate the final machine learning pipeline and prepare the trained model(s) for deployment. This includes serializing the pipeline, saving outputs, and creating essential documentation for integration into the deployment environment.

## Inputs

* **Trained Models and Hyperparameters**
  * Best-performing models from the model training and evaluation notebook, including their hyperparameters.
* **Processed Dataset**
  * Cleaned and feature-engineered datasets ready for input into the pipeline.
* **Evaluation Metrics and Feature Importances**
  * Outputs from the model training notebook to inform pipeline structure and deployment requirements.

## Outputs

* **Serialized Final Pipeline**
  * The complete pipeline, including preprocessing and the best-performing model, saved for deployment.
* **Deployment-Ready Artifacts**
  * Files required for model integration, such as serialized objects and configuration files.
* **Documentation**
  * Summary of the pipeline, deployment steps, and integration instructions.

## Additional Comments

* This ntoebook serves as the final step before integrating the pipeline into the deployment environment.
* The pipeline will include preprocessing steps, feature selection, and the chosen model for prediction.
* Key deployment considerations, such as scalability and maintainability, will be addressed.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5'

---

## Requirements

### Import Libraries

To begin, we will import all the necessary libraries required for data processing, model loading, evaluation, and output generation.

In [4]:
import pandas as pd
import numpy as np 
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import os
import json
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
import datetime

### Verify Dependancies

To ensure a smooth workflow, we will verify that all required dependencies are installed and compatible with the current environment. The below section checks for installed packages and their versions.

In [5]:
# List of required libraries and their versions
required_dependencies = {
    "pandas": "1.4.2",
    "numpy": "1.24.4",
    "matplotlib": "3.4.3",
    "seaborn": "0.11.2",
    "joblib": "1.4.2"
}

# Check installed dependancies
installed_dependencies = {}
for lib, version in required_dependencies.items():
    try:
        lib_version = __import__(lib).__version__
        installed_dependencies[lib] = lib_version
        if lib_version != version:
            print(f"{lib} version mismatch: Expected {version}, found {lib_version}")
        else:
            print(f"{lib} is correctly installed (version {version})")
    except ImportError:
        print(f"{lib} is not installed!")

# Display summary of dependencies
print("\nInstalled Dependencies")
print(json.dumps(installed_dependencies, indent=4))

pandas is correctly installed (version 1.4.2)
numpy is correctly installed (version 1.24.4)
matplotlib is correctly installed (version 3.4.3)
seaborn is correctly installed (version 0.11.2)
joblib is correctly installed (version 1.4.2)

Installed Dependencies
{
    "pandas": "1.4.2",
    "numpy": "1.24.4",
    "matplotlib": "3.4.3",
    "seaborn": "0.11.2",
    "joblib": "1.4.2"
}


### Load Saved Artifacts

In this step we will load the serialized models, feature importance data, evaluation metrics and any other outputs generated in the earlier notebooks. These artifacts are essential for building the final pipeline and preparing for deployment.

**Artifacts to Load:**
1. Trained Models:
   - Random Forest
   - XGBoost
2. Evaluation Metrics:
   - Performance metrics for each model.
3. Feature Importance Data:
   - Insights into features contributing to model predictions.
4. Testing Data:
   - Processed test dataset with features and target.

These artifacts will ensure continuity between the modeling and deployment stages.

In [6]:
# Define paths for saved artifacts
artifacts_paths = {
    "random_forest_model": "outputs/models/random_forest_model.pkl",
    "xgboost_model": "outputs/models/xgboost_model.pkl",
    "evaluation_metrics": "outputs/metrics/evaluation_metrics.csv",
    "feature_importance_rf": "outputs/feature_importance/random_forest_feature_importance.csv",
    "feature_importance_xgb": "outputs/feature_importance/xgboost_feature_importance.csv",
    "test_features": "outputs/datasets/processed/final/x_test_final.csv",
    "test_target": "outputs/datasets/processed/final/y_test_final.csv",
}

# Load Models
try:
    rf_model = joblib.load(artifacts_paths["random_forest_model"])
    xgb_model = joblib.load(artifacts_paths["xgboost_model"])
    print("Models loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading models: {e}")

# Load evaluation metrics
try:
    evaluation_metrics = pd.read_csv(artifacts_paths["evaluation_metrics"])
    print("Evaluation metrics loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading evaluation metrics: {e}")

# Load feature importance data
try:
    feature_importance_rf = pd.read_csv(artifacts_paths["feature_importance_rf"])
    feature_importance_xgb = pd.read_csv(artifacts_paths["feature_importance_xgb"])
    print("Feature Importance data loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading feature importance data: {e}")

# Load test features and target
try:
    test_features = pd.read_csv(artifacts_paths["test_features"])
    test_target = pd.read_csv(artifacts_paths["test_target"])
    print("Test features and target loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading test data: {e}")

# Display loaded artifacts 
print("Loaded Artifacts:")
print("\nEvaluation Metrics:")
print(evaluation_metrics.head())
print("\nFeature Importance (Random Forest):")
print(feature_importance_rf.head())
print("\nFeature Importance (XGBoost):")
print(feature_importance_xgb.head())
print("\nTest Features (First 5 Rows):")
print(test_features.head())
print("\nTest Target (First 5 Rows):")
print(test_target.head())

Models loaded successfully.
Evaluation metrics loaded successfully.
Feature Importance data loaded successfully.
Test features and target loaded successfully.
Loaded Artifacts:

Evaluation Metrics:
               Model  R2 Score       MAE       MSE
0      Random Forest  0.813480  0.122825  0.034807
1      Decision Tree  0.747387  0.154373  0.047141
2                KNN  0.760565  0.143803  0.044682
3  Gradient Boosting  0.771934  0.135791  0.042560
4            XGBoost  0.816074  0.122992  0.034323

Feature Importance (Random Forest):
             Feature  Importance
0   num__OverallQual    0.546662
1     num__GrLivArea    0.129170
2    num__GarageArea    0.052238
3      num__1stFlrSF    0.046122
4  num__OverallScore    0.041399

Feature Importance (XGBoost):
             Feature  Importance
0   num__OverallQual    0.783737
1  num__OverallScore    0.060585
2     num__GrLivArea    0.045171
3     num__YearBuilt    0.016420
4      num__1stFlrSF    0.015661

Test Features (First 5 Rows):
 

---

## Pipeline Design

### Preprocessing Pipeline

**Objectives**
The preprocessing pipeline ensures that the test data is appropriately prepared before being passed to the models for predictions. This involves verifying scaling, transformations, and alignment with the training dataset.

In [7]:
# Display a summary of test features
print("Test Features (First 5 Rows):")
print(test_features.head())

print("\nTest Target (First 5 Rows):")
print(test_target.head())

Test Features (First 5 Rows):
   num__LotFrontage  num__LotArea  num__OpenPorchSF  num__MasVnrArea  \
0          0.144140     -0.158460         -1.096169        -0.827815   
1          1.204764      0.612540          0.517257         1.413568   
2         -0.556568     -0.029579         -1.096169        -0.827815   
3         -0.911425     -1.225280          0.389147        -0.827815   
4          0.900684      0.717202         -1.096169         0.793095   

   num__BsmtFinSF1  num__GrLivArea  num__1stFlrSF  num__YearBuilt  \
0         0.755219       -0.922794      -0.126358        0.227176   
1         0.902910        1.808434       0.944129       -0.783836   
2        -1.416429       -1.038836      -0.246639        1.401254   
3         0.585846        0.425488      -0.321073        0.748988   
4         0.899659        0.343995       1.186707       -1.207808   

   num__YearRemodAdd  num__BedroomAbvGr  ...  num__BsmtUnfSF  num__GarageArea  \
0          -0.873470          -2.157869  

**Verification**

The following steps were conducted to validate the preprocessing:

1. **Feature Scaling:** The test features were checked, and it was confirmed that scaling had already been applied during earlier preprocessing steps. The values exhibit a standardized format (mean near 0 and consistent range).
2. **Target Transformation:** The target variable (`LogSalePrice`) was confirmed to retain its log-transformed format as expected.
3. **Data Readiness:** Both features and target data are aligned with the requirements of the trained models.

**Conclusion**

No additional preprocessing steps are required. The test data is ready to be used directly in the pipeline for model integration and predictions.

### Model Integration

This section demonstrates the integration of trained models to generate predictions for the testing dataset. The primary objective is to validate the models by making predictions on unseen data and prepare the results for further evaluation.

**Process:**
1. **Inegration Function:** A custom function `integrate_model` will be implemented to streamline the process of applying models to the testing dataset. This function:
   - Accepts a trained model, features, and a model name.
   - Generates predictions using the provided model.
   - Returns a DataFrame containing the predictions alongside the model name for traceability.
2. **Predictions for Each Model:**
   - Predictions will be generated using the two selected models:
     - Random Forest
     - XGBoost
3. **Combined Predictions:**
   - The predictions from both models will be consolidated into a single DataFrame for comparative analysis.
4. **Validation Against Actual Target Values:**
   - The Mean Absolute Error (MAE) metric will be calculated for each model by comparing their predictions against the actual test target values (`LogSalePrice`).

In [8]:
# Define a function for model integration
def integrate_model(model, features, model_name):
    """
    Integrates a model to make predictions on given features.

    Parameters:
        model: Trained model to use for predictions.
        features: DataFrame of features to predict on.
        model_name: Name of the model for logging and clarity.
    
    Returns:
        DataFrame with predictions and corresponding model name.
    """
    predictions = model.predict(features)
    results = pd.DataFrame({
        "Model": [model_name] * len(predictions),
        "Predicted LogSalePrice": predictions
    })
    return results

# Integrate Random Forest Model
rf_predictions = integrate_model(rf_model, test_features, "Random Forest")
print("Random Forest Model Predictions:")
print(rf_predictions.head())

# Integrate XGBoost Model
xgb_predictions = integrate_model(xgb_model, test_features, "XGBoost")
print("XGBoost Model Predictions:")
print(xgb_predictions.head())

# Combine predictions into a single DataFrame for comparison
combined_predictions = pd.concat([rf_predictions, xgb_predictions], axis=0)
print("\nCombined Model Predictions:")
print(combined_predictions.head())

Random Forest Model Predictions:
           Model  Predicted LogSalePrice
0  Random Forest               11.865206
1  Random Forest               12.713614
2  Random Forest               11.585296
3  Random Forest               11.945785
4  Random Forest               12.637299
XGBoost Model Predictions:
     Model  Predicted LogSalePrice
0  XGBoost               11.885277
1  XGBoost               12.795120
2  XGBoost               11.634674
3  XGBoost               11.983330
4  XGBoost               12.727750

Combined Model Predictions:
           Model  Predicted LogSalePrice
0  Random Forest               11.865206
1  Random Forest               12.713614
2  Random Forest               11.585296
3  Random Forest               11.945785
4  Random Forest               12.637299


In [9]:
# Validate Predictions Against Actuals
rf_mae = mean_absolute_error(test_target, rf_predictions["Predicted LogSalePrice"])
xgb_mae = mean_absolute_error(test_target, xgb_predictions["Predicted LogSalePrice"])

print(f"Random Forest MAE: {rf_mae}")
print(f"XGBoost MAE: {xgb_mae}")

Random Forest MAE: 0.10429243792856001
XGBoost MAE: 0.10712006451855471


**Observations**

- The **Random Forest** model achieved a lower MAE compared to **XGBoost**, indicating slightly better predictive performance.
- Both models provided consistent predictions, aligning with their previously evaluated performance during model testing and tuning.
- The predictions and validation results confirmed the effectiveness of both models on unseen data, with the **Random Forest** model slightly outperforming **XGBoost** in terms of MAE.

### Prediction Pipeline

This section will focus on loading, preprocessing, and generating predictions for the inherited properties dataset.

**Step 1: Load and Preview the Dataset**

We start by loading the inherited properties dataset. This dataset contains raw data, and we will need to preprocess it to align with the format of the training and testing datasets used in model training.

In [10]:
# Define the path to the Inherited properties dataset
inherited_properties_path = "outputs/datasets/raw/inherited_houses.csv"

# Load the dataset
try:
    inherited_properties = pd.read_csv(inherited_properties_path)
    print("Inherited Properties Dataset Loaded Successfully.")
    print(inherited_properties.head())
except FileNotFoundError as e:
    print(f"Error loading dataset: {e}")

# Preview Columns and data types
print("\nInherited Properties Dataset Info:")
print(inherited_properties.info())
print("\nInherited Properties Dataset Preview:")
print(inherited_properties.head())

Inherited Properties Dataset Loaded Successfully.
   1stFlrSF  2ndFlrSF  BedroomAbvGr BsmtExposure  BsmtFinSF1 BsmtFinType1  \
0       896         0             2           No       468.0          Rec   
1      1329         0             3           No       923.0          ALQ   
2       928       701             3           No       791.0          GLQ   
3       926       678             3           No       602.0          GLQ   

   BsmtUnfSF  EnclosedPorch  GarageArea GarageFinish  ...  LotArea  \
0      270.0              0       730.0          Unf  ...    11622   
1      406.0              0       312.0          Unf  ...    14267   
2      137.0              0       482.0          Fin  ...    13830   
3      324.0              0       470.0          Fin  ...     9978   

   LotFrontage MasVnrArea  OpenPorchSF  OverallCond  OverallQual  TotalBsmtSF  \
0         80.0        0.0            0            6            5        882.0   
1         81.0      108.0           36            6

The inherited properties dataset contains raw features and requires preprocessing to match the format of the datasets used for training. This includes feature scaling , engineering, and encoding to ensure compatibility with the trained models.

In [13]:
def preprocess_inherited_properties(raw_data):
    """ 
        Preprocess the inherited properties dataset to align with the training data format.

        Parameters:
            raw_data: Raw DataFrame of inherited properties
        
        Returns:
        Processed DataFrame with the same structure as the training features.
    """
    current_year = datetime.datetime.now().year 

    processed_data = raw_data.copy()

    # Drop columns with high missing percentages
    processed_data.drop(columns=["EnclosedPorch", "WoodDeckSF"], inplace=True)

    # One-hot encoding for categorical variables
    processed_data = pd.get_dummies(processed_data, columns=["GarageFinish", "BsmtFinType1"], drop_first=True)

    # Create new features
    processed_data["num__Age"] = current_year - processed_data["YearBuilt"]
    processed_data["num__LivingLotRatio"] = processed_data["LotArea"] / processed_data["GrLivArea"].replace(0, 1)
    processed_data["num__FinishedBsmtRatio"] = processed_data["BsmtFinSF1"] / processed_data["TotalBsmtSF"].replace(0, 1)
    processed_data["num__OverallScore"] = processed_data["OverallQual"] + processed_data["OverallCond"]
    processed_data["cat__HasPorch"] = (processed_data["OpenPorchSF"].astype(float) > 0).astype(float)

    # Drop extra features
    extra_features = {'BsmtFinType1_GLQ', 'KitchenQual', 'GarageFinish_Unf', 'BsmtFinType1_Rec', 'BsmtExposure', 'TotalBsmtSF'}
    processed_data.drop(columns=extra_features.intersection(processed_data.columns), inplace=True)

    # Rename columns to match the test features
    processed_data.rename(
        columns={
            "1stFlrSF": "num__1stFlrSF",
            "2ndFlrSF": "num__2ndFlrSF",
            "BedroomAbvGr": "num__BedroomAbvGr",
            "BsmtFinSF1": "num__BsmtFinSF1",
            "BsmtUnfSF": "num__BsmtUnfSF",
            "GarageArea": "num__GarageArea",
            "GarageYrBlt": "num__GarageYrBlt",
            "GrLivArea": "num__GrLivArea",
            "LotArea": "num__LotArea",
            "LotFrontage": "num__LotFrontage",
            "MasVnrArea": "num__MasVnrArea",
            "OpenPorchSF": "num__OpenPorchSF",
            "OverallCond": "num__OverallCond",
            "OverallQual": "num__OverallQual",
            "YearBuilt": "num__YearBuilt",
            "YearRemodAdd": "num__YearRemodAdd",
        },
        inplace=True,
    )

    # Debugging Outputs
    print("Processed Data Columns:")
    print(processed_data.columns)

    print("\nTest Features Columns:")
    print(test_features.columns)

    # Align features with training data
    missing_features = set(test_features.columns) - set(processed_data.columns)
    for col in missing_features:
        processed_data[col] = 0
    
    extra_features = set(processed_data.columns) - set(test_features.columns)
    processed_data.drop(columns=extra_features, inplace=True)

    # Reorder columns to match training data
    processed_data = processed_data[test_features.columns]
    assert list(processed_data.columns) == list(test_features.columns), "Column alignment mismatch!"

    # Apply scaling and encoding
    numeric_features = [col for col in processed_data.columns if col.startswith("num__")]

    # Standardize numerical columns
    scaler = StandardScaler()
    processed_data[numeric_features] = scaler.fit_transform(processed_data[numeric_features])

    return processed_data

# Apply the preprocessing function
try:
    inherited_properties_processed = preprocess_inherited_properties(inherited_properties)
    print("\nInherited Properties Dataset Processed Successfully.")
    print(inherited_properties_processed.head())
except Exception as e:
    print(f"Error processing inherited properties dataset: {e}")


Processed Data Columns:
Index(['num__1stFlrSF', 'num__2ndFlrSF', 'num__BedroomAbvGr',
       'num__BsmtFinSF1', 'num__BsmtUnfSF', 'num__GarageArea',
       'num__GarageYrBlt', 'num__GrLivArea', 'num__LotArea',
       'num__LotFrontage', 'num__MasVnrArea', 'num__OpenPorchSF',
       'num__OverallCond', 'num__OverallQual', 'num__YearBuilt',
       'num__YearRemodAdd', 'num__Age', 'num__LivingLotRatio',
       'num__FinishedBsmtRatio', 'num__OverallScore', 'cat__HasPorch'],
      dtype='object')

Test Features Columns:
Index(['num__LotFrontage', 'num__LotArea', 'num__OpenPorchSF',
       'num__MasVnrArea', 'num__BsmtFinSF1', 'num__GrLivArea', 'num__1stFlrSF',
       'num__YearBuilt', 'num__YearRemodAdd', 'num__BedroomAbvGr',
       'num__2ndFlrSF', 'num__BsmtUnfSF', 'num__GarageArea',
       'num__GarageYrBlt', 'num__OverallCond', 'num__OverallQual', 'num__Age',
       'num__LivingLotRatio', 'num__FinishedBsmtRatio', 'num__OverallScore',
       'cat__HasPorch'],
      dtype='object')

Inh

---

## Validation

### Test Pipeline

### Output Verification

---

## Serialization

### Save Pipeline

### Save Other Artifacts

---

## Deployment Preparation

### Folder Structure

### Inference Example

### Environment File

---

## Documentation

### Detailed Steps

### Usage Notes

---

## Future Maintenance

### Recommendations