# **Notebook 6: Final Pipeline and Deployment Preparation**

## Objectives

The objective of this notebook is to consolidate the final machine learning pipeline and prepare the trained model(s) for deployment. This includes serializing the pipeline, saving outputs, and creating essential documentation for integration into the deployment environment.

## Inputs

* **Trained Models and Hyperparameters**
  * Best-performing models from the model training and evaluation notebook, including their hyperparameters.
* **Processed Dataset**
  * Cleaned and feature-engineered datasets ready for input into the pipeline.
* **Evaluation Metrics and Feature Importances**
  * Outputs from the model training notebook to inform pipeline structure and deployment requirements.

## Outputs

* **Serialized Final Pipeline**
  * The complete pipeline, including preprocessing and the best-performing model, saved for deployment.
* **Deployment-Ready Artifacts**
  * Files required for model integration, such as serialized objects and configuration files.
* **Documentation**
  * Summary of the pipeline, deployment steps, and integration instructions.

## Additional Comments

* This ntoebook serves as the final step before integrating the pipeline into the deployment environment.
* The pipeline will include preprocessing steps, feature selection, and the chosen model for prediction.
* Key deployment considerations, such as scalability and maintainability, will be addressed.


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Predictive-Analytics-PP5'

---

## Requirements

### Import Libraries

To begin, we will import all the necessary libraries required for data processing, model loading, evaluation, and output generation.

In [4]:
import pandas as pd
import numpy as np 
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import joblib
import matplotlib.pyplot as plt
import seaborn as sns
import os
import json

### Verify Dependancies

To ensure a smooth workflow, we will verify that all required dependencies are installed and compatible with the current environment. The below section checks for installed packages and their versions.

In [7]:
# List of required libraries and their versions
required_dependencies = {
    "pandas": "1.4.2",
    "numpy": "1.24.4",
    "matplotlib": "3.4.3",
    "seaborn": "0.11.2",
    "joblib": "1.4.2"
}

# Check installed dependancies
installed_dependencies = {}
for lib, version in required_dependencies.items():
    try:
        lib_version = __import__(lib).__version__
        installed_dependencies[lib] = lib_version
        if lib_version != version:
            print(f"{lib} version mismatch: Expected {version}, found {lib_version}")
        else:
            print(f"{lib} is correctly installed (version {version})")
    except ImportError:
        print(f"{lib} is not installed!")

# Display summary of dependencies
print("\nInstalled Dependencies")
print(json.dumps(installed_dependencies, indent=4))

pandas is correctly installed (version 1.4.2)
numpy is correctly installed (version 1.24.4)
matplotlib is correctly installed (version 3.4.3)
seaborn is correctly installed (version 0.11.2)
joblib is correctly installed (version 1.4.2)

Installed Dependencies
{
    "pandas": "1.4.2",
    "numpy": "1.24.4",
    "matplotlib": "3.4.3",
    "seaborn": "0.11.2",
    "joblib": "1.4.2"
}


### Load Saved Artifacts

In this step we will load the serialized models, feature importance data, evaluation metrics and any other outputs generated in the earlier notebooks. These artifacts are essential for building the final pipeline and preparing for deployment.

**Artifacts to Load:**
1. Trained Models:
   - Random Forest
   - XGBoost
2. Evaluation Metrics:
   - Performance metrics for each model.
3. Feature Importance Data:
   - Insights into features contributing to model predictions.
4. Testing Data:
   - Processed test dataset with features and target.

These artifacts will ensure continuity between the modeling and deployment stages.

In [6]:
# Define paths for saved artifacts
artifacts_paths = {
    "random_forest_model": "outputs/models/random_forest_model.pkl",
    "xgboost_model": "outputs/models/xgboost_model.pkl",
    "evaluation_metrics": "outputs/metrics/evaluation_metrics.csv",
    "feature_importance_rf": "outputs/feature_importance/random_forest_feature_importance.csv",
    "feature_importance_xgb": "outputs/feature_importance/xgboost_feature_importance.csv",
    "test_data": "outputs/datasets/processed/with_target/test_with_target.csv",
}

# Load Models
try:
    rf_model = joblib.load(artifacts_paths["random_forest_model"])
    xgb_model = joblib.load(artifacts_paths["xgboost_model"])
    print("Models loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading models: {e}")

# Load evaluation metrics
try:
    evaluation_metrics = pd.read_csv(artifacts_paths["evaluation_metrics"])
    print("Evaluation metrics loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading evaluation metrics: {e}")

# Load feature importance data
try:
    feature_importance_rf = pd.read_csv(artifacts_paths["feature_importance_rf"])
    feature_importance_xgb = pd.read_csv(artifacts_paths["feature_importance_xgb"])
    print("Feature Importance data loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading feature importance data: {e}")

# Load test data
try:
    test_data = pd.read_csv(artifacts_paths["test_data"])
    print("Test data loaded successfully.")
except FileNotFoundError as e:
    print(f"Error loading test data: {e}")

# Display loaded artifacts 
print("Loaded Artifacts:")
print("\nEvaluation Metrics:")
print(evaluation_metrics.head())
print("\nFeature Importance (Random Forest):")
print(feature_importance_rf.head())
print("\nFeature Importance (XGBoost):")
print(feature_importance_xgb.head())
print("\nTest Data:")
print(test_data.head())

Models loaded successfully.
Evaluation metrics loaded successfully.
Feature Importance data loaded successfully.
Test data loaded successfully.
Loaded Artifacts:

Evaluation Metrics:
               Model  R2 Score       MAE       MSE
0      Random Forest  0.813480  0.122825  0.034807
1      Decision Tree  0.747387  0.154373  0.047141
2                KNN  0.760565  0.143803  0.044682
3  Gradient Boosting  0.771934  0.135791  0.042560
4            XGBoost  0.816074  0.122992  0.034323

Feature Importance (Random Forest):
             Feature  Importance
0   num__OverallQual    0.546651
1     num__GrLivArea    0.129142
2    num__GarageArea    0.051851
3      num__1stFlrSF    0.046355
4  num__OverallScore    0.041237

Feature Importance (XGBoost):
             Feature  Importance
0   num__OverallQual    0.783737
1  num__OverallScore    0.060585
2     num__GrLivArea    0.045171
3     num__YearBuilt    0.016420
4      num__1stFlrSF    0.015661

Test Data:
   num__LotFrontage  num__LotArea  

---

## Pipeline Design

### Preprocessing Pipeline

### Model Intergration

### Prediction Pipeline

---

## Validation

### Test Pipeline

### Output Verification

---

## Serialization

### Save Pipeline

### Save Other Artifacts

---

## Deployment Preparation

### Folder Structure

### Inference Example

### Environment File

---

## Documentation

### Detailed Steps

### Usage Notes

---

## Future Maintenance

### Recommendations