# Model Selection & Benchmarking

Objective: evaluate multiple candidate classifiers on the PdM dataset using the champion preprocessing pipeline (Feature Engineering ‚Üí OneHotEncoder + RobustScaler ‚Üí SMOTE) and log everything to MLflow (DagsHub) for traceability.

**Models to test (initial batch):** Logistic Regression, Random Forest, Gradient Boosting, XGBoost.

> Re-run after tweaking hyperparameters or adding new models as we learn more.


In [1]:
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from tqdm import tqdm

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import (
    recall_score,
    f1_score,
    roc_auc_score,
    ConfusionMatrixDisplay,
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

import mlflow
import dagshub


  import pkg_resources  # noqa: TID251


In [2]:
class FeatureEngineer(BaseEstimator, TransformerMixin):
    """Physics-informed feature generator used across all models."""

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        X["Power [W]"] = X["Torque [Nm]"] * (X["Rotational speed [rpm]"] * (2 * np.pi / 60))
        X["Temp Diff [K]"] = X["Process temperature [K]"] - X["Air temperature [K]"]
        X["Wear_Status"] = pd.cut(
            X["Tool wear [min]"], bins=[-1, 60, 180, 300], labels=[0, 1, 2]
        ).astype(int)
        return X

# Load raw data
raw_df = pd.read_csv("../data/raw/ai4i2020.csv")
raw_df = raw_df.drop(columns=["UDI", "Product ID"], axis=1)

# Feature lists consumed by the ColumnTransformer after FeatureEngineer
NUMERIC_FEATURES = [
    "Air temperature [K]",
    "Process temperature [K]",
    "Rotational speed [rpm]",
    "Torque [Nm]",
    "Tool wear [min]",
    "Power [W]",
    "Temp Diff [K]",
    "Wear_Status",
]
CATEGORICAL_FEATURES = ["Type"]

X = raw_df.drop(columns=["Machine failure", "TWF", "HDF", "PWF", "OSF", "RNF"], axis=1)
y = raw_df["Machine failure"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)


In [3]:
load_dotenv()

CONFIG = {
    "experiment_name": "Predictive_Maintenance_IIOT_Model_Selection",
    "random_state": 42,
    "test_size": 0.2,
    "DAGSHUB_REPO_OWNER": os.getenv("DagsHub_Repo_Owner"),
    "DAGSHUB_REPO_NAME": os.getenv("DagsHub_Repo_Name"),
    "DAGSHUB_TRACKING_URI": os.getenv("DagsHub_MLflow_Tracking_URI"),
}

# Initialize DagsHub-backed MLflow tracking
print("Tracking URI:", CONFIG["DAGSHUB_TRACKING_URI"])
dagshub.init(
    repo_owner=CONFIG["DAGSHUB_REPO_OWNER"],
    repo_name=CONFIG["DAGSHUB_REPO_NAME"],
    mlflow=True,
)
mlflow.set_tracking_uri(CONFIG["DAGSHUB_TRACKING_URI"])
mlflow.set_experiment(CONFIG["experiment_name"])


Tracking URI: https://dagshub.com/PrakashD2003/Smart-IIOT-Monitoring.mlflow


2025/12/09 19:54:54 INFO mlflow.tracking.fluent: Experiment with name 'Predictive_Maintenance_IIOT_Model_Selection' does not exist. Creating a new experiment.


<Experiment: artifact_location='mlflow-artifacts:/b2e5c5850d58468094c937aae6011e2e', creation_time=1765290295237, experiment_id='1', last_update_time=1765290295237, lifecycle_stage='active', name='Predictive_Maintenance_IIOT_Model_Selection', tags={}>

In [4]:
def log_model_params(algo_name: str, model) -> None:
    """Safely log hyperparameters to MLflow."""
    if mlflow.active_run() is None:
        raise RuntimeError("No active MLflow run. Use mlflow.start_run().")

    params = model.get_params()
    clean_params = {
        k: v if isinstance(v, (int, float, str, bool)) else str(v) for k, v in params.items()
    }
    mlflow.log_params(clean_params)


def build_pipeline(model):
    """Champion preprocessing + sampler + estimator."""
    preprocessor = ColumnTransformer(
        transformers=[
            ("num", RobustScaler(), NUMERIC_FEATURES),
            (
                "cat",
                OneHotEncoder(handle_unknown="ignore", sparse_output=False),
                CATEGORICAL_FEATURES,
            ),
        ],
        remainder="drop",
    )

    return ImbPipeline(
        steps=[
            ("eng", FeatureEngineer()),
            ("prep", preprocessor),
            ("sampler", SMOTE(random_state=CONFIG["random_state"])),
            ("model", model),
        ]
    )


def get_scores(pipeline, X_eval):
    """Return probability-like scores for ROC AUC."""
    if hasattr(pipeline, "predict_proba"):
        return pipeline.predict_proba(X_eval)[:, 1]
    if hasattr(pipeline, "decision_function"):
        return pipeline.decision_function(X_eval)
    return pipeline.predict(X_eval)


In [5]:
candidate_models = {
    "LogReg": LogisticRegression(max_iter=2000, solver="lbfgs", n_jobs=-1),
    "RandomForest": RandomForestClassifier(
        n_estimators=300,
        max_depth=None,
        min_samples_split=2,
        min_samples_leaf=1,
        random_state=CONFIG["random_state"],
        n_jobs=-1,
    ),
    "GradientBoosting": GradientBoostingClassifier(random_state=CONFIG["random_state"]),
    "XGBoost": XGBClassifier(
        n_estimators=400,
        learning_rate=0.05,
        max_depth=5,
        subsample=0.9,
        colsample_bytree=0.9,
        eval_metric="logloss",
        random_state=CONFIG["random_state"],
        n_jobs=2,
    ),
}


In [6]:
results = []
cv_splitter = StratifiedKFold(
    n_splits=5, shuffle=True, random_state=CONFIG["random_state"]
)

with mlflow.start_run(run_name="Model_Selection") as parent_run:
    pbar = tqdm(candidate_models.items(), total=len(candidate_models), desc="Models")

    for model_name, model in candidate_models.items():
        with mlflow.start_run(run_name=model_name, nested=True):
            try:
                start = time.time()
                pipeline = build_pipeline(model)

                cv_results = cross_validate(
                    pipeline,
                    X_train,
                    y_train,
                    cv=cv_splitter,
                    scoring=["recall", "f1", "roc_auc"],
                    n_jobs=2,
                )

                elapsed = time.time() - start

                cv_metrics = {
                    "cv_recall_mean": cv_results["test_recall"].mean(),
                    "cv_recall_std": cv_results["test_recall"].std(),
                    "cv_f1_mean": cv_results["test_f1"].mean(),
                    "cv_roc_auc_mean": cv_results["test_roc_auc"].mean(),
                    "cv_roc_auc_std": cv_results["test_roc_auc"].std(),
                    "cv_time_seconds": elapsed,
                }

                # Fit once on the full training split for holdout evaluation
                pipeline.fit(X_train, y_train)
                y_pred = pipeline.predict(X_test)
                y_scores = get_scores(pipeline, X_test)

                holdout_metrics = {
                    "holdout_recall": recall_score(y_test, y_pred),
                    "holdout_f1": f1_score(y_test, y_pred),
                    "holdout_roc_auc": roc_auc_score(y_test, y_scores),
                }

                # Log params & metrics
                mlflow.log_params(
                    {
                        "model_name": model_name,
                        "encoder": "OneHotEncoder",
                        "scaler": "RobustScaler",
                        "sampler": "SMOTE",
                    }
                )
                log_model_params(model_name, model)
                mlflow.log_metrics({**cv_metrics, **holdout_metrics})

                # Confusion matrix artifact
                disp = ConfusionMatrixDisplay.from_estimator(pipeline, X_test, y_test)
                plt.title(f"{model_name} Confusion Matrix (holdout)")
                plt.tight_layout()
                cm_path = f"confusion_matrix_{model_name}.png"
                plt.savefig(cm_path)
                mlflow.log_artifact(cm_path)
                plt.close()

                # Save pipeline
                mlflow.sklearn.log_model(pipeline, artifact_path="model")

                pbar.write(
                    f"‚úì {model_name}: holdout recall={holdout_metrics['holdout_recall']:.3f}, "
                    f"f1={holdout_metrics['holdout_f1']:.3f}, auc={holdout_metrics['holdout_roc_auc']:.3f}"
                )

                results.append(
                    {
                        "model": model_name,
                        **cv_metrics,
                        **holdout_metrics,
                    }
                )

            except Exception as e:
                mlflow.log_param("status", "failed")
                mlflow.log_param("error", str(e))
                pbar.write(f"‚úó {model_name}: {e}")

        pbar.update(1)

    pbar.close()

results_df = pd.DataFrame(results)
results_df.sort_values(by="holdout_recall", ascending=False)


Models:   0%|          | 0/4 [00:19<?, ?it/s]

‚úì LogReg: holdout recall=0.868, f1=0.296, auc=0.937


Models:  25%|‚ñà‚ñà‚ñå       | 1/4 [02:10<00:59, 19.96s/it]

‚úì RandomForest: holdout recall=0.809, f1=0.696, auc=0.974


Models:  50%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 2/4 [02:32<02:26, 73.47s/it]

‚úì GradientBoosting: holdout recall=0.912, f1=0.551, auc=0.975


Models:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 3/4 [02:47<00:49, 49.91s/it]

‚úì XGBoost: holdout recall=0.838, f1=0.679, auc=0.980


Models: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 4/4 [02:47<00:00, 41.93s/it]


Unnamed: 0,model,cv_recall_mean,cv_recall_std,cv_f1_mean,cv_roc_auc_mean,cv_roc_auc_std,cv_time_seconds,holdout_recall,holdout_f1,holdout_roc_auc
2,GradientBoosting,0.892862,0.032146,0.567393,0.977785,0.010637,6.39819,0.911765,0.551111,0.975083
0,LogReg,0.830438,0.041916,0.28181,0.919683,0.012282,1.571934,0.867647,0.296482,0.937234
3,XGBoost,0.82633,0.040531,0.726019,0.977017,0.010157,1.290008,0.838235,0.678571,0.979844
1,RandomForest,0.80431,0.045168,0.738947,0.973922,0.013828,3.461602,0.808824,0.696203,0.973625


# üèÜ Report: Model Selection & Benchmarking

**Date:** December 09, 2025
**Author:** Prakash Dwivedi
**Module:** Predictive Maintenance (PdM)
**Experiment:** `03_PDM_Model_Selection.ipynb`

-----

![alt text](image.png)

## 1\. Objective

Following the definition of our **Champion Preprocessing Pipeline** (RobustScaler + OneHotEncoder + SMOTE), the objective of this phase was to benchmark candidate Machine Learning algorithms to identify the best architecture for predicting machine failures.

**Selection Criteria:**

  * **Primary Metric:** **Recall** (Sensitivity). In PdM, missing a failure (False Negative) is the most expensive error.
  * **Secondary Metric:** **ROC-AUC**. Measures the model's ability to rank failure risks correctly, allowing for flexible threshold tuning.
  * **Sanity Check:** **F1-Score**. Ensures the model isn't just predicting "Failure" for everything (Precision check).

-----

## 2\. Models Evaluated

We evaluated four distinct model architectures to cover linear, bagging, and boosting approaches:

1.  **Logistic Regression:** Linear baseline.
2.  **Random Forest:** Bagging ensemble (Parallel trees).
3.  **Gradient Boosting (sklearn):** Boosting ensemble (Sequential trees).
4.  **XGBoost:** Advanced Gradient Boosting (Optimized for speed and performance).

-----

## 3\. Experimental Results (Holdout Set)

| Model Name | Recall (Catch Rate) | F1-Score (Balance) | ROC-AUC (Separability) | Status |
| :--- | :--- | :--- | :--- | :--- |
| **XGBoost** | **83.8%** | 67.9% | **0.980** | üèÜ **Champion** |
| **Random Forest** | 80.9% | **69.6%** | 0.974 | ü•à Runner-up |
| **Gradient Boosting** | 91.2% | 55.1% | 0.975 | Rejected (Low Precision) |
| **Logistic Regression** | 86.8% | 29.6% | 0.937 | Rejected (Noise) |

-----

## 4\. Analysis & Decision

### A. The "False Alarm" Trap (Gradient Boosting & LogReg)

  * **Gradient Boosting** achieved the highest Recall (91.2%), but the F1-Score (55.1%) indicates extremely low precision. It generates too many false alarms, which would cause "alert fatigue" for operators.
  * **Logistic Regression** failed to distinguish complex patterns, resulting in a dismal F1-score of 29.6%.

### B. The Top Contenders (XGBoost vs. Random Forest)

This was a tight race between the two industry standards.

  * **Random Forest** offered the best stability and precision (F1: 69.6%), but it missed \~19% of failures (Recall: 80.9%).
  * **XGBoost** caught significantly more failures (Recall: 83.8%) while maintaining a very similar F1-score (67.9%).

### C. Why XGBoost Won?

1.  **Superior Ranking (AUC 0.98):** XGBoost has the highest ROC-AUC, meaning it separates "Healthy" and "Failing" machines better than any other model.
2.  **Recall Priority:** In our roadmap, we prioritized safety. Catching \~3% more failures with XGBoost is worth the minor trade-off in precision.
3.  **Future Proofing:** XGBoost handles missing values and outliers natively (if our pipeline ever leaks them) and scales better for larger datasets.

-----

## 5\. Next Steps

We will proceed with **XGBoost** as the core algorithm for the Predictive Maintenance module.

**Immediate Tasks:**

1.  **Hyperparameter Tuning:** Conduct a Bayesian or Grid Search on XGBoost to optimize:
      * `scale_pos_weight` (To further balance precision/recall).
      * `max_depth` (To prevent overfitting).
      * `learning_rate` (For convergence).
2.  **Final Training:** Train the tuned model on the full dataset and serialize it as `pdm_model.pkl`.

3.  **Neural Networks:** Later we might move toward experimenting with neural networks when we have a `larger dataset` 
.

-----

