# Classification models

**Date:** July 25th, 2025  
**Author:** Paola Rocha  
**Description:** 

**Dataset:** [Microsoft Azure Predictive Maintenance](https://www.kaggle.com/datasets/arnabbiswas1/microsoft-azure-predictive-maintenance/data) on kaggle.

**Content:**  
* **Loading Data:** Importing the libraries and loading the datasets.  


## Loading data

In [1]:
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

# MLflow
import mlflow
import mlflow.sklearn
import mlflow.xgboost

In [2]:
telemetry = pd.read_csv('../data/processed/telemetry.csv', parse_dates=['date']).sort_values(['machineID', 'date'])
telemetry.info()

<class 'pandas.core.frame.DataFrame'>
Index: 36600 entries, 0 to 36599
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   date                  36600 non-null  datetime64[ns]
 1   machineID             36600 non-null  int64         
 2   volt                  36600 non-null  float64       
 3   rotate                36600 non-null  float64       
 4   pressure              36600 non-null  float64       
 5   vibration             36600 non-null  float64       
 6   error_last_7_days     36600 non-null  float64       
 7   error_last_14_days    36600 non-null  float64       
 8   error_last_30_days    36600 non-null  float64       
 9   failure_last_7_days   36600 non-null  float64       
 10  failure_last_14_days  36600 non-null  float64       
 11  failure_last_30_days  36600 non-null  float64       
 12  maint_last_7_days     36600 non-null  float64       
 13  maint_last_14_days   

In [3]:
feature_cols = telemetry.columns.drop(['date', 'will_fail_30_days'])
target_col = 'will_fail_30_days'

In [4]:
X = telemetry[feature_cols]
y = telemetry[target_col]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False  # don't shuffle time-series
)

# Normalize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Model 1: XGBoost

In [5]:
# Define the hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 6, 7],
    'max_leaves': [10, 20, 30],
    'learning_rate': [1, 0.1, 0.01, 0.001]
}

# Create the XGBoost model object
xgb_model = xgb.XGBClassifier(random_state=42)

# Create the GridSearchCV object
grid_search = GridSearchCV(xgb_model, param_grid, cv=5, scoring='f1')

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train_scaled, y_train)

# Print the best set of hyperparameters and the corresponding score
print("Best set of hyperparameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

Best set of hyperparameters:  {'learning_rate': 0.01, 'max_depth': 3, 'max_leaves': 10, 'n_estimators': 100}
Best score:  0.47035059331454965


## Experiment 2: Decision trees

In [6]:
# Define the hyperparameter grid
param_grid = {
    'criterion': ["entropy", "gini", "log_loss"],
    'max_depth': [3, 5, 6, 7],
    'min_samples_split': [1, 2, 5, 10],
    'min_samples_leaf': [1, 2, 5, 10]
}

# Create the XGBoost model object
model_tree = DecisionTreeClassifier()

# Create the GridSearchCV object
grid_search = GridSearchCV(model_tree, param_grid, cv=5, scoring='f1')

# Fit the GridSearchCV object to the training data
grid_search.fit(X_train_scaled, y_train)

# Print the best set of hyperparameters and the corresponding score
print("Best set of hyperparameters: ", grid_search.best_params_)
print("Best score: ", grid_search.best_score_)

Best set of hyperparameters:  {'criterion': 'gini', 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 10}
Best score:  0.6003406374279007


240 fits failed out of a total of 960.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
240 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/laurarocha/personal/predictive-maintenance/.venv/lib/python3.13/site-packages/sklearn/model_selection/_validation.py", line 859, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/laurarocha/personal/predictive-maintenance/.venv/lib/python3.13/site-packages/sklearn/base.py", line 1358, in wrapper
    estimator._validate_params()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/Users/laurarocha/personal/predictive-maintenance/.venv/lib/python3.13/site-packages/sklearn/base.py", line 471, i

## Model tracking performance

In [7]:
models = [
    (
        "DecisionTreeClassifier",
        DecisionTreeClassifier(
            criterion = 'gini',
            max_depth = 3,
            min_samples_leaf = 1,
            min_samples_split = 5,
            random_state = 42),
        (X_train_scaled, y_train),
        (X_test_scaled, y_test)
    ),
    (
        "XGBClassifier",
        xgb.XGBClassifier(
            max_depth = 3,
            max_leaves = 10,
            n_estimators = 100,
            learning_rate = 0.01,
            random_state = 42
        ),
        (X_train_scaled, y_train),
        (X_test_scaled, y_test)
    )
]

In [10]:
reports = []

for model_name, model, train_set, test_set in models:
    X_train_scaled = train_set[0]
    y_train = train_set[1]
    X_test_scaled = test_set[0]
    y_test = test_set[1]

    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    report = classification_report(y_test, y_pred, output_dict=True)
    print(f"Model: {model_name}")
    print(classification_report(y_test, y_pred))
    reports.append(report)

Model: DecisionTreeClassifier
              precision    recall  f1-score   support

         0.0       0.56      0.64      0.60      3218
         1.0       0.68      0.61      0.64      4102

    accuracy                           0.62      7320
   macro avg       0.62      0.62      0.62      7320
weighted avg       0.63      0.62      0.62      7320

Model: XGBClassifier
              precision    recall  f1-score   support

         0.0       0.58      0.50      0.54      3218
         1.0       0.65      0.72      0.68      4102

    accuracy                           0.62      7320
   macro avg       0.61      0.61      0.61      7320
weighted avg       0.62      0.62      0.62      7320



In [9]:
# Initialize MLflow
mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("Machine Failure Prediction")

for i, element in enumerate(models):
    model_name = element[0]
    model = element[1]
    report = reports[i]

    with mlflow.start_run(run_name=model_name):
        mlflow.log_param("model", model_name)
        mlflow.log_metric('accuracy', report['accuracy'])
        mlflow.log_metric('recall_class_1', report['1.0']['recall'])
        mlflow.log_metric('recall_class_0', report['0.0']['recall'])
        mlflow.log_metric('f1_score_macro', report['macro avg']['f1-score'])

        if "XGB" in model_name:
            mlflow.xgboost.log_model(model, name="model")
        else:
            mlflow.sklearn.log_model(model, name="model")

  self.get_booster().save_model(fname)


🏃 View run DecisionTreeClassifier at: http://127.0.0.1:5000/#/experiments/581840100263335195/runs/529cc0d0d476485eb9de38ef8dad56ba
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/581840100263335195




🏃 View run XGBClassifier at: http://127.0.0.1:5000/#/experiments/581840100263335195/runs/64868dc8748d47b7af3c1f776a18d5e5
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/581840100263335195
