**Phase 2 : Build a Supervised Learning Model**

---



| Name | Role / Task |
|-----------|----------------|
| Reema Almunasser | Algorithm Selection & Justification |
| Sadeem Alsayari | Implementation |
|   Leen Alohali  | Implementation |
|  Noof Alkhalifa   | Evaluation & Comparison |
|  Sara Alshuwaier |Results Interpretation Evaluation |


In this phase, we focused on building and training supervised machine learning models using the cleaned and preprocessed dataset from Phase 1.

The main goal was to develop predictive models capable of identifying when a vehicle is likely to require maintenance based on its operational and performance data.By experimenting with different algorithms we aim to determine which model provides the most accurate and reliable model for predicting maintenance needs.

This phase sets the foundation for model evaluation and comparison, where we will analyze performance metrics to select the best-performing model for our Car Maintenance Tracking System.

We selected two machine learning algorithms that are commonly used and well-suited for structured data and predictive maintenance tasks:

1.   Gradient Boosting Decision Trees (GBDT): Gradient Boosting models are highly effective for structured data and have become a standard choice for predictive maintenance tasks.[1][2]

      *   Handles mixed feature types: GBDT can work effectively with both numerical and categorical data, capturing complex relationships without heavy manual feature engineering.
      *   Strong predictive performance: Known for achieving high accuracy on classification and ranking tasks in structured datasets.
      *   Feature importance & interpretability: Provides scores for feature importance and can be analyzed using techniques such as SHAP to understand which variables most influence predictions.
      *   Efficient training: Can train effectively on large datasets (like our 92,000 records) and deliver strong results even with moderate tuning (adjusting hyperparameters like number of trees, tree depth, learning rate, and more).



2.   Random Forest (RF): The Random Forest model was chosen for its robustness, high accuracy, and strong interpretability, making it well suited for predictive maintenance tasks.[3]

      *   Handles mixed feature types: Works effectively with both numerical and categorical features without needing feature scaling or encoding.
      *   Easy to determine feature importance: Evaluates the contribution of each variable to the model using measures such as Gini importance and mean decrease in impurity (MDI).
      *   Captures non-linear relationships: Can model complex interactions between variables through an ensemble of decision trees.
      *   Efficient training: Can train on large datasets (like our 92,000 records) with moderate computational resources and without extensive hyperparameter tuning.


We selected these models for their effectiveness, ability to uncover complex patterns in the data, robustness across different feature types, and capacity to provide clear insights into what drives maintenance predictions making them ideal for our vehicle maintenance system.



### Implementation

In this section, we focus on the practical development of the supervised learning models.  
The goal is to implement the machine learning pipeline that prepares the data, trains the selected models,  
tunes their hyperparameters, and generates predictions.

The implementation process includes:
1. **Data loading and preparation** – importing the cleaned dataset from Phase 1 and identifying the target variable.  
2. **Preprocessing pipeline** – handling missing values, scaling numerical features, and encoding categorical features using a `ColumnTransformer`.  
3. **Model construction** – building two supervised learning models:  
   - Random Forest Classifier  
   - Gradient Boosting Decision Trees (GBDT)  
4. **Cross-validation** – training and validating models using 5-fold cross-validation to ensure generalization.  
5. **Hyperparameter tuning** – optimizing model parameters with `GridSearchCV` to enhance performance.  
6. **Final training and prediction** – retraining the best model on the full training set and generating predictions on the test data.  

All code blocks in this section are fully commented to make the workflow clear, reproducible, and aligned with the project’s Phase 2 requirements.


In [None]:
from google.colab import files
uploaded = files.upload()


Saving logistics_dataset_with_maintenance_required_cleaned.csv to logistics_dataset_with_maintenance_required_cleaned (1).csv


In [None]:
DATA_PATH = "logistics_dataset_with_maintenance_required_cleaned.csv"
TARGET_COL = "Maintenance_Required"


### Setup & Imports

This cell initializes the Phase 2 environment and defines global configuration:

- **Libraries**: Import core tools for data handling (`pandas`, `numpy`), preprocessing (`SimpleImputer`, `StandardScaler`, `OneHotEncoder`, `ColumnTransformer`), model building (`RandomForestClassifier`, `GradientBoostingClassifier`), and utilities for splitting and tuning (`train_test_split`, `StratifiedKFold`, `GridSearchCV`).  
- **Metrics**: Import evaluation functions (Accuracy, Precision, Recall, F1, ROC-AUC) and reporting helpers; although metrics are computed in the *Evaluation* section, we load them here for completeness.
- **Reproducibility**: Set a fixed `RANDOM_STATE = 42` to make data splits and model results repeatable.
- **Project constants**: Define `DATA_PATH` to point to the cleaned dataset from Phase 1 and `TARGET_COL = "Maintenance_Required"` as the prediction target.

> This setup supports the Implementation workflow that follows: preprocessing pipelines, model construction (RF & GBDT), cross-validation, tuning, and final training/prediction.


In [None]:
# === Setup & Imports ===
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    classification_report, confusion_matrix
)

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.inspection import permutation_importance

import matplotlib.pyplot as plt

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA_PATH = r"logistics_dataset_with_maintenance_required_cleaned.csv"
TARGET_COL = r"Maintenance_Required"

### Load Data

In this step, the cleaned dataset generated in **Phase 1** is loaded into the workspace.  
The goal is to prepare the data for model training by performing the following tasks:

1. **Import the dataset** from the defined path (`DATA_PATH`).   
2. **Split the dataset** into:
   - `X`: the feature set (independent variables).  
   - `y`: the target label (dependent variable).  
3. **Display key information** such as dataset shape and target distribution to confirm that the data is correctly loaded and balanced.

This ensures the data is clean, consistent , and ready for the subsequent steps


In [None]:
# === Load data ===
df = pd.read_csv(DATA_PATH)
assert TARGET_COL in df.columns, f"Target '{TARGET_COL}' not found. Columns: {list(df.columns)}"

# Drop rows where target is missing
df = df.dropna(subset=[TARGET_COL]).reset_index(drop=True)

# Identify X, y
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

print("Shape:", df.shape)
print("Target distribution:")
print(y.value_counts(normalize=True).round(4))
X.head()

Shape: (11585, 48)
Target distribution:
Maintenance_Required
1.0    0.7631
0.0    0.2369
Name: proportion, dtype: float64


Unnamed: 0,Vehicle_ID,Make_and_Model,Year_of_Manufacture,Usage_Hours,Load_Capacity,Actual_Load,Last_Maintenance_Date,Maintenance_Cost,Engine_Temperature,Tire_Pressure,...,Make_volvo,Model_fh,Model_semi,Model_silverado,Service_Year,Service_Month,Service_DayOfWeek,Days_Since_Last_Service,Recent_Service_90d,Load_Utilization
0,1,ford f150,1.0,0.018946,0.034069,0.041134,2023-04-09,0.001723,0.0,0.0,...,0,0,0,0,2023,4,6,448,0,1.0
1,2,volvo fh,0.588235,0.381747,0.034689,0.027919,2023-07-20,0.028125,0.0,0.0,...,1,1,0,0,2023,7,3,346,0,0.804853
2,3,chevy silverado,1.0,0.14946,0.013116,0.013731,2023-03-17,0.052977,0.0,1.0,...,0,0,0,1,2023,3,4,471,0,1.0
3,4,chevy silverado,0.352941,0.106313,0.071867,0.086002,2024-05-01,0.058339,0.0,0.0,...,0,0,0,1,2024,5,2,60,1,1.0
4,5,ford f150,0.529412,0.090763,0.274337,0.29972,2023-11-15,0.064227,0.0,1.0,...,0,0,0,0,2023,11,2,228,0,1.0


### Column Typing

This step prepares the dataset for machine learning by identifying the data types of each feature  
and building a preprocessing pipeline to handle numerical and categorical attributes properly.

1. **Column Typing** – The feature set `X` is divided into:
   - **Numerical columns** (`numeric_cols`): variables with quantitative values such as mileage, temperature, or cost.
   - **Categorical columns** (`categorical_cols`): variables with qualitative values such as vehicle make, model, or maintenance type.

2. **Preprocessing Pipelines** – Two transformation pipelines are created:
   - **Numerical Pipeline:** Handles missing numeric values using the *median* strategy and scales features using `StandardScaler` (with `with_mean=False` to support sparse matrices).
   - **Categorical Pipeline:** Handles missing categorical values using the *most frequent* strategy and encodes categories with `OneHotEncoder`, which safely ignores unseen labels during testing.

3. **ColumnTransformer Integration** – Both pipelines are combined using `ColumnTransformer` to apply the appropriate transformations automatically to each feature type during training and prediction.

This ensures that all model inputs are properly scaled, encoded, and free of missing values before entering the learning algorithms.


In [None]:
# === Column typing ===
numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X.select_dtypes(exclude=[np.number]).columns.tolist()

print("Numeric:", numeric_cols[:12], "..." if len(numeric_cols)>12 else "")
print("Categorical:", categorical_cols[:12], "..." if len(categorical_cols)>12 else "")

# Preprocessing
numeric_tf = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler(with_mean=False))  # with_mean=False safe for sparse
])

categorical_tf = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_tf, numeric_cols),
    ("cat", categorical_tf, categorical_cols)
])

Numeric: ['Vehicle_ID', 'Year_of_Manufacture', 'Usage_Hours', 'Load_Capacity', 'Actual_Load', 'Maintenance_Cost', 'Engine_Temperature', 'Tire_Pressure', 'Fuel_Consumption', 'Battery_Status', 'Vibration_Levels', 'Oil_Quality'] ...
Categorical: ['Make_and_Model', 'Last_Maintenance_Date'] 


### Train–Test Split

In this step, the cleaned and preprocessed dataset is divided into two subsets — one for **training** and one for **testing**.  
This separation allows the model to learn patterns from the training data and then be evaluated on unseen data to measure generalization performance.

- **Function Used:** `train_test_split()` from *scikit-learn*.
- **Split Ratio:** 80% of the data is used for training and 20% for testing (`test_size=0.2`).
- **Stratified Sampling:** The parameter `stratify=y` ensures that the class distribution (vehicles requiring maintenance vs. not requiring maintenance) remains consistent across both sets.
- **Random State:** A fixed `RANDOM_STATE=42` is used to make the split reproducible in future runs.

After this operation, the variables `X_train`, `X_test`, `y_train`, and `y_test` will contain the respective training and testing subsets ready for model building.


In [None]:
# === Split ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)
X_train.shape, X_test.shape

((9268, 47), (2317, 47))

### Candidate Models: Random Forest and Gradient Boosting

This section defines the two supervised learning models selected for the project:  
**Random Forest (RF)** and **Gradient Boosting Decision Trees (GBDT)**.  
Both models are implemented within *scikit-learn* Pipelines to ensure that the same preprocessing steps are consistently applied during training and prediction.

1. **Random Forest Classifier**
   - An ensemble of multiple decision trees trained on random subsets of data and features.
   - Helps reduce overfitting and improves prediction stability.
   - Parameters used:
     - `n_estimators=300`: number of trees in the forest.
     - `class_weight="balanced"`: handles class imbalance automatically.
     - `random_state=42`: ensures reproducible results.

2. **Gradient Boosting Classifier**
   - Builds trees sequentially, where each new tree corrects the errors of the previous ones.
   - Known for its high accuracy on structured datasets.
   - Parameters used:
     - `n_estimators=300`: number of boosting stages.
     - `learning_rate=0.1`: controls how much each tree contributes.
     - `max_depth=3`: limits tree complexity to prevent overfitting.

Finally, both models are stored in a Python dictionary named `models` for easy iteration and comparison in later stages.


In [None]:
# === Candidate models (RF, GBDT) ===
rf = Pipeline([
    ("prep", preprocessor),
    ("clf", RandomForestClassifier(
        n_estimators=300, random_state=RANDOM_STATE,
        class_weight="balanced", n_jobs=-1
    ))
])

gbdt = Pipeline([
    ("prep", preprocessor),
    ("clf", GradientBoostingClassifier(
        n_estimators=300, learning_rate=0.1, max_depth=3, random_state=RANDOM_STATE
    ))
])

models = {"RandomForest": rf, "GBDT": gbdt}
list(models.keys())

['RandomForest', 'GBDT']

### Cross-Validation (F1-Macro Evaluation)

In this step, 5-fold **Stratified Cross-Validation** is applied to evaluate the performance of each candidate model (Random Forest and GBDT) using the **F1-macro** metric.

- **Purpose:** Cross-validation helps measure how well the model generalizes by training and testing it on multiple subsets of the training data.  
- **Method:** The data is divided into 5 folds (`n_splits=5`), where each fold serves once as a validation set while the remaining folds are used for training.  
- **Stratified Sampling:** Ensures that the proportion of maintenance and non-maintenance cases remains consistent across folds.  
- **Metric Used:**  
  - *F1-macro* computes the average F1-Score across all classes, giving equal importance to each class — an appropriate choice for slightly imbalanced datasets.

The resulting table (`cv_df`) summarizes the mean and standard deviation of the F1-macro score for each model, helping identify which algorithm performs better before proceeding to hyperparameter tuning.


In [None]:
# === Cross-validation (F1-macro) ===
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

cv_table = []
for name, pipe in models.items():
    scores = cross_val_score(pipe, X_train, y_train, cv=cv, scoring="f1_macro", n_jobs=None)
    cv_table.append({"model": name, "f1_macro_mean": scores.mean(), "f1_macro_std": scores.std()})
    print(f"{name}: F1_macro = {scores.mean():.4f} ± {scores.std():.4f}")

cv_df = pd.DataFrame(cv_table).sort_values(by="f1_macro_mean", ascending=False)
cv_df

RandomForest: F1_macro = 0.9988 ± 0.0008
GBDT: F1_macro = 0.9997 ± 0.0006


Unnamed: 0,model,f1_macro_mean,f1_macro_std
1,GBDT,0.999702,0.000596
0,RandomForest,0.998806,0.000761


### Light Hyperparameter Tuning

After obtaining the initial cross-validation results, this step focuses on improving each model’s performance through **hyperparameter tuning** using `GridSearchCV`.

- **Purpose:** Hyperparameter tuning systematically searches for the best combination of parameters that maximize the model’s performance (measured here by *F1-macro*).  
- **Method Used:** `GridSearchCV` evaluates different parameter combinations using 5-fold cross-validation (`cv=cv`) to identify the optimal settings.

The parameter grids tested include:

1. **Random Forest**
   - `n_estimators`: number of trees in the ensemble (300, 500).  
   - `max_depth`: maximum depth of each tree (None, 12, 20).  
   - `min_samples_split`: minimum number of samples required to split a node (2, 10).

2. **Gradient Boosting (GBDT)**
   - `n_estimators`: number of boosting stages (200, 300, 500).  
   - `learning_rate`: step size for each iteration (0.05, 0.1).  
   - `max_depth`: depth of individual trees (2, 3, 4).

At the end of this process, the best-performing model configurations and their corresponding *F1-macro* scores are stored in the dictionary `best_models`.  
These tuned models will be used in the final training and prediction step.


In [None]:
# === Light hyperparameter tuning ===
param_grid = {
    "RandomForest": {
        "clf__n_estimators": [300, 500],
        "clf__max_depth": [None, 12, 20],
        "clf__min_samples_split": [2, 10]
    },
    "GBDT": {
        "clf__n_estimators": [200, 300, 500],
        "clf__learning_rate": [0.05, 0.1],
        "clf__max_depth": [2, 3, 4]
    }
}

best_models = {}
for name, pipe in models.items():
    grid = GridSearchCV(
        estimator=pipe,
        param_grid=param_grid[name],
        scoring="f1_macro",
        cv=cv,
        n_jobs=None,
        verbose=0
    )
    print(f"\nTuning {name}...")
    grid.fit(X_train, y_train)
    print("Best params:", grid.best_params_)
    print("Best CV F1_macro:", grid.best_score_)
    best_models[name] = grid.best_estimator_

best_models


Tuning RandomForest...
Best params: {'clf__max_depth': None, 'clf__min_samples_split': 10, 'clf__n_estimators': 300}
Best CV F1_macro: 0.9995528420420315

Tuning GBDT...
Best params: {'clf__learning_rate': 0.05, 'clf__max_depth': 3, 'clf__n_estimators': 200}
Best CV F1_macro: 0.9997019558643763


{'RandomForest': Pipeline(steps=[('prep',
                  ColumnTransformer(transformers=[('num',
                                                   Pipeline(steps=[('imputer',
                                                                    SimpleImputer(strategy='median')),
                                                                   ('scaler',
                                                                    StandardScaler(with_mean=False))]),
                                                   ['Vehicle_ID',
                                                    'Year_of_Manufacture',
                                                    'Usage_Hours',
                                                    'Load_Capacity',
                                                    'Actual_Load',
                                                    'Maintenance_Cost',
                                                    'Engine_Temperature',
                                              

### Final Training and Prediction

In this final part of the **Implementation** phase, the best-performing model identified during hyperparameter tuning is selected and retrained using the entire training dataset.  
This ensures that the model leverages all available data before making predictions.

- **Model Selection:** Based on the tuning results, the **Gradient Boosting Decision Tree (GBDT)** achieved the highest F1-macro score and is therefore chosen as the final model.  
- **Final Training:** The selected model is trained on the complete training data (`X_train`, `y_train`) to maximize learning.  
- **Prediction:** After training, the model generates predictions (`y_pred`) on the test set (`X_test`).  
  These predictions will later be used in the **Evaluation & Comparison** phase to calculate performance metrics such as Accuracy, Precision, Recall, F1-Score, and ROC-AUC.

This step concludes the *Implementation* section, ensuring that a fully trained and tuned model is now ready for evaluation.


In [None]:
# === Final Training & Prediction (end of Implementation) ===
# Select the best-performing model based on the tuning results (GBDT performed best)
best_name = "GBDT"
final_model = best_models[best_name]
print(f"Best model selected: {best_name}")

# Train the final model on the entire training dataset
final_model.fit(X_train, y_train)

# Generate predictions on the test set (to be evaluated later)
y_pred = final_model.predict(X_test)

print(" Final training completed. Predictions are stored in y_pred.")
print("Sample predictions:", y_pred[:10])


Best model selected: GBDT
 Final training completed. Predictions are stored in y_pred.
Sample predictions: [0. 0. 1. 0. 1. 0. 1. 1. 1. 1.]


**Valuation & Comparison**

After training two models Random Forest (RF) and Gradient Boosting Decision Trees (GBDT) we will evaluate and compare using evaluation metrics. We use classification metrics Accuracy, Precision, Recall, F1-score and  ROC-AUC to calculate performance.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import pandas as pd
import numpy as np

results = {
    "Model": [],
    "Accuracy": [],
    "Precision": [],
    "Recall": [],
    "F1-Score": [],
    "ROC-AUC": []
}

for name, model in best_models.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, "predict_proba") else None


    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average="weighted")
    rec = recall_score(y_test, y_pred, average="weighted")
    f1 = f1_score(y_test, y_pred, average="weighted")
    roc_auc = roc_auc_score(y_test, y_proba) if y_proba is not None else np.nan


    results["Model"].append(name)
    results["Accuracy"].append(acc)
    results["Precision"].append(prec)
    results["Recall"].append(rec)
    results["F1-Score"].append(f1)
    results["ROC-AUC"].append(roc_auc)

df_results = pd.DataFrame(results)
df_results[["Accuracy", "Precision", "Recall", "F1-Score", "ROC-AUC"]] = (
    df_results[["Accuracy", "Precision", "Recall", "F1-Score", "ROC-AUC"]].round(4)
)

print("\nModel Evaluation Results:\n")
display(df_results)


Model Evaluation Results:



Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,ROC-AUC
0,RandomForest,0.9996,0.9996,0.9996,0.9996,1.0
1,GBDT,1.0,1.0,1.0,1.0,1.0


#Result Interpretation

##1. Results Key Finding

### 1.1 Accuracy:
Accuracy reflects the overall correctness of the model in predicting all classes.

Random Forest: 0.9996

GBDT: 1.0000

Both models achieved near-perfect accuracy, but GBDT was flawless, classifying all instances correctly. This indicates a very strong generalization ability on the test set.

###1.2 Precision:

Precision measures how many of the predicted positive cases were actually positive, and it is especially important when false positives are costly.

Random Forest: 0.9996

GBDT: 1.0000

Again, GBDT reached perfect precision, meaning it made no false positive predictions. Random Forest also performed extremely well, but slightly behind.

### 1.3 Recall:
Recall calculates how many actual positive cases were correctly predicted by the model, which is critical when missing positive cases is dangerous (e.g., maintenance needed but undetected).

Random Forest: 0.9996

GBDT: 1.0000

The perfect recall score of GBDT confirms that no true positive instances were missed, making it highly reliable for safety-critical scenarios.

###1.4 F1-Score:
The F1-Score is the harmonic mean of precision and recall, offering a balanced metric in cases of class imbalance.

Random Forest: 0.9996

GBDT: 1.0000

GBDT outperformed here again. Its perfect F1-Score indicates excellent balance between false positives and false negatives.

### 1.5 ROC-AUC
The ROC-AUC (Receiver Operating Characteristic - Area Under Curve) score evaluates how well the model separates classes, regardless of threshold.

Both models: 1.0

Both models achieved a perfect ROC-AUC of 1.0, suggesting that they are both capable of completely distinguishing between positive and negative classes in probabilistic terms.

### **2- Final Model Selection: GBDT**
While both models demonstrated extremely high performance, GBDT is clearly the superior model based on perfect scores across all metrics. It made no errors in classification and offers perfect trade-offs between precision and recall.

Therefore, we selected Gradient Boosting Decision Trees (GBDT) as the final model for deployment in the vehicle maintenance prediction system.

#References



[1] C. A. (organisation), "Gradient-Boosted Decision Trees (GBDT) – C3 AI Glossary," [Online]. Available: https://c3.ai/glossary/data-science/gradient-boosted-decision-trees-gbdt/. [Accessed 29 10 2025].

[2] V. S. F. G. &. V. R. d. Carvalho, "A Review of Interpretability Methods for Gradient Boosting Decision Trees," [Online]. Available: https://www.researchgate.net/publication/395084817_A_Review_of_Interpretability_Methods_for_Gradient_Boosting_Decision_Trees. [Accessed 25 10 2025].

[3] I. R. Eda Kavlakoglu – Business Development + Partnerships, "What is random forest?," [Online]. Available: https://www.ibm.com/think/topics/random-forest. [Accessed 29 10 2025].