## Modeling Strategy

The modeling phase prioritizes algorithms that are well-suited for tabular data, robust to class imbalance, and capable of capturing non-linear relationships between health indicators and diabetes risk.

Rather than maximizing overall accuracy, the focus is placed on sensitivity (Recall), as the model is designed to function as a screening tool within a clinical triage workflow.

## Gradient Boosting Model (XGBoost)

Given the limitations observed in the baseline Random Forest model, a gradient boosting approach was selected to better address class imbalance and improve sensitivity to diabetic cases.

XGBoost was chosen due to its ability to iteratively focus on misclassified observations and to explicitly weight the minority class during training. This aligns with the project’s objective of minimizing false negatives in a screening context.

In [1]:
import sys
import os

# Add project root to Python path
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)

In [2]:
from src.preprocessing import basic_cleaning, engineer_features, prepare_train_test_split
import pandas as pd

df = pd.read_csv("../data/diabetes_binary_health_indicators_BRFSS2015.csv")

In [4]:
df_clean = basic_cleaning(df)
df_fe = engineer_features(df_clean)
X_train_scaled, X_test_scaled, y_train, y_test, scaler = prepare_train_test_split(df_fe, 'Diabetes_binary', 0.2, 42)

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

rf_model = RandomForestClassifier(
    n_estimators=100,
    class_weight={0: 1, 1: 8},
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_train_scaled, y_train)

y_pred_rf = rf_model.predict(X_test_scaled)

print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

           0       0.86      0.97      0.91     38876
           1       0.46      0.15      0.22      7019

    accuracy                           0.84     45895
   macro avg       0.66      0.56      0.57     45895
weighted avg       0.80      0.84      0.81     45895



### Baseline Model Assessment

The Random Forest baseline demonstrates strong performance in identifying non-diabetic individuals but fails to adequately capture the minority class. With a low recall for diabetic cases, the model misses a large proportion of at-risk individuals.

Given the project’s objective as a screening tool, this behavior is undesirable, as false negatives carry a significantly higher clinical cost than false positives. While precision is moderate, the insufficient sensitivity indicates that a more suitable modeling strategy is required.

In [6]:
ratio = (y_train == 0).sum() / (y_train == 1).sum()
ratio

np.float64(5.538179357504096)

## Threshold Optimization for Clinical Triage

Rather than using the default classification threshold of 0.5, a lower decision threshold was adopted to reflect the asymmetric cost of errors in a screening context.

A threshold of 0.3 was selected to prioritize sensitivity, ensuring that individuals with moderate estimated risk are flagged for further clinical evaluation. While this approach increases the number of false positives, it significantly reduces the risk of missing true diabetic cases.

In [8]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

xgb_model = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    scale_pos_weight=ratio,
    random_state=42,
    eval_metric="logloss"
)

xgb_model.fit(X_train_scaled, y_train)
y_probs = xgb_model.predict_proba(X_test_scaled)[:, 1]
threshold = 0.3
y_pred_xgb = (y_probs >= threshold).astype(int)

print(classification_report(y_test, y_pred_xgb))

              precision    recall  f1-score   support

           0       0.97      0.50      0.66     38876
           1       0.25      0.92      0.39      7019

    accuracy                           0.57     45895
   macro avg       0.61      0.71      0.53     45895
weighted avg       0.86      0.57      0.62     45895



### XGBoost Model Assessment

By lowering the classification threshold to 0.3, the model achieves a substantial increase in recall for diabetic cases, reaching approximately 92%. This behavior aligns with the intended role of the system as a screening tool, where minimizing false negatives is critical.

Although precision decreases as a consequence, this trade-off is acceptable in a clinical triage context, as false positives result in low-cost follow-up tests, while false negatives may lead to delayed diagnosis and severe long-term complications.