# DIABETES PREDICTION (Classification Problem)

### Analyzing Diagnostic Factors Using BOOSTING ALGORITHMS

In [465]:
#Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import joblib
from pickle import dump
import math
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

**STEP 1: PROBLEM STATEMENT & DATA COLLECTION**

***1.1 PROBLEM STATEMENT***

**Goal** -  predict based on diagnostic measures whether or not a patient has diabetes.

**6.2 SAVING THE CSV FILES**

In [466]:
#dfs_train = {
    #'X_train_with_outliers_sel': X_train_with_outliers_sel,
    #'X_train_without_outliers_sel': X_train_without_outliers_sel,
    #'X_train_with_outliers_norm_sel': X_train_with_outliers_norm_sel,
    #'X_train_without_outliers_norm_sel': X_train_without_outliers_norm_sel,
    #'X_train_with_outliers_minmax_sel': X_train_with_outliers_minmax_sel,
    #'X_train_without_outliers_minmax_sel': X_train_without_outliers_minmax_sel 
#}

#dfs_test = {
#    'X_test_with_outliers_sel': X_test_with_outliers_sel,
#    'X_test_without_outliers_sel': X_test_without_outliers_sel,
#    'X_test_with_outliers_norm_sel': X_test_with_outliers_norm_sel,
#    'X_test_without_outliers_norm_sel': X_test_without_outliers_norm_sel,
#    'X_test_with_outliers_minmax_sel': X_test_with_outliers_minmax_sel,
#    'X_test_without_outliers_minmax_sel': X_test_without_outliers_minmax_sel    
#}

#for name, df in dfs_train.items():
#    df.to_csv(f"../data/processed/{name}.csv", index=False)

#for name, df in dfs_test.items(): 
#    df.to_csv(f'../data/processed/{name}.csv', index=False)

In [467]:
# Load the processed datasets

# WITH outliers
X_train_with_outliers_sel = pd.read_csv("../data/processed/X_train_with_outliers_sel.csv")
X_test_with_outliers_sel = pd.read_csv("../data/processed/X_test_with_outliers_sel.csv")

# WITHOUT outliers
X_train_without_outliers_sel = pd.read_csv("../data/processed/X_train_without_outliers_sel.csv")
X_test_without_outliers_sel = pd.read_csv("../data/processed/X_test_without_outliers_sel.csv")

# TARGET VARIABLE
y_train = pd.read_csv("../data/processed/y_train.csv")
y_test = pd.read_csv("../data/processed/y_test.csv")

# Convertion to ensure that y_train and y_test are Series
y_train = y_train.squeeze() if isinstance(y_train, pd.DataFrame) else y_train
y_test = y_test.squeeze() if isinstance(y_test, pd.DataFrame) else y_test


## MACHINE LEARNING

 ## **BOOSTING FOR CLASSIFICATION**

In [468]:
np.random.seed(42)

# Train with the dataset WITH outliers
model_with_outliers = XGBClassifier(random_state=42)
model_with_outliers.fit(X_train_with_outliers_sel, y_train)
y_pred_with_outliers = model_with_outliers.predict(X_test_with_outliers_sel)
accuracy_with_outliers = accuracy_score(y_test, y_pred_with_outliers)
print(f"Accuracy WITH outliers: {accuracy_with_outliers}")

# Train with the dataset WITHOUT outliers
model_without_outliers = XGBClassifier(random_state=42)
model_without_outliers.fit(X_train_without_outliers_sel, y_train)
y_pred_without_outliers = model_without_outliers.predict(X_test_without_outliers_sel)
accuracy_without_outliers = accuracy_score(y_test, y_pred_without_outliers)
print(f"Accuracy WITHOUT outliers: {accuracy_without_outliers}")



Accuracy WITH outliers: 0.7662337662337663
Accuracy WITHOUT outliers: 0.7727272727272727


In [469]:
# Model evaluation WITH outliers
print("\nMetrics for the model WITH outliers:")
print(f"Accuracy: {accuracy_with_outliers}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_with_outliers))
print("Classification Report:")
print(classification_report(y_test, y_pred_with_outliers))

# Model evaluation WITHOUT outliers
print("\nMetrics for the model WITHOUT outliers:")
print(f"Accuracy: {accuracy_without_outliers}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_without_outliers))
print("Classification Report:")
print(classification_report(y_test, y_pred_without_outliers))



Metrics for the model WITH outliers:
Accuracy: 0.7662337662337663
Confusion Matrix:
[[77 19]
 [17 41]]
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.80      0.81        96
           1       0.68      0.71      0.69        58

    accuracy                           0.77       154
   macro avg       0.75      0.75      0.75       154
weighted avg       0.77      0.77      0.77       154


Metrics for the model WITHOUT outliers:
Accuracy: 0.7727272727272727
Confusion Matrix:
[[76 20]
 [15 43]]
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.79      0.81        96
           1       0.68      0.74      0.71        58

    accuracy                           0.77       154
   macro avg       0.76      0.77      0.76       154
weighted avg       0.78      0.77      0.77       154



#### DECISION:

The dataset WITH outliers was chosen for further optimization because it demonstrated slightly better overall accuracy and recall for class 1 (positive diabetes). In medical contexts, such as diabetes prediction (current problem), prioritizing the recall of the positive class is critical to minimize false negatives, ensuring patients with diabetes are correctly identified.



#### **OPTIMIZING BOOSTING ALGORITHM MODEL**

#### **Optimize boosting algorithm with GridSearchCV**

In [470]:
# Hyperparameters
param_grid = {
    'learning_rate': [0.001, 0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 150],
    'subsample': [0.8, 1.0],
    'max_depth': [None, 3, 5, 7],
    'colsample_bytree': [0.8, 1.0]
}

# Optimization with GridSearch
grid_search = GridSearchCV(estimator=model_with_outliers, param_grid=param_grid, cv=3, scoring='recall', n_jobs=2, verbose=1)
grid_search.fit(X_train_with_outliers_sel, y_train)


print(f"Best Hyperparameters: {grid_search.best_params_}")
best_model = grid_search.best_estimator_


Fitting 3 folds for each of 192 candidates, totalling 576 fits
Best Hyperparameters: {'colsample_bytree': 0.8, 'learning_rate': 0.2, 'max_depth': 7, 'n_estimators': 150, 'subsample': 1.0}


note: I decided to prioritize the "recall" for the positive class (class 1 - patients with diabetes), to avoid the risk of failing to identify patients with diabetes (false negatives); since in medical diagnostics, the cost of ignoring a positive case (not diagnosing a patient) is much higher than investigating a false positive case.

In [471]:
# Train the final model with the best hyperparameters
best_model.fit(X_train_with_outliers_sel, y_train)

# Predicting the final model
y_pred_final = best_model.predict(X_test_with_outliers_sel)


In [472]:
# Accuracy
accuracy_final = accuracy_score(y_test, y_pred_final)
print(f"Final Model Accuracy: {accuracy_final}")

# Confusion Matrix
print("\nFinal Model Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_final))


print("\nFinal Model Classification Report:")
print(classification_report(y_test, y_pred_final))


Final Model Accuracy: 0.7597402597402597

Final Model Confusion Matrix:
[[78 18]
 [19 39]]

Final Model Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.81      0.81        96
           1       0.68      0.67      0.68        58

    accuracy                           0.76       154
   macro avg       0.74      0.74      0.74       154
weighted avg       0.76      0.76      0.76       154



#### Statements

* The model has a good ability to identify examples of class 0 (negative for diabetes), but still has difficulty in correctly identifying examples of class 1 (positive for diabetes); recall is lower for class 1.

#### **Optmizing with RandomizedSearchCV**

I tried to optimize using GridSearch, but because it took so long to execute, I considered using RandomizedSearchCV.

In [473]:
model_with_outliers_sel = XGBClassifier(random_state=42, eval_metric="logloss")

param_dist = {
    'learning_rate': np.linspace(0.01, 0.2, 10), 
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [3, 4, 5, 6, 7],
    'subsample': np.linspace(0.6, 1.0, 5),
    'colsample_bytree': np.linspace(0.6, 1.0, 5),
    'min_child_weight': [1, 3, 5],
    'gamma': np.linspace(0, 0.3, 4),
    'reg_alpha': np.linspace(0, 1.0, 5),
    'reg_lambda': [1, 2, 3]
}

random_search = RandomizedSearchCV(estimator=model_with_outliers_sel, param_distributions=param_dist, n_iter=100, cv=3, scoring='recall', verbose=2, n_jobs=2)
random_search.fit(X_train_with_outliers_sel, y_train)

print(f"Best Hyperparameters: {random_search.best_params_}")
best_model = random_search.best_estimator_


Fitting 3 folds for each of 100 candidates, totalling 300 fits
Best Hyperparameters: {'subsample': np.float64(0.8), 'reg_lambda': 1, 'reg_alpha': np.float64(1.0), 'n_estimators': 150, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': np.float64(0.052222222222222225), 'gamma': np.float64(0.0), 'colsample_bytree': np.float64(1.0)}


In [474]:
# Apply the best hyperparameters found in RandomizedSearchCV
best_params = random_search.best_params_
best_model = XGBClassifier(
    random_state=42,
    **best_params,
    eval_metric="logloss",
    n_jobs=2
)

# Train the model with the best hyperparameters
best_model.fit(X_train_with_outliers_sel, y_train)

In [475]:
# Predicting
y_pred_final = best_model.predict(X_test_with_outliers_sel)

# Evaluating 
accuracy_final = accuracy_score(y_test, y_pred_final)
print(f"Final Model Accuracy: {accuracy_final}")
print("\nFinal Model Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_final))
print("\nFinal Model Classification Report:")
print(classification_report(y_test, y_pred_final))


Final Model Accuracy: 0.7922077922077922

Final Model Confusion Matrix:
[[79 17]
 [15 43]]

Final Model Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.82      0.83        96
           1       0.72      0.74      0.73        58

    accuracy                           0.79       154
   macro avg       0.78      0.78      0.78       154
weighted avg       0.79      0.79      0.79       154



## **FINAL STATEMENT** 
The final model achieved an accuracy of 79.22%, demonstrating its capability to correctly classify both classes with a good balance. The recall for class 1 (positive diabetes) increased to 74%, reducing false negatives significantly compared to the initial models. This improvement highlights the model's ability to better identify patients at risk of diabetes, which is critical in a medical context where timely diagnosis and treatment are vital.

#### **Saving Optimized Boosting Algorithms model**

In [476]:
# Saving the final model
best_model.save_model("xgb_final_model_with_outliers_42.json")
print("Final model saved successfully!")


Final model saved successfully!
