# DIABETES PREDICTION (Classification Problem)

### Analyzing Diagnostic Factors Using BOOSTING ALGORITHMS

In [411]:
#Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import joblib
from pickle import dump
import math
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

**STEP 1: PROBLEM STATEMENT & DATA COLLECTION**

***1.1 PROBLEM STATEMENT***

**Goal** -  predict based on diagnostic measures whether or not a patient has diabetes.

**6.2 SAVING THE CSV FILES**

In [412]:
#dfs_train = {
    #'X_train_with_outliers_sel': X_train_with_outliers_sel,
    #'X_train_without_outliers_sel': X_train_without_outliers_sel,
    #'X_train_with_outliers_norm_sel': X_train_with_outliers_norm_sel,
    #'X_train_without_outliers_norm_sel': X_train_without_outliers_norm_sel,
    #'X_train_with_outliers_minmax_sel': X_train_with_outliers_minmax_sel,
    #'X_train_without_outliers_minmax_sel': X_train_without_outliers_minmax_sel 
#}

#dfs_test = {
#    'X_test_with_outliers_sel': X_test_with_outliers_sel,
#    'X_test_without_outliers_sel': X_test_without_outliers_sel,
#    'X_test_with_outliers_norm_sel': X_test_with_outliers_norm_sel,
#    'X_test_without_outliers_norm_sel': X_test_without_outliers_norm_sel,
#    'X_test_with_outliers_minmax_sel': X_test_with_outliers_minmax_sel,
#    'X_test_without_outliers_minmax_sel': X_test_without_outliers_minmax_sel    
#}

#for name, df in dfs_train.items():
#    df.to_csv(f"../data/processed/{name}.csv", index=False)

#for name, df in dfs_test.items(): 
#    df.to_csv(f'../data/processed/{name}.csv', index=False)

In [413]:
# Load the processed datasets

# WITH outliers
X_train_with_outliers = pd.read_csv("../data/processed/X_train_with_outliers.csv")
X_test_with_outliers = pd.read_csv("../data/processed/X_test_with_outliers.csv")

# WITHOUT outliers
X_train_without_outliers = pd.read_csv("../data/processed/X_train_without_outliers.csv")
X_test_without_outliers = pd.read_csv("../data/processed/X_test_without_outliers.csv")

# TARGET VARIABLE
y_train = pd.read_csv("../data/processed/y_train.csv")
y_test = pd.read_csv("../data/processed/y_test.csv")

# Convertion to ensure that y_train and y_test are Series
y_train = y_train.squeeze() if isinstance(y_train, pd.DataFrame) else y_train
y_test = y_test.squeeze() if isinstance(y_test, pd.DataFrame) else y_test


## MACHINE LEARNING

 ## **BOOSTING FOR CLASSIFICATION**

In [414]:
np.random.seed(42)

# Train with the dataset WITH outliers
model_with_outliers = XGBClassifier(random_state=42)
model_with_outliers.fit(X_train_with_outliers, y_train)
y_pred_with_outliers = model_with_outliers.predict(X_test_with_outliers)
accuracy_with_outliers = accuracy_score(y_test, y_pred_with_outliers)
print(f"Accuracy WITH outliers: {accuracy_with_outliers}")

# Train with the dataset WITHOUT outliers
model_without_outliers = XGBClassifier(random_state=42)
model_without_outliers.fit(X_train_without_outliers, y_train)
y_pred_without_outliers = model_without_outliers.predict(X_test_without_outliers)
accuracy_without_outliers = accuracy_score(y_test, y_pred_without_outliers)
print(f"Accuracy WITHOUT outliers: {accuracy_without_outliers}")



Accuracy WITH outliers: 0.7662337662337663
Accuracy WITHOUT outliers: 0.7532467532467533


In [415]:
# Model evaluation WITH outliers
print("\nMetrics for the model WITH outliers:")
print(f"Accuracy: {accuracy_with_outliers}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_with_outliers))
print("Classification Report:")
print(classification_report(y_test, y_pred_with_outliers))

# Model evaluation WITHOUT outliers
print("\nMetrics for the model WITHOUT outliers:")
print(f"Accuracy: {accuracy_without_outliers}")
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_without_outliers))
print("Classification Report:")
print(classification_report(y_test, y_pred_without_outliers))



Metrics for the model WITH outliers:
Accuracy: 0.7662337662337663
Confusion Matrix:
[[77 19]
 [17 41]]
Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.80      0.81        96
           1       0.68      0.71      0.69        58

    accuracy                           0.77       154
   macro avg       0.75      0.75      0.75       154
weighted avg       0.77      0.77      0.77       154


Metrics for the model WITHOUT outliers:
Accuracy: 0.7532467532467533
Confusion Matrix:
[[76 20]
 [18 40]]
Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.79      0.80        96
           1       0.67      0.69      0.68        58

    accuracy                           0.75       154
   macro avg       0.74      0.74      0.74       154
weighted avg       0.76      0.75      0.75       154



#### DECISION:

The dataset WITH outliers was chosen for further optimization because it demonstrated slightly better overall accuracy and recall for class 1 (positive diabetes). In medical contexts, such as diabetes prediction (current problem), prioritizing the recall of the positive class is critical to minimize false negatives, ensuring patients with diabetes are correctly identified.



#### **OPTIMIZING BOOSTING ALGORITHM MODEL**

#### **Optimize boosting algorithm with GridSearchCV**

In [416]:
# Hyperparameters
param_grid = {
    'learning_rate': [0.001, 0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 150],
    'subsample': [0.8, 1.0],
    'max_depth': [None, 3, 5, 7],
    'colsample_bytree': [0.8, 1.0]
}

# Optimization with GridSearch
grid_search = GridSearchCV(estimator=model_with_outliers, param_grid=param_grid, cv=3, scoring='recall', n_jobs=2, verbose=1)
grid_search.fit(X_train_with_outliers, y_train)


print(f"Best Hyperparameters: {grid_search.best_params_}")
best_model = grid_search.best_estimator_


Fitting 3 folds for each of 192 candidates, totalling 576 fits
Best Hyperparameters: {'colsample_bytree': 1.0, 'learning_rate': 0.2, 'max_depth': None, 'n_estimators': 50, 'subsample': 1.0}


In [417]:
# Train the final model with the best hyperparameters
best_model.fit(X_train_with_outliers, y_train)

# Predicting the final model
y_pred_final = best_model.predict(X_test_with_outliers)


In [418]:
# Accuracy
accuracy_final = accuracy_score(y_test, y_pred_final)
print(f"Final Model Accuracy: {accuracy_final}")

# Confusion Matrix
print("\nFinal Model Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_final))


print("\nFinal Model Classification Report:")
print(classification_report(y_test, y_pred_final))


Final Model Accuracy: 0.7597402597402597

Final Model Confusion Matrix:
[[76 20]
 [17 41]]

Final Model Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.79      0.80        96
           1       0.67      0.71      0.69        58

    accuracy                           0.76       154
   macro avg       0.74      0.75      0.75       154
weighted avg       0.76      0.76      0.76       154



#### Statements

* The model has a good ability to identify examples of class 0 (negative for diabetes), but still has difficulty in correctly identifying examples of class 1 (positive for diabetes); recall is lower for class 1.

#### **Optmizing with RandomizedSearchCV**

I tried to optimize using GridSearch, but because it took so long to execute, I considered using RandomizedSearchCV.

In [419]:
model_with_outliers = XGBClassifier(random_state=42, eval_metric="logloss")

param_dist = {
    'learning_rate': np.linspace(0.01, 0.2, 10), 
    'n_estimators': [50, 100, 150, 200],
    'max_depth': [3, 4, 5, 6, 7],
    'subsample': np.linspace(0.6, 1.0, 5),
    'colsample_bytree': np.linspace(0.6, 1.0, 5),
    'min_child_weight': [1, 3, 5],
    'gamma': np.linspace(0, 0.3, 4),
    'reg_alpha': np.linspace(0, 1.0, 5),
    'reg_lambda': [1, 2, 3]
}

random_search = RandomizedSearchCV(estimator=model_with_outliers, param_distributions=param_dist, n_iter=50, cv=3, scoring='recall', verbose=2, n_jobs=2)
random_search.fit(X_train_with_outliers, y_train)

print(f"Best Hyperparameters: {random_search.best_params_}")
best_model = random_search.best_estimator_


Fitting 3 folds for each of 50 candidates, totalling 150 fits
Best Hyperparameters: {'subsample': np.float64(0.7), 'reg_lambda': 1, 'reg_alpha': np.float64(0.0), 'n_estimators': 50, 'min_child_weight': 5, 'max_depth': 7, 'learning_rate': np.float64(0.09444444444444444), 'gamma': np.float64(0.09999999999999999), 'colsample_bytree': np.float64(1.0)}


In [420]:
# Apply the best hyperparameters found in RandomizedSearchCV
best_params = random_search.best_params_
best_model = XGBClassifier(
    random_state=42,
    **best_params,
    eval_metric="logloss",
    n_jobs=2
)

# Train the model with the best hyperparameters
best_model.fit(X_train_with_outliers, y_train)

In [421]:
# Predicting
y_pred_final = best_model.predict(X_test_with_outliers)

# Evaluating 
accuracy_final = accuracy_score(y_test, y_pred_final)
print(f"Final Model Accuracy: {accuracy_final}")
print("\nFinal Model Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_final))
print("\nFinal Model Classification Report:")
print(classification_report(y_test, y_pred_final))


Final Model Accuracy: 0.7727272727272727

Final Model Confusion Matrix:
[[80 16]
 [19 39]]

Final Model Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.83      0.82        96
           1       0.71      0.67      0.69        58

    accuracy                           0.77       154
   macro avg       0.76      0.75      0.76       154
weighted avg       0.77      0.77      0.77       154



The matrix shows that the model is balanced, but fails to capture all examples in class 1, with 19 positive examples classified as negative (False Negatives).

**NOTE** - The model achieves an overall accuracy of 77.27%, with balanced performance across classes. However, it struggles to correctly identify examples from class 1 (positive diabetes), resulting in 19 false negatives and a recall of 67%. This highlights the model's limitation in prioritizing the identification of diabetes-positive patients, which is critical in this medical context. To address this issue, by adjusting the class weights I could increase recall for class 1, reducing the number of false negatives and improving the model's utility in diagnosing diabetes.



In [422]:
# Calculate the relative weight between classes
class_0_weight = sum(y_train == 0)
class_1_weight = sum(y_train == 1)
scale_pos_weight = class_0_weight / class_1_weight

print(f"Scale Pos Weight: {scale_pos_weight}")


Scale Pos Weight: 1.9238095238095239


In [423]:
# Train the model with weight adjustment

best_model_with_weights = XGBClassifier(
    subsample=best_params['subsample'],
    reg_lambda=best_params['reg_lambda'],
    reg_alpha=best_params['reg_alpha'],
    n_estimators=best_params['n_estimators'],
    min_child_weight=best_params['min_child_weight'],
    max_depth=best_params['max_depth'],
    learning_rate=best_params['learning_rate'],
    gamma=best_params['gamma'],
    colsample_bytree=best_params['colsample_bytree'],
    random_state=42,
    eval_metric="logloss",
    n_jobs=2,
    scale_pos_weight=scale_pos_weight,
)


# Training the model
best_model_with_weights.fit(X_train_with_outliers, y_train)


In [424]:
# Making predictions on the test set
y_pred_with_weights = best_model_with_weights.predict(X_test_with_outliers)

# Evaluate the model
accuracy_with_weights = accuracy_score(y_test, y_pred_with_weights)
print(f"Accuracy with Adjusted Weights: {accuracy_with_weights}")

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred_with_weights))

print("\nClassification Report:")
print(classification_report(y_test, y_pred_with_weights))


Accuracy with Adjusted Weights: 0.7857142857142857

Confusion Matrix:
[[74 22]
 [11 47]]

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.77      0.82        96
           1       0.68      0.81      0.74        58

    accuracy                           0.79       154
   macro avg       0.78      0.79      0.78       154
weighted avg       0.80      0.79      0.79       154



#### **Statement -** After weight adjustment, the model showed a significant improvement in accuracy, which increased from 77.27% to 78.57%, 
improvements in class 1 (diabetes positive) recall, which increased from 67% to 81%. Despite this, there was a small reduction in overall accuracy, from 79% to 76%, indicating a slight impact on overall performance. This change, however, is acceptable in medical scenarios, as it prioritizes the identification of diabetes-positive patients, reducing false negatives, which is crucial for timely diagnosis and treatment.

In [425]:
# Comparing results
results = {
    "Model": ["With Outliers", "Without Outliers", "Best Params", "Adjusted Weights"],
    "Accuracy": [0.77, 0.75, 0.81, 0.79],
    "Recall Classe 1": [0.71, 0.69, 0.76, 0.84],
    "F1-Score Classe 1": [0.69, 0.68, 0.75, 0.75]
}

results_df = pd.DataFrame(results)
print(results_df)


              Model  Accuracy  Recall Classe 1  F1-Score Classe 1
0     With Outliers      0.77             0.71               0.69
1  Without Outliers      0.75             0.69               0.68
2       Best Params      0.81             0.76               0.75
3  Adjusted Weights      0.79             0.84               0.75


#### Statements
After adjusting the class weights, the model demonstrated an improvement in its ability to correctly identify examples from class 1 (positive diabetes). The overall accuracy increased slightly from 77.27% to 78.57%, indicating better overall performance.
This adjustment addresses the model's limitation in identifying positive diabetes cases, aligning its performance with the clinical importance of minimizing missed diagnoses.

#### **Saving Optimized Boosting Algorithms model**

In [426]:
# Saving the final model
best_model_with_weights.save_model("xgb_final_model_with_outliers_42.json")
print("Final model saved successfully!")


Final model saved successfully!
