# DIABETES PREDICTION USING BOOSTING ALGORITHMS

The purpose of this project is to improve the prediction of diabetes, comparing the accuracy with the results obtained in the Decision Tree and Random Forest methods.

In [49]:
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split
import matplotlib.pyplot as plt
from xgboost import XGBClassifier
import pickle
import joblib
from sklearn.metrics import accuracy_score


The csv and the division of the data into train and test are loaded, according to the previous Decision Tree and Random Forest project, where an EDA was previously carried out.

In [3]:
total_data = pd.read_csv('../data/raw/diabetes.csv')
train_data = pd.read_csv('../data/raw/diabetes_train.csv')
test_data = pd.read_csv('../data/raw/diabetes_test.csv')
total_data.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
X_train = train_data.drop(["Outcome"], axis = 1)
y_train = train_data["Outcome"]
X_test = test_data.drop(["Outcome"], axis = 1)
y_test = test_data["Outcome"]

A dictionary is created with the hyperparameters that will be used in the XGBoost model. Different values ​​are tested.

In [22]:
param_grid = {
    'n_estimators': [100, 300, 500, 600],
    'learning_rate': [0.01, 0.1, 0.2, 0.4],
    'max_depth': [3, 4, 5, 6, 8],
    'subsample': [0.5, 0.6, 0.8, 1.0],
    'colsample_bytree': [0.5, 0.6, 0.8, 1.0]
}

The best hyperparameters of the dictionary are obtained, taking into account the optimized accuracy value.

In [23]:
# Create the XGBoost model
model = XGBClassifier(random_state=42)

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=2)

# Perform hyperparameter search
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Cross-Validation Score:", best_score)

Fitting 5 folds for each of 1280 candidates, totalling 6400 fits
Best Parameters: {'colsample_bytree': 0.5, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500, 'subsample': 1.0}
Best Cross-Validation Score: 0.7866719978675196


Boosting construction with the best hyperparameters:

In [24]:
from xgboost import XGBClassifier

model = XGBClassifier(**best_params, random_state = 42)
model.fit(X_train, y_train)

In [25]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0])

Boosting method accuracy:

In [42]:
accuracy_score(y_test, y_pred)

0.7857142857142857


The model is saved in .sav using joblib:

In [45]:
joblib.dump(model, "../models/xgb_classifier_42.sav")

['../models/xgb_classifier_42.sav']

Loading the tree models:

In [46]:
#Decision Tree model:
model_dt = "../models/decision_tree_classifier.sav"
with open(model_dt, 'rb') as file:
    loaded_model_dt = pickle.load(file)

#Random Forest model:
model_rf = "../models/random_forest_best_hyperparameters.sav"
with open(model_rf, 'rb') as file:
    loaded_model_rf = pickle.load(file)

#Boosting model:
loaded_model_boosting = joblib.load('../models/xgb_classifier_42.sav')

Predictions in each model:

In [47]:
y_pred_dt = loaded_model_dt.predict(X_test)
y_pred_rf = loaded_model_rf.predict(X_test)
y_pred_bt = loaded_model_boosting.predict(X_test)

Printing the accuracy for each model:

In [48]:
accuracy_dt = accuracy_score(y_test, y_pred_dt)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
accuracy_bt = accuracy_score(y_test, y_pred_bt)
print(f'Accuracy Decision Tree Model: {accuracy_dt:.2f}')
print(f'Accuracy Random Forest Model: {accuracy_rf:.2f}')
print(f'Accuracy Boosting Model: {accuracy_bt:.2f}')

Accuracy Decision Tree Model: 0.71
Accuracy Random Forest Model: 0.76
Accuracy Boosting Model: 0.79


CONSLUSION:

The accuracy results are lower in the Decision Tree model (71%) and better in the Boosting model (79%), so for the diabetes detection project, the best method is Boosting.