# Pima Indians Diabetes Database

[About Dataset](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)
## Context

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

## Content
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

## Acknowledgements
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.

Inspiration
Can you build a machine learning model to accurately predict whether or not the patients in the dataset have diabetes or not?

In [1]:
import random
import numpy as np

random.seed(42)
np.random.seed(100)

In [2]:
import pandas as pd
import os
from f_importance.util.runner import compute_importance

In [3]:
diabetes = pd.read_csv("./dataset/diabetes.csv")

diabetes.sample(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
173,1,79,60,42,48,43.5,0.678,23,0
253,0,86,68,32,0,35.8,0.238,25,0
207,5,162,104,0,0,37.7,0.151,52,1
737,8,65,72,23,0,32.0,0.6,42,0
191,9,123,70,44,94,33.1,0.374,40,0
754,8,154,78,32,0,32.4,0.443,45,1
159,17,163,72,41,114,40.9,0.817,47,1
448,0,104,64,37,64,33.6,0.51,22,1
359,1,196,76,36,249,36.5,0.875,29,1
651,1,117,60,23,106,33.8,0.466,27,0


In [4]:
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn import metrics

In [5]:
def train_evaluate(data: pd.DataFrame, clazz= XGBClassifier, target="Outcome", proba=False, metrics=metrics.accuracy_score, n_fold=5, **kwargs):
    X = data.drop(columns=[target])
    y = data[target]
    score = 0
    for train_index, test_index in StratifiedKFold(
                n_fold, shuffle=True, random_state=42
            ).split(X, y):
        (X_train, y_train), (X_test, y_test) = (
            (X.loc[train_index], y.loc[train_index]),
            (X.loc[test_index], y.loc[test_index]),
        )
        model: XGBClassifier = clazz(**kwargs)
        model.fit(X_train, y_train)
        if proba:
            y_preds = model.predict_proba(X_test)[:, 1]
        else:
            y_preds = model.predict(X_test)
        score += metrics(y_test, y_preds)
    return score / n_fold

In [6]:
print("Accuracy Score : ", train_evaluate(diabetes))

Accuracy Score :  0.7434173669467787


In [7]:
importance = compute_importance(
    model_name = "XGBClassifier",
    method = "DataFold",
    metric = "accuracy_score",
    dataset  = diabetes,
    targets = "Outcome",
    n_gram = (1, 1),
    val_rate = 0.15,
    shuffle = True,
    n = 10,
    is_regression = False,
    n_jobs = os.cpu_count(),
    refit = True,
    seed=41
)

In [8]:
importance

Unnamed: 0,Importance,Split0,Split1,Split2,Split3,Split4,Split5,Split6,Split7,Split8,Split9
,0.732895,0.792208,0.701299,0.753247,0.701299,0.831169,0.701299,0.714286,0.805195,0.684211,0.644737
'Glucose',0.055861,0.714286,0.636364,0.688312,0.662338,0.766234,0.623377,0.662338,0.701299,0.684211,0.631579
'Age',0.023308,0.74026,0.675325,0.701299,0.61039,0.779221,0.675325,0.753247,0.779221,0.710526,0.671053
'Insulin',0.012953,0.753247,0.727273,0.727273,0.675325,0.779221,0.662338,0.714286,0.805195,0.710526,0.644737
'BloodPressure',0.01285,0.779221,0.701299,0.753247,0.623377,0.753247,0.662338,0.727273,0.766234,0.763158,0.671053
'Pregnancies',0.003862,0.714286,0.74026,0.779221,0.688312,0.805195,0.701299,0.714286,0.792208,0.710526,0.644737
'DiabetesPedigreeFunction',0.00123,0.74026,0.688312,0.727273,0.727273,0.792208,0.688312,0.753247,0.818182,0.710526,0.671053
'BMI',-5.1e-05,0.766234,0.74026,0.805195,0.662338,0.818182,0.662338,0.74026,0.766234,0.710526,0.657895
'SkinThickness',-0.003947,0.792208,0.753247,0.766234,0.688312,0.818182,0.714286,0.701299,0.766234,0.723684,0.644737


In [9]:
to_delete = ['SkinThickness', 'Insulin']


to_delete

['SkinThickness', 'Insulin']

In [10]:
print("Accuracy Score : ", train_evaluate(diabetes.drop(columns=[to_delete[0]])))

Accuracy Score :  0.7161276631864867


In [11]:
print("Accuracy Score : ", train_evaluate(diabetes.drop(columns=[to_delete[1]])))

Accuracy Score :  0.7331296154825566


In [12]:
print("Accuracy Score : ", train_evaluate(diabetes.drop(columns=['Age'])))

Accuracy Score :  0.722663610898905


In [13]:
importance2 = compute_importance(
    model_name = "XGBClassifier",
    method = "DataFold",
    metric = "accuracy_score",
    dataset  = diabetes,
    targets = "Outcome",
    n_gram = (1, 2),
    val_rate = 0.15,
    shuffle = True,
    n = 5,
    is_regression = False,
    n_jobs = os.cpu_count(),
    refit = True
)

In [14]:
importance2

Unnamed: 0,Importance,Split0,Split1,Split2,Split3,Split4
,0.744945,0.766234,0.668831,0.681818,0.79085,0.816993
"'Glucose', 'BMI'",0.087327,0.649351,0.584416,0.688312,0.633987,0.732026
"'Glucose', 'Insulin'",0.086054,0.701299,0.623377,0.623377,0.620915,0.72549
"'Glucose', 'DiabetesPedigreeFunction'",0.080876,0.681818,0.62987,0.675325,0.666667,0.666667
"'Glucose', 'Age'",0.07041,0.662338,0.636364,0.681818,0.640523,0.751634
"'Glucose', 'SkinThickness'",0.063916,0.707792,0.62987,0.675325,0.666667,0.72549
'Glucose',0.057457,0.720779,0.642857,0.707792,0.627451,0.738562
"'Pregnancies', 'Glucose'",0.05216,0.720779,0.616883,0.681818,0.699346,0.745098
"'Glucose', 'BloodPressure'",0.046991,0.733766,0.616883,0.714286,0.679739,0.745098
"'BMI', 'DiabetesPedigreeFunction'",0.033877,0.727273,0.642857,0.662338,0.745098,0.777778


In [15]:
print("Accuracy Score : ", train_evaluate(diabetes.drop(columns=['Glucose'])))

Accuracy Score :  0.6913419913419914
