# Diabetes Health Prediction

**Data Source:** [kaggle](https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset)

**Members**:
- David Mairena
- Fernando Sirias

In [163]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
import warnings
warnings.simplefilter("ignore")
from collections import Counter

In [165]:
df = pd.read_csv("data\diabetes_inbalance.csv")
df.shape

(253680, 22)

In [166]:
df.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [None]:
df.Diabetes_binary.value_counts(normalize=True)

In [None]:
df.Diabetes_binary.value_counts(normalize=True).plot(kind='bar')
plt.title("Distribution of Diabetes")
plt.xlabel("Diabetes")
plt.ylabel("Normalize count")
plt.show()

In [None]:
df.Sex.value_counts(normalize=True)

In [None]:
plt.figure(figsize=(7, 4))
sns.countplot(data = df, x = 'Sex', hue='Diabetes_binary', dodge=False)
plt.title("Sex countplot by Diabetes")
plt.show()

## Data Cleaning

In [None]:
print("Total N/A:", df.isnull().sum().sum())

In [None]:
df.dtypes

## Exploratory Data Analysis

In [None]:
df.columns

In [None]:
df.head()

---
`Age` columns as a range of values between 1 and 13, this is because each value correspond to a range of 5 years:
- **1:** 18-24.
- **2:** 25-29.
- **3:** 30-34.
- ...
- **13:** 80 or older.

So in the next plot we can see the ranges of ages with most cases of diabetes, people with an age between **65-69** (10) are more likely to have diabetes.

In [None]:
plt.figure(figsize=(8,4))
temp = df[df.Diabetes_binary == 1].groupby('Age').Diabetes_binary.count()
temp.plot(kind="bar", color='navy')
plt.title("Diabetes by age range")
plt.ylabel("count")
plt.show()

---
In the next plot we can see that if you have high cholesterol levels you are basically **twice** as likely to have diabetes as people without high cholesterol.

The second plot you us that the people that checks their cholesterol level in the past 5 years have the most **negatives** diabetes cases, this can be due the people that didn't check their cholesterol dont receive an alert to avoid diabetes.

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.countplot(data = df[df.Diabetes_binary == 1], x = 'HighChol', hue = 'Diabetes_binary', ax=axes[0])
axes[0].set_title("Diabetes with High cholesterol")
sns.countplot(data = df[df.Diabetes_binary == 0], x = 'Diabetes_binary', hue = 'CholCheck', ax=axes[1])
axes[1].set_title("People that checks the cholesterol in 5 years")
plt.show()

---
The following plot shows that people who do physical activities are **less likely to have diabetes.**

Since exercising helps improve cholesterol levels, blood pressure and body weight, also one of the most important factors is that it helps improve resistance to insulin.

In [None]:
sns.countplot(data = df, x = 'PhysActivity', hue = 'Diabetes_binary')
plt.show()

---
The following graph shows us that the column `BMI` is not a factor related to diabetes, since it can be seen that both categories of the column `diabetes` (0, 1) have relatively the same **distribution**.


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.violinplot(data = df, x = 'Diabetes_binary', y = 'BMI',hue='Sex', ax=axes[0])
sns.stripplot(data = df, x = 'Diabetes_binary', y = 'BMI', hue='Sex', ax=axes[1])

---
Next we created a plot that shows the amount of people that consume fruits and vegetables. As we can see, there is more risk of having Diabetes if people doesn't eat vegetables or fruits.   

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
fig.suptitle('Number of Diabetes Diagnosis on people that consume fruits and vegetables')

# fruits
sns.countplot(data = df, x = 'Fruits', hue = 'Diabetes_binary', ax=axes[0])
axes[0].set_title("Fruits")
axes[0].set_xlabel("Fruit consumer")
axes[0].set_ylabel("Number of Diabetes Diagnosis")

# veggie
sns.countplot(data = df, x = 'Veggies', hue = 'Diabetes_binary', ax=axes[1])
axes[1].set_title("Vegetables")
axes[1].set_xlabel("Vegetable consumer")
axes[1].set_ylabel("Number of Diabetes Diagnosis")

plt.show()

---
Also we have some insights on the Diabetes diagnosis for the people that consumes Alcohol or are Smokers.
As we could see in following plot, it seems to not be correlated with Diabetes because of the minimun differences in the positive and negative diagnosis in both subplots. 

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
fig.suptitle('Number of Diabetes Diagnosis on Smokers and People with Heavy Alcohol Consumption')

# fruits
sns.countplot(data = df, x = 'Smoker', hue = 'Diabetes_binary', ax=axes[0])
axes[0].set_title("Smokers")
axes[0].set_ylabel("Number of Diabetes Diagnosis")

# veggie
sns.countplot(data = df, x = 'HvyAlcoholConsump', hue = 'Diabetes_binary', ax=axes[1])
axes[1].set_title("Heavy Alcohol Consumption")
axes[1].set_xlabel("Heavy Alcohol Consumption")
axes[1].set_ylabel("Number of Diabetes Diagnosis")
plt.show()

---
At the beginning we thought that the `Education` and` Income` columns would not be related to whether the person has diabetes or not, but the following plots do show us a **difference between each of the classes**, as the ratios are quite different between them. So we are going to keep these features in the creation of the model.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
sns.countplot(data = df, x = 'Education', hue='Diabetes_binary', ax=axes[0, 0])
sns.countplot(data = df, x = 'Income', hue = 'Diabetes_binary', ax=axes[0, 1])
axes[0, 0].set_title("Diabetes by Education level")
axes[0, 1].set_title("Diabetes by Income level")
temp0 = df[df.Diabetes_binary == 0].groupby('Education').Education.count()
temp1 = df[df.Diabetes_binary == 1].groupby('Education').Education.count()
temp2 = df[df.Diabetes_binary == 0].groupby('Income').Income.count()
temp3 = df[df.Diabetes_binary == 1].groupby('Income').Income.count()
(temp1/temp0).plot(kind='bar', ax= axes[1, 0], color='#2C3E50')
(temp3/temp2).plot(kind='bar', ax= axes[1, 1], color ='#2C3E50')
axes[1,0].set_title("Diabetes/No Diabetes Ratio by Education")
axes[1,1].set_title("Diabetes/No Diabetes Ratio by Income")
plt.show()

## Feature Engineering & Model Creation

In [167]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, precision_score, recall_score, accuracy_score

In [None]:
df_corr = df.corr()
plt.figure(figsize=(10, 7))
sns.heatmap(df_corr)
plt.show()

In [None]:
#Q1 = df.quantile(0.25)
#Q3 = df.quantile(0.75)
#IQR = Q3 - Q1
#
#df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
#print("Shape with Outliers:", df.shape)
#print("Shape without Outliers:", df_out.shape)

In [168]:
features = df.drop('Diabetes_binary', axis=1)
target = df.Diabetes_binary

In [None]:
print("----- Uniques values in each column -----")
for i in df.columns:
    print(f"{i} uniques: {len(df[i].unique())}")

### Feature Scaling

In [169]:
scaler = StandardScaler()
t = np.asarray(features.BMI)
t = t.reshape(-1, 1)
features.BMI = scaler.fit_transform(t)
features.head()

Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,1.0,1.0,1.0,1.757936,1.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,-0.511806,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,1.0,1.0,1.0,-0.057858,0.0,0.0,0.0,0.0,1.0,0.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,1.0,0.0,1.0,-0.209174,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,1.0,1.0,1.0,-0.663122,0.0,0.0,0.0,1.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


### Over-sampling

In [170]:
from imblearn.over_sampling import RandomOverSampler

rus = RandomOverSampler(random_state=42)
x_rus, y_rus = rus.fit_resample(features, target)

print('original dataset shape:', Counter(target))
print('Resample dataset shape', Counter(y_rus))

original dataset shape: Counter({0.0: 218334, 1.0: 35346})
Resample dataset shape Counter({0.0: 218334, 1.0: 218334})


In [171]:
X_train, X_test, y_train, y_test = train_test_split(x_rus, y_rus, test_size=0.3, random_state=42)
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

X_train: (305667, 21)
X_test: (131001, 21)
y_train: (305667,)
y_test: (131001,)


---
>### K-Nearest Neighbor

In [172]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
#knn_scores = []
#for i in range(7,20):
#    print(f"Running: {i} neighbors")
#    knn = KNeighborsClassifier(n_neighbors=i)
#    knn.fit(X_train, y_train)
#    y_preds = knn.predict(X_test)
#    knn_scores.append(accuracy_score(y_test, y_preds))

In [None]:
#plt.plot(range(7,20), knn_scores)
#plt.xlabel("Number of Neighbors")
#plt.ylabel("Testing Accuracy")

In [173]:
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
knn_preds = knn.predict(X_test)

KeyboardInterrupt: 

In [None]:
print("----- K-Nearest Neighbor Metrics -----")
print("Accuracy:", accuracy_score(y_test, knn_preds))
print("AUC:", roc_auc_score(y_test, knn_preds))

---
>### XGBClassifier

In [174]:
from xgboost import XGBClassifier, DMatrix
import xgboost

In [175]:
xgbc = XGBClassifier(n_estimators=1500, max_depth=20)
xgbc.fit(X_train, y_train)
xgbc_preds = xgbc.predict(X_test)

In [176]:
print("----- XGBClassifier Metrics -----")
print("Accuracy:", accuracy_score(y_test, xgbc_preds))
print("AUC:", roc_auc_score(y_test, xgbc_preds))

----- XGBClassifier Metrics -----
Accuracy: 0.9337027961618614
AUC: 0.9337791229099801


In [None]:
params = {'max_depth': [3, 6, 9, 10, 15],
        'min_child_weight': [1, 5, 10],
        'subsample': [0.5, 1]}

grid = GridSearchCV(xgbc, param_grid=params)
grid.fit(X_train, y_train)

In [None]:
pd.DataFrame(grid.cv_results_).sort_values("rank_test_score").head()

---
>### Neuronal Network Model

In [177]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score, roc_auc_score 

mlp = MLPClassifier(hidden_layer_sizes=(21,21,21), activation='relu', solver='adam', max_iter=500)
mlp.fit(X_train,y_train)

predict_train = mlp.predict(X_train)
predict_test = mlp.predict(X_test)


y_preds = mlp.predict(X_test)
print("Accuracy",  accuracy_score(y_test,y_preds))
print("Accuracy_auc",  roc_auc_score(y_test,y_preds))


Accuracy 0.753505698429783
Accuracy_auc 0.7535675795143225


---
>### Random Forest 

In [178]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=1500, random_state=1, max_depth=100, max_samples=10000)
clf.fit(X_train, y_train)
y_preds = clf.predict(X_test)
print("Accuracy",  accuracy_score(y_test,y_preds))
print("Accuracy_auc",  roc_auc_score(y_test,y_preds))

Accuracy 0.7861314035770719
Accuracy_auc 0.7862234695013852


---
>### TPOT

In [179]:
from tpot import TPOTClassifier
from sklearn.model_selection import StratifiedKFold

In [180]:
cv = StratifiedKFold(n_splits=6)
tpot = TPOTClassifier(generations=4, population_size=50, scoring='accuracy', cv=cv, verbosity=2, random_state=1, n_jobs=-1)
tpot.fit(X_train, y_train)
tpot.export('tpot_cv_best_model.py')


Generation 1 - Current best internal CV score: 0.7673023290504606

Generation 2 - Current best internal CV score: 0.7673023290504606

Generation 3 - Current best internal CV score: 0.7673023290504606

Generation 4 - Current best internal CV score: 0.7673023290504606

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.1, min_samples_leaf=20, min_samples_split=13, n_estimators=100)


In [None]:
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive
from sklearn.pipeline import make_pipeline
from sklearn.neural_network import MLPClassifier

exported_pipeline = make_pipeline(
    StackingEstimator(estimator=MLPClassifier(alpha=0.1, learning_rate_init=0.001)),
    KNeighborsClassifier(n_neighbors=78, p=2, weights="distance")
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(X_train, y_train)
results = exported_pipeline.predict(X_test)

In [None]:
print("----- TPOT KNN Metrics -----")
print("Accuracy:", accuracy_score(y_test, results))

### PyCaret