# Diabetes Classification Dataset

This data is from the Behavioral Risk Factor Surveillance System (BRFSS) which is a health-related telephone survey that is collected annually by the CDC. The goal of this project is to train classification machine learning models to compare model performance based on model type and dataset characteristics.  

The original data is the linked Dataset from the University of California Irvine ML Repository. The linked data opens a kaggle dataset which was cleaned from the original CDC posted kaggle dataset. Yes a lot of chain of custody of this data. 

The two CSV files of interest to compare model performance are:
1. diabetes _ binary _ 5050split _ health _ indicators _ BRFSS2015.csv is a clean dataset of 70,692 survey responses to the CDC's BRFSS2015. It has an equal 50-50 split of respondents with no diabetes and with either prediabetes or diabetes. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is balanced.

2. diabetes _ binary _ health _ indicators _ BRFSS2015.csv is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes. This dataset has 21 feature variables and is not balanced.

We will be comparing model performance between these two datasets, to determine if whether the data being balanced can help with classification modeling of binary diabetes/prediabetes diagnosis. 

https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

In [34]:
# Import all necessary libraries for the project
import numpy as np 
import pandas as pd
import sklearn 

In [35]:
# Read the balanced data into a DataFrame in pandas
data_balanced = pd.read_csv('diabetes_prediction/diabetes_binary_5050split_health_indicators_BRFSS2015.csv')

# Read the unbalanced data into a DataFrame in pandas
data_unbalanced = pd.read_csv('diabetes_prediction/diabetes_binary_health_indicators_BRFSS2015.csv')

# Exploratory Data Analysis

In [44]:
# Compare dataframe sizes 
print(f"Shape of balanced:{data_balanced.shape}, Shape of un-balanced: {data_unbalanced.shape}")

Shape of balanced:(70692, 22), Shape of un-balanced: (253680, 22)


In [36]:
# Checking to make sure both datasets have the same columns
print(data_balanced.columns)
print(data_unbalanced.columns)

Index(['Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')
Index(['Diabetes_binary', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')


In [37]:
# Data types for all columns 
print(data_balanced.dtypes)

# Print out the first five rows of the balanced dataset 
data_balanced.head()

Diabetes_binary         float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
HeartDiseaseorAttack    float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
Education               float64
Income                  float64
dtype: object


Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0


In [42]:
# Check for columns with NA values
print(data_balanced.isna().sum())

Diabetes_binary         0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64


In [50]:
print("Percent of diabetes/prediabetes in balanced set:", data_balanced['Diabetes_binary'].sum() / len(data_balanced))

print("Percent of diabetes/prediabetes in unbalanced set:", data_unbalanced['Diabetes_binary'].sum() / len(data_unbalanced))

Percent of diabetes/prediabetes in balanced set: 0.5
Percent of diabetes/prediabetes in unbalanced set: 0.13933301797540207


## Determine which columns we can remove before modeling 

In [81]:
correlations = data_balanced.corrwith(data_balanced['Diabetes_binary'])
print(correlations.sort_values(ascending=False))

Diabetes_binary         1.000000
GenHlth                 0.407612
HighBP                  0.381516
BMI                     0.293373
HighChol                0.289213
Age                     0.278738
DiffWalk                0.272646
PhysHlth                0.213081
HeartDiseaseorAttack    0.211523
Stroke                  0.125427
CholCheck               0.115382
MentHlth                0.087029
Smoker                  0.085999
Sex                     0.044413
NoDocbcCost             0.040977
AnyHealthcare           0.023191
Fruits                 -0.054077
Veggies                -0.079293
HvyAlcoholConsump      -0.094853
PhysActivity           -0.158666
Education              -0.170481
Income                 -0.224449
dtype: float64


# Data prep for modeling

In [66]:
from sklearn.model_selection import train_test_split

random_state = 12

# Balanced Model variables
X_bal = data_balanced.drop(["Diabetes_binary", "GenHlth"], axis = 1)
y_bal = data_balanced['Diabetes_binary']

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(X_bal, y_bal, test_size=.2, random_state=random_state)

# Unbalanced Model variables
X_unbal = data_unbalanced.drop("Diabetes_binary", axis = 1)
y_unbal = data_unbalanced['Diabetes_binary']

X_train_u, X_test_u, y_train_u, y_test_u = train_test_split(X_unbal, y_unbal, test_size=.2, random_state=random_state)


In [67]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


clf_bal = LogisticRegression(random_state=random_state, max_iter=1000).fit(X_train_b, y_train_b)

clf_unbal = LogisticRegression(random_state=random_state, max_iter=1000).fit(X_train_u, y_train_u)

# clf_bal = make_pipeline(StandardScaler(), LogisticRegression(random_state=random_state, max_iter=1000)).fit(X_train_b, y_train_b)

# clf_unbal = make_pipeline(StandardScaler(), LogisticRegression(random_state=random_state, max_iter=1000)).fit(X_train_u, y_train_u)

In [73]:
from sklearn.metrics import accuracy_score
y_pred_b = clf_bal.predict(X_test_b)

accuracy_score(y_pred=y_pred_b, y_true=y_test_b)

0.7364028573449325

In [74]:
from sklearn.metrics import accuracy_score
y_pred_u = clf_unbal.predict(X_test_u)

accuracy_score(y_pred=y_pred_u, y_true=y_test_u)

0.8626813308104699