# Diabetes Classification Dataset

This data is from the Behavioral Risk Factor Surveillance System (BRFSS) which is a health-related telephone survey that is collected annually by the CDC. The goal of this project is to train classification machine learning models to compare model performance based on model type and dataset characteristics.  

The original data is the linked Dataset from the University of California Irvine ML Repository. The linked data opens a kaggle dataset which was obtained from the original CDC posted kaggle dataset. 

The CSV file of interest to compare model performance:
1. diabetes _ 012 _ health _ indicators _ BRFSS2015.csv is a dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_012 has 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. There is class imbalance in this dataset. This dataset has 21 feature variables.


https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

### Data Imports, package imports and dataframe setup

In [76]:
# Import all necessary libraries for the project
import numpy as np 
import pandas as pd
import sklearn 
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score

In [77]:
# Read the data into a DataFrame in pandas
data = pd.read_csv('diabetes_012_health_indicators_BRFSS2015.csv')

# Exploratory Data Analysis

In [78]:
# Shape of the data
print(f"Shape of dataset:{data.shape}")

# All column names
print(data.columns)

Shape of dataset:(253680, 22)
Index(['Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')


In [79]:
# Data types for all columns 
print(data.dtypes)

# Print out the first five rows 
data.head()

Diabetes_012            float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
HeartDiseaseorAttack    float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
Education               float64
Income                  float64
dtype: object


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [80]:
# Convert the binary columns from float type to int to help with memory/speed when training the data
non_binary_columns = ['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income']

# Loop through all columns in the dataframe
for column in data.columns:
    # If the column is not in the list of non-binary columns, convert it to int
    if column not in non_binary_columns:
        data[column] = data[column].astype(int)


# Print out all dtypes after conversion to Int
print(data.dtypes)

Diabetes_012              int64
HighBP                    int64
HighChol                  int64
CholCheck                 int64
BMI                     float64
Smoker                    int64
Stroke                    int64
HeartDiseaseorAttack      int64
PhysActivity              int64
Fruits                    int64
Veggies                   int64
HvyAlcoholConsump         int64
AnyHealthcare             int64
NoDocbcCost               int64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                  int64
Sex                       int64
Age                     float64
Education               float64
Income                  float64
dtype: object


# Scale our non-binary data to help with modeling

In [81]:
# Initialize MinMaxScaler for scaling data 
scaler = MinMaxScaler()
data[non_binary_columns] = scaler.fit_transform(data[non_binary_columns])

In [82]:
# Clean the dataset to Encode the diabetes column as binary for 0 = no diabetes, and 1 == pre-diabetes or diabetes
data['Diabetes_binary'] = (data['Diabetes_012'] > 0).astype(int)

# Drop the orginal 'Diabetes_012' column to avoid multicollinearity
data = data.drop('Diabetes_012', axis = 1)

In [83]:
# Check for columns with NA values
print(data.isna().sum())
data.dtypes

HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
Diabetes_binary         0
dtype: int64


HighBP                    int64
HighChol                  int64
CholCheck                 int64
BMI                     float64
Smoker                    int64
Stroke                    int64
HeartDiseaseorAttack      int64
PhysActivity              int64
Fruits                    int64
Veggies                   int64
HvyAlcoholConsump         int64
AnyHealthcare             int64
NoDocbcCost               int64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                  int64
Sex                       int64
Age                     float64
Education               float64
Income                  float64
Diabetes_binary           int64
dtype: object

In [84]:
print("Percent of diabetes/prediabetes in data set:", data['Diabetes_binary'].sum() / len(data))

Percent of diabetes/prediabetes in data set: 0.15758830022075054


## Determine which columns we can remove before modeling 

In [85]:
correlations = data.corrwith(data['Diabetes_binary'])
print(correlations.sort_values(ascending=False))

Diabetes_binary         1.000000
GenHlth                 0.300785
HighBP                  0.270334
BMI                     0.223851
DiffWalk                0.222155
HighChol                0.210290
Age                     0.185891
HeartDiseaseorAttack    0.176933
PhysHlth                0.174948
Stroke                  0.104800
MentHlth                0.074971
CholCheck               0.067879
Smoker                  0.062778
NoDocbcCost             0.038025
Sex                     0.029606
AnyHealthcare           0.014079
Fruits                 -0.042088
HvyAlcoholConsump      -0.056682
Veggies                -0.059219
PhysActivity           -0.121392
Education              -0.131803
Income                 -0.172794
dtype: float64


# Data prep for modeling

This dataset contains class imbalance with only ~15% of the data having diabetes. With our predictions we need to sample the data in a way for training that will allow the diabetes data to be trained properly. We will be using the imbalanaced-learn package to sample the data with the SMOTE (Synthetic Minority Over-sampling Technique). Using only the training data with SMOTE will avoid data leakage into the testing dataset. 

In [86]:
random_state = 12

# Balanced Model variables
X = data.drop(["Diabetes_binary", "GenHlth"], axis = 1)
y = data['Diabetes_binary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=random_state)

smote = SMOTE(random_state=random_state)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Model Testing

In [87]:
clf = LogisticRegression(random_state=random_state, max_iter=1000).fit(X_train_balanced, y_train_balanced)

y_pred_b = clf.predict(X_test)

accuracy_score(y_pred=y_pred_b, y_true=y_test)

0.7167100283822138