# Diabetes Classification Dataset

This data is from the Behavioral Risk Factor Surveillance System (BRFSS) which is a health-related telephone survey that is collected annually by the CDC. The goal of this project is to train classification machine learning models to compare model performance based on model type and dataset characteristics. Being able to help classify patients for higher risk of diabetes could help prevent health outcomes linked to undiagnosed diabetes/prediabetes. Early disease detection could help both patients and physicians with better diagnosis.  

The original data is the linked Dataset from the University of California Irvine ML Repository. The linked data opens a kaggle dataset which was obtained from the original CDC posted kaggle dataset. 

The CSV file of interest to compare model performance:
- diabetes _ 012 _ health _ indicators _ BRFSS2015.csv is a dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_012 has 3 classes. 0 is for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. There is class imbalance in this dataset. This dataset has 21 feature variables.


https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

https://archive.ics.uci.edu/dataset/891/cdc+diabetes+health+indicators

### Data Imports, package imports and dataframe setup

In [2]:
# Import all necessary libraries for the project
import numpy as np 
import pandas as pd
import sklearn 
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

In [3]:
# Read the data into a DataFrame in pandas
data = pd.read_csv('diabetes_012_health_indicators_BRFSS2015.csv')

# Data Cleaning and Exploratory Data Analysis

In [4]:
# Shape of the data
print(f"Shape of dataset:{data.shape}")

# MegaByte size of the file
total_memory_usage_bytes = data.memory_usage(deep=True).sum()
megabytes = total_memory_usage_bytes / (1024**2)
print(f'Total Memory Usage: {megabytes:.2f} MB')

# All column names
print(data.columns)

Shape of dataset:(253680, 22)
Total Memory Usage: 42.58 MB
Index(['Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'BMI', 'Smoker',
       'Stroke', 'HeartDiseaseorAttack', 'PhysActivity', 'Fruits', 'Veggies',
       'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'GenHlth',
       'MentHlth', 'PhysHlth', 'DiffWalk', 'Sex', 'Age', 'Education',
       'Income'],
      dtype='object')


In [5]:
# Data types for all columns 
print(data.dtypes)

# Print out the first five rows 
data.head()

Diabetes_012            float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
HeartDiseaseorAttack    float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
Education               float64
Income                  float64
dtype: object


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [6]:
# Convert the binary columns from float type to int to help with memory/speed when training the data
non_binary_columns = ['BMI', 'GenHlth', 'MentHlth', 'PhysHlth', 'Age', 'Education', 'Income']

# Loop through all columns in the dataframe
for column in data.columns:
    # If the column is not in the list of non-binary columns, convert it to int
    if column not in non_binary_columns:
        data[column] = data[column].astype(int)


# Print out all dtypes after conversion to int
print(data.dtypes)

Diabetes_012              int64
HighBP                    int64
HighChol                  int64
CholCheck                 int64
BMI                     float64
Smoker                    int64
Stroke                    int64
HeartDiseaseorAttack      int64
PhysActivity              int64
Fruits                    int64
Veggies                   int64
HvyAlcoholConsump         int64
AnyHealthcare             int64
NoDocbcCost               int64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                  int64
Sex                       int64
Age                     float64
Education               float64
Income                  float64
dtype: object


# Scale our non-binary data to help with modeling

In [7]:
# Initialize MinMaxScaler for scaling data 
scaler = MinMaxScaler()
data[non_binary_columns] = scaler.fit_transform(data[non_binary_columns])

# Encode Diabetes Column as Binary value
The original data has it as 0 for no diabetes or only during pregnancy, 1 is for prediabetes, and 2 is for diabetes. We will encode the data as either 0 for no diabetes or only during pregnancy, and 1 for prediabetes/diabetes. 

In [8]:
# Clean the dataset to Encode the diabetes column as binary for 0 = no diabetes, and 1 == pre-diabetes or diabetes
data['Diabetes_binary'] = (data['Diabetes_012'] > 0).astype(int)

# Drop the orginal 'Diabetes_012' column to avoid multicollinearity
data = data.drop('Diabetes_012', axis = 1)

In [9]:
# Check for columns with NA values
print(data.isna().sum())
data.dtypes

HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
Diabetes_binary         0
dtype: int64


HighBP                    int64
HighChol                  int64
CholCheck                 int64
BMI                     float64
Smoker                    int64
Stroke                    int64
HeartDiseaseorAttack      int64
PhysActivity              int64
Fruits                    int64
Veggies                   int64
HvyAlcoholConsump         int64
AnyHealthcare             int64
NoDocbcCost               int64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                  int64
Sex                       int64
Age                     float64
Education               float64
Income                  float64
Diabetes_binary           int64
dtype: object

## Determine which columns we see correlations with Diabetes_binary before modeling

In [10]:
correlations = data.corrwith(data['Diabetes_binary'])
print(correlations.sort_values(ascending=False))

Diabetes_binary         1.000000
GenHlth                 0.300785
HighBP                  0.270334
BMI                     0.223851
DiffWalk                0.222155
HighChol                0.210290
Age                     0.185891
HeartDiseaseorAttack    0.176933
PhysHlth                0.174948
Stroke                  0.104800
MentHlth                0.074971
CholCheck               0.067879
Smoker                  0.062778
NoDocbcCost             0.038025
Sex                     0.029606
AnyHealthcare           0.014079
Fruits                 -0.042088
HvyAlcoholConsump      -0.056682
Veggies                -0.059219
PhysActivity           -0.121392
Education              -0.131803
Income                 -0.172794
dtype: float64


In [11]:
# Determine the composition of our Diabetes Binary Class for our modeling 
print("Ratio of diabetes/prediabetes or Neither (0,1) in data set:", data['Diabetes_binary'].sum() / len(data))

Ratio of diabetes/prediabetes or Neither (0,1) in data set: 0.15758830022075054


# Data prep for modeling

This dataset contains class imbalance with only ~15.7% of the data having diabetes. With our predictions we need to sample the data in a way for training that will allow the diabetes data to be trained properly. We will be using the imbalanaced-learn package to sample the data with the SMOTE (Synthetic Minority Over-sampling Technique). Using only the training data with SMOTE will avoid data leakage into the testing dataset. 

In [12]:
random_state = 12

# Model variables
X = data.drop(["Diabetes_binary"], axis = 1)
y = data['Diabetes_binary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=random_state)

# Use SMOTE for balancing the training data
smote = SMOTE(random_state=random_state)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Model Testing

## Overview
Our goal is to pinpoint the best-performing classification model for our dataset. By employing SMOTE for dataset balancing, optimizing hyperparameters with Grid Search CV, and utilizing all CPU cores for efficient processing, we navigate towards an informed model selection.

## Models and Their Strengths
- **Logistic Regression**: A fundamental model for binary outcomes, valued for its simplicity and interpretability. It estimates probabilities that allow for a clear threshold decision.

- **Random Forest Classifier**: Utilizes an ensemble of decision trees to reduce overfitting and improve prediction accuracy. It's known for its robustness and ability to handle non-linear data.

- **K-Nearest Neighbors (KNN)**: A non-parametric method that classifies each data point based on the majority label of its nearest neighbors. It's simple yet effective, with performance depending on the choice of the distance metric and the value of k.

- **Gradient Boosting Classifier**: Builds models sequentially, each correcting its predecessor, which combines weak models to create a strong model. It's powerful for handling various data types and relationships.

- **LinearSVC (Support Vector Machine)**: A variant of SVM optimized for linear classification. It's effective in high-dimensional spaces and for cases where the number of dimensions exceeds the number of samples.

## Evaluation Strategy
Our comprehensive assessment leans on accuracy, Precision, Recall, F1 Score, and ROC AUC to offer a holistic view of model performance. These metrics are crucial for understanding how well each model can manage class imbalances and make accurate classifications.

## Outcome
This methodical approach will lead us to select a model that not only performs well across our metrics but also effectively handles the nuances of our specific dataset, ensuring reliable predictions.

In [13]:
# Define the parameter grid
param_grid_lr = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs']  
}

# Initialize the GridSearchCV object
grid_search_lr = GridSearchCV(LogisticRegression(random_state=random_state, max_iter=1000), param_grid_lr, cv=3, scoring='roc_auc')

# Fit the model
grid_search_lr.fit(X_train_balanced, y_train_balanced)

# Best parameters and best score
print(f"Best parameters for Logistic Regression: {grid_search_lr.best_params_}")
print(f"Best ROC AUC for Logistic Regression: {grid_search_lr.best_score_:.4f}")

# Evaluate on the test set
y_pred_lr = grid_search_lr.predict(X_test)
y_pred_proba_lr = grid_search_lr.predict_proba(X_test)[:, 1]  # For ROC AUC

# Print classification report
print(classification_report(y_test, y_pred_lr))

# Calculate and print ROC AUC
roc_auc_lr = roc_auc_score(y_test, y_pred_proba_lr)
print(f"ROC AUC on Test Set: {roc_auc_lr:.4f}\n")


Best parameters for Logistic Regression: {'C': 100, 'solver': 'liblinear'}
Best ROC AUC for Logistic Regression: 0.8220
              precision    recall  f1-score   support

           0       0.94      0.72      0.82     42646
           1       0.34      0.77      0.47      8090

    accuracy                           0.73     50736
   macro avg       0.64      0.74      0.65     50736
weighted avg       0.85      0.73      0.76     50736

ROC AUC on Test Set: 0.8194



In [14]:
# Define the parameter grid
param_grid_rf = {
    'n_estimators': [10, 50, 100],  
    'max_depth': [None, 10, 20, 30]  
}

# Initialize the GridSearchCV object
grid_search_rf = GridSearchCV(RandomForestClassifier(random_state=random_state), param_grid_rf, cv=3, scoring='roc_auc', n_jobs=-1)

# Fit the model
grid_search_rf.fit(X_train_balanced, y_train_balanced)

# Best parameters and best score
print(f"Best parameters for Random Forest: {grid_search_rf.best_params_}")
print(f"Best ROC AUC for Random Forest: {grid_search_rf.best_score_:.4f}")

# Evaluate on the test set
y_pred_rf = grid_search_rf.predict(X_test)
y_pred_proba_rf = grid_search_rf.predict_proba(X_test)[:, 1]

# Print classification report
print(classification_report(y_test, y_pred_rf))

# Calculate and print ROC AUC
roc_auc_rf = roc_auc_score(y_test, y_pred_proba_rf)
print(f"ROC AUC on Test Set: {roc_auc_rf:.4f}\n")


Best parameters for Random Forest: {'max_depth': None, 'n_estimators': 100}
Best ROC AUC for Random Forest: 0.9700
              precision    recall  f1-score   support

           0       0.88      0.93      0.90     42646
           1       0.46      0.33      0.38      8090

    accuracy                           0.83     50736
   macro avg       0.67      0.63      0.64     50736
weighted avg       0.81      0.83      0.82     50736

ROC AUC on Test Set: 0.7947



In [15]:
# Define the parameter grid
param_grid_knn = {
    'n_neighbors': [3, 5, 7, 9], 
    'metric': ['euclidean', 'manhattan']  
}

# Initialize the GridSearchCV object
grid_search_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=3, scoring='roc_auc')

# Fit the model
grid_search_knn.fit(X_train_balanced, y_train_balanced)

# Best parameters and best score
print(f"Best parameters for KNN: {grid_search_knn.best_params_}")
print(f"Best ROC AUC for KNN: {grid_search_knn.best_score_:.4f}")

# Evaluate on the test set
y_pred_knn = grid_search_knn.predict(X_test)

# Since KNN doesn't directly support predict_proba, we use a workaround or skip ROC AUC
# Print classification report
print(classification_report(y_test, y_pred_knn))


Best parameters for KNN: {'metric': 'manhattan', 'n_neighbors': 3}
Best ROC AUC for KNN: 0.9074
              precision    recall  f1-score   support

           0       0.89      0.79      0.84     42646
           1       0.31      0.50      0.38      8090

    accuracy                           0.75     50736
   macro avg       0.60      0.65      0.61     50736
weighted avg       0.80      0.75      0.77     50736



In [16]:
# Define the parameter grid
param_grid_gb = {
    'n_estimators': [100, 200],  
    'learning_rate': [0.01, 0.1, 0.2],  
    'max_depth': [3, 5, 7]  
}

# Initialize the GridSearchCV object
grid_search_gb = GridSearchCV(GradientBoostingClassifier(random_state=random_state), param_grid_gb, cv=3, scoring='roc_auc', n_jobs=-1)

# Fit the model
grid_search_gb.fit(X_train_balanced, y_train_balanced)

# Best parameters and best score
print(f"Best parameters for Gradient Boosting: {grid_search_gb.best_params_}")
print(f"Best ROC AUC for Gradient Boosting: {grid_search_gb.best_score_:.4f}")

# Evaluate on the test set
y_pred_gb = grid_search_gb.predict(X_test)
y_pred_proba_gb = grid_search_gb.predict_proba(X_test)[:, 1]

# Print classification report
print(classification_report(y_test, y_pred_gb))

# Calculate and print ROC AUC
roc_auc_gb = roc_auc_score(y_test, y_pred_proba_gb)
print(f"ROC AUC on Test Set: {roc_auc_gb:.4f}\n")


Best parameters for Gradient Boosting: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100}
Best ROC AUC for Gradient Boosting: 0.9577
              precision    recall  f1-score   support

           0       0.89      0.93      0.91     42646
           1       0.50      0.36      0.42      8090

    accuracy                           0.84     50736
   macro avg       0.70      0.65      0.67     50736
weighted avg       0.82      0.84      0.83     50736

ROC AUC on Test Set: 0.8214



In [19]:
from sklearn.calibration import CalibratedClassifierCV

# Initialize LinearSVC
linear_svc = LinearSVC(random_state=random_state, max_iter=1000, dual='auto')

# Calibrate model to allow for probability estimates
calibrated_svc = CalibratedClassifierCV(linear_svc, method='sigmoid', cv=3)
calibrated_svc.fit(X_train_balanced, y_train_balanced)

# Predict on test set
y_pred_svc = calibrated_svc.predict(X_test)

# Obtain probabilities for ROC AUC
y_pred_proba_svc = calibrated_svc.predict_proba(X_test)[:, 1]

# Evaluation metrics
accuracy_svc = accuracy_score(y_test, y_pred_svc)
precision_svc = precision_score(y_test, y_pred_svc)
recall_svc = recall_score(y_test, y_pred_svc)
f1_svc = f1_score(y_test, y_pred_svc)
roc_auc_svc = roc_auc_score(y_test, y_pred_proba_svc)

# Print metrics
print("LinearSVC (with Calibration):")
print(f"Accuracy: {accuracy_svc:.4f}, Precision: {precision_svc:.4f}, Recall: {recall_svc:.4f}, F1 Score: {f1_svc:.4f}, ROC AUC: {roc_auc_svc:.4f}")

# Print classification report
print(classification_report(y_test, y_pred_svc))


LinearSVC (with Calibration):
Accuracy: 0.7279, Precision: 0.3424, Recall: 0.7674, F1 Score: 0.4735, ROC AUC: 0.8194
              precision    recall  f1-score   support

           0       0.94      0.72      0.82     42646
           1       0.34      0.77      0.47      8090

    accuracy                           0.73     50736
   macro avg       0.64      0.74      0.65     50736
weighted avg       0.85      0.73      0.76     50736

