# Names
- Turki Alrasheed
- Mylinh Tang
- Arnav Kulkarni
- Chanbin Na
- Yiming Sun

# Research Question

Can we build a predictive model to classify individuals in the United States as type 2 diabetic, pre-diabetic, or non-diabetic based on health indicators, such as BMI, physical activity, age, and smoking and drinking habits, and medical conditions like high cholesterol and high blood pressure? Additionally, what features have the highest weight in the model’s prediction of diabetes?

## Background and Prior Work

According to the American Diabetes Association (ADA), approximately 38.4 million Americans had diabetes in 2021. Of these, around 8.7 million cases went undiagnosed.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Diagnosing diabetes typically involves tests such as the A1C test, Fasting Plasma Glucose (FPG) test, Oral Glucose Tolerance Test (OGTT), and the use of Continuous Glucose Monitors (CGM). These tests require laboratory analysis or expensive medical devices, making them inaccessible to individuals without proper healthcare coverage.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) To address this gap, we aim to develop a predictive model that can help individuals assess their risk for diabetes without requiring costly medical diagnostics.  

The research paper written by Chang, Ganatra, aligns with our project through demonstrating the roles of machine learning models and algorithms in predicting diabetes diagnosis using health indicators like Body Mass Index (BMI), age, blood pressure, cholesterol levels, and others. It applies predictive models and uses techniques such as PCA and SMOTE to improve data suitability, which is relevant to our work. The study also evaluated various machine learning models in the application of predicting type 2 diabetes, including Decision Tree, Logistic Regression, Random Forest, and more. I found it particularly useful how the Random Forest model can help us determine which factors such as age and BMI, affect the healthcare outcomes and in the model’s prediction of diabetes through its function of identifying important variables through a large pool of data.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

The paper written by Päivi Riihimaa, compares the traditional statistical regression models and the advanced machine learning algorithms on predicting type 2 diabetes. On the one hand, it discusses how machine learning algorithms tend to outperform the regression models for diabetes risk prediction based on health indicators like BMI, physical activity, and medical conditions. 
On the other hand, it introduces features such as blood pressure and cholesterol levels, and can improve accuracy in predictive models, which can be very helpful for us when selecting features to use for the models. This also lays a solid foundation on assessing which features have the highest weight in the model’s prediction of diabetes.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)


### References  
1.<a name="cite_note-1"></a> [^](#cite_ref-1) American Diabetes Association. (2021). *Statistics About Diabetes*. https://diabetes.org/about-diabetes/statistics/about-diabetes  
2.<a name="cite_note-2"></a> [^](#cite_ref-2) American Diabetes Association. (2015). *Classification and Diagnosis of Diabetes*. *Diabetes Care*, 38(Supplement_1), S8. https://diabetesjournals.org/care/article/38/Supplement_1/S8/37298/2-Classification-and-Diagnosis-of-Diabetes  
3.<a name="cite_note-3"></a> [^](#cite_ref-3) Chang, V., Ganatra, M. A., Hall, K., Golightly, L., & Xu, Q. A. (2022). *An assessment of machine learning models and algorithms for early prediction and diagnosis of diabetes using health indicators*. *Healthcare Analytics, 2*, 100118. https://doi.org/10.1016/j.health.2022.100118  
4.<a name="cite_note-4"></a> [^](#cite_ref-4) Riihimaa, P. (2020). *Impact of machine learning and feature selection on type 2 diabetes risk prediction*. *Journal of Medical Artificial Intelligence, 3*, 10. https://doi.org/10.21037/jmai-20-4  


# Hypothesis


We hypothesize that Body Mass Index (BMI) will be the strongest predictor of Type 2 diabetes risk. A higher BMI, which is closely linked to poor dietary habits (high in fats, carbs, and sugars), leads to glucose spikes and increased insulin demand, contributing to a higher risk of developing diabetes. Age is also a strong metric because frequently older people have a greater chance of developing diabetes.

We also expect high cholesterol and high blood pressure to be strongly positively correlated to diabetes as they are often associated with metabolic syndrome and insulin resistance, which drive the progression from normal glucose metabolism to prediabetes and type 2 diabetes.

Overall, because of the strongly correlated features we have with diabetes, we believe it is possible to build an accurate machine learning model that predicts the category of type 2 diabetes: non-diabetic, pre-diabetic, and diabetic. 

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Diabetes Health Indicators Dataset
  - Link to the dataset:https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
  - Number of observations: 253,680
  - Number of variables: 22

This dataset was originally sourced from The Behavioral Risk Factor Surveillance System (BRFSS) 2015 dataset. The BRFSS is a health-related telephone survey that is collected annually by the Centers for Disease Control and Prevention(CDC).

This dataset contains our target variable column, which categorizes individuals into non-diabetics, pre-diabetes, and diabetics. There are 21 feature columns, which includes health indicators like BMI, physical activity levels, and smoking status and medical conditions like high blood pressure and high levels of cholesterol. Many columns are binary where they answer a question with a yes or a no, while some are numerical variables like BMI. These metrics are proxies for lifestyle factors, chronic risk disease, and metabolic health. To clean this data, I will select the columns that relates to my research question, convert the binary floats to ints, check if there is missing data and use probabilistic imputation if possible, and convert the encoded age column to age ranges for better readibility. 

Here is a brief description of each variable/column that could be used in our analysis:
|column name|description  
|---|---|  
|`'Diabetes_012'`|0 = No diabetes, 1 = Pre-diabetes, 2 = Diabetes
|`'HighBP'`|1 = High blood pressure, 0 = No high blood pressure 
|`'HighChol'`|1 = High cholesterol, 0 = No high cholesterol
|`'CholCheck'`|1 = Has had cholesterol checked in the last five years, 0 = Has not
|`'BMI'`|Body Mass Index (weight-to-height ratio) 
|`'Smoker'`|1 = Has smoked at least 100 cigarettes in lifetime, 0 = Has not
|`'Stroke'`|1 = Has had a stroke, 0 = Has not  
|`'HeartDiseaseorAttack'`|1 = Has had a heart attack or coronary heart disease, 0 = Has not
|`'PhysActivity'`|1 = Engages in physical activity in last 30 days (excluding job-related), 0 = Does not 
|`'Fruits'`|1 = Eats fruit at least once per day, 0 = Does not
|`'Veggies'`|1 = Eats vegetables at least once per day, 0 = Does not
|`'HvyAlcoholConsump'`|1 = Heavy alcohol consumption (men: >14 drinks per week, women: >7 drinks per week), 0 = Not a heavy drinker
|`'AnyHealthcare'`|1 = Has health insurance or access to healthcare, 0 = Does not
|`'NoDocbcCost'`|1 = Could not see a doctor due to cost, 0 = Could see a doctor  
|`'GenHlth'`|Self-reported general health (1 = Excellent, 2 = Very good, 3 = Good, 4 = Fair, 5 = Poor) 
|`'MentHlth'`|Number of days in the past 30 with poor mental health
|`'PhysHlth'`|Number of days in the past 30 with poor physical health
|`'DiffWalk'`|1 = Has difficulty walking or climbing stairs, 0 = No difficulty  
|`'Sex'`|0 = Female, 1 = Male
|`'Age'`|Encoded age group (1: 18-24, 2: 25-29, 3: 30-34, 4: 35-39, 5: 40-44, 6: 45-49, 7: 50-54, 8: 55-59, 9: 60-64, 10: 65-69, 11: 70-74, 12: 75-79, 13: 80 or older)

## Diabetes Health Indicators Dataset

In [91]:
# imports needed for cleaning
import pandas as pd
import numpy as np

In [92]:
def drop_cols(df):
    return df.drop(columns = ['Education','Income']) # this is for dropping columns not related to our research question

In [93]:
diabetes_df = pd.read_csv('diabetes.csv').pipe(drop_cols) # loading in our dataset

In [94]:
diabetes_df.isna().sum() # checking number of missing values for each column, none found

Diabetes_012            0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
dtype: int64

In [95]:
diabetes_df.dtypes # check dtypes of variables

Diabetes_012            float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
HeartDiseaseorAttack    float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
dtype: object

Below, we want to check that values of the binary columns are either 1 or 0

In [96]:
binary_columns = ['HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke', 'HeartDiseaseorAttack',
               'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare',
               'NoDocbcCost', 'DiffWalk', 'Sex']
for col in binary_columns:
    print(diabetes_df[col].unique())

[1. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[0. 1.]
[0. 1.]
[0. 1.]
[0. 1.]
[1. 0.]
[0. 1.]
[1. 0.]
[0. 1.]
[1. 0.]
[0. 1.]


In [97]:
diabetes_df['Diabetes_012'].value_counts() # check that there's no mistake in class values, 
# we are supposed to have 0,1, and 2 only. 
#there's a class imbalance will have to balance it for analysis later

Diabetes_012
0.0    213703
2.0     35346
1.0      4631
Name: count, dtype: int64

In [98]:
diabetes_df['GenHlth'].value_counts() # check values here, good only 1,2,3,4,5

GenHlth
2.0    89084
3.0    75646
1.0    45299
4.0    31570
5.0    12081
Name: count, dtype: int64

In [99]:
diabetes_df['Age'].value_counts()  # good values are correct here supposed to be from 1-13

Age
9.0     33244
10.0    32194
8.0     30832
7.0     26314
11.0    23533
6.0     19819
13.0    17363
5.0     16157
12.0    15980
4.0     13823
3.0     11123
2.0      7598
1.0      5700
Name: count, dtype: int64

In [100]:
diabetes_df.describe() # we see that menHlth and PhysHlth is from 0-30 so all good there

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,0.296921,0.429001,0.424121,0.96267,28.382364,0.443169,0.040571,0.094186,0.756544,0.634256,0.81142,0.056197,0.951053,0.084177,2.511392,3.184772,4.242081,0.168224,0.440342,8.032119
std,0.69816,0.494934,0.49421,0.189571,6.608694,0.496761,0.197294,0.292087,0.429169,0.481639,0.391175,0.230302,0.215759,0.277654,1.068477,7.412847,8.717951,0.374066,0.496429,3.05422
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,6.0
50%,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0
75%,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,2.0,3.0,0.0,1.0,10.0
max,2.0,1.0,1.0,1.0,98.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,5.0,30.0,30.0,1.0,1.0,13.0


Now all of these are floats. We know that we can convert everything to ints based on their description except BMI in case it is continuous(contains decimals). So, we will check if BMI only contains whole numbers or not. If so, we can just convert all columns in df to ints.

In [101]:
is_whole_number = (diabetes_df['BMI'] % 1 == 0).all() 
is_whole_number
# doing a vectorized operation of % 1 to this series and 
#checking if each reminder equals to 0 ouputs true if its an integer and false if it's a decimal 
# this would output a series of trues and falses and by applying .all() if outputs true if all numbers are integers

np.True_

Since all BMI's are integers, we can just convert the whole df to ints.

In [102]:
diabetes_df = diabetes_df.astype(int)

In [103]:
age_groups = { # dictionary to convert numbers to age ranges for readibility & visualizations in EDA
    1: "18-24",
    2: "25-29",
    3: "30-34",
    4: "35-39",
    5: "40-44",
    6: "45-49",
    7: "50-54",
    8: "55-59",
    9: "60-64",
    10: "65-69",
    11: "70-74",
    12: "75-79",
    13: "80+"
}
diabetes_df['Age']=diabetes_df['Age'].apply(lambda x: age_groups[x])
diabetes_df

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age
0,0,1,1,1,40,1,0,0,0,0,1,0,1,0,5,18,15,1,0,60-64
1,0,0,0,0,25,1,0,0,1,0,0,0,0,1,3,0,0,0,0,50-54
2,0,1,1,1,28,0,0,0,0,1,0,0,1,1,5,30,30,1,0,60-64
3,0,1,0,1,27,0,0,0,1,1,1,0,1,0,2,0,0,0,0,70-74
4,0,1,1,1,24,0,0,0,1,1,1,0,1,0,2,3,0,0,0,70-74
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,0,1,1,1,45,0,0,0,0,1,1,0,1,0,3,0,5,0,1,40-44
253676,2,1,1,1,18,0,0,0,0,0,0,0,1,0,4,0,0,1,0,70-74
253677,0,0,0,1,28,0,0,0,1,1,0,0,1,0,1,0,0,0,0,25-29
253678,0,1,0,1,23,0,0,0,0,1,1,0,1,0,3,0,0,0,1,50-54


# Ethics & Privacy

Developing a predictive model for diabetes classification raises important ethical and privacy concerns. Ethically, the model must be free from biases that could lead to inaccurate or unfair predictions, particularly for underrepresented groups. Ensuring a diverse dataset and evaluating the output for a diverse validation set can help mitigate these risks. Additionally, the model should serve as a decision-support tool rather than a sole determinant of medical conditions, keeping clinical expertise central to healthcare decisions. 

Privacy is also critical, as the model will handle sensitive health data. Compliance with regulations and privacy policy according to the state is essential, which might involve requiring data anonymization, secure storage, and strict access controls. Individuals should be informed about how their data will be used and given the option to opt out, while transparency and explainability should be prioritized to foster trust and protect patient confidentiality.