## 1. Introduction

Diabetes is a major public health concern associated with increased morbidity, mortality, and healthcare costs. Understanding demographic and behavioral factors associated with the widespread of diabetes can help inform prevention strategies and public health interventions.

This project explores relationships between diabetes diagnosis and selected risk factors using public health survey data. The analysis focuses on exploratory data analysis and statistical inference rather than predictive modeling.

### Research Questions
The primary questions addressed in this analysis are:

1. Is body mass index (BMI) significantly associated with diabetes prevalence?
2. Do physically active individuals exhibit lower rates of diabetes compared to inactive individuals?
3. Does diabetes prevalence differ significantly across age groups?
4. Is income level associated with diabetes diagnosis?

## 2. Data Overview

The dataset used in this project is derived from a public health behavioral survey and includes individual-level responses related to diabetes diagnosis, demographic characteristics, and health-related behaviors.
"Diabetes Health Indicators Dataset" - https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

In [1]:
import pandas as pd

df = pd.read_csv("data/diabetes_012_health_indicators_BRFSS2015.csv")

print("Shape:", df.shape)
display(df.head())

display(df.dtypes)

# missing values check
display(df.isna().sum().sort_values(ascending=False).head(20))


Shape: (253680, 22)


Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


Diabetes_012            float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
HeartDiseaseorAttack    float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
Education               float64
Income                  float64
dtype: object

Diabetes_012            0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
HeartDiseaseorAttack    0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
dtype: int64

## 2a. Target Variable Definition

The primary outcome variable in this analysis is **Diabetes_012**, which represents diabetes status based on self-reported survey responses.

The variable is encoded as follows:
- **0**: No diabetes
- **1**: Prediabetes
- **2**: Diabetes

For the purposes of this exploratory and statistical analysis, diabetes status will be examined both in its original categorical form and, where appropriate, using binary groupings to support hypothesis testing.

## 2b. Key Variables and Descriptions

The following variables are central to the analyses:

- **BMI**: Body Mass Index, calculated from self-reported height and weight.
- **PhysActivity**: Indicator of whether the respondent engaged in physical activity during the past 30 days.
- **Age**: Categorical age group variable.
- **HighBP**: Indicator for high blood pressure diagnosis.
- **HighChol**: Indicator for high cholesterol diagnosis.
- **Smoker**: Indicator for current or former smoking status.
- **Income**: Ordinal variable representing household income category.
- **Education**: Ordinal variable representing highest level of education attained.

All variables are derived from self-reported survey data

In [None]:
# target variable distribution
df["Diabetes_012"].value_counts().sort_index()

Diabetes_012
0.0    213703
1.0      4631
2.0     35346
Name: count, dtype: int64

In [None]:
# target variable proportions
df["Diabetes_012"].value_counts(normalize=True).sort_index()

Diabetes_012
0.0    0.842412
1.0    0.018255
2.0    0.139333
Name: proportion, dtype: float64

## 3. Data Cleaning & Preparation

## 4. Exploratory Data Analysis

## 5. Statistical Analysis

## 6. Key Insights

## 7. Limitations & Next Steps