# Title

# Summary

# Introduction

Diabetes is a prevalent chronic disease that affects millions worldwide, requiring early detection and proper management to reduce associated health risks. This project aims to develop a classification model to predict diabetes status based on various health indicators. By using data preprocessing strategies, we seek to improve the accuracy of diabetes detection using publicly available health datasets.
The dataset used in this project is sourced from the CDC Diabetes Health Indicators dataset, which contains various demographic and lifestyle-related features that may influence the likelihood of diabetes. The primary objective is to classify individuals into diabetic (1) or non-diabetic (0) categories using predictive modeling.
The project follows a structured approach to data preparation, exploration, and classification modeling.
Firstly, The dataset is obtained from an external source and loaded into R. Then, the raw dataset is inspected for completeness and correctness. Checking for missing values and unique values in each feature. Converting categorical variables (such as age, smoking status, and blood pressure) into factor types to facilitate analysis.
Moreover, the dataset is highly imbalanced, with more non-diabetic cases than diabetic ones. To address this, the ROSE (Random Over-Sampling Examples) technique is applied to generate synthetic data points and balance the dataset.
Various visualizations (bar plots, box plots, and scatter plots) are generated to explore the relationships between health indicators and diabetes status. Trends in factors such as BMI, high blood pressure, cholesterol levels, and physical activity are examined. The dataset is split into 75% training data and 25% testing data to build and evaluate machine learning models.



# Methods & Results

*describe in written english the methods you used to perform your analysis from beginning to end that narrates the code the does the analysis.*

#### (1) Loading Data from Original Source On The Web

In [5]:
install.packages("glmnet")

also installing the dependency ‘RcppEigen’


Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [2]:
options(repr.plot.width = 15, repr.plot.height = 10, warn = -1)

library(reticulate) 
library(tidyverse) 
library(tidymodels)
library(glmnet)
library(patchwork)
library(ROSE)
library(purrr)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.4     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.2.0 ──

[32m✔[39m [34mbroom       [39m 1.0.7     [32m✔[39m [34mrsample     [39

In [2]:
py_run_file("/home/jovyan/work/src/dataset_download.py")

raw_diabetes_df <- read_csv("/home/jovyan/work/data/raw/cdc_diabetes_health_indicators.csv", show_col_types = FALSE)
head(raw_diabetes_df, n = 3)

HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,⋯,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income,Diabetes_binary
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,1,40,1,0,0,0,0,1,⋯,0,5,18,15,1,0,9,4,3,0
0,0,0,25,1,0,0,1,0,0,⋯,1,3,0,0,0,0,7,6,1,0
1,1,1,28,0,0,0,0,1,0,⋯,1,5,30,30,1,0,9,4,8,0


#### (2) Preprocessing: Wrangle, Clean, and Balance Data From Original Format

In [None]:
# (1) Check NA values, the distinct Count of each variable (to see which ones are categorical/binary), and the current data type

checking_raw_matrix <- rbind(
  NA_Count = sapply(raw_diabetes_df, function(x) sum(is.na(x))),
  Distinct_Count = sapply(raw_diabetes_df, function(x) n_distinct(x)),
  Current_Data_Type = sapply(raw_diabetes_df, typeof)
)

checking_raw_df <- as.data.frame(t(checking_raw_matrix))

checking_raw_df
# --------------------------------------
# observations
# (1) no NA values
# (2) all var except BMI are numerical. Rest are categorical/binary. See metadata for detail
# (3) Currently, all are double. 

In [None]:
# (2) converting categorical/binary variables into factors

raw_diabetes_df <- raw_diabetes_df %>%
  mutate(across(!BMI, ~ factor(.)))

In [None]:
# (3) check how unbalance is the dataset + balance the data

# it's pretty unbalanced
target_result <- raw_diabetes_df %>%
  group_by(Diabetes_binary) %>%
  summarise(Count = n(), Proportion = n() / nrow(raw_diabetes_df)) %>%
  ungroup()

# ----------------------------------------
# use ROSE to balance data by oversampling

set.seed(6)

balanced_raw_diabetes_df <- ROSE(Diabetes_binary ~ ., data = raw_diabetes_df, seed = 123)$data

balanced_target_result <- balanced_raw_diabetes_df %>%
  group_by(Diabetes_binary) %>%
  summarise(Count = n(), Proportion = n() / nrow(balanced_raw_diabetes_df)) %>%
  ungroup()

# -----------------------------------------
balanced_raw_comparision_df <- data.frame(
  Class = target_result$Diabetes_binary,
  Original_Count = target_result$Count,
  Original_Proportion = target_result$Proportion,
  Balanced_Count = balanced_target_result$Count,
  Balanced_Proportion = balanced_target_result$Proportion
)

balanced_raw_comparision_df

In [None]:
# (4) Write out balanced df into data/processed/ so we don't have to run all the step above each time
balanced_raw_diabetes_df %>% write_csv("/home/jovyan/work/data/processed/balanced_cdc_diabetes_health_indicators.csv")

In [3]:
# Then read it out again so we dont have to run the steps above
balanced_raw_diabetes_df <- read_csv("/home/jovyan/work/data/processed/balanced_cdc_diabetes_health_indicators.csv", show_col_types = FALSE) %>%
  mutate(across(!BMI, ~ factor(.)))

In [4]:
# (5) Split data into train + test for ML
set.seed(6)

diabetes_split <- initial_split(balanced_raw_diabetes_df, prop = 0.75, strata = Diabetes_binary)
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)

#### (3) EDA - Summary Statistics

In [None]:
# <chi-square-test-here>

#### (4) EDA - Visualization

In [None]:
# binary
binary_vars <- c("HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", 
                      "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", 
                      "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", 
                      "DiffWalk", "Sex", "Age", "Education", "Income", "MentHlth", "PhysHlth", "GenHlth")

# not binary
nonbinary_var <- c("BMI")

# --------------------------------------------------
# inits
bar_plots <- list()
density_plots <- list()

# --------------------------------------------------
# Bar plots
for (var in binary_vars) {
  p <- ggplot(diabetes_train, aes(x = !!sym(var), fill = as.factor(Diabetes_binary))) +
    geom_bar(position = "fill") + 
    scale_fill_manual(values = c("#FF9999", "#66B2FF")) + 
    labs(title = paste("Diabetes Binary by", var),
         x = var,
         y = "Proportion",
         fill = "Diabetes_binary") +
    theme_minimal()
  bar_plots[[var]] <- p
}


# --------------------------------------------------
# Density plots

for (var in nonbinary_var) {
  p <- ggplot(diabetes_train, aes(x = !!sym(var), fill = as.factor(Diabetes_binary))) +
    geom_density(alpha = 0.5) +
    scale_fill_manual(values = c("#FF9999", "#66B2FF")) + 
    labs(title = paste("Diabetes Binary by", var),
         x = var,
         y = "Density",
         fill = "Diabetes_binary") +
    theme_minimal()
  density_plots[[var]] <- p
}

# ----------------------------------------------------------------------------------------------
combined_plots <- wrap_plots(c(bar_plots, density_plots), ncol = 3, nrow = 7)
print(combined_plots, width = 15, height = 8)

#### (5) Classification Analysis

In [5]:
# --------------------------------------------------
# select only var we want first

diabetes_train_filtered <- diabetes_train %>%
  select(Diabetes_binary, HighBP, HighChol, CholCheck, Stroke, HeartDiseaseorAttack, 
         HvyAlcoholConsump, DiffWalk, Age, Education, Income, GenHlth)

# ---------------------------------------------------
# pipeline for logistic regression 

lr_mod <- logistic_reg(penalty = tune(), mixture = 1) %>% 
    set_engine("glmnet") %>%
    set_mode("classification")

folds <- vfold_cv(diabetes_train_filtered, v=5)

lr_recipe <- recipe(Diabetes_binary ~ ., data = diabetes_train_filtered) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_predictors())

lr_workflow <- workflow() %>%
  add_recipe(lr_recipe)

In [None]:
# tuning with cv set for penalty

lambda_grid <- grid_max_entropy(penalty(), size = 10)

lasso_grid <- tune_grid(lr_workflow %>% add_model(lr_mod),
                                               resamples = folds,
                                               grid = lambda_grid,
                                               metrics = metric_set(recall))

In [None]:
# chose the metric with highest recall

highest_auc <- lasso_grid %>% select_best(metric = "recall")

lasso_tuned_wflow <- finalize_workflow(lr_workflow %>% 
                     add_model(lr_mod),highest_auc) %>%
                     fit(data = diabetes_train_filtered)

#### (6) Result of Analysis - Visualization

In [None]:
# apply on test set

lasso_preds <- lasso_tuned_wflow %>% predict(diabetes_test)
lasso_probs <- lasso_tuned_wflow %>% predict(diabetes_test, type="prob")
lasso_modelOutputs <- cbind(diabetes_test, lasso_preds, lasso_probs)

classificationMetrics <- metric_set(sens, spec, ppv, npv, accuracy, recall, f_meas)

lasso_metrics <- rbind(classifcationMetrics(lasso_modelOutputs, truth = Diabetes_binary, estimate = .pred_class, event_level = "second"),
                       roc_auc(lasso_modelOutputs, truth =Diabetes_binary, .pred_1, event_level = "second"))
lasso_metrics

autoplot(roc_curve(lasso_modelOutputs, Diabetes_binary, .pred_1, event_level = "second"))

# Discussion

# References