### 1. Title : Diabetes Diagnosis by Classification

### 2. Introduction : 
Diabetes is a chronic ailment affecting millions of individuals in the United States and represents a significant public health concern worldwide. The disease is characterized by elevated blood sugar levels and major health complications requiring lifelong management. The prevalence of diabetes in the US has been steadily increasing, and approximately 11% of the population suffers from diabetes today. Investigating the prevalence and risk factors for diabetes can help drive efforts to identify individuals and populations at risk and tailor interventions specifically to them.
The Diabetes Health Indicators Dataset (2015) contains a valuable trove of data collected through the Behavioural Risk Factor Surveillance System, a telephone survey relating to health outcomes across the US. Our objective is to develop an accurate classification model for diabetes, and to pinpoint the variables in our survey response data with the greatest impact in predicting diabetes.


### 3. Preliminary exploratory data analysis :


In [1]:
# Installing Packages
# install.packages("tidyverse")
# install.packages("RColorBrewer")

# Loading in Libraries
library(tidyverse)
library(RColorBrewer)

# Loading in Data
url <- 'https://github.com/MatildaBae/dsci-100-2023W1-group45/raw/main/diabetes_binary_5050split_health_indicators_BRFSS2015.csv'
diab_data <- read_csv(url)


# Inspecting data
dim(diab_data)
str(diab_data)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.3     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


ERROR: Error in open.connection(structure(4L, class = c("curl", "connection"), conn_id = <pointer: 0x1d9>), : HTTP error 404.


In [2]:
# Formatting column classes appropriately
age_levels <- c('[18,24]', 
                '[25,29]', 
                '[30,34]', 
                '[35,39]', 
                '[40,44]', 
                '[45,49]', 
                '[50,54]', 
                '[55,59]', 
                '[60,64]', 
                '[65,69]', 
                '[70,74]', 
                '[75,79]', 
                '[79,∞)')

diab_data <- diab_data %>% 
        mutate(across(everything(), as_factor)) %>%
        mutate(Diabetes_binary = fct_recode(Diabetes_binary, 'Case' = '1', 'Control' = '0')) %>%
        mutate(Diabetes_binary = factor(Diabetes_binary, levels = c('Case', 'Control'))) %>%
        mutate(Sex = fct_recode(Sex, 'Female' = '0', 'Male' = '1')) %>%
        mutate(Sex = factor(Sex, levels = c('Male', 'Female'))) %>%
        mutate(Age = fct_recode(Age, 
                                '[18,24]' = '1',
                                '[25,29]' = '2',
                                '[30,34]' = '3',
                                '[35,39]' = '4',
                                '[40,44]' = '5',
                                '[45,49]' = '6',
                                '[50,54]' = '7',
                                '[55,59]' = '8',
                                '[60,64]' = '9',
                                '[65,69]' = '10',
                                '[70,74]' = '11',
                                '[75,79]' = '12',
                                '[79,∞)' = '13')) %>%
        mutate(Age = factor(Age, levels = age_levels)) %>%
        mutate(BMI = as.numeric(BMI),
               MentHlth = as.numeric(MentHlth),
               GenHlth = as.numeric(GenHlth))


# Making sure that classes are balanced
diab_data %>% count(Diabetes_binary)

ERROR: Error in eval(expr, envir, enclos): object 'diab_data' not found


In [None]:
# Exploring BMI data
summary_bmi <- diab_data %>% 
        select(Diabetes_binary, BMI) %>%
        group_by(Diabetes_binary) %>%
        summarize(mean_BMI = mean(BMI),
                  sd_BMI = sd(BMI))
summary_bmi

mean_bmi_case <- summary_bmi %>% 
        filter(Diabetes_binary == 'Case') %>% 
        pull(mean_BMI)
mean_bmi_control <- summary_bmi %>% 
        filter(Diabetes_binary == 'Control') %>% 
        pull(mean_BMI)



In [None]:
# Plotting BMI frequency
bmi_plot <- diab_data %>%
        ggplot(aes(x = BMI, fill = Diabetes_binary)) +
        geom_histogram(bins = 30, 
                       color = 'white',
                       size = 0.5,
                       alpha = 0.5,
                       position = 'identity') +
        geom_vline(xintercept = mean_bmi_case, color = 'darkgreen', linewidth = 1) +
        geom_vline(xintercept = mean_bmi_control, color = 'red', linewidth = 1) +
        scale_x_continuous(limits = c(0, 60),
                           breaks = seq(0, 60, 10)) +
        labs(x = 'Body Mass Index (BMI)', 
             y = 'Count',
             fill = '') +
        ggtitle('BMI distribution across cases-controls') +
        scale_fill_brewer(palette = 'Set2') +
        theme_classic() +
        theme(text = element_text(size = 12))

bmi_plot

In [None]:
# General and Mental Health scores
diab_data %>% select(Diabetes_binary, GenHlth, MentHlth) %>%
        group_by(Diabetes_binary) %>%
        summarize(across(GenHlth:MentHlth, mean, .names = 'mean_{.col}'),
                  across(GenHlth:MentHlth, sd, .names = 'sd_{.col}'))


In [None]:
# Population plot
diab_data %>% 
        count(Diabetes_binary, 
              Age, 
              Sex) %>%
        mutate(n = ifelse(Sex == 'Male', 
                          n * -1, 
                          n)) %>%
        ggplot(aes(x = Age, 
                   y = n, 
                   fill = Diabetes_binary)) +
        geom_bar(stat = 'identity', 
                 position = 'dodge',
                 color = 'white',
                 width = 1) +
        labs(x = 'Age bracket', 
             y = 'Count',
             fill = '') +
        ggtitle('Age and gender distribution across case-controls') +
        coord_flip() +
        scale_fill_brewer(palette = 'Set2') +
        theme_classic() +
        scale_y_continuous(labels = abs,
                           expand = c(0,0)) +
        facet_wrap(~Sex, 
                   strip.position = 'bottom', 
                   scale = 'free_x') +
        theme(text = element_text(size = 12), 
              panel.spacing.x = unit(0, 'pt'))

diab_data

In [None]:
# Looking at binary survey data
diab_summary <- diab_data %>% 
        select(-Sex) %>%
        group_by(Diabetes_binary) %>% 
        summarize(across(where(~ is.factor(.x) && length(unique(.x)) == 2), ~ sum(. == "1") / n(), .names = '{.col}'))

survey_plot <- diab_summary %>%
        pivot_longer(!Diabetes_binary) %>%
        ggplot(aes(x = name, y = value, fill = Diabetes_binary)) +
        geom_bar(stat = 'identity', position = 'dodge', width = 0.7) +
        labs(x = '', y = 'Proportion answered "Yes"', fill = '') +
        ggtitle("Survey results across cases-controls") +
        scale_fill_brewer(palette = 'Set2') +
        theme_classic() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

survey_plot

### 4. Methods :
These data encompass 21 variables, both categorical and numerical. We plan to identify 5 to 10 key variables that potentially have the highest predictive potential for diabetes, which we will then employ to train a robust classifier. We will perform a short literature search, coupled with some basic data exploration techniques plus forward selection, to identify those variables that are most influential when determining an individual’s risk for diabetes and guide our overall feature selection process.
We plan to employ a cross-validation approach to train a variety of different predictive models, including random forests, and/or generalized linear models, to predict disease presence. We will investigate the relative contributions of our predictors in the predictive performance of our models through variable importance analysis. We will also use visualization tools such as PCA, PCoA, and/or T-SNE to represent these multidimensional data in attractive, informative plots.

### 5. Expected outcomes and significance :
