# COGS 108 - Data Checkpoint

## Authors


**Pranava Gande** - Conceptualization, Data Curation, Background research

**Sinai Brito** - Visualization, Analysis

**Alexa Covarrubias** - Visualization, Analysis

**Jenna Cheng** - Writing - Original Draft

**Monique Ramirez** - Writing - Original Draft

## Research Question

This project is about U.S. adults from the 2017-2018 National Health and Nutrition Examination Survey (NHANES) who meet the biomarker criteria for diabetes (either from tests of hemoglobin A1c or fasting plasma glucose).

This project asks three related questions:

1. Which non-clinical variables – namely, sex, race, ethnicity, education, income, health insurance, source of care, time since previous routine health check-up, exercise, smoking, and alcohol usage – are associated with remaining undiagnosed?

2. How well can we predict whether a person is aware of their diabetes (i.e. undiagnosed vs diagnosed) using these non-clinical variables, and which of these variables has highest feature importance?

3. Does classification performance differ across race, ethnicity, or income, thereby challenging the equity of non-laboratory screening models?

This project will elucidate these questions using survey-weighted logistic regression and supervised classification models.

## Background and Prior Work

Diabetes is among the most common chronic conditions in the United States, and it is a leading driver of illness, disability, and healthcare spending.[[1]](#ref1)[[2]](#ref2) According to CDC estimates, tens of millions of U.S. adults have diabetes, with roughly a quarter being unaware [[1]](#ref1). High blood sugar can remain asymptomatic while causing insidious health effects.[[3]](#ref3) Thus, people who remain undiagnosed, due to not adhering to routine health checkups for example, may only first show up with complications, including cardiovascular disease, kidney disease, or ocular damage.[[3]](#ref3)[[4]](#ref4) This makes early detection of diabetes critical.[[4]](#ref4)

NHANES is a nationally representative cross-sectional survey that provides interviews, physical exam results, and laboratory test data from the U.S. population.[[5]](#ref5) It is the among the main data sources used to monitor the incidence and burden of diabetes, both diagnosed and undiagnosed, at population-scale.[[1]](#ref1)[[5]](#ref5)[[6]](#ref6) For each participant, it documents whether a physician has diagnosed diabetes as well as lab biomarker (hemoglobin A1c and fasting plasma glucose) results.[[5]](#ref5)[[6]](#ref6) This wealth of data has resulted in many important recent epidemiological findings.[[1]](#ref1)[[6]](#ref6) This includes documenting healthcare inequities; namely, Black, Hispanic, and Asian adults have a higher prevalence of both diagnosed and undiagnosed diabetes than non-Hispanic White adults.[[6]](#ref6) In addition, participants with lower education levels were also found to have higher rates of undiagnosed diabetes.[[6]](#ref6) These findings indicate that there are broader inequities in which people are likely to be made aware of their diabetes via formal diagnosis.[[3]](#ref3)[[6]](#ref6)[[7]](#ref7)

There have been a few studies that have linked other non-clinical variables to diabetes diagnosis. These include variables that are not be related to healthcare equity. For example, Wilder et al. showed that obesity was linked to un NHANES III (1988-1994).[[8]](#ref8) In the Korean NHANES (KNHANES 2008-2011), a replica of the NHANES dataset at a national scale in Korea, it was found that cardiometabolic risk factors like hypertension and dyslipidemia were also linked to undiagnosed diabetes.[[9]](#ref9) Thus, undiagnosed diabetes is affected by the interplay of numerous different social and medical factors, which can be difficult to disentangle despite the high stakes.[[3]](#ref3)[[4]](#ref4)[[8]](#ref8)[[9]](#ref9) This makes this problem an ideal candidate for data science approaches.

There is a growing line of work that aims to do exactly this, i.e. detect undiagnosed diabetes solely from easily measured variables.[[10]](#ref10)[[11]](#ref11) A recent study in JMIR AI trained a model to NHANES 1999-2000 that used demographics, anthropometric measures, health behaviors and some other clinical variables.[[10]](#ref10) This model achieved an AUC of 0.91 in the task of classifying adults with undiagnosed diabetes and healthy adults.[[10]](#ref10) For comparison, an AUC of above 0.8 is generally considered clinically usable for medical diagnostic models.[[13]](#ref13) Another paper by Riveros Perez and Avella-Molano reported that XGBoost achieved an AUC of 0.82 on the same task using only lifestyle and anthropometric measurements in NHANES 2007-2008.[[11]](#ref11) These studies are promising results that suggest that data science models might support early risk screening for diabetes.[[10]](#ref10)[[11]](#ref11)

Our project is also a predictive-modelling task, but otherwise differs from previous studies in the field.[[10]](#ref10)[[11]](#ref11) We focus on a different population – adults in NHANES 2017-2018 who already meet biomarker criteria for diabetes.[[5]](#ref5)[[6]](#ref6)[[12]](#ref12) In this group, we aim to model awareness of diabetes status and discover the features that are most explain this awareness gap.[[4]](#ref4)[[7]](#ref7) As such, our work builds on previous NHANES studies of undiagnosed diabetes.[[4]](#ref4)[[8]](#ref8)[[9]](#ref9)[[10]](#ref10)[[11]](#ref11) Prior results are taken a step further by applying a novel predictive model to discover who among those with diabetes remain undiagnosed.[[10]](#ref10)[[11]](#ref11) These results will also shed light on equity-related questions to understand how diabetes screening models may contend with different demographic groups.[[3]](#ref3)[[6]](#ref6)[[7]](#ref7)

<a id="ref1"></a>[1] Centers for Disease Control and Prevention. *National Diabetes Statistics Report.* Atlanta, GA: U.S. Department of Health and Human Services. Available at: https://www.cdc.gov/diabetes/php/data-research/

<a id="ref2"></a>[2] American Diabetes Association. *Economic Costs of Diabetes in the U.S. in 2017.* Diabetes Care. 2018;41(5):917–928. PubMed: https://pubmed.ncbi.nlm.nih.gov/29567642/

<a id="ref3"></a>[3] Mayo Clinic. *Hyperglycemia in diabetes – Symptoms and causes.* Available at: https://www.mayoclinic.org/diseases-conditions/hyperglycemia/symptoms-causes/syc-20373631

<a id="ref4"></a>[4] Centers for Disease Control and Prevention. *How Diabetes Can Affect Your Body.* Infographic. Available at: https://www.cdc.gov/diabetes/communication-resources/how-diabetes-can-affect-your-body.html

<a id="ref5"></a>[5] National Center for Health Statistics. *National Health and Nutrition Examination Survey (NHANES) – Overview and Methods.* Available at: https://www.cdc.gov/nchs/hus/sources-definitions/nhanes.htm

<a id="ref6"></a>[6] Centers for Disease Control and Prevention. *Appendix A: Detailed Tables for the National Diabetes Statistics Report (2017–2020).* Table 1: Age-adjusted prevalence of diagnosed, undiagnosed, and total diabetes among adults, by race/ethnicity and education. Available at: https://www.cdc.gov/diabetes/php/data-research/appendix.html

<a id="ref7"></a>[7] To KG et al. *Awareness of having hypertension, diabetes and dyslipidaemia among US adults: The 2011–2018 NHANES data.* Scand J Public Health. 2025;53(4):391–399. PubMed: https://pubmed.ncbi.nlm.nih.gov/38679806/

<a id="ref8"></a>[8] Wilder RP, Majumdar SR, Klarenbach SW, Jacobs P. *Socio-economic status and undiagnosed diabetes.* Diabetes Res Clin Pract. 2005;70(1):26–30. PubMed: https://pubmed.ncbi.nlm.nih.gov/16126120/

<a id="ref9"></a>[9] Kim JH et al. *Prevalence and Risk Factors for Undiagnosed Glucose Intolerance Status in Apparently Healthy Young Adults Aged <40 Years: The Korean National Health and Nutrition Examination Survey 2014–2017.* Int J Environ Res Public Health. 2019;16(13):2396. PMC: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6651181/

<a id="ref10"></a>[10] Liu J et al. *Use of Automated Machine Learning to Detect Undiagnosed Diabetes in US Adults: Development and Validation Study.* JMIR AI. 2025;4:e68260. Available at: https://ai.jmir.org/2025/1/e68260

<a id="ref11"></a>[11] Riveros Perez E, Avella-Molano B. *Learning from the machine: is diabetes in adults predicted by lifestyle variables? A retrospective predictive modelling study of NHANES 2007–2018.* BMJ Open. 2025;15(3):e096595. Available at: https://bmjopen.bmj.com/content/15/3/e096595

<a id="ref12"></a>[12] National Center for Health Statistics. *NHANES 2017–2018 Glycohemoglobin (GHB_J) Data Documentation, Codebook, and Frequencies.* Available at: https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2017/DataFiles/GHB_J.htm

<a id="ref13"></a>[13] Fan J, Upadhye S, Worster A. *Understanding receiver operating characteristic (ROC) curves.* CJEM. 2006;8(1):19–20. (See also: *Assessing the Accuracy of Diagnostic Tests* – Table 2 AUC interpretation.) PMC: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6410404/

## Hypothesis


We hypothesize that, among U.S. adults meeting the biomarker criteria for diabetes in NHANES 2017-2018, being undiagnosed will be less common in those with worse healthcare access. The specific indicators of this include lower income, lack of health insurance status, no consistent source of care, and lack of annual health checkups. Drawing on prior work, we also hypothesize that people of color (such as Black, Hispanic and Asians) will have higher odds of being undiagnosed. With regards to predictive modelling, we hypothesize that models using only non-clinical variables will be able to achieve moderate predictive power, roughly corresponding to an AUC of 0.7. This is due to the success of previous models on similar tasks. Lastly, we hypothesize that predictive performance will differ between different racial/ethnic and income groups.

## Data

**Ideal Dataset**

The ideal dataset would have a large sample size and be nationally representative. In addition, it would contain the following measurements:

*Outcome Data*
- Laboratory tests indicative of diabetes (hemoglobin A1c and fasting plasma blood glucose)
- Self-reported history of diabetes diagnosis by a physician

*Predictors and Controls*
- Demographic: age, sex, race, ethnicity
- Socioeconomic: household income, employment, education
- Healthcare access: insurance coverage, typical source of care, time since last routine checkup
- Lifestyle: smoking, alcohol, exercise
- Anthropometric: BMI, waist circumference, adiposity

*Overall Dataset Features*
- Large sample size to allow analysis of subgroups
- Weighted representation of data from different geographic regions
- Extensive documentation and codebooks

**Real Dataset**

*NHANES 2017-2018*

The primary dataset for our project is the publicly available files from 2017-2018 NHANES. These are freely available from the CDC website:
https://wwwn.cdc.gov/nchs/nhanes/Default.aspx

NHANES is a nationally representative survey that interviews, physical exam results, and laboratory test data from the U.S. population. The following files will be used in this project:
- Demographics – includes age, sex, race, ethnicity, education, income
- Glycohemoglobin – lab file; includes hemoglobin A1c values
- Plasma Fasting Glucose – lab file; includes fasting plasma glucose values
- Individual files under the Questionnaire section that contains insurance, source of care, smoking, alcohol, and exercise

All the above files are de-identified and do not require special permission to download.

In [23]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [24]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.run_file()

Overall Download Progress:   0%|                                                                 | 0/8 [00:00<?, ?it/s]
Downloading DEMO_J.XPT: 0.00B [00:00, ?B/s][A
Downloading DEMO_J.XPT: 673kB [00:00, 6.73MB/s][A
Downloading DEMO_J.XPT: 1.48MB [00:00, 7.40MB/s][A
Downloading DEMO_J.XPT: 2.96MB [00:00, 8.04MB/s][A
Overall Download Progress:  12%|███████▏                                                 | 1/8 [00:01<00:09,  1.41s/it]

Successfully downloaded: DEMO_J.XPT



Downloading GHB_J.XPT:   0%|                                                               | 0.00/23.9k [00:00<?, ?B/s][A
Overall Download Progress:  25%|██████████████▎                                          | 2/8 [00:02<00:07,  1.21s/it][A

Successfully downloaded: GHB_J.XPT



Downloading GLU_J.XPT: 0.00B [00:00, ?B/s][A
Overall Download Progress:  38%|█████████████████████▍                                   | 3/8 [00:03<00:05,  1.19s/it]

Successfully downloaded: GLU_J.XPT



Downloading HIQ_J.XPT: 0.00B [00:00, ?B/s][A
Downloading HIQ_J.XPT: 327kB [00:00, 1.74MB/s][A
Downloading HIQ_J.XPT: 675kB [00:00, 880kB/s] [A
Overall Download Progress:  50%|████████████████████████████▌                            | 4/8 [00:05<00:06,  1.53s/it]

Successfully downloaded: HIQ_J.XPT



Downloading SMQ_J.XPT: 0.00B [00:00, ?B/s][A
Downloading SMQ_J.XPT: 997kB [00:00, 9.44MB/s][A
Downloading SMQ_J.XPT: 2.29MB [00:00, 7.54MB/s][A
Overall Download Progress:  62%|███████████████████████████████████▋                     | 5/8 [00:07<00:05,  1.68s/it]

Successfully downloaded: SMQ_J.XPT



Downloading ALQ_J.XPT:   0%|                                                               | 0.00/43.6k [00:00<?, ?B/s][A
Downloading ALQ_J.XPT: 156kB [00:00, 1.40MB/s]                                                                         [A
Overall Download Progress:  75%|██████████████████████████████████████████▊              | 6/8 [00:08<00:03,  1.55s/it]

Successfully downloaded: ALQ_J.XPT



Downloading PAQ_J.XPT: 0.00B [00:00, ?B/s][A
Downloading PAQ_J.XPT: 584kB [00:00, 3.13MB/s][A
Overall Download Progress:  88%|█████████████████████████████████████████████████▉       | 7/8 [00:10<00:01,  1.42s/it]

Successfully downloaded: PAQ_J.XPT



Downloading DIQ_J.XPT: 0.00B [00:00, ?B/s][A
Downloading DIQ_J.XPT: 2.78MB [00:00, 27.6MB/s][A
Overall Download Progress: 100%|█████████████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.39s/it]

Successfully downloaded: DIQ_J.XPT





In [25]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import clean_data
import pandas as pd

clean_data.clean()

df = pd.read_csv("data/01-interim/nhanes_diabetes.csv")
df.head()


Dataset saved to data/01-interim/nhanes_diabetes.csv
Rows: 860
Columns: 181


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,RIDEXAGM,...,DID310D,DID320,DID330,DID341,DID350,DIQ350U,DIQ360,DIQ080,diabetes_biomarker,diagnosed
0,93714.0,10.0,2.0,2.0,54.0,,4.0,4.0,2.0,,...,6666.0,9999.0,9999.0,5.397605e-79,4.0,2.0,3.0,2.0,True,True
1,93730.0,10.0,2.0,1.0,57.0,,2.0,2.0,2.0,,...,6666.0,9999.0,9999.0,2.0,1.0,1.0,5.0,2.0,True,True
2,93758.0,10.0,2.0,2.0,55.0,,3.0,3.0,1.0,,...,6666.0,9999.0,6666.0,5.397605e-79,3.0,2.0,4.0,1.0,True,True
3,93759.0,10.0,2.0,1.0,60.0,,5.0,7.0,2.0,,...,80.0,82.0,9999.0,5.397605e-79,5.397605e-79,,3.0,2.0,True,True
4,93762.0,10.0,2.0,2.0,74.0,,1.0,1.0,1.0,,...,6666.0,9999.0,6666.0,5.397605e-79,5.397605e-79,,2.0,2.0,True,True


**Data Overview**

**Dataset #1**
- **Dataset Name:** NHANES 2017-2018
- **Link to the dataset:** https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?BeginYear=2017
- **Number of observations:** 8704
- **Number of variables:** 100 (?)
- **Description of variables:** Demographics includes age, sex, race, ethnicity, education, income, Glycohemoglobin lab file; includes hemoglobin A1c values, Plasma Fasting Glucose lab file; includes fasting plasma glucose values, Individual files under the Questionnaire section that contains insurance, source of care, smoking, alcohol, and exercise
- **Descriptions of any shortcomings this dataset has with respect to the project:** NHANES is cross sectional meaning that we can only identify associations and cannot identify causality. Our project also deals with self-reported variables which could include bias. Lastly, sample sizes for subgroups can be quite small which would decrease statistical power.

---

**Downloading and Loading Data**

The data comes from the National Health and Nutrition Examination Survey that was from the 2017-2018 cycle. The survey combines interview responses, physical examinations, laboratory results which makes it very fitting for the study of diabetes.

The data includes samples of individuals who meet the criteria of diabetes whether it be from a high hemoglobin A1c or fasting plasma glucose levels. In this project we are not focusing on who has diabetes, but who has been diagnosed with diabetes by a physician and who is aware of it. We are comparing two groups, diagnosed and undiagnosed individuals. By limiting the survey to the people who already meet the criteria, we are given a clearer way of differentiating between who is aware and who isn't aware of their condition.

We have merged several files together, which include demographic information such as age, sex, race, education, income; laboratory files containing hemoglobin A1c and fasting glucose levels; and questionnaire data like insurance status, source of care, smoking, alcohol use, and exercise. NHANES is designed to be a national representative which creates a higher external validity and allows results to generalize to be more focused on U.S. adults instead of just survey participants.

This dataset is very detailed and provides insight on important factors like healthcare access and the social determinants of health. This information helps us look into the non health factors like insurance, income, education and routine checkups and lets us determine why adults still remain undiagnosed from diabetes although they meet the criteria. Having a large diverse sample offers the ability to evaluate other subgroups like income and race, which supports equity focused analysis.

There are some limitations with the dataset. The data set is cross sectional, which only collects the data at one point and can't establish causation. The data also excludes the unhoused and institutionalized groups which may underestimate the level of treatment in undiagnosed treatments. Although there may be some limitations, the dataset is a good representation for examining the awareness of diabetes and developing predictive models.

---

**Data Cleaning**

NHANES 2017-2018 is clean and well-organized by design. Each specific survey component - e.g. demographics, lab tests, questionnaires, etc - are released in separate SAS files. Each has a common participant ID field called "SEQN," and this makes merging the different files straightforward. Additionally, variable names are standardized and present in documentation. For this reason, not much of the data preparation is fixing errors as is typical. Instead, it merely consists of selecting the subset of patients and variables relevant to our question.

In our code, we have done the following:
1) Read in each NHANES component ("DEMO_J", "GHB_J", etc.)
2) Merge these into one wide dataset using "SEQN"
3) Apply basic inclusion criteria:
   - Age > 18
   - Include those with diabetes self-report
   - Include those with biomarker definition of diabetes (variables "LBXGH" and "LBXGLU")
4) Save to CSV file

This completes a reproducible data-preparation pipline for NHANES.

## Ethics

**Ethics**

### A. Data Collection
- [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

Participants give informed consent to be part of the original survey. Furthermore, the data of our project is public-use and de-identified. We will not collect new data from human subjects for this project.

- [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

NHANES' mission statement is to be nationally representative. Nevertheless, it still excludes niche groups such as institutionalized people and the unhoused. When describing our results, we will take special care to emphasize that they apply only to the target population of NHANES and may miss some high-risk groups.

- [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

The public-use NHANES files do not contain identifying information. Geographic information is present but limited. Though it is not possible to do any re-identification, our group will still adhere to the spirit of privacy and avoid any attempts to do so.

### B. Data Storage
- [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

Since the data is public-use and not sensitive, there is no need to take special care to secure data.

- [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

We have no way to remove individual participants because the data is deidentified. Requests for deidentification must be processed by the CDC.

- [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

Raw data files will only be stored as long as is needed to complete this course. Afterwards, data files will be deleted and only code and non-identifiable summary data will be kept.

### C. Analysis
- [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

Our team does not have a qualified physician. Thus, we will make use of peer-reviewed studies in the field to interpret our findings, Additionally, we do not have an expert in public policy, and making suggestions about policy, even when our data seems to point in one direction, is very complex and disastrous if done wrongly. Thus, we will be careful to not over-state our results and avoid making direct suggestions on policy.

- [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

One of our research questions is about equity. This involves explicitly characterizing the effect that dataset bias can have in model underperformance for disadvantaged groups. We acknowledge that there may still be some residual confounding in our predictive model that can bias estimates, but because this was not discovered during our extensive equity analysis, it is likely to not be possible to reduce with typical data science methods.

- [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

Our figures and tables will show confidence intervals and when survey weights are used clearly. This aligns with correct practices in using the data and will help us maintain the national representativeness of the data. We will also avoid cherry-picking results and common practices like data dredging.

- [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

Given that the data is de-identified and diabetes is such a ubiquitous condition, we believe that re-identification is not a concern. This is in contrast to an analysis of, for example, a rare disease, where re-identification is a risk.

- [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

All our analyses will take place in notebooks under Git version control. This ensures that our workflow is reproducible and verifiable.

### D. Modeling
- [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

One of our research questions is about equity. This involves explicitly characterizing the effect that dataset bias can have in model underperformance for disadvantaged groups. Therefore, this concern is adequately addressed.

- [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

Our project will explicitly measure model performance across race, ethnicity, income, and low healthcare access groups using AUC values. Possible causes for any disparities will be discussed.

- [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

We believe that for the purposes of our project, multiple metrics must be used in tandem, because each metric measures a slightly different facet of performance. These will be the standard metrics used to evaluate clinical model performance – AUC, sensitivity, specificity, PPV/NPV, and calibration. We will talk about the performance trade-offs that come with optimizing for any one of these metrics.

- [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

We will prioritize the development of interpretable models like logistic regression. We will also use carefully-chosen statistical measures like odds ratios and permutation importance, which will elucidate what our model is doing in human-understandable terms.

- [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

The conclusion of our project will clearly highlight the main limitations of our analysis. These are that our analysis does not establish causality and may affected by self-reporting bias, limited follow-up and measurement error. We will also emphasize that this model is only a proof-of-concept and not ready for cinical deployment.

### E. Deployment
- [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

This model is a work-in-progress. It will not be implemented in a real-world setting without peer-review and further revisions based on that. However, we will discuss monitoring that must be undertaken were this model to be deployed, such as how to monitor model fairness.

- [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

In our public Github repository, we will make it clear that this model is not ready-to-use, and it should not be incorporated into clinical wprkflows at its current state. This will prevent misunderstandings about the progress of the project.

## Team Expectations 

**Team**

**Communication**

We will communicate via text message and check it once per day. If someone will be unavailable due to exams or travel, they will let the team know in advance.

**Meetings**

We will meet in-person once a week to get a progress update. If anyone has to miss a meeting, we will update that person with brief points of what they missed and what work they need to catch up on,

**Division of Labor**

Our project is broken up into background writing, data wrangling, exploratory data analysis, modeling, evaluation, final writing. We will assign one person in charge of each and try to distribute work as evenly as possible.

**Deadlines**

We will set internal deadlines in advance of official deadines to review work and make up for anything that is being delayed. The expectation is that everyone in the team adheres to the internal deadlines or asks for help ahead of time when they are too busy to finish.

**Conflict Resolution**

We will discuss conflicts respectfully and resolve them as a group. We will assume good intentions and avoid blaming individuals. If all attempts to resolve a conflict fail, we will contact the instructional staff for help.


## Project Timeline Proposal

**Week 7: Data Acquisition & Cleaning**
- Finalize research questions and hypotheses.
- Download NHANES 2017–2018 datasets (Demographics, Glycohemoglobin, Plasma Fasting Glucose, Questionnaire files).
- Review documentation and codebooks.
- Merge datasets and restrict to adults meeting biomarker criteria.
- Define outcome variable (diagnosed vs. undiagnosed).
- Recode key predictors (income, education, insurance, race/ethnicity, etc.).
- Address missing data and document exclusions.
- **Internal deadline: 02/16/2026**

**Week 8: Exploratory Data Analysis & Regression**
- Conduct descriptive analyses using survey weights.
- Examine prevalence of diagnosed vs. undiagnosed diabetes.
- Visualize distributions of predictors.
- Fit survey-weighted logistic regression models.
- Interpret odds ratios and identify significant non-clinical predictors.
- Assess model diagnostics.
- **Internal deadline: 02/23/2026**

**Week 9: Predictive Modeling & Evaluation**
- Train classification models using non-clinical variables.
- Perform cross-validation.
- Evaluate performance (AUC, sensitivity, specificity, PPV/NPV, calibration).
- Compare predictive performance across models.
- Assess potential class imbalance.
- **Internal deadline: 03/02/2026**

**Week 10: Fairness Analysis & Finalization**
- Evaluate model performance across racial/ethnic and income groups.
- Compare subgroup AUC and sensitivity.
- Analyze disparities in error rates.
- Synthesize results and discuss equity implications.
- Write final report and refine figures/tables.
- Review limitations and ensure reproducibility.
- **Internal deadline: 03/09/2026**