# COGS 108 - Project Proposal

## Authors

- Hannah Daniel: Conceptualization, background research, writing
- Isaac Cordova: Conceptualization, data curation, analysis
- Evenie Osorio: Data curation, analysis, visualization
- Evelyn Cobian: Analysis, visualization, writing
- Deandre Juguilon: Project administration, writing, coordination

## Research Question

Do U.S. states with higher obesity rates also tend to report higher rates of diagnosed depression?

## Background and Prior Work

Obesity and depression are both major public health issues in the United States and rates of both vary a lot from state to state. Obesity rates have increased over time, while depression is one of the most commonly diagnosed mental health conditions among adults. Since both conditions are influenced by multiple factors like income, access to healthcare, and lifestyle, it is reasonable to think they may be related at a larger population level.
Previous research has looked at the relationship between obesity and depression mostly at the individual level. Many studies have found that people with higher body mass index are more likely to report symptoms of depression. Some research also suggests this relationship may work in both directions, where obesity can increase the risk of depression and depression can also contribute to weight gain through changes in behavior or biological factors.
Researchers have also used geographic data to study health trends across the United States. Public health datasets, such as those collected by the CDC, allow researchers to compare health outcomes across states and identify regional patterns. These datasets are commonly used to examine differences in physical and mental health at the population level.
While obesity and depression have both been studied individually, fewer studies focus on how these two variables relate to each other at the state level. This project aims to explore whether U.S. states with higher obesity rates also tend to report higher rates of diagnosed depression. Examining this relationship at the state level may provide insight into broader public health trends and help guide future research.

## Hypothesis


We hypothesize that U.S. states with higher obesity rates will also tend to report higher rates of diagnosed depression. This is based on previous research showing a connection between obesity and depression at the individual level, as well as shared factors such as socioeconomic conditions and access to healthcare. We expect to see a positive association between these two variables across states.

## Data

To answer our research question, the ideal dataset would include obesity rates and diagnosed depression rates for each U.S. state. The main variables we would need are the percentage of adults who are considered obese and the percentage of adults who report having been diagnosed with depression. It would also be helpful to include the year the data was collected and basic population information for each state.
Ideally, the dataset would include data for all 50 states, and possibly Washington, D.C., from the same time period so the values can be compared fairly. These data would most likely be collected through large public health surveys and then summarized at the state level. The data would be stored in a simple format such as a CSV file, where each row represents a state and each column represents a variable.
One real dataset that could be used for this project comes from the CDC, specifically the BRFSS. This survey collects information on health behaviors and conditions from adults across the United States. Variables related to obesity and diagnosed depression are included and the data are publicly available online without needing special permission.
Another possible dataset is the CDC PLACES dataset, which provides estimates of health outcomes at the state and county level. This dataset includes measures related to both obesity and mental health, making it useful for our analysis. Since the data are already summarized by state, they would work well for comparing trends across the U.S.

## Ethics 

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> The data used in this project comes from publicly available health surveys collected by the CDC. Participants gave informed consent when the data was originally collected. Since we are not collecting any data ourselves and are only using state-level summaries, informed consent was already handled by the original researchers.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> There may be some collection bias in these datasets because the information is self-reported. This means some people may choose not to respond or may not report their health information accurately. Certain groups could also be underrepresented, which could affect the results.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> This project only uses aggregated data at the state level and does not include any personal or identifying information. Because of this, the risk of exposing individual privacy is very low.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We will look for possible sources of bias in the dataset, such as missing values or differences in how states report health information. These factors will be considered when analyzing and discussing the results.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> The results will be presented using clear and appropriate visualizations and summary statistics. We will try to avoid misleading graphs or conclusions and represent the data as accurately as possible.

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Team Expectations 

Our group expects everyone to communicate regularly and respond to messages within a reasonable amount of time, especially as deadlines get closer. If someone is unable to complete a task on time, we expect them to let the group know as soon as possible so adjustments can be made.
We expect all group members to contribute fairly to the project and complete the portions they agree to work on. While everyone may not contribute in the same way, each member should put in good effort and stay involved throughout the project.
Our team plans to communicate primarily through group messaging and GitHub. We will use these tools to share updates, ask questions, and keep track of progress. If conflicts or disagreements come up, we will try to talk them through as a group and find a solution that works for everyone.
We also expect group members to be respectful of each other’s time and ideas, and give constructive feedback when reviewing each other’s work. Our goal is to work together efficiently and complete the project successfully.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|-------------|--------------|--------------------------|--------------------|
| 1/30 | Before 11:59 PM | Brainstorm project ideas and research questions | Finalize project topic and research question; submit project review |
| 2/4 | Before 11:59 PM | Complete background research and ethics section | Review and submit final project proposal |
| 2/18 | Before 11:59 PM | Import datasets and begin data cleaning | Review data wrangling progress and submit data checkpoint |
| 3/4 | Before 11:59 PM | Complete exploratory data analysis (EDA) | Discuss patterns, visualizations, and submit EDA checkpoint |
| 3/13 | Before 11:59 PM | Draft results, discussion, and video outline | Review analysis, refine visuals, and prepare final submission |
| 3/18 | Before 11:59 PM | Final edits and checks | Submit final project and video |