# COGS 108 - Data Checkpoint

## Authors

- Hannah Daniel: Conceptualization, background research, writing
- Isaac Cordova: Conceptualization, data curation, analysis
- Evenie Osorio: Data curation, analysis, visualization
- Evelyn Cobian: Analysis, visualization, writing
- Deandre Juguilon: Project administration, writing, coordination

## Research Question

How well do diagnosed depression rates predict obesity rates in the 50 US states and DC in the year 2020?

## Background and Prior Work

Obesity and depression are both major public health issues in the United States and rates of both vary a lot from state to state. Obesity rates have increased over time, while depression is one of the most commonly diagnosed mental health conditions among adults. Since both conditions are influenced by multiple factors like income, access to healthcare, and lifestyle, it is reasonable to think they may be related at a larger population level.

Previous research has looked at the relationship between obesity and depression mostly at the individual level. Many studies have found that people with higher body mass index are more likely to report symptoms of depression. Some research also suggests this relationship may work in both directions, where obesity can increase the risk of depression and depression can also contribute to weight gain through changes in behavior or biological factors.

Researchers have also used geographic data to study health trends across the United States. Public health datasets, such as those collected by the CDC, allow researchers to compare health outcomes across states and identify regional patterns. These datasets are commonly used to examine differences in physical and mental health at the population level.

While obesity and depression have both been studied individually, fewer studies focus on how these two variables relate to each other at the state level. This project aims to explore whether U.S. states with higher obesity rates also tend to report higher rates of diagnosed depression. Examining this relationship at the state level may provide insight into broader public health trends and help guide future research.

## Hypothesis

We hypothesize that U.S. states with higher obesity rates will also tend to report higher rates of diagnosed depression. This is based on previous research showing a connection between obesity and depression at the individual level, as well as shared factors such as socioeconomic conditions and access to healthcare. We expect to see a positive association between these two variables across states. Keep in mind, however, that correlation does not imply causation.

Additional factors, such as the stigmatization of mental health leading to unhealthy coping mechanisms or the societal attitudes towards overweight people also may link these variables. The scope of this project is to define the strength of this relationship.

## Data

### Data overview

For this project, we are using the following dataset:

- **Dataset #1: CDC BRFSS Obesity/Depression Prevalence**
  - **Link:** [CDC BRFSS Data](https://data.cdc.gov/Behavioral-Risk-Factors/BRFSS-Graph-of-Current-Adult-Obesity-Prevalence-Na/tcmp-75zb)
  - **Observations:** 50 states + DC
  - **Variables:** State name, Obesity percentage (BMI > 30.0), Depression percentage

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pandas as pd
import numpy as np

# 1. Load the data
df = pd.read_csv('data/00-raw/Behavioral_Risk_Factor_Surveillance_System_(BRFSS)_Prevalence_Data_(2011_to_present)_20260217.csv')

# 2. Quick check of the data
print("Dataset Shape:", df.shape)
df.head()

## Ethics 

### A. Data Collection
 - [X] **A.1 Informed consent**: Participants gave informed consent when the data was originally collected by the CDC. Since we are using state-level summaries, informed consent was already handled by the original researchers.
 - [X] **A.2 Collection bias**: We acknowledge that BRFSS data is self-reported via telephone surveys. This may exclude populations without stable phone access and introduces social desirability bias (underreporting weight or mental health struggles).
 - [X] **A.3 Limit PII exposure**: This project uses aggregated state-level data. No individual-level identifiers are present, ensuring anonymity.
 - [X] **A.4 Downstream bias mitigation**: While our primary analysis is state-level, we will acknowledge that these trends may vary significantly across racial and socioeconomic lines within states.

### B. Data Storage
 - [X] **B.1 Data security**: Data is stored in a secure GitHub repository. We are using public datasets, so no sensitive proprietary data is at risk.
 - [X] **B.2 Right to be forgotten**: Not applicable at the individual level since the data is already anonymized and aggregated by the CDC.
 - [X] **B.3 Data retention plan**: We will retain the data for the duration of the course. If the project is made public, we will ensure it links to the primary CDC source to maintain data freshness.

### C. Analysis
 - [X] **C.1 Missing perspectives**: We acknowledge that BMI is a controversial metric for obesity and may not capture metabolic health accurately for all body types or ethnicities.
 - [X] **C.2 Dataset bias**: We will check for outliers (e.g., states with significantly different reporting standards) to ensure they don't skew the correlation.
 - [X] **C.3 Honest representation**: Our visualizations will use appropriate scales (e.g., not starting Y-axes at non-zero points to exaggerate small differences) to represent the correlation fairly.
 - [X] **C.5 Auditability**: All data cleaning and analysis steps are documented in this notebook, allowing for full reproducibility.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: We will ensure our model does not use state-level variables that might serve as proxies for racial or religious discrimination.
 - [X] **D.4 Explainability**: We will use a simple linear regression model so that the relationship between depression and obesity is easily interpretable by a general audience.
 - [X] **D.5 Communicate limitations**: We will explicitly state that state-level correlation does not imply that an individual with depression will become obese, or vice-versa (avoiding the ecological fallacy).

### E. Deployment
 - [X] **E.4 Unintended use**: We will include a disclaimer that this project is for educational purposes and should not be used to inform personal medical decisions or public health policy without expert consultation.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|-------------|--------------|--------------------------|--------------------|
| 1/30 | 11:59 PM | Brainstorming | Finalize topic |
| 2/4 | 11:59 PM | Ethics | Submit Proposal |
| 2/18 | 11:59 PM | Data Cleaning | Submit Checkpoint |
| 3/4 | 11:59 PM | EDA | Discuss patterns |
| 3/13 | 11:59 PM | Results draft | Refine visuals |
| 3/18 | 11:59 PM | Final edits | Submit Project |