# COGS 108 - Data Checkpoint

## Authors

- Hannah Daniel: Conceptualization, background research, writing
- Isaac Cordova: Conceptualization, data curation, analysis
- Evenie Osorio: Data curation, analysis, visualization
- Evelyn Cobian: Analysis, visualization, writing
- Deandre Juguilon: Project administration, writing, coordination

## Research Question

To what extent is state-level adult diagnosed depression prevalence associated with state-level adult obesity prevalence (BMI ≥ 30) across the 50 U.S. states and DC in 2020?

## Background and Prior Work

Obesity and depression are both major public health issues in the United States and rates of both vary a lot from state to state. Obesity rates have increased over time, while depression is one of the most commonly diagnosed mental health conditions among adults. Since both conditions are influenced by multiple factors like income, access to healthcare, and lifestyle, it is reasonable to think they may be related at a larger population level.

Previous research has looked at the relationship between obesity and depression mostly at the individual level. Many studies have found that people with higher body mass index are more likely to report symptoms of depression. Some research also suggests this relationship may work in both directions, where obesity can increase the risk of depression and depression can also contribute to weight gain through changes in behavior or biological factors.

Researchers have also used geographic data to study health trends across the United States. Public health datasets, such as those collected by the CDC, allow researchers to compare health outcomes across states and identify regional patterns. These datasets are commonly used to examine differences in physical and mental health at the population level.

While obesity and depression have both been studied individually, fewer studies focus on how these two variables relate to each other at the state level. This project aims to explore whether U.S. states with higher obesity rates also tend to report higher rates of diagnosed depression. Examining this relationship at the state level may provide insight into broader public health trends and help guide future research.

## Hypothesis

We hypothesize that there will be a positive linear association between state-level adult diagnosed depression prevalence and state-level adult obesity prevalence (BMI ≥ 30) across the 50 U.S. states and DC in 2020. Specifically, states with higher percentages of adults reporting diagnosed depression will also tend to have higher percentages of adults classified as obese (BMI ≥ 30). This analysis will examine correlation rather than causation.

## Data

### Data overview

For this project, we are using the following dataset:

- **Dataset #1: CDC BRFSS Obesity/Depression Prevalence**
  - **Link:** [CDC BRFSS Data](https://data.cdc.gov/Behavioral-Risk-Factors/BRFSS-Graph-of-Current-Adult-Obesity-Prevalence-Na/tcmp-75zb)
  - **Observations:** 50 states + DC
  - **Variables:** State name, Obesity percentage (BMI > 30.0), Depression percentage

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
import numpy as np

# 1. Load the data
df = pd.read_csv('data/00-raw/Behavioral_Risk_Factor_Surveillance_System_(BRFSS)_Prevalence_Data_(2011_to_present)_20260217.csv')

# 2. Quick check of the data
print("Dataset Shape:", df.shape)
df.head()

Dataset Shape: (2760, 27)


Unnamed: 0,Year,Locationabbr,Locationdesc,Class,Topic,Question,Response,Break_Out,Break_Out_Category,Sample_Size,...,Data_Value_Footnote,DataSource,ClassId,TopicId,LocationID,BreakoutID,BreakOutCategoryID,QuestionID,ResponseID,GeoLocation
0,2020,AK,Alaska,Chronic Health Indicators,Depression,Ever told you that you have a form of depression?,Yes,18-24,Age Group,35,...,,BRFSS,CLASS03,TOPIC17,2,AGE01,CAT3,ADDEPEV3,RESP046,"(64.84507995700051, -147.72205903599973)"
1,2020,AK,Alaska,Chronic Health Indicators,Depression,Ever told you that you have a form of depression?,Yes,25-34,Age Group,88,...,,BRFSS,CLASS03,TOPIC17,2,AGE02,CAT3,ADDEPEV3,RESP046,"(64.84507995700051, -147.72205903599973)"
2,2020,AK,Alaska,Chronic Health Indicators,Depression,Ever told you that you have a form of depression?,Yes,35-44,Age Group,80,...,,BRFSS,CLASS03,TOPIC17,2,AGE03,CAT3,ADDEPEV3,RESP046,"(64.84507995700051, -147.72205903599973)"
3,2020,AK,Alaska,Chronic Health Indicators,Depression,Ever told you that you have a form of depression?,Yes,45-54,Age Group,94,...,,BRFSS,CLASS03,TOPIC17,2,AGE04,CAT3,ADDEPEV3,RESP046,"(64.84507995700051, -147.72205903599973)"
4,2020,AK,Alaska,Chronic Health Indicators,Depression,Ever told you that you have a form of depression?,Yes,55-64,Age Group,117,...,,BRFSS,CLASS03,TOPIC17,2,AGE05,CAT3,ADDEPEV3,RESP046,"(64.84507995700051, -147.72205903599973)"


In [3]:
df["Break_Out_Category"].value_counts().head(20)

Break_Out_Category
Race/Ethnicity        848
Age Group             636
Household Income      530
Education Attained    424
Sex                   212
Overall               110
Name: count, dtype: int64

In [4]:
df_overall = df[df["Break_Out_Category"] == "Overall"]
df_overall.shape

(110, 27)

In [5]:
df_overall["Locationdesc"].unique()

array(['Alaska', 'Alabama', 'Arkansas', 'Arizona', 'California',
       'Colorado', 'Connecticut', 'District of Columbia', 'Delaware',
       'Florida', 'Georgia', 'Guam', 'Hawaii', 'Iowa', 'Idaho',
       'Illinois', 'Indiana', 'Kansas', 'Kentucky', 'Louisiana',
       'Massachusetts', 'Maryland', 'Maine', 'Michigan', 'Minnesota',
       'Missouri', 'Mississippi', 'Montana', 'North Carolina',
       'North Dakota', 'Nebraska', 'New Hampshire', 'New Jersey',
       'New Mexico', 'Nevada', 'New York', 'Ohio', 'Oklahoma', 'Oregon',
       'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina',
       'South Dakota', 'Tennessee', 'Texas',
       'All States, DC and Territories (median) **', 'Utah',
       'All States and DC (median) **', 'Virginia', 'Vermont',
       'Washington', 'Wisconsin', 'West Virginia', 'Wyoming'],
      dtype=object)

In [6]:
# Remove unwanted locations
exclude = [
    "Guam",
    "Puerto Rico",
    "All States, DC and Territories (median)",
    "All States and DC (median)"
]

df_clean = df_overall[~df_overall["Locationdesc"].isin(exclude)]

df_clean.shape

(106, 27)

In [7]:
df_clean["Locationdesc"].unique()

array(['Alaska', 'Alabama', 'Arkansas', 'Arizona', 'California',
       'Colorado', 'Connecticut', 'District of Columbia', 'Delaware',
       'Florida', 'Georgia', 'Hawaii', 'Iowa', 'Idaho', 'Illinois',
       'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts',
       'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri',
       'Mississippi', 'Montana', 'North Carolina', 'North Dakota',
       'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada',
       'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'All States, DC and Territories (median) **', 'Utah',
       'All States and DC (median) **', 'Virginia', 'Vermont',
       'Washington', 'Wisconsin', 'West Virginia', 'Wyoming'],
      dtype=object)

In [8]:
# remove summary rows that contain "median"
df_clean = df_clean[~df_clean["Locationdesc"].str.contains("median", na=False)]

df_clean.shape

(102, 27)

In [9]:
df_clean["Locationdesc"].unique()

array(['Alaska', 'Alabama', 'Arkansas', 'Arizona', 'California',
       'Colorado', 'Connecticut', 'District of Columbia', 'Delaware',
       'Florida', 'Georgia', 'Hawaii', 'Iowa', 'Idaho', 'Illinois',
       'Indiana', 'Kansas', 'Kentucky', 'Louisiana', 'Massachusetts',
       'Maryland', 'Maine', 'Michigan', 'Minnesota', 'Missouri',
       'Mississippi', 'Montana', 'North Carolina', 'North Dakota',
       'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'Nevada',
       'New York', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',
       'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee',
       'Texas', 'Utah', 'Virginia', 'Vermont', 'Washington', 'Wisconsin',
       'West Virginia', 'Wyoming'], dtype=object)

In [10]:
df_clean["Locationdesc"].nunique()

51

In [11]:
df_clean["Topic"].value_counts()

Topic
Depression        51
BMI Categories    51
Name: count, dtype: int64

In [12]:
df_state = df_clean.pivot_table(
    index="Locationdesc",
    columns="Topic",
    values="Data_value",
    aggfunc="mean"
).reset_index()

df_state.head()

Topic,Locationdesc,BMI Categories,Depression
0,Alabama,39.0,23.5
1,Alaska,31.9,15.9
2,Arizona,30.9,17.4
3,Arkansas,36.4,23.5
4,California,30.2,14.1


In [13]:
df_state.shape

(51, 3)

In [14]:
df_state["Locationdesc"].sort_values().tail(10)

41     South Dakota
42        Tennessee
43            Texas
44             Utah
45          Vermont
46         Virginia
47       Washington
48    West Virginia
49        Wisconsin
50          Wyoming
Name: Locationdesc, dtype: object

In [15]:
df_state.isna().sum()

Topic
Locationdesc      0
BMI Categories    0
Depression        0
dtype: int64

### Data Cleaning

To isolate state-level prevalence estimates for 2020, we filtered the dataset to include only rows where `Break_Out_Category == "Overall"`. We excluded territories (Guam and Puerto Rico) and summary rows containing "median" to retain only the 50 U.S. states and DC. After cleaning, the final dataset contains 51 observations with no missing values for obesity prevalence (BMI ≥ 30) or diagnosed depression prevalence.

## Ethics 

### A. Data Collection
 - [X] **A.1 Informed consent**: Participants gave informed consent when the data was originally collected by the CDC. Since we are using state-level summaries, informed consent was already handled by the original researchers.
 - [X] **A.2 Collection bias**: We acknowledge that BRFSS data is self-reported via telephone surveys. This may exclude populations without stable phone access and introduces social desirability bias (underreporting weight or mental health struggles).
 - [X] **A.3 Limit PII exposure**: This project uses aggregated state-level data. No individual-level identifiers are present, ensuring anonymity.
 - [X] **A.4 Downstream bias mitigation**: While our primary analysis is state-level, we will acknowledge that these trends may vary significantly across racial and socioeconomic lines within states.

### B. Data Storage
 - [X] **B.1 Data security**: Data is stored in a secure GitHub repository. We are using public datasets, so no sensitive proprietary data is at risk.
 - [X] **B.2 Right to be forgotten**: Not applicable at the individual level since the data is already anonymized and aggregated by the CDC.
 - [X] **B.3 Data retention plan**: We will retain the data for the duration of the course. If the project is made public, we will ensure it links to the primary CDC source to maintain data freshness.

### C. Analysis
 - [X] **C.1 Missing perspectives**: We acknowledge that BMI is a controversial metric for obesity and may not capture metabolic health accurately for all body types or ethnicities.
 - [X] **C.2 Dataset bias**: We will check for outliers (e.g., states with significantly different reporting standards) to ensure they don't skew the correlation.
 - [X] **C.3 Honest representation**: Our visualizations will use appropriate scales (e.g., not starting Y-axes at non-zero points to exaggerate small differences) to represent the correlation fairly.
 - [X] **C.5 Auditability**: All data cleaning and analysis steps are documented in this notebook, allowing for full reproducibility.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: We will ensure our model does not use state-level variables that might serve as proxies for racial or religious discrimination.
 - [X] **D.4 Explainability**: We will use a simple linear regression model so that the relationship between depression and obesity is easily interpretable by a general audience.
 - [X] **D.5 Communicate limitations**: We will explicitly state that state-level correlation does not imply that an individual with depression will become obese, or vice-versa (avoiding the ecological fallacy).

### E. Deployment
 - [X] **E.4 Unintended use**: We will include a disclaimer that this project is for educational purposes and should not be used to inform personal medical decisions or public health policy without expert consultation.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|-------------|--------------|--------------------------|--------------------|
| 1/30 | 11:59 PM | Brainstorming | Finalize topic |
| 2/4 | 11:59 PM | Ethics | Submit Proposal |
| 2/18 | 11:59 PM | Data Cleaning | Submit Checkpoint |
| 3/4 | 11:59 PM | EDA | Discuss patterns |
| 3/13 | 11:59 PM | Results draft | Refine visuals |
| 3/18 | 11:59 PM | Final edits | Submit Project |

## Notes to TA: Revisions from Project Proposal

Based on the feedback from our project proposal, we made the following revisions:
- We clarified the research question by explicitly defining how both variables are measured. Obesity is defined as adult prevalence of BMI ≥ 30 and depression is defined as adult diagnosed depression prevalence at the state level in 2020.
- We specified that our analysis examines statistical association rather than causation.
- We ensured the dataset directly aligns with our research question by filtering to “Overall” state-level prevalence estimates and excluding territories and summary median rows.
- We included the CDC BRFSS dataset link and source information in the Data Overview section.