# COGS 108 - Project Proposal

## Authors


Team list & Credits:

- Ali Juhdi: Conceptualization, Data curation & Methodology
- Anchit Kumar: Analysis, Software, & Visualization
- Brandon Scappaticci: Project administration, Software, Writing – review & editing
- Cesar Vizcaino Garay: Analysis, Background research, & Validation
- John Caabay-Sandoval: Investigation & Supervision


## Research Question


This project explores whether **estimated country-level caffeine intake** is related to **mental health and mortality outcomes** around the world. Instead of just looking at simple correlations, the analysis will use regression models to see if caffeine intake is associated with outcomes like depression burden or all-cause mortality after accounting for important factors such as GDP per capita, smoking rates, alcohol use, and regional differences. Because the data is at the country level, results will be interpreted as population-level patterns rather than individual cause-and-effect.

Secondary analyses will test whether results differ based on the **main source of caffeine** (coffee, tea, or energy drinks). The project will also include basic robustness checks by trying different model setups and control variables to see whether the findings stay consistent. The goal is to understand whether caffeine consumption is associated with differences in health outcomes across countries in a clear and careful way.


## Background and Prior Work


Caffeine is one of the most widely consumed psychoactive substances worldwide, typically consumed through coffee, tea, and other caffeinated beverages. Because caffeine can affect sleep, population-level differences in caffeine intake could plausibly relate to mental health and mortality patterns.

At the individual level, observational research has reported associations between caffeine intake and depression-related outcomes. For example, analyses using NHANES data examined caffeine consumption and depression measures, suggesting a relationship may exist, though causality is not established and confounding remains a major concern.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Similarly, large observational analyses have examined caffeine intake and all-cause or cause-specific mortality, often identifying non-linear associations and emphasizing careful interpretation and adjustment for confounders.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

However, most prior work is within-country and individual-level, the project focus is between-country and ecological. This makes the question different: we are not claiming caffeine causes outcomes in individuals, but asking whether country-level patterns co-vary and whether those patterns persist after controlling for structural factors (wealth, health system capacity, smoking, alcohol use, etc.).

To estimate country-level caffeine intake, we will start from publicly available per-capita consumption estimates for coffee and tea by country for recent years.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) <a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) We will then convert these amounts to approximate caffeine intake using caffeine-content references (acknowledging large variation across preparation methods and products).<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5)

In addition, beverage-market composition data (percent of caffeinated beverage volume sales by category) can help characterize where caffeine is coming from in different countries (coffee vs tea vs carbonates vs energy drinks), which we can use for subgroup and interaction analyses.<a name="cite_ref-6"></a>[<sup>6</sup>](#cite_note-6)

Footnotes:
1. <a name="cite_note-1"></a> [^](#cite_ref-1) Association between Caffeine Consumption and Depression in NHANES 2009–2010. https://pmc.ncbi.nlm.nih.gov/articles/PMC6407621/
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Association between Caffeine Intake and All-Cause and Cause-Specific Mortality. https://pmc.ncbi.nlm.nih.gov/articles/PMC8715461/
3. <a name="cite_note-3"></a> [^](#cite_ref-3) World Population Review: Coffee consumption by country (2019–2023). https://worldpopulationreview.com/country-rankings/coffee-consumption-by-country
4. <a name="cite_note-4"></a> [^](#cite_ref-4) World Population Review: Tea consumption by country (2019–2022). https://worldpopulationreview.com/country-rankings/tea-consumption-by-country
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Kaggle dataset: Caffeine content of drinks. https://www.kaggle.com/datasets/heitornunes/caffeine-content-of-drinks
6. <a name="cite_note-6"></a> [^](#cite_ref-6) Nutrients (MDPI) article with country beverage-source percentages (Figure). https://www.mdpi.com/2072-6643/10/11/1772#Abstract


## Hypothesis


We hypothesize that countries with higher estimated caffeine consumption per capita will show differences in mental health and mortality outcomes, although the direction of these relationships may vary by outcome. Based on prior observational research, we expect caffeine intake will be associated with mental health and mortality outcomes, though the direction and strength may vary after controlling for confounders. In addition we will recognizing that economic development and lifestyle factors may strongly confound these patterns.<a name="cite_ref-1b"></a>[<sup>1</sup>](#cite_note-1)

We also hypothesize that the dominant caffeine source will matter. Countries where caffeinated beverage consumption is dominated by energy drinks or carbonated beverages may show different patterns compared to countries where caffeine primarily comes from coffee or tea. As well as the higher caffeine intake per mg/day, is a result of an assiociation with higher systolic blood pressure(SBP) and hypertension. <a name="cite_ref-6b"></a>[<sup>6</sup>](#cite_note-6)


## Data


### 1) Ideal dataset
The ideal dataset would contain, for each country and year (e.g., 2019–2023), **total caffeine intake per capita (mg/day)** measured directly from nationally representative dietary surveys or validated retail sales/scan data mapped to caffeine content. It would include beverage-type breakdown (coffee/tea/cola/energy/RTD), uncertainty estimates, and key covariates (GDP per capita, smoking prevalence, alcohol consumption, obesity prevalence, healthcare access indicators). The dataset would be stored in a tidy format with one row per country-year and a clear codebook.

### 2) Real datasets we have now (and how we will use them)

**Dataset A: Coffee consumption per capita (kg/person/year), 2019–2023**
- Location: https://worldpopulationreview.com/country-rankings/coffee-consumption-by-country
- Use: Convert kg/year to g/day, then estimate caffeine intake using assumed caffeine-per-gram ranges; run sensitivity analyses for low/medium/high assumptions.
- : definitions vary by country; brewing strength varies widely; this produces an estimate, not a direct measure.

**Dataset B: Tea consumption per capita, 2019–2022**
- Location: https://worldpopulationreview.com/country-rankings/tea-consumption-by-country
- Use: Convert to g/day and estimate caffeine intake using tea-specific caffeine-per-gram ranges; run sensitivity analyses.
- : tea type varies (black/green/herbal) and caffeine varies accordingly.

**Dataset C: Caffeine content reference dataset (mg per serving / mg per volume)**
- Location: https://www.kaggle.com/datasets/heitornunes/caffeine-content-of-drinks
- Use: Provides reference values and plausible ranges for caffeine content in different beverages; used to justify conversion assumptions and ranges.
- : product list may not be country-specific and values differ by brand/serving size.

**Dataset D: Percent composition of caffeinated beverage volume sales by category (country-level)**
- Location: https://www.mdpi.com/2072-6643/10/11/1772#Abstract
- Use: Use composition to classify countries by dominant caffeine source (coffee-dominant vs tea-dominant vs carbonates/energy-dominant) and test subgroup differences / interactions.
- : percent composition does not provide total volume per capita; it cannot produce a total caffeine estimate by itself.

### 3) Constructed core variable
We will construct an estimated caffeine intake per capita (mg/day) as:
- caffeine_mg_day ≈ (coffee_g_day × caffeine_mg_per_g_coffee) + (tea_g_day × caffeine_mg_per_g_tea)

We will report results across multiple assumption settings (low/medium/high caffeine-per-gram values) to test robustness.


## Ethics 

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

Because our primary analysis is ecological (country-level), the key “collection bias” risks include differences in how consumption is measured across countries, differences in reporting quality, and differences in how health outcomes are recorded. We will document how each source defines its variables and include sensitivity checks and cautious interpretation.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

We plan to use aggregated country-level data only. No personally identifiable information is needed or collected.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

A major risk is ecological fallacy (over-interpreting country-level correlations as individual-level causation). We will clearly frame results as associations at the country level and avoid causal claims.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

We will control for plausible confounders (e.g., GDP per capita, smoking, alcohol use) and will report  where confounding cannot be removed. We will also run sensitivity analyses because caffeine intake is estimated rather than directly measured.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

We will show uncertainty by reporting results across multiple assumption settings for caffeine conversion factors instead of presenting one “true” caffeine value. We will avoid misleading country ranking visualizations without context.

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

We will keep a clear reproducible pipeline in the repository: raw sources, cleaning scripts, a data dictionary, and versioned outputs.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

We will report multiple evaluation views (e.g., coefficient estimates + confidence intervals, partial dependence / marginal effects where appropriate, and sensitivity checks) instead of relying on one metric.

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate**: Have we communicated the shortcomings, and biases of the model to relevant stakeholders in ways that can be generally understood?

We will include a  section that emphasizes ecological design constraints, measurement uncertainty in caffeine estimates, and remaining confounding.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

Country-level correlations can be misused to shame or stereotype countries/cultures. We will avoid sensational framing and will emphasize uncertainty and confounding. We will not claim that caffeine causes outcomes, only that country-level variables are associated.


## Team Expectations 


- We will communicate in a shared group channel (Discord/iMessage) and respond within 24 hours on weekdays.
- Each task will have an owner, a deadline, and a clear definition of done.
- We will use GitHub issues to track tasks and do PR-based work (review before merging).
- If someone is blocked for more than 24 hours, they will post what they tried and ask for help early.
- If conflicts arise, we will address them respectfully and quickly; if needed, we will escalate to the TA/instructor.


## Project Timeline Proposal


| Meeting Date  | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 1/20  | 1 PM | Read COGS 108 team policies; brainstorm topic and candidate outcomes/covariates. | Confirm communication tools; finalize project direction; assign background research tasks. |
| 1/26  | 10 AM | Summarize 3–6 sources; list candidate outcome datasets. | Align on research question/hypothesis; draft proposal sections; decide confounders. |
| 2/1   | 10 AM | Finalize and submit proposal; collect coffee/tea datasets; draft caffeine estimation approach. | Define data dictionary; plan merging keys (country/year); outline analysis plan. |
| 2/14  | 6 PM | Import & clean consumption data; initial caffeine estimate scenarios; initial plots. | Review cleaning choices; decide final assumptions/ranges; confirm outcome dataset(s). |
| 2/23  | 12 PM | Merge caffeine estimates with outcome/covariate data; run baseline models. | Review model diagnostics; iterate controls; plan sensitivity analyses. |
| 3/13  | 12 PM | Complete analysis and sensitivity checks; draft results and discussion. | Edit full report; refine visuals; finalize  and ethics write-up. |
| 3/20  | Before 11:59 PM | Final polish; finalize repository and report. | Turn in final project and group surveys. |
