# COGS 108 - Project Proposal

## Authors


Team list & Credits:

- Ali Juhdi: Conceptualization, Data curation & Methodology
- Anchit Kumar: Analysis, Software, & Visualization
- Brandon Scappaticci: Project administration, Software, Writing – review & editing
- Cesar Vizcaino Garay: Analysis, Background research, & Validation
- John Caabay-Sandoval: Investigation & Supervision


## Research Question

We will test whether **estimated caffeine consumption per capita (by country)** is associated with **country-level health and social outcomes**, after accounting for major confounders.

Primary question: Across countries, is higher estimated caffeine intake associated with differences in mental health and mortality indicators (e.g., depression burden and all-cause mortality)?

Secondary questions:
- Do associations differ depending on the dominant caffeine source (coffee vs tea vs carbonated/energy drinks)?
- Do associations remain after controlling for factors like GDP per capita, smoking prevalence, alcohol use, and other relevant confounders?

This is an **inference** project using regression-based modeling and sensitivity analyses to test robustness to uncertainty in the caffeine estimates.



## Background and Prior Work

Caffeine is one of the most widely consumed psychoactive substances worldwide, typically consumed through coffee, tea, and other caffeinated beverages. Because caffeine affects sleep, mood, and cardiovascular physiology, population-level differences in caffeine intake could plausibly relate to mental health and mortality patterns.

At the individual level, observational research has reported associations between caffeine intake and depression-related outcomes. For example, analyses using NHANES data examined caffeine consumption and depression measures, suggesting a relationship may exist, though causality is not established and confounding remains a major concern.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

Similarly, large observational analyses have examined caffeine intake and all-cause or cause-specific mortality, often identifying non-linear associations and emphasizing careful interpretation and adjustment for confounders.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

However, most prior work is within-country and individual-level, while our project is between-country and ecological. This makes the question different: we are not claiming caffeine causes outcomes in individuals, but asking whether country-level patterns co-vary and whether those patterns persist after controlling for structural factors (wealth, health system capacity, smoking, alcohol use, etc.).

To estimate country-level caffeine intake, we will start from publicly available per-capita consumption estimates for coffee and tea by country for recent years.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) <a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) We will then convert these amounts to approximate caffeine intake using caffeine-content references (acknowledging large variation across preparation methods and products).<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5)

In addition, beverage-market composition data (percent of caffeinated beverage volume sales by category) can help characterize where caffeine is coming from in different countries (coffee vs tea vs carbonates vs energy drinks), which we can use for subgroup and interaction analyses.<a name="cite_ref-6"></a>[<sup>6</sup>](#cite_note-6)

Footnotes:
1. <a name="cite_note-1"></a> [^](#cite_ref-1) Association between Caffeine Consumption and Depression in NHANES 2009–2010. https://pmc.ncbi.nlm.nih.gov/articles/PMC6407621/
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Association between Caffeine Intake and All-Cause and Cause-Specific Mortality. https://pmc.ncbi.nlm.nih.gov/articles/PMC8715461/
3. <a name="cite_note-3"></a> [^](#cite_ref-3) World Population Review: Coffee consumption by country (2019–2023). https://worldpopulationreview.com/country-rankings/coffee-consumption-by-country
4. <a name="cite_note-4"></a> [^](#cite_ref-4) World Population Review: Tea consumption by country (2019–2022). https://worldpopulationreview.com/country-rankings/tea-consumption-by-country
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Kaggle dataset: Caffeine content of drinks. https://www.kaggle.com/datasets/heitornunes/caffeine-content-of-drinks
6. <a name="cite_note-6"></a> [^](#cite_ref-6) Nutrients (MDPI) article with country beverage-source percentages (Figure). https://www.mdpi.com/2072-6643/10/11/1772#Abstract


## Hypothesis


We hypothesize that countries with higher estimated caffeine consumption per capita will show measurable differences in mental health and mortality indicators, though the direction may differ by outcome. Based on prior observational findings, we predict a weak-to-moderate association between higher caffeine intake and lower depression burden and/or lower all-cause mortality, but we expect substantial confounding by economic development and lifestyle factors.<a name="cite_ref-1b"></a>[<sup>1</sup>](#cite_note-1)

We also hypothesize that the dominant caffeine source will matter: countries where caffeine comes mainly from energy drinks or carbonated beverages may show different patterns than countries where caffeine primarily comes from coffee or tea.<a name="cite_ref-6b"></a>[<sup>6</sup>](#cite_note-6)


## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics 

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?
 
 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

Because our primary analysis is ecological (country-level), the key “collection bias” risks include differences in how consumption is measured across countries, differences in reporting quality, and differences in how health outcomes are recorded. We will document how each source defines its variables and include sensitivity checks and cautious interpretation.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

We plan to use aggregated country-level data only. No personally identifiable information is needed or collected.

 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

A major risk is ecological fallacy (over-interpreting country-level correlations as individual-level causation). We will clearly frame results as associations at the country level and avoid causal claims.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

We will control for plausible confounders (e.g., GDP per capita, smoking, alcohol use) and will report  where confounding cannot be removed. We will also run sensitivity analyses because caffeine intake is estimated rather than directly measured.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

We will show uncertainty by reporting results across multiple assumption settings for caffeine conversion factors instead of presenting one “true” caffeine value. We will avoid misleading country ranking visualizations without context.

 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

We will keep a clear reproducible pipeline in the repository: raw sources, cleaning scripts, a data dictionary, and versioned outputs.

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

We will report multiple evaluation views (e.g., coefficient estimates + confidence intervals, partial dependence / marginal effects where appropriate, and sensitivity checks) instead of relying on one metric.

 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate**: Have we communicated the shortcomings, and biases of the model to relevant stakeholders in ways that can be generally understood?

We will include a  section that emphasizes ecological design constraints, measurement uncertainty in caffeine estimates, and remaining confounding.

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

Country-level correlations can be misused to shame or stereotype countries/cultures. We will avoid sensational framing and will emphasize uncertainty and confounding. We will not claim that caffeine causes outcomes, only that country-level variables are associated.


## Team Expectations 


- We will communicate in a shared group channel (Discord/iMessage) and respond within 24 hours on weekdays.
- Each task will have an owner, a deadline, and a clear definition of done.
- We will use GitHub issues to track tasks and do PR-based work (review before merging).
- If someone is blocked for more than 24 hours, they will post what they tried and ask for help early.
- If conflicts arise, we will address them respectfully and quickly; if needed, we will escalate to the TA/instructor.


## Project Timeline Proposal


| Meeting Date  | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 1/20  | 1 PM | Read COGS 108 team policies; brainstorm topic and candidate outcomes/covariates. | Confirm communication tools; finalize project direction; assign background research tasks. |
| 1/26  | 10 AM | Summarize 3–6 sources; list candidate outcome datasets. | Align on research question/hypothesis; draft proposal sections; decide confounders. |
| 2/1   | 10 AM | Finalize and submit proposal; collect coffee/tea datasets; draft caffeine estimation approach. | Define data dictionary; plan merging keys (country/year); outline analysis plan. |
| 2/14  | 6 PM | Import & clean consumption data; initial caffeine estimate scenarios; initial plots. | Review cleaning choices; decide final assumptions/ranges; confirm outcome dataset(s). |
| 2/23  | 12 PM | Merge caffeine estimates with outcome/covariate data; run baseline models. | Review model diagnostics; iterate controls; plan sensitivity analyses. |
| 3/13  | 12 PM | Complete analysis and sensitivity checks; draft results and discussion. | Edit full report; refine visuals; finalize  and ethics write-up. |
| 3/20  | Before 11:59 PM | Final polish; finalize repository and report. | Turn in final project and group surveys. |
