# COGS 108 - Project Proposal

## Authors

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Team list and credits:
- Wuyue Huang: Project administration, Conceptualization, Software, Writing – review & editing
- Tiger Huang: Methodology, Experimental investigation, Software, Conceptualization
- Yining Ning: Background research, Experimental investigation, Software, Visualization
- Zhixi Guo: Software, Analysis, Visualization, Writing – original draft
- Yilei Wang: Software, Analysis, Experimental investigation, Writing – original draft

## Research Question

At the U.S. county level, does a higher multi-year average annual mean PM2.5 concentration (µg/m³) predict higher adult COPD prevalence (%) and adult current asthma prevalence (%), after controlling for smoking, demographics, socioeconomic factors, urbanicity, and healthcare access?

## Background and Prior Work


Instructions: REPLACE the contents of this cell with your work

- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be primary research publications, but they need not be. Blogs, GitHub repositories, company websites, reputable news or magazine articles etc., are all viable references if they are relevant to your project. It may be very helpful to look for review papers in research publications; these surveys of a field can provide you with important domain expertise and make excellent citations.

Do not just give us a couple of random citations.  These should be directly relevant. Depending on the needs this could
- fundamental things that are very important in the broad field (think textbook chapters or famous case studies)
- very similar projects to what you want to do
- primary research or review articles about techniques you will use or an obstacle you will face 

Generally if two possible citations exist, choose the one which seems more central to the field (has more citations or appears in a reputable source).

Be aware that AI will make up citations and generally not necessarily pick out the most important ones.

You are expected to have three to six paragraphs of background information here and a *minimum* of three relevant citations.  Use those citations in a way that it is clear which information comes from which reference. Don't just claim a bunch of stuff in the text without citation and then dump a bibliogrpahy at the end

 **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts and methods. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


Instructions: REPLACE the contents of this cell with your hypothesis

- Put your hypothesis here, this is different than your question. This what you think the answer will be to the question you're asking
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

If you question is "What is the association between X and Y" then a hypothesis might be "We predict a strong correlation between X and Y" or you might predict no correlation or any other possible relationship. Briefly explain your thinking, referring to the background section as needed. (2-3 sentences)

## Data

To study whether higher multi-year average PM<sub>2.5</sub> predicts higher adult COPD and asthma prevalence (by county) while controlling for other factors, we would construct an ideal dataset with Each County contains the following features. And we plan to collect 5 to 10 year datasets:
- Air Pollution Exposure: Multi-year average annual PM<sub>2.5</sub> concentration (µg/m³) for each county.
- Health Outcomes: Adult COPD prevalence (%) and adult current asthma prevalence (%) for the corresponding time period. These would ideally be age-adjusted prevalence estimates of diagnosed COPD and current asthma among adults (18+).

We also have a set of comprehensive covariates for each county, including:
- Smoking Rate: Adult smoking prevalence (%) as smoking is a major COPD risk factor.
- Demographics: Population age distribution (e.g., % of population over 65), sex distribution, and racial/ethnic composition of the county. These factors control for population differences that affect disease prevalence.
- Socioeconomic Factors: Indicators such as median household income, poverty rate, education level (% high school graduates), and unemployment rate. These help control for social determinants of health that could confound the pollution-health relationship.
- Urbanicity: A measure of how urban or rural the county is (for example, a rural–urban continuum code from USDA). Urban counties may have different pollution levels and health care access than rural ones, so this needs to be accounted for.
- Healthcare Access: Metrics reflecting access to care, such as the number of primary care physicians per 100,000 population or the percentage of adults without health insurance. This controls for differences in medical diagnosis and management of COPD/asthma across counties.

Number of Observations: Ideally, all counties in the U.S. (≈3,142 counties) would be included to maximize statistical power and generalizability. Each county would be one observation (a cross-sectional dataset), since we are examining multi-year average exposure and current outcomes. If a panel design were considered, we could have multiple observations per county over time, but the question’s focus on a multi-year average suggests a single observation per county (using recent 5–10 year data). Thus, roughly 3,000+ observations would be needed, covering the entire United States.

Data Collection Methods: In an ideal scenario, all variables would be measured consistently over the same multi-year period

- PM<sub>2.5</sub> Data Collection: Fine particulate pollution levels would come from a combination of ground monitors and satellite-derived models to ensure every county has an estimate. For example, the EPA’s Air Quality System monitors and modeled data (like CDC’s Environmental Public Health Tracking network) could provide daily PM<sub>2.5</sub> values, which are then averaged to an annual mean for each year and further averaged over 5+ years. This yields a stable multi-year mean for each county.

- Health Outcomes: COPD and asthma prevalence data would come from large health surveys (such as CDC’s Behavioral Risk Factor Surveillance System) analyzed with small-area estimation techniques. A model-based approach can produce county-level prevalence estimates for COPD and asthma by combining survey data with census population characteristics. This was done, for instance, in CDC’s PLACES project, which provides county estimates for chronic diseases.

- Smoking and Other Covariates: These would be collected from reliable national sources. Smoking rates and other health behavior data could also come from CDC survey estimates (e.g., BRFSS via PLACES). Demographic and socioeconomic variables come from the U.S. Census Bureau’s American Community Survey (ACS), which continuously collects data on population characteristics in every county. Urban/rural status can be assigned from USDA classifications, and healthcare access metrics can be obtained from health resource databases (e.g., HRSA’s Area Health Resources File). Each of these data sources covers all U.S. counties consistently.

Data Organization: All the above data would be merged into one cohesive dataset keyed by a common identifier for counties (such as the 5-digit FIPS code). Each row would represent one county, and columns would include the PM<sub>2.5</sub> exposure, health outcomes, and control variables. The data could be stored as a spreadsheet or CSV file, or in a relational database. For example, one might use a table with columns: County_FIPS, County_Name, State, PM25_avg, COPD_prev, Asthma_prev, Smoking_rate, Median_income, Poverty_rate, RUCC_code, PrimaryCarePhysicians_per100k, etc. This organized format makes it easy to run statistical analyses. We would ensure that the time frames align (e.g., pollution averaged over 2015–2019 and COPD/asthma prevalence around 2019–2020) so that the exposure precedes or coincides with the health outcome measurement. In summary, the ideal dataset includes all U.S. counties with a comprehensive set of variables collected over the last 5–10 years, cleaned and combined into a single analysis file.

Potential **Real** Dataset:
1) CDC PLACES — County Data (COPD, asthma, smoking):
- The CDC PLACES County dataset is hosted on CDC’s open data portal (Socrata) and is publicly downloadable—no application or special permission required. we can download it directly as a CSV from the dataset page or via the “rows.csv” endpoint. PLACES provides model-based estimates that cover the entire U.S. (50 states + DC) at the county level. The most recent “PLACES 2025 release” is based primarily on 2023 BRFSS data (with a small set of measures carried over from 2022).
- URLs:
    - Dataset page: https://data.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-County-Data-20/swc5-untb
    - Direct CSV: https://data.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD

- Important variables to use:
Adult COPD prevalence (%) (crude and often age-adjusted versions available depending on field);Adult current asthma prevalence (%);Adult current smoking prevalence (%);County identifiers: FIPS, county/state names, measure name/category fields

2) CDC Environmental Public Health Tracking (EPHT) — PM2.5 (annual mean; county-level)
- CDC’s Environmental Public Health Tracking Network provides an online Data Explorer where we can view and export environmental measures, including PM2.5-related indicators, by geography (including counties) and by year. Access is public—typically you select the PM2.5 measure, set geography to county, choose the year range (last 5–10 years), and then export/download. This source is widely used for county-level air pollution indicators (including in County Health Rankings’ PM2.5 measure). We will compute multi-year average annual mean PM2.5 by averaging the annual county values across your chosen years.
- URL:
    - Data Explorer: https://ephtracking.cdc.gov/DataExplorer/
    - PM2.5 measure page https://www.countyhealthrankings.org/health-data/community-conditions/physical-environment/air-water-and-land/air-pollution-particulate-matter

- Important variables to use: County annual mean PM2.5 (µg/m³) by year (exported from the explorer); County identifiers: FIPS (or equivalent geography codes), year

3) U.S. Census ACS 5-Year (county-level demographics + socioeconomic controls)
- The American Community Survey (ACS) 5-year estimates are available publicly and cover all counties, providing stable estimates for small areas. We can access ACS either through the Census API (recommended for reproducibility) or via the data.census.gov interface. No permission is needed; an API key is optional but helpful for larger queries. We’ll pull county-level controls aligned to your analysis window (e.g., 2019–2023 ACS 5-year for recent covariates).
- URLs:
    - ACS 5-year developer hub: https://www.census.gov/data/developers/data-sets/acs-5year.html
    - API endpoint docs (example): https://api.census.gov/data/2022/acs/acs5.html
    - Table browser: https://data.census.gov/

- Important variables We’ll use: Demographics: total population, % age 65+, sex composition, race/ethnicity composition; Socioeconomic: median income, poverty rate, education attainment, unemployment; Healthcare-related: % uninsured (if using ACS for insurance)

4) USDA ERS Rural–Urban Continuum Codes (RUCC) — urbanicity control

- USDA ERS provides RUCC as a free county-level classification that captures metro/non-metro status and gradations of urbanization. The RUCC 2023 release is publicly downloadable as CSV or XLSX—no application required. We’ll merge RUCC into your master dataset by county FIPS and use it as a categorical control (or collapse it into metro vs non-metro).

- URL:

    - RUCC downloads page: https://www.ers.usda.gov/data-products/rural-urban-continuum-codes

- Important variables we’ll use: RUCC code (1–9) for each county; County identifiers (FIPS, county/state)

HRSA Area Health Resources Files (AHRF) — healthcare access controls
- HRSA’s AHRF is a public dataset providing 6,000+ variables for each U.S. county, including workforce and facility measures useful for healthcare access controls. We can download county-level files from HRSA’s “Data Downloads” page; typically it’s a ZIP containing data files plus documentation/metadata. No special permission is generally required beyond standard website terms. We’ll extract variables like primary care supply, provider density, hospital resources, and related access indicators, then merge by county codes.

- URL(s):

    - AHRF overview: https://data.hrsa.gov/topics/health-workforce/ahrf

    - HRSA download hub: https://data.hrsa.gov/data/download
- Important variables we’ll use: Primary care physicians per 100,000 (or related provider supply variables); Hospital resources (beds, facilities) (optional); Shortage/resource scarcity indicators (optional); County geographic codes for linking/merging

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
> We considered that not every county has the same number of air quality monitoring stations during data collection. This could introduce bias because there are fewer PM2.5 monitoring stations in rural areas and more in urban areas.
 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
> Both EPA monitoring stations and the CDC collect county level data that does not include personal identifying information, such as names or precise addresses, so the privacy risk is extremely low.
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?
> To mitigate potential downstream biases, we considered incorporating a range of protective variables, such as race, income, and education level, rather than focusing solely on PM2.5 concentrations. In this way, we identified potential environmental injustices, reflecting the distribution of polluting infrastructure across different communities in the United States.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
> Although the raw data is publicly available, our data cleaning and tables are securely stored in a private Github repository, accessible only to team members and relevant faculty. Furthermore, by using Git, we can track who modified the data. This not only ensures data security during the analysis process but also prevents data from being corrupted.
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?
>Our data will only be used during the project period in the Winter 2026 quarter, and will be deleted or archived after the grading is completed.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
> Our analysis focuses primarily on outdoor PM2.5 concentrations, but our county level environmental data cannot capture factors affecting indoor air quality, such as household dust or workplace secondhand smoke, which contribute to chronic obstructive pulmonary disease and asthma.
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
> We identified detection bias, meaning that not being diagnosed by a doctor doesn't necessarily mean not being sick. In counties with limited access to healthcare, despite people being exposed to high concentrations of PM2.5, the diagnosis rate of respiratory diseases is low, leading to data that falsely suggests that poorer counties are healthier than they actually are. To address this detection bias, we will incorporate income as a factor to adjust these underestimated figures back to their true levels.
 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
> In this project, to ensure that our charts and statistics accurately reflect the data, we will maintain integrity by avoiding misleading presentations, including outliers, and avoiding truncated axes, etc.
 - [ ] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?
> Our analysis is fully documented in Jupyter Notebooks, and the use of Git ensures that all modifications are recorded. Furthermore, we have listed all data sources to facilitate verification by the TA and ensure the reproducibility of our work.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
> We understand that the variable of income might be misinterpreted as being linked to discrimination, but in our project, it is simply a factual statement reflecting the unfairness of social structures, and not intended to discriminate against any particular group.
 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
> We will test whether the model performs as accurately in impoverished counties as it does in wealthy counties. If the model makes more errors in impoverished areas, we will honestly report this in our report and explain the reasons.
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
> We will use various metrics to evaluate the model. For example, we will look at R² to understand the extent of air pollution's impact on asthma.
 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
> Linear regression clearly shows the meaning of each number, which makes it easy to explain, for example, how much the asthma rate changes when PM2.5 levels increase.
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?
> We will include a "Limitations" section in the project and honestly state that this analysis primarily focuses on the relationship between air pollution and respiratory health, but does not consider other factors such as family genetics.
### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
> Our results are based on past data, so we would like to remind viewers that this data may need to be updated if future air pollution patterns change.
 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
> If we find that our analysis might mislead others, we will immediately update the content. We will also include a disclaimer stating that this is a school project and does not constitute any medical advice or policy recommendations.
 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
> We use Git to manage our projects, and if we find any incorrect analysis or faulty code, we immediately go back to the appropriate version.
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?
> To prevent unintended use, we will add a disclaimer explaining that our project only shows correlation, not causation. In other words, for example, the fact that there are more cases of asthma in areas with poor air quality does not mean that poor air quality causes asthma.

## Team Expectations 

Read over the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) individually. Then, include your group’s expectations of one another for successful completion of your COGS108 project below. Discuss and agree on what all of your expectations are. Discuss how your team will communicate throughout the quarter and consider how you will communicate respectfully should conflicts arise. By including each member’s name above and by adding their name to the submission, you are indicating that you have read the COGS108 Team Policies, accept your team’s expectations below, and have every intention to fulfill them. These expectations are for your team’s use and benefit — they won’t be graded for their details.

* *Team Expectation 1* : We will use WeChat as our main channel (and email for formal messages). We respond within 24 hours on weekdays (and will send a quick acknowledgement + ETA if we’re busy).
* *Team Expectation 2* : We will meet at least once per week (virtually on WeChat) for progress updates and planning. Each member will come prepared with: what they completed, what they’re doing next, and any blockers.
* *Team Expecation 3* : All work will be tracked in a shared task list. Each task has a clear owner, deliverable, and deadline, and we update progress regularly so everyone can see what’s happening.
* *Team Expecation 4* : We will divide work evenly by effort, and ensure everyone contributes to (1) defining/refining the research question + dataset choice, (2) coding/analysis, (3) writing explanations, and (4) editing/review—not just one person doing all the coding or writing.
* *Team Expecation 5* : If a member is struggling to deliver, they will notify the group as soon as possible (ideally 48+ hours before a deadline). The team will respond by pairing up, adjusting scope, or redistributing tasks early rather than waiting until the last minute.
* *Team Expecation 6* : We will communicate respectfully using “I” statements and specific feedback (e.g., “I think X is an issue because Y—what do you think?”). If conflict persists, we will discuss it in a meeting, agree on a concrete plan, and follow up at the next check-in.

## Project Timeline Proposal

Instructions: REPLACE the contents of this cell with your work

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |