# COGS 108 - Project Proposal

## Authors

Team list and credits:
- Bernico Chandra: Software, Visualization, Writing – original draft, Analysis
- Sebastian Ferragut: Visualization, Data Curation, Software
- Aamir Haq: Analysis, Writing - original draft, Validation
- Aaron Quizon: Writing - original draft, Writing - review & editing, background research
- Paige Schumsky: Writing – original draft, Writing – review & editing, Conceptualization

## Research Question

What is the relationship between household income and estimated residential carbon emission rates at the census tract level in San Diego County in 2022? With tract-level data, we measure income using Area Median Income (AMI) indicators as well as average annual household income. We can estimate residential carbon emissions using household electricity, natural gas, and other fuel expenditures. We consider this as a statistical inference problem to estimate the degree of association between income levels and emissions in order to understand whether or not income disparities are systematically associated with differences in residential carbon impact across San Diego County census tracts. 

## Background and Prior Work

Over the last fifty years, a growing number of eyes have been focused on the controversial topic of climate change, and by extension carbon emissions. Over this past half-decade, said emissions have skyrocketed dramatically<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1), prompting analysts to explore all aspects surrounding it -- ranging from its consequences, potential solutions, and---most relevant to this project---its roots. Due to this there exists many scientific journals and articles examining potential relationships between certain variables and carbon emissions, with one such variable being income level.

Discussions from scholars suggest an overall consensus that higher income levels, from local to national scales, correlate with higher levels of carbon emissions. Internationally, researcher scholars Sarah Schöngart et al. have used models of wealth groups and emissions data to discover the extent to which affluent groups become responsible for emissions, with "the wealthiest 10% [contributing] 6.5 times more," and "the top 1% and 0.1% contributing 20x and 76x more..."<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Honing in on the United States specifically, there are expectedly similar conclusions, as macroeconomist Fredrick N. G. Andersson states "income inequality is correlated with carbon emissions," admittedly with a weaker correlation level.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Regardless, these two articles are only a small portion of the many findings agreeing with higher-income groups having responsibility for the bulk of carbon emissions. As this idea is seen across nearly every nation, it would be reasonable to believe that smaller region levels such as San Diego would have similar results – though it would be interesting to see if there is such a disproportionate gap as with other regions.

Prior analysis utilizing appropriate datasets appear to reinforce the same idea. French economist Lucas Chancel conducted an analysis between income inequality and environmental impact datasets, revealing that the top 10% of earners in the United States and China emit greater tons of CO2 than the other 90% of earners in their respective countries, further highlighting this severe inequality emissions gap.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) Data analysis performed by ourworldindata.org into a chart also shows that high-income and upper-middle-income nations produce more than 30 billion tons of carbon dioxide combined – essentially 10x more than lower-income nations.<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5)

Many researchers have called to attention other factors surrounding income levels affecting carbon emissions; Amy Richmond and Robert Kauffman determined that energy prices provide a more direct measure of increases in emissions than strictly income per capita. They found that “real oil prices have a statistically measurable effect on per-capita energy use and carbon emissions,” as “higher energy prices reduce energy use… because both firms and households can substitute capital or labor."<a name="cite_ref-6"></a>[<sup>6</sup>](#cite_note-6) Ultimately, this concept supports the idea that an increase in income is associated with an increase in carbon emissions, as individuals with higher incomes can afford more energy, leading to producing more emissions than those with lower incomes. Richmond and Kauffman’s findings prove a need to account for energy prices in our analysis; by only measuring carbon emissions for the year 2022 in San Diego County, we limit the amount of fluctuations in energy prices within our data.

It is important to note that much of the research and data surrounding carbon emissions focuses on limiting its overall production, determining the most efficient way the total amount of emissions can be reduced to limit the growth of climate change. For instance, Chancel suggests that “governments [should] develop better data on individual emissions to monitor progress towards sustainable lifestyles,” while Richmond and Kauffman propose that “raising real energy prices may be a more effective means for reducing energy use.” While our project is limited to investigating the relationship between income levels and carbon emissions, our data will fit into the larger scope of evaluating how outside factors affect energy consumption, and in turn, contribute to climate change.

Sources:
1. <a name="cite_note-1"></a> [^](#cite_ref-1) Hannah Ritchie and Max Roser (2020) - “CO₂ emissions” Published online at OurWorldinData.org. Retrieved from: 'https://archive.ourworldindata.org/20260119-070102/co2-emissions.html' [Online Resource] (archived on January 19, 2026).
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Schöngart, S., Nicholls, Z., Hoffmann, R. et al. High-income groups disproportionately contribute to climate extremes worldwide. Nat. Clim. Chang. 15, 627–633 (2025). https://doi.org/10.1038/s41558-025-02325-x
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Andersson, Fredrik N. G. “Income Inequality and Carbon Emissions in the United States 1929–2019.” SSRN Electronic Journal, vol. 204, no. 107633, 2022, https://doi.org/10.2139/ssrn.4123754. Accessed 7 June 2022.
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Chancel, L. Global carbon inequality over 1990–2019. Nat Sustain 5, 931–938 (2022). https://doi.org/10.1038/s41893-022-00955-z
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Global Carbon Budget (2025) – with major processing by Our World in Data. “Carbon dioxide emissions by income level” [dataset]. Global Carbon Project, “Global Carbon Budget v15” [original data]. Retrieved February 3, 2026 from https://archive.ourworldindata.org/20260119-070102/grapher/co2-income-level.html (archived on January 19, 2026).
6. <a name="cite_note-6"></a> [^](#cite_ref-6) Richmond, Amy K., and Robert K. Kaufmann. “Energy Prices and Turning Points: The Relationship between Income and Energy Use/Carbon Emissions.” The Energy Journal, vol. 27, no. 4, 2006, pp. 157–80. JSTOR, http://www.jstor.org/stable/23297037. Accessed 3 Feb. 2026.


## Hypothesis


We hypothesize that census tracts with higher household income will relate to higher estimated residential carbon emissions per household. We base this expectation on the idea that higher-income households tend to consume more resources in larger homes, producing greater emissions. Therefore, we expect a positive association between income and estimated emissions, which can be further tested controlling for demographics and housing types.

## Data

1. Ideal Dataset

The ideal dataset would have tract-level data for San Diego County for a given year, which would be a significant number of observations and enough for modeling/inference. This data would be collected by agencies or groups with access to official or government data collection or reports.  Each observation would include variables on household income, residential energy and fuel consumption, and potentially demographic information. The dataset would be clean and tidy in tabular format, organized with standardized procedures and non-missing data and cleanly labelled columns. Each observation would correspond to a census tract for granularity and accurate analysis.

2. Real Dataset

The data is located and freely available at https://data.openei.org/submissions/6219. It provides census tract level data for all of California with detailed measurements of income and energy consumption. The important variables in the dataset that we are likely to use are AMI measures, average household income, energy expenditures, and potentially some of the demographic variables to use as controls.

## Ethics

### A. Data Collection
 - [X] **A.1 Informed consent**: This is not relevant as the project does not involve direct interaction with people. All data is publicly available and aggregated at census tract level.
 - [X] **A.2 Collection bias**: The data in this dataset could have some systematic bias from the original Department of Energy data collection process, such as sampling bias. However, this data was collected under the Better Building's Clean Energy for Low Income Communities Accelerator (CELICA) for state and local entities to better understand housing and energy characteristics for the low/moderate income communities they serve, so it is intentionally unlikely to have these issues and is largely comprehensive with millions of observations.
 - [X] **A.3 Limit PII exposure**: This is not relevant as the dataset is publicly available government data, inherently anonymized by the granularity and available information.
 - [X] **A.4 Downstream bias mitigation**: The dataset includes multiple variables centered around demographics, including race and education level. This allows us to potentially include some of these variables in order to control for bias in downstream analysis, serving a better understanding of the data and interpretation of the results of our research task.

### B. Data Storage
 - [X] **B.1 Data security**: This is not relevant as it is publicly accessible government data that is anonymized by the granularity.
 - [X] **B.2 Right to be forgotten**: This is not relevant as it is publicly accessible government data that is anonymized by the granularity.
 - [X] **B.3 Data retention plan**: This is not relevant as it is publicly accessible government data that is anonymized by the granularity.

### C. Analysis
 - [X] **C.1 Missing perspectives**: This project does not necessarily involve direct engagement with stakeholders, though we do acknowledge taht the data can contain assumptions that do not fully grasp lived experience across census tracts. To address this, we plan to understand literature on energy and environmental justice.
 - [X] **C.2 Dataset bias**: Potential sources include modeling/inference assumptions to estimate household energy expenditures. We plan to offset this by understanding the literature and discourse on energy and environmental justice, as well as potentially using control variables to do with demographics.
 - [X] **C.3 Honest representation**: Our visualizations and summary stats will be designed and communicated to accurately reflect the underlying data without exaggeration of the data. We will clearly label visualizations for clear understanding. Where uncertainty or variability is high, we can address this through distributional plots or confidence intervals rather than single summary values.
 - [X] **C.4 Privacy in analysis**: There is no PII used or displayed at any stage of the analysis. We examine the data that is provided at the census tract level, without any individual reconstruction. This ensures that privacy is preserved throughout the analytical process.
 - [X] **C.5 Auditability**: Our analysis will be documented through our GitHub repo with code and written descriptions of EDA and modeling/inference decisions. This documentation ensures that the analysis can be reproduced and audited if any errors or concerns arise. Clear separation between raw data, cleaned data, and derived variables for analysis will further support transparency and reproducibility.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Although this project does not involve decision making about individuals, there is a risk that income or demographic variables could act as proxies for protected characteristics. To address this, we avoid using demographic variables as primary predictors and instead include them as controls to contextualize results. The goal is to understand structural associations rather than to rank or evaluate specific groups.
 - [X] **D.2 Fairness across groups**: Model results will be examined across different demographic and income strata to identify whether associations differ substantially between groups. We can check whether residual patterns or estimated effects systematically vary across protected groups to help ensure that conclusions do not disproportionately misrepresent or obscure impacts on particular populations.
 - [X] **D.3 Metric selection**: We primarily focus on interpretability of regression coefficients rather than optimizing a single performance metric. Model fit statistics may be considered in order to avoid overemphasizing overall fit at the expense of understanding differential impacts.
 - [X] **D.4 Explainability**: The modeling approach is intentionally limited to interpretable methods, such as linear or generalized linear regression. This ensures that relationships between income, energy expenditures, and estimated emissions can be clearly explained in non-technical terms. 
 - [X] **D.5 Communicate limitations**: Limitations of the model, including reliance on expenditure-based emissions estimates, aggregation at the census tract level, and potential omitted variables, will be clearly communicated in the final report. These limitations will be presented in acessible language to avoid overstating the precision or generalizability of our findings.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: This is not relevant as this is a short-term, exploratory analysis project and does not involve deploying a model or system for ongoing use. As a result, post-deployment monitoring, performance tracking, or concept drift evaluation are not applicable.
 - [X] **E.2 Redress**: This is not relevant as this is a short-term, exploratory analysis project and does not involve deploying a model or system for ongoing use. As such, there are no end users that can be harmed by the results or operational impacts. We nevertheless aim to present results responsibly to minimize redress through misinterpretation.
 - [X] **E.3 Roll back**: This is not relevant as this is a short-term, exploratory analysis project and does not involve deploying a model or system for ongoing use. As such, there’s no deployment that can be rolled back. If errors are discovered, the analysis and conclusions can be revised within the project submission.
 - [X] **E.4 Unintended use**: This is not relevant as this is a short-term, exploratory analysis project and does not involve deploying a model or system for ongoing use. While we acknowledge that findings could be misinterpreted out of context, preventing misuse beyond the scope of this class project is not applicable as it’s not deployed.

## Team Expectations 

* Communication should be done via Discord, mostly through the already established group chat, but private matters can be discussed through direct messages as well.
* In most cases, everyone is expected to read and respond to the group message within 24 hours since the chat is sent. However, this rule is relaxed during the weekend so group members are expected to already read and respond to the group messages by the coming Monday. The tolerated hours might be reduced when deadlines are close and it’s mentioned before during a group meeting that replies are expected to be as soon as possible.
* Meetings are done weekly on Mondays at 9 AM, Peterson Hall. This is done hybrid, where the main meeting is in person but those who can’t meet during this time can join the Discord call either.
* The tone of the group’s communication should be blunt, but friendly. Each group member is expected to voice any concerns or thoughts on every decision the group makes. However, friendliness and politeness in tone is expected. Generally, criticisms and suggestions should be delivered with reasons and without starting a fight.
* Decision making is done via majority voting, especially for the project-wide and big decisions. However, if the decision is about specific and delegated tasks, like how to implement an algorithm used for the project, the delegated team member is allowed to do as they please as long as it doesn’t go against the agreed big picture.
* Decision making for a short time frame should include calling the group chat to make sure as many members as possible are able to voice their opinions. However, for decisions due in a very short time frame like within an hour, a majority of whoever is available after calling the group chat is followed.
* Specializations and tasks will be assigned as we move through each part of the project based on who is available to work the most during the week.
* To-do lists and the agreed upon due dates are available in the group’s Google Docs.
* Each member is expected to submit their work through pushing their part to Github. This doesn’t apply to the project proposal.
* Everyone is expected to help other members that are struggling if they understand how. Members that are struggling with their delegated task must quickly state it in the Discord group chat way before the agreed due date of the task.
* Planned items’ due dates are set as we move through each part of the project. Generally, they must be completed about 12 hours before the deliverable due date of the assignment, if they are already agreed in meetings / group chat to be submitted in the upcoming checkpoint.

## Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/26  |  9 AM |  Read about what we were supposed to do for the project review assignment. | Working on the project review assignment | 
| 2/2  | 9 AM  | Read through the project proposal’s sections, come up with ideas on what topic to do. | Discuss and vote on what topic / research question and datasets are going to be used. Discuss and assign each members’ parts to work on for the project proposal. Discuss and formalize team expectations. |
| 2/4  | Before 11:59 PM | NA | Push the Project Proposal to the Github repo. |
| 2/9  | 9 AM  | Read through the Data Checkpoint’s sections. Read through the TA’s feedback if it’s already returned back. | Discuss each section of what is due next in the Data Checkpoint assignment, and assign each members’ parts to work on. If the TA already returns the feedback, discuss and assign how and who to fix parts of the proposal if needed. |
| 2/16  | 9 AM  | Finish data wrangling and the description of the dataset (finish what was assigned to each member for the Data Checkpoint assignment). | Review each part of the Data Checkpoint parts. Discuss and assign how and who to fix parts of the proposal if needed. If everything is done, discuss the EDA Checkpoint’s sections and start early work assignments. |
| 2/18  | Before 11:59 PM | NA | Make sure everything up to the Data Checkpoint has been pushed to the Github repo. |
| 2/23  | 9 AM  | Read through the EDA Checkpoint’s sections. Read through the TA’s feedback if it’s already returned back. | Discuss each section of what is due next in the EDA Checkpoint assignment, and assign each members’ parts to work on. If the TA already returns the feedback, discuss and assign how and who to fix parts of the project until the Data Checkpoint if needed. |
| 3/2  | 9 AM  | Finish what was assigned to each member for the EDA Checkpoint assignment. | Review each part of the EDA Checkpoint parts. Discuss and assign how and who to fix parts of the project until the Data Checkpoint if needed. |
| 3/4  | Before 11:59 PM | NA | Make sure everything up to the EDA Checkpoint has been pushed to the Github repo. |
| 3/9  | 9 AM  | Read through the TA’s feedback if it’s already returned back. | Discuss the TA’s feedback and assign each member’s work on the final project’s revision. Start discussing and assigning works for the video. |
| 3/14-3/15  | TBA  | Try to already finish the final projects’ revision and video. | Review and check on the progress of each members’ works for the final revision and the video. |
| 3/18  | Before 11:59 PM | NA | Turn in Final Project, Video, and Group Project Surveys. |