# COGS 108 - Project Proposal

## Authors


- *Andrew Zhang*: Research Question, Backgound and Prior Work, Writing (original draft)
- *Andy Cao*: Hypothesis, Data, Writing (original draft)
- *Vicky Huang*: Data Collection, Data Storage, Writing (review and editing)
- *Jasmine Lou*: Analysis, Modeling, Writing (review and editing)
- *Yiwen Huang*: Deployment, Timeline, Writing (review and editing)

## Research Question

How have housing prices and affordability changed over time in regions surrounding University of California (UC) campuses, and how do these trends relate to local income levels and broader economic conditions?

Specifically, we examine whether housing prices near UC campuses have increased faster than median household income and how these trends differ across regions (e.g., Irvine, La Jolla, Berkeley, Los Angeles) and time periods. The main metrics are median home price (or home value index), median household income, and affordability (e.g., price-to-income ratio or rent burden). The analysis is primarily descriptive and comparative. We will visualize price and income trends, compare regions, and assess changes in affordability over at least 10 to 20 years using public data from government and real-estate sources (e.g., Zillow, U.S. Census/ACS, FRED).



## Background and Prior Work

Housing affordability has become a major economic and social issue in California, particularly in regions with high demand and limited housing supply.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Areas surrounding University of California campuses often experience additional housing pressure due to student populations, faculty demand, and local economic growth. Prior research has shown that housing prices in California have increased significantly over the past two decades, frequently outpacing wage growth and contributing to affordability challenges for renters and homeowners alike.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

Several public data sources and prior analyses provide context for this project. Zillow's housing market reports document long-term growth in home values across major U.S. metropolitan areas, with especially rapid increases in coastal California cities.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) U.S. Census data and American Community Survey (ACS) reports provide evidence that median household income growth has been slower and uneven across regions.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) Additionally, Federal Reserve economic data has been used in prior projects to analyze how interest rates and inflation influence housing prices over time.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)

Related student and public data science projects have explored housing affordability using Zillow and Census data to compare regional price growth, rent burdens, and income trends. These projects generally find that housing affordability has declined over time, particularly in urban and high-education regions. Our project builds on this prior work by focusing specifically on UC-adjacent regions (e.g., Irvine, La Jolla, Berkeley, Los Angeles) and comparing affordability trends across multiple campuses, rather than analyzing California as a single aggregated market.

**References**

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Zillow Research, Housing Data and Market Reports. https://www.zillow.com/research/data/
2. <a name="cite_note-2"></a> [^](#cite_ref-2) California housing and affordability context, consistent with widely reported trends in coastal metros.
3. <a name="cite_note-3"></a> [^](#cite_ref-3) U.S. Census Bureau, American Community Survey. https://www.census.gov/programs-surveys/acs
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Federal Reserve Economic Data (FRED). https://fred.stlouisfed.org/


## Hypothesis


We hypothesize that housing prices in regions surrounding UC campuses have increased faster than median household income over time, leading to decreased affordability. We also expect that UC regions located in major metropolitan or coastal areas (e.g., near UC Berkeley or UCLA) will exhibit higher prices and lower affordability compared to UC campuses in less dense regions (e.g., UC Riverside or UC Merced). This is based on prior evidence that coastal and urban California markets have seen stronger price growth and that university towns often face extra demand from students and staff, while income growth has been slower and uneven across the state.

## Data

**1. Ideal dataset**

To answer our research question we would use multiple datasets combined at the regional and temporal level. *Variables:* median home price or home value index, median household income, rent prices (if available), location (UC campus region, county, or metro area), time (year or month), and economic indicators such as interest rates or inflation. *Observations:* region-time units (e.g., UC-adjacent counties or metro areas) over multiple years. We would want at least 10 to 20 years to capture long-term trends. *Collection and storage:* publicly released aggregate statistics from official and commercial sources, stored as structured tabular data (e.g., CSV) with rows as region-time observations, joined by geographic and time identifiers.

**2. Real datasets**

- **Zillow housing data.** Zillow Research provides housing market data and reports (e.g., home value indices, median list/sale prices) at metro, county, and sometimes ZIP level. Data are available at [Zillow Research](https://www.zillow.com/research/data/). No permission is required for research use of their public indices. Key variables include median home value, price indices over time, and optionally rent indices. We would use these to measure price levels and growth near UC campuses (e.g., Irvine, La Jolla, Berkeley, Los Angeles).

- **U.S. Census / American Community Survey (ACS).** The Census Bureau's ACS provides median household income, demographics, and housing characteristics at county and place level. Data are available at [census.gov](https://www.census.gov/programs-surveys/acs). Access is free and open. We would use median household income and possibly rent burden or housing cost measures to compare with Zillow price data and assess affordability across UC regions and over time.

- **Federal Reserve Economic Data (FRED).** FRED offers interest rates, inflation, and other macroeconomic series at [fred.stlouisfed.org](https://fred.stlouisfed.org/). No application is needed for standard use. We would use these as contextual variables (e.g., mortgage rates, CPI) to interpret price and affordability trends rather than as primary outcome variables.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [ ] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Not applicable. We use only aggregated, publicly available data (Zillow, Census/ACS, FRED). There are no human subjects.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> Housing data from sources such as Zillow may reflect market activity more accurately in higher-income or urban areas, potentially underrepresenting lower-income or rural communities. Census median income may mask within-region inequality. We will state these limitations clearly and examine differences across regions to identify potential biases.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> The proposed datasets are aggregated and publicly available. They do not contain personally identifiable information. We will not collect or use individual-level data.
 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We do not collect protected group status. Our results are regional aggregates (e.g., county-level price and income). We will frame findings as regional trends and avoid claims about any demographic group so that results are not used to support biased downstream decisions.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Data are public aggregates (Zillow, Census, FRED) stored in our project repo. We will follow normal file and repo access (e.g., only team members have edit access). No sensitive or PII data are stored.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> Not applicable. We do not collect or store any personal or individually identifiable information.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> We may keep downloaded public datasets for the duration of the course. After the project we can delete local copies if desired. There is no formal retention requirement for this coursework.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> We do not have direct stakeholder engagement. We will state limitations clearly (e.g., aggregate data do not capture within-region variation or lived experience) and avoid claiming to represent any community. We will present results as descriptive trends that readers can interpret in context.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We will examine differences across regions to identify potential biases and discuss how aggregation may obscure within-region inequality. We will avoid normative claims that could reinforce harmful stereotypes about specific communities and emphasize that findings reflect regional trends rather than individual experiences.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> We will design visualizations and summaries to represent the underlying data honestly, and we will communicate limitations (e.g., coverage of Zillow/Census, use of medians) so that readers can interpret results appropriately.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> Our data are aggregate only (e.g., median income by county). We will not use or display any personally identifiable information in the analysis or report.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> We will document data sources, cleaning steps, and code (e.g., in the repo and notebook) so that the analysis can be reproduced and checked later.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> We are not building a predictive or decision-making model. We do descriptive analysis (e.g., trends, comparisons). There are no model outputs that could discriminate against individuals or groups.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> Not applicable. We have no model that makes decisions or predictions about people. Our outputs are regional summaries (e.g., price and income by area).

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> We use descriptive metrics (e.g., median price, median income, affordability ratios). We are not optimizing a model. We will report what we use and note limitations (e.g., medians can hide inequality within regions).

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> We have no decision-making model. Our analysis is transparent (e.g., we show data sources, steps, and visualizations) so readers can see how we reach our conclusions.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> We will state limitations in the report (e.g., data coverage, use of medians, no causal claims). We do not have a deployed model or formal stakeholders beyond the course.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> Not applicable. We are not deploying a model. This is a one-time course project (report and possibly presentation). There is no production system to monitor.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> Not applicable. No model is deployed and no users receive decisions from our work. We will present findings with clear limitations so that misuse is less likely.

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> Not applicable. We have no model in production. Deliverables are a report and possibly a presentation for the course only.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> We are not deploying a model. We will write the report so that results and limitations are clear, reducing the chance that our descriptive findings are misused (e.g., as justification for stereotyping a region). We have no ongoing deployment to monitor.


## Team Expectations 

All team members have read the [COGS108 Team Policies](https://github.com/COGS108/Projects/blob/master/COGS108_TeamPolicies.md) and agree to the following expectations.

* **Communication.** Communicate regularly through a shared platform (e.g., Discord or Slack) and attend scheduled meetings.
* **Contribution.** Each member is expected to contribute equitably to data collection, analysis, and writing.
* **Conflict resolution.** If conflicts arise, we agree to address them respectfully and promptly through group discussion.
* **Commitment.** By submitting this proposal, each member affirms that they have read the COGS 108 Team Policies and intend to meet these expectations.

## Project Timeline Proposal

We do not anticipate needing specialized methods beyond those covered in COGS 108. Standard data analysis, visualization, and basic statistical techniques should be sufficient for this project.

| Week | Completed Before | Discuss / Deliverables |
|------|------------------|-------------------------|
| 4 | Finalize research question and identify datasets (Zillow, Census/ACS, FRED) | Confirm data access and align on UC regions (e.g., Irvine, La Jolla, Berkeley, LA) |
| 5 | Data cleaning and preprocessing | Joined region-time tables, consistent geography and time range |
| 6 | Exploratory data analysis and visualization | Price and income trends, price difference and income comparison visuals |
| 7 | Statistical analysis and interpretation | Compare price vs. income growth, affordability by region |
| 8 | Draft results and ethics discussion | Integrate limitations and ethics, first full draft |
| 9 | Finalize analysis and visualizations | Polished figures and tables, finalize methods and results |
| 10 | Complete final report and presentation | Turn in final project and group surveys |
