# COGS 108 - Project Proposal

## Authors

* **[Jasmine Gao ]**: Background research, Conceptualization, Writing
* **[Ella Guo]**: Hypothesis, Data, Writing
* **[Qi Zhang]**: Ethics, Writing
* **[Shuheng Cao]**: Project Timeline Proposal, Writing
* **[Shujia chen]**: Project administration, review & editing

## Research Question

Among U.S. adults aged 25–54, how is household broadband internet subscription associated with labor-market outcomes (employment status and annual earnings), and does this association differ between rural and urban residents after controlling for demographics and human-capital factors?

**Dependent Variables:**

- Employment (binary: employed vs not employed)

- Annual earnings/wage income (continuous)

- Key explanatory/Independent Variables (IV)

- Household broadband subscription (binary)

- Rural vs urban status (binary or metro status proxy -> depending on geography)

- Interaction: broadband × rural/urban

**Controls Variables**

- Age, sex, race/ethnicity

- Education level

- Marital status, number of children in household

- Immigration status / English proficiency

-  fixed effects (or region fixed effects), year fixed effects (if multi-year)

## Background and Prior Work

High-speed internet access has become a key input to modern labor markets: it enables online job search and matching, remote work, access to training, and participation in digitally mediated services (e.g., gig work, online freelancing). Yet access and adoption are uneven across the United States, creating a persistent “digital divide” that often maps onto geography (rural vs. urban), income, and education. For this project, we ask whether household broadband subscription is associated with employment and earnings for U.S. adults ages 25–54, and whether the relationship differs in rural versus urban contexts after controlling for demographic and human-capital factors.

A large body of prior research suggests that broadband can affect labor outcomes, but the direction and magnitude may depend on worker characteristics and local context. Using U.S. data from 1999–2007, Atasoy (2013) finds that broadband expansion is associated with improved labor-market outcomes, consistent with the idea that connectivity reduces frictions in job search and supports labor-force participation.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

More causal work focusing on specific subpopulations also finds meaningful effects: Dettling (2017) uses an instrumental-variables strategy based on supply-side constraints to broadband access and reports that exogenous increases in high-speed internet use raise labor force participation among married women, highlighting a plausible mechanism where internet access expands feasible work arrangements and job search at home.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

Beyond participation, broadband can also influence who benefits in the labor market. Akerman, Gaarder, and Mogstad (2015) provide evidence of “skill complementarity,” where broadband adoption improves outcomes for more-skilled workers and can disadvantage less-skilled workers—suggesting that broadband may amplify existing inequalities unless paired with complementary skills and opportunities.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Recent research also emphasizes that the broadband–employment relationship may look different in rural areas, where infrastructure, adoption, and job opportunities differ from urban labor markets. For example, Isley (2022) studies rural U.S. counties during the early COVID-19 period and finds that both broadband availability and adoption are related to employment rates, using a two-stage least squares approach to address endogeneity.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)

This motivates our rural–urban heterogeneity component: if broadband supports telework, job search, and access to wider labor markets, the payoff may be larger where geographic isolation is a bigger constraint; alternatively, if high-quality jobs and complementary skills are concentrated in urban areas, the earnings gains from broadband adoption may be larger in cities. Together, these studies justify examining (1) the overall association between household broadband subscription and employment/earnings and (2) whether that association differs across rural versus urban settings after accounting for education, age, race/ethnicity, household composition, and place-based factors.

Finally, prior work highlights an important methodological challenge: broadband adoption is not randomly assigned. Households with broadband may differ systematically in income, education, occupation, and local labor conditions, all of which also affect employment and earnings. The literature therefore often emphasizes careful adjustment strategies (e.g., rich controls, fixed effects, quasi-experimental instruments, or matching) to reduce confounding and clarify interpretation.<a name="cite_ref-2b"></a>[<sup>2</sup>](#cite_note-2) In this project, using public microdata, we will treat results primarily as associational unless we can justify a credible identification strategy.

### References

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Atasoy, H. (2013). The effects of broadband internet expansion on labor market outcomes. *Industrial and Labor Relations Review*, 66(2), 315-345. https://journals.sagepub.com/doi/epdf/10.1177/001979391306600202
2. <a name="cite_note-2"></a> [^](#cite_ref-2) [^](#cite_ref-2b) Dettling, L. J. (2017). Broadband in the Labor Market: The Impact of Internet Speed on Job Search and Married Women’s Labor Supply. *Industrial and Labor Relations Review*, 70(2), 451-482. https://journals.sagepub.com/doi/epub/10.1177/0019793916644721
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Akerman, A., Gaarder, I., & Mogstad, M. (2015). The Skill Complementarity of Broadband Internet. *The Quarterly Journal of Economics*, 130(4), 1781–1824. https://doi.org/10.1093/qje/qjv028
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Isley, C., & Low, S. A. (2022). Broadband adoption and availability: Impacts on rural employment during COVID-19. *Telecommunications Policy*, 46(7), 102310. https://doi.org/10.1016/j.telpol.2022.102310

## Hypothesis


We hypothesize that broadband internet access is positively associated with employment status and annual earnings among U.S. adults aged 25–54. Individuals with household broadband subscriptions are expected tohave better labor-market outcomes, as internet connectivity facilitates job search, remote work, and access to employment opportunities.


## Data

### Ideal Dataset

The ideal dataset would include individual- and household-level data on broadband internet access and labor-market outcomes among U.S. adults aged 25–54.

Key variables would include employment status, annual earnings, and household broadband subscription status. Additional variables would include rural versus urban residence, education level, age, sex, race, marital status, number of children, immigration status, and English proficiency.

We would ideally want data from tens of thousands of individuals across multiple states to ensure national representativeness. Data could be collected through large-scale government surveys or census-based labor and technology access surveys.

The dataset would be stored in structured tabular format, with each row representing an individual and columns representing demographic, economic, and broadband access variables.

### Datasets

**Dataset Name:** Current Population Survey (CPS)

**Link: https:**//www.census.gov/data/datasets/time-series/demo/cps/cps-basic.html

**Number of observations:** Approximately 60,000 households per month.

This dataset is publicly available through the U.S. Census Bureau website, where monthly CPS files can be directly downloaded in formats such as CSV. No special permission or application is required to access the data.

The CPS contains detailed labor market information including employment status, wage income, hours worked, education level, and demographic characteristics. Some supplements also include information related to technology and internet access. These variables can be used to examine how broadband access is associated with employment and earnings outcomes.

**Dataset Name:** Pew Internet & Technology Survey

**Link: https:**//www.pewresearch.org/internet/datasets/

**Number of observations:** Typically between 1,000 and 5,000 respondents per survey.

This dataset is publicly available through the Pew Research Center website, where survey data can be downloaded directly after agreeing to terms of use.

The dataset includes variables such as broadband subscription, internet access, employment status, income level, and demographic characteristics.

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Example of how to use the checkbox, and also of how you can put in a short paragraph that discusses the way this checklist item affects your project.  Remove this paragraph and the X in the checkbox before you fill this out for your project

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?
 - [ ] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
 - [ ] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
 - [ ] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
 - [ ] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
 - [ ] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

### C. Analysis
 - [ ] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?
 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?
 - [ ] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?
 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?
 - [ ] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

### D. Modeling
 - [ ] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
 - [ ] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [ ] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?
 - [ ] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [ ] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

### E. Deployment
 - [ ] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
 - [ ] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
 - [ ] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [ ] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?


## Ethics 

This project involves two main ethical considerations related to data bias and the use of sensitive information. First, selection bias is an important concern. Household broadband access is not randomly assigned. People who have broadband often differ from those who do not in ways that also affect employment and earnings, such as education level, income, job type, and where they live. These differences may be especially strong between rural and urban areas. Although we control for many demographic and human-capital variables, some unobserved differences may remain. Because of this, we interpret our results as associations rather than causal effects. Second, the use of sensitive socioeconomic variables raises privacy and ethical considerations. Variables such as employment status, earnings, immigration status, and English proficiency can be sensitive and potentially stigmatizing. While the Pew datasets are publicly available and anonymized, we avoid reporting small subgroups or detailed geographic information that could risk harm or misinterpretation. All results are presented in aggregated form. And the analysis focuses on overall patterns rather than individual outcomes.

## Team Expectations 

Success for our project will be defined by both the quality of the final deliverables and the effectiveness of our team collaboration throughout the quarter. Specifically, our project will be considered successful if it meets the following criteria:

* **Clear and Well-Defined Research Question:** The project addresses a clearly articulated, well-motivated data science research question that is appropriate in scope and grounded in relevant background literature.
  
* **Sound Data Practices:** The dataset(s) used are appropriate for answering the research question, ethically sourced, and clearly documented. Data wrangling, cleaning, and preprocessing steps are transparent, reproducible, and well-explained.

* **Appropriate and Justified Analysis:** The analytical methods and visualizations are suitable for the research question, correctly implemented, and clearly interpreted. Results are discussed thoughtfully, including limitations and potential sources of bias.


* **Clear Communication and Documentation:** The final notebook and written components are well-organized, clearly written, and understandable to a reader outside the group. Code is readable, commented where appropriate, and follows good data science practices.


* **Equitable Team Contribution:** All team members contribute meaningfully to multiple aspects of the project, including ideation, analysis, writing, and revision. Responsibilities are distributed fairly, and progress is communicated regularly.


* **Effective Team Communication and Collaboration:** The team maintains respectful, timely, and transparent communication, addresses conflicts constructively, and follows agreed-upon team expectations as outlined in the COGS108 Team Policies.


* **On-Time Completion of Milestones:** Intermediate milestones and the final project are completed on schedule, allowing sufficient time for review, revision, and quality control before submission.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 2/4 | 6 PM | Finalize project topic and research question; confirm group roles and communication plan | Review finalized research question, hypotheses, and expectations; confirm dataset selection |
| 2/7 | 6 PM | Collect and import dataset; review dataset structure and variables | Discuss data wrangling plan and potential challenges; assign wrangling and EDA tasks |
| 2/11 | 6 PM | Complete data cleaning and preprocessing; begin exploratory data analysis (EDA) | Review EDA results; refine analysis plan and decide on statistical methods |
| 2/15 | 6 PM | Complete main analyses; generate initial visualizations and summary statistics | Discuss interpretation of results; identify missing analyses or improvements |
| 2/20 | 6 PM | Revise analyses; draft results and discussion sections | Review full analysis progress; plan final report structure |
| 2/28 | 6 PM | Draft full project report; finalize figures and tables | Peer review full draft; address clarity, formatting, and rubric alignment |
| 3/8 | 6 PM | Revise final report; complete ethics and limitations sections | Final check for completeness and rubric requirements |
| 3/15 | Before 11:59 PM | N/A | Submit Final Project and complete Group Project Surveys |