# COGS 108 - Data Checkpoint

## Authors

* **[Jasmine Gao ]**: Background research, Conceptualization, Writing
* **[Ella Guo]**: Hypothesis, Data, Writing
* **[Qi Zhang]**: Ethics, Writing
* **[Shuheng Cao]**: Project Timeline Proposal, Writing
* **[Shujia chen]**: Project administration, review & editing

## Research Question

Among U.S. adults aged 25–54, how is household broadband internet subscription associated with labor-market outcomes (employment status and annual earnings), and does this association differ between rural and urban residents after controlling for demographics and human-capital factors?

**Dependent Variables:**

- Employment (binary: employed vs not employed)

- Annual earnings/wage income (continuous)

**Key explanatory/Independent Variables (IV)**

- Household broadband subscription (binary)

- Rural vs urban status (binary or metro status proxy -> depending on geography)

- Interaction: broadband × rural/urban

**Controls Variables**

- Age, sex, race/ethnicity

- Education level

- Marital status, number of children in household

- Immigration status / English proficiency

-  fixed effects (or region fixed effects), year fixed effects (if multi-year)

## Background and Prior Work

High-speed internet access has become a key input to modern labor markets: it enables online job search and matching, remote work, access to training, and participation in digitally mediated services (e.g., gig work, online freelancing). Yet access and adoption are uneven across the United States, creating a persistent “digital divide” that often maps onto geography (rural vs. urban), income, and education. For this project, we ask whether household broadband subscription is associated with employment and earnings for U.S. adults ages 25–54, and whether the relationship differs in rural versus urban contexts after controlling for demographic and human-capital factors.

A large body of prior research suggests that broadband can affect labor outcomes, but the direction and magnitude may depend on worker characteristics and local context. Using U.S. data from 1999–2007, Atasoy (2013) finds that broadband expansion is associated with improved labor-market outcomes, consistent with the idea that connectivity reduces frictions in job search and supports labor-force participation.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1)

More causal work focusing on specific subpopulations also finds meaningful effects: Dettling (2017) uses an instrumental-variables strategy based on supply-side constraints to broadband access and reports that exogenous increases in high-speed internet use raise labor force participation among married women, highlighting a plausible mechanism where internet access expands feasible work arrangements and job search at home.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2)

Beyond participation, broadband can also influence who benefits in the labor market. Akerman, Gaarder, and Mogstad (2015) provide evidence of “skill complementarity,” where broadband adoption improves outcomes for more-skilled workers and can disadvantage less-skilled workers—suggesting that broadband may amplify existing inequalities unless paired with complementary skills and opportunities.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3)

Recent research also emphasizes that the broadband–employment relationship may look different in rural areas, where infrastructure, adoption, and job opportunities differ from urban labor markets. For example, Isley (2022) studies rural U.S. counties during the early COVID-19 period and finds that both broadband availability and adoption are related to employment rates, using a two-stage least squares approach to address endogeneity.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4)

This motivates our rural–urban heterogeneity component: if broadband supports telework, job search, and access to wider labor markets, the payoff may be larger where geographic isolation is a bigger constraint; alternatively, if high-quality jobs and complementary skills are concentrated in urban areas, the earnings gains from broadband adoption may be larger in cities. Together, these studies justify examining (1) the overall association between household broadband subscription and employment/earnings and (2) whether that association differs across rural versus urban settings after accounting for education, age, race/ethnicity, household composition, and place-based factors.

Finally, prior work highlights an important methodological challenge: broadband adoption is not randomly assigned. Households with broadband may differ systematically in income, education, occupation, and local labor conditions, all of which also affect employment and earnings. The literature therefore often emphasizes careful adjustment strategies (e.g., rich controls, fixed effects, quasi-experimental instruments, or matching) to reduce confounding and clarify interpretation.<a name="cite_ref-2b"></a>[<sup>2</sup>](#cite_note-2) In this project, using public microdata, we will treat results primarily as associational unless we can justify a credible identification strategy.

### References

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Atasoy, H. (2013). The effects of broadband internet expansion on labor market outcomes. *Industrial and Labor Relations Review*, 66(2), 315-345. https://journals.sagepub.com/doi/epdf/10.1177/001979391306600202
2. <a name="cite_note-2"></a> [^](#cite_ref-2) [^](#cite_ref-2b) Dettling, L. J. (2017). Broadband in the Labor Market: The Impact of Internet Speed on Job Search and Married Women’s Labor Supply. *Industrial and Labor Relations Review*, 70(2), 451-482. https://journals.sagepub.com/doi/epub/10.1177/0019793916644721
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Akerman, A., Gaarder, I., & Mogstad, M. (2015). The Skill Complementarity of Broadband Internet. *The Quarterly Journal of Economics*, 130(4), 1781–1824. https://doi.org/10.1093/qje/qjv028
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Isley, C., & Low, S. A. (2022). Broadband adoption and availability: Impacts on rural employment during COVID-19. *Telecommunications Policy*, 46(7), 102310. https://doi.org/10.1016/j.telpol.2022.102310

## Hypothesis


We hypothesize that broadband internet access is positively associated with employment status and annual earnings among U.S. adults aged 25–54. Individuals with household broadband subscriptions are expected tohave better labor-market outcomes, as internet connectivity facilitates job search, remote work, and access to employment opportunities.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [7]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|                                                                 | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|                                                      | 0.00/1.23k [00:00<?, ?B/s][A
Overall Download Progress:  50%|████████████████████████████▌                            | 1/2 [00:00<00:00,  3.75it/s][A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|                                                         | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.96it/s][A

Successfully downloaded: bad-drivers.csv





In [9]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|                                                                 | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|                                                      | 0.00/1.23k [00:00<?, ?B/s][A
Overall Download Progress:  50%|████████████████████████████▌                            | 1/2 [00:00<00:00,  8.26it/s][A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|                                                         | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|█████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  7.61it/s][A

Successfully downloaded: bad-drivers.csv





### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [11]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [12]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

This project involves two main ethical considerations related to data bias and the use of sensitive information. First, selection bias is an important concern. Household broadband access is not randomly assigned. People who have broadband often differ from those who do not in ways that also affect employment and earnings, such as education level, income, job type, and where they live. These differences may be especially strong between rural and urban areas. Although we control for many demographic and human-capital variables, some unobserved differences may remain. Because of this, we interpret our results as associations rather than causal effects. Second, the use of sensitive socioeconomic variables raises privacy and ethical considerations. Variables such as employment status, earnings, immigration status, and English proficiency can be sensitive and potentially stigmatizing. While the Pew datasets are publicly available and anonymized, we avoid reporting small subgroups or detailed geographic information that could risk harm or misinterpretation. All results are presented in aggregated form. And the analysis focuses on overall patterns rather than individual outcomes.

## Team Expectations 

Success for our project will be defined by both the quality of the final deliverables and the effectiveness of our team collaboration throughout the quarter. Specifically, our project will be considered successful if it meets the following criteria:

* **Clear and Well-Defined Research Question:** The project addresses a clearly articulated, well-motivated data science research question that is appropriate in scope and grounded in relevant background literature.
  
* **Sound Data Practices:** The dataset(s) used are appropriate for answering the research question, ethically sourced, and clearly documented. Data wrangling, cleaning, and preprocessing steps are transparent, reproducible, and well-explained.

* **Appropriate and Justified Analysis:** The analytical methods and visualizations are suitable for the research question, correctly implemented, and clearly interpreted. Results are discussed thoughtfully, including limitations and potential sources of bias.


* **Clear Communication and Documentation:** The final notebook and written components are well-organized, clearly written, and understandable to a reader outside the group. Code is readable, commented where appropriate, and follows good data science practices.


* **Equitable Team Contribution:** All team members contribute meaningfully to multiple aspects of the project, including ideation, analysis, writing, and revision. Responsibilities are distributed fairly, and progress is communicated regularly.


* **Effective Team Communication and Collaboration:** The team maintains respectful, timely, and transparent communication, addresses conflicts constructively, and follows agreed-upon team expectations as outlined in the COGS108 Team Policies.


* **On-Time Completion of Milestones:** Intermediate milestones and the final project are completed on schedule, allowing sufficient time for review, revision, and quality control before submission.

## Project Timeline Proposal

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
|---|---|---|---|
| 2/4 | 6 PM | Finalize project topic and research question; confirm group roles and communication plan | Review finalized research question, hypotheses, and expectations; confirm dataset selection |
| 2/7 | 6 PM | Collect and import dataset; review dataset structure and variables | Discuss data wrangling plan and potential challenges; assign wrangling and EDA tasks |
| 2/11 | 6 PM | Complete data cleaning and preprocessing; begin exploratory data analysis (EDA) | Review EDA results; refine analysis plan and decide on statistical methods |
| 2/15 | 6 PM | Complete main analyses; generate initial visualizations and summary statistics | Discuss interpretation of results; identify missing analyses or improvements |
| 2/20 | 6 PM | Revise analyses; draft results and discussion sections | Review full analysis progress; plan final report structure |
| 2/28 | 6 PM | Draft full project report; finalize figures and tables | Peer review full draft; address clarity, formatting, and rubric alignment |
| 3/8 | 6 PM | Revise final report; complete ethics and limitations sections | Final check for completeness and rubric requirements |
| 3/15 | Before 11:59 PM | N/A | Submit Final Project and complete Group Project Surveys |