**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

**Issues Fixed**
- Research question hypothesis added for each factor
- Added further explanation for data by specifying the size and relevant variables of our dataset
- Elaborated on our ethics by expanding on unexpected consequences such as reinforcement of stereotypes and how we could work on to prevent them from happening
- Updated our hypothesis to expand on the rationale and describe what results we hope to see from our dataset in more detail and fixed formatting
- Added sources from peer-reviewed scientific journals relating to our topic to the prior work and background section and used APA style citation

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Nessa Pantfoerder
- Lillian Ho
- Chi Cheng
- Romir Kant
- Xiangjun Fu

# Research Question

Is there a relationship between demographic characteristics, such as education level, work class, occupation, and marital status, and an individual's likelihood of earning an annual income over $50,000 in the United States? What specific factors have the greatest impact on one's earnings?

We start with the premise that educational attainment plays a significant role, anticipating that individuals with higher education levels tend to earn more, suggesting a positive correlation. 

For work class, our focus narrows to the two predominant groups in our dataset: private sector and government employees. Here, we hypothesize that those in the private sector generally outearn their government counterparts.

Occupation-wise, we divide our dataset into clerical and non-clerical roles, with an underlying assumption that clerical, or office-based, workers have higher average incomes compared to those in blue-collar jobs. 

Additionally, marital status is considered a potential influencer, where we speculate that individuals who are, or have been, married are more likely to have an income exceeding $50,000 annually than their single peers.

# Background and Prior Work

Income equality is a critical issue in society, impacting the quality of life and well-being of many individuals. Various factors, such as education level, occupation, gender, and marital status can contribute to these differences in income. For instance, in 2022, individuals with doctoral and professional degrees earned three times more per week than those with less than a high school diploma.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Occupation type also plays a pivotal role, with some roles offering higher salaries than others.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Gender has been shown to play a factor in pay. In 2022, a Pew Research study found that American women were found to earn 82 cents for every dollar earned by American men.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) It is essential to better understand what these factors are to impose policies to counter this inequality. 

Research in this field has explored the relationship between various demographic characteristics and income levels. Christopher Tamborini, Changhwan Kim, and Arthur Sakamoto explored the association between education and income in one of their works, “Education and Lifetime Earnings in the United States”. They used data from the Survey of Income and Program Participation (SIPP) 2004 matched to the Detailed Earnings Record (DER) constructed by the Social Security Administration (SSA) and utilized regression analysis to estimate the net effect of college education on long-term earnings. This revealed a positive correlation between education and lifetime earnings, as well as a gap, implying the differences in earnings based on educational attainment. The results also provided insight into gender differences in lifetime earnings, demonstrating that men consistently earned more than women across all educational levels.<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) 

Similarly, Esteban Ortiz-Ospina and Max Roser contributed to understanding income, job, and wealth inequality between men and women in “Economic Inequality by Gender”. Through the analysis of data presented by various sources, including the United Nation’s International Labor Organization (ILO), as well as through a review of relevant research on these inequalities, they found that men tend to earn more than women in all regions of the world. Their analysis also revealed the underrepresentation of women in senior positions within firms.<a name="cite_ref-5"></a>[<sup>5</sup>](#cite_note-5)

We hope to build upon this prior work by investigating the impact of specific factors, such as education level and length, work class (private vs. self-employed), occupation, marital status, and gender on an individual’s income. Specifically, we aim to determine whether or not specific factors impact one’s likelihood of earning an annual income exceeding $50,000. To accomplish this, we plan on using one of the datasets available on UCI’s Machine Learning Repository and conducting an analysis of these factors and their effects on income. 

1. <a name="cite_note-1"></a> [^](#cite_ref-1) U.S. Bureau of Labor Statistics. (2023, May). Education Pays, 2022. U.S. Bureau of Labor Statistics. https://www.bls.gov/careeroutlook/2023/data-on-display/education-pays.htm 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) U.S. Bureau of Labor Statistics. (2023, April 25). May 2022 OEWS State Occupational Employment and Wage Estimates. U.S. Bureau of Labor Statistics. https://www.bls.gov/oes/current/oes_ca.htm
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Kochhar, R. (2023, March 1). The Enduring Grip of the Gender Pay Gap. Pew Research Center’s Social & Demographic Trends Project. https://www.pewresearch.org/social-trends/2023/03/01/the-enduring-grip-of-the-gender-pay-gap/
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Tamborini, C. R., Kim, C., & Sakamoto, A. (2015). Education and Lifetime Earnings in the United States. Demography, 52(4), 1383–1407. https://doi.org/10.1007/s13524-015-0407-0 
5. <a name="cite_note-5"></a> [^](#cite_ref-5) Ortiz-Ospina, E., & Roser, M. (2019). Economic Inequality by Gender. Retrieved from https://ourworldindata.org/economic-inequality-by-gender




# Hypothesis


Our main hypothesis is focused on predicting if a person makes more than 50k annual income or less given the characteristics from our dataset, such as marital status, education level, type of work, and the individual’s gender. 

The rationale for analyzing these factors in order to determine how much income a person makes comes from observing common trends, such as people with higher education levels tending to make more money, people working in the private sector generally having higher earning potential, and also analyzing the relationship with an individual’s gender. 

We anticipate there will be a positive correlation between income and these factors provided from trends already observed in history, and by testing this hypothesis, we hope to determine a person’s rough annual income given their personal characteristics and how it correlates with the factors in our dataset to validate our hypothesis. 


# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Adult
  - Link to the dataset: https://archive.ics.uci.edu/static/public/2/data.csv
  - Number of observations: 48842
  - Number of variables: 15

Our dataset captures demographic information, sourced from the U.S. census, with the primary goal of predicting income levels. Key variables include age (integer), workclass (categorical, representing employment status), education (categorical, denoting education level), marital status (categorical), occupation (categorical, with missing values), relationship (categorical), race (categorical), sex (binary), hours worked per week (integer), native country (categorical, with missing values), and the target variable income (binary).

To prepare our dataset for analysis, we will need to address missing values in "workclass," "occupation," and "native-country" through imputation or removal of affected rows. We can also drop columns that aren’t essential such as “fnlwgt”, “capital-gain”, and “capital-loss”. We can also encode categorical variables, deal with potential outliers, and scale numerical features to further help optimize the dataset for exploring relationships between demographic characteristics and an individual's likelihood of earning over $50k annually.


## Dataset adult_data.csv

In [42]:
import pandas as pd
import numpy as np

df = pd.read_csv("https://archive.ics.uci.edu/static/public/2/data.csv")

df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [43]:
df.dropna(axis='rows', inplace=True)

df = df.replace('?', np.nan).dropna(axis='rows', how ='any') # Drop rows containing a '?' value

df = df.drop(columns=['fnlwgt', 'capital-gain', 'capital-loss']) # Drop columns that are not needed

In [44]:
df["income"] = df["income"].replace({"<=50K": 0, ">50K": 1, "<=50K.": 0, ">50K.": 1}) # Replace income values with 0 and 1

df.rename(columns={'income': 'income-over-50k'}, inplace=True) # Rename income column so that it is more descriptive

df.head()

Unnamed: 0,age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,income-over-50k
0,39,State-gov,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,0
1,50,Self-emp-not-inc,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,0
2,38,Private,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,0
3,53,Private,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,0
4,28,Private,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,0


# Ethics & Privacy

Ensuring ethical and privacy considerations are central to our project, given that it handles sensitive demographic data. Privacy concerns would be addressed by ensuring that the dataset is anonymized. 

Additionally, participants would have their identities protected from those handling the data. Working with this data we need to adhere to ethical guidelines and ensure to obtain any necessary permissions to use the data. 

We acknowledge the potential for biases in the dataset, which could arise from factors such as data collection methods and sources (especially given the size of our dataset in proportion to all workers in the US). 

Further, our analysis will be conducted with sensitivity to potential biases and inequalities especially in income, race, and gender. We remain vigilant to additional ethical concerns, committed to equitable impact, and ready to implement corrective measures as needed to uphold the highest standards of ethics and privacy. 

Moreover, we acknowledge that there are possible unintended consequences of our study such as reinforcement of stereotypes. Our findings could unintentionally reinforce some stereotypes if they are interpreted without the context. To mitigate this, we aim to provide comprehensive analysis and nuanced reporting that emphasize the complex interplay of socioeconomic factors.

# Team Expectations 

* Communicate via text group chat to coordinate meetings
* Show up to planned meetings and/or communicate necessary absences in advance
* Complete tasks assigned to each team member
* Conflict resolution through open and respectful communication
* Communicate when struggling with tasks to ensure completion

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 11/13  | 4 PM  | Import & Wrangle Data (Ethan); EDA (Nessa, Ethan) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 11/27  | 4 PM  | Finalize wrangling/EDA; Begin Analysis (Lillian, Nessa, Ethan, Bruce) | Discuss/edit Analysis; Complete project check-in |
| 12/4  | 4 PM  | Complete analysis; Draft results/conclusion/discussion (Romir, Bruce)| Discuss/edit full project |
| 12/7  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |