# **COGS 108 \- Project Proposal**

# **Names**

* Arman Uddin  
* Ethan Williams  
* Vinson Nguyen  
* Alysia Kim

# **Research Question**

**What percent increase do post-secondary educations (e.g. associate degree, bachelor’s degree) have on likelihood to have an above-median salary for 40 hours/week workers in the United States and in what age group are those increases best realized?** 

## **Background and Prior Work** 

Education has long been recognized as a key factor in income levels, with a higher level of education generally perceived to increase earning potential. Prior research supports this, showing individuals tend to have higher earnings, “a college degree is key to economic opportunity, conferring substantially higher earnings on those with credentials than those without” \[1\]. As access to education has increased, the expectation that higher levels of education can unlock better job opportunities and financial stability with it. A formal education is important, but further nuance and research is needed to examine this relationship.

Previous studies consistently showed that individuals with advanced education tend to have higher lifetime earnings. “The College Payout” explains “bachelor’s degree holders earn 31 percent more than workers with an associate’s degree and 74 percent more than those with just a high school diploma” \[1\].  This paper finds a positive relationship between educational attainment and median earnings. Beyond the level of educational attainment, there is some nuance when looking at general earnings.  Besides degree types, age, gender, race/ethnicity, and job type are other factors that have varying degrees of effect on income. For example, they conclude that there is a prevalent gender wage gap across all educational attainments. There are also some disparities when looking at race/ethnicities, with them being more concentrated on bachelor's and professional degrees. Similar ideas are communicated in a study done by the National Center for Education Statistics, as levels of education increased, median earnings followed similarly \[2\]. “The pattern of those with educational attainment having higher median earnings held, in general, for both male and female 25- to 34-year-olds who worked full time, year-round in 2022.” This report also recognizes there are likely trends related to annual earnings such as sex/gender and race/ethnicity.  

There is other previous work done to show the effects of having a degree in general, regardless of which level \[3\]. The study concludes through its research that “\[c\]ollege graduates are half as likely to be unemployed as their peers whose highest degree is a high school diploma.” This goes to show the struggles of even finding a job for someone without a degree. Then there is the income factor, which is also devastating. To prove this with their numbers, they deduced that a person with a bachelors will on average make a whopping 86% higher than people who just have a high school diploma. The difference between having a degree and not having one is evident and clear. Additionally, the research proved that at “median lifetime earnings are $1.2 million higher for bachelor’s degree holders” \[3\]. It can be concluded that people who have a degree under their belts will almost always have a higher income and job stability than those who don’t.

The relationship between education and salary can vary depending on the type of education pursued. Vocational education can also potentially lead to immediate employment opportunities but may not provide the same long-term income growth as academic degrees. This may be due to more specific professional training in post-secondary education with an emphasis on occupational hierarchy. There are a variety of high paying trade jobs including, but not limited to, HVAC technicians, electricians, and plumbers that offer high paying salaries without needing a four-year degree or above. Not all forms of education have the same income trajectory. There are many ways to achieve a great deal of financial success that should be mentioned. 

In our project, we look to similarly examine the relationship between post-secondary education and annual earnings, particularly focusing on whether individuals with such education are more likely to earn above or below $50,000 per year.


References:

* \[1\] The College Payoff”  
  [https://cew.georgetown.edu/wp-content/uploads/collegepayoff-completed.pdf](https://cew.georgetown.edu/wp-content/uploads/collegepayoff-completed.pdf)  
* \[2\] “Annual Earnings by Educational Attachment”  
  [https://nces.ed.gov/programs/coe/indicator/cba/annual-earnings](https://nces.ed.gov/programs/coe/indicator/cba/annual-earnings)  
* \[3\] “How does a college degree improve graduates’ employment and earnings potential”  
  [https://www.aplu.org/our-work/4-policy-and-advocacy/publicuvalues/employment-earnings/](https://www.aplu.org/our-work/4-policy-and-advocacy/publicuvalues/employment-earnings/)

# **Hypothesis**

We hypothesize that there will be a trend of an increased likelihood of $50k or more salary alongside an individual’s level of education, with perhaps a small departure from the trend when comparing the salaries of young individuals with a vocational associate's degree to those with higher levels of education.  We predict this overall trend because of our research and prior knowledge that suggest higher education on average results in more successful careers and salaries and that four-plus year degrees provide specialized knowledge and opportunities for career advancement when compared to an associate degree or lesser college experience.  We predict an exception to the trend with vocational associates because we believe vocational training may provide high initial wages in technical fields but constraints in long-term salary growth which could surface in our analysis.

# **Data** 

Dataset Name: Salary Prediction Classification

[https://www.kaggle.com/datasets/ayessa/salary-prediction-classification](https://www.kaggle.com/datasets/ayessa/salary-prediction-classification)

Our ideal dataset would be from the last few years and include basic demographic information, education level, employment history, work experience, work location, occupation, and some metrics to measure annual/hourly income with somewhat granular brackets (5-10k).  We would like several thousands of observations in order to be confident that we have a representative sample of all the education levels involved.  This data could be collected in either a US census and/or a survey conducted on alumni of several colleges and high schools.  This data could be cleaned of unique identifying information and properly anonymized then stored in a relational database where each row would be a person’s set of answers and the columns would encompass each data field.

This data we will be using was extracted from the 1994 Census database. It includes 32,600 observations and includes background information such as age, level of education, hours worked per week, and a true/false value for whether the person’s salary is greater than $50k/year. Even though we would like the salary column to have more granularity, the data available will be sufficient to recognize clear correlations between the levels of education and their effects on salary.  

The census data is quite old, but not irrelevant since all the salaries we analyze will be from the same time period and should reflect the impact of education level on annual salary regardless.  It is worth noting that in 1994, 50k/year for a 40hr/week worker is a wage of $24.04/hour which falls into a medium wage bracket which, when adjusted for wage growth will have increased at a rate of roughly 0.1\% per year up until 2019 at which point a much higher wage growth occurred during the pandemic lockdowns for a total medium-wage growth of \~16\% (taken from the medium wage growth data from 1979 to 2023, the article claims the 0.1\% / year figure for the block 1979 \- 2019, so subtracting the estimated 1.4\% between 1979 and 1994 from the total 17.4\% in the time period \[4\]).  Therefore a study using modern census data that looks to reflect on our results here should look at 40 hour/week workers who annually make more than $58,000.00

(article on wage growth from about the time of census data to almost modern day:\[4\] [Chart: Growth in U.S. Real Wages, by Income Group (1979-2023)](https://www.visualcapitalist.com/growth-in-real-wages-over-time-by-income-group-usa-1979-2023/) )

# **Dataset \#1: Salary Prediction Classification**







In [None]:
#Load:  
df = pd.read_csv('salary.csv')

#Data cleaning/wrangling:  
df_new = df.dropna(how='all')  #drop rows with all null values(if any)  
df_new["education"] = df_new["education".str.strip()]  #delete leading whitespaces  
df_new = df[df["hours-per-week"] == 40]  #specify 40 hrs a week  
df_new = df_new[~df_new["workclass"].isin(["Without-pay", "Never-worked"])] #remove anyone who is unemployed

#Data tidying:  
df_new.columns = df_new.columns.str.replace('-', '*')  #*use snake_case, better readability

# **Ethics & Privacy** 

The 1994 U.S. Census dataset is publicly sourced data collected and mandated by the U.S. government, which guarantees ethicality in its data collection. However, ethical and other critical challenges need to be addressed while interpreting older data. The challenges include representation biases, the limitations of outdated economic data, and the risk of perpetuating harmful stereotypes. 

One of the biggest concerns is representation bias because the dataset may not entirely represent the diversity of the U.S. population. Even though the 1994 Census covers a broad sample of the U.S. population, it may underrepresent various groups in certain ways. For example, the dataset can illustrate historical gender disparities, where women have historically earned lower wages than men for equivalent work and education levels. This inequality should be interpreted as a fundamental issue rather than an inherent relationship between gender and income. Likewise, racial and ethnic inequities are evident in the data, since certain barriers historically induced limited access to education and career opportunities for minority groups. Furthermore, because the dataset displays income at a single point in time, it does not reflect long-term potential or career advancement of individuals with different levels of education. In order to address these biases, this study’s findings will be set within their historical context while recognizing that the observed trends may not fully align with the present day. Also, any conclusions regarding income disparities will be cautiously made to avoid overgeneralization or predetermined claims that fail to consider broader influences. 

Another issue is the outdated nature of the dataset. Since 1994, the U.S. economy experienced drastic shifts that altered the relationship between education and income. Changes such as technological advancements, the decline of manufacturing jobs, and the rise of remote work clearly redefined labor market opportunities, which affects how education is associated with earnings. In addition, the increased tuition costs and student loan debt also change the economic relationship on higher education, while policy shifts like minimum wage changes and labor protections have influenced income distribution. To mitigate this limitation, our findings may be strictly framed as historical trends rather than comparisons to present-day labor market conditions. If applicable, we can discuss relevant policy changes and shifts in educational accessibility to provide additional context. 

Lastly, the study should avoid implying harmful stereotypes when analyzing the relationship between education and income. The assumption that education alone determines earnings would fail to notice systemic factors such as historical discrimination, labor market inequality, and other barriers. Without careful analysis, our findings could unintentionally prompt insinuations that blame individuals for income disparities rather than recognizing structural conditions that shape economic outcomes. To prevent this, we will avoid claims that oversimplify the relationship between education and income, and instead emphasize the broader economic and policy frameworks that influence wage distributions. 

# **Team Expectations** 

* **Organization**  
  * All team members are expected to show up to meetings using Discord  
  * We expect each team member to have their own way of stating their thoughts, whether bluntly or not  
  * Decisions will be made after every member has agreed, and each person will always have their say  
  * If a person has issues and is struggling to complete their end:   
    * It should be told before deadlines to avoid untimely completion   
    * Other team members can take over and help when needed  
* **Contribution**  
* Every person will be assigned their part, and some parts will have multiple members   
* All members need to put in equal effort to the project, which means a fair splitting of work  
* Some aspects everyone needs to work on:  
  * Code  
  * Checking for mistakes (grammar, spelling, graph mistakes)  
  * Text  
* **Communication**  
  * All communication must be respectful, avoiding conflicts  
  * Criticism:  
  * Respectful  
  * Not too blunt  
  * Think of solutions, not just stating the problems  
* **Conflict**  
  * Resolved through communication  
  * To understand better, everyone must take the quiz:   
    * [https://cuboulder.qualtrics.com/jfe/form/SV\_6Kkp5kCHt628Zg1](https://cuboulder.qualtrics.com/jfe/form/SV_6Kkp5kCHt628Zg1)  
      

# **Project Timeline Proposal**

| Meeting Date | Meeting Time | Completed Before Meeting | Discuss at Meeting |
| :---- | :---- | :---- | :---- |
| 1/31 | 2pm | Read through project guidelines, brainstorming ideas | Exchange contact information, start finalizing a topic |
| 2/8 | 10pm | Do background research, familiarize with chosen dataset | Filling out the project proposal, start splitting up work accordingly  |
| 2/15 | 12pm | Import the data/EDA | Discuss cleanliness of data and start editing/wrangling |
| 2/22 | 12pm | Review checkpoint guidelines | Data Checkpoint |
| 3/8 | 12pm | Review checkpoint guidelines | EDA Checkpoint |
| 3/13 | 12pm | Finish up data analysis  | Finalize project |
| 3/20 | 12pm | Finish up the project | Final Project submission  |

