## Project3: Investigate the impact of educational levels on wages in different industries

#### Reseacher: Chenyi JIANG

### 1. Introduction:
Understanding how education affects wages is central to labor-market policy, inequality research, and workforce development. While higher education generally predicts higher earnings, less is known about whether this relationship is consistent across industries.
This project explores the extent to which educational attainment influences income outcomes across different U.S. industries.

#### Dataset to be used:
https://www.kaggle.com/datasets/uciml/adult-census-income
(UCI Adult dataset on Kaggle)

#### Analysis question:
Do higher educational levels consistently lead to higher earnings across different U.S. industries?

#### Columns that will (likely) be used:
- Education – categorical education level  
- Education-num – numeric encoding of education level  
- Hours-per-week – potential labor supply control  
- Occupation – proxy for industry / labor sector  
- Workclass – type of employer (government, private, self-employed)  
- Income – binary indicator (>50K or ≤50K)  
- Sex, age, race – optional controls for robustness  
- Capital-gain, capital-loss – may help capture high-earning outliers

#### Hypothesis: 
Education significantly predicts income, but its effect size varies across industries.  
For example, the income gap between education levels is expected to be smaller in government/public service than in technology/finance.

### 2. Code part

#### 2.1 Upload and clean data

In [23]:
# import  and load dataset
import pandas as pd
import plotly.express as px

original_data = pd.read_csv("adult_wages.csv")
original_data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [None]:
# Data cleaning
import numpy as np

df = original_data.replace("?", np.nan)
df = df.dropna(subset=["education", "workclass", "income"])
df.head()

In [19]:
# Convert numeric-like columns to numeric dtype
numeric_cols = ["age", "fnlwgt", "education.num", "hours.per.week"]

for col in numeric_cols:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors="coerce")

df = df.dropna(subset=numeric_cols)

In [20]:
education_order = [
    "Preschool","1St-4Th","5Th-6Th","7Th-8Th","9Th","10Th","11Th","12Th",
    "Hs-Grad","Some-College","Assoc-Voc","Assoc-Acdm",
    "Bachelors","Masters","Prof-School","Doctorate"
]

df["education"] = pd.Categorical(df["education"], categories=education_order, ordered=True)


In [25]:
df.dtypes

age                  int64
workclass           object
fnlwgt               int64
education         category
education.num        int64
marital.status      object
occupation          object
relationship        object
race                object
sex                 object
capital.gain         int64
capital.loss         int64
hours.per.week       int64
native.country      object
income              object
dtype: object

In [None]:
print(df.isna().sum())   
df.head()

#### 2.2 Calculation and group

In [27]:
df["high_income"] = (df["income"] == ">50K").astype(int)

edu_col = "education.num"      
industry_col = "workclass"

In [32]:
edu_industry = (
    df.groupby([industry_col, edu_col], as_index=False)
            .agg(
                high_income_rate=("high_income", "mean"),
                n_obs=("high_income", "size")
            )
)

edu_industry.head()

Unnamed: 0,workclass,education.num,high_income_rate,n_obs
0,Federal-gov,3,0.0,1
1,Federal-gov,4,0.0,2
2,Federal-gov,5,0.333333,3
3,Federal-gov,6,0.0,6
4,Federal-gov,7,0.111111,9


#### 2.3 Visualization

In [None]:
import plotly.express as px

heat = edu_industry.pivot_table(
    index=industry_col,
    columns=edu_col,
    values="high_income_rate"
)

fig = px.imshow(
    heat,
    labels=dict(x="Education Level", y="Industry", color="High Income Share"),
    title="Education → High-Income Relationship Across Industries"
)
fig.show()

ValueError: 
Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
    $ pip install -U kaleido


![Education Heatmap](https://github.com/JocelynJIANG-12/jocelynjiang-12.github.io/blob/034600fc6f0493a41547c0a640f2a4c164f7618b/images:/fig.png?raw=true)


In [34]:
fig2 = px.line(
    edu_industry,
    x=edu_col,
    y="high_income_rate",
    color=industry_col,
    markers=True,
    title="Education Returns Within Each Industry",
    labels={edu_col: "Education Level", "high_income_rate": "Probability of Income >50K"}
)

fig2.update_layout(legend_title_text="Industry")
fig2.show()


![Education Line Plot](https://github.com/JocelynJIANG-12/jocelynjiang-12.github.io/blob/034600fc6f0493a41547c0a640f2a4c164f7618b/images:/fig2.png?raw=true)

### 3. Interpretation and conclusion

#### 3.1 Key Findings

(1) Education generally increases the probability of high income  
Across almost all industries, the likelihood of earning >$50K rises with higher education levels.
This upward trend appears in:
- Private sector
- Federal / state / local government
- Self-employment (incorporated & not incorporated)
This confirms a broad, economy-wide positive return to education.

(2) But the size of education returns differs sharply across industries.  
High-income industrie
- Self-employed, incorporated
- Federal government
- Private sector
show steep upward curves, meaning workers with college or advanced degrees experience dramatically higher earnings probabilities (60–95%).
These industries traditionally reward advanced technical, managerial, or professional skills.

Moderate-return industries
- State government
- Local government
These sectors show a positive but flatter slope.
Education matters, but wage structures are more regulated, compressing income differences between education groups.

Low-return or non-return industries
- Never-worked and Without-pay categories remain near zero regardless of educational attainment.
This is expected—these categories structurally prevent high incomes, making education irrelevant here.

#### 3.2 Important Observations

(1) Threshold effects

Several industries show minimal increases at low education levels (1–8).  
Meaning: Workers only see significant returns once education reaches associate degree or higher.

(2) Government vs. Private

Government sectors display:
- Lower variance in earnings
- More compressed wage distribution  

This suggests institutional pay scales reduce the marginal wage return to higher education relative to the private sector.

(3) Self-employed have the steepest education-performance link  

This group has high volatility at low education levels, but the highest peak at upper education levels.

Interpreted as:
- Low barrier to entry → wide income dispersion  
- High education → substantially increases probability of high earnings when combined with entrepreneurship

#### 3.3 Conclusion

Overall, the analysis confirms that higher educational levels substantially increase the likelihood of earning >$50K in most U.S. industries. However, the magnitude of this effect varies considerably, with private, federal, and entrepreneurial sectors showing the strongest returns. In contrast, government and low-wage structural categories exhibit flatter patterns, supporting the hypothesis that education’s impact on income is industry-dependent