# Problem Statement (Linear Regression with CPS Data)

## Context
Education is often viewed as a key driver of economic opportunity. Policymakers and researchers have long debated the extent to which additional years of schooling translate into higher wages. The Current Population Survey (CPS), conducted annually by the U.S. Bureau of Labor Statistics, provides detailed data on earnings, education, and demographics. The dataset can be accessed directly from rdatasets.[link](https://vincentarelbundock.github.io/Rdatasets/doc/AER/CPSSW04.html)

## Problem Definition
Despite widespread belief in the “returns to education,” the magnitude of this relationship varies across time, degree level, and demographic groups. Using subsets of CPS data compiled by Stock and Watson (2007), this project investigates how years of education and degree attainment affect hourly earnings among full‑time workers.

## Objective
The goal is to apply a linear regression model to estimate the effect of education on earnings, controlling for factors such as age, gender, and region. Specifically:
- **Dependent variable:** Average hourly earnings (inflation‑adjusted to 2004 USD)  
- **Independent variables:** Years of education, degree type (high school vs. bachelor’s), age, gender, and region  

## Significance
Quantifying the returns to education provides evidence for labor economics and informs policy debates on college affordability, workforce development, and wage inequality. It also demonstrates the practical application of econometric methods in analyzing large‑scale survey data.

## Scope
The analysis uses CPS subsets from 1992–2004, focusing on full‑time workers aged 25–34. Earnings are inflation‑adjusted to 2004 dollars to ensure comparability across years.

### Set view.

## Import datasets

In [27]:
# Libraries
from github import Github
import pandas as pd

In [28]:
pd.set_option('display.max.rows',None)

In [29]:
g = Github()
repo = g.get_repo('Rooney-tech/Linear-Regression')
contents = repo.get_contents('')

# Get all CSV files
csv_files = [file for file in contents if file.name.endswith('.csv')]

# Download and load each CSV into a list of DataFrames
dfs = []
for file in csv_files:
    df = pd.read_csv(file.download_url)
    dfs.append(df)
    print(f"Loaded: {file.name} | Shape: {df.shape}")

# Combine all into one DataFrame (if same structure)
cps = pd.concat(dfs, ignore_index=True)

# Show first 5 rows of each DataFrame
for i, df in enumerate(dfs):
    print(f"\n--- CSV {i+1} ---")
    print(df.head())



Loaded: CPSSW04.csv | Shape: (7986, 5)
Loaded: CPSSW3.csv | Shape: (20999, 4)
Loaded: CPSSW8.csv | Shape: (61395, 6)
Loaded: CPSSW9204.csv | Shape: (15588, 6)
Loaded: CPSSW9298.csv | Shape: (13501, 6)
Loaded: CPSSWEducation.csv | Shape: (2950, 5)

--- CSV 1 ---
   rownames  earnings      degree  gender  age
0         1  34.61538    bachelor    male   30
1         2  19.23077    bachelor  female   30
2         3  13.73626  highschool  female   30
3         4  19.23077    bachelor  female   30
4         5  19.23077    bachelor    male   25

--- CSV 2 ---
   rownames  gender  year   earnings
0         1    male  1992  15.064622
1         2    male  1992  13.464005
2         3    male  1992  20.138470
3         4  female  1992  11.659959
4         5    male  1992  19.419239

--- CSV 3 ---
   rownames   earnings  gender  age region  education
0         1  20.673077    male   31  South         14
1         2  24.278847    male   50  South         12
2         3  10.149572    male   36  South

### 1. Exploratory Data Analysis

* 1.1 View the first 5 rows of the data.

In [30]:
cps.head()

Unnamed: 0,rownames,earnings,degree,gender,age,year,region,education
0,1,34.61538,bachelor,male,30.0,,,
1,2,19.23077,bachelor,female,30.0,,,
2,3,13.73626,highschool,female,30.0,,,
3,4,19.23077,bachelor,female,30.0,,,
4,5,19.23077,bachelor,male,25.0,,,


* Drop row names since they are just numbers

In [31]:
cps.drop(columns=['rownames'], inplace = True)
cps.head()

Unnamed: 0,earnings,degree,gender,age,year,region,education
0,34.61538,bachelor,male,30.0,,,
1,19.23077,bachelor,female,30.0,,,
2,13.73626,highschool,female,30.0,,,
3,19.23077,bachelor,female,30.0,,,
4,19.23077,bachelor,male,25.0,,,


* Check each column quality


In [38]:
print(cps.info())
cps.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122419 entries, 0 to 122418
Data columns (total 7 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   earnings   122419 non-null  float64
 1   degree     37075 non-null   object 
 2   gender     122419 non-null  object 
 3   age        101420 non-null  float64
 4   year       50088 non-null   float64
 5   region     61395 non-null   object 
 6   education  64345 non-null   float64
dtypes: float64(4), object(3)
memory usage: 6.5+ MB
None


earnings         0
degree       85344
gender           0
age          20999
year         72331
region       61024
education    58074
dtype: int64

## References

- Becker, G.S. (1964). *Human Capital: A Theoretical and Empirical Analysis with Special Reference to Education*. NBER. [Link](https://www.nber.org/books-and-chapters/human-capital-theoretical-and-empirical-analysis-special-reference-education-first-edition)

- Mincer, J. (1974). *Schooling, Experience, and Earnings*. NBER. [Link](https://www.nber.org/books-and-chapters/schooling-experience-and-earnings)

- Stock, J.H., & Watson, M.W. (2007). *Introduction to Econometrics (2nd Edition)*. Pearson/Addison Wesley. [Link](https://archive.org/details/introductiontoec0000stoc)

- U.S. Bureau of Labor Statistics. *Current Population Survey (CPS)*. [Link](https://www.bls.gov/cps/)