# Introduction

The National Longitudinal Survey of Youth 1997-2011 dataset is one of the most important databases available to social scientists working with US data. 

It allows scientists to look at the determinants of earnings as well as educational attainment and has incredible relevance for government policy. It can also shed light on politically sensitive issues like how different educational attainment and salaries are for people of different ethnicity, sex, and other factors. When we have a better understanding how these variables affect education and earnings we can also formulate more suitable government policies. 

<center><img src=https://i.imgur.com/cxBpQ3I.png height=400></center>


###  Import Statements


In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

## Notebook Presentation

In [3]:
pd.options.display.float_format = '{:,.2f}'.format

# Load the Data



In [4]:
df_data = pd.read_csv('NLSY97_subset.csv')
df_expl = pd.read_csv("NLSY97_Variable_Names_and_Descriptions.csv")

# Preliminary Data Exploration 🔎

**Challenge**

* What is the shape of `df_data`? 
* How many rows and columns does it have?
* What are the column names?
* Are there any NaN values or duplicates?

In [5]:
df_data.isna().any().value_counts()

False    80
True     16
dtype: int64

In [6]:
df_data[df_data.isna().any(axis=1)]

Unnamed: 0,ID,EARNINGS,S,EXP,FEMALE,MALE,BYEAR,AGE,AGEMBTH,HHINC97,...,URBAN,REGNE,REGNC,REGW,REGS,MSA11NO,MSA11NCC,MSA11CC,MSA11NK,MSA11NIC
1,4328,19.23,17,5.71,0,1,1982,29,32.00,6000.00,...,2,0,0,1,0,0,1,0,0,0
2,8763,39.05,14,9.94,0,1,1981,30,23.00,88252.00,...,1,0,0,0,1,0,0,1,0,0
3,8879,16.80,18,1.54,0,1,1983,28,30.00,,...,1,0,1,0,0,0,1,0,0,0
7,268,12.79,14,7.71,0,1,1981,30,26.00,71100.00,...,1,1,0,0,0,0,0,1,0,0
9,2333,25.48,17,7.15,0,1,1981,30,30.00,61300.00,...,1,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1993,2740,14.00,12,12.44,1,0,1980,31,27.00,81800.00,...,1,0,1,0,0,0,1,0,0,0
1994,3779,9.33,12,9.12,1,0,1984,27,22.00,,...,1,0,0,1,0,0,1,0,0,0
1995,2456,14.00,8,7.87,1,0,1982,29,19.00,6000.00,...,1,1,0,0,0,0,1,0,0,0
1997,3561,35.88,18,2.67,1,0,1984,27,29.00,77610.00,...,1,0,0,1,0,0,0,1,0,0


In [7]:
df_data.isna().any().value_counts()

False    80
True     16
dtype: int64

In [8]:
df_expl[df_expl["Personal variables"]=="PRFSTYAE"]

Unnamed: 0,Personal variables,Variable Type,Description
95,PRFSTYAE,D,"Father, authoritative"


## Data Cleaning - Check for Missing Values and Duplicates

Find and remove any duplicate rows.

In [9]:
df_ID = pd.DataFrame(df_data["ID"])

In [10]:
df_ID[(df_ID.duplicated())].sort_values("ID")

Unnamed: 0,ID
1868,1
1299,28
1114,31
1957,81
1141,93
...,...
1320,8916
1282,8924
1450,8947
1673,8956


## Descriptive Statistics

In [11]:
df_data.describe()

Unnamed: 0,ID,EARNINGS,S,EXP,FEMALE,MALE,BYEAR,AGE,AGEMBTH,HHINC97,...,URBAN,REGNE,REGNC,REGW,REGS,MSA11NO,MSA11NCC,MSA11CC,MSA11NK,MSA11NIC
count,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,1956.0,1630.0,...,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0,2000.0
mean,3530.89,18.87,14.58,6.72,0.5,0.5,1982.05,28.95,26.42,58143.75,...,0.78,0.15,0.27,0.34,0.23,0.05,0.54,0.41,0.0,0.0
std,2023.07,11.95,2.74,2.84,0.5,0.5,1.39,1.39,5.04,42745.79,...,0.43,0.36,0.44,0.48,0.42,0.21,0.5,0.49,0.06,0.0
min,1.0,2.0,6.0,0.0,0.0,0.0,1980.0,27.0,12.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1825.0,11.54,12.0,4.69,0.0,0.0,1981.0,28.0,23.0,32000.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3471.5,15.75,15.0,6.63,0.5,0.5,1982.0,29.0,26.0,50502.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,5158.25,22.7,16.0,8.7,1.0,1.0,1983.0,30.0,30.0,72202.5,...,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
max,8980.0,132.89,20.0,14.73,1.0,1.0,1984.0,31.0,45.0,246474.0,...,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


# Split Training & Test Dataset

We *can't* use all the entries in our dataset to train our model. Keep 20% of the data for later as a testing dataset (out-of-sample data).  

In [13]:
for col in df_data.columns:
    df_data[col] = df_data[col].round(0)

In [112]:
df_data.EARNINGS = df_data.EARNINGS.round(0).astype(int)

In [16]:
target = pd.DataFrame(df_data, columns=["EARNINGS"])
features = pd.DataFrame(df_data, columns=["S"])

In [17]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2)

# Simple Linear Regression

Only use the years of schooling to predict earnings. Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data? 

In [113]:
regr = LinearRegression()
regr.fit(X_train, y_train)
rsquared = regr.score(X_train, y_train)
print(rsquared)

0.10994190684166716


<h1>Logarithmic Linear Regression</h1>

In [19]:
log_target = np.log(target)
log_features = np.log(features)

In [20]:
log_X_train, log_X_test, log_y_train, log_y_test = train_test_split(log_features, log_target, test_size=0.2)

In [21]:
log_regr = LinearRegression()
log_regr.fit(log_X_train, log_y_train)
log_rsquared = log_regr.score(log_X_train, log_y_train)
print(log_rsquared)

0.0958969352424246


### Evaluate the Coefficients of the Model

Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative). 

In [23]:
print(f"Fit: {regr.fit(X_train, y_train)}")
print(f"R-squared {regr.score(X_train, y_train)}")
print(f"Intercept: {regr.intercept_}")
print(f"Coefficient: {regr.coef_}")

Fit: LinearRegression()
R-squared 0.07827268935871146
Intercept: [0.83699969]
Coefficient: [[1.24071474]]


In [24]:
print(f"Fit: {log_regr.fit(log_X_train, log_y_train)}")
print(f"R-squared {log_regr.score(log_X_train, log_y_train)}")
print(f"Intercept: {log_regr.intercept_}")
print(f"Coefficient: {log_regr.coef_}")

Fit: LinearRegression()
R-squared 0.0958969352424246
Intercept: [0.51183455]
Coefficient: [[0.85603697]]


### Analyse the Estimated Values & Regression Residuals

How good our regression is also depends on the residuals - the difference between the model's predictions ( 𝑦̂ 𝑖 ) and the true values ( 𝑦𝑖 ) inside y_train. Do you see any patterns in the distribution of the residuals?

In [25]:
log_prediction = log_regr.predict(log_X_test)
log_residual = (log_y_test-log_prediction)
log_residual.describe()

Unnamed: 0,EARNINGS
count,400.0
mean,-0.03
std,0.56
min,-2.29
25%,-0.34
50%,0.0
75%,0.32
max,1.79


In [26]:
prediction = regr.predict(X_test)


In [27]:
residual = (y_test - prediction)
residual.describe()

Unnamed: 0,EARNINGS
count,400.0
mean,-0.31
std,11.23
min,-18.93
25%,-6.93
50%,-2.69
75%,3.01
max,86.76


# Multivariable Regression

Now use both years of schooling and the years work experience to predict earnings. How high is the r-squared for the regression on the training data? 

In [114]:
multplie_features = df_data[["S", "EXP"]]

In [57]:
mult_X_train, mult_X_test, mult_y_train, mult_y_test = train_test_split(multplie_features, target, test_size=0.2)

In [58]:
regr_mult = LinearRegression()
regr_mult.fit(mult_X_train, mult_y_train)
rsquared_mult = regr_mult.score(mult_X_train, mult_y_train) 
print(rsquared)

0.07827268935871146


### Evaluate the Coefficients of the Model

In [62]:
print(f"Fit: {regr_mult.fit(mult_X_train, mult_y_train)}")
print(f"R-squared {regr_mult.score(mult_X_train, mult_y_train)}")
print(f"Intercept: {regr_mult.intercept_}")
print(f"Coefficient: {regr_mult.coef_}")

Fit: LinearRegression()
R-squared 0.09645825149678688
Intercept: [-11.75034475]
Coefficient: [[1.71432636 0.84251638]]


### Analyse the Estimated Values & Regression Residuals

In [63]:
multiple_pred = regr_mult.predict(mult_X_test)
multiple_pred

array([[12.10375631],
       [15.56170262],
       [19.8914589 ],
       [11.29053352],
       [21.54719808],
       [21.57649166],
       [16.46280618],
       [ 8.73369077],
       [15.56170262],
       [13.87666985],
       [15.56170262],
       [24.162628  ],
       [17.30532256],
       [14.77777341],
       [16.43351259],
       [16.43351259],
       [24.162628  ],
       [23.26152443],
       [23.3494052 ],
       [18.08925177],
       [23.32011161],
       [20.73397528],
       [18.96106174],
       [19.8914589 ],
       [21.60578525],
       [20.73397528],
       [16.43351259],
       [18.20642613],
       [16.404219  ],
       [18.14783895],
       [20.73397528],
       [20.73397528],
       [14.74847982],
       [10.41872354],
       [23.3494052 ],
       [15.56170262],
       [18.99035533],
       [17.24673539],
       [18.96106174],
       [17.24673539],
       [19.86216531],
       [18.17713254],
       [15.56170262],
       [23.37869879],
       [11.37841428],
       [ 7

In [67]:
mult_residuals = (mult_y_test - multiple_pred)
mult_residuals.describe()

Unnamed: 0,EARNINGS
count,400.0
mean,-0.33
std,8.61
min,-18.32
25%,-5.96
50%,-2.07
75%,3.82
max,41.2


# Use Your Model to Make a Prediction

How much can someone with a bachelors degree (12 + 4) years of schooling and 5 years work experience expect to earn in 2011?

In [72]:
predict_data = df_data[(df_data["S"]==16) & (df_data["EXP"]==5.00)]

In [74]:
predict_values = regr_mult.predict(predict_data[["S", "EXP"]])

In [83]:
predicted_earnings = predict_values[0][0].round(2)
predicted_earnings

19.89