# Introduction

The National Longitudinal Survey of Youth 1997-2011 dataset is one of the most important databases available to social scientists working with US data. 

It allows scientists to look at the determinants of earnings as well as educational attainment and has incredible relevance for government policy. It can also shed light on politically sensitive issues like how different educational attainment and salaries are for people of different ethnicity, sex, and other factors. When we have a better understanding how these variables affect education and earnings we can also formulate more suitable government policies. 

<center><img src=https://i.imgur.com/cxBpQ3I.png height=400></center>


### Upgrade Plotly

In [77]:
#%pip install --upgrade plotly

###  Import Statements


In [203]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt


# Machine learning stuff
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.model_selection import train_test_split

# Evaluating predictions
from sklearn.metrics import mean_squared_error



## Notebook Presentation

In [79]:
pd.options.display.float_format = '{:,.2f}'.format

# Load the Data



In [80]:
df = pd.read_csv('NLSY97_subset.csv')

### Understand the Dataset

Have a look at the file entitled `NLSY97_Variable_Names_and_Descriptions.csv`. 

---------------------------

    :Key Variables:  
      1. S           Years of schooling (highest grade completed as of 2011)
      2. EXP         Total out-of-school work experience (years) as of the 2011 interview.
      3. EARNINGS    Current hourly earnings in $ reported at the 2011 interview

# RQ1: What variables predict earnings?
# RQ1a: What variables positively predict earnings?
# RQ1b: What variables negatively predict earnings?

# Preliminary Data Exploration 🔎

**Challenge**

* What is the shape of `df_data`? 
* How many rows and columns does it have?
* What are the column names?
* Are there any NaN values or duplicates?

In [81]:
df

Unnamed: 0,ID,EARNINGS,S,EXP,FEMALE,MALE,BYEAR,AGE,AGEMBTH,HHINC97,...,URBAN,REGNE,REGNC,REGW,REGS,MSA11NO,MSA11NCC,MSA11CC,MSA11NK,MSA11NIC
0,4275,18.50,12,9.71,0,1,1984,27,24.00,64000.00,...,1,0,0,1,0,0,0,1,0,0
1,4328,19.23,17,5.71,0,1,1982,29,32.00,6000.00,...,2,0,0,1,0,0,1,0,0,0
2,8763,39.05,14,9.94,0,1,1981,30,23.00,88252.00,...,1,0,0,0,1,0,0,1,0,0
3,8879,16.80,18,1.54,0,1,1983,28,30.00,,...,1,0,1,0,0,0,1,0,0,0
4,1994,36.06,15,2.94,0,1,1984,27,23.00,44188.00,...,1,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,2456,14.00,8,7.87,1,0,1982,29,19.00,6000.00,...,1,1,0,0,0,0,1,0,0,0
1996,1119,14.83,18,1.92,1,0,1983,28,28.00,50000.00,...,1,1,0,0,0,0,1,0,0,0
1997,3561,35.88,18,2.67,1,0,1984,27,29.00,77610.00,...,1,0,0,1,0,0,0,1,0,0
1998,5980,25.48,16,4.71,1,0,1982,29,23.00,69300.00,...,0,0,0,1,0,0,1,0,0,0


## Data Cleaning - Check for Missing Values and Duplicates

Find and remove any duplicate rows.

In [82]:
df[df.duplicated().values]

Unnamed: 0,ID,EARNINGS,S,EXP,FEMALE,MALE,BYEAR,AGE,AGEMBTH,HHINC97,...,URBAN,REGNE,REGNC,REGW,REGS,MSA11NO,MSA11NCC,MSA11CC,MSA11NK,MSA11NIC
1000,4693,14.50,12,7.25,0,1,1981,30,20.00,40700.00,...,1,0,0,0,1,0,1,0,0,0
1004,4827,38.48,16,8.50,0,1,1981,30,34.00,27700.00,...,0,1,0,0,0,0,1,0,0,0
1006,4176,4.29,16,2.04,0,1,1980,31,23.00,2500.00,...,1,0,0,1,0,0,1,0,0,0
1012,3256,10.00,12,8.02,0,1,1984,27,21.00,43000.00,...,0,0,1,0,0,0,1,0,0,0
1015,4600,52.00,17,9.08,0,1,1980,31,28.00,48900.00,...,1,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1993,2740,14.00,12,12.44,1,0,1980,31,27.00,81800.00,...,1,0,1,0,0,0,1,0,0,0
1996,1119,14.83,18,1.92,1,0,1983,28,28.00,50000.00,...,1,1,0,0,0,0,1,0,0,0
1997,3561,35.88,18,2.67,1,0,1984,27,29.00,77610.00,...,1,0,0,1,0,0,0,1,0,0
1998,5980,25.48,16,4.71,1,0,1982,29,23.00,69300.00,...,0,0,0,1,0,0,1,0,0,0


In [83]:
# Remove duplicated rows
df = df.drop_duplicates(ignore_index=True)

## Descriptive Statistics

In [84]:
print(f"Ratio of remaining IDs to number of predictors: {len(df)/(len(df.columns)-1)}")

Ratio of remaining IDs to number of predictors: 15.652631578947368


Considering the above, will the number of predictors still work for more advanced prediction methods?

In [85]:
df.describe()[1:]

Unnamed: 0,ID,EARNINGS,S,EXP,FEMALE,MALE,BYEAR,AGE,AGEMBTH,HHINC97,...,URBAN,REGNE,REGNC,REGW,REGS,MSA11NO,MSA11NCC,MSA11CC,MSA11NK,MSA11NIC
mean,3547.13,18.81,14.56,6.7,0.49,0.51,1982.07,28.93,26.32,58310.67,...,0.78,0.16,0.27,0.34,0.23,0.05,0.54,0.41,0.0,0.0
std,2009.84,12.0,2.77,2.86,0.5,0.5,1.38,1.38,5.08,43868.05,...,0.43,0.36,0.44,0.47,0.42,0.21,0.5,0.49,0.04,0.0
min,1.0,2.0,6.0,0.0,0.0,0.0,1980.0,27.0,12.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1888.0,11.41,12.0,4.65,0.0,0.0,1981.0,28.0,23.0,32000.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3474.0,15.75,15.0,6.63,0.0,1.0,1982.0,29.0,26.0,50500.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,5160.5,22.6,16.0,8.71,1.0,1.0,1983.0,30.0,30.0,72000.0,...,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
max,8980.0,132.89,20.0,14.73,1.0,1.0,1984.0,31.0,45.0,246474.0,...,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


In [86]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1487 entries, 0 to 1486
Data columns (total 96 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   ID        1487 non-null   int64  
 1   EARNINGS  1487 non-null   float64
 2   S         1487 non-null   int64  
 3   EXP       1487 non-null   float64
 4   FEMALE    1487 non-null   int64  
 5   MALE      1487 non-null   int64  
 6   BYEAR     1487 non-null   int64  
 7   AGE       1487 non-null   int64  
 8   AGEMBTH   1453 non-null   float64
 9   HHINC97   1205 non-null   float64
 10  POVRAT97  1203 non-null   float64
 11  HHBMBF    1487 non-null   int64  
 12  HHBMOF    1487 non-null   int64  
 13  HHOMBF    1487 non-null   int64  
 14  HHBMONLY  1487 non-null   int64  
 15  HHBFONLY  1487 non-null   int64  
 16  HHOTHER   1487 non-null   int64  
 17  MSA97NO   1487 non-null   int64  
 18  MSA97NCC  1487 non-null   int64  
 19  MSA97CC   1487 non-null   int64  
 20  MSA97NK   1487 non-null   int6

In [87]:
df.columns

Index(['ID', 'EARNINGS', 'S', 'EXP', 'FEMALE', 'MALE', 'BYEAR', 'AGE',
       'AGEMBTH', 'HHINC97', 'POVRAT97', 'HHBMBF', 'HHBMOF', 'HHOMBF',
       'HHBMONLY', 'HHBFONLY', 'HHOTHER', 'MSA97NO', 'MSA97NCC', 'MSA97CC',
       'MSA97NK', 'ETHBLACK', 'ETHHISP', 'ETHWHITE', 'EDUCPROF', 'EDUCPHD',
       'EDUCMAST', 'EDUCBA', 'EDUCAA', 'EDUCHSD', 'EDUCGED', 'EDUCDO',
       'PRMONM', 'PRMONF', 'PRMSTYUN', 'PRMSTYPE', 'PRMSTYAN', 'PRMSTYAE',
       'PRFSTYUN', 'PRFSTYPE', 'PRFSTYAN', 'PRFSTYAE', 'SINGLE', 'MARRIED',
       'COHABIT', 'OTHSING', 'FAITHN', 'FAITHP', 'FAITHC', 'FAITHJ', 'FAITHO',
       'FAITHM', 'ASVABAR', 'ASVABWK', 'ASVABPC', 'ASVABMK', 'ASVABNO',
       'ASVABCS', 'ASVABC', 'ASVABC4', 'VERBAL', 'ASVABMV', 'HEIGHT',
       'WEIGHT04', 'WEIGHT11', 'SF', 'SM', 'SFR', 'SMR', 'SIBLINGS', 'REG97NE',
       'REG97NC', 'REG97S', 'REG97W', 'RS97RURL', 'RS97URBN', 'RS97UNKN',
       'JOBS', 'HOURS', 'TENURE', 'CATGOV', 'CATPRI', 'CATNPO', 'CATMIS',
       'CATSE', 'COLLBARG', 'URBAN'

## Considering there were 97 features, I had to narrow down what 1997(birth) - 2004 (elementary) predictors would likely predict 2011 earnings. 

In [88]:
numerical_feat =[
    "ID", # key for matching
    'EARNINGS', # outcome
    'S', # years of schooling
    'EXP', # out of school work experience
    'BYEAR', # Year of birth
    'AGE', # Age at 2011 (will likely correlate with BYEAR)
    'HHINC97', # Gross household income
    'POVRAT97', # Ratio of Poverty level
    
    # Parental Monitoring (scale of 0 low, to 16 high)
    'PRMONM', 'PRMONF',
    
    # ASVAB battery scores
    'ASVABAR', 'ASVABWK', 'ASVABPC', 'ASVABMK', 'ASVABNO',
    'ASVABCS', 'ASVABC', 'ASVABC4', 'VERBAL', 'ASVABMV',
    
    # height and weight at 2004
    'HEIGHT', 'WEIGHT04',
    
    # Family background
    'SF', 'SM', 'SFR', 'SMR', 'SIBLINGS',
]

categorical_feat = [
    "ID", # key for matching
    'FEMALE', 
    'MALE',
    # Household structure 1997
    'HHBMBF', 'HHBMOF', 'HHOMBF',
    'HHBMONLY', 'HHBFONLY', 'HHOTHER',
    
    # Household location 1997
    'MSA97NO', 'MSA97NCC', 'MSA97CC',
    'MSA97NK', 'REG97NE', 'REG97NC',
    'REG97S', 'REG97W', 'RS97RURL', 
    'RS97URBN', 'RS97UNKN',
    
    # Ethnicity
    'ETHBLACK', 'ETHHISP', 'ETHWHITE',

    # Highest educational qualification
    'EDUCPROF', 'EDUCPHD',
    'EDUCMAST', 'EDUCBA', 
    'EDUCAA', 'EDUCHSD', 
    'EDUCGED', 'EDUCDO',
    
    # Faith:
    'FAITHN', 'FAITHP', 'FAITHC', 'FAITHJ', 'FAITHO','FAITHM',
    
     # Parenting style (0 or 1)
    'PRMSTYUN', 'PRMSTYPE', 'PRMSTYAN', 'PRMSTYAE',
    'PRFSTYUN', 'PRFSTYPE', 'PRFSTYAN', 'PRFSTYAE',
    

]

not_included =[
    # marital status at 2011
    'SINGLE', 'MARRIED',
    'COHABIT', 'OTHSING',
    
    # weight at 2011
    'WEIGHT11'
    
    # work related vars at 2011
    'JOBS', 'HOURS', 'TENURE', 'COLLBARG',
    
    # Category of employment at 2011
    'CATGOV', 'CATPRI', 'CATNPO', 'CATMIS','CATSE',
    
    # Living in 2011
    'URBAN', 'REGNE', 'REGNC', 'REGW', 'REGS',
    'MSA11NO', 'MSA11NCC', 'MSA11CC', 'MSA11NK', 'MSA11NIC'
]

## Visualise the Features

In [89]:
num_df = df[numerical_feat]
num_df

Unnamed: 0,ID,EARNINGS,S,EXP,BYEAR,AGE,HHINC97,POVRAT97,PRMONM,PRMONF,...,ASVABC4,VERBAL,ASVABMV,HEIGHT,WEIGHT04,SF,SM,SFR,SMR,SIBLINGS
0,4275,18.50,12,9.71,1984,27,64000.00,402.00,14.00,14.00,...,-0.32,-0.53,29818,70,155,12,12,12.00,12.00,1
1,4328,19.23,17,5.71,1982,29,6000.00,38.00,12.00,12.00,...,-0.14,-0.21,46246,74,200,16,12,16.00,12.00,3
2,8763,39.05,14,9.94,1981,30,88252.00,555.00,,,...,0.48,0.54,66480,72,168,16,6,,6.00,1
3,8879,16.80,18,1.54,1983,28,,,6.00,4.00,...,0.16,-0.05,51240,73,153,14,14,14.00,14.00,2
4,1994,36.06,15,2.94,1984,27,44188.00,278.00,11.00,8.00,...,1.07,0.59,89773,71,145,14,16,14.00,16.00,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1482,2400,9.00,12,10.83,1982,29,73100.00,390.00,11.00,4.00,...,-0.81,-1.09,22110,69,190,16,12,16.00,12.00,5
1483,3018,8.00,12,9.37,1982,29,66300.00,354.00,16.00,10.00,...,-1.78,-1.35,7706,67,125,13,14,13.00,14.00,2
1484,4550,8.57,17,6.29,1984,27,76300.00,364.00,11.00,,...,0.73,0.83,75186,62,173,14,18,16.00,18.00,4
1485,3779,9.33,12,9.12,1984,27,,,12.00,4.00,...,-0.84,-0.32,25700,64,158,12,8,,8.00,1


In [231]:
%%script False
for i in range(1, len(num_df.columns)):
    fig = px.histogram(num_df[numerical_feat[i]], title = f"{numerical_feat[i]}")
    fig.show()

Couldn't find program: 'False'


## My Analysis of graphs

I will replace NAs with the mean/median after the next section

## Now to look at the cateogical features

In [91]:
cat_df = df[categorical_feat]
cat_df[:3]

Unnamed: 0,ID,FEMALE,MALE,HHBMBF,HHBMOF,HHOMBF,HHBMONLY,HHBFONLY,HHOTHER,MSA97NO,...,FAITHO,FAITHM,PRMSTYUN,PRMSTYPE,PRMSTYAN,PRMSTYAE,PRFSTYUN,PRFSTYPE,PRFSTYAN,PRFSTYAE
0,4275,0,1,1,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,4328,0,1,1,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,8763,0,1,0,0,0,1,0,0,0,...,0,0,,,,,,,,


In [92]:
# select only rows with presence of NAs
cat_df.isna().sum()[cat_df.isna().sum()>0]
missing_cat = cat_df.isna().sum()[cat_df.isna().sum()>0].index
missing_cat

Index(['PRMSTYUN', 'PRMSTYPE', 'PRMSTYAN', 'PRMSTYAE', 'PRFSTYUN', 'PRFSTYPE',
       'PRFSTYAN', 'PRFSTYAE'],
      dtype='object')

easy to clean. On inspection, NA values are typically 0

In [93]:
# fill NAs with 0
cat_df = cat_df.fillna(0)

## Handing missing data from numerical features

In [94]:
# select columns with NAs
num_df.isna().sum()[num_df.isna().sum()>0]
missing_num = num_df.isna().sum()[num_df.isna().sum()>0].index
missing_num

# see columns with data
num_df[missing_num][:3]

Unnamed: 0,HHINC97,POVRAT97,PRMONM,PRMONF,SFR,SMR
0,64000.0,402.0,14.0,14.0,12.0,12.0
1,6000.0,38.0,12.0,12.0,16.0,12.0
2,88252.0,555.0,,,,6.0


### Columns with Missing numerical values
- HHINC97: Gross household income, $, in year prior to 1997 interview
- POVRAT97: Ratio of household income to poverty level, 1997
- PRMONM: Monitoring by mother
- PRMONF: Monitoring by father
- SFR: Years of schooling of residential Father
- SMR: Years of schooling of residential Mother

I need to see the range of each to determine the necessary actions.

In [95]:
num_df[missing_num].describe()

Unnamed: 0,HHINC97,POVRAT97,PRMONM,PRMONF,SFR,SMR
count,1205.0,1203.0,851.0,698.0,1197.0,1430.0
mean,58310.67,362.26,9.8,7.48,13.6,13.43
std,43868.05,294.23,3.05,3.7,2.95,2.66
min,0.0,0.0,0.0,0.0,3.0,1.0
25%,32000.0,190.5,8.0,5.0,12.0,12.0
50%,50500.0,302.0,10.0,8.0,13.0,13.0
75%,72000.0,441.0,12.0,10.0,16.0,16.0
max,246474.0,1627.0,16.0,16.0,20.0,20.0


### Actions required: Imputation
- HHINC97: mean values
- POVRAT97: mean values
- PRMONM: median values
- PRMONF: median values
- SFR: median values
- SMR: median values

In [96]:
# HHINC97 mean imputation
num_df['HHINC97'] = num_df['HHINC97'].fillna(num_df['HHINC97'].mean())

# POVRAT97 mean imputation
num_df['POVRAT97'] = num_df['POVRAT97'].fillna(num_df['POVRAT97'].mean())

# lets code the next one...
features = ["PRMONM","PRMONF","SFR","SMR"]

for i in features:
    num_df[i] = num_df[i].fillna(num_df[i].median())



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [74]:
num_df[missing_num].isna().sum()

HHINC97     0
POVRAT97    0
PRMONM      0
PRMONF      0
SFR         0
SMR         0
dtype: int64

### NA filled, we can merge the data
At this point in time, I can do one more step to do standardising of the values.

But maybe for my next submission?

## Merging the data

In [109]:
df_new = num_df.merge(cat_df, on="ID")

In [114]:
df_new = df_new.drop("ID", axis = 1)


KeyError: "['ID'] not found in axis"

In [116]:
df_new[:3]

Unnamed: 0,EARNINGS,S,EXP,BYEAR,AGE,HHINC97,POVRAT97,PRMONM,PRMONF,ASVABAR,...,FAITHO,FAITHM,PRMSTYUN,PRMSTYPE,PRMSTYAN,PRMSTYAE,PRFSTYUN,PRFSTYPE,PRFSTYAN,PRFSTYAE
0,18.5,12,9.71,1984,27,64000.0,402.0,14.0,14.0,0.12,...,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,19.23,17,5.71,1982,29,6000.0,38.0,12.0,12.0,0.45,...,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,39.05,14,9.94,1981,30,88252.0,555.0,10.0,8.0,0.42,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


I learnt from previous projects to check after merging...

# Split Training & Test Dataset

We *can't* use all the entries in our dataset to train our model. Keep 30% of the data for later as a testing dataset (out-of-sample data).  

In [117]:
earnings = df_new["EARNINGS"]
df_new = df_new.drop("EARNINGS", axis=1)

In [118]:
X_train, X_test, y_train, y_test = train_test_split(df_new, earnings,
                                                    test_size = .3, 
                                                    random_state = 1047)

### Checking the outputs to see if they are okay

In [119]:
len(X_train), len(X_test), len(y_train), len(y_test)

(1040, 447, 1040, 447)

# Simple Linear Regression

Only use the years of schooling to predict earnings. Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data? 

In [133]:
S = pd.DataFrame(X_train["S"])


In [134]:
%%time
LR = LinearRegression()
LR.fit(S, y_train)

CPU times: total: 0 ns
Wall time: 2.08 ms


LinearRegression()

### Evaluate the Coefficients of the Model

Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative). 

Interpret the regression. How many extra dollars can one expect to earn for an additional year of schooling?

In [135]:
print(f"For every one year of schooling, earnings increase by ${LR.coef_[0]:.2f} per hour.")

For every one year of schooling, earnings increase by $1.29 per hour.


### Analyse the Estimated Values & Regression Residuals

How good our regression is also depends on the residuals - the difference between the model's predictions ( 𝑦̂ 𝑖 ) and the true values ( 𝑦𝑖 ) inside y_train. Do you see any patterns in the distribution of the residuals?

In [155]:
S_score = LR.score(S, y_train)

In [156]:
print(f"The number of years in education only accounts for {S_score*100:.2f}% of variance attributed to earnings.")

The number of years in education only accounts for 8.33% of variance attributed to earnings.


# Multivariable Regression

Now use both years of schooling and the years work experience to predict earnings. How high is the r-squared for the regression on the training data? 

In [164]:
LR1 = LinearRegression()
LR1.fit(X_train[["S", "EXP"]], y_train)
LR1_score = LR1.score(X_train[["S", "EXP"]], y_train)

In [165]:
LR1_score

0.12469467150035185

In [166]:
LR1.coef_

array([1.93609408, 1.08040541])

In [167]:
print(f"The number of years in education and experience accounts for {LR1_score*100:.2f}% of variance attributed to earnings.")

The number of years in education and experience accounts for 12.47% of variance attributed to earnings.


### Evaluate the Coefficients of the Model

In [168]:
print(f"For every number of years in education, earnings increase by ${LR1.coef_[0]:.2f} per hour." )
print(f"For every number of years with work experience, earnings increase by ${LR1.coef_[1]:.2f} per hour." )

For every number of years in education, earnings increase by $1.94 per hour.
For every number of years with work experience, earnings increase by $1.08 per hour.


Note: there could be an interaction effect, where more years of work increase the effect of education on earnings.  

# Use Your Model to Make a Prediction

How much can someone with a bachelors degree (12 + 4) years of schooling and 5 years work experience expect to earn in 2011?

In [177]:
test_pred = pd.DataFrame({"S":16, "EXP":5}, index=[1])
score = LR1.predict(test_pred)

In [182]:
print(f"A person with 16 years of schooling and 5 years of work experience is likely to earn ${score[0]:.2f} per hour in 2011.")

A person with 16 years of schooling and 5 years of work experience is likely to earn $19.94 per hour in 2011.


# Experiment and Investigate Further

Which other features could you consider adding to further improve the regression to better predict earnings? 

## Creating a scoring metric

In [188]:
def RMSE(y_pred, y_true):
    score = mean_squared_error(y_pred,y_true)
    score = np.sqrt(score)
    
    return score

## BASELINE MODEL WITH ALL FEATURES

In [185]:
%%time
LR2 = LinearRegression()
LR2.fit(X_train, y_train)
y_pred = LR2.predict(X_test)

CPU times: total: 0 ns
Wall time: 2.23 ms


LinearRegression()

In [194]:
LR2_score = RMSE(y_pred, y_test)
LR2_score

10.561814458929755

In [195]:
print(f"The root mean square error score for including ALL variables is {LR2_score:.4f}. That's quite high!")

The root mean square error score for including ALL variables is 10.5618. That's quite high!


In [200]:
# comparing with only 2 predictors

y_pred_2 = LR1.predict(X_test[["S", "EXP"]])
LR1_score = RMSE(y_pred_2, y_test)

In [202]:
print(f"The root mean square error score for including ALL variables is {LR1_score:.4f} (lower is better). That difference is negligible.")

The root mean square error score for including ALL variables is 10.5634 (lower is better). That difference is negligible.


<h1> <span style='background:yellow'> There are many ways to reduce the RMSE </span> </h1>

<ol>
<li> Use a different model. There are many algorithms to predict continuous variables. 
    > The one I will demonstrate is 
<li> Feature selection (not all features contribute to the prediction)