# Tutorial 4: Chapter 3

## Goals:

The goals for the this lab are to use real data to explore important concepts in Chapter 3 of Wooldridge using real data. In this lab we will explore concepts related to:


• Estimation: Multiple Regression

• Omitted Variable Bias


## Basics

1. Always include your import statements. Remember that you can add to the import statements at any time,
 as long as you rerun your code after. As always, copy and pasting what is inside these notebooks will
 suffice.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.api as sm
from IPython.display import Image

## Review: Loading the Data, Summary Statistics

In [4]:
df = pd.read_stata("C:/Users/patri/Desktop/Metrics TA/Intro-To-Econometrics-In-Python/Datasets/Lab 4.dta")
df

Unnamed: 0,wage,abil,educ,ne,nc,west,south,exper,motheduc,fatheduc,...,urban,ne18,nc18,south18,west18,urban18,tuit17,tuit18,expersq,ctuit
0,12.019231,5.027738,15,0,0,1,0,9,12,12,...,1,1,0,0,0,1,7.582914,7.260242,81,-0.322671
1,8.912656,2.037170,13,1,0,0,0,8,12,10,...,1,1,0,0,0,1,8.595144,9.499537,64,0.904392
2,15.514334,2.475895,15,1,0,0,0,11,12,16,...,1,1,0,0,0,1,7.311346,7.311346,121,0.000000
3,13.333333,3.609240,15,1,0,0,0,6,12,12,...,1,1,0,0,0,1,9.499537,10.162070,36,0.662534
4,11.070110,2.636546,13,1,0,0,0,15,12,15,...,1,1,0,0,0,1,7.311346,7.311346,225,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1225,7.735584,2.803173,12,0,0,0,1,9,12,12,...,1,0,0,1,0,1,3.895709,3.810777,81,-0.084932
1226,91.309219,4.164562,19,0,0,1,0,6,13,14,...,1,0,0,0,1,0,0.000000,0.000000,36,0.000000
1227,12.980769,0.893115,16,0,0,0,1,11,14,16,...,1,0,0,1,0,0,2.444079,2.444079,121,0.000000
1228,12.500000,-0.633061,8,0,0,0,1,19,6,10,...,1,0,1,0,0,1,7.582914,7.582914,361,0.000000


Now as before, lets cleanse the data.

In particular lets delete all rows where "wage”, “abil”, “educ”, and “exper” don't have entries.

Lets also create a variable lnWage, which is the natural logarithm of wage.



In [10]:
df = df.dropna(subset=['wage','abil','educ', 'exper'])

#As before we need to drop the 0's since ln(0) is undefined
df = df[df.wage != 0]
df['lnWage'] = np.log(df.wage)


## Review: Simple Regression, Correlation

1. To get an initial understanding of the relationship between the variables of interest (“wage”, “abil”, “educ”, and “exper”) calculate the correlation between:

(a) log of wage, highest grade completed, experience, ability.



In [11]:
# Make a new dataframe of only the columns we are interested in
# Make sure it is a copy so we are not changing the original df
corrDf = df[['lnWage','educ', 'exper','abil']].copy()
corrDf.corr()


Unnamed: 0,lnWage,educ,exper,abil
lnWage,1.0,0.401943,-0.186335,0.366225
educ,0.401943,1.0,-0.684677,0.594033
exper,-0.186335,-0.684677,1.0,-0.445563
abil,0.366225,0.594033,-0.445563,1.0


2. Run a simple regression with natural log of hourly wages in 1991 as the dependent variable, and highest grade completed as your independent variable. 




In [13]:
X = sm.add_constant(df['educ'].ravel())
results = sm.OLS(df['lnWage'], X).fit()
print(results.summary())


                            OLS Regression Results                            
Dep. Variable:                 lnWage   R-squared:                       0.162
Model:                            OLS   Adj. R-squared:                  0.161
Method:                 Least Squares   F-statistic:                     236.6
Date:                Sun, 13 Dec 2020   Prob (F-statistic):           5.80e-49
Time:                        11:43:16   Log-Likelihood:                -995.16
No. Observations:                1230   AIC:                             1994.
Df Residuals:                    1228   BIC:                             2005.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.0923      0.087     12.513      0.0

## New: Multiple Regression


Now we can run a regression with multiple independent variables. The steps are identical, we just have to list all the variables we want to include as independent variables.

Lets start with lnWage as the dependent, and education and ability as the independent

The results are below. Lets start by interpreting the slope parameters in the first multiple regression where education and ability are the independent variables. Here, we estimate that a one year increase in years of education will increase hourly wage by approx. 7.1%, holding ability constant. We also see that that a one unit increase in the ability measure is associated with an approx. 5.3% increase in wages.

We can see right away that our estimate of the returns to education has changed as we add ability to the model. The estimate of the returns to education has fallen about 3 percentage points. This is because we
were omitting ability in our simple regression above, meaning ability was in the error term of our original model, and we had omitted variable bias. We can see on average we would be over-estimating the returns
to education if we left ability out of the regression. This is because the correlation between ability and education is positive ($\widehat{δ}$ > 0), and higher ability is likely associated with higher wages ($\β}$ > 0).


In [23]:

indVars = df[['educ', 'abil']]
dependant = df['lnWage']
indVars = sm.add_constant(indVars)

results = sm.OLS(dependant, indVars).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 lnWage   R-squared:                       0.187
Model:                            OLS   Adj. R-squared:                  0.185
Method:                 Least Squares   F-statistic:                     140.8
Date:                Sun, 13 Dec 2020   Prob (F-statistic):           8.94e-56
Time:                        11:54:37   Log-Likelihood:                -976.46
No. Observations:                1230   AIC:                             1959.
Df Residuals:                    1227   BIC:                             1974.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.3808      0.098     14.096      0.0

In the second regression we added experience as an independent variable (but left out ability). We can see through the results below that our estimate of the return to schooling is approx. 13.1% as opposed to the approx 10.1% when just having education as our independent variable. The estimate is getting larger here because there is a negative correlation between experience and education in our sample (ˆδ < 0), and there is a positive relationship between experience and wages (βexperience > 0). On average, we’d be underestimating the returns to education by leaving experience out of the model.


In [24]:
indVars = df[['educ', 'exper']]
dependant = df['lnWage']
indVars = sm.add_constant(indVars)

results = sm.OLS(dependant, indVars).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 lnWage   R-squared:                       0.176
Model:                            OLS   Adj. R-squared:                  0.175
Method:                 Least Squares   F-statistic:                     131.4
Date:                Sun, 13 Dec 2020   Prob (F-statistic):           1.92e-52
Time:                        11:58:21   Log-Likelihood:                -984.15
No. Observations:                1230   AIC:                             1974.
Df Residuals:                    1227   BIC:                             1990.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.3730      0.176      2.123      0.0

The final regression includes education, ability and experience as independent variables. Understanding bias when there are multiple variables is more complicated as all the estimators are biased. Note: we are still not capturing the causal effect.

In [26]:
indVars = df[['educ', 'exper', 'abil']]
dependant = df['lnWage']
indVars = sm.add_constant(indVars)

results = sm.OLS(dependant, indVars).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                 lnWage   R-squared:                       0.204
Model:                            OLS   Adj. R-squared:                  0.202
Method:                 Least Squares   F-statistic:                     104.9
Date:                Sun, 13 Dec 2020   Prob (F-statistic):           1.86e-60
Time:                        11:59:27   Log-Likelihood:                -963.00
No. Observations:                1230   AIC:                             1934.
Df Residuals:                    1226   BIC:                             1954.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.6121      0.177      3.467      0.0