# Lars Lefgren; Matthew Lindquist and David Sims, (2012), Rich Dad, Smart Dad: Decomposing the Intergenerational Transmission of Income, Journal of Political Economy, 120, (2), 268 - 303

Using data from the Lefgren et al study, I demonstrate simple OLS in python using the statsmodels library

In [1]:
#importing modules....
from __future__ import division
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [2]:
#read in the data
hw3=pd.read_stata(r"C:\Users\Rodgers\Desktop\PhD courses\PhD courses\EconS 593 Cowan\nls80.dta")
df=pd.DataFrame(data=hw3)
df.head()

Unnamed: 0,wage,hours,iq,kww,educ,exper,tenure,age,married,black,south,urban,sibs,brthord,meduc,feduc,lwage
0,769,40,93,35,12,11,2,31,1,0,0,1,1,2.0,8.0,8.0,6.645091
1,808,50,119,41,18,11,16,37,1,0,0,1,1,,14.0,14.0,6.694562
2,825,40,108,46,14,11,9,33,1,0,0,1,1,2.0,14.0,14.0,6.715384
3,650,40,96,32,12,13,7,32,1,0,0,1,4,3.0,12.0,12.0,6.476973
4,562,40,74,27,11,14,5,34,1,0,0,1,10,6.0,6.0,11.0,6.331502


In [3]:
#This code fills in the missing values in the fathers education columnn with zero's. If you don't do this python will not run this 
#code.
df["feduc"] = df["feduc"].fillna(0)


In [4]:
#purge the rows that have inf's and NaNs
#df[~df.isin([np.nan, np.inf, -np.inf]).any(1)]
#Ignore this piece of code, I'm just toying around with the model to see how it perfoms


In [5]:
#lets run an ols model with this data
#I run this OLS regression with missing values set to zero. I thought that this was better than purging all the row's with missing data
#from the set#
#The model here is a simple one:lnwage=a+b.feduc+e
a=df.lwage
b=df.feduc
model=sm.OLS(a,b).fit()
model_prediciton=model.predict(b)
model_details=model.summary()
print(model_details)

                                 OLS Regression Results                                
Dep. Variable:                  lwage   R-squared (uncentered):                   0.726
Model:                            OLS   Adj. R-squared (uncentered):              0.725
Method:                 Least Squares   F-statistic:                              2470.
Date:                Mon, 06 Apr 2020   Prob (F-statistic):                   1.50e-264
Time:                        08:54:02   Log-Likelihood:                         -2513.3
No. Observations:                 935   AIC:                                      5029.
Df Residuals:                     934   BIC:                                      5033.
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [12]:
#IV regression.
#from statsmodels.sandbox.regression.gmm import IV2SLS
#import seaborn as sns
#lets estimate the following IV regression.
# Lets do this one lnwage=a +iq + educ + age + feduc+e
#We Will use this same equation to to the IV estimate ( Where we will asuume that feduc instruments for educ)


In [6]:
#Multiple regression
# Lets do this one lnwage=a +iq + educ + age + feduc+e
from sklearn import linear_model
y=df['lwage']
x=df[['iq','age','tenure','feduc','educ']]
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 4.945151934695261
Coefficients: 
 [0.00542948 0.01960038 0.01216089 0.00832393 0.03565223]


In [7]:
#With statsmodels.api
x = sm.add_constant(x)
model=sm.OLS(y,x).fit()
model_prediciton=model.predict(x)
model_details=model.summary()
print(model_details)

  return ptp(axis=axis, out=out, **kwargs)


                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.189
Model:                            OLS   Adj. R-squared:                  0.185
Method:                 Least Squares   F-statistic:                     43.30
Date:                Mon, 06 Apr 2020   Prob (F-statistic):           3.49e-40
Time:                        08:54:33   Log-Likelihood:                -419.71
No. Observations:                 935   AIC:                             851.4
Df Residuals:                     929   BIC:                             880.5
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9452      0.166     29.779      0.0

In [11]:
#Running OLS with the covariates we need to identify pi-1 and pi-2
import statsmodels.formula.api as smf
results = smf.ols('lwage ~ iq +age+tenure+feduc+educ',data=df).fit()
results_robust = results.get_robustcov_results()
print(results_robust.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.189
Model:                            OLS   Adj. R-squared:                  0.185
Method:                 Least Squares   F-statistic:                     42.62
Date:                Mon, 06 Apr 2020   Prob (F-statistic):           1.36e-39
Time:                        09:17:35   Log-Likelihood:                -419.71
No. Observations:                 935   AIC:                             851.4
Df Residuals:                     929   BIC:                             880.5
Df Model:                           5                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      4.9452      0.168     29.417      0.0

In [None]:
#There it is, simple OLS in python.