# 2. Simple Regression


In [27]:
from wooldridge import *
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import numpy as np

dataWoo()

  J.M. Wooldridge (2016) Introductory Econometrics: A Modern Approach,
  Cengage Learning, 6th edition.

  401k       401ksubs    admnrev       affairs     airfare
  alcohol    apple       approval      athlet1     athlet2
  attend     audit       barium        beauty      benefits
  beveridge  big9salary  bwght         bwght2      campus
  card       catholic    cement        census2000  ceosal1
  ceosal2    charity     consump       corn        countymurders
  cps78_85   cps91       crime1        crime2      crime3
  crime4     discrim     driving       earns       econmath
  elem94_95  engin       expendshares  ezanders    ezunem
  fair       fertil1     fertil2       fertil3     fish
  fringe     gpa1        gpa2          gpa3        happiness
  hprice1    hprice2     hprice3       hseinv      htv
  infmrt     injury      intdef        intqrt      inven
  jtrain     jtrain2     jtrain3       kielmc      lawsch85
  loanapp    lowbrth     mathpnl       meap00_01   meap01
  meap93    

C1 The data in 401K are a subset of data analyzed by Papke (1995) to study the relationship between
participation in a 401(k) pension plan and the generosity of the plan. The variable prate is the percentage of eligible workers with an active account; this is the variable we would like to explain. The
measure of generosity is the plan match rate, mrate. This variable gives the average amount the firm
contributes to each worker’s plan for each $1 contribution by the worker. For example , if mrate =   0.50, then a $1 contribution by the worker is matched by a 50¢ contribution by the firm.

In [28]:
df = dataWoo('401K')
df.head()

Unnamed: 0,prate,mrate,totpart,totelg,age,totemp,sole,ltotemp
0,26.1,0.21,1653.0,6322.0,8,8709.0,0,9.07
1,100.0,1.42,262.0,262.0,6,315.0,1,5.75
2,97.6,0.91,166.0,170.0,10,275.0,1,5.62
3,100.0,0.42,257.0,257.0,7,500.0,0,6.21
4,82.5,0.53,591.0,716.0,28,933.0,1,6.84


In [29]:
dataWoo('401K', description=True)

name of dataset: 401k
no of variables: 8
no of observations: 1534

+----------+---------------------------------+
| variable | label                           |
+----------+---------------------------------+
| prate    | participation rate, percent     |
| mrate    | 401k plan match rate            |
| totpart  | total 401k participants         |
| totelg   | total eligible for 401k plan    |
| age      | age of 401k plan                |
| totemp   | total number of firm employees  |
| sole     | = 1 if 401k is firm's sole plan |
| ltotemp  | log of totemp                   |
+----------+---------------------------------+

L.E. Papke (1995), “Participation in and Contributions to 401(k)
Pension Plans:Evidence from Plan Data,” Journal of Human Resources 30,
311-325. Professor Papke kindly provided these data. She gathered them
from the Internal Revenue Service’s Form 5500 tapes.


(i) Find the average participation rate and the average match rate in the sample of plans.

In [43]:
print("Average participation rate:", round(df['prate'].mean(), 2))
print("Average match rate:", round(df['mrate'].mean(), 2))

Average participation rate: 87.36
Average match rate: 0.73


(ii) Now, estimate the simple regression equation
$$
\hat{prate} = \hat{b}_0 + \hat{b}_1 mrate
$$

and report the results along with the sample size and R-squared

In [31]:
prate_hat = smf.ols("prate ~ 1 + mrate", data=df).fit()

print("results:", prate_hat.params)

print("R squared:", prate_hat.rsquared.__round__(3))

print("Sample size:", prate_hat.nobs)

results: Intercept   83.08
mrate        5.86
dtype: float64
R squared: 0.075
Sample size: 1534.0


(iii) Interpret the intercept in your equation. Interpret the coefficient on mrate.

In [32]:
print('intercept:', prate_hat.params.iloc[0].__round__(2))

intercept: 83.08


Find the predicted prate when mrate = 3.5. Is this a reasonable prediction? Explain what is
happening here.

In [33]:
round(prate_hat.predict({'mrate': 3.5}), 2)

0   103.59
dtype: float64

(v) How much of the variation in prate is explained by mrate? Is this a lot in your opinion?

In [34]:
print("Percentage explained:", round(prate_hat.rsquared * 100, 1))

Percentage explained: 7.5


C2 The data set in CEOSAL2 contains information on chief executive officers for U.S. corporations. The
variable salary is annual compensation, in thousands of dollars, and ceoten is prior number of years as
company CEO.

In [35]:
df2 = dataWoo("CEOSAL2")
df2.head()

Unnamed: 0,salary,age,college,grad,comten,ceoten,sales,profits,mktval,lsalary,lsales,lmktval,comtensq,ceotensq,profmarg
0,1161,49,1,1,9,2,6200.0,966,23200.0,7.06,8.73,10.05,81,4,15.58
1,600,43,1,1,10,10,283.0,48,1100.0,6.4,5.65,7.0,100,100,16.96
2,379,51,1,1,9,3,169.0,40,1100.0,5.94,5.13,7.0,81,9,23.67
3,651,55,1,0,22,22,1100.0,-54,1000.0,6.48,7.0,6.91,484,484,-4.91
4,497,44,1,1,8,6,351.0,28,387.0,6.21,5.86,5.96,64,36,7.98


In [36]:
dataWoo("CEOSAL2", description=True)

name of dataset: ceosal2
no of variables: 15
no of observations: 177

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| salary   | 1990 compensation, $1000s      |
| age      | in years                       |
| college  | =1 if attended college         |
| grad     | =1 if attended graduate school |
| comten   | years with company             |
| ceoten   | years as ceo with company      |
| sales    | 1990 firm sales, millions      |
| profits  | 1990 profits, millions         |
| mktval   | market value, end 1990, mills. |
| lsalary  | log(salary)                    |
| lsales   | log(sales)                     |
| lmktval  | log(mktval)                    |
| comtensq | comten^2                       |
| ceotensq | ceoten^2                       |
| profmarg | profits as % of sales          |
+----------+--------------------------------+

See CEOSAL1.RAW


(i) Find the average salary and the average tenure in the sample.

In [37]:
print("Average Salary:", round(df2['salary'].mean(), 3))
print("Average ceoten", round(df2["ceoten"].mean(), 2))

Average Salary: 865.864
Average ceoten 7.95


(ii) How many CEOs are in their first year as CEO (that is, ceoten = 0)? What is the longest tenure
as a CEO?

In [38]:
print("Number of first year CEO:", (df2['ceoten'] == 0).sum())
print("Longest Tenure:", df2["ceoten"].max())

Number of first year CEO: 5
Longest Tenure: 37


(iii) Estimate the simple regression model 
$$
\log(salary) =  {B}_0 + {B}_1 {ceoten} + u,
$$
and report your results in the usual form. What is the (approximate) predicted percentage
increase in salary given one more year as a CEO?

In [39]:
log_salary_hat = smf.ols("np.log(salary) ~ 1 + ceoten", data=df2).fit()

print("Paramters:\n", log_salary_hat.params, sep='')

print("Percentage increase:", round(log_salary_hat.params.iloc[1] * 100, 2))

Paramters:
Intercept   6.51
ceoten      0.01
dtype: float64
Percentage increase: 0.97


C3 Use the data in SLEEP75 from Biddle and Hamermesh (1990) to study whether there is a tradeoff
between the time spent sleeping per week and the time spent in paid work. We could use either variable
as the dependent variable. For concreteness, 
estimate the model 
$$
sleep = B_0 + B_1 totwrk + u
$$ 
where sleep is minutes spent sleeping at night per week and totwrk is total minutes worked during the week

In [44]:
df3 = dataWoo("sleep75")
df3.head()

Unnamed: 0,age,black,case,clerical,construc,educ,earns74,gdhlth,inlf,leis1,...,spwrk75,totwrk,union,worknrm,workscnd,exper,yngkid,yrsmarr,hrwage,agesq
0,32,0,1,0.0,0.0,12,0.0,0,1,3529,...,0,3438,0,3438,0,14,0,13,7.07,1024
1,31,0,2,0.0,0.0,14,9500.0,1,1,2140,...,0,5020,0,5020,0,11,0,0,1.43,961
2,44,0,3,0.0,0.0,17,42500.0,1,1,4595,...,1,2815,0,2815,0,21,0,0,20.53,1936
3,30,0,4,0.0,0.0,12,42500.0,1,1,3211,...,1,3786,0,3786,0,12,0,12,9.62,900
4,64,0,5,0.0,0.0,14,2500.0,1,1,4052,...,1,2580,0,2580,0,44,0,33,2.75,4096


In [45]:
dataWoo("sleep75", description=True)

name of dataset: sleep75
no of variables: 34
no of observations: 706

+----------+--------------------------------+
| variable | label                          |
+----------+--------------------------------+
| age      | in years                       |
| black    | =1 if black                    |
| case     | identifier                     |
| clerical | =1 if clerical worker          |
| construc | =1 if construction worker      |
| educ     | years of schooling             |
| earns74  | total earnings, 1974           |
| gdhlth   | =1 if in good or excel. health |
| inlf     | =1 if in labor force           |
| leis1    | sleep - totwrk                 |
| leis2    | slpnaps - totwrk               |
| leis3    | rlxall - totwrk                |
| smsa     | =1 if live in smsa             |
| lhrwage  | log hourly wage                |
| lothinc  | log othinc, unless othinc < 0  |
| male     | =1 if male                     |
| marr     | =1 if married                  |
| prot    