# Lab 4: Multivariate OLS Review
### Jake Lee

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
import math
%matplotlib inline

In [2]:
data = 'ARC.dta'
df = pd.read_stata(data)
df.head()

Unnamed: 0,emplid,year,quarter,sex,educ,age,pay,apt,prod,turn,title,tenure
0,1.0,2018.0,1.0,female,4.0,30.0,18.57,18.0,50.5,0.0,associate,4.0
1,1.0,2018.0,2.0,female,4.0,31.0,18.57,17.0,49.299999,0.0,associate,7.0
2,1.0,2018.0,3.0,female,4.0,31.0,18.57,14.0,49.900002,0.0,associate,10.0
3,1.0,2018.0,4.0,female,4.0,31.0,18.57,16.0,54.099998,0.0,associate,13.0
4,1.0,2019.0,1.0,female,4.0,31.0,18.57,21.0,55.700001,0.0,associate,16.0


# Part 1
We have been exploring the relationship between experience on the job (`tenure`) and
employee productivity (`prod`). Now, let’s re-familiarize ourselves with the concept of omitted
variable bias

# 3. 
Omitted variable bias is caused by something in the error term that both is correlated
with X and impacts Y. In our regression of pay on productivity, let’s consider the
omitted variable `tenure`.


(a) What do you think the sign of the correlation is between prod and tenure? 

Positive


(b) What do you think the sign of the impact of tenure is on pay?

Positive 


(c) Combining these, what do you predict the sign of the omitted variable bias will
be?

Positive


(d) If you control for tenure, what should happen to your estimate of the effect of
pay on productivity?

More accurate

# 4. 
Write down the multivariate OLS model that predicts pay based on productivity and
tenure.

$ pay = \beta_0 + \beta_1 (prod) + \beta_2 (tenure)$

# 5.
Estimate a multivariate regression of pay on productivity and tenure.

In [3]:
X = df[['prod', 'tenure']]
y = df['pay']

X = sm.add_constant(X)

model = sm.OLS(y,X)
results = model.fit(cov_type='HC3') # robust

print(results.mse_total)

results.summary()

4.96881499550993


0,1,2,3
Dep. Variable:,pay,R-squared:,0.169
Model:,OLS,Adj. R-squared:,0.169
Method:,Least Squares,F-statistic:,615.9
Date:,"Tue, 06 Feb 2024",Prob (F-statistic):,4.69e-244
Time:,15:35:16,Log-Likelihood:,-12796.0
No. Observations:,6014,AIC:,25600.0
Df Residuals:,6011,BIC:,25620.0
Df Model:,2,,
Covariance Type:,HC3,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,15.7622,0.147,107.454,0.000,15.475,16.050
prod,0.0196,0.003,7.633,0.000,0.015,0.025
tenure,0.0279,0.001,28.240,0.000,0.026,0.030

0,1,2,3
Omnibus:,284.93,Durbin-Watson:,0.168
Prob(Omnibus):,0.0,Jarque-Bera (JB):,303.957
Skew:,0.525,Prob(JB):,9.92e-67
Kurtosis:,2.67,Cond. No.,430.0


Recall: 

standard error (SE): quantifies the uncertainty in an estimate. The standard error of the coefficient estimates measures how much the estimated coefficients vary from the true population coefficients due to random sampling variability.

heteroscedasticity: unequal variability in the errors/residuals across levels of the independent variables

residual: difference between the observed value of y and the predicted value of y from the regression model.

# 7. 
Interpret your results

(a) What is the interpretation of ˆβ0?

^B_0 is the estimated average pay when all other independent variables are set to zero, i.e. minimum wage in this case = 15.76

(b) What is the interpretation of ˆβ1?

A one unit increase in `prod` is estimated to have a ^B_1 = \$0.003 increase in `pay` on average, assuming ceterus paribus.

(c) What is the interpretation of ˆβ2?

A one year increase in `tenure` is estimated to have a ^B_2 = \$0.001 increase in `pay` on average, assuming ceterus paribus.


(d) Can you reject the two-sided null hypothesis that there is no effect of `tenure` on
`pay` at the 5% level, holding `prod` constant?

Holding `prod` constant, and assuming H_0 is true, the p-value (in this case z value bc sample is large enough for Z test) rep.'s the probability of observing a coefficient more or as extreme as the one we got in our sample of workers. 

From the output we can see that this probability is 0.000, thus we can reject H_0 and conclude that `tenure` does in fact have a statistically significant effect on `pay`.

(e) Can you reject the two-sided null hypothesis that there is no effect of productivity on pay at the 5% level, holding tenure constant?

Yes - same logic as (d)



# 8.
Can you reject a null hypothesis that there is no difference between the effect that
tenure has on pay and the effect that productivity has on pay at the 5% level

In [4]:
H_0 = 'tenure = prod'

results.t_test(H_0)

<class 'statsmodels.stats.contrast.ContrastResults'>
                             Test for Constraints                             
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
c0             0.0084      0.003      2.703      0.007       0.002       0.014

From above results, we can reject the null hypothesis at 5% level and conclude that there is a statistically significant difference between `tenure` and `prod`.

In other words: The coefficients for tenure and prod are not equal in their impact on the dependent variable pay. This means (as expected) that changes in tenure and prod are associated with different changes in the dependent variable, holding other variables constant.

# 9.
Can you reject the null hypothesis that productivity and tenure jointly have no effect
on pay at the 5% level? 

(the results from this
test are reported in the F statistic and the p-value in the upper right-hand corner of
your output.)


Yes, both from deductive reasoning and the probability of the F-statistic being extremely low

# Part 2
# 1. 
Generate a dummy var for female employees (1 if female, 0 if male)

In [5]:
df['female'] = df['sex'].apply(lambda sex: 1 if sex == 'female' else 0)
df['female'].value_counts(normalize=True).round(2)

0    0.55
1    0.45
Name: female, dtype: float64

# 2.
Generate an interaction term between your female dummy variable and productivity:

In [8]:
df['female'] = df['female'].astype(int)
df['female_prod'] = df['female'] * df['prod']

df['female_prod']

0       50.500000
1       49.299999
2       49.900002
3       54.099998
4       55.700001
          ...    
6009     0.000000
6010    51.599998
6011    50.200001
6012    53.099998
6013    54.000000
Name: female_prod, Length: 6014, dtype: float64

# 3. 
Run a multivariate regression that tests for whether or not productivity has a differential effect on pay for men and women.

In other words, does the relationship between prod and pay differ depending on the gender of the individuals?

In [12]:
X = df[['prod', 'female', 'female_prod']]
y = df['pay']

X = sm.add_constant(X)

model = sm.OLS(y,X)
results = model.fit(cov_type='HC3') # robust

results.summary()

0,1,2,3
Dep. Variable:,pay,R-squared:,0.066
Model:,OLS,Adj. R-squared:,0.065
Method:,Least Squares,F-statistic:,149.3
Date:,"Tue, 06 Feb 2024",Prob (F-statistic):,2.51e-93
Time:,15:53:55,Log-Likelihood:,-13149.0
No. Observations:,6014,AIC:,26310.0
Df Residuals:,6010,BIC:,26330.0
Df Model:,3,,
Covariance Type:,HC3,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,15.6396,0.225,69.451,0.000,15.198,16.081
prod,0.0411,0.004,11.705,0.000,0.034,0.048
female,-1.0813,0.306,-3.538,0.000,-1.680,-0.482
female_prod,0.0148,0.005,3.012,0.003,0.005,0.024

0,1,2,3
Omnibus:,335.066,Durbin-Watson:,0.166
Prob(Omnibus):,0.0,Jarque-Bera (JB):,214.857
Skew:,0.339,Prob(JB):,2.21e-47
Kurtosis:,2.369,Cond. No.,884.0


  (b) What is the interpretation of the coefficient on productivity?

A one unit increase in productivity is estimated to have a \$0.04 increase in pay on average, assuming ceterus paribus.

(c) What is the interpretation of the coefficient on female?

Being female is estimated to have a \$1.08 decrease in pay on average, assuming ceterus paribus.

(d) What is the interpretation of the coefficient on the interaction term?

The coefficient is positive, suggesting that the effect of productivity on pay increases for women compared to men.

(e) Can you reject the two-sided null hypothesis that there is no additional effect of productivity on pay for women at the 5% level?

Yes, we can reject the null hypothesis and conclude that there is a statistically significant additional effect of prod on pay for women. It implies that the effect of productivity on pay is stronger for women compared to men. This could suggest that productivity is more strongly rewarded or valued in the pay structure for women compared to men.

# 4.
Run a multivariate regression that tests for whether or not productivity has a differential effect on pay for men and women, controlling for tenure.

In [13]:
X = df[['prod', 'female', 'female_prod', 'tenure']]
y = df['pay']

X = sm.add_constant(X)

model = sm.OLS(y,X)
results = model.fit(cov_type='HC3') # robust

results.summary()

0,1,2,3
Dep. Variable:,pay,R-squared:,0.175
Model:,OLS,Adj. R-squared:,0.175
Method:,Least Squares,F-statistic:,329.5
Date:,"Tue, 06 Feb 2024",Prob (F-statistic):,9.82e-257
Time:,15:54:20,Log-Likelihood:,-12774.0
No. Observations:,6014,AIC:,25560.0
Df Residuals:,6009,BIC:,25590.0
Df Model:,4,,
Covariance Type:,HC3,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,16.3008,0.212,76.884,0.000,15.885,16.716
prod,0.0130,0.003,3.799,0.000,0.006,0.020
female,-0.7211,0.293,-2.458,0.014,-1.296,-0.146
female_prod,0.0061,0.005,1.281,0.200,-0.003,0.015
tenure,0.0286,0.001,28.835,0.000,0.027,0.031

0,1,2,3
Omnibus:,260.159,Durbin-Watson:,0.167
Prob(Omnibus):,0.0,Jarque-Bera (JB):,282.097
Skew:,0.513,Prob(JB):,5.54e-62
Kurtosis:,2.727,Cond. No.,1040.0


(b) What is the interpretation of the coefficient on productivity?

A one unit increase in productivity is estimated to have a \$0.01 increase in pay on average, assuming ceterus paribus.

(c) What is the interpretation of the coefficient on female?

Being female is estimated to have a \$1.08 decrease in pay on average, assuming ceterus paribus.

(d) What is the interpretation of the coefficient on the interaction term?

The coefficient is positive, suggesting that the effect of productivity on pay increases for women compared to men.

(e) Can you reject the two-sided null hypothesis that there is no additional effect of productivity on pay for women at the 5% level?

No, we fail to reject => no statistically significant additional effect