In [3]:
# Initialize Otter
import otter
grader = otter.Notebook("ps3.ipynb")

# Econ 140 – Problem Set 3

We'll go through a demonstration on how to do multivariable regression, however we also went through a breif demonstration in PS0 so it might be worth going through that one as well. It is very similar to single variable. The only difference is we need to select multiple columns for our independent `X` variable. Suppose we have a dataset called `df` that has three columns of observations, one called `wage`, another `educ`, and another `parents_wealth`, and suppose we want to regress `wage` onto `educ`, `parents_wealth`, and a constant. To do so, we would first identify the endogenous (dependent) variable and the exogenous (independent) variables.

```python
y = df['wage']
X = sm.add_constant(df[['educ', 'parents_wealth']])
```

Notice the double square brackets when we select multiple columns. `df['educ', 'parents_wealth']` will not work.

Next, we will pass in our endogenous and exogenous variables (in that order) to `sm.OLS`, just like before.

```python
my_ols_model = sm.OLS(y, X)
results = my_ols_model.fit(cov_type = 'HC1')
results.summary()
```

And that's it!

Before getting started on the assignment, run the cell at the very top that imports `otter` and the cell below which will import the packages we need.

**Important:** As mentioned in problem set 0, if you leave this notebook alone for a while and come back, to save memory datahub will "forget" which code cells you have run, and you may need to restart your kernel and run all of the cells from the top. That includes this code cell that imports packages. If you get `<something> not defined` errors, this is because you didn't run an earlier code cell that you needed to run. It might be this cell or the `otter` cell above.

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

## Problem 1. Multivariate Linear Regression

This problem will create a dataset by having generated variables in the same way as we did in Problem Set 2. A main advantage of such an exercise is that we can control the true data generating process (“DGP”), which is not possible in practical econometric analysis.

<!-- BEGIN QUESTION -->

**Question 1.a.**
Set the sample size at 1,000 and generate an error term, $u_i$, by randomly selecting from a normal distribution with mean 0, and standard deviation 5. Draw an explanatory variable, $X_{1i}$, from a standard normal distribution, $\mathcal{N}(0,1)$, and then define a second explanatory variable, $X_{2i}$, to be equal to $e^{X_{1i}}$ for all $i$. Finally, set the dependent variable to be linearly related to the two regressors plus an additive error term: $y_i = 2 + 4X_{1i} − 6X_{2i} + u_i$. Note that, by construction, the error term of this multivariate linear regression is homoskedastic.

*Hint*: You may want to refer to how you did this in Problem Set 2. Also, the function `np.exp()` takes a list/array of numbers and applies the exponential function to each element. This is basically the opposite funciton of `np.log()`.

<!--
BEGIN QUESTION
name: q1_a
manual: true
-->

In [8]:
u = np.random.normal(0, 5, 1000)
X1 = np.random.normal(0, 1, 1000)
X2 = np.exp(X1)
y = 2 + 4*X1 - 6*X2 + u

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.b.**
Regress $y$ on $X_1$ with homoskedasticity-only standard errors (`statsmodels` does this by default, just don't specify a `cov_type` like we usually do to get robust errors). Do the same analysis for $y$ and $X_2$. Compare the results with the true data generating process. Explain why differences arise between the population slopes and the estimated slopes, if there are any.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_b
manual: true
-->

In [9]:
X1_const = sm.add_constant(X1)
model_1b_X1 = sm.OLS(y, X1_const)
results_1b_X1 = model_1b_X1.fit()
results_1b_X1.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.274
Model:,OLS,Adj. R-squared:,0.274
Method:,Least Squares,F-statistic:,377.3
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,1.55e-71
Time:,20:57:25,Log-Likelihood:,-3713.6
No. Observations:,1000,AIC:,7431.0
Df Residuals:,998,BIC:,7441.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-7.4904,0.314,-23.849,0.000,-8.107,-6.874
x1,-6.2500,0.322,-19.424,0.000,-6.881,-5.619

0,1,2,3
Omnibus:,817.586,Durbin-Watson:,2.014
Prob(Omnibus):,0.0,Jarque-Bera (JB):,30075.588
Skew:,-3.466,Prob(JB):,0.0
Kurtosis:,28.957,Cond. No.,1.03


<!-- END QUESTION -->

In [10]:
X2_const = sm.add_constant(X2)
model_1b_X2 = sm.OLS(y, X2_const)
results_1b_X2 = model_1b_X2.fit()
results_1b_X2.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.742
Model:,OLS,Adj. R-squared:,0.742
Method:,Least Squares,F-statistic:,2867.0
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,1.08e-295
Time:,20:57:25,Log-Likelihood:,-3197.0
No. Observations:,1000,AIC:,6398.0
Df Residuals:,998,BIC:,6408.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.2405,0.236,1.020,0.308,-0.222,0.703
x1,-4.7359,0.088,-53.547,0.000,-4.909,-4.562

0,1,2,3
Omnibus:,11.482,Durbin-Watson:,1.944
Prob(Omnibus):,0.003,Jarque-Bera (JB):,13.117
Skew:,-0.185,Prob(JB):,0.00142
Kurtosis:,3.422,Cond. No.,3.54


<!-- BEGIN QUESTION -->

**Question 1.c.**
Explain.

<!--
BEGIN QUESTION
name: q1_c
manual: true
-->

Our population slope for Beta_1 is 4 and our population slope for Beta_2 is -6. 

Regressing only on X1 gives us the OLS regression: $$ \widehat{y} = -6.25 * X_1 - 7.49 $$
Our estimated slope coefficient for X1 is -6.25, this value is much less than our population slope of 4.

Regressing only on X2 gives us the OLS regression: $$ \widehat{y} = -4.74 * X_2 - 0.24 $$
Our estimated slope coefficient for X2 is -4.74, this value is greater than our population slope of -6.

These estimated slopes are different from the population values because of omitted variable bias, since when we only regress on one of the X terms, then part of our error term contains the other X term correlated with the regressor. 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.d.**
Next, regress $y$ on both $X_1$ and $X_2$. Compare the estimation results with those you did in part (b/c), especially the model with only the regressor $X_1$. Examine differences across the three regressions in terms of the coefficient estimates, their standard errors, the $R^2$, and the adjusted $R^2$.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_d
manual: true
-->

In [11]:
X_const = sm.add_constant(np.stack([X1, X2], axis=1)) # This just puts our two variables together with a const
model_1d = sm.OLS(y, X_const)
results_1d = model_1d.fit()
results_1d.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.798
Model:,OLS,Adj. R-squared:,0.798
Method:,Least Squares,F-statistic:,1974.0
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,0.0
Time:,20:57:25,Log-Likelihood:,-3073.3
No. Observations:,1000,AIC:,6153.0
Df Residuals:,997,BIC:,6167.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.9355,0.263,11.144,0.000,2.419,3.452
x1,4.5420,0.272,16.726,0.000,4.009,5.075
x2,-6.3700,0.125,-50.905,0.000,-6.616,-6.124

0,1,2,3
Omnibus:,0.003,Durbin-Watson:,1.955
Prob(Omnibus):,0.999,Jarque-Bera (JB):,0.012
Skew:,0.004,Prob(JB):,0.994
Kurtosis:,2.984,Cond. No.,6.04


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.e.**
Explain.

<!--
BEGIN QUESTION
name: q1_e
manual: true
-->

The estimated coefficients for the regression with both X1 and X2 are much closer to the population slope values than both bivariate regressions in part (b/c). The R-squared for the multivariate regression is the highest (0.798), however it is only a little higher than the R-squared regressing on X2 (0.742), this is because a regression using X1 and X2 can explain the most variance because its using the most X variables in its model, and a regression using X2 can still explain a lot of variance because the coefficient for X2 is larger than X1 and the values for X2 are much larger than terms for X1 since X2 are exponential terms. This is also why the regression using X1 only has a R-squared value of 0.274, X1 explains a lot less of the variance in the model than regressing on X2. The adjusted R-squared values are very similar to the non-adjusted values since the samples sizes are very large.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.f.**
Generate a third regressor: $X_{3i} = 1 + X_{1i} − X_{2i} + v_i$ where $v_i$ is drawn from a normal distribution with mean 0 and standard deviation 0.5. Estimate the model $y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3X_{3i} + w_i$. Compare the result with part (d/e). Do changes in OLS estimates, standard errors, the $R^2$, and the adjusted $R_2$ make sense to you? Explain why or why not.

*Hint: Think about the concept of “imperfect multicollinearity".*

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q1_f
manual: true
-->

In [12]:
v = np.random.normal(0, 0.5, 1000)
X3 = 1 + X1 - X2 + v

X_const_f = sm.add_constant(np.stack([X1, X2, X3], axis=1))
model_1f = sm.OLS(y, X_const_f)
results_1f = model_1f.fit()
results_1f.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.799
Model:,OLS,Adj. R-squared:,0.799
Method:,Least Squares,F-statistic:,1323.0
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,0.0
Time:,20:57:25,Log-Likelihood:,-3070.8
No. Observations:,1000,AIC:,6150.0
Df Residuals:,996,BIC:,6169.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.1890,0.427,5.121,0.000,1.350,3.028
x1,3.7802,0.438,8.633,0.000,2.921,4.639
x2,-5.6222,0.360,-15.618,0.000,-6.329,-4.916
x3,0.7445,0.336,2.215,0.027,0.085,1.404

0,1,2,3
Omnibus:,0.012,Durbin-Watson:,1.967
Prob(Omnibus):,0.994,Jarque-Bera (JB):,0.032
Skew:,-0.008,Prob(JB):,0.984
Kurtosis:,2.977,Cond. No.,14.3


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 1.g.**
Explain.

<!--
BEGIN QUESTION
name: q1_g
manual: true
-->

The values in OLS estimates for X1 and X2 slope coeffiecients, the R-squared, and the adjusted R-squared stay relatively the same because most of the model has not changed. However the standard erros have increased substantially, this is because of the property of imperfect multicollinearity. Since X3 is a linear combination of X1, X2, and a small error term, X3 is highly correlated with X1 and X2 and this causes the standard errors to increase by a lot. Again, even though our degrees of freedom are decreasing we still have a high adjusted R-squared since our sample size is very large.

<!-- END QUESTION -->



---

## Problem 2. Teaching Ratings

We will use `teaching_ratings.csv` which contains data on course evaluations, course characteristics, and professor characteristics for 463 courses at the University of Texas at Austin. One of the characteristics is an index of the professor’s “beauty” as rated by a panel of six judges. The variable `course_eval` is an overall teaching evaluation score, on a scale of 1 (very unsatisfactory) to 5 (excellent). In this exercise, you will investigate how course evaluations are related to the professor’s beauty.

In [5]:
ratings = pd.read_csv("teaching_ratings.csv")
ratings.head()

Unnamed: 0,minority,age,female,onecredit,beauty,course_eval,intro,nnenglish
0,1.0,36.0,1.0,0,0.289916,4.3,0.0,0.0
1,0.0,59.0,0.0,0,-0.737732,4.5,0.0,0.0
2,0.0,51.0,0.0,0,-0.571984,3.7,0.0,0.0
3,0.0,40.0,1.0,0,-0.677963,4.3,0.0,0.0
4,0.0,31.0,1.0,0,1.509794,4.4,0.0,0.0


<!-- BEGIN QUESTION -->

**Question 2.a.**
Run a regression of `course_eval` on `beauty` using robust standard errors. What is the estimated slope? Is it statistically significant?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_a
manual: true
-->

In [6]:
y_2a = ratings['course_eval']
X_2a = sm.add_constant(ratings['beauty'])
model_2a = sm.OLS(y_2a, X_2a)
results_2a = model_2a.fit(cov_type = 'HC1')
results_2a.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,course_eval,R-squared:,0.036
Model:,OLS,Adj. R-squared:,0.034
Method:,Least Squares,F-statistic:,16.94
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,4.58e-05
Time:,23:04:32,Log-Likelihood:,-375.32
No. Observations:,463,AIC:,754.6
Df Residuals:,461,BIC:,762.9
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,3.9983,0.025,157.727,0.000,3.949,4.048
beauty,0.1330,0.032,4.115,0.000,0.070,0.196

0,1,2,3
Omnibus:,15.399,Durbin-Watson:,1.41
Prob(Omnibus):,0.0,Jarque-Bera (JB):,16.405
Skew:,-0.453,Prob(JB):,0.000274
Kurtosis:,2.831,Cond. No.,1.27


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.b.**
Explain.

<!--
BEGIN QUESTION
name: q2_b
manual: true
-->

$$ \widehat{course\ eval} = 0.133 * beauty + 4 $$
The estimated slope is 0.133, the result is statistically significant because the z-value for the slope is 4.115 which is above 1.96 and the p-value is very close to 0.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.c.**
Run a regression of `course_eval` on `beauty`, including some additional variables to control for the type of course and professor characteristics. In particular, include as additional regressors `intro`, `onecredit`, `female`, `minority`, and `nnenglish`. What is the estimated effect of `beauty` on `course_eval`? Does the regression in (a) suffer from important omitted variable bias (OVB)? What happens with the $R^2$? Based on
the confidence interval from the regression, can you reject the null hypothesis that the effect of beauty is the same as in part (a)? What can you say about the effect of the new variables included?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_c
manual: true
-->

In [15]:
y_2c = ratings['course_eval']
X_2c = sm.add_constant(ratings[['beauty','intro', 'onecredit', 'female', 'minority', 'nnenglish']])
model_2c = sm.OLS(y_2c, X_2c)
results_2c = model_2c.fit(cov_type = 'HC1')
results_2c.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,course_eval,R-squared:,0.155
Model:,OLS,Adj. R-squared:,0.144
Method:,Least Squares,F-statistic:,17.03
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,8.67e-18
Time:,20:57:25,Log-Likelihood:,-344.85
No. Observations:,463,AIC:,703.7
Df Residuals:,456,BIC:,732.7
Df Model:,6,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,4.0683,0.037,109.926,0.000,3.996,4.141
beauty,0.1656,0.032,5.246,0.000,0.104,0.227
intro,0.0113,0.056,0.202,0.840,-0.099,0.121
onecredit,0.6345,0.108,5.871,0.000,0.423,0.846
female,-0.1735,0.049,-3.505,0.000,-0.270,-0.076
minority,-0.1666,0.067,-2.472,0.013,-0.299,-0.034
nnenglish,-0.2442,0.094,-2.608,0.009,-0.428,-0.061

0,1,2,3
Omnibus:,22.413,Durbin-Watson:,1.516
Prob(Omnibus):,0.0,Jarque-Bera (JB):,24.406
Skew:,-0.555,Prob(JB):,5.02e-06
Kurtosis:,3.179,Cond. No.,5.81


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.d.**
Explain.

<!--
BEGIN QUESTION
name: q2_d
manual: true
-->

After including additional regressors, the estimated effect on beauty is 0.1656, this value is different from the slope coefficient in the bivariate regression in (a) which is 0.133. We can assume this is because the regression in (a) suffered from omitted variable bias. Both R-squared and Adjusted R-squared increase substantially from the regression in (a) to the multivariate regression because the additional regressors help explain substantially more variance in the model. Based on the confidence interval for the beauty coefficient from the multivariate regression, which is (0.104, 0.227) we fail to reject the null hypothesis since 0.133 is included in this interval. The effect of the new variables included in the model means that there are less things in the error term to be correlated with these regressors so we now have a greater causal inference in our model.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.e.**
Estimate the coefficient on beauty for the multiple regression model in (c) using the three-step process in Appendix 6.3 (the Frisch-Waugh theorem). Verify that the three-step process yields the same estimated coefficient for beauty as that obtained in (c). Comment.

*Hint: Recall that if your regression results are called `results`, you could get the residuals using `results.resid`.*

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q2_e
manual: true
-->

In [18]:
# Do the first step here (regress the outcome variable on covariates)
course_eval = ratings['course_eval']
covariates = sm.add_constant(ratings[['intro', 'onecredit', 'female', 'minority', 'nnenglish']])
model_eval_on_covariates = sm.OLS(course_eval, covariates)
results_eval = model_eval_on_covariates.fit(cov_type = 'HC1')
eval_residuals = results_eval.resid

# Do the second step here (regress the explanatory variable on covariates)
beauty = ratings['beauty']
model_beauty_on_covariates = sm.OLS(beauty, covariates)
results_beauty = model_beauty_on_covariates.fit(cov_type = 'HC1')
beauty_residuals = results_beauty.resid

# Do the last step here (regress the outcome variable's residuals on the explanatory variable's residuals)
model_fw = sm.OLS(eval_residuals, beauty_residuals)
results_fw = model_fw.fit(cov_type = 'HC1')
results_fw.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.06
Model:,OLS,Adj. R-squared (uncentered):,0.058
Method:,Least Squares,F-statistic:,27.88
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,1.99e-07
Time:,23:26:02,Log-Likelihood:,-344.85
No. Observations:,463,AIC:,691.7
Df Residuals:,462,BIC:,695.8
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
x1,0.1656,0.031,5.280,0.000,0.104,0.227

0,1,2,3
Omnibus:,22.413,Durbin-Watson:,1.516
Prob(Omnibus):,0.0,Jarque-Bera (JB):,24.406
Skew:,-0.555,Prob(JB):,5.02e-06
Kurtosis:,3.179,Cond. No.,1.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.f.**
Explain.

<!--
BEGIN QUESTION
name: q2_f
manual: true
-->

We see that after using the steps outlined in the Frisch-Waugh theorem we get the same value for the estimated slope coefficient we get from the multivariate linear regression, 0.1656. This is because by explaining the variance of both the dependent and independent variable on the other regressors, this achieves the same effect as a multivariate regression.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.g.**
Professor Smith is a black male with average beauty and is a native English speaker. He teaches a three-credit upper-division course. Predict Professor Smith’s course evaluation.

<!--
BEGIN QUESTION
name: q2_g
manual: true
-->

$$ \widehat{course\ eval} = 4.068 + 3 * (0.1656) + 1 * (-0.166)$$
$$ \widehat{course\ eval} = 4.3988 $$

<!-- END QUESTION -->



---

## Problem 3. Education and Distance to College

The file `college_distance.csv` contains data from a random sample of high school seniors interviewed in 1980 and re-interviewed in 1986. A detailed description is given in `college_distance_description.pdf`, which will be shared on Piazza and bCourses. In this exercise, you will use these data to investigate the relationship between the number of completed years of education for young adults and the distance from each student’s high school to the nearest four-year college.

In [19]:
dist = pd.read_csv("college_distance.csv")
dist.head()

Unnamed: 0,female,black,hispanic,bytest,dadcoll,momcoll,ownhome,urban,cue80,stwmfg80,dist,tuition,yrsed,incomehi
0,0.0,0.0,0.0,39.15,1.0,0.0,1.0,1.0,6.2,8.09,0.2,0.88915,12.0,1.0
1,1.0,0.0,0.0,48.87,0.0,0.0,1.0,1.0,6.2,8.09,0.2,0.88915,12.0,0.0
2,0.0,0.0,0.0,48.74,0.0,0.0,1.0,1.0,6.2,8.09,0.2,0.88915,12.0,0.0
3,0.0,1.0,0.0,40.4,0.0,0.0,1.0,1.0,6.2,8.09,0.2,0.88915,12.0,0.0
4,1.0,0.0,0.0,40.48,0.0,0.0,0.0,1.0,5.6,8.09,0.4,0.88915,13.0,0.0


<!-- BEGIN QUESTION -->

**Question 3.a.**
What do you expect for the sign of the relationship and what mechanism can you think about to explain it?

<!--
BEGIN QUESTION
name: q3_a
manual: true
-->

I expect the sign of the relationship to be negative. As the distance from the student's high school to the nearest four-year college increases, the number of completed years of education will decrease. The mechanism I can think about to explain it is homesickness, as you go farther away you miss home more and want to come back, but if you're closer to home already then you can go to school and visit home more easily.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.b.**
Run a regression of years of completed education (`yrsed`) on distance to the nearest college (`dist`), measured in tens of miles (For example, dist = 2 means that the distance is 20 miles). What is the estimated slope? Is it statistically significant? Does distance to college explain a large fraction of the variance in educational attainment across individuals? Explain.

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q3_b
manual: true
-->

In [20]:
y_3b = dist['yrsed']
X_3b = sm.add_constant(dist['dist'])
model_3b = sm.OLS(y_3b, X_3b)
results_3b = model_3b.fit(cov_type = 'HC1')
results_3b.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,yrsed,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,29.83
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,5.01e-08
Time:,23:39:22,Log-Likelihood:,-7632.2
No. Observations:,3796,AIC:,15270.0
Df Residuals:,3794,BIC:,15280.0
Df Model:,1,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,13.9559,0.038,369.093,0.000,13.882,14.030
dist,-0.0734,0.013,-5.462,0.000,-0.100,-0.047

0,1,2,3
Omnibus:,7187.794,Durbin-Watson:,1.769
Prob(Omnibus):,0.0,Jarque-Bera (JB):,361.676
Skew:,0.41,Prob(JB):,2.9e-79
Kurtosis:,1.729,Cond. No.,3.73


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.c.**
Explain.

<!--
BEGIN QUESTION
name: q3_c
manual: true
-->

The estimated slope on distance is -0.0734, this value is statistically significant since its z score (-5.462) has an absolute value greater than 1.96.  
Distance to college does not explain a large amount of the variation in educational attainment, because R-squared of the regression is only 0.007, that means only 0.7 percent of the variation is explained.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.d.**
Now run a regression of `yrsed` on `dist`, but include some additional regressors to control for characteristics of the student, the student’s family, and the local labor market. In particular, include as additional regressors: `bytest`, `female`, `black`, `hispanic`, `incomehi`, `ownhome`, `dadcoll`, `cue80`, and `stwmfg80`.  What is the estimated effect of `dist` on `yrsed`?  Is it substantively different from the regression in (b)? Based on this, does the regression in (b) seem to suffer from important omitted variable bias?

This question is for your code, the next is for your explanation.

<!--
BEGIN QUESTION
name: q3_d
manual: true
-->

In [22]:
y_3d = dist['yrsed']
X_3d = sm.add_constant(dist[['dist', 'bytest', 'female', 'black', 'hispanic', 'incomehi', 'ownhome', 'dadcoll', 'cue80', 'stwmfg80']])
model_3d = sm.OLS(y_3d, X_3d)
results_3d = model_3d.fit(cov_type = 'HC1')
results_3d.summary()

  x = pd.concat(x[::order], 1)


0,1,2,3
Dep. Variable:,yrsed,R-squared:,0.279
Model:,OLS,Adj. R-squared:,0.277
Method:,Least Squares,F-statistic:,197.7
Date:,"Tue, 19 Oct 2021",Prob (F-statistic):,0.0
Time:,23:48:55,Log-Likelihood:,-7025.9
No. Observations:,3796,AIC:,14070.0
Df Residuals:,3785,BIC:,14140.0
Df Model:,10,,
Covariance Type:,HC1,,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,8.8275,0.241,36.583,0.000,8.355,9.300
dist,-0.0315,0.012,-2.705,0.007,-0.054,-0.009
bytest,0.0938,0.003,31.479,0.000,0.088,0.100
female,0.1454,0.050,2.885,0.004,0.047,0.244
black,0.3680,0.068,5.449,0.000,0.236,0.500
hispanic,0.3985,0.074,5.394,0.000,0.254,0.543
incomehi,0.3952,0.062,6.382,0.000,0.274,0.517
ownhome,0.1521,0.065,2.343,0.019,0.025,0.279
dadcoll,0.6961,0.071,9.838,0.000,0.557,0.835

0,1,2,3
Omnibus:,118.266,Durbin-Watson:,1.924
Prob(Omnibus):,0.0,Jarque-Bera (JB):,97.867
Skew:,0.32,Prob(JB):,5.6e-22
Kurtosis:,2.543,Cond. No.,539.0


<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.e.**
Explain.

<!--
BEGIN QUESTION
name: q3_e
manual: true
-->

The estimated effect of dist on years is -0.0315, this value is substantively different from regression in (b) and seems to suffer from omitted variable bias.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.f.**
The value of the coefficient on `dadcoll` is positive. What does this coefficient measure?
Interpret this effect.

<!--
BEGIN QUESTION
name: q3_f
manual: true
-->

This coefficient measures the average difference between years of educational attainment when your father is a college graduate than if your father is not. Based on this coefficient you have 0.6934 years of educational attainment, on average, when your father is a college graduate than if your father is not.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.g.** Explain why `cue80` and `stwmfg80` appear in the regression. Are the signs of their estimated coefficients what you would have believed? Explain.

<!--
BEGIN QUESTION
name: q3_g
manual: true
-->

These values appear in the regression because values that evaluate the local economy influence whether some people go to college or get a job instead. The sign of both estimated coefficients are what I would expect, if the unemployment rate is higher then you would be less likely to get a job and would instead go to college. Likewise if the wages are higher for jobs in the county, then you would be more likely to get a job than go to college.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.h.**
Bob is a black male. His high school was 20 miles from the nearest college. His base-year composite test score (`bytest`) was 58. His family income in 1980 was \\$26,000, and his family owned a home. His mother attended college, but his father did not. The unemployment rate in his county was 7.5%, and the state average manufacturing hourly wage was \\$9.75. Predict Bob’s years of completed schooling using the regression in (d).

<!--
BEGIN QUESTION
name: q3_h
manual: true
-->

Predicted years of completed schooling = 14.78

<!-- END QUESTION -->



---

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [24]:
# Save your notebook first, then run this cell to export your submission.
grader.to_pdf(pagebreaks=False, display_link=True)