# Problem 1

In [34]:
import numpy as np
import pandas as pd 
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import norm

### a) You have data (Yi,X1i,X2i) generated from the model Yi = β0 + β1X1i + β2X2i + ei, where ei satisfies the usual exogeneity condition. You are interested in estimating the causal effect of X1 on Y . Suppose X1 and X2 are positively correlated, and β2 > 0. If you regress Y only on X1, will your estimate of β1 be biased? In what direction?

We start with the true regression model:

$$
Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + e_i
$$

where $E[e_i \mid X_1, X_2] = 0$.

⸻

Now, suppose instead we run a regression of Y only on X_1:

$$
Y_i = \beta_0 + \tilde{\beta}1 X{1i} + u_i
$$

We want to determine the relationship between $\tilde{\beta}_1$ and the true coefficient $\beta_1$.

⸻

The OLS estimator in this simple regression is:

$$
\hat{\beta}1 = \frac{\text{Cov}(Y_i, X{1i})}{\text{Var}(X_{1i})}
$$

Substituting $Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + e_i$:

$$
\hat{\beta}1 = \frac{\text{Cov}(\beta_0 + \beta_1 X{1i} + \beta_2 X_{2i} + e_i, X_{1i})}{\text{Var}(X_{1i})}
$$

Since $\beta_0$ is a constant and $e_i$ is mean-independent of $X_1$, this reduces to:

$$
\hat{\beta}1 = \frac{\beta_1 \text{Var}(X{1i}) + \beta_2 \text{Cov}(X_{2i}, X_{1i})}{\text{Var}(X_{1i})}
$$

⸻

Thus:

$$
\hat{\beta}1 = \beta_1 + \beta_2 \cdot \frac{\text{Cov}(X{1i}, X_{2i})}{\text{Var}(X_{1i})}
$$

⸻

Direction of Bias

$\beta_2$ > 0 (given)

$\text{Cov}(X_1, X_2) > 0$ (since they are positively correlated)

$\text{Var}(X_1) > 0$

Therefore:

$$
\hat{\beta}_1 > \beta_1
$$

⸻

Final Conclusion

$$
\hat{\beta}_1 \text{ has a positive bias.}
$$

### b) A researcher runs a simple regression of income (Y ) on years of education (X1) and finds a strong positive effect. Another researcher argues that this estimate is likely upward biased because it omits cognitive ability (X2). Explain why omitted variable bias may be present, what is the most likely sign of the bias and how it affects the interpretation of the education coefficient

Omitted variable bias may be present as congnitive ability (X2) is both positively correlated with years of education and also probably has a causal relationship with income, hence it will be a form of omitted variable bias on the years of education (X1). The likely sign of the bias is positive (like the question above) and therefore the education coefficient in the original regression is incorretly specified and is probably overstated (i.e., too large) hence the causal coeffeicient in the regression assigns too much importance to years of education.

### (c) Consider the regression model Yi = β0 + β1X1i + β2X2i + ei. Transform the regression so that you can use a t-statistic to test i H0 : β1 = β2, ii H0 : β1 + 2β2 = 0, iii H0 : β1 + β2 = 1 (you may redefine the dependant variable)

i) Run regression $Y_i = \beta_3 + \lambda_1(X_{1i} - X_{2i}) + e_i$. You can test if $\lambda_1 = 0$ or not to test whether $\beta_1=\beta_2$. 


ii) Run regression $Y_i = \beta_5 + \lambda_2(X_{1i} + 2*X_{2i}) + e_i$. You can test if $\lambda_2 = 0$ is zero or not to test whether $\beta_1+2\beta_2=0$

iii) Run regression $Y_i = \beta_7 + \lambda_3(X_{1i} + X_{2i}) + e_i$. You can test if $\lambda_3 = 1$ to test whether $\beta_1+\beta_2=1$ using appropriate t-test. You essentially need to create a new variable computed as specified by the transformation

Note for all the above tests it's important the way that you construct these z-tests and t-tests. In particular to we combute the $Var(\lambda) = a'Var(\Beta)a$, where a is the linear combination that we are now testing $\lambda=a'\beta$.  Accordingly we also habve the standard errors that we can compute like this: $s.e.(\lambda)=\frac{\sqrt{var(\lambda)}}{\sqrt{n}}$


### (d) True or False: In a multiple regression model, the coefficient β1 on regressor X1 always equals the coefficient from a simple regression of Y on X1. Briefly justify your answer.

False, in the case OMVB as shown above this will not be the case.

# Problem 2

Policy makers often debate whether increasing public school funding leads to
better student outcomes. This question is particularly important in the context
of persistent achievement gaps across states. To study this question, a researcher collects data from 50 U.S. states in 2019. The dependent variable is AvgScore, the average score on a standardized 8th
grade math test (out of 500 points). One key independent variable is Spend, per-pupil public school expenditure (in thousands of dollars). Heteroskedasticity-robust standard errors are reported in parentheses.

### a) Interpret the coefficient on Spend in Regression (1). Is it large in a real-world sense? Is it statistically significant?

The spend variable doesn't seem to be that large in a real world sense. It basically means that out of a test of 500 our scores increase by 2.9 points for an increase in $1000 per pupil. This seems like fairly marginal increase for a test that is scored out of 500 (i.e., about a 0.6% increase for an extra thousand). Hence for student scores to increase by around 10% on average a school would have to spend $16,667 per student which seems like extra spend doesn't make that much of a difference. 


It is statistically significant however. If we calculate the t-stastitic it is $\frac{\hat{\beta}}{s.e.} = \frac{2.90}{1.25} = 2.32$, which would be statistically signficant at a significance level of $5\%$

### b) Suppose Mississippi spends $8,000 per pupil and Massachusetts spends $16,000. Predict the difference in average test scores between the two states using Regression (1).

The difference in avg test scores for the regressions would be $-8*2.90 = -23.2$, basically massachusetts would have higher scores on averages of about 23.2.

### c) Compute a 95% confidence interval for the coefficient on Spend. What does this interval imply about the precision of the estimate?

The 95% confidence interval of the spend would be $[2.9-1.62*1.25, 2.9+1.62*1.25]$ = $[0.875, 4.925]$. 

### d) Do you think the regression error is likely to be homoskedastic or heteroskedastic in this context? Explain briefly.

I think the standard errors here are likely to be heavily heteroskedastic. As you increase spend on students there will be a few schools private schools that have different focuses. Some may focus on sports but may have a simialr spend per student to those that focus on academics and hence there would be a higher variance in the std errors for these high spending schools suggesting heteroskedasticity.

### The researcher suspects that other state-level characteristics may also affect student performance. In particular, states differ in income levels and demographics. She runs a new regression

### e) The coefficient on Spend decreased substantially from Regression (1) to Regression (2). Why might this have happened? Explain both the direction and magnitude of the change.

This may of happened because of there may have been omitted variable bias with spend being negatively correlated to poverty and positively correlated to pct in college. Since the Betas are also negative and positive relative to test score both theses OMVBs would induce a posiitve bias on the original regression, leading to the significant reduction on the variable spend.


### f) The researcher is concerned about omitted variable bias from not including ClassSize (average class size in public schools). Would including this variable likely increase or decrease the estimated effect of Spend ? Justify your reasoning

This variable is most likely explanatory with a negative beta to test score (i.e. the lower the class size the higher the test score), it will also probably be negatively correlated with spend e.g., the more you spend the lower your class size. It will therefore induce a positive OMVB on spend if it is not included. It will therefore decrease the estimated effect of Spend.

### g) Based on Regression (2), interpret the coefficient on PctCollege. What does this suggest about the relationship between adult education levels and student performance?

For every percent increase in the amount of college educated students you have, your avg score increases by 1.75. This susggests that adult education is positively correlated with student performance. 

### h) What is the adjusted R2 trying to capture in this context? Would you expect it to be higher or lower than the reported R2?

The adjusted R^2 is trying to capture how the new regression explains the variation in the test scores agnostic to how many new regressors you include. You would expect adjusted R^2 to always be lower than R^2. 

# Problem 3

### a) run an OLS regression of Earnings on Height. Discuss the interpretation, the sign and the size of the coefficient.

In [35]:
raw_data = pd.read_stata('../Earnings_and_Height_v2.dta')
raw_data

Unnamed: 0,sex,educ,earnings,height
0,0:female,13,84.054749,65.0
1,0:female,12,14.021395,65.0
2,0:female,16,84.054749,60.0
3,0:female,16,84.054749,67.0
4,0:female,16,28.560387,68.0
...,...,...,...,...
17865,1:male,12,18.168842,70.0
17866,1:male,12,84.054749,74.0
17867,1:male,12,16.081589,65.0
17868,1:male,12,84.054749,68.0


In [36]:
Y = raw_data[['earnings']]
X = sm.add_constant(raw_data[['height']])

model = sm.OLS(endog=Y, exog=X).fit(cov_type='HC0')
print(model.summary(alpha=0.05))


                            OLS Regression Results                            
Dep. Variable:               earnings   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                  0.011
Method:                 Least Squares   F-statistic:                     197.2
Date:                Sun, 28 Sep 2025   Prob (F-statistic):           1.46e-44
Time:                        16:08:33   Log-Likelihood:                -84104.
No. Observations:               17870   AIC:                         1.682e+05
Df Residuals:                   17868   BIC:                         1.682e+05
Df Model:                           1                                         
Covariance Type:                  HC0                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.5127      3.380     -0.152      0.8

Intepretation of size and sign of coefficient: for every additional inch of height an indviduals wage increases by 0.71 cents. It is statiscally signficant. 

### b) Suppose that the mechanism described above is correct. Explain how this leads to omitted variable bias in the OLS regression of Earnings on Height. Does the bias lead the estimated slope to be too large or too small?

In this case height would be positively correlated with cognitive ability and cognitive ability would be positively correlated with earnings. So accoridngly what happens is that you would placing a positive bias on the original regression coefficient with height, making it too large.

### c) Run a regression of Earnings on Height, including LT HS,HS, and some col as control variables

In [47]:
raw_data['LT_HS'] = raw_data['educ']<12
raw_data['HS'] = raw_data['educ']==12
raw_data['Some_Col'] = (raw_data['educ']>12) & (raw_data['educ']<16)
raw_data['College'] = raw_data['educ']>=16

educ_cols = ['LT_HS','HS', 'Some_Col','College']
for col_name in educ_cols:
     raw_data[col_name] = raw_data[col_name].map({True:1, False:0})


X = raw_data[["height", "LT_HS", "HS", "Some_Col"]]
X = sm.add_constant(X)

model = sm.OLS(endog=Y, exog=X).fit(cov_type='HC0')
print(model.summary(alpha=0.05))

                            OLS Regression Results                            
Dep. Variable:               earnings   R-squared:                       0.152
Model:                            OLS   Adj. R-squared:                  0.152
Method:                 Least Squares   F-statistic:                     904.6
Date:                Sun, 28 Sep 2025   Prob (F-statistic):               0.00
Time:                        16:11:09   Log-Likelihood:                -82731.
No. Observations:               17870   AIC:                         1.655e+05
Df Residuals:                   17865   BIC:                         1.655e+05
Df Model:                           4                                         
Covariance Type:                  HC0                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         30.2331      3.199      9.451      0.0

### c i) Compare the estimated coefficient on Height in two regressions. Is there a large change in the coefficient? Has it changed in a way con- sistent with the cognitive ability explanation? Explain


Yes there is a major change in the estimated coefficients of height it decrases significantly. This is consistent with the congitive ability explanation because we expected a positive bias due to the OMVB from the cognitive ability perspective. 

### c ii) The regression omits the control variable College. Why?

The regression omits the control variable College to ensure no perfect multicolinearity because if you included this variable your column rank would be less than p. 

### c iii) Test the joint null hypothesis that the coefficients on the education variables are equal to 0

The null hypothesis that the coefficients on the education variables are all equal to 0 is tested by the F-statistic it clearly shows that F-statistic is 904.6 which implies that the education variables are all equal to 0.

### c iv) Discuss the values of the estimated coefficients on LT HS,HS, and Some Col. (Each of the estimated coefficients is negative, and thecoefficient on LT HS is more negative than the coefficient on HS, which in turn is more negative than the coefficient on Some Col. Why? What do the coefficients measure?)

The reason they are all negative is that the coefficients measures the relative effect of doing that level of education versus college. So the coeffecient implies the decrease in wage once can expect relative to if that individual stayed educated until college. Of course dropping out earlier decreases your wage by a greater amount than dropping out at some point in college.

### d) Run an OLS regression of Height, on LT HS,HS, and Some Col, get residuals from this regression. Regress Earnings on the residuals you just obtained, compare the results with the ones you obtained in c. Discuss.

In [None]:
raw_data['resids_first_regression'] = model.resid

X = sm.add_constant(raw_data['resids_first_regression'])
Y = raw_data['earnings']

model = sm.OLS(endog=Y, exog=X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:               earnings   R-squared:                       0.848
Model:                            OLS   Adj. R-squared:                  0.848
Method:                 Least Squares   F-statistic:                 9.988e+04
Date:                Sun, 28 Sep 2025   Prob (F-statistic):               0.00
Time:                        16:26:41   Log-Likelihood:                -67354.
No. Observations:               17870   AIC:                         1.347e+05
Df Residuals:                   17868   BIC:                         1.347e+05
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                     

The regression clearly shows a coeffcient of 1 on the residuals from the first regression, implying that the residuals are correctly specified and that the regression is correctly specified as it has a coefficient of 1 with high statistical significance.