# Problem 1

In [2]:
import numpy as np
import pandas as pd 
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import norm

### a) You have data (Yi,X1i,X2i) generated from the model Yi = β0 + β1X1i + β2X2i + ei, where ei satisfies the usual exogeneity condition. You are interested in estimating the causal effect of X1 on Y . Suppose X1 and X2 are positively correlated, and β2 > 0. If you regress Y only on X1, will your estimate of β1 be biased? In what direction?

We start with the true regression model:

$$
Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + e_i
$$

where E[e_i \mid X_1, X_2] = 0.

⸻

Now, suppose instead we run a regression of Y only on X_1:

$$
Y_i = \beta_0 + \tilde{\beta}1 X{1i} + u_i
$$

We want to determine the relationship between \tilde{\beta}_1 and the true coefficient \beta_1.

⸻

The OLS estimator in this simple regression is:

$$
\hat{\beta}1 = \frac{\text{Cov}(Y_i, X{1i})}{\text{Var}(X_{1i})}
$$

Substituting $Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + e_i$:

$$
\hat{\beta}1 = \frac{\text{Cov}(\beta_0 + \beta_1 X{1i} + \beta_2 X_{2i} + e_i, , X_{1i})}{\text{Var}(X_{1i})}
$$

Since $\beta_0$ is a constant and $e_i$ is mean-independent of $X_1$, this reduces to:

$$
\hat{\beta}1 = \frac{\beta_1 ,\text{Var}(X{1i}) + \beta_2 ,\text{Cov}(X_{2i}, X_{1i})}{\text{Var}(X_{1i})}
$$

⸻

Thus:

$$
\hat{\beta}1 = \beta_1 + \beta_2 \cdot \frac{\text{Cov}(X{1i}, X_{2i})}{\text{Var}(X_{1i})}
$$

⸻

Direction of Bias
	•	$\beta_2$ > 0 (given)
	•	$\text{Cov}(X_1, X_2) > 0$ (since they are positively correlated)
	•	$\text{Var}(X_1) > 0$

Therefore:

$$
\hat{\beta}_1 > \beta_1
$$

⸻

Final Conclusion

$$
\hat{\beta}_1 \text{ has a positive bias.}
$$

### b) A researcher runs a simple regression of income (Y ) on years of education (X1) and finds a strong positive effect. Another researcher argues that this estimate is likely upward biased because it omits cognitive ability (X2). Explain why omitted variable bias may be present, what is the most likely sign of the bias and how it affects the interpretation of the education coefficient

Omitted variable bias may be present as congnitive ability (X2) is both positively correlated with years of education and also probably has a causal relationship with income, hence it will be a form of ommitted variable bias on the years of education (X1). The likely sign of the bias is positive (like the question above) and therefore the education coefficient in the original regression is incorretly specified and is probably overstated (i.e., too large) hence the causal coeffeicient in the regression assigns too much importance to years of education.

Distribution of the test statistic is approximately normal N(0,1) or T distribution with degrees of freedom =30+35-2=63

### (c) Consider the regression model Yi = β0 + β1X1i + β2X2i + ei. Transform the regression so that you can use a t-statistic to test i H0 : β1 = β2, ii H0 : β1 + 2β2 = 0, iii H0 : β1 + β2 = 1 (you may redefine the dependant variable)

i) Run regression $Y_i = \beta_3 + \beta_4(X_{1i} - X_{2i}) + e_i$. You can test if $\beta_4 = 0$ or not to test whether $\beta_1=\beta_2$. 


ii) Run regression $Y_i = \beta_5 + \beta_6(X_{1i} + \frac{1}{2}X_{2i}) + e_i$. You can test if $\beta_6 = 0$ is zero or not to test whether $\beta_1+2\beta_2=0$

iii) Run regression $Y_i = \beta_7 + \beta_8(X_{1i} + X_{2i}-\frac{1}{\beta_1+\beta_2}(X_{1i}+X_{2i})) + e_i$. You can test if $\beta_8 = 0$ is zero or not to test whether $\beta_1+\beta_2=1$ You essentially need to create a new variable computed as specified by the transformation

### (d) True or False: In a multiple regression model, the coefficient ˆβ1 on regressor X1 always equals the coefficient from a simple regression of Y on X1. Briefly justify your answer.

False, in the case OMVB as shown above this will not be the case.

# Problem 2

Policy makers often debate whether increasing public school funding leads to
better student outcomes. This question is particularly important in the context
of persistent achievement gaps across states. To study this question, a researcher collects data from 50 U.S. states in 2019. The dependent variable is AvgScore, the average score on a standardized 8th
grade math test (out of 500 points). One key independent variable is Spend, per-pupil public school expenditure (in thousands of dollars). Heteroskedasticity-robust standard errors are reported in parentheses.

### a) Interpret the coefficient on Spend in Regression (1). Is it large in a real-world sense? Is it statistically significant?

The spend variable doesn't seem to be that large in a real world sense. It basically means that out of a test of 500 our scores increase by 2.9 points for an increase in $1000 per pupil. This seems like fairly marginal increase for a test that is scored out of 500 (i.e., about a 0.6% increase for an extra thousand). Hence for student scores to increase by around 10% on average a school would have to spend $16,667 per student which seems like extra spend doesn't make that much of a difference. 


It is statistically significant however. If we calculate the t-stastitic it is $\frac{\hat{\beta}}{s.e.} = \frac{2.90}{1.25} = 2.32$, which would be statistically signficant at a significance level of $5\%$

### b) Suppose Mississippi spends $8,000 per pupil and Massachusetts spends $16,000. Predict the difference in average test scores between the two states using Regression (1).

The difference in avg test scores for the regressions would be $-8*2.90 = -23.2$, basically massachusetts would have higher scores on averages of about 23.2.

### c) Compute a 95% confidence interval for the coefficient on Spend. What does this interval imply about the precision of the estimate?

The 95% confidence interval of the spend would be $[2.9-1.62*1.25, 2.9+1.62*1.25]$ = $[0.875, 4.925]$. 

### d) Do you think the regression error is likely to be homoskedastic or heteroskedastic in this context? Explain briefly.

I think the standard errors here are likely to be heavily heteroskedastic. As you increase spend on students there will be a few schools private schools that have different focuses. Some may focus on sports but may have a simialr spend per student to those that focus on academics and hence there would be a higher variance in the std errors for these high spending schools suggesting heteroskedasticity.

### The researcher suspects that other state-level characteristics may also affect student performance. In particular, states differ in income levels and demographics. She runs a new regression

### e) The coefficient on Spend decreased substantially from Regression (1) toRegression (2). Why might this have happened? Explain both the direction and magnitude of the change.

This may of happened because of there may have been omitted variable bias with spend being negatively correlated to poverty and positively correlated to pct in college. Since the Betas are also negative and positive relative to test score both theses OMVBs would induce a posiitve bias on the original regression, leading to the significant reduction on the variable spend.


### f) The researcher is concerned about omitted variable bias from not including ClassSize (average class size in public schools). Would including this variable likely increase or decrease the estimated effect of Spend ? Justify your reasoning

This variable is most likely explanatory with a negative beta to test score (i.e. the lower the class size the higher the test score), it will also probably be negatively correlated with spend e.g., the more you spend the lower your class size. It will therefore induce a positive OMVB on spend if it is not included. It will therefore decrease the estimated effect of Spend.

### h) What is the adjusted R2 trying to capture in this context? Would you expect it to be higher or lower than the reported R2

Adjusted R^2 is always lower because you're adding more regressors and more dofs hence the adjusted R^2 will be lower when you correct for this.

# Problem 3

In [17]:
X = sm.add_constant(vote_df['X'])
Y = vote_df['Y']

model = sm.OLS(endog=Y, exog=X).fit(cov_type='HC0')
print(model.summary(alpha=0.05))


                            OLS Regression Results                            
Dep. Variable:                      Y   R-squared:                       0.856
Model:                            OLS   Adj. R-squared:                  0.855
Method:                 Least Squares   F-statistic:                     1197.
Date:                Wed, 24 Sep 2025   Prob (F-statistic):           3.96e-79
Time:                        09:31:25   Log-Likelihood:                -565.20
No. Observations:                 173   AIC:                             1134.
Df Residuals:                     171   BIC:                             1141.
Df Model:                           1                                         
Covariance Type:                  HC0                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         26.8122      0.880     30.465      0.0

### b) What is the estimated slope? Explain in words what it means. Is the estimated effect of spending on share large or small? Explain what you mean by "large" or "small"

The estimated slope is 0.46, the estimated spending effect is quite large. E.g., for every 1 percent point increase in the share of spend the share of votes goes up 0.46. So it is large in the effect that more spending has a significant increase on the share of the election that somebody will win.

### Report the 95% confidence interval for β1, the slope of the population regression line.

From the regression above the 95% confidence interval  is 0.438 to 0.490.

### c) Does spending explain a large fraction of the variance in vote? Explain.

Quite clearly since the R^2 (0.856) is high a large variance in the vote is explained by the linear relationship between spending and the percent of variance in the vote.

### d) Look at the correlation coefficient between share and vote computed in the previous problem, and compare its square to the R2. How are they related? Provide a simple mathematical derivation of this fact.

In [18]:
corr_matrix = vote_df.corr()

rho = corr_matrix.loc['Y','X']
print(rho)

rho_squared = rho**2
print(rho_squared)


0.9252801762904964
0.8561434046361721


Quite clearly rho squared looks equal to the R squared. 

Derivation:

$$
r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2}\;\sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}}
$$

$$
r^2 = \frac{\left(\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\right)^2}{\left(\sum_{i=1}^n (x_i - \bar{x})^2\right)\left(\sum_{i=1}^n (y_i - \bar{y})^2\right)}
$$

---

In simple linear regression with intercept:
$$
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i, \quad \text{with} \quad 
\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}.
$$

Thus,
$$
\hat{y}_i - \bar{y} = \hat{\beta}_1 (x_i - \bar{x}).
$$

---

Explained sum of squares:
$$
ESS = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 
     = \sum_{i=1}^n \big(\hat{\beta}_1 (x_i - \bar{x})\big)^2
     = \hat{\beta}_1^2 \sum_{i=1}^n (x_i - \bar{x})^2.
$$

Substitute $\hat{\beta}_1$:
$$
ESS = \left(\frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)^2 \sum_{i=1}^n (x_i - \bar{x})^2,
$$

$$
ESS = \frac{\left(\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}.
$$

---

Total sum of squares:
$$
SST = \sum_{i=1}^n (y_i - \bar{y})^2.
$$

Coefficient of determination:
$$
R^2 = \frac{ESS}{SST} 
     = \frac{\dfrac{\left(\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}{\sum_{i=1}^n (y_i - \bar{y})^2}.
$$

Simplify:
$$
R^2 = \frac{\left(\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\right)^2}{\left(\sum_{i=1}^n (x_i - \bar{x})^2\right)\left(\sum_{i=1}^n (y_i - \bar{y})^2\right)}.
$$

---

Therefore,
$$
R^2 = r^2.
$$

The relationship between correlation coefficient squared and R² in simple linear regression:

In simple linear regression, $\rho^2_{XY} = R^2$ because both measure the proportion of variance in Y explained by X.



### What is the root mean squared error of the regression? What does this mean?

In [19]:
RMSE = np.sqrt(np.mean(vote_df['u_hat']**2))
print(RMSE)

6.347770441872386


The RMSE is the average absolute distance of the true variable from the predicted line

### f) Based on your graph from 2(h), does the error term appear to be homoskedastic or heteroskedastic?

The regression appears to be homoskedastic with no changing variance in the residuals


### Run the regression again without the robust" option. Compare the results to what you obtained with the robust" option. What is the same and what is dierent?

This is done before. The only change is that heteroskedasticity robust provides a tighter bound on the confidence interval which usually happens when there is some heteroskedasticity. That's the only major difference