# 7-3 Bias Variance Tradeoff - an Exercise

Assume $y = 1 + 2x_1 + 3x_2 + 4 x_3 + u$, and $u\sim \mathcal{N}(0,1)$. We also know that $x_3\sim U(1,10)$ is independent of both $x_1$ and $x_2$, while $x_1$ and $x_2$ are correlated and satisfies a bivariate normal distribution with 
$$\mu= \begin{bmatrix} 0 \\ 0\end{bmatrix}$$
and
$$\Sigma = \begin{bmatrix} 4&5 \\ 5&9 \end{bmatrix}$$

1. Generate a sample with 1000 observations. \
You can use `np.random.uniform` to generate $x_3$.\
use `np.random.multivariate_norm` to generate $x_1$ and $x_2$, \
use `np.random.randn` to generate u.

For reproducibility, add `np.random.seed(1)`.

In [12]:
import numpy as np
import pandas as pd


np.random.seed(1)
x3 = np.random.uniform(1,10,1000)
X12 = np.random.multivariate_normal([0,0],[[4,5],[5,9]],1000)
u = np.random.randn(1000)


df = pd.DataFrame({"x1":X12[:,0], "x2":X12[:,1],"x3":x3, "u":u})

df["y"] = 1 + 2*df["x1"]+ 3* df["x2"]+ 4*df["x3"]+df["u"]

df.head()

Unnamed: 0,x1,x2,x3,u,y
0,-1.06326,-2.308605,4.753198,0.29637,11.256825
1,-2.1219,-2.017276,7.48292,8e-06,20.636062
2,2.801666,4.252892,1.001029,-0.034211,23.331915
3,0.361978,0.154109,3.720993,-0.281499,16.788755
4,-0.120881,3.163531,2.320803,0.580178,20.11222


2. Regress y on x1, x2, and x3, and obtain the estimates.

In [22]:
import statsmodels.formula.api as smf
reg = smf.ols("y~x1 + x2 + x3", data = df)
res = reg.fit()
res.params

Intercept    0.921935
x1           2.021681
x2           2.994631
x3           4.011834
dtype: float64

3. If we omitted a key variable x2, what will happen to the estimated coefficient on x1? What about the coefficient on x3? What about the estimated constant? 

In [23]:
reg2 = smf.ols("y~x1+x3", data=df)
res2 = reg2.fit()
res2.params

Intercept    1.122309
x1           5.837949
x3           3.965432
dtype: float64

4. If we omitted a key variable x3, what will happen to the estimated coefficient on x1? What about the standard error of $\hat{\beta_1}$

In [24]:
reg3 = smf.ols("y~x1+x2", data=df)
res3 = reg3.fit()
res3.params

Intercept    22.990205
x1            1.938675
x2            2.840492
dtype: float64

In [25]:
res3.bse

Intercept    0.330763
x1           0.304058
x2           0.200900
dtype: float64

In [26]:
res.bse

Intercept    0.072874
x1           0.028554
x2           0.018871
x3           0.011984
dtype: float64

5. To avoid biased estimations, you regress y on x1, x2, x3, and $x_3^2$. Add the `x3sq` column to the df, and run a new regression. Comment on your findings.

In [27]:
df["x3sq"] = df["x3"]**2

In [28]:
reg4 = smf.ols("y~x1+x2+x3+x3sq", data=df)
res4 = reg4.fit()
res4.params

Intercept    0.841531
x1           2.022856
x2           2.993615
x3           4.049749
x3sq        -0.003465
dtype: float64

In [29]:
res4.bse

Intercept    0.139316
x1           0.028615
x2           0.018936
x3           0.057253
x3sq         0.005116
dtype: float64

6. What if we add in a variable x4 that is euqal to $x1+x2+x3$.

In [65]:
e = np.random.randn(1000)
df["x4"] = df["x1"] + df["x2"] +df["x3"]

In [66]:
reg5 = smf.ols("y~x1+x2+x3+x3+x4", df)
res5 = reg5.fit()
res5.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.996
Model:,OLS,Adj. R-squared:,0.996
Method:,Least Squares,F-statistic:,88550.0
Date:,"Mon, 21 Mar 2022",Prob (F-statistic):,0.0
Time:,13:34:39,Log-Likelihood:,-1398.6
No. Observations:,1000,AIC:,2805.0
Df Residuals:,996,BIC:,2825.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.9219,0.073,12.651,0.000,0.779,1.065
x1,-0.2354,0.026,-9.164,0.000,-0.286,-0.185
x2,0.7376,0.021,35.622,0.000,0.697,0.778
x3,1.7548,0.010,180.905,0.000,1.736,1.774
x4,2.2570,0.005,438.244,0.000,2.247,2.267

0,1,2,3
Omnibus:,0.88,Durbin-Watson:,1.938
Prob(Omnibus):,0.644,Jarque-Bera (JB):,0.923
Skew:,0.07,Prob(JB):,0.63
Kurtosis:,2.952,Cond. No.,2.05e+16
