Alex Kappes <br>
Problem Set 4 <br>
EconS 512

**Problem 1**

Given the equation $Y_i = \beta_0 + \beta_1T_i + u_i$, it is assumed that $T_i \sim \text{U}(0,1)$ and $u_i \sim \text{N}(0, \sigma_u^2)$. It is further assumed that $T_i\ \forall\ i \in I$ is unobservable. The true value for $T_i$ follows $\tau_{1i} = T_i + \xi_i$, where $\xi_i \sim \text{N}(0, \sigma_\xi^2)$. The true population parameter values are $\boldsymbol{\beta} = [0.2, 1.5]$.

**(a)** The accuracy of OLS estimation procedures using observed $\mathbf{T}$ values is presented below.

In [1]:
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm

T = stats.uniform.rvs(size=1000)
u = stats.norm.rvs(scale=0.5, size=1000)
xi = stats.norm.rvs(scale=0.4, size=1000)
eta = stats.norm.rvs(scale=0.6, size=1000)
tau1 = T + xi
tau2 = T + eta

B = np.array([0.2, 1.5])
y = B[0] + B[1]*T + u

mod_unobs_slp = sm.OLS(y, sm.add_constant(T)).fit().params[1]

if abs(mod_unobs_slp - B[1]) < .01:
    print('Estimation is accurate within .01')
else:
    print('Estimation is not accurate within .01. The difference is', round(mod_unobs_slp - B[1], 3))

Estimation is not accurate within .01. The difference is 0.013


**(b)** The true values $\tau_{1i}$ will provide for a slope estimate approximately equal to $\beta_1\frac{\sigma_T^2}{\sigma_T^2 + \sigma_\xi^2}$. The results are shown below.

In [2]:
true_slp = round(B[1]*T.var() / (T.var() + xi.var()), 3)
mod_obs_slp = round(sm.OLS(y, sm.add_constant(tau1)).fit().params[1], 3)

print('B_1 true value is', true_slp, ', and is estimated at', mod_obs_slp)

B_1 true value is 0.5 , and is estimated at 0.519


**(c)** Instrumenting $\tau_{1i}$ with the variable $\tau_{2i} = T_i + \eta_i$ for $\eta_i \sim \text{N}(0, \sigma_\eta^2)$, where $\tau_{2i}$ is assumed to be observable, provides the following slope parameter estimate. 

In [3]:
X = np.array([np.repeat(1, len(y)), tau1]).T
Z = np.array([np.repeat(1, len(y)), tau2])
B1_iv = np.round(np.linalg.multi_dot([np.linalg.inv(np.dot(Z, X)), Z, y[np.newaxis].T])[1], 3)

print('The IV estimate is', float(B1_iv), ', compared to the true population parameter of', B[1])

The IV estimate is 1.578 , compared to the true population parameter of 1.5


**Problem 2**

Note: The linear equation is specified in the problem set.

**(a)** OLS optimization provides $\hat{\beta}_1$ below. A White's Heteroskedastic Consistent covariance matrix estimator is produced below. The standard error used for the $\hat{\beta}_1$ confidence interval is taken from the White's covariance estimator. 

In [4]:
df = pd.read_csv('/home/akappes/WSU/512_MetricsII/ps4_data.csv')

X = df[['pct_insclnxtyr', 'mhighgrad', 'msomcol', 'fhighgrad', 'fsomcol', 'parincome', 'afqt']].dropna()
y = df.loc[list(X.index), 'dayssmklm17']
mod = sm.OLS(y, sm.add_constant(X)).fit()

# White's HC estimator
e_sq = np.array(np.power(mod.resid, 2))
Xmat = np.concatenate((np.repeat(1, len(X.index))[np.newaxis].T, X), axis=1)
k = Xmat.shape[1]
n = Xmat.shape[0]

w = np.zeros((k, k))
i = 1

while i < n:

    w = w + e_sq[i] * Xmat[i][np.newaxis].T * Xmat[i]
    i = i + 1

    if i > n:
        break

w_hce = np.linalg.multi_dot([np.linalg.inv(np.dot(Xmat.T, Xmat)),
                             w,
                             np.linalg.inv(np.dot(Xmat.T, Xmat))])

w_se = np.sqrt(np.diag(w_hce))

# Robust B1 CI
alpha = .05
t_crit = stats.t.ppf(1-alpha/2, n-k)

b1_ci = np.array([mod.params[1] - t_crit * w_se[1],
                  mod.params[1] + t_crit * w_se[1]])

print('The 95% CI for B_1 is', b1_ci)

The 95% CI for B_1 is [-0.08105284 -0.02244233]


**(b)** Random $\mathbf{X}$ bootstrap sampling for $\hat{\beta}_1$ is produced below. The sampling provides a distribution of 1000 $\hat{\beta}_1$ and $se\{\hat{\beta}_1\}$ estimates. The means of bootstrapped $\hat{\beta}_1$ and $se\{\hat{\beta}_1\}$ are used to create the bootstrap $\hat{\beta}_1$ confidence interval.

In [5]:
n_boot = 1000
boot_mat = pd.concat([pd.DataFrame(y), X], axis=1).reset_index()
boot_vals = pd.DataFrame({'b1_boot': 0, 'se_boot': 0}, index=range(n_boot))

for i in range(len(boot_vals)):

    boot_sample = boot_mat.sample(n, replace=True)
    X_boot = boot_sample[['pct_insclnxtyr', 'mhighgrad', 'msomcol', 'fhighgrad', 'fsomcol', 'parincome', 'afqt']]
    y_boot = boot_sample['dayssmklm17']

    boot_vals.loc[i, 'b1_boot'] = sm.OLS(y_boot, sm.add_constant(X_boot)).fit().params[1]
    boot_vals.loc[i, 'se_boot'] = sm.OLS(y_boot, sm.add_constant(X_boot)).fit().bse[1]

b1_boot_mean = boot_vals['b1_boot'].mean()
se_boot_mean = boot_vals['se_boot'].mean()

b1_boot_ci = np.array([b1_boot_mean - t_crit * se_boot_mean,
                       b1_boot_mean + t_crit * se_boot_mean])

print('The 95% CI for bootstrapped B_1 is', b1_boot_ci)

The 95% CI for bootstrapped B_1 is [-0.07704628 -0.02665728]


The bootstrapped $\hat{\beta}_1$ confidence interval has shrunk around the *true* $\beta_1$ population paramter.

**(c)** The bootstrap process for $\hat{\beta}_1$ is repeated using a just-identified instrumental variable approach. The bootstrap $\hat{\beta}_1$ confidence interval is produced below.

In [6]:
boot_iv_mat = pd.concat([pd.DataFrame(y), df.loc[list(X.index), 'ctuition17'], X], axis=1).reset_index()
boot_iv_vals = pd.DataFrame({'b1_iv_boot': 0, 'se_iv_boot': 0}, index=range(n_boot))

for i in range(len(boot_iv_vals)):

    boot_iv_sample = boot_iv_mat.sample(n, replace=True)
    Z_boot = np.concatenate((np.repeat(1, len(boot_iv_sample.index))[np.newaxis].T,
                             boot_iv_sample[['ctuition17', 'mhighgrad', 'msomcol', 'fhighgrad',
                                             'fsomcol', 'parincome', 'afqt']]),
                            axis=1)
    X_boot = np.concatenate((np.repeat(1, len(boot_iv_sample.index))[np.newaxis].T,
                             boot_iv_sample[['pct_insclnxtyr', 'mhighgrad', 'msomcol', 'fhighgrad',
                                             'fsomcol', 'parincome', 'afqt']]),
                            axis=1)
    y_boot = np.array(boot_iv_sample['dayssmklm17'])[np.newaxis].T

    B_iv_boot = np.linalg.multi_dot([np.linalg.inv(np.dot(Z_boot.T, X_boot)), Z_boot.T, y_boot])

    sig2 = np.dot(np.array((y_boot - np.dot(X_boot, B_iv_boot))).T, (y_boot - np.dot(X_boot, B_iv_boot))) / (n - k)

    cov_B_iv = sig2 * np.linalg.multi_dot([np.linalg.inv(np.dot(Z_boot.T, X_boot)),
                                           np.dot(Z_boot.T, Z_boot),
                                           np.linalg.inv(np.dot(X_boot.T, Z_boot))])

    boot_iv_vals.loc[i, 'b1_iv_boot'] = B_iv_boot[1]
    boot_iv_vals.loc[i, 'se_iv_boot'] = np.sqrt(np.diag(cov_B_iv))[1]

b1_iv_boot_mean = boot_iv_vals['b1_iv_boot'].mean()
se_iv_boot_mean = boot_iv_vals['se_iv_boot'].mean()

b1_iv_boot_ci = np.array([b1_iv_boot_mean - t_crit * se_iv_boot_mean,
                          b1_iv_boot_mean + t_crit * se_iv_boot_mean])

print('The 95% CI for IV bootstrapped B_1 is', b1_iv_boot_ci)

The 95% CI for IV bootstrapped B_1 is [-0.53605269  0.29218561]


The IV bootstrapped CI shows that under IV framework, the instrument is not significant in explaining variation in the dependent variable. The working paper by Alwyn Young titled *Consistency without Inference: Instrumental Variables in Practical Application* eludes to this result in the explanation of there being less variability in IV estimation procedures. When bootstrapping, the results are often more sensitive to influential observations, leading to wider variability in parameter estimates and significance.  