## Programming Assignment - Bootstrap Methods - Adv. Econometrics 2

**Deadline**:  Friday 17:00 hours, 8 January 2021

|Nr|**Name**|**Student ID**|**Email**|
|--|--------|--------------|---------|
|1.|        |              |         |
|2.|        |              |         |
|3.|        |              |         |

**Declaration of Originality**

We whose names are given under 1., 2. and 3. above declare that:

1. These solutions are solely our own work.
2. We have not made (part of) these solutions available to any other student.
3. We shall not engage in any other activities that will dishonestly improve my results or dishonestly improve or hurt the results of others.

## Instructions for completing and submitting the assignment
1. Submit your work in the form of (i) a Jupyter Notebook and (ii) PDF-file via Canvas assuming basic econometric knowledge, before the deadline. Your notebook should not give errors when executed with `Run All`.
2. Complete the table with the info of your group members. By submitting the Jupyter Notebook, you agree with the included declaration of originality. Do not copy work of others. This will be considered as fraud!
3. Clarify your code with comments.

## Hints
- Only use the paired bootstrap
- Work with Numpy vectors or matrices as much as possible, e.g. `np.quantile(tB_OLS,[0.05,0.95],axis=0)` returns two quantiles for the whole vector of OLS estimates
- When coding, you can reduce the running time by setting `BOOTREP=99` and reduce the number of simulations. For the final execution, please return to original values!
- For a progress bar, please install `conda install -c conda-forge tqdm` or if you don't use anaconda use can just execute `pip install tqdm`
- If you want to use plotly, please install `conda install -c plotly plotly`
- Below, you can find Python code for generating the data and doing a simulation using multicores. To use multicores, you have to install `multiprocess`: `conda install -c conda-forge multiprocess`. Otherwise, execute `pip install multiprocess`
- The idea behind multiprocess is that each CPU core receives a sample, executes the resampling and returns the results. These results will be stored in one big list, which can be analyzed after the simulation.

## Assignment 

The purpose of this assignment is for you to gain practical experience with resampling methods. You will investigate several bootstrap confidence intervals for OLS and LASSO estimators. The DGP is given by:

- $X_i \sim N(0,\Sigma)$, $\Sigma=(\sigma_{ij}) \in \mathbb{R}^{p\times p}$ with $\sigma_{ij}=\rho^{|i-j|}$, $\beta_j=0$ for $1\leq j\leq p-15$, $\beta_j=0.5$ for $p-14\leq j\leq p-10$, $\beta_j=1.5$ for $p-9\leq j \leq p-5$ $\beta_j=2.5$ for $p-4 \leq j \leq p$. 

- $\varepsilon_1,...,\varepsilon_n \sim N(0,1)$

- $y=X \beta+\varepsilon$

Let $\hat{\beta}=(X'X)^{-1}X'y$ denote the OLS estimator, while $\breve{\beta}$ denote the LASSO estimator based on minimizing
$$ \sum_{i=1}^{n} (y_i- b'X_i)^2+\alpha \sum_{j=1}^{p}|b_j|.$$
Only consider `lasso = linear_model.Lasso(alpha=0.02)`, so keep the amount of regularization fixed!

Please, briefly answer all questions below using graphs and if necessary tables.

1. Choose $n=50$, $p=25$ and $\rho=0.6$. Determine the bias and RMSE of the OLS and LASSO estimators using 1,000 Monte Carlo replications.

2. Estimate by simulation the coverage probabilities (cov. prob.) of the 90% first-order asymptotic two-sided confidence intervals for the OLS estimator, i.e. the fraction of confidence intervals (CI)
$$[\hat{\beta_j}-1.645 SE(\hat{\beta_j}),\hat{\beta_j}+1.645 SE(\hat{\beta_j})]$$
that contains the true parameter $\beta_j$. Here $SE(\hat{\beta_j})$ is the usual (non-robust) standard error based on $s^2(X'X)^{-1}$.

3. Estimate by simulation the cov. prob. of the 90% first-order asymptotic two-sided CI for OLS and LASSO using $$SE_{boot}(\tilde{\beta}),$$
for $\tilde{\beta}\in \{\hat{\beta},\breve{\beta}\}.$$

4. Note that the cov. prob. can be interpreted as the number of successes in 1,000 trials. Hence, the estimated cov. prob. is not significantly different (at the 95% confidence level) from 90% if its value is contained in the interval
$$0.90\pm 1.96\sqrt{(0.90\times 0.10/1000)}=[0.8814,0.9186].$$
Check if the estimated cov. prob. of questions 2 & 3 are significantly different from 90%. What is your conclusion?

5. Estimate by simulation the cov. prob. for OLS and LASSO of 90% equal-tailed two-sided percentile bootstrap confidence intervals:
$$(\tilde{\beta}_{95\%}^*,\tilde{\beta}_{5\%}^*),$$
where $\mathbb{P}_*[\tilde{\beta}^*>\beta_{5\%}^*]=5\%$ and  $\tilde{\beta}\in \{\hat{\beta},\breve{\beta}\}$.

6.  Estimate by simulation the cov. prob. for OLS of 90% equal-tailed two-sided percentile-$t$ bootstrap confidence intervals based on the quantiles of the root
$$(\hat{\beta}^*-\hat{\beta})/SE(\hat{\beta}^*),$$ 
where $SE(\hat{\beta}^*)$ is based on $s^{*2}(X^{*}\,' X^{*})^{-1}$.

7. What problems would you encounter if you wanted to implement the percentile-$t$ intervals for the LASSO? How could you remedy these problems (you don't have to implement this)?

8.  Estimate by simulation the cov. prob. for LASSO of 90% two-sided bias-corrected and accelerated (BC$_a$) confidence intervals. For this, use the bootstrap to estimate the (median) bias and the Jackknife for the acceleration constant. In the BC$_a$ method, the quantiles are adjusted:
$$\alpha_1=\Phi\left ( \hat{z}_0+\frac{\hat{z}_0+z_{\alpha/2}}{1-\hat{a}(\hat{z}_0+z_{\alpha/2})} \right ), \\
  \alpha_2=\Phi\left ( \hat{z}_0+\frac{\hat{z}_0+z_{1-\alpha}}{1-\hat{a}(\hat{z}_0+z_{1-\alpha/2})} \right ) \\
$$
with $z_{0.95}=1.645$. Here
$$\hat{z}_0=\Phi^{-1}\left ( \frac{\sum_{i=1}^n \mathbb{1} \{\hat{\theta}^*(b)<\hat{\theta}\}}{B} \right)$$
and
$$ \hat{a}=\frac{\sum_{i=1}^n (\hat{\theta}_{(\cdot)}-\hat{\theta}_{(i)})^3}{6\{\sum_{i=1}^n(\hat{\theta}_{(\cdot)}-\hat{\theta}_{(i)})^2\}^{3/2}}$$
with $\hat{\theta}_{(\cdot)}=\sum_{i=1}^n \hat{\theta}_{(i)}/n$; see for more details Section 14.3 of Efron, B., & Tibshirani, R. J. (1994). An introduction to the bootstrap: [link](http://www.ru.ac.bd/stat/wp-content/uploads/sites/25/2019/03/501_02_Efron_Introduction-to-the-Bootstrap.pdf)).

9. Based on all the results of this assignment, which inference procedure would you advise a practitioner that wants to conduct inference in a model described by the DGP? Motivate your recommendation.

## Generate Samples

In [None]:
import numpy as np
from sklearn import linear_model
#import plotly.express as px        # uncomment if you want to use plotly.express
from tqdm.notebook import tqdm

REP=1000                            # numer of Monte Carlo simulations
BOOTREP=999                         # number of bootstrap replications
n=50
p=25
rho=0.6
mu=np.zeros(p)
Sigma=np.identity(p)
for i in range(p):
    for j in range(p):
        Sigma[i,j]=rho**abs(i-j)
beta=np.zeros(p)
beta[(p-15):(p-10)]=0.5
beta[(p-10):(p-5)]=1.5
beta[(p-5):]=2.5
arglist=[]
for r in tqdm(range(REP)):
    X = np.random.multivariate_normal(mean=mu, cov=Sigma, size=n)
    eps = np.random.normal(size=n)
    y = X@beta + eps
    arglist.append((r,BOOTREP,y,X))

## Resampling Procedure

In [None]:
def Bootstrap(args):
    (iter,BOOTREP,y,X)=args
    import numpy as np
    from sklearn import linear_model
    from scipy.stats import norm
    
    def OLS(y,X):
        N,p = X.shape                   # number of observations and regressors
        XXi = np.linalg.inv(X.T @ X)
        b_ols = XXi @ (X.T @ y)
        res = y-X @ b_ols
        s2 = (res @ res)/(N-p)
        SE = np.sqrt(s2*np.diag(XXi))
        return b_ols,SE,res
    
    n,p = X.shape
    # Estimates original sample
    lasso = linear_model.Lasso(alpha=0.02)
    lasso.fit(X, y)
    b_LASSO=np.copy(lasso.coef_)
    b_OLS,b_OLS_SE,res = OLS(y,X)
    # initilize bootstrap arrays
    bB_LASSO = np.zeros((BOOTREP,p))
    bB_OLS   = np.zeros((BOOTREP,p))
    #
    #
    np.random.seed(1)
    # balanced bootstap
    index_B=np.random.permutation(np.repeat(np.arange(n),BOOTREP)).reshape((BOOTREP,n))
    for b in range(BOOTREP):
        index = index_B[b,:]  # select the indices
        yB = np.copy(y[index])
        XB = np.copy(X[index,:])
        lasso.fit(XB, yB)
        bB_LASSO[b,:] = np.copy(lasso.coef_)
        # other code
    # percentile
    q_bB_LASSO = np.quantile(bB_LASSO,[0.05,0.95],axis=0)
    q_bB_OLS   = np.quantile(bB_OLS,[0.05,0.95],axis=0)
    argout = [b_LASSO,b_OLS     # add more when necessary
             ]
    return(argout)

## Execute the Simulation and get Results

In [None]:
from multiprocess import Pool
pool4 = Pool(processes=4)
result_list = list(tqdm(pool4.imap_unordered(Bootstrap, arglist), total=REP))
pool4.close()
pool4.join()

## Perform the Post-Processing

In [None]:
nr_methods=4                        # number of methods
                                    # 0 = percentile
                                    # 1 = SE_boot
                                    # 2 = percentile-t
                                    # 3 = BCa
b_LASSO    =np.zeros((REP,p))
b_OLS      =np.zeros((REP,p))
for r in tqdm(range(REP)):
    b_LASSO[r,:]  = result_list[r][0]
    b_OLS[r,:]    = result_list[r][1]

## Carry Out Analysis