# Appendix E Sampling Distribution 

## E.2 Sampling from observed samples

In most case, the true population or true data generating mechanisms are unknown. As a work-around, we can make full use of the observed samples to *approximate* the true population. This approximation is *exact* only when the samples exhaust the true population (e.g., a mandatory survey in a class). In most cases, it remains an approximation.


### E.2.1 Bootstrap 

There are two factors that affect the sampling distribution for any estimator: (i) the true population and (ii) the sampling scheme. Boostrap is based on a simple idea of treating the observed samples as the true population. In bootstrap, (i) the observed samples are seen as the true population, and (ii) a sampling scheme is chosen to micmic the true mechanism. To be specific, if the samples are i.i.d. (independently and identically distributed), the bootstrap procedure will draw samples with replacement with equal probability from the observed samples, where "with replacement" is because of independence and "equal probability" if because of identical distribution. The bootstrap sampling distribution approximates the true sampling distribution when the sample size is large. 

There are many variants of bootstrap (see, e.g., [here](https://en.wikipedia.org/wiki/Bootstrapping_(statistics))). We will examine the nonparametric bootstrap in this section. 

We will revist the confidence interval in Appendix D.3. This time we write our own bootstrap function to construct the confidence interval. It is important to note the similarity between the `simulate.one.instance` function in Appendix D.3 and the `boot.fit` function below. 

In practice, we use the `boot` function in the package `boot` for bootstrapping. 


In [None]:
### Simulation function from E.1
simulate.one.instance<-function(x,beta.true){
  n=length(x);
  Ey= x*beta.true[2]+beta.true[1];
  error.terms= (runif(n)-0.5)*5;
  y=Ey+error.terms;
  beta.hat=lm(y~x)$coef;
  return(beta.hat)
}


In [1]:
boot.fit<-function(x,y){
  # One approach: ###
  # Ey= x*beta.true[2]+beta.true[1];
  # Treat beta.hat as beta.true 
  # error.terms= (runif(n)-0.5)*5;
  # Draw error terms from the residuals 
  # ------------  ###
    
  # Another approach: 
  n=length(x);
  samples_indices=sample(1:n, n, replace =TRUE);
  x.boot=x[samples_indices];
  y.boot=y[samples_indices];
  beta.hat=lm(y.boot~x.boot)$coef;
  return(beta.hat)
}

In [2]:
set.seed(1)
n=50;
x=rnorm(n,mean=10,sd=2);
beta.true=c(20,0.15)
Ey=x*beta.true[2]+beta.true[1];
error.terms=rnorm(n)*5;
y=Ey+error.terms;

In [3]:
B=1e4;
beta.hat.boot=replicate(B, boot.fit(x=x,y=y))

In [5]:
alpha=0.05;
apply(beta.hat.boot, 1, quantile,probs=c(alpha/2,1-alpha/2) )

Unnamed: 0,(Intercept),x.boot
2.5%,13.79914,-0.8703025
97.5%,31.0153,0.8336293


In [7]:
t(confint(lm(y~x+1),alpha=alpha))

Unnamed: 0,(Intercept),x
2.5 %,13.0189,-0.8086838
97.5 %,30.47755,0.8809413
