# Bootstrapping

**Can we estimate the mean of a distribution without using a parametric model?**

Key idea is to first estimate the distribution non-parametrically, and then get an estimate of the mean.

**What about the standard error of that estimator? How to get a C.I.**

Bootstrapping!

## Bootstrap principle

## Empirical distribution

Let $x_1, ..., x_n$ be an independent real-valued random variable with distribution $P$.

Define a pdf $\hat{P}$ by

$$\hat{P}(A) = \frac{1}{n} \sum_{i=1}^n I_A(x_i)$$

where $I_A (x_i) = \begin{cases}
1 & if x_i \in A\\
0 & o.w.\\
\end{cases}$

$\hat{P}$ is called the empirical distribution of the sample $X$.  $\hat{P}$ can be thought as the distribution which puts mass $\frac{1}{n}$ on each observation $X_i$. It can be shown that $\hat{P}$ is a nonparametric likelihood estimator of $P$. This justifies estimating $P$ by $\hat{P}$ if no other information about $P$ is available.

**Theorem:** Let $A \leq \mathbf{R}$ (s.t. $P(A)$ is defined, i.e. $A$ belongs to the Borel $\sigma$-algebra), then $\hat{P}(A) \xrightarrow{} P(A)$ as $n \xrightarrow{} \infty$. This result can proved using the Law of Large Numbers. 

**Proof:**

$$n \hat{P}(A) = \sum_{i=1}^n I_A (x_i) \sim Bin(n, P(A))$$

This is a binomial because of our indicator function above.  We then use the law of large numbers to say $\hat{P}(A) \xrightarrow{} P(A)$, its expectation, as $n\xrightarrow{} \infty$. In other words, the distribution $P(A)$ can be approximated by $\hat{P}(A)$ equally well. for all $A \in I$ where $I$ is the set of all intervals of $\mathbf{R}$. This is the empirical distribution. 

## Empirical distribution function

Move in the direction of a CDF. Let $x_1, ..., x_n \sim F$ where $F(x) = P(X \leq x)$. Now because we cna relate the $F$ to the $P$, which was defined above, we can relate to the content above.

We can estimate $F$ with the empirical distribution function $\hat{F}_n$, the CDF that puts mass $\frac{1}{n}$ at each data point $x_i$.

$$\hat{F}_n (x) = \frac{1}{n} \sum_{i=1}^n I(X_i \leq x)$$

where $I(x_i \leq x) = \begin{cases}
1 & if X_i \leq x\\
0 & o.w.\\
\end{cases}$

According to **Glivenko-Cautelli theorem**

$$sup_x |\hat{F}_n(x) - F(x)| \xrightarrow[a.s.]{} 0$$

Here, $\hat{F}_n(x) \xrightarrow[a.s.]{} F(x)$ as $n \xrightarrow{} \infty$. This is called a **consistent** estimator of $F$. In fact, the convergence is fast.

## Sampling from the empirical distribution $\hat{F}_n$

To recap, we have a distribution $F$ which we do not know. So, we can use the estimator $\hat{F}_n$, and we know that this estimator is a consistent estimator of $F$. So, how do we get a sample from $\hat{F}_n$?

Suppose we want to draw an iid sample $\vec{X}^\star = (X_1^\star, ..., X_n^\star)^T$ from $\hat{F}_n$. When sampling from $\hat{F}_n$, the $i$th observation $X_i$ in the original sample is selected with probability $\frac{1}{n}$. If we use this idea, we can define a two-step procedure to do our sampling.

Step 1: Draw $i_1, i_2, ..., i_n$ independently from the uniform distribution. $\{ 1, 2, 3, ..., n \}$.

Step 2: Set $X_j^\star = X_{ij}$ and $\vec{X}^\star = (X_{i_1}^\star, ..., X_{i_n}^\star)^T$

For example, let's say $\vec{X} = (3,2,7,8,10,25)^T$ is a sample coming from a distribution that we do not know. Now, we want a sample from the empirical distribution $\hat{F}_n$ of $X$. 

Step 1: Draw $i_1, i_2, i_3, i_4, i_5, i_6$ from the uniform $\{ 1,2,3,4,5,6 \}$. Assume that $i_1 = 6,  i_2=3, i_3 = 1, i_4 = 2, i_5=1, i_6 = 3$ comes from the first drawing of the uniform, for instance.

Step 2: We assign $X_j^\star = X_{ij}$. So, in our first sample $\vec{X}_1^\star = (x_{i_1}, x_{i_2}, x_{i_3}, x_{i_4}, x_{i_5}, x_{i_6}) = (X_6, X_3, X_1, X_2, X_1, X_3) = (25, 7, 3, 2, 3, 7)$

Each sample must have $n$ samples! We sample with replacement from the original sample $X_1, X_2, ..., X_n$.

So, sampling from $\hat{F}_n$ is the same as saying sampling with replacement from the original sample.



## Bootstrap principle

Let $\vec{X} = (X_1, ..., X_n)^T$ be a random sample from $F$. Let $\theta = t(F)$ be some parameter of the distribution $F$. 

$\hat{\theta} = s(\vec{X})$ is an estimate of $\theta$.

Example: $\vec{X} = (X_1, ..., X_n) \sim F$,

$\theta = \mu ( = t(F))$, which is a parameter

$\hat{\theta} = \bar{X} (= s(\vec{X}))$, which is a function of the sample, and is an estimator of $\theta$

To evaluate statistical properties (bias or standard error) of $\hat{\theta}$, we need to estimate the sampling distribution of $\hat{\theta}$. The bootstrap method mimics the data generating process by sampling an estimate $\hat{F}_n$ of $F$. The role of the above real quantities is taken by their analogous quantities in the "bootstrap world". 

$\vec{X}^\star = (X_1^\star, ..., X_n^\star)^T$ is a bootstrap sample from $\hat{F}_n$. 

$\theta^\star = t(\hat{F}_n)$ is the parameter in the bootstrap world.

$\hat{\theta}^\star = s(\vec{X}^\star)$ is the bootstrap replication for $\theta$

So, we have our bootstrap sample data $\vec{X}^\star$ is found using sample with replacement. Then, we have $\theta$ which we are trying to understand. We have $\theta^\star$ which is the parameter in the bootstrap world. $\hat{\theta}^\star$ is the estimated $\theta$.

So, what is the sampling distribution of $\hat{\theta}$?  This is estimated by its bootstrap equivalent $\hat{\theta}^\star$.


In [None]:
import numpy as np
n = 1000
sets = 100
i = np.random.uniform(0,1,(sets, n))

The bootstrap principal can be summarized as follows:

**In the real world**, we see some unknown probabldy distribution and we have a sample, with the goal of getting a statistic from the underlying distribution.

We have an unknown probability distribution (Usually denoted as $P$ or $F$) and an observed random sample $\vec{x}$.

$$P, F \xrightarrow{} \vec{x} = (x_1, ..., x_u)$$

producing $\hat{\theta} = s(\vec{x})$, which is the **statistic of interest**

**In the bootstrap world**, we have an empirical distribution (usually denoted as $\hat{P}$ or $\hat{F}_n$) and a bootstrap sample $\vec{x}^\star$. The bootstrap sample is derived from the observed random sample $\vec{x}$. In the bootstrap world, the observed random sample $\vec{x}$ acts as the "population". Some people call the $\vec{x}^\star$ a "sample of a sample".

$$\hat{P}, \hat{F}_n \xrightarrow{} \vec{x}^\star = (x_1^\star, ..., x_n^\star)$$

producing $\hat{\theta}^\star = s(\vec{x}^\star)$, which is the **bootstrap replication**

Even though the distribution of the boostrap sample $\vec{x}^\star$ is known, the evaluation of the exact bootstrap sampling distribution of $\hat{\theta}^\star$ can still be intractable (aka, a tough problem). 

### Example: Bootstrap median

Let's say our stat of interest is the median. Our bootstrap sample can still be generated and we can get the bootstrap median. But, what is the sampling distribution of the sample median?

In general, the bootstrap estimate of the sampling distribution of $\hat{\theta}^\star$ is computed using the _monte carlo method_. 

**Algorithm**

Step 1: Draw $B$ independent bootstrap samples $\vec{X}^{(\star) (1)}, ..., \vec{X}^{(\star) (B)}$ from $\hat{F}_n$ (i.e. samples of size $n$ (with replacement) repeated $B$ times)

This will generate a lot of $\hat{\theta}^\star$ and we can visualize it in a distribution.

Step 2: Evaluate the bootstrap replications $\hat{\theta}^\star = s(\vec{X}^{\star (b)})$, $b = 1, ..., B$

Step 3: Estimate the sampling distribution of $\hat{\theta}$ by the empirical distribution of the bootstrap replications $\hat{\theta}^{\star (1)}, ..., \hat{\theta}^{\star (B)}$

**Question: Why should bootstrap work?**

Glivenko-Cantelli Theorem says that $\hat{F}_n \xrightarrow[a.s.]{} F$ as $n \xrightarrow{} \infty$. So, iid sampling from $\hat{F}_n$ should be approximately the same as iid sampling from $F$ when $n$ is large.


## Applications

### Bootstrap for standard error (s.e.)

Let $\theta = s(\vec{x})$ be an estimator for $\theta$ and suppose we want to know the s.e. of $\hat{\theta}$. 

**Algorithm:**

Step 1: Draw $B$ independent samples $\vec{X}^{\star (1)}, ..., \vec{X}^{\star (B)}$ from $\hat{F}_n$.

Step 2: Evaluate the bootstrap replications $\hat{\theta}^\star = s(\vec{X}^{\star (b)})$, $b = 1, ..., B$

Step 3: Estimate the s.e., $se(\hat{\theta})$, by the standard deviation of the $B$ replications.

$$se_{boot}(\hat{\theta}) = [\frac{1}{B-1} \sum_{b=1}^B (\hat{\theta}^{\star (b)} - \hat{\theta}^{\star (o)})^2]^{1/2}$$

where $\hat{\theta}^{\star o} = \frac{1}{B} \sum_{b=1}^B \hat{\theta}^{\star (b)}$

Remark: $\bar{\hat{\theta}}^\star$ or $\hat{\theta}^{\star (o)}$: mean of the bootstrap replications

In [1]:
import numpy as np
from numpy.random import choice
import matplotlib.pyplot as plt

n=1000
rv = np.random.randint(0,10,n)

# TODO: FINISH
def bootstrap(samples, statistic_func, nboot=1000):
    """Conduct bootstrap statistic estimation, including the estimate, 
    standard error, and confidence interval

    Parameters
    ----------
    samples : array (n,)
        1D array of samples
    statistic_func : function
        Function which summarizes an array
        theta = f(samples)
        Should have parameter `axis` where a specification of 0 allows
        row-wise operations.
    nboot : int
        Number of bootstrap samples
    """
    # Draw nboot independent samples from the available samples
    samples_star = choice(samples, size=(len(samples),nboot))
    # Evaluate the bootstrap replications
    theta_hat = statistic_func(samples_star, axis=0)
    # Analyze!
    # 1. Evaluate standard error, se(theta-hat)
    theta_hat_star_dot = (1 / nboot) * np.sum(theta_hat)
    se_boot = ((1 / (nboot - 1)) * np.sum(np.square(theta_hat - theta_hat_star_dot)) )**(1/2)
    # 2. Visualize sampling distribution
    plt.hist(samples_star)
    plt.show()

    return theta_hat#, se_boot 

bootstrap(rv, np.std, nboot=100), np.std(rv)

### Bootstrap estimate of bias

Suppose we estimate the parameter $\theta = t(F)$ by the statistic $\hat{\theta} = s(\vec{x})$

The bias of $\hat{\theta}$ is defined as 

$$bias(\hat{\theta}) = E(\hat{\theta}) - \theta$$

Recall: if $bias(\hat{\theta}) = 0$, then $E(\hat{\theta}) = \theta$, and $\hat{\theta}$ is an unbiased estimator. 

Example: Let $\theta = \mu$, and $\hat{\theta} = \bar{x}$.

$E[\bar{x}] = \mu$, $\bar{x}$ is an unbiased estimator for $\mu$.

Substituting the empirical distribution $\hat{F}_n$ for $F$, the bootstrap estimate of the bias is 
$$\hat{bias (\hat{\theta})} = bias^\star (\hat{\theta}^\star) = E^\star [\hat{\theta}^\star] - \hat{\theta} = \hat{\theta}^{\star (o)} - \hat{\theta}$$

### Confidence Intervals

From the sampling distribution of $\hat{\theta}$, we can construct CIs for $\theta$

**Standard CI**

Suppose that $\hat{\theta}$ is approximately normally distributed with mean $\theta$ and variance $se(\hat{\theta})^2$.

An approximate $(1 - \alpha) 100 \%$ CI for $\theta$ is $\hat{\theta}_L = \hat{\theta} - Z_{\alpha/2} \hat{se}_{boot}(\hat{\theta})$ and $\hat{\theta}_U = \hat{\theta} + Z_{\alpha/2} \hat{se}_{boot}(\hat{\theta})$, where $Z_\alpha$ is the $\alpha$ critical value of the standard normal.

**Bootstrap t-interval**

See function in Rizzo textbook - TODO: replicate it in python.

Suppose that $\hat{\theta}$ is approximately normal distribution. Also, suppose that $\hat{\sigma}_n$ is an estimate of the standard deviation of $\hat{\theta}$. Therefore, $\hat{\sigma}_n = \frac{s}{\sqrt{n}}$ in the normal mean problem. We can also use the **delta theorem approximation** for $\hat{\sigma}_n$. 

Let $T = \frac{\hat{\theta} - \theta}{\hat{\sigma}_n}$ be the "t - statistic".

We can also use the bootstrap appromimation of $\hat{\sigma}_n$!

From the bootstrap samples $X^{\star (b)}$, we calculate $T^{\star (b)} = \frac{\hat{\theta^{\star (b)}} - \theta}{\hat{\sigma}_n^\star}$ as the $T$ counterpart.

Then the bootstrap + CI for $\theta$ is given by $[\hat{\theta} + C^\star_{\alpha /2} \hat{\sigma}_n, \hat{\theta} + C^\star_{1 - \alpha /2} \hat{\sigma}_n]$ where $C^\star_p$ is the 100 $p$th percentile of $T^{\star (b)}$

**Bootstrap percentile interval**

Let $\hat{\theta}^{\star (1)}, \hat{\theta}^{\star (B)}$ be a bootstrap sample. **Order** the bootstrap replicates $\hat{\theta}^\star_{(1)}, \hat{\theta}^\star_{(2)}, ..., \hat{\theta}^\star_{(B)}$. Let $m = [\frac{\alpha}{2} \times B]$, where $[u]$ is the largest integer less than or equal to $u$.

An approximation $(1 - \alpha) 100 \%$ CI for $\theta$ is $(\hat{\theta}_{(m)}^\star, \hat{\theta}_{(B-m)}^\star)$.

Generally, we want $B\geq 1000$

This construction is simple and intuitive, but there are "better" methods

Takeaway: While the standard CI and bootstrap t-interval need normal distribution specifications, the bootstrap percentile interval does not. 

## Bias-corrected percentile interval (BCPI)

Correct the percentile interval by using the bias!

The CI should have equal probability to both sides of $\hat{\theta}$, that is $P(\hat{\theta} < \theta < \hat{\theta}_U) = P(\hat{\theta}_L < \theta < \hat{\theta})$   (Condition *). 

**Example:** For $\theta = \mu$, our $\bar{x}$ is in the middle of the interval. 

If $\hat{\theta}$ is not the median of the bootstrap distribution, Condition * is not fulfilled.

The bias-corrected (BC) percentile interval is one approach to fix this issue.

The 2-sided $100(1-\alpha)\%$ BCPI boils down to picking different quantiles of the bootstrap distribution of $\hat{\theta}$.

Instead of using $[\zeta^\star_{\alpha/2}, \zeta^\star_{1 - \alpha/2}]$ where $\zeta^\star_{p}$ is the 100 $p$th percentile in the bootstrap sample, use $[\zeta_{\beta_1}^\star, \zeta_{\beta_2}^\star]$ where $\beta_1$ and $\beta_2$ are quantities that depend on user-specified constants $(a,b)$. (Formulas for $\beta_1$ and $\beta_2$ are in Rizzo textbook along with discussion on choices)

Idea: does the above principle also apply to the $C^\star_{\alpha/2}$ from Bootstrap t-interval?

## Parametric bootstrap

$\lambda$ : parameter

$\theta$: statistic

Parametric bootstrap is a variation of the standard (non-parametric) bootstrap discussed previously. Suppose that we knwo that the distribution $F$ belongs to a parametric family of distribution $F_\lambda$ with densities $p(x|\lambda)$. If $\hat{\lambda}$ is an estimate of the true parameter $\lambda$, an obvious estimate of $F$ is the distribution $F = F_{\hat{\lambda}}$ with density $p(x|\hat{\lambda})$. In this case, we can still use the bootstrapping method to obtain an estimate of the sampling distribution of $\hat{\theta}$ (or any function of $\hat{\theta}$: $p(\hat{\theta})$). The parametric bootstrap replaces sampling iid from $\hat{F}_n$ (empirical distribution) with sampling iid from $F_{\hat{\theta}}$. 

Algorithm:

Step 1: Draw $B$ independent bootstrap samples $\vec{X}^{\star (1)}, ..., \vec{X}^{\star (B)}$ from $F_{\hat{\lambda}}$

Step 2: Evaluate bootstrap replications $\hat{\theta}^{\star (b)} = s(\vec{X}^{\star (b)}), b = 1, ..., B$

Step 3: Estimate the sampling distribution of $\hat{\theta}$ by the empirical distribution of the bootstrap replications $\hat{\theta}^{\star (1)}, ..., \hat{\theta}^{\star (B)}$

This is potentially more complicated than nonparametric because sampling from $F_{\hat{\lambda}}$ might be more difficult than sampling from $\hat{F}_n$

Question: Is $\hat{F}_n$ or $F_{\hat{\lambda}}$ better? Why?


**Example:** Assume a normal population $N(\mu, \sigma^2)$ and $H_0: \mu = 2$ vs. $H_1: \mu \neq 2$. 

If $\sigma^2$ is unknown, we can use a $t$-statistic (but here, we try something else). 

We can also use the likelihood ratio statistic $\Lambda$. Get the sampling distribution of $\Lambda$ when $H_0$ is true. If $H_0$ is true, $\mu=2$ and we have now $N(2, \sigma^2)$

As a note, $\lambda = \mu$ and $\hat{\theta} = \Lambda$, according to the algorithm presented above

Want to use Monte Carlo simulation to get the sampling distribution of $\Lambda$ when $\mu = 2$. However, monte carlo hypothesis testing will not work here since we do not know $\sigma^2$. So, an alternative method is to use **parameteric bootstrap**. Use $s^2$ from the initial sample as an estimate of $\sigma^2$. Sample from $N(2, s^2)$, then compute $\lambda$ and sampling distribution $\hat{\Lambda}^{\star (1)}, ..., \hat{\Lambda}^{\star (B)}$.


## Bootstrap Failures

**Example:** Let iid $X_1, ..., X_n \sim Unif(0,\theta)$

Consider statistic $T_n = n(\theta - \hat{\theta}_n)$, where $\hat{\theta} = X_{(n)} = \max X_i$.

Let the bootstrap stat $T^\star_n = n(\hat{\theta}_n - \hat{\theta}^\star_n)$, where $\hat{\theta}^\star_n = X^\star_{(n)}$

**Claim:** The distribution of $T^\star_n$ and $T_n$ are not close when $n \xrightarrow{} \infty$. Therefore, we do not have a consistent estimator.

**Remedy:** Surprisingly, one way to fix bootstrap failure is to take bootstrap samples of size $m = o(n)$ (little o, not big-O) instead of size $n$.

This is called "m-out-of-n bootstrap", and is consistent in cases such as the $unif(0, \theta)$ model. We use bootstrap samples of size $m$ instead of $n$. It is using sampling with replacement.

Intuition: https://stats.stackexchange.com/questions/476018/intuition-behind-m-out-of-n-bootstrap

## Remarks

1. The bootstrap is nonparametric but it does require some assumptions. You can't always assume it is always valid.

2. The bootstrap is an asymptotic method. It is very accurate when $n \xrightarrow{} \infty$ (because each bootstrap samples use all data)

3. There is related method called "Jacknife". However, the bootstrap is valid under weaker conditions.

4. There are many cases where the bootstrap is not formally justified. This is especially true with discrete structures like trees and graphs. It is an informal way to get some intuition on the variability of the procedure. But, keep in mind that the formal guarantees may not apply in these cases.

5. There is a method related to the bootstrap called subsampling. In this case, we draw samples of size $m < n$ without replacement (because without replacement, it is not the same as bootstrap). Subsampling produces valid CI under weaker conditions than the bootstrap. 