[![Open In
Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap5/16-Bootstrap-Dist.ipynb)




# <a name="16title">5.1: Bootstrap Distributions</a>

---

<figure>
<img
src="https://upload.wikimedia.org/wikipedia/commons/e/e5/La_Roulette_de_Monte-Carlo_R%C3%A8gle_du_Jeu.jpg"
alt="Figure of Two Paths" width = "50%"/>
<figcaption aria-hidden="true">
Credit: <a href="https://commons.wikimedia.org/wiki/File:La_Roulette_de_Monte-Carlo_R%C3%A8gle_du_Jeu.jpg">Lévy et Neurdein réunis</a>, Public domain, via, via Wikimedia Commons
</figcaption>
</figure>


# <a name="16intro">Sampling From a Population</a>

---

Rarely, we have access to data from the entire population of interest,
in which case we are able to calculate the value(s) of population
parameter(s) of interest. We can generate a sampling distribution by
simulating the selection of many different random samples from the
population data, and we can compute the standard error of the sampling
distribution to measure how much uncertainty we can expect due to the
randomness of sampling. If we have access to data from the entire
population, there is no need for statistics to estimate parameters since
we know the values of the parameters! Recall the distinction and
connection and between parameters and statistics:

-   A  <font color="dodgerblue">**parameter**</font> is a   characteristic of a population (which may be a probability   distribution).
    -   The values of population parameters are unknown, but their values are fixed and not random.
-   A  <font color="dodgerblue">**statistic**</font> is a   characteristic of a sample that we can use to approximate a   parameter.
    -   Statistics are not fixed values. They will vary due to the randomness of sampling.
-   Recall we use different notation for parameters and statistics.   -   We typically use $\theta$ to denote a generic population parameter.
    -   We use  <font color="dodgerblue">**hat notation**</font>, $\color{dodgerblue}{\hat{\theta}}$, to denote estimators of $\theta$.

In some situations, we have a known probability distribution that we can
use to build a model and make predictions. For example, characteristics
such as height
([normal](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap2/08-Common-Continuous-Distributions.ipynb#08append-normal)),
time between successive events
([exponential](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap2/08-Common-Continuous-Distributions.ipynb#08append-exp)),
and counting the number of times an event occurs over an interval of
time all behave predictably
([Poisson](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap2/07-Common-Discrete-RandVar.ipynb#07append-pois)). We can pick
random sample and use point estimators such as
[MLE](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap4/13-Estimation-MLE.ipynb) and [MoM](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap4/14-Estimation-MOM.ipynb) to
estimate unknown population parameters. *What happens if the data does
not follow a known distribution?*

Suppose we would like to estimate the value of a parameter for a
population about which we know very little information (this is often
the case). We collect data from a single random sample of size $n$, and
then we can use statistics from the sample to make predictions about the
population:

-   To estimate a population mean, it makes practical and [mathematical   sense to use a sample   mean](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap4/15-Properties-Estimators.ipynb#15q5).
-   To estimate a population proportion, the sample proportion is an   [unbiased estimator](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap4/15-Properties-Estimators.ipynb#sec-prop-bias).
-   To estimate a population variance, we can use the [unbiased   estimator](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap4/15-Properties-Estimators.ipynb#15var)   $s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2$.

In any of these cases, how certain can/should we be in our estimate? In
practice, we do not repeatedly pick 1000’s of random samples from the
population. That is likely impractical, expensive, and time consuming.
We have only collected data from a single random sample.

>  <font color="dodgerblue">**How can we account for the
> uncertainty in our estimate if we only have one random
> sample?**</font>



## <a name="16q1">Question 1</a>

---


<figure>
<img
src="https://upload.wikimedia.org/wikipedia/commons/5/5e/Golden_jackal_-_portrait.jpg"
alt="Golden Jackal Portrait" width = "50%"/>
<figcaption aria-hidden="true">
Credit:  Вых Пыхманн, <a href="https://creativecommons.org/licenses/by-sa/3.0">CC BY-SA 3.0</a>, via Wikimedia Commons
</figcaption>
</figure>


A zoologist would like to answer the following question?

> “What is the average mandible (jaw) length of all golden jackals?

Devise a method for collecting and analyzing data to help them answer
this question.

### <a name="16sol1">Solution to Question 1</a>

---

<br>  
<br>  
<br>



# <a name="16jackal">Case Study: Golden Jackal Mandible Length</a>

---

The data frame `jackal` in the `permute` package contains a sample of
$n=20$ mandibles from male and female golden jackals (Canis aureus). For
each of the 20 observations, two variables are recorded<sup>1</sup>:

-   `Length` is the length of the mandible in millimeters (mm).
-   `Sex` is a categorical variable with two levels: `Male` and   `Female`.

<br>

<font size="2"> 1.  Manly, B.F.J. (2007) *Randomization, bootstrap and Monte Carlo
methods in biology*. Third Edition. Chapman & Hall/CRC, Boca Raton.</font>


## <a name="16load-jackal">Loading the Data</a>

---

It is very likely you do not have the package `permute` installed. You
will need to first install the `permute` package.

-   Go to the R Console window.
-   Run the command `> install.packages("permute")`.

You will only need to run the `install.package()` command one time. You
can now access `permute` anytime you like! However, you will need to run
the command `library(permute)` during any R session in which you want to
access data from the `permute` package. **Be sure you have first
installed the `permute` package before executing the code cell below.**


In [None]:
install.packages("permute")

In [None]:
# be sure you have already installed the permute package
library(permute)  # loading permute package

### <a name="16sum-jackal">Summarizing and Storing the Data</a>

---

In the code cell below we load the `jackal` data from the `permute`
package and provide a numerical summary of the two variables in the
sample.

In [None]:
data(jackal)  # load jackal data
summary(jackal)  # numerical summary of each variable

The code cell below displays the distribution of mandible lengths
separately for males and females.

In [None]:
# side by side box plots
plot(Length ~ Sex, data = jackal,
     col = c("dodgerblue", "mediumseagreen"),
     main = "Mandible Length of Golden Jackals",
     ylab = "Length (in mm)",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5)

We will be analyzing mandible lengths for both adult male and female
golden jackals. In the code cell below, we save the $n=20$ mandible
lengths to a vector called `jaw.sample`.

In [None]:
jaw.sample <- jackal$Length  # store mandible lengths to vector
jaw.sample  # print sample to screen

## <a name="16q2">Question 2</a>

---

Based on the sample above, what is your estimate for $\mu$, the mean
mandible length of all adult golden jackals?

### <a name="16sol2">Solution to Question 2</a>

---

<br>  
<br>  
<br>



## <a name="16q3">Question 3</a>

---

How much confidence do you have in your estimate in \[Question 2\]? Any
suggestions on how we can measure the uncertainty in our estimate due to
the randomness of sampling?

### <a name="16sol3">Solution to Question 3</a>

---

<br>  
<br>  
<br>



## <a name="16stat-q">What is a Statistical Question?</a>

---

A  <font color="dodgerblue">**statistical question**</font> is one
that can be answered by collecting data and where there will be
variability in that data.

> Based on a random sample of $n=20$ adult golden jackals, what is the
> mean mandible length of all adult golden jackals?

-   Each time we pick a different sample we have a different subset of   data.
-   Different samples have different sample means, leading to different   estimates.
-   **This is a statistical question!**
-   How can we account for this variability in our estimate?

> Using a database that contains information on all registered voters in
> Colorado, what proportion of all Colorado voters are over 50 years
> old?

-   The database includes information from the population of all   registered voters in Colorado.
-   We can use the population data to calculate the proportion.
-   The population data does not change, so there is no variability in   the value of the proportion.
-   **This is not an example of a statistical question.**



# <a name="16boot-intro">Bootstrapping: Sampling from a Sample</a>

---

We have explored the  <font color="dodgerblue">**sampling
distributions**</font> of [sample means](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/10-Sampling-Dist-Mean.ipynb),
[proportions](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/11-Sampling-Dist-Prop.ipynb), [medians, variances](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/12-Sampling-Dist-Other.ipynb) and other
[estimators](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap4/15-Properties-Estimators.ipynb) as a tool to assess the
variability in those statistics and measure the level of uncertainty or
precision in the estimate we obtain from the sample. In particular, the
variance of a sampling distribution or the <font color="dodgerblue">**standard error**</font> (which is the square
root of the variance of a sampling distribution) are commonly used to
assess the variability in sample statistics.

In the case of the mean mandible length of all golden jackals, we have
collected one sample of $n=20$ adult golden jackals. We do not have
access to data from the entire population, so we cannot construct a
sampling distribution by picking many different random samples each size
$n=20$. Collecting unbiased samples can be quite expensive,
time-consuming, and logistically difficult. <font color="tomato">**If we only have one sample and know very little
about the population, how can we generate a sampling distribution from
this limited information?**</font>



## <a name="16boot-dist">Bootstrap Distributions</a>

---

 <font color="dodgerblue">**Bootstrapping**</font> is the process
of generating many different random samples from one random sample to
obtain an estimate for a population parameter. For each randomly
selected resample, we calculate a statistic of interest. Then we
construct a new distribution of bootstrap statistics that approximates a
sampling distribution for some sample statistic (such as a mean,
proportion, variance, and others). We can use bootstrapping with any
sample, even small ones. We can bootstrap any statistic. Thus,
bootstrapping provides a robust method for performing statistical
inference that we can adapt to many different situations in statistics
and data science.



### <a name="16boot-steps">A Bootstrapping Algorithm</a>

---

Given an original sample of size $n$ from a population:

1.  Draw a  <font color="dodgerblue">**bootstrap resample**</font>   of the same size, $n$, with replacement from the original sample.
2.  Compute the relevant statistic (mean, proportion, max, variance,   etc) of that sample.
3.  Repeat this many times (say $100,\!000$ times).

-   A distribution of statistics from the bootstrap samples is called a    <font color="dodgerblue">**bootstrap distribution**</font>.
-   A bootstrap distribution gives an *approximation* for the sampling   distribution.
-   We can inspect the center, spread and shape of the bootstrap   distribution and do statistical inference.



## <a name="16q4">Question 4</a>

---

Consider a random sample of 4 golden jackal mandible lengths (in mm):
120, 107, 110, and 116. Which of the following could be a possible
bootstrap resample? Explain why or why not.

### <a name="16q4a">Question 4a</a>

---

120, 107, 116

#### <a name="16sol4a">Solution to Question 4a</a>

---

<br>  
<br>  
<br>



### <a name="16q4b">Question 4b</a>

---

110, 110, 110, 110

#### <a name="16sol4b">Solution to Question 4b</a>

---

<br>  
<br>  
<br>



### <a name="16q4c">Question 4c</a>

---

120, 107, 110, 116

#### <a name="16sol4c">Solution to Question 4c</a>

---

<br>  
<br>  
<br>



### <a name="16q4d">Question 4d</a>

---

120, 107, 110, 116, 120

#### <a name="16sol4d">Solution to Question 4d</a>

---

<br>  
<br>  
<br>



### <a name="16q4e">Question 4e</a>

---

110, 130, 120, 107

#### <a name="16sol4e">Solution to Question 4e</a>

---

<br>  
<br>  
<br>

## <a name="16q5">Question 5</a>

---

How many possible bootstrap resamples can be constructed from an
original sample that has $n=20$ values?

### <a name="16sol5">Solution to Question 5</a>

---

In [None]:
# How many possible resamples are there for n=20?



<br>  
<br>  
<br>



# <a name="16monte">Monte Carlo Methods</a>

---

 <font color="dodgerblue">**Monte Carlo methods**</font> are
computational algorithms that rely on repeated random sampling. A
bootstrap distribution is one example of a Monte Carlo method. A
bootstrap distribution theoretically would contain the sample statistics
from *all possible bootstrap resamples*. If we pick an initial sample
size $n$, then there exists a total of $n^n$ possible bootstrap
resamples. In the case of $n=20$, we have
$20^{20} \approx 1.049 \times 10^{26}$ possible resamples. If we ignore
the ordering in which we pick the sample, when $n=20$, we have a total
of $68,\!923,\!264,\!410$ (almost 69 billion!) distinct bootstrap
resamples.

For small samples, we could write out all possible bootstrap resamples.
For larger values of $n$ (and we see $n=20$ is already extremely large),
it is really not practical or feasible to generate all possible
bootstrap resamples while avoiding duplicates. Instead, we use Monte
Carlo methods to repeatedly pick random samples that we use to
approximate a sampling distribution. The Monte Carlo method of
generating many (but necessarily all) bootstrap resamples introduces
additional uncertainty and variability into the analysis. The more
bootstrap resamples we choose, the less uncertainty we have.

-   By default, we will create $N =100,\!000 = 10^5$ bootstrap   resamples.
-   In some cases (very large $n$), we may choose a smaller number of   resamples for the sake of time.
-   For typically bootstrapping, it is recommended to use at least   $N=10,\!000$ bootstrap resamples.

Monte Carlo methods were first explored by the Polish mathematician
Stanislow Ulam in the 1940s while working on the initial development of
nuclear weapons at Los Alamos National Lab in New Mexico. The research
required evaluating extremely challenging integrals. Ulam devised a
numerical algorithm based on resampling to approximate the integrals.
The method was later named “Monte Carlo”, a gambling region in Monaco,
due to the randomness involved in the computations.



## <a name="16jackal-boot">Creating a Bootstrap Distribution in R</a>

---

Let’s return to our statistical question:

> “What is the average mandible (jaw) length of all golden jackals?

We have already picked one random sample of $n=20$ adult golden jackals.
The mandible lengths of our sample are stored in the vector
`jaw.sample`.



### <a name="16jackal-pick">Step 1: Pick a Bootstrap Resample</a>

---

We use the `sample()` function in R to pick a random sample of values
out of the values in `jaw.sample`.

-   Notice the resample has size $n=20$, the same as the original   sample.
-   We use the option `replace = TRUE` since we want to sample with   replacement.
-   Running the code cell below creates one bootstrap resample stored in   `temp.samp`.

In [None]:
temp.samp <- sample(jaw.sample, size=20, replace = TRUE)  # sample with replacement
temp.samp  # print sample to screen

### <a name="16jackal-stat">Step 2: Calculate Statistic(s) from the Bootstrap Sample</a>

---

In the golden jackal mandible length example, we want to use information
about the distribution of sample means to estimate a population mean.
Thus, we calculate the mean of the bootstrap resample `temp.samp` that
we picked in the previous code cell.

In [None]:
mean(temp.samp)  # mean of bootstrap resample

### <a name="16jackal-repeat">Step 3: Repeat Over and Over Again</a>

---

In the code cell below, we repeat steps 1 and 2 over and over again. The
sample means we calculate from each bootstrap resample are stored in a
vector named `boot.dist`. Run the code cell below to generate a
bootstrap distribution for the sample mean.

-   A  <font color="tomato">solid red line</font> marks the   location of the  <font color="tomato">sample mean from the   original sample</font>.
-   A  <font color="dodgerblue">dashed blue line</font> marks the   location of the  <font color="dodgerblue">mean of the bootstrap   distribution</font>.
-   A  <font color="mediumseagreen">solid green line</font> marks   the location of the  <font color="mediumseagreen">actual   population mean</font><sup>2</sup>.

<br>

<font size=2>2. Ali Louei Monfared, “Macro-Anatomical Investigation of the Skull of Golden Jackal (Canis aureus) and its Clinical Application during Regional Anesthesia”, *Global Veterinaria* 10 (5): 547-550, 2013.</font>

In [None]:
##########################
# cell is ready to run
# no need for edits
##########################
N <- 10^5  # Number of bootstrap samples
boot.dist <- numeric(N)  # create vector to store bootstrap means

# for loop that creates bootstrap dist
for (i in 1:N)
{
  x <- sample(jaw.sample, 20, replace = TRUE)  # pick a bootstrap resample
  boot.dist[i] <- mean(x)  # compute mean of bootstrap resample
}

# plot bootstrap distribution
hist(boot.dist,
     breaks=20,
     xlab = "x-bar, mandible length (in mm)",
     main = "Bootstrap Distribution for Sample Mean (n=20)",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5)

# red line at the observed sample mean
abline(v = mean(jaw.sample), col = "firebrick2", lwd = 2, lty = 1)

# blue line at the center of bootstrap dist
abline(v = mean(boot.dist), col = "blue", lwd = 2, lty = 2)

# green line at the population mean, 112 mm
abline(v = 112, col = "mediumseagreen", lwd = 2, lty = 1)

## <a name="16q6">Question 6</a>

---

What are the mean and standard error of the bootstrap distribution?
Complete the code below to compute these values.

In [None]:
# calculate center of bootstrap dist


# calculate bootstrap standard error



### <a name="16sol6">Solution to Question 6</a>

---

<br>  
<br>



# <a name="16bias">Measuring the Bias of Booststrap Estimates</a>

---

Recall if ${\color{tomato}{\widehat{\theta}}}$ is an estimator for the
parameter ${\color{mediumseagreen}{\theta}}$, then we define the <font color="dodgerblue">**bias**</font> of an estimator as

$${\large \color{dodgerblue}{ \boxed{\mbox{Bias}(\widehat{\theta}) = {\color{tomato}{\widehat{\theta}}} - {\color{mediumseagreen}{\theta}}}.}}$$

**In the case of bootstrapping:**

-   We use   ${\color{tomato}{E( \overline{X}_{\rm{boot}}) = \hat{\mu}_{\rm{boot}}}}$,   the  <font color="tomato">mean of the bootstrap distribution</font>, as the <font color="tomato">estimator</font>.
-   We use the <font color="mediumseagreen">mean from the original   sample</font>, ${\color{mediumseagreen}{\bar{x}}}$, <font color="mediumseagreen">in place of the parameter $\theta$</font>.
-   We define the  <font color="dodgerblue">**bootstrap estimate of   bias**</font> as

$${\large \color{dodgerblue}{ \boxed{\mbox{Bias}_{\rm{boot}} \big( \hat{\mu}_{\rm{boot}} \big) = {\color{tomato}{\hat{\mu}_{\rm{boot}}}} - {\color{mediumseagreen}{\bar{x}}}.}}}$$

We use the center of the bootstrap distribution,
${\color{tomato}{E( \overline{X}_{\rm{boot}}) = \hat{\mu}_{\rm{boot}}}}$
as the estimator for the population mean.

-   Do not use the mean of the original sample, $\bar{x}$, as the   estimator.
-   The values of $\hat{\mu}_{\rm{boot}}$ and $\bar{x}$ are usually very   close, but not equal.
-   The difference $\hat{\mu}_{\rm{boot}} - \bar{x}$ is used to estimate   the bias of $\hat{\mu}_{\rm{boot}}$.



## <a name="16q7">Question 7</a>

---

Compute the bootstrap estimate of bias if we use the mean of the
bootstrap distribution from \[Question 6\] as our estimate for the mean
mandible length of all adult golden jackals.

### <a name="16sol7">Solution to Question 7</a>

---

<br>  
<br>  
<br>



## <a name="16q8">Question 8</a>

---

What common distribution do you believe is the best model for mandible
lengths of all golden jackals? Explain your reasoning.

### <a name="16sol8">Solution to Question 8</a>

---

<br>  
<br>  
<br>



# <a name="16clt">A Central Limit Theorem Model</a>

---

Let $X$ be the mandible length (in mm) of a randomly selected adult
golden jackal. Based on the sample data in `jaw.sample`, we could come
up with unbiased estimates for the population mean and population
variance using:

-   $\mu \approx \hat{\mu} = \frac{1}{n} \sum x_i =$ `mean(jaw.sample)`   $=111$ mm.
-   $\sigma^2 \approx s^2 = \frac{1}{n-1} \sum (x_i - \bar{x})^2 =$   `var(jaw.sample)` $=15.05$ mm<sup>2</sup>.
-   This gives $\sigma \approx s = \sqrt{15.05} = 3.88$ mm.

We now have an estimate for the population, namely
$X \sim N(111, 3.88)$. Although our sample size is relatively small
($n=20 < 30$), we can apply the [Central Limit Theorem (CLT) for
Means](https://githubtocolab.com/CU-Denver-MathStats-OER/Statistical-Theory/blob/main/Chap3/10-Sampling-Dist-Mean.ipynb) since the population is assumed to be
symmetric (normally distributed).



## <a name="16q9">Question 9</a>

---

Using the CLT on our normal population model $X \sim N(111, 3.88)$ for
the population, we can derive a theoretical model for the distribution
of sample means for $n=20$. Using the CLT for means, give the mean and
standard error for the sampling distribution for $\overline{X}$. **How
do your answers compare to approximations you found in \[Question 6\]
for the bootstrap distribution approximation of the sampling
distribution for sample means?**

### <a name="16sol9">Solution to Question 9</a>

---

<br>  
<br>  
<br>



# <a name="16compare-clt">Comparison of Bootstrap Approximations to CLT</a>

---

Consider the theoretical population $X \sim N(23,7)$. Below we compare the sampling distribution for the mean obtained using the central limit theorem on the top row with one random sample and a corresponding bootstrap distribution for the sample mean on the bottom row.


<figure>
<img
src="https://upload.wikimedia.org/wikipedia/commons/1/1c/16fig-clt-boot-compare.png"
alt="Comparing Bootstrap Approximation of Sampling Distribution to CLT" width = "100%"/>
<figcaption aria-hidden="true">
Credit: Adam Spiegler <a href="https://commons.wikimedia.org/wiki/File:16fig-clt-boot-compare.png">“Comparing Bootstrap and CLT”</a>, <a href="https://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a>
</figcaption>
</figure>

|  | <font size="3">Mean</font> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | <font size="3">Standard deviation</font> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; |
|----------|----------------|-------------------------|
| <font size="3">Population</font>  | ${\large \mu_X= 23}$ | ${\large\sigma_X = 7}$ |
| <font size="3">Theoretical Sampling Dist for</font> ${\large\overline{X} }$ | ${\large \mu_{\bar{X}}= 23}$ | ${\large \sigma_{\overline{X}} = \mbox{SE}(\overline{X}) = 0.99}$ |
| <font size="3">Sample ($n=50$)</font> | ${\large \bar{x} = 22.69}$ | ${\large s = 6.15}$  |
| <font size="3">Bootstrap distribution</font> | ${\large\hat{\mu}_{\rm{boot}} = 22.88}$ | ${\large\mbox{SE}_{\rm{boot}}(\overline{X}) = 0.938}$ |



## <a name="16q10">Question 10</a>

---

a.  Compare the population and sample distributions. What is similar   about the two distributions? What are the differences?

<br>

b.  Compare the CLT sampling distribution and bootstrap sampling   distribution. What is similar about the two distributions? What are   the differences?

### <a name="16sol10">Solution to Question 10</a>

---

a.  ??

<br>  
<br>

b.  ??

<br>  
<br>



## <a name="16plug-in">The Plug-in Principle</a>

---

 <font color="dodgerblue">**The Plug-in Principle:**</font> If
something (such as a characteristic of a population) is unknown,
substitute (plug-in) an estimate. For example, if we do not know the
population mean $\mu$, the sample mean $\bar{x}$ is a nice, unbiased
substitute. If a population standard deviation $\sigma$ is unknown, we
can use substitute the sample standard deviation, $s$.

-   Bootstrapping is an extreme application of this principle.
-   We replace the entire population (not just one parameter) by the   entire set of data from the sample.
-   We can use our substitute for the population to generate a bootstrap   distribution to estimate a sampling distribution.



# <a name="16prop-boot">Properties of Bootstrap Estimators</a>

---

-   The goal of a bootstrap distribution is to estimate a sampling   distribution for some statistic.
-   Bootstrap distributions are  <font color="tomato">**biased   estimators for a population mean**</font> since they are centered   near $\bar{x}$ not necessarily $\mu$.
    -   Thus, the mean of a bootstrap distribution is not useful alone.
    -   But bootstrapping is useful at quantifying the behavior of a parameter estimate.
-   For most common statistics, bootstrap distributions provide good   estimates for the true <font color="dodgerblue">**spread**</font>, <font color="dodgerblue">**shape**</font>, and <font color="dodgerblue">**bias**</font> of a sampling distribution   for the statistic of interest.



## <a name="16q11">Question 11: Arsenic Case Study</a>

---

Arsenic is a naturally occurring element in the groundwater in
Bangladesh. Much of this water is used for drinking in rural areas, so
arsenic poisoning is a serious health issue. The data set `Bangladesh`
in the `resampledata` package<sup>3</sup> contains measurements on arsenic,
chlorine, and cobalt levels (in parts per billion, ppb) present in each
of 271 groundwater samples.

<br>

<font size=2>3.  Laura M. Chihara and Tim C. Hesterberg (2019) *Mathematical Statistics with Resampling and R*. Second Edition. John Wiley & Sons, Hoboken, NJ.</font>

### <a name="16load-arsenic">Loading the Data</a>

---

It is very likely you do not have the package `resampledata` installed.
You will need to first install the `resampledata` package.

-   Go to the R Console.
-   Run the command `> install.packages("resampledata")`.

You will only need to run the `install.package()` command one time. You
can now access `resampledata` anytime you like! However, you will need
to run the command `library(resampledata)` during any R session in which
you want to access data from the `resampledata` package. **Be sure you
have first installed the `resampledata` package before executing the
code cell below.**

In [None]:
install.packages("resampledata")

In [None]:
# be sure you have already installed the resampledata package
library(resampledata)  # loading resampledata package

### <a name="16q11a">Question 11a</a>

---

Complete the code cell below to calculate the mean and standard
deviation and size of the arsenic level of the sample.

#### <a name="16sol11a">Solution to Question 11a</a>

---

In [None]:
# be sure you have already installed the resampledata package
mean.arsenic <- ??  # sample mean
sd.arsenic <- ??  # sample standard deviation
n.arsenic <- length(arsenic)  # how many observations in arsenic

<br>  
<br>



### <a name="16q11b">Question 11b</a>

---

Create a histogram to show the shape of the distribution of the sample
data. How would you describe the shape?

#### <a name="16ol11b">Solution to Question 11b</a>

---

In [None]:
# create a histogram of the sample arsenic data
hist(??)

<br>  
<br>



### <a name="16ol11c">Question 11c</a>

---

Complete the code cell below to generate a bootstrap distribution for
the sample mean. What are the mean and standard error of the bootstrap
distribution?

#### <a name="16ol11c">Solution to Question 11c</a>

---

In [None]:
N <- 10^5  # number of bootstrap samples
boot.arsenic <- numeric(N)  # create vector to store bootstrap means

# Set up a for loop!
for (i in 1:N)
{
  x <- sample(??, size = ??, replace = ??)  # pick a bootstrap resample
  boot.arsenic[i] <- ??  # calculate relevant sample statistic
}

boot.mean <- mean(??)  # calculate center of bootstrap dist
boot.se <- sd(??)  # calculate bootstrap standard error

# plot bootstrap distribution
hist(boot.arsenic,  xlab = "xbar",
     main = "Bootstrap Distribution",
     cex.lab=1.5, cex.axis=1.5, cex.main=1.5)

# add a red line at the observed sample mean
abline(v = ??, col = "firebrick2", lwd = 2, lty = 1)

# add a blue line at the center of bootstrap dist
abline(v = ??, col = "blue", lwd = 2, lty = 2)

In [None]:
# print bootstrap estimate and standard error to screen
boot.mean  # mean (center) of bootstrap dist
boot.se  # standard error (spread) of bootstrap dist

In [None]:
# compare bootstrap dist to normal dist
# run to create a qq-plot
qqnorm(boot.dist)
qqline(boot.dist)

<br>  
<br>



### <a name="16q11d">Question 11d</a>

---

Calculate the bootstrap estimate for bias.

#### <a name="16ol11d">Solution to Question 11d</a>

---

In [None]:
# calculate bootstrap estimate of bias


# <a name="16CC License">Creative Commons License Information</a>
---


![Creative Commons
License](https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png)

*Statistical Methods: Exploring the Uncertain* by [Adam
Spiegler (University of Colorado Denver)](https://github.com/CU-Denver-MathStats-OER/Statistical-Theory)
is licensed under a [Creative Commons
Attribution-NonCommercial-ShareAlike 4.0 International
License](http://creativecommons.org/licenses/by-nc-sa/4.0/). This work is funded by an [Institutional OER Grant from the Colorado Department of Higher Education (CDHE)](https://cdhe.colorado.gov/educators/administration/institutional-groups/open-educational-resources-in-colorado).

For similar interactive OER materials in other courses funded by this project in the Department of Mathematical and Statistical Sciences at the University of Colorado Denver, visit <https://github.com/CU-Denver-MathStats-OER>.

<br>  
<br>