## Bootstrap Confidence Intervals

The **empirical bootstrap** is a technique popularized by Bradley Efron in 1979. It is easy to understand and implement, but is just recently gaining popularity, since it is not really feasible without modern computing power. The bootstrap allows us to substitute fast computation for theoretical math.

**Big Idea:** Instead of trying to derive sampling distributions, instead approximate them by resampling from your given sample.

The procedure for finding a bootstrap confidence interval:
1. Draw a large number (1,000 or so) resamples from the original sample and calculate the statistic $s^*$ for each.
2. Find the 0.025 and 0.975 quantiles of the set of $s^*$. We'll call them $a$ and $b$,respectively.
3. The 95% conﬁdence interval is given by $[a, b]$

In [None]:
import pandas as pd
import numpy as np

Let's see how we can create bootstap confidence intervals for different parameters.

For this notebook, we'll be working with the Palmer penguins dataset.

In [None]:
penguins = pd.read_csv('../data/penguins.csv').dropna()
penguins.head()

## Bootstrap Confidence Intervals for a Single Parameter

Let's look just at the adelie penguins.

In [None]:
adelie = penguins[penguins['species'] == 'Adelie']

First, let's build a confidence interval for the mean body mass for the Adelie penguins.

First, let's set the number of resamples that we'll use.

In [None]:
#Number of Resamples
num_resamples = 1000

#Confidence Level
conf_level = 0.95

Using the confidence level, we can determine the cutoff values.

In [None]:
margin = round((1 - conf_level) / 2, 3)
margin

We can extract out the sample values.

In [None]:
values = adelie['body_mass_g']

To resample these values, we can use the `choice` function from numpy's random module.

This function takes in a 1-D array and will randomly sample from that array (with replacement by default). We can also use the `size` parameter in order to draw all of our resamples simulataneously. To use this parameter, we'll give it a tuple containing the number of resamples that we want and the size of the resamples (the same size as the original dataset).

Here is an example use of this function:

In [None]:
np.random.choice([1,2,3], size = (5,3))

Now, let's apply it to our sample values.

In [None]:
resamples = np.random.choice(values, size = (num_resamples, values.count()))

In [None]:
resamples

Let's verify that this is the correct shape.

In [None]:
resamples.shape

This is an array where each row is a different resample. An advantage of creating the resamples in this way is that we can compute the mean values of all resamples simultaneously using the `mean` method. When doing this, we need to specify that we want to compute these using `axis = 1`, which indicates that we are computing the mean across the rows.

In [None]:
resample_stats = resamples.mean(axis = 1)
resample_stats

In [None]:
a = np.quantile(resample_stats, q = margin)
b = np.quantile(resample_stats, q = 1 - margin)

print('lower bound: ', a)
print('upper bound: ', b)

Let's condense all of the code into one cell.

In [None]:
num_resamples = 10000
conf_level = 0.95

margin = (1 - conf_level) / 2

values = adelie['body_mass_g']

resamples = np.random.choice(values, size = (num_resamples, values.count()))

resample_stats = resamples.mean(axis = 1)

a = np.quantile(resample_stats, q = margin)
b = np.quantile(resample_stats, q = 1 - margin)

print('lower bound: ', a)
print('upper bound: ', b)

Let's compare this to the $t$-interval, which we've seen before.

In [None]:
from scipy.stats import t, sem

In [None]:
t.interval(confidence = 0.95, 
           df = values.count() - 1, 
           loc = values.mean(), 
           scale = sem(values))

One major advantage of the bootstrap is that we can use it for statistics even when we don't know the exact sampling distribution. For example, let's find the 95% bootstrap confidence interval for the median.

Note that the median funtion is not built into numpy arrays, but we can use the `np.median` function and specify the axis for this calculation.

In [None]:
num_resamples = 10000
conf_level = 0.95

margin = (1 - conf_level) / 2

values = adelie['body_mass_g']

resamples = np.random.choice(values, size = (num_resamples, values.count()))

resample_stats = np.median(resamples,axis = 1)

a = np.quantile(resample_stats, q = margin)
b = np.quantile(resample_stats, q = 1 - margin)

print('lower bound: ', a)
print('upper bound: ', b)

**Your Turn:** Modify the above code to find a 95% bootstrap confidence interval for the standard deviation of the flipper length.

In [None]:
# Your code here

### Bootstrap Confidence Interval for a Proportion

Now, let's see how we can create bootstrap confidence interval for a proportion.

For this, we'll work with a sample from the hotel booking demand dataset, which is described [here](https://www.sciencedirect.com/science/article/pii/S2352340918315191).

In [None]:
bookings = pd.read_csv('../data/bookings_sample.csv')

Specifically, let's estimate the proportion of reservations at city hotels that are canceled.

In order to quickly calculate a proportion, we can make use of the `mean` method of a numpy array. We can do this by checking that an observation is equal to a desired value and then using `mean` on the resulting array of Boolean values. (Note: in this case we do not have to check that the values are equal to 1 prior to using `mean`, but in cases where your variables are not encoded as 0/1, it would be necessary.)

First, let's see what the observed frequency of bookings that are canceled.

In [None]:
(bookings['is_canceled'] == 1).mean()

Now, we can largely reuse the code from above for our confidence interval.

In [None]:
num_resamples = 10000
conf_level = 0.95

margin = (1 - conf_level) / 2

values = bookings['is_canceled']

resamples = np.random.choice(values, size = (num_resamples, values.count()))

resample_stats = resamples.mean(axis = 1)

a = np.quantile(resample_stats, q = margin)
b = np.quantile(resample_stats, q = 1 - margin)

print('lower bound: ', a)
print('upper bound: ', b)

## Confidence Interval for a Difference

Sometimes, you may be interested in comparing the parameter value of two different groups. We can construct a confidence interval for the difference in parameters in a similar way as above, but we'll need to resample from each group.

For example, let's say we want to build a confidence interval for the difference in the mean body mass between adelie and chinstrap penguins.

In [None]:
import seaborn as sns
sns.boxplot(data = penguins, x = 'species', y = 'body_mass_g');

In [None]:
adelie = penguins[penguins['species'] == 'Adelie']
chinstrap = penguins[penguins['species'] == 'Chinstrap']

chinstrap_values = chinstrap['body_mass_g']
adelie_values = adelie['body_mass_g']

num_resamples = 10000
conf_level = 0.95

margin = (1 - conf_level) / 2

chinstrap_resamples = np.random.choice(chinstrap_values, size = (num_resamples, chinstrap_values.count()))
adelie_resamples = np.random.choice(adelie_values, size = (num_resamples, adelie_values.count()))

resample_stats = chinstrap_resamples.mean(axis = 1) - adelie_resamples.mean(axis = 1)

a = np.quantile(resample_stats, q = margin)
b = np.quantile(resample_stats, q = 1 - margin)

print('lower bound: ', a)
print('upper bound: ', b)

The fact that our confidence interval contains zero says that we can't immediately dismiss the possibility that adelie and chinstrap penguins have the same body mass on average.