## Bootstrap Confidence Intervals

The **empirical bootstrap** is a technique popularized by Bradley Efron in 1979. It is easy to understand and implement, but is just recently gaining popularity, since it is not really feasible without modern computing power. The bootstrap allows us to substitute fast computation for theoretical math.

**Big Idea:** perform computations on the data itself to estimate the variation of statistics that are themselves computed from the same data. That is, the data is ‘pulling itself up by its own bootstrap.’ 

Since the bootstrap allows you to estimate the variance of the sampling distribution of these statistics, you can use this technique to construct confidence intervals.

Recall the procedure for building a 95% Bootstrap Conﬁdence Interval:

1. Given a sample, ﬁnd the sample statistic $s$. This is the **point estimate**.
2. Draw a large number (1,000 or so) resamples from the original sample and calculate the statistic $s^*$ for each.
3. Find the 0.025 and 0.975 quantiles of the set of $s^* - s$. We'll call them $a$ and $b$,respectively.
4. The 95% conﬁdence interval is given by $[s - b, s - a]$

In [None]:
import pandas as pd
import numpy as np

Let's see how we can create bootstap confidence intervals for different parameters.

For this notebook, we'll be working with the Palmer penguins dataset.

In [None]:
penguins = pd.read_csv('../data/penguins.csv').dropna()

In [None]:
penguins

## Bootstrap Confidence Intervals for a Single Parameter

Let's look just at the adelie penguins.

In [None]:
adelie = penguins[penguins['species'] == 'Adelie']

First, let's build a confidence interval for the mean body mass.

The first thing to do is to find our point estimate, the sample mean.

In [None]:
point_estimate = adelie['body_mass_g'].mean()
point_estimate

Then set the number of resamples and the confidence level.

In [None]:
#Number of Resamples
num_resamples = 1000

#Confidence Level
conf_level = 0.95

Using the confidence level, we can determine the cutoff values.

In [None]:
margin = (1 - conf_level) / 2
margin

We can extract out the sample values.

In [None]:
values = adelie['body_mass_g']

To resample these values, we can use the `choice` function from numpy's random module.

This function takes in a 1-D array and will randomly sample from that array (with replacement by default). We can also use the `size` parameter in order to draw all of our resamples simulataneously. To use this parameter, we'll give it a tuple containing the number of resamples that we want and the size of the resamples (the same size as the original dataset).

Here is an example use of this function:

In [None]:
np.random.choice([1,2,3], size = (5,3))

Now, let's apply it to our sample values.

In [None]:
resamples = np.random.choice(values, size = (num_resamples, values.count()))

In [None]:
resamples

Let's verify that this is the correct shape.

In [None]:
resamples.shape

This is an array where each row is a different resample. An advantage of creating the resamples in this way is that we can compute the mean values of all resamples simultaneously using the `mean` method. When doing this, we need to specify that we want to compute these using `axis = 1`, which indicates that we are computing the mean across the rows.

In [None]:
resamples.mean(axis = 1)

To create out bootstrap samples, we need to compute the difference between the resample means and the point estimate.

In [None]:
diffs = resamples.mean(axis = 1) - point_estimate

Now, we can find the quantiles of these differences to get the upper and lower bounds.

In [None]:
a = np.quantile(diffs, q = margin)
b = np.quantile(diffs, q = 1 - margin)

print('lower bound: ', point_estimate - b)
print('upper bound: ', point_estimate - a)

Let's condense all of the code into one cell.

In [None]:
num_resamples = 10000
conf_level = 0.95

margin = (1 - conf_level) / 2

values = adelie['body_mass_g']

point_estimate = values.mean()

resamples = np.random.choice(values, size = (num_resamples, values.count()))

diffs = resamples.mean(axis = 1) - point_estimate

a = np.quantile(diffs, q = margin)
b = np.quantile(diffs, q = 1 - margin)

print('lower bound: ', point_estimate - b)
print('upper bound: ', point_estimate - a)

Let's compare this to the $t$-interval, which we've seen before.

In [None]:
from scipy.stats import t, sem

In [None]:
t.interval(alpha = 0.95, 
           df = values.count() - 1, 
           loc = values.mean(), 
           scale = sem(values))

One major advantage of the bootstrap is that we can use it for statistics even when we don't know the exact sampling distribution. For example, let's find the 95% bootstrap confidence interval for the median.

Note that the median funtion is not built into numpy arrays, but we can use the `np.median` function and specify the axis for this calculation.

In [None]:
num_resamples = 10000
conf_level = 0.95

margin = (1 - conf_level) / 2

values = adelie['flipper_length_mm']

point_estimate = values.median()

resamples = np.random.choice(values, size = (num_resamples, values.count()))

diffs = np.median(resamples, axis = 1) - point_estimate

a = np.quantile(diffs, q = margin)
b = np.quantile(diffs, q = 1 - margin)

print('lower bound: ', point_estimate - b)
print('upper bound: ', point_estimate - a)

**Your Turn:** Modify the above code to find a 95% bootstrap confidence interval for the standard deviation of the flipper length.

In [None]:
# Your code here

### Bootstrap Confidence Interval for a Proportion

Now, let's see how we can create bootstrap confidence interval for a proportion.

For this, we'll work with a sample from the hotel booking demand dataset, which is described [here](https://www.sciencedirect.com/science/article/pii/S2352340918315191).

In [None]:
bookings = pd.read_csv('../data/bookings_sample.csv')

Specifically, let's estimate the proportion of reservations at city hotels that are canceled.

In order to quickly calculate a proportion, we can make use of the `mean` method of a numpy array. We can do this by checking that an observation is equal to a desired value and then using `mean` on the resulting array of Boolean values. (Note: in this case we do not have to check that the values are equal to 1 prior to using `mean`, but in cases where your variables are not encoded as 0/1, it would be necessary.)

First, let's see what the observed frequency of bookings that are canceled.

In [None]:
(bookings['is_canceled'] == 1).mean()

Now, we can largely reuse the code from above for our confidence interval.

In [None]:
num_resamples = 10000
conf_level = 0.95

margin = (1 - conf_level) / 2

values = bookings['is_canceled']

point_estimate = (values == 1).mean()

resamples = np.random.choice(values, size = (num_resamples, values.count()))

diffs = (resamples == 1).mean(axis = 1) - point_estimate

a = np.quantile(diffs, q = margin)
b = np.quantile(diffs, q = 1 - margin)

print('lower bound: ', point_estimate - b)
print('upper bound: ', point_estimate - a)

## Confidence Interval for a Difference

Sometimes, you may be interested in comparing the parameter value of two different groups. We can construct a confidence interval for the difference in parameters in a similar way as above, but we'll need to resample from each group.

For example, let's say we want to build a confidence interval for the difference in the mean body mass between adelie and chinstrap penguins.

In [None]:
import seaborn as sns
sns.boxplot(data = penguins, x = 'species', y = 'body_mass_g');

In [None]:
chinstrap = penguins[penguins['species'] == 'Chinstrap']

chinstrap_values = chinstrap['body_mass_g']
adelie_values = adelie['body_mass_g']

num_resamples = 10000
conf_level = 0.95

margin = (1 - conf_level) / 2

chinstrap_resamples = np.random.choice(chinstrap_values, size = (num_resamples, chinstrap_values.count()))
adelie_resamples = np.random.choice(adelie_values, size = (num_resamples, adelie_values.count()))

point_estimate = chinstrap_values.mean() - adelie_values.mean()

diffs = chinstrap_resamples.mean(axis = 1) - adelie_resamples.mean(axis = 1) - point_estimate

a = np.quantile(diffs, q = margin)
b = np.quantile(diffs, q = 1 - margin)

print('lower bound: ', point_estimate - b)
print('upper bound: ', point_estimate - a)

The fact that our confidence interval contains zero says that we can't immediately dismiss the possibility that adelie and chinstrap penguins have the same body mass on average.

## Confidence Interval for Correlation

Finally, let's see how we can construct a bootstrap confidence interval for a correlation coefficient.

First, let's remember how to find the correlation. Specifically, let's look at the correlation between flipper length and bill depth for adelie penguins.

In [None]:
adelie.plot(kind = 'scatter', x = 'bill_depth_mm', y = 'flipper_length_mm');

In [None]:
adelie[['flipper_length_mm', 'bill_depth_mm']].corr()

To extract just the correlation we care about, we can use the `iloc` accessor.

In [None]:
point_estimate = adelie[['flipper_length_mm', 'bill_depth_mm']].corr().iloc[0, 1]
point_estimate

In order to use the `np.random.choice` function, we need a one-dimensional array. For this, we can use the index of the adelie DataFrame.

Note that the index object does not have a `count` method, but we can use the `len` function here.

In [None]:
values = adelie.index

resamples = np.random.choice(values, size = (num_resamples, len(values)))

What we have now is a list of lists of index values for our resamples. Let's look at the procedure that we'll use to extract the correlation coefficient for each resample.

For demonstration purposes, we'll look at the first resample.

In [None]:
resample = resamples[0]
resample

To retrieve the corresponding rows, we can use the `.loc` accessor.

In [None]:
adelie.loc[resample]

We can then filter and calculate the correlation for this resample.

In [None]:
adelie.loc[resample][['flipper_length_mm', 'bill_depth_mm']].corr()

But, what we really want is the difference between this correlation and the point estimate.

In [None]:
adelie.loc[resample][['flipper_length_mm', 'bill_depth_mm']].corr().iloc[0, 1] - point_estimate

Now that we know the procedure that we need to do for each resample, we can create a **for loop** to take each resample, extract the correlation and subtract our point estimate.

We do need to store the result of each calculation, so we'll create a list and append the results to it as we go.

In [None]:
diffs = []

for resample in resamples:
    diffs.append(adelie.loc[resample][['bill_length_mm', 'bill_depth_mm']].corr().iloc[0, 1] - point_estimate)

In [None]:
a = np.quantile(diffs, q = margin)
b = np.quantile(diffs, q = 1 - margin)

print('lower bound: ', point_estimate - b)
print('upper bound: ', point_estimate - a)

Based on this result, we can say that, based on the data we have, there is at best a moderate correlation between flipper length and bill depth, and it might even be a very weak one.