# Statistical Inference

In this notebook, we'll see how to create bootstrap confidence intervals.

## Bootstrap Confidence Intervals

The **empirical bootstrap** is a technique popularized by Bradley Efron in 1979. It is easy to understand and implement, but is just recently gaining popularity, since it is not really feasible without modern computing power. The bootstrap allows us to substitute fast computation for theoretical math.

**Big Idea:** perform computations on the data itself to estimate the variation of statistics that are themselves computed from the same data. That is, the data is ‘pulling itself up by its own bootstrap.’ 

Since the bootstrap allows you to estimate the variance of the sampling distribution of these statistics, you can use this technique to construct confidence intervals.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

You have already seen how to create a t-interval to estimate the population mean. Now, you will use the bootstrap to estimate the population median.

You'll use the American Time Use Survey sleeping data again.

In [None]:
sleeping = pd.read_csv('../data/atus_sleeping.csv')

In [None]:
sleeping.head()

First, get a point estimate:

In [None]:
sleeping.minutes_spent_sleeping.median()

The point estimate is 550 minutes.

Now, the idea is to repeatedly resample with replacement from the observations.

In [None]:
point_estimate = np.median(sleeping.minutes_spent_sleeping)

#Number of Resamples
num_resamples = 10000

#Confidence Level
conf_level = 0.95

#Divide the remaining area in half. Half for the left and half for the right
margin = (1 - conf_level) / 2

values = sleeping.minutes_spent_sleeping.to_list()

resample_values = []

for i in range(num_resamples):
    #Resample with replacement
    resample = np.random.choice(values, len(values))
    
    #Compute the resample median and save the value
    resample_values.append(np.median(resample))

#Find the quantiles of our resample values
top_quantile = np.quantile(resample_values, q = 1 - margin)
bottom_quantile = np.quantile(resample_values, q = margin)

print('lower bound: ', point_estimate - (top_quantile - point_estimate))
print('upper bound: ', point_estimate + (point_estimate - bottom_quantile))

You can also look at the distribution of resample values, if you want to get an idea of the variance in the sample statistics.

In [None]:
plt.hist(resample_values);

Rather than have to rewrite or copy/paste the above code every time you want to do a bootstrap confidence interval, here is a function that can be used.

In [None]:
from nssstats.bootstrap import bootstrap_ci

To use this function, just pass in the values for which you want to perform bootstrap resampling along with the statistic you want to compute. 

In [None]:
bootstrap_ci(sleeping['minutes_spent_sleeping'], statistic = np.median)

## Bootstrap For Friday Crashes

Let's say that you want to ensure that there are enough emergency responders scheduled for this upcoming Friday. You want to have a good idea of how many crashes you can expect so that there will be enough responders on call, but you do not want to schedule way too many.

The file `friday_crashes.csv` contains the count of the number of reported accidents for all Fridays in 2018 in Davidson County.


In [None]:
friday_crashes = pd.read_csv('../data/friday_crashes.csv')

In [None]:
friday_crashes['Accident Number'].hist();

While this contains data for every single Friday in 2018, you can view it as a sample for the *population* of *all*  Fridays. If you want to make inferences about all Fridays, you need to construct a confidence interval rather than just look at sample statistics.

Perhaps you would like to have a good idea about what the 80th percentile of the number of crashes is. This way, enough responders can be scheduled so that they can cover 80% of cases.

You can use the bootstrap to construct this interval. You can use the above function. This time, you need to specify that the statistic of interest is `np.quantile` and you also need to pass in the `q = 0.8` argument to specify that you're interested in the 80th percentile.

In [None]:
bootstrap_ci(friday_crashes['Accident Number'].values, 
             statistic = np.quantile, q = 0.8)

If you wanted to be extra cautious, you could plan for the upper value of this interval.

One of the advantages of using bootstrap resampling is that you can use it to compute a confidence interval for any statistic you like. Let's say you want a confidence interval for the standard deviation of the number of accidents.

In [None]:
bootstrap_ci(friday_crashes['Accident Number'].values, 
             statistic = np.std)