# Foundations of inference - confidence intervals with bootstrapping

In New York City on October 23rd, 2014, a doctor who had recently been treating Ebola patients in Guinea went to the hospital with a slight fever and was subsequently diagnosed with Ebola. Soon thereafter, an NBC 4 New York/The Wall Street Journal/Marist Poll asked New Yorkers whether they favored a "mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient". This poll included responses of 1,042 New York adults between October 26th and 28th, 2014.

In this lab you'll be using the data from the poll to investigate how to bootstrap from a sample in order to construct confidence intervals.

# Getting started

## Load packages

For this lab we will need the following packages.

```python
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
```

## Creating a reproducible lab report

You will be using Jupyter notebook to create reproducible lab reports. Download the lab report template and load the template into Jupyter notebook. These templates can be used for each of the labs.

## The data

The data we are working with is in the ebola_survey.csv file. Download and load the data frame into **python**. Note that the one variable being used is `quarantine` with values of either `favor` or `against`.

<div class="alert alert-block alert-info">
<b>Exercise 1:</b> Determine the observed proportion of New York adults who said they were in favor of a mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient. <b>Hint:</b> You can use the <code>value_counts</code> function to find percents in a data frame.</div>

The observed proportion represents the proportion of the 1,042 New Yorkers who were polled. This may or may not reflect how New Yorkers as a whole feel about a mandatory 21-day quarantine. It is unlikely that the observed proportion is equal to the proportion of the population. It is likely, however, that the observed proportion is near the population proportion.

# Bootstrapping samples

Similar to what you did in lab 4, you will take samples from the sample data frame to construct a sample distribution. This time, however, instead of the randomization technique, you will be bootstrapping. When bootstrapping, you use your sample as an estimated population that you will sample from. In this case, you have a sample of 1,042 data points. You will treat this as a population of infinite data points, but with the same proportion. You then sample from this estimated population. This means that the code to perform bootstrapping is almost identitical to the code used for the randomization technique. The only difference will be that you sample with replacement (in the randomization technique you sampled without replacement). Let's see how that looks.

```python
simulated_ebola_survey = pd.DataFrame().assign(quarantine=ebola_survey['quarantine'].sample(frac=1, replace=True, ignore_index=True))
simulated_ebola_survey.value_counts(normalize=True)
```

Let's compare this code to the code used from lab 4 when you did randomization.

```python
#Code from lab 4 for the randomization technique
sim_cancer_in_dogs = pd.DataFrame().assign(order=cancer_in_dogs['order'], response=cancer_in_dogs['response'].sample(frac=1, ignore_index=True))
```

Both codes are creating a new data frame and `assign`ing the variables in them. The randomization technique code has two variables whereas the bootstrapping code only has the one. Comparing the variables that are being sampled, however, you can see that the only difference is that the bootstrapping code includes the parameter `replace=True` for the `sample` function. By default, the `sample` function uses `replace=False` which is why the randomization code does not need to include that line.

<div class="alert alert-block alert-info">
<b>Exercise 2:</b> Simulate 3 samples of the <code>ebola_survey</code> data frame. How do each of your samples differ from the observed proportion?</div>

<div class="alert alert-block alert-info">
<b>Exercise 3:</b> Create a bootstrapped distribution of 10,000 simulated samples in a variable <code>bootstrap_dist</code>. Display your findings in a histogram.</div>

<div class="alert alert-block alert-info">
<b>Exercise 4:</b> Describe your histogram. Why is the histogram shaped the way it is? What do the data in the histogram represent? Why is the center of the histogram where it is?</div>

# Constructing a confidence interval

The goal is to estimate the proportion of New Yorkers who are in favor of a mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient. Once again, it is unlikely that the sample proportion, $\hat{p}$, is equal to the population proportion, $p$. It is likely that the sample proportion is near the population proportion, though. Ultimatley, it is unknown where the parameter $p$ is. You can use the bootstrapped distribution to construct a range of plausible values for $p$. If you find the 2.5th and 97.5th percentile for the distribution, then 95% of the boostrap resampled values will be captured between them. This range of values is the bootstrap percentile confidence interval.

Let's construct the 95% confidence interval. First we need to find the lower and upper bounds, the 2.5th and 97.5th percentile. Begin by sorting the bootstrapped distribution.

```python
bootstrap_dist.sort()
```

The `.sort()` function will take the list `bootstrap_dist` and put it in ascending order. Now, you will need to find the 2.5th and 97.5th percentile. Since `bootstrap_dist` is a list and not a data frame, we cannot use the fancy **pandas** tools to accomplish this. You could convert `bootstrap_dist` into a data frame, though finding the percentiles is simple enough in a list.

```python
lower_bound = bootstrap_dist[int(0.025*len(bootstrap_dist))-1]
```

Since `bootstrap_dist` is in ascending order, the 2.5th percentile is 2.5% of the way through the list. Thus, multiplying `0.025*len(bootstrap_dist)` yields roughly the index of the 2.5th percentile. Depending on the number of simulations, `0.025*len(bootstrap_dist)` may not be a whole number. The `int()` function will return the nearest integer. Lastly, subtract 1 from this value since indexes begin at 0 instead of 1.

<div class="alert alert-block alert-info">
<b>Exercise 5:</b> What are the bounds of the 95% confidence interval?</div>

You can graph these bounds in a histogram using the same fancy code used in Lab 4.

```python
ax = sns.histplot(data=bootstrap_dist, bins=20)
ax.axvline(x = lower_bound, ymin = 0, ymax = 1, color = "black", linestyle = "dashed")
ax.axvline(x = upper_bound, ymin = 0, ymax = 1, color = "black", linestyle = "dashed")
```

<div class="alert alert-block alert-info">
<b>Exercise 6:</b> Run a new boostrap distribution, but this time with 1,000 samples. Find the 95% confidence interval and plot the bounds on a histogram. How does this new simulation differ from the one previously created? Why is there a difference? Which would you trust more? Why?</div>

<div class="alert alert-block alert-info">
<b>Exercise 7:</b> You have found the bounds of a 95% confidence interval. Interpret what this confidence interval means in the context of how New York adults felt regarding mandatory 21-day quarantine for anyone who has come in contact with an Ebola patient.</div>

---

# Additional Practice

<div class="alert alert-block alert-info">
<b>Exercise 8:</b> Construct a 95% confidence interval for the <code>nuclear_survey</code> dataset. A simple random sample of 1,028 US adults in March 2013 found that 56% support nuclear arms reduction.</div>