# Question 0: The Bootstrap

The bootstrap is an approach to generating new samples from the "true" distribution. There's only one **real** way to get new samples from the actual true distribution, and that's to run your experiment again. The bootstrap offers a different idea: we assume that our existing sample is representative of the "true" distribution, so to approximate new samples from the "true" distribution, we can repeatedly draw new samples from our existing sample. This process of repeatedly drawing new samples is called "resampling" or "bootstrapping".

Technically, what the bootstrap assumes is that the relationship between the "true" distribution and our sample is approximately the same as the relationship between our sample and the bootstrap (re-)samples that we draw from it. We will explore this idea more in the next lecture. For now, however, let's just get comfortable with resampling.

## Problem 0.0
Below, a list of lifespans is provided, giving the length of the lives of 96 *C. elegans* (in days). Write a function called `resample_means` that will generate a specified number of resamples of the a given dataset, and for each resample, calculate the mean of that resample. As is standard for bootstrapping, each resampled dataset should be the same size as the original dataset. The function should return a list containing the mean value for each resampled dataset.

The most straightforward way to perform the actual resampling is with the `choices` function from the `random` module. (Try importing `random` and then type `random.choices?` to see the documentation. In short, `choices` performs sampling *with replacement* which is exactly what is needed here. Note that `random.sample` is the equivalent function for sampling *without replacement*, but you really don't want to use that here!) 

**Tip:** use the parameter `k` of `random.choices` to control the data size returned (e.g. `resample = random.choices(data, k=10)` will return 10 elements sampled with replacement from `data`).

In [None]:
import random

lifespans = [12.01,  11.75,   8.09,   6.21,   8.82,   8.35,   3.65,  10.12,
              9.63,   5.41,  13.08,  10.87,  13.31,  11.24,  10.87,   8.76,
             11.97,   9.26,  11.6 ,  12.22,  10.62,   9.63,   9.14,  10.62,
              3.38,   9.02,   9.99,   6.94,  11.95,  10.08,   8.11,   8.85,
             13.65,  13.89,   9.23,  14.87,   8.73,  11.58,  13.42,  11.46,
              9.1 ,   8.  ,  14.76,   9.71,  13.66,  14.39,   9.21,  10.79,
             12.88,   8.95,  10.92,  11.65,  10.79,  10.63,  10.07,   9.21,
              9.08,  11.65,   6.77,  12.38,   4.8 ,  11.29,  11.05,  10.68,
             11.66,  11.29,  10.06,  11.88,  11.4 ,  10.3 ,  11.15,  10.3 ,
             11.52,  13.6 ,  11.16,   9.57,   9.69,   8.83,  10.91,   8.23,
             12.75,  15.47,  10.79,  12.48,   9.08,  11.04,  11.16,  10.78,
              6.76,  12.97,   9.19,   9.81,  13.82,  14.33,  10.3 ,  11.28]

def mean(v):
    return sum(v)/len(v)

def resample_means(data, n_resamples):
    means = []
    # YOUR ANSWER HERE
    return means

print('mean lifespan:', mean(lifespans))
print('min and max lifespan:', min(lifespans), max(lifespans))
resampled = resample_means(lifespans, 10000)
print('min and max mean lifespan from resamples:', min(resampled), max(resampled))

In [None]:
assert len(resampled) == 10000
new_samples = resample_means([1,1,1,2], 1000)
for sample in new_samples:
    assert sample in {1, 1.25, 1.5, 1.75, 2}
assert 275 < new_samples.count(1.0) < 357 # should get all 1's about 31.6% of the time

Note from the output above that while there's a wide range of lifespans in the population, the range of mean lifespans from the resamples is comparatively narrow. 

Run the code below to use `matplotlib.hist` to plot the histogram of the lifespans and the resampled means. Note several new techniques:
 1. `plt.style.use` allows you to choose among [several style options](https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html). 'ggplot' is a decent one.
 1. you can use `bins='auto'` to have the histogram function guess a good number of bins from the dataset size.
 1. `plt.xlim(min, max)` sets range of the x-values in the plot. We use this below to make each plot be on the same x scale.
 1. `plt.axvline(x)` will plot a line at the given x position. An optional `color` parameter can be used.
 1. To create additional plots in one cell without piling the plots on top of one another, you can use `plt.figure()`. Try removing this call and re-run the cell...
 
From this, you should see how very narrow the distribution of resampled means is! It's pretty rare by chance alone to get a sample with a mean more than half a day or so different...

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use('ggplot')

plt.hist(lifespans, bins='auto')
plt.xlim(min(lifespans)-1, max(lifespans)+1)
plt.axvline(mean(lifespans), color='blue')
plt.title('Distribution of lifespans')
plt.figure()

plt.xlim(min(lifespans)-1, max(lifespans)+1)
plt.hist(resampled, bins='auto')
plt.axvline(mean(lifespans), color='blue')
plt.title('Distribution of resampled means')

pass
# "pass" means do nothing, which has the effect of suppressing the matplotlib barf that would otherwise
# be printed. Try commenting it out to see...

## Problem 0.1
The mean lifespan from our sample is 10.5 days. Let's calculate percent of the time that one of the resampled datasets has a mean lifespan less than 10 days.

To do so, write a more general function that will count how many times a list contains a value less than or equal to `below` or greater than or equal to `above`. Note that sometimes (as in the case above) we only want to use one of these thresholds. (For example, only count values less than the `below` threshold, and not bother with a high threshold.) To do this we will provide the function with *default arguments* that allow the user to not specify one of the thresholds.

The following example demonstrates default arguments, which you have seen before, but which we haven't used in our own functions yet. Before you run the below, try to guess what the outputs will be...

In [None]:
def test_defaults(a=1, b='b'):
    print('a is', a)
    print('b is', b)
    print()
    
print('no arguments:')
test_defaults()

print('a=5:')
test_defaults(a=5)

print('b=5:')
test_defaults(b=5)

print('out of order:')
test_defaults(b=5, a='a')

print('no names:')
test_defaults(4, 'c')

print('only one:')
test_defaults(4)

Now, we need to choose a default value for the `below` and `above` thresholds that will let us decide whether or not to ignore those thresholds. The classic choice in python is `None`. We can detect if an argument is `None` by the following:
```python
def test_none(x=None):
    if x is None:
        print('no x specified!')
    else:
        print(x)

def test_not_none(x=None):
    if x is not None:
        print(x)
    # don't print anything if x is none...
```

Using similar logic, finish defining the `fraction_extreme` function to determine the *fraction* of values in a list that are less than or equal to `below` (if `below` is not `None`) or are greater than or equal to `above` (if `above` is not `None`). 

In [None]:
def fraction_extreme(values, below=None, above=None):
    total = 0
    # YOUR ANSWER HERE
    return total

print(fraction_extreme(resampled, below=10))
print(fraction_extreme(resampled, above=11))
print(fraction_extreme(resampled, below=10, above=11))

In [None]:
both = fraction_extreme(resampled, below=10) + fraction_extreme(resampled, above=11)
assert abs(both - fraction_extreme(resampled, below=10, above=11)) < 0.0001
assert fraction_extreme([1,2,3,4,5], below=2) == 2/5

So, less than 2% of the time do we get a sample that has a mean less than 10 days. This suggests that we should be surprised if we ran another experiment and found such a short lifespan in those animals. (Either the genotype in the other experiment is short-lived, or the incubator temperature was different, or the media was different, or the person scoring the worms as alive or dead wasn't consistent, and on and on. We could reject the null hypothesis that the two different samples are identical, but we don't know which alternative hypothesis explains the difference!)

In the next questions, we will develop these methods into specific null hypothesis significance tests.