# Question 1: One-Sample Tests

Using the logic of "resampling under the null hypothesis", we will use our resampling tools from the previous question to compute p-values for some simple tests.

## Problem 1.0
Below, paste in your answers for the `resample_means` and `fraction_extreme` functions from the last question. 

Also, recall that the process of resampling data under the null hypothesis means *forcing* your dataset to comply with the null hypothesis, usually by shifting the data around to have zero mean (or a zero value for some other statistic). To help with this, write a function called `shift` that takes a list of values and a `shift_factor` and returns a new list where `shift_factor` has been subtracted from each of the original values. (Style points for using the "list comprehension" syntax to create the new list from the old one, e.g. `[x**2 for x in y]`.)

In [None]:
import random

def mean(v):
    return sum(v)/len(v)

def resample_means(data, n_resamples):
    # YOUR ANSWER HERE

def fraction_extreme(values, below=None, above=None):
    # YOUR ANSWER HERE

def shift(values, shift_factor):
    # YOUR ANSWER HERE

data = [1, 2, 3, 4]
m = mean(data)
shifted = shift(data, m)
print(m, mean(shifted))

In [None]:
assert mean(shifted) == 0
assert shifted == [-1.5, -0.5, 0.5, 1.5]

Let's first examine the experiment discussed in class to determine whether rats have a left-hand bias. The experimental setup is as follows. We put each of a number of rats in a cage, and let it press on a lever 20 times. Each time it presses on the lever with its left paw, we add -1 to a running total, and each time it presses with its right paw, we add +1 to the total. So a rat with a 100% left-hand bias will have a score of -20, and a rat with no bias will have a score that averages around 0.

The below `rat_biases` list has scores for 24 rats. Its mean is -2, indicating that the average rat pressed the bar two more times with its left paw than its right (i.e. 11 left-paw presses and 9 right-paw presses.) Is this indicative of something more than the expected amount of random variability in a small sample?

## Problem 1.1
We will first answer this question by **simulation**. Write a function, `simulate_rat_trial`, that will simulate 20 presses on a bar for each of 24 rats, assuming no left/right bias, and return the mean score. (To simulate 20 presses, it might be simplest to use `random.choices` to draw from `[-1, 1]` 20 times, and then use `sum` to calculate the total bias score from those 20 presses.)

Next write a function, `simulate_rat_trials`, that will run `simulate_rat_trial` a specified number of times and return a list of the mean bias scores for each trial.

In [None]:
rat_biases = [-2,  -2,   8,   6,  -2, -12,  -4,  -2,  -6,   0,  -4,  -8,
              -4,   2,   2,  -2,   2,  -4,  -4,  -4,   2,  -2,  -8,   0]
print('mean bias', mean(rat_biases))

def simulate_rat_trial():
    # YOUR ANSWER HERE

print("trial 1:", simulate_rat_trial())
print("trial 2:", simulate_rat_trial())

def simulate_rat_trials(n_trials):
    # YOUR ANSWER HERE

simulated_trials = simulate_rat_trials(5000)
print('range of simulated trials:', min(simulated_trials), max(simulated_trials))

In [None]:
assert len(simulated_trials) == 5000
assert -0.05 < mean(simulated_trials) < 0.05
for s in simulated_trials:
    assert (s * 24) == int(s * 24) # each mean should be a sum of integers divided by 24...
assert min(simulated_trials) < -2.5
assert max(simulated_trials) > 2.5

So, the simulation shows that even in a no-bias condition, it's possible for random samples of 24 rats to have a wide range of possible mean bias scores! In 5000 repeat trials, you should have gotten some with bias scores as low as -3 and as high as 3 (or so). So we know that even if rats use each paw equally on average, it is still *possible* to get a score of -2, such as we observed in the actual rat data. But how *likely* is that? Recall that "the probability of observing a result as or more extreme by chance alone" is the definition of the p-value, so let's calculate a p-value for the observed bias score based on these simulation results.

## Problem 1.2
Use the function `fraction_extreme` to calculate the p-value of obtaining a bias score as far or farther from zero than the one we actually observed. Store this in a variable `p_value`.

Next, plot the null distribution: the histogram of scores from your simulated trials where there was no bias. Last, plot a vertical line at the position of the actual observed bias score.


In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use('ggplot')

# YOUR ANSWER HERE
print('p =', p_value)

In [None]:
assert 0.01 < p_value < 0.05

So, our simulation shows that while it's *possible* to get a bias score more extreme than -2, it happens less than 5% of the time. (If you run the simulation repeatedly, you should find that scores that extreme or more occur generally around 3% or so of the time.) So we would say that the result is "significantly unlikely to be due to chance alone", or in shorthand, "statistically significant". (Using the p=0.05 threshold favored by biologists.)

Now, what about bootstrapping? How does that compare to running a simulation? Under the simulation, we made new data that complied with the null hypothesis (of no left-right bias) by construction. On the other hand, our existing dataset has a slight (but apparently significant) bias. How can we resample these data  to generate a null distribution?

As discussed in class, we simply shift the location of the data points so that the null hypothesis is true, and then repeatedly resample to generate a null distribution.  

## Problem 1.3
Write a function `bootstrap_mean`, that takes a data set and shifts it to have zero mean (using your `shift` function), and then uses your `resample_means` function to resample the data a given number of times and return the mean of each resampled dataset. Last, calculate a p-value as `bootstrap_p_value`, and plot the null distribution and a vertical line at the actual bias score of -2.

In [None]:
def bootstrap_mean(data, n_resamples):
    # YOUR ANSWER HERE

bootstrapped_trials = bootstrap_mean(rat_biases, 5000)
print('range of bootstrapped trials:', min(bootstrapped_trials), max(bootstrapped_trials))

# Now calculate p-value and plot histogram of the null distribution
# YOUR ANSWER HERE
print('p =', bootstrap_p_value)

In [None]:
assert bootstrap_mean([1,1,1,1], 200) == [0]*200
assert len(bootstrapped_trials) == 5000
assert -0.05 < mean(bootstrapped_trials) < 0.05
for b in bootstrapped_trials:
    assert (b * 24) == int(b * 24) # each mean should be a sum of integers divided by 24...
assert min(bootstrapped_trials) < -2.5
assert max(bootstrapped_trials) > 2.5
assert 0.01 < bootstrap_p_value < 0.05

Wow! The bootstrap procedure produces results that are extremely similar to the exact simulation. (Note that the samller your original dataset is, the worse the bootstrap will perform compared to simulating new data or compared to using parametric statistics.) Even better, though, is that the bootstrap can be used in situations where it's not obvious how to write a simulation.

For an example of this, consider the question posed in class about whether attending a tutorial session is guaranteed to boost a student's homework score. (Imagine that we graded the homework first before and then after the tutorial session, so we could measure score increase.)

Let's assume that of 30 students who attended the tutorial session, each of them saw an improvement in score of at least one point. That is, the minimum *observed* improvement was > 0.

But how likely would it be to get a sample of 30 students who all had improvements > 0 in the case that there was a theoretical possibility of having no improvement? (I.e. under the null hypothesis that the minimum possible  improvement is zero.)

How would you simulate this situation? You would have to have some idea of how the learning process works in order to  model the distribution of score improvements under the null hypothesis! This would not be particularly simple to do...

Bootstrapping allows us to short-circuit this whole issue. After all, we have a perfectly good example of the distribution of improvements: those that we observed. The only problem is that the minimum improvement is 1 point, while the null hypothesis requires that the minimum be 0. So to generate new data under the null hypothesis, all we have to do is shift the observed distribution to have a minimum of 0 and resample from that.

## Problem 1.4
A list of 30 score improvements is provided below. Note that most students improved a lot, but two stragglers had an improvement of only one point. Write a function `bootstrap_minimum` that takes a list of improvements, shifts it so that its *minimum* is zero, and then performs a specified number of resamplings. For each resampled dataset, calculate its *minimum*, and return a list of all the calculated minima.

Then use the resampled data to calculate the probability of observing a minimum improvement of 1 or larger across 30 students, under the null hypothesis that the theoretical minimum improvement is zero. (I.e. calculate the p-value.) Store this as `improvement_p_value`. Note that this will be a one-tailed p-value: the fraction of scores that go up by one point or more. (As the tutorial session never seems to make homework scores worse, we needn't bother examining the case that scores go down by one point or more...)

Finally, it would be good to visualize the null distribution of improvement scores. Unfortunately, in this case the distribution is quite skewed (you'll see). This makes it hard to see the true shape of the distribution from a histogram, which would be dominated by one huge bin.

The good news is that we know that each minimum must be an integer >=0, so we can just count how many 0s we see, how many 1s, and so forth, up to the maximum value seen in the list of `bootstrapped_minima`. To do this, write a function called `bincount` that will count the number of occurences of each integer in a list. This function should create a list of counters: one for each integer from zero to the maximum value in the input data. The function should then step through the input values in a loop and, for each value, increment the corresponding counter.

For example, consider the input `[1,2,3,1]`. In this list, there are no 0s, two 1s, one 2 and one 3. Thus, the bincount should be `[0, 2, 1, 1]`. If the input is `[0,2,0,0]`, then the bincount should be `[3, 0, 1]` (i.e. three 0s, no 1s, and one 2).


In [None]:
improvements = [17,  8,  5,  1,  9, 11,  8,  1, 11,  8,  6,  6,  9, 11, 10,
                12,  5, 18, 19,  4,  5,  9, 15, 11,  9, 17, 16, 10, 16, 11]

print('mean improvement', mean(improvements))

def bootstrap_minimum(data, n_resamples):
    # YOUR ANSWER HERE

bootstrapped_minima = bootstrap_minimum(improvements, 10000)
print('range of bootstrapped minima:', min(bootstrapped_minima), max(bootstrapped_minima))

# Now calculate p-value 
# YOUR ANSWER HERE
print('p =', improvement_p_value)
print() # print a blank line

def bincount(data):
    num_bins = max(data) + 1
    # If the maximum value is 3, we need to keep a count of the number of 0s, 1s, 2s, and 3s.
    # Thus, we need 4 total bins. Every time we see a 0 in the data, we should increment the
    # counter in the 0th position of the counts list. Every time we see a 1, we should increment
    # the counter in the 1st position, and so forth...
    counters = [0] * num_bins
    # YOUR ANSWER HERE
    return counters

print('bincount tests:')
print(bincount([1,1,2,3])) # should be [0, 2, 1, 1]
print(bincount([0,0,0,2])) # should be [3, 0, 1]
print() # print a blank line

print('bootstrapped minima counts:')
minima_counts = bincount(bootstrapped_minima)
print(minima_counts)

In [None]:
assert bootstrap_minimum([1,1,1,1], 200) == [0]*200
assert len(bootstrapped_minima) == 10000
assert min(bootstrapped_minima) == 0
shifted_improvements = set(shift(improvements, 1))
for b in bootstrapped_minima:
    assert b in shifted_improvements
assert max(bootstrapped_minima) < 10
assert 0.12 < improvement_p_value < 0.14
counts = bincount([0]*50 + [999])
assert len(counts) == 1000
assert counts[0] == 50
assert counts[999] == 1
assert sum(counts) == 51
assert abs(improvement_p_value - sum(minima_counts[1:])/10000) < 0.0000001

To recap: we have resampled data under the scenario that score improvements are distributed basically as we observed, except that the minimum value could be zero. In this case, it is still reasonably common for a set of 30 students to all improve by at least one point, just by chance alone. (In particular, this happens about 13% of the time.)

So that means that observing data such as we did, where two students improved by one point and the rest by more, is not really a strong indication that the minimum theoretical improvement is greater than zero.

In other words, given these data, we cannot say that the null hypothesis (that the tutoring sessions don't raise everyone's scores) is particularly unlikely. 

## Problem 1.5
What if we observed slightly different data? In particular, what if **more** students saw a one-point improvement and fewer saw larger improvements? The minimum improvement observed would still be one, and the mean would be decreased somewhat. It seems like in this scenario it could only be *less likely* that the tutorial session was helping everyone improve, compared to the previous scenario where more students improved by more than 1 point, right?

Test this intuition out on the `improvements2` datset, where 5 students saw an improvement of 1 point. Calculate a p-value as `improvement2_p_value`.

In [None]:
improvements2 = [1,  8,  5,  1,  9, 11,  8,  1, 11,  8,  6,  6,  9, 11, 10,
                12,  5, 18, 19,  4,  5,  9, 1, 11,  9, 1, 16, 10, 16, 11]


print('mean improvement 2', mean(improvements2))

bootstrapped_minima2 = bootstrap_minimum(improvements2, 10000)
print('range of bootstrapped minima 2:', min(bootstrapped_minima2), max(bootstrapped_minima2))

# Now calculate p-value
# YOUR ANSWER HERE
print('p =', improvement2_p_value)
minima2_counts = bincount(bootstrapped_minima2)
print(minima2_counts)

In [None]:
assert 0.002 < improvement2_p_value < 0.006
assert abs(improvement2_p_value - sum(minima2_counts[1:])/10000) < 0.0000001

***What what what?***

How could it be that with a set of observed scores where *more* students had a *smaller* improvement, we are nevertheless more certain that tutoring has a significant effect on the minimum level of improvement?

The intuitive answer is that if you see a lot of students with a 1-point improvement but none with a 0-point improvement, you might imagine that 1 point is a hard floor. (Imagine observing 1000 students with a 1-point improvement and none with a zero point improvement. It would be hard to claim that these data support the hypothesis that that it's possible to not improve your score at all.) On the other hand, if you see just a few instances of a 1-point improvement, it's much harder to be certain whether that is the real lower bound, or if an even-lower bound might be possible.

More formally, what the bootstrapping shows is that when there are a lot of values at the true minimum, then most of the time a new sample *will contain* that true minimum value. So when we shifted the data in the second example to set the true minimum to zero, over 99% of the bootstrap samples had a zero in them. 

In other words, this showed that in the second scenario, the minimum of any given sample was generally representative of the minimum of the true distribution. As such, we could be confident that a 1-point improvement was the true minimum.

On the other hand, the first scenario showed that if there are only a few values at the true minimum, then much of the time a given sample will *not* contain the true minimum. Thus, in the first case, it's harder to know whether the true minimum is 1 or 0.

Note especially that we cannot conclude from this that the true minimum is **not** 1. We just don't have much good evidence either way! We cannot reject the null hypothesis that just by chance our sample did not contain the true population minimum. But equally we cannot reject the alternative hypothesis that it did contain the true minimim! This is the fundamental limit of hypothesis testing with p-values.

**One last note:** Even in the second scenario where we saw a significant improvement in scores, we can't attribute that improvement to the tutoring session! It's possible that maybe scores just go up as students have time to think more about a homework. (In this case, the better experiment would be a controlled trial, where students are randomized to go to the tutor session or not. We'll examine such two-sample tests in the homework for the next statistics lecture.) The main lesson of this is that even if you *can* reject the null hyopthesis, you can never be sure *which* alternative hypothesis is correct (i.e. whether tutoring helped, or some other factor improved the scores) unless you design an experiment to distinguish among the alternatives.