# Question 1: Bootstrap Hypothesis Testing

We will use the tools developed in the previous question to perform basic bootstrap hypothesis testing. First, paste your definitions of `p_value`, `resample`, `mean_difference`, `t_stat` and `F_stat` from Question 0 below. (Two helper functions, `sum_of_squares` and a numpy-ified `shift_mean`, which returns a dataset shifted such that its mean is the specifed value, are provided for you.)

In [None]:
import numpy

def shift_mean(data, new_mean):
    data = numpy.asarray(data)
    mean = numpy.mean(data)
    shift_factor = new_mean - mean
    return data + shift_factor

def sum_of_squares(data, reference):
    data = numpy.asarray(data)
    return numpy.sum((data - reference)**2)

### YOUR CODE BELOW:

def p_value(actual_stat, resampled_stats, null_hyp_stat=0, two_tailed=True):
    resampled_stats = numpy.asarray(resampled_stats)
    # YOUR ANSWER HERE
    
def resample(data_sets, statistic, n_resamples):
    # turn every dataset in `data_sets` into a `numpy` array 
    data_sets = [numpy.asarray(data) for data in data_sets]
    # YOUR ANSWER HERE

def mean_difference(data1, data2):
    # YOUR ANSWER HERE

def t_stat(data1, data2):
    # YOUR ANSWER HERE

def F_stat(*data_sets):
    # YOUR ANSWER HERE
    return ss_between / ss_within

## Question 1.1

Now, let's read in some data. The file "bloodpressure.txt" contains a number of lines, each with a measured blood pressure and a genotype at a particular SNP specified as "AA" or "GG". The values are separated by a tab character (spelled `'\t'` in Python). Note that there *is* a header row. (Open the file in the Jupyter file browser to see. Note that the browser helpfully shows the tab characters as distinct from spaces...)

Read in the file, and store the "AA" blood pressures (converted to floating-point numbers) in a list named `aa_bp` and the "GG" blood pressures in a list named `gg_bp`. Last, convert each list to a numpy array.

In [None]:
# YOUR ANSWER HERE

print(len(aa_bp), len(gg_bp))
print(numpy.mean(aa_bp), numpy.mean(gg_bp))

In [None]:
assert len(aa_bp) == 56
assert len(gg_bp) == 205
assert type(aa_bp) is numpy.ndarray
assert abs(numpy.mean(gg_bp) - 124.021365854) < 0.00000001

Below, we plot the GG and AA data. Note several useful features of plot:
  1. By changing `plt.rcParams['figure.figsize']`, we can make the default figure size a little larger, which helps readability.
  2. the `density=True` parameter changes the histogram from using raw data counts to presenting the fraction of the data in each bin. This is handy for datasets with different total sizes, such as these.
  3. It's convenient to plot both histograms on the same set of bins. To ensure this happens, we collect the bin boundaries that are automatically calculated from the GG data, and use those as the boundaries for the AA histogram. (Read the documentation for `plt.hist` and it will tell you that the `bins` parameter can be a number, 'auto', or a list of bin boundaries. Moreover, the documentation will tell you that `plt.hist` provides the bin boundaries used as the second element in the returned values.)
  4. Note how to provide labels and legends.
  5. Note the use of the `alpha=0.5` parameter to turn the AA histogram 50% transparent.
  
Also, looking at the plot, do the AA and GG data seem particularly distinguishable? Make a guess as to whether the difference will be statistically significant.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = [12, 4]

gg_mean = numpy.mean(gg_bp)
aa_mean = numpy.mean(aa_bp)
hist_data = plt.hist(gg_bp, bins='auto', density=True, label='GG')
plt.axvline(gg_mean, color='orange', label='GG mean')
bin_edges = hist_data[1]
hist_data = plt.hist(aa_bp, bins=bin_edges, density=True, alpha=0.5, label='AA')
plt.axvline(aa_mean, color='blue', label='AA mean')
legend = plt.legend()

## Question 1.2
Let's ask if the difference in blood pressure between these groups is statistically significant. 

First, write a function to calculate a bootstrapped p-value to test the hypothesis that the means of two groups are different. Since the null hypothesis is that the means are the same, make sure to resample under this assumption. Let the user provide a statistic function to score how different the means are (i.e. `mean_difference` or `t_stat` or `F_stat`). This function should calculate the actual statistic, then shift each data group so that its mean is equal to the grand mean (the mean of all the data concatenated together). The `shift_mean` function defined above may help.

Next, the function should generate a set of resampled statistics of the shifted data (which comply with the null hypothesis that each data set has an identical mean), and calculate a p-value therefrom. The function should return the actual statistic, the list of statistics calculated from the resamples under the null hypothesis, and the p-value.

The histogram plotted below shows in red the null distribution (the values of the statistic you might expect by chance when the means are actually the same), and the blue line shows the value of the actual statistic we obtained. The further out on the tail of the histogram the blue line is, the lower the p-value.

In [None]:
def bootstrap_means_different(data1, data2, statistic, n_resamples=10000):
    # YOUR ANSWER HERE
    return actual_stat, resampled_stats, p

actual_diff, resampled_diffs, mean_diff_p = bootstrap_means_different(aa_bp, gg_bp, mean_difference)
plt.hist(resampled_diffs, bins='auto')
plt.axvline(actual_diff, color='blue')
print(actual_diff, mean_diff_p)

In [None]:
stat, samples, p = bootstrap_means_different(aa_bp, gg_bp, mean_difference)
assert len(samples) == 10000
assert abs(stat + 5.25058013937) < 0.000001
assert 0.021 < p < 0.029

stat, samples, p = bootstrap_means_different(aa_bp, gg_bp - 3, mean_difference)
assert len(samples) == 10000
assert abs(stat + 2.25058013937) < 0.000001
assert 0.32 < p < 0.36

In class I mentioned that the correction for sample variance is a critical feature of the t-statistic for parametric statistics: for a given difference in the means, datasets with large standard deviations are harder to distinguish from one another than datasets with small standard deivations.

Does this matter here? Generate 200 p-values from the AA and GG data using the `mean_difference` statistic and 200 using the `t_stat` statistic. (Use `n_resamples=500` to make this not take forever...) Plot superimposed histograms of these p-values just as above (though don't bother with vertical lines showing the "mean p-value"). How different do these histograms appear?

In [None]:
mean_diff_ps = []
t_stat_ps = []

# YOUR ANSWER HERE
print('Average p-value with mean differences:', numpy.mean(mean_diff_ps))
print('Average p-value with t-statistic:', numpy.mean(t_stat_ps))

In [None]:
assert numpy.mean(mean_diff_ps) < numpy.mean(t_stat_ps)
assert len(mean_diff_ps) == 200

From this, you should be able to see that the t-statistic produces a sligtly more conservative test (i.e. larger p-values). Because it (sensibly) takes into account the spread in the data, the t-statistic will generally judge these datasets to be less different than the difference in means alone. 

## Question 1.3

From the histograms of the blood pressure data, there may be a few outliers. Write a new statistic to compare the differences in median (using `numpy.median`), and compute the p-value for this test of medians. You can't just use your `bootstrap_means_different` function above, because that resamples under the null hypothesis of equal means, not equal medians. So write new functions for `bootstrap_medians_different` and `shift_median`. Note that unlike `bootstrap_means_different` which took a `statistic` function (allowing us to compare the t-test's way of scoring the difference between means to the simpler way of just subtracting the means), `bootstrap_medians_different` will only ever use `median_difference`, so you can just code that in directly.

In [None]:
def median_difference(data1, data2):
    # YOUR ANSWER HERE

def shift_median(data, new_median):
    # YOUR ANSWER HERE

# below should print 42 (or something arbitrarily close to it), to prove that shifting to the specified median works...
print(numpy.median(shift_median(aa_bp, 42)))
    
def bootstrap_medians_different(data1, data2, n_resamples=10000):
    # YOUR ANSWER HERE
    return actual_stat, resampled_stats, p

actual_diff, resampled_diffs, median_diff_p = bootstrap_medians_different(aa_bp, gg_bp)
plt.hist(resampled_diffs, bins='auto')
plt.axvline(actual_diff, color='blue')
print(actual_diff, median_diff_p)

In [None]:
assert abs(numpy.median(shift_median(aa_bp, 42)) - 42) < 0.0000001
assert 0.41 < median_diff_p < 0.46

Interesting! Even though there were clearly significant differences in the mean (either by the mean-differences method or the t-statistic), there is no significant difference in the median blood pressure between the genotypes! This means that the differences observed may be more due to outliers than the position of most of the individuals in the datasets.

Note that this doesn't mean that there are no real biological differences in the AA or GG genotypes! It just means that those differences mostly manifest in a few extreme individuals of each genotype. This is critical information for understanding the biology of these differences. (e.g. perhaps this indicates that there is some other, relatively rare interacting allele that potentiates the effect of the AA or GG genotype? Or perhaps there is some gene-by-environment interaction driven by an environment that only a few individuals in the population experience...)

## Question 1.4

So far we have been examining questions about the location of the data (means or medians, or even in the last homework, the minimum). What about the scale of the data? Do the datasets under consideration have equal variance? Or is there more variability in one dataset than the other?

Let's start by defining a statistic. For the hypothesis that the means were not the same, our statistic was the difference between means. However, standard deviation is a measure of *scale*, not position, and typically we think of scales as a multiplicative factor. So instead of looking at the difference in the standard deviations, let's look at the *ratio* of the standard deviations of the two datasets. The null hypothesis in this case will be that the ratio is 1.

Now, how should we sample under this null hypothesis of equal standard deviations? To modify the standard deviation of a dataset without changing its mean, you need to first shift it to have zero mean, then multiply the dataset by whatever scaling factor is desired, and then shift back to the original mean. (Note: if a dataset has a standard deviation of 3 and you want it to have a standard deviation of 6, the scaling factor to multiply by is 2.)

So first write an `equalize_stds` function that calculates a grand standard deviation of the concatenated data and rescales the each individual data set such that they all have that same standard deviation. Then write a `std_ratio` function to return the ratio of standard deviations (using `numpy.std`). 

Now, the null hypothesis of equal standard deviations means that we expect the "true" ratio of standard deviations to be 1. To calculate a two-tailed p-value, we need to count the number of times that the resampled data (which were forced to comply with that null hypothesis) produces a ratio of standard deviations farther -- in either direction -- from the null-hypothesis value of 1 than the actual standard deviation ratio is. 

The current `p_value` function uses subtraction to define the distance between the statistic and the null hypothesis, but this doesn't make sense for ratios. If we observe a ratio of 3 and the null hypothesis is that the ratio is 1, a ratio similarly extreme in the other direction is not -1 (two units less than than 1), it is 1/3 (three times smaller than 1). That is, you need to measure distance multiplicatively, not additively.

So, you will need to write a new `p_value_ratio` function that calculates whether a resampled ratio or the actual ratio is multiplicatively farther from 1. The bookkeeping for doing this in all possible cases winds up getting complex fast, though. Previously we avoided similar bookkeeping by taking the absolute value of the distance between a statistic and the null hypothesis, so we could just ask if a resample was g. The multaplicative equivalent is to make sure that a ratio is always >= 1 by taking the reciprocal if the ratio is less than one. 

As such, you should reciprocate all the ratios (both the resampled ones and the actual one) if they are < 1. Then, since all of the ratios are by construction >= 1, it becomes quite easy to see whether a resampled ratio is more extreme than the actual one.


Last, write a function `bootstrap_std_difference` that brings all of these together to calculate a p-value. No need for this function to take a user-specified statistic function; it will always use `std_ratio`.


In [None]:
def equalize_stds(data1, data2):
    # YOUR ANSWER HERE
    return equalized1, equalized2

eq_aa, eq_gg = equalize_stds(aa_bp, gg_bp)
print(numpy.std(eq_aa), numpy.std(eq_gg))
    
def std_ratio(data1, data2):
    # YOUR ANSWER HERE


def p_value_ratio(actual_ratio, resampled_ratios):
    # Since this is a two-tailed test, we don't care if the actual or resampled ratios
    # are larger or smaller than 1, but just by what factor they are larger or smaller.
    # So we will just convert all ratios to be larger than 1 and compare for 
    # extreme-ness only in one direction. (This is akin to taking absolute values
    # in the previous p-value calculation.)
    
    # Sometimes a for-loop is the easiest way to do things, even when numpy is
    # available. So make a new version of resampled_ratios using a for-loop,
    # where every value < 1 is replaced by its reciprocal.
    # Also don't forget to make sure actual_ratio is > 1 also...
    # YOUR ANSWER HERE
    
def bootstrap_std_different(data1, data2, n_resamples=10000):
    # YOUR ANSWER HERE
    return actual_ratio, resampled_ratios, p_val

ratio, resampled_ratios, std_ratio_p = bootstrap_std_different(gg_bp, aa_bp)
plt.hist(resampled_ratios, bins='auto')
plt.axvline(ratio, color='blue')
print(ratio, std_ratio_p)

In [None]:
assert abs(numpy.std(eq_aa) - 13.642075681) < 0.000001
assert abs(numpy.std(eq_aa) - numpy.std(eq_gg)) < 0.000001
assert abs(numpy.mean(aa_bp) - numpy.mean(eq_aa)) < 0.000001

assert std_ratio([0, 4, 8], [1, 2, 3]) == 1/std_ratio([1, 2, 3], [0, 4, 8])
assert std_ratio([0, 4, 8], [1, 2, 3]) == 4

assert p_value_ratio(1/4, [1/5, 1/4, 1/2, 2, 4, 5, 6]) == 5/7
assert 0.023 < std_ratio_p < 0.031

So, the AA data is significantly more spread out than the GG data (the GG/AA standard deviation ratio is around 0.77). That is, it was rare to see a ratio so different from 1 when we resampled data that had identical standard deviations.

## Question 1.5

Last, we haven't done an ANOVA test yet to see if three or more samples have different means. Imagine that we are testing a novel protocol to differentiate iPS cells into pancreatic beta cells. We have tried four different doses of a drug, and it looks like maybe one or two of those doses had an effect on the differentiation efficiency. But the data were very noisy, so we want to know whether the effect is likely just due to the variability within each replicate.

Write a function, `bootstrap_anova`, that takes an arbitrary number of datasets, shifts them each to the grand mean, and then resamples them to estimate the distribution of F-statistics under null hypothesis of no difference in means.


In [None]:
dose0 = numpy.array([
        0.44353122,  0.01012966,  0.28873548,  0.41508564,  0.63920005,
        0.38840745,  0.56994441,  0.39461312,  0.63531554,  0.2243476 ,
        0.0146331 ,  0.34027282,  0.11461404,  0.3706856 ,  0.03022053,
        0.26903881,  0.40752708,  0.57226634,  0.56036688,  0.45346642,
        0.24393661,  0.11314218,  0.37424453,  0.02731236,  0.36170049,
        0.28624436,  0.27673137,  0.02059751,  0.33590967,  0.34159704])

dose1 = numpy.array([
        0.40447368,  0.04894949,  0.43691414,  0.36156942,  0.55082948,
        0.45234243,  0.08263355,  0.53735037,  0.19556559,  0.14684213,
        0.12670269,  0.09968791,  0.09903378,  0.23750902,  0.30392521,
        0.27720465,  0.64437095,  0.62903639,  0.62678483,  0.41837535,
        0.02564389,  0.34572511,  0.48980329,  0.67060822,  0.40915237,
        0.67581135,  0.63069843,  0.1503978 ,  0.12660994,  0.25990044])

dose2 = numpy.array([
        0.2023481 ,  0.17263792,  0.24061181,  0.23310464,  0.75607685,
        0.67992754,  0.1684292 ,  0.72103724,  0.50828973,  0.19148194,
        0.76176207,  0.31807931,  0.04664898,  0.22524644,  0.83347876,
        0.58708986,  0.40128552,  0.33489851,  0.16898191,  0.0937998 ,
        0.82937534,  0.69083288,  0.14723692,  0.51264047,  0.46377266,
        0.55422885,  0.49082234,  0.80979359,  0.43772453,  0.35936454])

dose3 = numpy.array([
        0.56307295,  0.260053  ,  0.65498897,  0.4074656 ,  0.38936101,
        0.10776682,  0.46990373,  0.39733009,  0.04181612,  0.15412012,
        0.44600949,  0.56160325,  0.09421104,  0.29194078,  0.69317676,
        0.22814752,  0.17328897,  0.40468055,  0.69985375,  0.39791401,
        0.55387151,  0.68740186,  0.52072545,  0.65457568,  0.19796182,
        0.58060724,  0.6489602 ,  0.2342047 ,  0.42521541,  0.62100328])

plt.scatter([0]*30, dose0)
plt.scatter([1]*30, dose1)
plt.scatter([2]*30, dose2)
plt.scatter([3]*30, dose3)

def bootstrap_anova(*data_sets, n_resamples=10000):
    # YOUR ANSWER HERE
    return actual_stat, resampled_stats, p_val

f_value, resampled_f_values, anova_p = bootstrap_anova(dose0, dose1, dose2, dose3)
plt.figure()
plt.hist(resampled_f_values, bins='auto')
plt.axvline(f_value, color='blue')
print(f_value, anova_p)

In [None]:
assert abs(f_value - 0.0519515028907) < 0.000001
assert 0.11 < anova_p < 0.13

It appears that the blip in the data for dose 2 was just that: a blip. Maybe with a larger dataset, that dose would really have a higher differentiation efficiency... or maybe not.