# Question 2: Statistical Power

We will now explore statistical power. First, definitions:

1. The p-value controls the *false positive rate*: the chance that we declare there to be a true effect when really what we saw was due to chance alone. (This is called a **"Type I Error"**.) Our p-value cutoff (typically 0.05 in biology) is the false positive rate that we are willing to tolerate.

2. Statistical power is the *true positive rate* (or equivalently, 1 minus the false negative rate): the chance that when there really is an effect, we will get a p-value below the cutoff. Power depends on, amongst other things, (a) the size of the effect, (b) the size of the sample, and (c) the p-value cutoff. (Increasing the p-value cutoff will improve statistical power, but at the cost of also increasing the false positive rate.) Getting a false negative (which is more common when a statistical test has low power) is called a **"Type II Error"**.

As a warm-up, let's first explore the false positive / type I error rate. We will do so by examining how often we see "significant" p-values from data that we know come from identical populations.

To do this, we need to repeatedly generate pairs of "new" datasets drawn from the same underlying population. We can do this with the random number generators in python, or we can test resamples from the same data set for differences.

**Crucially**, in neither case should we repeatedly apply a statistical test to the same data set over and over again. We want to know how often a pair of samples will be "significantly" far apart (which requires drawing new samples each time), **not** how often a test gives a significant p-value for any specific pair of samples.

## Question 2.1
Some code that's pretty familiar from the previous questions is provided below. (Note that the implementation of `bootstrap_means_different` has a couple changes: it no longer takes a `statistic` parameter, and instead always just calls `mean_difference` as the statistic. It also only returns the p-value.)

We will use these to test whether the bootstrap test of two means gives the correct number of false positives given a specific p-value threshold. Write a function, `significant_fraction(test, get_sample, sample_size, n_trials=250, n_boot_resamples=1000, p_thresh=0.05)` to run this test:
  - `test` is a function that implements a statistical test. Assume that the "function signature" of test (i.e. what parameters it takes) is `test(data_sets, n_resamples)`. I.e. we will pass `bootstrap_means_different` function above as a test of interest. (We will write and use other tests that have the same function signature later on.)
  - `get_sample` is a function that will be called as `get_sample(sample_size)` and will return a new sample. (Just as in Question 0.5.)
  - `sample_size` is the sample-size parameter to pass to `get_sample`.
  - `n_trials` is the number of times to run the statistical test.
  - `n_boot_resamples` is the number of resamples parameter to pass to `test`.
  - `p_thresh` is a p-value threshold.

For each trial (of the `n_trials` total trials), the `significant_fraction` should call `get_sample` **twice** to generate two samples from the same "true distribution" (i.e. the true difference in means is zero). Then the function should run the given statistical test with those two samples. The function should return the fraction of trials for which the p-value was less than or equal to the provided threshold.

In [None]:
import numpy

aa_bp = []
gg_bp = []
file = open('bloodpressure3.txt')
header = file.readline()
for line in file:
    bp, genotype = line.strip('\n').split('\t')
    bp = float(bp)
    if genotype == 'AA':
        aa_bp.append(bp)
    elif genotype == 'GG':
        gg_bp.append(bp)
    else:
        print('unknown genotype!', genotype)
file.close()
aa_bp = numpy.array(aa_bp)
gg_bp = numpy.array(gg_bp)

def mean_difference(data1, data2):
    return numpy.mean(data1) - numpy.mean(data2)

def resample(data_sets, statistic, n_resamples):
    data_sets = [numpy.asarray(data_set) for data_set in data_sets]
    stats = []
    for i in range(n_resamples):
        resampled_data_sets = [numpy.random.choice(data, size=len(data), replace=True) for data in data_sets]
        stats.append(statistic(*resampled_data_sets))
    return stats

def two_tail_p_value(actual_stat, resampled_stats, null_hyp_stat=0):
    resampled_stats = numpy.asarray(resampled_stats)
    actual_diff = abs(actual_stat - null_hyp_stat)
    diff = numpy.abs(resampled_stats - null_hyp_stat)
    count_extreme = numpy.count_nonzero(diff >= actual_diff)
    return count_extreme / len(resampled_stats)

def bootstrap_means_different(data_sets, n_resamples=10000):
    data_sets = [numpy.asarray(data_set) for data_set in data_sets]
    actual_stat = mean_difference(*data_sets)
    shifted = [data_set - numpy.mean(data_set) for data_set in data_sets]
    resampled_stats = resample(shifted, mean_difference, n_resamples)
    p_val = two_tail_p_value(actual_stat, resampled_stats)
    return p_val

def get_normal_sample(sample_size):
    # draw data from a normal distribution with mean=0 and std=1
    return numpy.random.normal(size=sample_size)

def get_gg_sample(sample_size):
    # draw a re-sample of the GG data.
    return numpy.random.choice(gg_bp, size=sample_size)

# YOUR ANSWER HERE

false_pos_norm = significant_fraction(bootstrap_means_different, get_normal_sample, sample_size=100)
print(false_pos_norm)

false_pos_gg = significant_fraction(bootstrap_means_different, get_gg_sample, sample_size=len(gg_bp))
print(false_pos_gg)

In [None]:
assert 0.02 < false_pos_norm < 0.08
assert 0.02 < false_pos_gg < 0.08
num_false_pos = false_pos_gg * 250
assert int(num_false_pos) == num_false_pos

You should see that the false positive rate is somewhere near 5%, which is as it should be for a p-value threshold of 0.05.

## Question 2.2
Recall from Question 0.8 that we didn't always get the "right" confidence interval for the median, compared to the mean. Essentially the problem was that resampling a specific dataset gave a distribution of medians that wasn't quite the same as you would see by sampling a truly new dataset over and over again.

Specifically, the distribution of resamples was a tad too narrow, leading to confidence intervals that were too small. One might also suspect that a too-narrow distribution might lead to p-values that are smaller than maybe they ought to be. Is this a cause for concern? Let's test. Below, paste your code for `median_difference` and `bootstrap_medians_different` from Question 1.2. Modify `bootstrap_medians_different` so that it does not take a `statistic` parameter and instead always just uses `median_difference`. Also modify the function to only return the p-value. This way, it will be compatible with your `significant_fraction` function above.

Running the code will answer the question of whether we get too many false positives for difference-of-medians tests.

Note that we only test this with the `get_normal_sample` function, which returns a uniquely new sample from the normal distribution. In particular, since `get_gg_sample` "cheats" and returns a *resample* of the GG data (instead of getting a new population of individuals, genotyping them, and measuring their blood pressure!) we can't use it to ask whether *resamples* act just like *new samples* in this case.

In [None]:
# YOUR ANSWER HERE
false_pos_norm_median = significant_fraction(bootstrap_medians_different, get_normal_sample, sample_size=100)
print(false_pos_norm_median)

In [None]:
assert 0.02 < false_pos_norm < 0.08

Good news! The false-positive rate isn't too bad. Clearly, even if the distribution of medians of resampled statistics is a bit wonky, the distribution of *differences between two medians* isn't that different for resamples or true new samples.

Run the code below to see that. This is just like the example at the end of Question 0, except that wer're looking at differences between medians rather than medians themselves. (Again, note that the overall positions of the distributions might differ, but that in this case the shapes are more similar than in Question 0.

This is an illustration of the *[central limit theorem](https://en.wikipedia.org/wiki/Central_limit_theorem)*, one of the crown jewels of statistical theory. This theorem states that sume (or differences) of random values will typically have a normal (bell-curve-like) distribution, even when the underlying random variables come from distributions with very different shapes. (Sums of larger numbers of independent random variables will be more normally distributed than sums of fewer numbers, but as we see in this case, even two numbers added looks a lot smoother than the underlying distribution.)

In [None]:
median_diffs = [median_difference(get_normal_sample(200), get_normal_sample(200)) for i in range(10000)]

sample1 = get_normal_sample(200)
sample2 = get_normal_sample(200)
resampled_median_diffs = resample([sample1, sample2], median_difference, n_resamples=10000)

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = [12, 4]

plt.hist(median_diffs, bins='auto', label='sample median diffs')
plt.hist(resampled_median_diffs, bins='auto', alpha=0.6, label='resample median diffss')
x = plt.legend() # store result of this to suppress printing it out

Next, let's compute statistical power: the probablility of observing a statistically significant p-value, given a particular size of effect, a particular sample size, a statistical test, and a p-value threshold.

We want to write a function called `power` which will run any given hypothesis test. However, some tests might need different function arguments (e.g. a bootstrap or permutation test might require `n_resamples`, but a parametric Student's t-Test won't...) How can our `power` function accept arbitrary keyword arguments to pass along to the statistical test?

Last week, we learned about the `*` syntax, whereby a function could accept an arbitrary list of parameters, or could call another function with such a list. There is another similar syntax for accepting arbitrary keyword arguments: `**`. Try running the below, and maybe changing it around a bit to see what happens:


In [None]:
def print_kws(**kws):
    # kws is a dictionary of the keyword arguments provided to the function.
    # kws maps keyword names (as strings) to values
    for name, value in kws.items():
        print(name, value)

print_kws(hello=5, goodnight='moon')
print_kws(a=4, b=5, c=5)
# The below will generate an error if you uncomment  and run it (try it)!
# Note that "positional" arguments (as refered to by the error text) are 
# arguments not specified by keyword, like those below:
# print_kws(2, 4, 5)

In addition, we can call a function with an arbitrary dictionary of keyword arguments with `**`. Note that you can also provide some number of positional and keyword arguments before the `**` syntax. It's only an error to not provide enough arguments, or to provide too many or overlapping arguments. Again, try running (and playing with) the below:

In [None]:
def many_args(a, b, c, d=5):
    print('a', a)
    print('b', b)
    print('c', c)
    print('d', d)
    print()

# can use all positional arguments
many_args(1, 2, 3, 4)

# equivalently, we can use a list with the * syntax:
args = [1, 2, 3, 4]
many_args(*args)

# can use all keyword arguments
many_args(a=1, b=2, c=3, d=4)

# equivalently, can use a dictionary with the ** syntax, mapping strings to argument values:
kws = {'a':1, 'b':2, 'c':3, 'd':4}
many_args(**kws)

# we can mix these patterns ad nauseum:

kws = {'b':2, 'c':3}
many_args(a=1, **kws) # note that we don't specify d at all! This gives us its default value of 5.
many_args(4, d='hello', **kws) # specify a by position and d by explicit keyword

args = [1, 2]
kws = {'c': 3, 'd':4}
many_args(*args, **kws)
many_args(*args, c=3) # just use default value for d
many_args(*args, c=3, d=4)

Last, note that while it's traditional to name the `**` parameter `**kws` or `**kwargs` (and use `*args` for the counterpart syntax), this isn't required in the slightest. Any variable name can be used (as you'll see below).

## Question 2.3
Write a function `power(test, get_sample, effect_size, sample_size, n_trials=250, p_thresh=0.05, **test_args)` that will repeatedly get new samples and run a two-sample test on those samples, where:
 - `test` is a function that can be called as `test(data_sets, **test_args)`, where `data_sets` is a list of two new samples. This function can be assumed to return a p-value and nothing else.
 - `get_sample` is a function that can be called as `get_sample(sample_size, effect_size)` and will return a new sample of the given size and degree of "effect". `get_sample(sample_size, 0)` should return a no-effect "control" sample.
 - `effect_size` and `sample_size` are the parameters to `get_sample` described above.
 - `n_trials` is the number of times that new samples should be generated and the test run.
 - `p_thresh` is the threshold for calling a result significant
 - `**test_args` are arguments to pass to the `test` function.

This function should return the fraction of `n_trials` where `test` returns a p-value that is less than or equal to `p_thresh`.

The function should call `get_sample` twice to generate two separate datasets: once as `get_sample(sample_size, 0)` to get the "control" sample, and once as `get_sample(sample_size, effect_size)` to generate a sample that is distinct from the control by some specified effect size. (Usually this will amount to shifting the mean by the specified effect size...)

Some potential `get_sample` functions are defined below.

In [None]:
def power(test, get_sample, effect_size, sample_size, n_trials=250, p_thresh=0.05, **test_args):
    # YOUR ANSWER HERE

# define some sample-getting tools. Each of these returns samples with a standard deviation
# around 12 (which is that of the GG data), so that the "effect size" parameters mean the same
# thing for each. (Recall that a "big effect" is always relative to how variable the data are.)
def get_normal_sample(sample_size, effect_size):
    # note: loc = "mean" and scale = "standard deviation"
    return numpy.random.normal(size=sample_size, loc=effect_size, scale=12)

def get_gg_sample(sample_size, effect_size):
    # draw a re-sample of the GG data, translated by effect_size units.
    return numpy.random.choice(gg_bp + effect_size, size=sample_size)

def get_bimodal_sample(sample_size, effect_size):
    # draw a sample from a bimodal distribution
    large_mode_size = int(sample_size * 2/3)
    small_mode_size = sample_size - large_mode_size
    large_mode = numpy.random.normal(size=large_mode_size, loc=effect_size, scale=7)
    small_mode = numpy.random.normal(size=small_mode_size, loc=22+effect_size, scale=6)
    return numpy.concatenate([large_mode, small_mode])

plt.hist(get_normal_sample(10000, 0), bins='auto')
plt.title('normal distribution')
plt.figure()
print('normal std:', numpy.std(get_normal_sample(10000, 0)))

plt.hist(get_bimodal_sample(10000, 0), bins='auto')
plt.title('bimodal distribution')
plt.figure()
print('bimodal std:', numpy.std(get_bimodal_sample(10000, 0)))

plt.hist(gg_bp, bins='auto')
plt.title('GG distribution')
plt.figure()
print('GG std:', numpy.std(gg_bp))
print()

# test the power of our bootstrap_means_different test and a normal sample.
# note that the 'n_resamples' parameter, because it is not defined explicitly in the declaration of the
# 'power' function, gets stuffed into the '**test_args' parameter, and thus passed through to the 
# 'bootstrap_means_different' function. (If you look at that function, note that it does in fact define
# such a parameter...)
norm_p = power(bootstrap_means_different, get_normal_sample, effect_size=4, sample_size=100, n_resamples=1000)
print('normal distribution power:', norm_p)

bimodal_p = power(bootstrap_means_different, get_bimodal_sample, effect_size=4, sample_size=100, n_resamples=1000)
print('bimodal distribution power:', bimodal_p)

gg_p = power(bootstrap_means_different, get_gg_sample, effect_size=4, sample_size=100, n_resamples=1000)
print('GG distribution power:', gg_p)

In [None]:
# make sure that 250 trials were performed
assert (norm_p * 250) == int(norm_p * 250)
assert abs(norm_p - 0.68) < 0.1
assert abs(bimodal_p - 0.7) < 0.1
assert abs(gg_p - 0.62) < 0.1

From this, you should see that with a 4-unit difference between two groups (each with a standard deviation around 12) and a sample size of 100, there is about a 65% chance of getting a "significant" p-value. There may be some slight differences between the distributions, but if you re-run the above a few times the numbers bounce around a lot. (Something like 500+ trials and 5-10,000 resamples are necessary to start reliably seeing these small differences, but that's pretty slow.)

## Question 2.4
Let's now test the relationship between power and sample size. Using the bimodal sample and the `bootstrap_means_different` test (with 1000 resamples), calculate the power for each of the provided sample sizes and store the results in list titled `powers`.

In [None]:
sample_sizes = [10, 40, 80]
effect_size = 6

# YOUR ANSWER HERE
plt.plot(sample_sizes, powers)
print(powers)

In [None]:
assert len(powers) == 3
assert powers[2] > powers[1] > powers[0]
assert powers[0] < 0.15
assert powers[2] > 0.85

So, as expected, power increases with sample size: with 10 samples, you have less than a 10% chance of getting a positive result when the difference between means is about half of the standard deviation of the data (which is a pretty big effect). But with 80 samples, you are almost certain to get a positive result.

You could plot a similar curve for effect size as well, of course, or for any particular property of your data or statistical test.

## Question 2.5
What about the specific statistic used? Can we compare the t-statistic to the plain old difference in means?

Out previous `bootstrap_means_different` function was hard-coded to use only the mean-difference statistic. Previously we had defined a more-general bootstrap function, a simplified version of which is below as `bootstrap_p`. This function requires a statistic to be provided in addition to the `data_sets` and `n_resamples` parameters.

Can you use your above `power` function with `bootstrap_p` to compare the power of using the `mean_difference` statistic to that of the `t_stat` statistic? This shouldn't require editing any function definitions: you just have to call `power` in the right way to provide `bootstrap_p` with all the parameters it needs.

Use an effect size of 6, a sample size of 20, the default 250 trials, 1000 bootstrap resamples and `get_normal_sample`. Store the results in `mean_diff_p` and `t_stat_p`.

In [None]:
def bootstrap_p(data_sets, statistic, n_resamples=10000):
    data_sets = [numpy.asarray(data_set) for data_set in data_sets]
    actual_stat = statistic(*data_sets)
    shifted = [data_set - numpy.mean(data_set) for data_set in data_sets]
    resampled_stats = resample(shifted, statistic, n_resamples)
    p_val = two_tail_p_value(actual_stat, resampled_stats)
    return p_val

def t_stat(data1, data2):
    # Here we use the data several times, so it makes sense to convert to arrays just once beforehand,
    # rather than having numpy do it for us each time.
    data1 = numpy.asarray(data1)
    data2 = numpy.asarray(data2)
    numerator = numpy.mean(data1) - numpy.mean(data2)
    denominator = (len(data1) * numpy.var(data1) + len(data2) * numpy.var(data2))**0.5
    return  numerator / denominator 

# YOUR ANSWER HERE

print('mean diff power:', mean_diff_p)
print('t statistic power:', t_stat_p)


In [None]:
assert abs(mean_diff_p - t_stat_p) < 0.2
assert abs(mean_diff_p - 0.35) < 0.1

Overall, in this context there isn't a lot of difference between using the t-statistic versus the difference in means. (Remember that due to the small `n_trials` and `n_resamples` per trial, the numbers you get above can bounce around quite a bit. But if you re-run the tests a few times, or increase the `n_trials` and `n_resamples` parameters at the cost of longer execution time, you will see that the values are typically right in the same range...)

## Question 2.6
Let's compare the power of permutation test, bootstrap test (both using the t-statistic), and a parametric t-test. Paste your solution for `permutation_p` from the previous question below, and modify it to return just the p-value.

A function to return a parametric t-test result is provided below.

Use an effect size of 6, a sample size of 12, the default 250 trials, and `get_normal_sample`. Use 1000 permutations/resamples for the permutation/bootstrap tests respectively. Store the results in `boot_p`,  `perm_p`, and `parametric_p`.

In [None]:
def permutation_p(data_sets, statistic, n_permutations=10000):
    # YOUR ANSWER HERE
    return p_val

from scipy import stats
def t_test_parametric(data_sets):
    # use the pre-canned two-sample t-test from the "scipy.stats" library. Scipy is a
    # sister library to numpy, and it provides higher-level scientific computing tools,
    # compared to the raw numerical computing primitives provided by numpy.
    data1, data2 = data_sets
    return stats.ttest_ind(data1, data2).pvalue

# YOUR ANSWER HERE

print('bootstrap power:', boot_p)
print('permutation power:', perm_p)
print('parametric power:', parametric_p)

In [None]:
assert abs(boot_p - 0.2) < 0.1
assert abs(perm_p - 0.2) < 0.1
assert abs(parametric_p - 0.2) < 0.1

As you can see, all of the tests are approximately similarly powered, even with a very small sample and a normal distribution (which should be the home turf for a parametric t-test). There are cases where a parametric t-test is theoretically better-powered than a permutation or bootstrap test, but they're hard to find. However, the real strength of the parametric test is that it is many, many times faster to execute than the simulation-based tests.

## Question 2.7
What about a case that's more suboptimal for the parametric test: one with a decidedly non-normal distribution?

Re-run the above, but change the sample to `get_bimodal_sample` and increase the sample size to 20 (it turns out that with just 12 samples, the power is very low for all the tests...)

In [None]:
# YOUR ANSWER HERE

print('bootstrap power:', boot_p)
print('permutation power:', perm_p)
print('parametric power:', parametric_p)

In [None]:
assert abs(boot_p - 0.16) < 0.1
assert abs(perm_p - 0.16) < 0.1
assert abs(parametric_p - 0.16) < 0.1

So in this case of non-normality, the parametric test does just fine: it's basically in the same range as the nonparametric tests. This is in large part due to the central limit theorem: differences between strangely-distributed data are generally distributed with something closer to a normal bell-curve shape.

## Question 2.8
Below is a much more pathological sample distribution: a case with large outliers that do not shift along with the effect size. You might get something like this with a noisy instrument that has some specific failure modes (crud that shows up as readings, or similar).

The histogram that the code below produces shows the shape of this distribution.

Run the above tests again, using an effect size of 16, a sample size of 100, the t-statistic as your statistic, and 1000 resamples/permutations.

In [None]:
def get_outliers_bimodal_sample(sample_size, effect_size):
    # draw a sample from a bimodal distribution
    large_mode_size = int(sample_size * 2/3)
    background_size = int(sample_size / 5)
    small_mode_size = sample_size - large_mode_size - background_size
    large_mode = numpy.abs(numpy.random.normal(size=large_mode_size, loc=effect_size, scale=7))
    small_mode = numpy.random.normal(size=small_mode_size, loc=56, scale=3)
    background_mode = numpy.random.uniform(0, 75, size=background_size)
    return numpy.concatenate([large_mode, small_mode, background_mode])

plt.hist(get_outliers_bimodal_sample(10000, 0), bins='auto')
plt.hist(get_outliers_bimodal_sample(10000, 6), bins='auto', alpha=0.6)

# YOUR ANSWER HERE

print('bootstrap power:', boot_p)
print('permutation power:', perm_p)
print('parametric power:', parametric_p)

In [None]:
assert abs(boot_p - 0.78) < 0.1
assert abs(perm_p - 0.78) < 0.1
assert abs(parametric_p - 0.78) < 0.1

Even in this textbook case of violated parametric t-test assumptions (a highly non symmetric sample distribution, with many outliers), it works just as well as the bootstrap and permutation tests.

As we can see, the parametric t-test is quite robust to non-normally-distributed data -- especially for two-sample data where the difference between means is much closer to normally distributed. (Run the code below to see the distribution of the differences between the means of two independent samples of these data.)

But even so  sometimes it's good to check funny-looking data with a nonparametric test like the bootstrap or the permutation too. Especially when using a one-sample test (a case we didn't examine here), where the central-limit theorem doesn't help!

In [None]:
mean_diffs = [numpy.mean(get_outliers_bimodal_sample(100, 16)) - numpy.mean(get_outliers_bimodal_sample(100, 16)) for i in range(10000)]
x = plt.hist(mean_diffs, bins='auto')