<img src="./img/HWNI_logo.svg"/>

In [1]:
import scipy.special

# Tutorial - Tests for 2-Sample Data

Though a simple hypothesis test like comparing a single value to a reference distribution, as in the
[hypothesis testing lab](../03 - Hypothesis Testing/Lab - Hypothesis Testing.ipynb),
is sufficient to explore the core concepts of hypothesis testing,
such a test is not sufficient for practical purposes.

This tutorial covers a class of tests that's the next step in complexity:
*2-sample tests*,
where we compare two values of the same statistic,
one calculated on a sample from the null distribution
and the other calculated on a sample from the experimental distribution.

## Example Experiment

As our example experiment,
let's consider the administration of
[octopamine](https://en.wikipedia.org/wiki/Octopamine_%28neurotransmitter%29)
to flies.
Octopamine has been shown to
[increase food consumption](http://www.pnas.org/content/80/13/4159.short).
We'd like to replicate that experiment.

To do so,
we need to measure the change in food consumption
when flies are given octopamine.
One way to do so
is to collect a large group of flies,
randomly split it into two sub-groups,
and then administer octopamine to one of the groups.
We then measure the sucrose consumption
of all flies in each group.

Intuitively, the appropriate test statistic in this case
is some measure of the difference between the two groups.
For example, we might calculate
the difference between the largest values
or the difference between two randomly chosen values.
While either of those statistics would work,
neither of them is particularly good:
the former is very sensitive to outliers and
the latter doesn't get better when we collect more data.

Instead, we use a difference between one of the
descriptive statistics
that we learned about.
The mean and the median are popular choices.
The mean is best for symmetric data with no outliers
while the median is better in other cases.
For now, let's use the difference in means as our test statistic.

### First Pass: Directly Estimating the Null Distribution

As before, we need to get our hands on the distribution of our test statistic
under the null hypothesis.

In our first foray into hypothesis testing,
we used the fact that,
under the null hypothesis, 
our untreated condition would
be the same as our treated condition.
We therefore measured our statistic in the untreated condition
many times and used the measured distribution
as an estimate of the null distribution.

Now, our hypothesis concerns the difference
between our two groups directly,
so we'd have to change our strategy.
For example,
we could measure the difference of statistics between
the two groups according to null hypothesis,
then compare that to the difference we did measure.

More specifically, we could repeatedly gather groups of flies,
split them randomly into groups,
and then administer octopamine to *neither* group.
Since both groups are the same,
the distribution of differences we measure
will be the null distribution of our test statistic.

This works,
but it's somewhat inefficient.
Measuring the null well enough to get $p$-values
with two decimal places takes 100 samples,
which means we'll need 1000 flies
just to estimate the null distribution
for an experiment involving two groups each of size 5.

### Second Pass: Permutation and Randomization

We'll have to come up with a new strategy if we want to make our two-sample tests more efficient.

Luckily, there's a general purpose and flexible way to perform two-sample tests
that doesn't require us to collect any extra data:
*randomization*.

Consider:
if the null hypothesis is correct,
then there are no differences between the two groups of flies,
at least so far as our statistic is concerned.
Any difference that we appear to uncover is simply due to chance.
If this is true, then
which group the fly came from has no effect on its feeding behavior.

Under the null hypothesis, then, our experiment is no different from one like this:
We collect a bunch of flies and randomly label some "control" and some "experimental",
then measure the difference in feeding behaviors.

In this experiment, it's clear that re-labeling all of the flies
shouldn't, on average, change the value of the test statistic.
If, for example,
we accidentally mixed around the labels of
our two groups
\- flies were inadvertently allowed to move between the two enclosures, 
or we accidentally randomized the labels while entering them into the computer -
and calculated the difference,
we shouldn't be able to tell that we messed up from looking at the test statistic.

This gives us a way to measure the null distribution of the test statistic
without having to collect any more data!
We can simply randomly assign the flies
to two groups of the same size as the groups we measured,
then calculate the test statistic.
This gives us one sample from the null distribution of the statistic.
We then repeat this process until we have a good estimate of the null distribution.

You might notice that there are only a finite number of ways to
rearrange the data points we collected into two groups.
That means we could get an exact measurement of the null distribution
and then use it to calculate the $p$-value.

Unfortunately, just because a number is finite doesn't mean
it's reasonable.
We can calculate the exact number of ways our data could be rearranged
by using a function from `scipy` that calculates the
[binomial coefficient](https://en.wikipedia.org/wiki/Binomial_coefficient).
You may have heard this function called "choose", as in
"$n$ choose $k$" or ${n}\choose{k}$,
where $n$ is the size of the total population
and $k$ is the size of the group being selected.
If the groups are of unequal size,
we can use either and we'll get the same answer.

Run the code cell below to see how many rearrangements
you'd need to test for a typical experiment in your field.
Even with relatively low numbers like a total population of 30
and sub-group sizes of 15 apiece,
the number of rearrangements to test is aleady over 100 million!

In [2]:
total_population = 30
sub_group = 15

scipy.special.binom(total_population, sub_group)

155117520.0

With the advent of modern computing power,
it is in fact possible to perform tests for total populations
of a reasonable size -- between 30 and 70,
depending on the machine and the operator's patience.
These tests are called
[Fisher exact tests](https://en.wikipedia.org/wiki/Fisher%27s_exact_test)
or *perumtation tests*.

However, this isn't often necessary.
The estimate of the null distribution given by 
10,000 or 100,000 or 1,000,000 random samples
is usually more than adequate to determine the $p$-value,
and any errors caused by slight deviations
from the true null distribution are in general swamped
by the randomness in the $p$-value caused by randomness in the data.

### Third Pass: Student's $t$

The randomization test described above
is applicable in any scenario,
even when we know nothing about the data
except what we measured.

If we do know more about the null distribution of the data,
for example if we know its shape,
then we can infer facts about the null distribution of the test statistic.
We can use that information to design better statistical tests --
tests with more power and lower false positive rates
using less data.

The most common shape assumed for distributions is
Gaussian,
thanks in part to the Central Limit Theorem,
as discussed in the section on Inferential Statistics.

If we assume that the data is distributed as a Gaussian, then
[it can be shown](https://www.nature.com/nmeth/journal/v11/n3/full/nmeth.2858.html)
that the null distribution of the difference of two means is a distribution called the
[Student's $t$-distribution](http://www.nature.com/nmeth/journal/v10/n11/full/nmeth.2698.html).
This distribution is *almost* a Gaussian --
and gets closer and closer to a Gaussian as the size of the samples increases --
but has a slightly higher propensity to produce outliers.

The Student's $t$ test, in its usual form,
assumes that both groups have the same variance.
Though it is fairly robust to violations of this assumption --
meaning that the power and false positive rate
don't rapidly change as the variances become more different --
it is important to remember that this assumption is being made.

## Paired Tests

Consider an experiment much like the one described above,
but instead of splitting our flies into two groups,
one of which receives octopamine treatment
and one of which does not,
we keep our flies in a single group and
measure sucrose consumption before
and after administration of octopamine.

This is a better experimental design than the one proposed above,
because instead of asking the question
"do flies exposed to octopamine consume more sucrose?"
it asks the question
"does exposing a fly to octopamine cause it to consume more sucrose?",
which is much closer to the scientific question we are asking.

However, it will require a re-thinking of our statistical tests.
Because all of our observations are made as pairs
-- one measurement before octopamine and one measurement after  --
these new tests will be called *paired* tests.

Let's walk through this process for the randomization test.

In designing our randomization test,
we took advantage of the fact that, under the null hypothesis,
we could swap measurements between groups.
Now, there is a measurement in each group from the same fly,
so we can't just swap them willy-nilly.

That's because
under our new null hypothesis,
instead of the group labels of the flies being assigned randomly,
which of the two measurements
*from the same fly*
is labeled as "control" and which is labeled "experiment"
is determined randomly.

This gives us a new randomization procedure:
instead of randomly re-assigning flies to the two groups
and calculating the test statistic,
we randomly swap the labels "control" and "experiment"
on the measurements from some of our flies
and calculate the test statistic.

It's helpful to consider what the results of our randomization test
would look like in two cases:
first, let's imagine that, for every fly,
its sucrose consumption is higher in the experimental condition
than in the control condition.
If we randomly switch around some of the measurements and then recompute
the differences, some of them will now be negative instead of positive,
and the value of the test statistic will go down.
We'll never see a test statistic as extreme as the one we observed,
so the $p$ value will be 0!

Notice that we only had to assume that *for each fly*,
the sucrose consumption was higher in the experimental condition
than in the control condition, and we got a $p$ value of 0.
If we'd been doing our original randomization test,
where we weren't tracking the differences,
then we'd only get a $p$ value of 0 if
every fly's sucrose consumption was higher in the experimental condition
than for any fly in the control condition.
This is a much stronger assumption!

From this example, we can see the greater power of
tests that keep track of differences,
also known as *paired* tests,
relative to tests that do not,
also known as *unpaired* tests.
With a paired test,
we don't lose some of our power
when some flies just have a higher or lower
baseline sucrose consumption than other flies.
In statistical lingo,
the paired design
*controls for individual differences*
and so lets us observe smaller differences
with the same sample size.

Now, let's imagine that, for half the flies,
sucrose consumption is higher in the experimental condition,
and vice versa for the other half.
Now, when we start randomly swapping labels for flies,
sometimes the test statistic will go up,
and sometimes the test statistic will go down.
Once we've randomly swapped values enough times to get
a good estimate of the null distribution,
we'll discover that roughly half of the test statistic values are lower,
and half are higher,
than the value we originally observed.
If we were doing a one-tailed test,
we'd end up with a $p$-value of around 0.5,
whereas a two-tailed test
would give us a $p$ value of around 1.

There is also a paired version of the $t$-test,
wherein average of the differences between the two conditions is compared,
rather than the difference of the average.
Just like the unpaired $t$-test,
it's a good choice if we can assume that our data is distributed as a Gaussian.

## Going Hands-On

For more on two-sample tests,
including the chance to apply these tests
to real data from the experiment described above,
check out the lab notebooks in this folder!