# Question 0: Python Tools for Resampling

In this question, we will develop general, higher-performance tools for resampling and bootstrap hypothesis testing using `numpy`.

General-purpose tools are nice so that we won't have to keep re-writing functions to bootstrap the mean vs. the min, say.

Performance is important in bootstrapping (and for the power analysis that will be the subject of the next lecture), because in these applications we will be simulating things tens of thousands of times over.

Let's tackle performance first, using some tools from IPython to compare the speed of various operations using `numpy` vs. basic Python.

A very useful IPython "magic" command for analyzing performance is `%timeit`. (Note for advanced users: "magic" commands are available in the Notebook, or by running `ipython` from the command-line, but not via the plain-vanilla `python` interpreter.)

Try running the cell below to see how `%timeit` works. (Ignore any warnings about "The slowest run took xx times longer than the fastest": these are usually false positives, and certainly so in simple cases like these...)

In [None]:
print('How long to make a list?')
%timeit a = [1,2,3]
print()

print('How long to add numbers?')
a = 5
b = 6
%timeit a + b
print()

print('How long for a simple loop?')
def loop():
    a = []
    for i in range(100000):
        a.append(i)

%timeit loop()

As you can see, `%timeit` will carefully time fast things (that take nanoseconds) and slow things (taking milliseconds or longer).

First, let's figure out the best way to resample data. Remember that resampling involves drawing a new dataset, the same size as the old one, with replacement. The python module `random` has a function `choices` that we previously used for this task, to draw elements with replacement from a list:
```python
import random
a = [1,2,3,4,5,6]
random.choices(a, k=len(a))
```

It happens that `numpy` also has a similar function, adapted to work not on lists, but on `numpy` arrays of numbers:
```python
import numpy
a = numpy.array([1,2,3,4,5,6])
numpy.random.choice(a, size=len(a), replace=True)
```

There are a few things to note:
  1. `numpy.random.choice` can sample with or without replacement, controlled by the `replace` parameter.
  2. The number of points to draw is controlled by the `size` parameter in `numpy.random.choice`, vs. the `k` parameter in `random.choices`.
  3. `numpy` functions, such as `random.choice`, can take a list as input, and will internally convert that list into an array. But if you're going to do something a lot of times over, it makes more sense to convert the list to an array once at the beginning, rather than each time through...
 
## Question 0.1
In the cell below, use `%timeit` to compare the speed of three conditions: First, `random.choices` applied to a list; second, `numpy.random.choice` applied to a list; and third, `numpy.random.choice` applied to an array. Use the provided `data_list` and `data_array` below, and make sure to draw proper resamples of the right size, and with replacement.

In the cell below that, write out which is the fastest, and approximately the fold-speedup between slowest and fastest (rounded to the nearest 5).

In [None]:
import numpy
import random
data_list = list(range(1000))
data_array = numpy.array(data_list)
# YOUR ANSWER HERE

# Note that the difference in speed between numpy.random.choice applied to a list
# vs. to an array should be *approximately* the amount of time it takes to turn a
# list into a numpy array. Test this:
print('time to convert list to array:')
%timeit data_array = numpy.array(data_list)

YOUR ANSWER HERE

A general performance rule is that "if you are working with numbers, use numpy". Another rule is that you should try to avoid looping through a `numpy` array with an explicit for loop. Instead, rely on `numpy` functions that operate on all the entries of the array at once.

The function you wrote in the last stats homework to calcualte the fraction of data points more extreme (i.e. a p-value) is a good example of looping through a list of numbers to calculate something.

## Question 0.2
Write a similar function, below, called `p_value_from_list`. (In the next question, we'll make it work faster with `numpy`.) This function will take three parameters:
  - `actual_stat`: the actual value of some statistic (like the mean of a dataset)
  - `resampled_stats`: a list of values for the statistic after resampling *under the null hypothesis*.
  - `null_hyp_stat`: the value of the statistic if the null hypothesis were true (often zero, which we will actually use as the default value of this parameter).
  - `two_tailed`: if True (default), count differences as or more extreme in *either direction* from the null hypothesis. Otherwise only count differences as or more extreme in the same direction from the null hypothesis as `actual_stat`.

Return the p-value. Note that you cannot assume that `actual_stat` is greater or less than `null_hyp_stat`. Make sure to handle all cases properly. For the two-tailed case, you will probably want to use the `abs` function to calculate the absolute value of the distance between `actual_stat` and `null_hyp_stat` and so forth.

Note: it's better Python style to test if something like `two_tailed` is `True` as follows:
```python
if two_tailed:
    do_something()
```
compared to the more redundant:
```python
if two_tailed == True:
    do_something()
```


In [None]:
def p_value_from_list(actual_stat, resampled_stats, null_hyp_stat=0, two_tailed=True):
    # YOUR ANSWER HERE

In [None]:
# 100,000 numbers uniformly spaced from 0 to 1
resampled_stats = list(numpy.linspace(0, 1, 100000))

# 5% of the data should be as far or farther from 0.5 (in either direction) than is 0.975
assert p_value_from_list(0.975, resampled_stats, null_hyp_stat=0.5) == 0.05

# 2.5% of the data should be as far or farther from 0.5 (in the same direction) than is 0.975
assert p_value_from_list(0.975, resampled_stats, null_hyp_stat=0.5, two_tailed=False) == 0.025

# 5% of the data should be as far or farther from 0.5 (in either direction) than is 0.025
assert p_value_from_list(0.025, resampled_stats, null_hyp_stat=0.5) == 0.05

# 2.5% of the data should be as far or farther from 0.5 (in the same direction) than is 0.025
assert p_value_from_list(0.025, resampled_stats, null_hyp_stat=0.5, two_tailed=False) == 0.025

# 100,000 numbers uniformly spaced from -1 to 1
resampled_stats = list(numpy.linspace(-1, 1, 100000))

# 5% of the data should be as far or farther from 0 (in either direction) than is 0.95
assert p_value_from_list(0.95, resampled_stats) == 0.05

# 2.5% of the data should be as far or farther from 0 (in the same direction) than is 0.95
assert p_value_from_list(0.95, resampled_stats, two_tailed=False) == 0.025

# 5% of the data should be as far or farther from 0 (in either direction) than is -0.95
assert p_value_from_list(-0.95, resampled_stats) == 0.05

# 2.5% of the data should be as far or farther from 0 (in the same direction) than is -0.95
assert p_value_from_list(-0.95, resampled_stats, two_tailed=False) == 0.025

print('counting time:')
%timeit p_value_from_list(0.975, resampled_stats, null_hyp_stat=0.5)

One thing you can do with `numpy arrays` is adding two arrays together:
```python
a = numpy.array([1,2,3])
b = numpy.array([2,4,6])
a + b # is an array of [3, 6, 9]
```
As you can see (and test this out / play with similar commands in another notebook!), the addition proceeds element-wise: the first element in `a` is added to the first in `b`, and so forth. Multiplying, dividing, &c. work similarly.

It turns out that numpy will also allow you to add/subtract/divide/whatnot a single number (i.e. a "scalar") to an array:
```python
a = numpy.array([1,2,3])
a + 2 # is an array of [3, 4, 5]
```

One convenient feature of numpy is that you can *also* use comparisons in the same way:
```python
a = numpy.array([1,2,3])
a > 2 # is an array of [False, False, True]
```

Now, how can we get a count of how many True values there are in an array? If you type `numpy.count` and then press TAB, to see what `numpy` functions have names starting with `count`, you will see exactly one: `numpy.count_nonzero`. Reading the documentation, you will see that what this function does is count how many elements are non-zero **or** not False. (Remember that `0` acts the same as `False` in many cases... `if 0:` is the same as `if False:` -- whatever is in the body of that if-statement will never run.)

So:
```python
a = numpy.array([1,2,3])
numpy.count_nonzero(a > 2) # returns 1
numpy.count_nonzero(a >= 2) # returns 2
```

## Question 0.3
Rewrite `p_value_from_list` as `p_value`, using addition/subtraction/comparisons of the all of the elements in the `resampled_stats` array at once, and `numpy.count_nonzero`. (There should be no for loops!) Note that you can use `numpy.abs` to get the absolute value of every element in an array.

One issue: how do we know that `resampled_stats` is an array? If we do something like `a + 6`, we will get an error if `a` is `[1,2,3]`, but we will get what we want if `a` is `numpy.array([1,2,3])`. The standard approach to making sure that an input is always an array is to do this:
```python
def square_array(data):
    data = numpy.asarray(data)
    return data**2
```

What `numpy.asarray` does is (a) convert its input to an array if the input is not an array, or (b) return the input without doing anything if its input is an array.

This differs slightly from `numpy.array`, which (a) converts its input to an array if the input is not an array, or (b) copies the input to a new array if the input is an array. (This is a lot like what the python fuction `list` does.)

When dealing with large arrays, it can be slow to copy the contents to a new array. So generally `numpy.asarray` is the right approach.

**Tip:** `arr.size` gives the number of elements in an array. (If you're sure that the array is 1-dimensional, this is equivalent to `len(arr)` or `arr.shape[0]`.) 

In [None]:
def p_value(actual_stat, resampled_stats, null_hyp_stat=0, two_tailed=True):
    resampled_stats = numpy.asarray(resampled_stats)
    # YOUR ANSWER HERE

In [None]:
# 100,000 numbers uniformly spaced from 0 to 1
resampled_stats = numpy.linspace(0, 1, 100000)

# 5% of the data should be as far or farther from 0.5 (in either direction) than is 0.975
assert p_value(0.975, resampled_stats, null_hyp_stat=0.5) == 0.05

# 2.5% of the data should be as far or farther from 0.5 (in the same direction) than is 0.975
assert p_value(0.975, resampled_stats, null_hyp_stat=0.5, two_tailed=False) == 0.025

# 5% of the data should be as far or farther from 0.5 (in either direction) than is 0.025
assert p_value(0.025, resampled_stats, null_hyp_stat=0.5) == 0.05

# 2.5% of the data should be as far or farther from 0.5 (in the same direction) than is 0.025
assert p_value(0.025, resampled_stats, null_hyp_stat=0.5, two_tailed=False) == 0.025

# 100,000 numbers uniformly spaced from -1 to 1
resampled_stats = list(numpy.linspace(-1, 1, 100000))

# 5% of the data should be as far or farther from 0 (in either direction) than is 0.95
assert p_value(0.95, resampled_stats) == 0.05

# 2.5% of the data should be as far or farther from 0 (in the same direction) than is 0.95
assert p_value(0.95, resampled_stats, two_tailed=False) == 0.025

# 5% of the data should be as far or farther from 0 (in either direction) than is -0.95
assert p_value(-0.95, resampled_stats) == 0.05

# 2.5% of the data should be as far or farther from 0 (in the same direction) than is -0.95
assert p_value(-0.95, resampled_stats, two_tailed=False) == 0.025

print('counting time:')
%timeit p_value(0.975, resampled_stats, null_hyp_stat=0.5)

So, the `numpy` version should be around 5-10× faster. 

Though, the astute among you might notice that even the "slow" version should take only a handful of milliseconds to complete. You'd have to calculate thousands of p-values before you'd ever notice that the "slow" version is actually bothersome. Well, in the next problem set when doing power analysis, we will do just that!

## Question 0.4

Next, let's make a very general `resample` function. Previously we would have had to write different `resample_mean` and `resample_min` functions, even though absolutely everything about the function was the same except for calling the `mean` or `min` function:
```python
def mean(data):
    return sum(data)/len(data)

def resample_means(data, n_resamples):
    means = []
    for i in range(n_resamples):
        resampled_data = random.choices(data, k=len(data))
        means.append(mean(resampled_data))
    return means

def resample_mins(data, n_resamples):
    mins = []
    for i in range(n_resamples):
        resampled_data = random.choices(data, k=len(data))
        mins.append(min(resampled_data))
    return mins
```

If you remember the last programming lecture, however, you know that you can store a function (either one that we wrote, like `mean`, or a Python built-in, like `min`) in a variable and pass that to another function. Rewrite the above into a generic `resample_1sample` function that can resample values for any statistic (i.e. any Python function) that is calculated from a single dataset (such as `min`, `mean`, &c.)

Your `resample_1sample` function will take three parameters: `data` (which may be a list or a numpy array), `statistic` which will be any python function that takes a numpy array and returns a single number, and `n_resamples`, the number of resamples to perform. The function should then return a list of the values of calling `statistic` on resampled versions of the input data `n_resamples` different times.

Note that even in a `numpy` world, it's still fine to incrementally build up a list in a for loop like the above functions do. There's no good way to avoid a for loop in this case. However, let's not use `random.choices` like in the functions above, given what we learned earlier in this question... And remember that `numpy.random.choice` works faster on a `numpy` array, so make sure to turn the data into a numpy array with `numpy.asarray` before doing anything else.


In [None]:
def resample_1sample(data, statistic, n_resamples):
    # YOUR ANSWER HERE

# simple usage example
print(resample_1sample([1,2,3,4,5], min, n_resamples=10))
print()

# now let's compare speed
def mean(data):
    return sum(data) / len(data)

# make up 1000 fake data points
data = numpy.linspace(0, 10, 1000)

def resample_means(data, n_resamples):
    means = []
    for i in range(n_resamples):
        resampled_data = random.choices(data, k=len(data))
        means.append(mean(resampled_data))
    return means

print('Time using resample_means:')
%timeit resample_means(data, n_resamples=1000)
print()

print('Time using resample_1sample with the above mean function defined in python:')
%timeit resample_1sample(data, mean, n_resamples=1000)
print()

print('Time using resample_1sample with numpy.mean, which is optimized for numpy arrays:')
%timeit resample_1sample(data, numpy.mean, n_resamples=1000)




In [None]:
mins = resample_1sample([1,2,3,4,5], min, n_resamples=10000)
# there is a (4/5)**5, or about 32.7%, chance that any given resample doesn't include 1.
# So 32.7% of the time the min should be > 1. 
expected = 10000 * (4/5)**5
actual = numpy.count_nonzero(numpy.array(mins) > 1)
print(actual, expected)
assert abs(actual - expected) < 200

data = numpy.linspace(0, 10, 1000)
means = resample_1sample(data, numpy.mean, n_resamples=10000)
print(min(means), max(means))
assert min(means) > 4.4
assert max(means) < 5.5

So, that's a generic 1-sample resampling function. But we also have lots of two-sample statistics we might be interested in! For example, the difference in means, or the t-statistic.

Remember that the t-statistic is defined as:
$$ t = k(n_1, n_2)\cdot\frac{\mu_1 - \mu_2}{\sqrt{n_1\cdot\sigma^2_1 + n_2\cdot\sigma^2_2 }}$$

where $\mu_i$ is the mean of dataset i, $n_i$ is the number of elements in that dataset, and $\sigma^2_i$ is its variance. As $k$ depends only on the $n_i$, it won't ever change within our resampling runs. Thus we will ignore it for our purposes. (It's critical for comparing t-values between experiments, but we don't need to do that here.)

Let's define the difference in means as simply $\mu_1 - \mu_2$ (as opposed to $\mu_2 - \mu_1$).

## Question 0.5
Use the functions `numpy.mean` and `numpy.var` to write `mean_diff` and `t_stat` functions, and then write a `resample_2sample` function to work with them. Note that the difference in means and the t-statistic can be negative or positive.

Unlike the `statistic` functions that we passed to `resample_1sample` above, which operate on a single array of data (like `numpy.mean` for example, which gets called as `numpy.mean(data)`), the functions we provide to `resample_2sample` will take two different arrays and will get called as e.g. `t_stat(data1, data2)`).

In [None]:
def mean_difference(data1, data2):
    # YOUR ANSWER HERE

def t_stat(data1, data2):
    # YOUR ANSWER HERE

def resample_2sample(data1, data2, statistic, n_resamples):
    # YOUR ANSWER HERE

data1 = numpy.random.normal(loc=1, scale=0.1, size=10000)
data2 = numpy.random.normal(loc=0, scale=0.1, size=10000)
print(mean_difference(data1, data2), t_stat(data1, data2))

data3 = numpy.random.normal(loc=1, scale=2, size=10000)
data4 = numpy.random.normal(loc=0, scale=2, size=10000)
print(mean_difference(data3, data4), t_stat(data3, data4))

print(resample_2sample(data3, data4, t_stat, n_resamples=5))

In [None]:
assert mean_difference([1,2,3], [4,5,6]) == -mean_difference([4,5,6], [1,2,3])
assert mean_difference([1,2,3], [4,5,6]) == -3
assert t_stat([1,2,3], [4,5,6]) == -1.5
assert t_stat([0,2,4], [3,5,7]) == -0.75 # larger spread = smaller t
assert t_stat([1,2,3], [7,8,9]) == -3 # larger difference in means = larger t

mean_diffs = resample_2sample([1,1,1], [2,2,2], mean_difference, n_resamples=100)
assert len(mean_diffs) == 100
assert numpy.count_nonzero(numpy.array(mean_diffs) == -1) == 100

data3 = numpy.random.normal(loc=1, scale=2, size=1000)
data4 = numpy.random.normal(loc=0, scale=2, size=3000)
t_stats = resample_2sample(data3, data4, t_stat, n_resamples=5000)
print(min(t_stats), max(t_stats))
assert min(t_stats) > 0.003
assert max(t_stats) < 0.013

You may have noticed that `resample_1sample` and `resample_2sample` looked pretty similar. And you can probably imagine that `resample_3sample` and `resample_4sample` would very much follow the same template. How could we make a totally-generic `resample` function that could take an arbitrary number of samples?

To do this, we need to learn one special bit of syntax in Python. Recall list and tuple unpacking:
```python
a = [5, 6, 7]

# list unpacking is convenient:
first, middle, last = a

# this is inconvenient:
first = a[0]
middle = a[1]
last = a[2]
```

What if we had a three-parameter function, and a three-element list. Is there a way to "unpack" that list into the arguments of the parameters?
```python
def do_something(first, middle, last):
    return (first + last) / middle

a = [5, 6, 7]
# can we do better than:
do_something(a[0], a[1], a[2])

# yes we can!
do_something(*a)
```

That `*a` syntax means "unpack the list `a` into the individual parameters that `do_something` requires. If `a` is too long or too short, an error would be raised, just as it would be if you did `do_something(1, 2)` or `do_something(1, 2, 3, 4)`.

## Question 0.6
Use this syntax to write a fully-general `resample` function, that takes a list of `data_sets` in addition to the statistic to use and the number of resamples. For each resampling run, resample each of the datasets, and then call the statistic on all of the resampled datasets. For example, each of the following should work:
```python
resample([data], numpy.mean, n_resamples=1000)
resample([data1, data2], t_stat, n_resamples=1000)

def three_sample_stat(data1, data2, data3):
    # just silly
    return numpy.mean(data1) - numpy.mean(data2) + numpy.var(data3)

resample([data1, data2, data3], three_sample_stat, n_resamples=1000)
```

In [None]:
def resample(data_sets, statistic, n_resamples):
    # turn every dataset in `data_sets` into a `numpy` array 
    data_sets = [numpy.asarray(data) for data in data_sets]
    # YOUR ANSWER HERE

data1 = numpy.random.normal(loc=1, scale=0.1, size=1000)
data2 = numpy.random.normal(loc=0, scale=0.1, size=1000)

means = resample([data1], numpy.mean, n_resamples=1000)
t_stats = resample([data1, data2], t_stat, n_resamples=1000)

In [None]:
mins = resample([[1,2,3,4,5]], min, n_resamples=10000)
# there is a (4/5)**5, or about 32.7%, chance that any given resample doesn't include 1.
# So 32.7% of the time the min should be > 1. 
expected = 10000 * (4/5)**5
actual = numpy.count_nonzero(numpy.array(mins) > 1)
assert abs(actual - expected) < 200

data = numpy.linspace(0, 10, 1000)
means = resample([data], numpy.mean, n_resamples=10000)
assert min(means) > 4.4
assert max(means) < 5.5

mean_diffs = resample([[1,1,1], [2,2,2]], mean_difference, n_resamples=100)
assert len(mean_diffs) == 100
assert numpy.count_nonzero(numpy.array(mean_diffs) == -1) == 100

data3 = numpy.random.normal(loc=1, scale=2, size=1000)
data4 = numpy.random.normal(loc=0, scale=2, size=3000)
t_stats = resample([data3, data4], t_stat, n_resamples=5000)
assert min(t_stats) > 0.003
assert max(t_stats) < 0.013

Now we know how to call a function with a variable number of arguments: just put them into a list and use the `*` syntax:
```python
a = [1, 2, 3]
print(*a)
```

Note also that some functions, such as `print` above, can *receive* a variable number of arguments! `print('hello')` works, as does `print('hello', 2, 'goodbye')`, and so forth. As yet, we don't know how to write such a function. Fortunately, the `*` syntax is even more versatile, and can be used to define a function that takes zero or more arguments:
```python
def var_args(*args):
    print(type(args), len(args))
    print(args)

var_args()
# prints:
# <class 'tuple'> 0
# ()

var_args(1)
# prints:
# <class 'tuple'> 1
# (1,)

var_args(1, 2, 3)
# prints:
# <class 'tuple'> 3
# (1, 2, 3)

a = [1, 2, 3]
var_args(*a)
# prints:
# <class 'tuple'> 3
# (1, 2, 3)
```

So, as you can see, defining a function that takes a `*` argument (which can be named anything, though `*args` is traditional) allows that function to be called with a variable number of arguments. Inside the function, the arguments are all put into a single tuple containing the arguments the function was called with.

In class, we talked about a statistic that can work with an arbitrary number of data points: the F-statistic, which drives ANOVA. For a given number of data groups, the F-statistic concerns the relationship between three sums:
1. The total sum of squared differences
$$ SS_{\mathrm{total}} = \sum_i \sum_j (\mathrm{data}_{ij} - \mu)^2 $$
2. The within-group sum of squared differences
$$ SS_{\mathrm{within}} = \sum_i \sum_j (\mathrm{data}_{ij} - \mu_i)^2 $$
3. The between group sum of squared differences
$$ SS_{\mathrm{between}} = \sum_i n_i \cdot (\mu_i - \mu)^2 $$
Where:
  - the indices *j* refer to the different data points in each group *i*
  - $data_{ij}$ is the *jth* element of group *i*
  - $\mu_i$ is the mean of data group *i*
  - $n_i$ number of data points in group *i*
  - $\mu$ is the grand mean of all the data points together. 

Recall that these sums of squares are related as: $S_{\mathrm{total}} = SS_{\mathrm{within}} + SS_{\mathrm{between}}$.

The F-statistic is:
$$ F = \frac{SS_{\mathrm{between}}}{SS_{\mathrm{within}}} $$

## Question 0.7
Below, a function called `sum_of_squares` is defined that returns the sum of squared distances between each point in a dataset and a given reference value (i.e. the "reference" could be $\mu_i$ or $\mu$ above).

Write two functions. The first, `all_sums`, we will use to test that the sums of squares add as we expect: it should take an arbitrary number of data sets and return three values: the total, within-group, and between-group sums of squares, each calculated as described above. Note that to calculate the grand mean, you will need a single array consisting of all the data sets "stuck together". (This array can simplify calculating the total sum of squares too.) We can do that with `numpy.concatenate`, which literally means "to stick together end-to-end". 

The second, `F-stat`, should share a lot of similarity with `all_sums`, but it should just calculate and return the ratio of the between- to within-group sum of squares.

In [None]:
def sum_of_squares(data, reference):
    data = numpy.asarray(data)
    # look: no for loop! We let numpy do everything for us. Much faster that way, and simpler to read.
    return numpy.sum((data - reference)**2)

def all_sums(*data_sets):
    # turn every dataset in `data_sets` into a `numpy` array 
    data_sets = [numpy.asarray(data) for data in data_sets]
    all_data = numpy.concatenate(data_sets)
    grand_mean = numpy.mean(all_data)
    # YOUR ANSWER HERE
    return ss_total, ss_within, ss_between

data1 = numpy.random.normal(loc=0, scale=2, size=300)
data2 = numpy.random.normal(loc=1, scale=2, size=500)
data3 = numpy.random.normal(loc=2, scale=2, size=700)

ss_total, ss_within, ss_between = all_sums(data1, data2, data3)
print(ss_total, ss_within + ss_between)

def F_stat(*data_sets):
    # YOUR ANSWER HERE
    return ss_between / ss_within

print(F_stat(data1, data2, data3))

data4 = numpy.random.normal(loc=0, scale=1, size=300)
data5 = numpy.random.normal(loc=1, scale=1, size=500)
data6 = numpy.random.normal(loc=2, scale=1, size=700)
# F-stat should get bigger as the spread within each group decreases
# Here we go from "scale" (i.e. standard deviation) = 2 in data 1-3 to 1 in data 4-6.
print(F_stat(data4, data5, data6))

In [None]:
data1 = numpy.random.normal(loc=0, scale=2, size=300)
data2 = numpy.random.normal(loc=1, scale=2, size=500)
data3 = numpy.random.normal(loc=2, scale=2, size=700)
ss_total, ss_within, ss_between = all_sums(data1, data2, data3)
assert abs(ss_total - (ss_within + ss_between)) < 0.0000001

F_stats = resample([data1, data2, data3], F_stat, n_resamples=3000)
print(min(F_stats), F_stat(data1, data2, data3), max(F_stats))
assert min(F_stats) > 0.04
assert max(F_stats) < 0.32

In class, I said that the F-statistic is just the square of the t-statistic. This is exactly true, if you take into account the defintion of $k(n_1, n_2)$ that we ignored above. If we ignore $k$, then the F-statistic is simply proportional to the square of the t-statistic. Run the code below to generate F- and t-statistics on data with a range of differences in mean, and plot their relationship:

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use('ggplot')

Fs = []
ts = []
data1 = numpy.random.normal(loc=0, scale=2, size=300)
for center in numpy.linspace(-4, 4, 500):
    data2 = numpy.random.normal(loc=center, scale=2, size=500)
    Fs.append(F_stat(data1, data2))
    ts.append(t_stat(data1, data2))
    
plt.scatter(ts, Fs)
plt.xlabel('t-statistic')
plt.ylabel('F-statistic')
print('looks quadradic indeed!')