# 2: Cancer mice (6 points)

Recall that we imagined an experiment with 10 blueberry-fed mice and 15 regular-chow mice, which found that 4 of the blueberry-fed mice were cancer-free (40%), while 3 of the regular-chow mice were cancer-free (20%). 

Before the National Blueberry Council put out the press release about how blueberries prevent cancer, we did a quick simulation to make sure that a result such as this would not be common by chance alone.

Our final version of the simulation assigned cancer randomly to blueberry or regular mice, with a cancer-free probability of 7/25. We then asked how often we would expect to see more than double or less than half the rate of cancer-freedom in the blueberry mice. By looking at double *or* half, we are performing a two-tailed test: asking about the probability of extreme deviations in either direction.

Below is the code from the in-class simulation. Try running it a few times. You should find that almost 30% of the time, we would see a result as extreme as a doubling or halving of the cancer-free rate in the blueberry group by chance alone.

In [None]:
import random
mice = ['blueberry'] * 10 + ['regular'] * 15

count = 0
n_trials = 10000
for i in range(n_trials):
    no_cancer = []
    for mouse in mice:
        if random.random() < 7/25:
            no_cancer.append(mouse)
    blueberry_cancer_free_rate = no_cancer.count('blueberry') / 10
    regular_cancer_free_rate = no_cancer.count('regular') / 15
    if blueberry_cancer_free_rate > 2 * regular_cancer_free_rate:
        count += 1
    elif blueberry_cancer_free_rate < 0.5 * regular_cancer_free_rate:
        count += 1

print(count / n_trials)

Based on these results, you were able to convince the National Blueberry Council to **not** put out a press release.

Regardless, the Council was intrigued by your results that 40% of the mice in the blueberry group were cancer-free, while only 20% in the control group were spared. Though the result was potentially just due to chance, your best estimate of the effects of blueberries was, after all, that it doubled of the chances of being cancer-free. (The problem is just that your best estimate really isn't all that reliable.)

The good news is that the National Blueberry Council want to pay for a larger experiment! They hope that with more mice, it might be possible to distinguish whether the effect observed is more likely due to blueberries or to random chance.

The bad news is that the Council is pretty angry with you for not having done these calculations beforehand. If you had done so, you would have realized that even an effect as strong as an increase from 20% to 40% in cancer-free mice would be indistinguishable from random chance. This means that the experiment you did was a waste of their money and your time. (Not to mention the lives of 25 mice! (It's one thing to do experiments on animals if you are going to learn something. It's completely different to do experiments that are statistically incapable of producing meaningful results.)

So, you need to do some statistics to figure out how many mice need to be in each group. As a first step toward that, let's examine the probability by chance alone of getting similar results using different numbers of mice. Modify the above code to be a function that takes two parameters, the number of blueberry and regular mice, and returns the probability of seeing twice or half the cancer-free rate in the blueberry mice by chance alone (based on 10,000 sampling runs). For simplicity, assume that the cancer-free rate is always 7/25.

In [None]:
def p_cancer_free_extreme(n_blueberry, n_regular):
    # YOUR ANSWER HERE

print(p_cancer_free_extreme(10, 15))

In [None]:
assert 0.28 < p_cancer_free_extreme(10, 15) < 0.31
assert 0.13 < p_cancer_free_extreme(20, 30) < 0.16

Write a for-loop to print out these probabilities for 10, 20, 50, and 100 blueberry mice, each time with 1.5 times as many control mice. Print the number of blueberry mice on the same line. Last, store the probabilities in a list named `ps`.

Note that while `['a','b'] * 2` gives `['a','b','a','b']`, if were were to do `['a','b'] * 2.0`, this would produce  an error. That's because Python distinguishes between integers (i.e. `2`, also called an "int") and floating-point numbers (i.e. `2.0`, also called a "float"). Because in general it doesn't make sense to multiply a list some fractional number of times, it is always an error to multiply a list by a float. Also note that an int times a float is always a float: `10 * 2.0` gives `20.0`. 

So, if you use multiplication to calculate the number of control mice, make sure to convert that number back to an integer before you pass it to your `p_cancer_free_extreme()` function. (Converting a floating point number to an integer is just the same as converting a string: `int(2.0)` gives `2`. Note that converting a non-whole number to an integer always rounds down... `int(2.8)` gives `2`. To round to the closest whole number, use `round()`, though note that this returns a float, so `round(2.8)` gives `3.0`. But for this specific question, we don't need to round anything.)

In [None]:
ps = []
n_blueberries = [10, 20, 40, 50, 100]
# YOUR ANSWER HERE

In [None]:
assert 0.28 < ps[0] < 0.31
assert 0.13 < ps[1] < 0.16
assert 0.02 < ps[3] < 0.03

So, based on the above, about how many mice, total, will we need to get the probability of a chance-only finding below 5%?

This is related to the concept of "statistical power", which we will develop in a future lecture. For now, though, this number gives a rough estimate of how many mice you'd need in an experiment, assuming tha

YOUR ANSWER HERE