# Question 2: The one-sample chi-squared test

Recall from class that the chi-squared statistic is used to compare counts of categorical variables between the "observed" and "expected" cases. The definition of the statistic is:
$$\chi^2 = \sum_{i = 1}^{n}\frac{(\textrm{observed}_i - \textrm{expected}_i)^2}{\textrm{expected}_i}$$
where the summation is across the *i* different possible categories.

## Problem 2.0
The first thing we need to do to build a $\chi^2$ test is to write a function `category_count` that will count the number of instances of each category in a list. This will be a little like our `bincount` function from the previous question, but with some salient differences. We will allow category names to be strings or numbers (or really anything), and we will store the count for each category in a dictionary.

The function will take a list of all the possible categories, and a list of data containing the actual categories to count. To do this, initialize a dictionary mapping each category name to a counter that starts at zero. Then step through the data and for each element, increment the appropriate counter in the dictionary.

For example, `category_count(['apple', 'banana', 'plum'], ['apple', 'apple', 'apple', 'plum'])` should return the following dictionary: `{'apple':3, 'banana':0, 'plum':1}` (Don't worry about handling the case where the data list contains a category not in the list of possible categories.)

Then write a function, `chi2`, which will take the list of possible categories and two dictionaries: `observed` and `expected`. Assume that these dictionaries map from each category to the count, just as returned from the `category_count` function. For each category, calculate that category's contribution to the $\chi^2$ statistic and add it to a running sum.

In [None]:
def category_count(all_categories, values):
    counts = {}
    # YOUR ANSWER HERE
    return counts

def chi2(all_categories, observed, expected):
    chi2_statistic = 0
    # YOUR ANSWER HERE
    return chi2_statistic

fruits = ['apple', 'banana', 'plum']
print(category_count(fruits, ['apple']*3 + ['banana']*4))

observed_fruit_bowl = {'apple': 4, 'banana': 0, 'plum': 10}
fruit_bowl_as_pictured_in_menu = {'apple': 3, 'banana': 1, 'plum': 8}
print(chi2(fruits, observed_fruit_bowl, fruit_bowl_as_pictured_in_menu))

In [None]:
assert category_count(fruits, ['plum']*100) == {'apple': 0, 'banana': 0, 'plum': 100}
assert category_count(fruits, ['plum', 'apple', 'banana']) == {'apple': 1, 'banana': 1, 'plum': 1}
assert chi2(fruits, observed_fruit_bowl, fruit_bowl_as_pictured_in_menu) == (1/3 + 1/1 + 4/8)

Let's return to the second rat-behavior problem from class: "do rats prefer food to electric shocks?"

Let us assume that we have again obtained 24 rats, and this time we placed each in a cage with three buttons: one that delivers food, one that does nothing, and one that delivers a mild electric discharge. We allow each rat 10 button presses, for 240 presses total. 

Under the null hypothesis that the rats have no preference either way, we would expect on average 80 button presses for each category. In reality, we observed 98 food-button presses, 79 nothing-button presses, and 63 shock-button presses.

The setup below shows how we can calculate the $\chi^2$ score to see how far our experimental data depart from our theoretical prediction assuming no preference. 

But what does this score mean? Does that mean that our data are *really* different from the theoretical prediction, or only a *little* different? That is, what's the relevant scale for comparing $\chi^2$ scores?

Again, null hypothesis significance testing comes in handy. If a $\chi^2$ score that large could easily come about by chance just due to sampling from a population of rats that complies with our null hypothesis, then no, our data isn't really significantly different from theoretical prediction.

So let's bootstrap to find out what kind of $\chi^2$ scores we might get if the null hypothesis were true. So, how do we make our experimental data look like the null hypothesis? In the previous problems, the null hypothesis merely specified the location of the mean or the minimum of the data, so all we had to do was shift the data around a bit. Here, though, the null hypothesis completely specifies the distribution: from 24 completely unbiased rats, we would expect 80 button presses for each category. So that's our "transformed" dataset. 

You may note that the lines can get a bit blurry between "resampling" and "simulation"... which are we doing in this case?

## Problem 2.1
Finish writing the `bootstrap_chi2` function below. It takes a list of categories and a list containing the expected distribution of categories (i.e. not a dictionary of counts, but a list containing each category name the number of times it's expected to be observed). Your function should calculate the expected counts from this list (using `category_count`), and then resample the expected list repeatedly to generate a new list to get new counts from. For each new set of counts, calculate the $\chi^2$ score compared to the expected counts, and add that score to the list of $\chi^2$ scores to return.

This will give the distribution of $\chi^2$ scores under the null hypothesis that the observed values are distributed identically to the expected values. (Again, note that we don't need to use the actual observed values anywhere in this calculation...)

Next, paste in your `fraction_extreme` function from the previous questions, and use that to calculate a p-value for the experimental $\chi^2$ score compared to the null distribution. Store this as `p_value`. Note that this is a one-tailed value, because $\chi^2$ scores are by definition non-negative (so there's only one direction to go from the null hypothesis that the "true" $\chi^2 = 0$).

Finally, plot the histogram of resampled $\chi^2$ scores and plot the experimental value with a blue line, as before.

In [None]:
import random
%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use('ggplot')

buttons = ['food', 'nothing', 'shock']
button_presses_observed = ['food']*98 + ['nothing']*79 + ['shock']*63
button_presses_expected = ['food']*80 + ['nothing']*80 + ['shock']*80

observed_button_counts = category_count(buttons, button_presses_observed)
expected_button_counts = category_count(buttons, button_presses_expected)
experimental_chi2 = chi2(buttons, observed_button_counts, expected_button_counts)
print('experimental chi^2:', experimental_chi2)

def bootstrap_chi2(all_categories, expected, n_resamples):
    chi2_vals = []
    # YOUR ANSWER HERE
    return chi2_vals

chi2s = bootstrap_chi2(buttons, button_presses_expected, 10000)
print('bootstrap chi^2 range:', min(chi2s), max(chi2s))

def fraction_extreme(values, below=None, above=None):
    # YOUR ANSWER HERE

# Now calculate the p-value and plot a histogram of the bootstrapped chi^2 values
# vs. the experimentally observed value
# YOUR ANSWER HERE

print('p =', p_value)

In [None]:
assert experimental_chi2 == 7.675
assert len(chi2s) == 10000
assert min(chi2s) == 0
assert max(chi2s) > 13
assert 0.015 < p_value < 0.03

## Problem 2.2
So, it was pretty unlikely to have a set of category counts as far from (80, 80, 80) as our experimental data of (98, 79, 63) by chance alone. Thus we can reject the null hypothesis that the observed differences can be accounted for simply as an artifact of random sampling.

What if we had run a smaller experiment, though? Repeat the above steps, using the new experimental dataset below, with half as many button presses...

Calculate `experimental_chi2_2`, `chi2s_2`, and `p_value_2` as above, but using the new datasets below, and plot the relevant histogram and line.

In [None]:
button_presses_observed_2 = ['food']*49 + ['nothing']*39 + ['shock']*32
button_presses_expected_2 = ['food']*40 + ['nothing']*40 + ['shock']*40

observed_button_counts_2 = category_count(buttons, button_presses_observed_2)
expected_button_counts_2 = category_count(buttons, button_presses_expected_2)

# YOUR ANSWER HERE
print('experimental chi^2:', experimental_chi2_2)
print('bootstrap chi^2 range:', min(chi2s_2), max(chi2s_2))
print('p =', p_value_2)

In [None]:
assert experimental_chi2_2 == 3.65
assert len(chi2s_2) == 10000
assert min(chi2s_2) == 0
assert max(chi2s_2) > 13
assert 0.16 < p_value_2 < 0.19

What a difference a little more data make, huh?

In the next problem set, we'll perform a two-sample $\chi^2$ test where experimental and control data are compared, rather than comparing experimental data against a theoretical prediction.