### Lecture Notes: Introduction to Hypothesis Testing

**Helpful Resource:**
- [Python Reference](http://data8.org/sp22/python-reference.html)

**Recommended Readings:**
- [Assessing a Model](https://www.inferentialthinking.com/chapters/11/1/Assessing_a_Model.html)
- [Decisions and P-value](https://www.inferentialthinking.com/chapters/11/3/Decisions_and_Uncertainty.html)
- [A/B Testing](https://www.inferentialthinking.com/chapters/12/1/AB_Testing.html)
- [Testing two proportions](https://www.inferentialthinking.com/chapters/12/2/Causality.html)

In [2]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
students = Table().read_table('student_data.csv')


### Hypothesis Testing Based on Simulations

1. State the null and alternative hypothesis
2. Pick a test statistic
3. Simulate the sampling distribution of the test statistic, under the null model
4. Draw conclusion with regard to null model based on the p-value, using the distribution generated from Step 3. A significance level of 0.05 was often used to decide whether the p-value is significant enough. 


#### Example 1: Swain v.s. Alabama

In the Swain v.s. Alabama (1965) case, Robert Swain believes that the jury panel that convicted him under-represented the population of eligible jurors, among which 26% were black. In the jury panel of 100 jurors, only 8 were black. The Supreme court opinion stated "the overall percentage disparity has been small". 

What are the null and alternative models? 

In the following code, what is the test statistic used in the simulation? 

In [6]:
def sampling_distribution_black(n, prop):
    results = [] 
    proportions = make_array(prop, 1-prop)   
    for i in np.arange(100000):
        results = np.append(results, sample_proportions(n, proportions).item(0)*100)
    return results

How do you find the p-value based on the following distribution? 

In [None]:
results = sampling_distribution_black(100, 0.26)
Table().with_column('Number of black jurors', results).hist()

What is the overall conclusion from the hypothesis test? 

If in the county, instead of 26% of the jurors, 12% of the jurors are black. Re-run the simulation based on the new null model. What is the conclusion of this hypothesis test? 

#### Example 2: The psychic octopus. 

During the 2010 World Cup tournament, Paul the Octopus (in a German aquarium) became famous for correctly predicting the winner in all 8 games it was asked to predict.  (Two containers of food were lowered into Paul’s tank, each with a flag of the opposing teams.  He made a selection by choosing which container to eat from. Is this evidence that Paul has psychic powers and can choose correctly more than half the time?

What are the null and alternative models? 

Use the same test statistic as Swain v.s. Alabama, simulate a sampling distribution for the test statistic, and use it to complete the hypothesis test. 

Suppose a different and less psychically powerful octopus named "Polly" only got 6 correct out of 8.  Will the p-value be more than or less than the p-value for Paul the Octopus? Estimate this new p-value using the same sampling distribution shown above and draw a conclusion with respect to this hypothesis test. 

#### Example 3: the loaded die

A gambler at the Graton Casino observed 120 dice rolls and noticed that the number of rolls were not perfectly even. Specifically, he counted the following frequencies for each side of the die: (18, 21, 19, 23, 24, 15). He suspects the die may be loaded. 

What are the null and alternative models? 

Based on the following code, what is value of the test statistic given the observed data? Complete the definition of the ```fair_die``` based on the null model.    

In [None]:
dice_data = make_array(18, 21, 19, 23, 24, 15)

def tvd(dist1, dist2):
    return sum(abs(dist1 - dist2))/2

fair_die = make_array( ... )

results = make_array()
for i in np.arange(10000):
    test_stat = ... 
    results = np.append(results, test_stat)
Table().with_column('Test stat', results).hist()

Based on the simulated sampling distribution, what is the p-value and the conclusion of the test? 

Imagine now instead of 120 rolls, there are 1200 dice rolls, and the frequencies were 10 times larger. Repeat the hypothesis test. Does the conclusion change? 

In [None]:
dice_data = dice_data * 10

#### Example 4: Testing whether one group mean is different

Based on the student data in ```student_data.csv```, suppose we want to test the hypothesis that students who do not use social networking sites tend to be older in terms of average age. 

What is are the null and alternative models? 

Suppose we use the average age as the test statistic. Does the following output appear to support the null or alternative model? 

In [None]:
students.group('SOCIAL', np.average)

In [None]:
students.group('SOCIAL')

In the following code, what is the test statistic? What is the p-value and conclusion of the hypothesis test based on the student data? 

In [None]:
def sampling_distribution_mean(n, variable):
    random_group = students.sample(n, with_replacement=False)
    return np.average(random_group.column(variable))

sample_means = make_array()
for i in np.arange(10000):
    sample_means = np.append(sample_means, sampling_distribution_mean(20, 'AGE'))

Table().with_column('Sample Means', sample_means).hist()

Now modify the code and use it to test the hypothesis that on average, female students own more pets than male students. 

#### Example 5: A/B Testing 

In A/B testing, we ask whether the two samples are likely from the same population. Instead of using the sample mean as the test statistic as in Example 4, we use the difference between the two sample means as the test statistic. 

In the experiment (c) mentioned above, the test statistic was set to be the difference in the mean number of words recalled by people who take a nap ($\bar{x}_1$) and the ones by those who took a caffeine pill ($\bar{x}_2$). Similar to the example used in the A/B testing lecture, here the null hypothesis assumes that the two groups have the same population mean. 

<img src="two_means_diff_1000samples.GIF" alt="drawing" width="600"/>



What is the P-value if the observed difference in the original sample is 3.5 words? 

What is the P-value if the observed difference in the original sample is 3.0 words? 


Use the student data to test there is a significance difference in average height between female and male students.

In [None]:
def difference_of_means(table, variable):
    subtable = table.select('SEX', variable)
    means_table = subtable.group('SEX', np.average)
    means = means_table.column(1)
    return means.item(0) - means.item(1)

difference_of_means(students, 'HEIGHT')

What are the null and alternative models? 

What is the value of the test statistic? 

Under the null model, the sampling distribution of the test statistic was generated using a technique called "label shuffling"  

In [None]:
shuffled_sex = students.sample(with_replacement = False)
shuffled_students = students.drop('SEX').with_column('SEX', shuffled_sex.column('SEX'))
shuffled_students


Use the code provided above, create a simulation of the sampling distribution of the test statistic. Each iteration should shuffle the labels once, and use ```difference_of_means``` to compute the test statistic. 

Use the simulated sampling distribution to compute the p-value and draw the conclusion. 

Can you use the same procedure to test whether there is a difference in the attitude towards Math among male and female students? 

#### Example 6: Label Shuffling for Categorical Data

The following data records an experiment on whether Lithium was effective in preventing cocaine users from relapsing. In the column labeled "Result", 1 represents relapse (returning to cocaine use), and 0 presents no relapse. 

In [None]:
coke = Table().read_table('cocaine_lithium.csv')
coke.show(3)
coke.pivot('Result', 'Group')

The following function is used to compute the test statistic, based on a table similar to ```coke```. What does the test statistic compute in terms of the original data? 

In [None]:
def distance(table, group_label):
    proportions = table.group(group_label, np.average).column(1)
    print(proportions)
    return abs(proportions.item(1) - proportions.item(0))

distance(coke, 'Group')

The code below uses a similar technique as in Lecture 20: Causality to determine to shuffle the labels under "Group" to produce another sample under the null hypothesis (Lithium and Placebo has the same rate of relapse). Run the code a few times and describe what happens after each shuffle. 

In [None]:
def one_shuffle(table):
    shuffled_labels = table.sample(with_replacement = False).column('Group')
    shuffled_table = table.select('Group', 'Result').with_column(
        'Shuffled', shuffled_labels)
    return shuffled_table


coke_shuffled = one_shuffle(coke)
coke_shuffled.show(10)
coke_shuffled.pivot('Result','Shuffled')

In [None]:
distance(coke_shuffled.drop('Group'), 'Shuffled')

Write code to simulate 10000 values of the test statistic based on the ```distance``` and ```one_shuffle``` functions defined above. Plot the histogram of the sampling distribution.

Based on the test statistic from the original data, write some code to answer the question: what is the p-value of this test? If you were a doctor, would you recommend the Lithium treatment to your patients? 

Now use the same technique and the same test statistic, conduct another hypothesis test to answer the question: based on the student data, do female students tend to use socal media more than male students? (Hint: you can re-use the code above by converting the student data to a similar table)