In [None]:
from scipy.stats import f_oneway
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.stats.multicomp 
import pairwise_tukeyhsd

# Hypothesis testing

## T Tests

Results under 0.05 are statistically significant.

#### One sample t-tests 

Suppose we want to compare exam scores for students who attended a test prep program to the global average score of 35 points. Do students who attend this program score higher than 35 points? The global average is the hypothesized population value and the average of the exam scores of students who attended the program is the sample statistic (in this case, sample mean).

Below is the code to run a one-sample t-test to address the above question. In this example the alternative hypothesis is that the sample mean is significantly different than 35, and the null hypothesis is that the sample mean is 35.

In [None]:
from scipy.stats import ttest_1samp

global_average_score = 35
sample_scores = [12, 42, 37, 18, 23, 39, 45, … , 52]

t_stat, p_value = ttest_1samp(sample_scores, global_average_score)

####  Two-sample t-test
A two-sample t-test is used to investigate an association between a quantitative variable and a binary categorical variable. For example, suppose we want to test if there is an association between claw size (quantitative) and species: black or grizzly bear (binary categorical). To answer this question, we could sample a selection of black bears and grizzly bears, then calculate the average claw size for each species. Then, we can use a two-sample t-test to determine the probability that the claw sizes for these two species are significantly different (among the entire population of black and grizzly bears). 

Other examples of two-sample t-tests include studies like drug trials or psychology studies with a control and experimental group or A/B Testing with quantitative data like “time spent on a website”.

In [None]:
a_b_tval, a_b_pval = ttest_ind(a,b)
print(a_b_pval)

In [None]:
from scipy.stats import ttest_ind

#separate out claw lengths for two species
grizzly_bear = data.claw_length[data.species=='grizzly']
black_bear = data.claw_length[data.species=='black']

#run the t-test here:
tstat, pval = ttest_ind(grizzly_bear, black_bear)

#### Binomial test
If we instead have a sample of binary data and want to compare a sample proportion/frequency to an underlying probability (population value), a binomial test is appropriate. The classic example of a binomial test is tossing a coin to determine if it’s fair (fair means that the probability of either heads or tails is exactly 50%).

For example, suppose that you collect sample data from a coin by tossing it 100 times, and find that 45 flips result in heads. Based on this sample, what is the probability that the coin is actually fair (if you flipped it infinitely many times, exactly half those flips would be heads)? The following code runs the binomial test to answer this question:

In [4]:
from scipy.stats import binomtest

p_value = binomtest(45, 100, p = 0.50)

p_value

BinomTestResult(k=45, n=100, alternative='two-sided', statistic=0.45, pvalue=0.36820161732669565)

The alternative hypothesis for this test is that the probability is different than p = 0.50, and the null is that it is equal to 0.50.

Here are some other examples of situations where a binomial test would be useful

* Is the number of passengers who show up for a flight fewer than normal?
* Is the open rate on a marketing email different from the company target?

## ANOVA (Analysis of variance).

ANOVA tests the null hypothesis that all groups have the same population mean. Results under 0.05 are statistically significant.

In cases similar to the two-sample t-test, but when the categorical variable has three or more categories, an ANOVA can be used to see if there is a significant difference between any of the groups. Then, if at least one pair of groups are significantly different, Tukey’s range test can be used to determine which groups are different. This is better than running multiple two-sample t-tests because it leads to a lower probability of making a type I error.

For example, if we want to compare the heights of three different tree species, in order to test the hypothesis that average tree heights vary by species, we can use an ANOVA. Then, if the p-value from the ANOVA is below our significance threshold, we can run Tukey’s range test to determine which tree species have significantly different heights.

In [None]:
fstat, pval = f_oneway(a, b, c)
print(pval)

In [None]:
# ANOVA Test
from scipy.stats import f_oneway
fstat, pval = f_oneway(heights_pine, heights_oak, heights_spruce)


## Tukeys range test

Run a Tukey test with an error rate of 0.05. Significant results will be marked as true in the results.


In [None]:
# Tukey’s Range Test
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_results = pairwise_tukeyhsd(tree_data.height, tree_data.species, 0.05)

In [None]:
tukey_results = pairwise_tukeyhsd(data.col_a, data.col_b, 0.05)
print(tukey_results)

1. The observations should be independently randomly sampled from the population

Ranndom sampling will help ensure that our sample is representative of the population we care about.

2. The standard deviations of the groups should be equal

For example, if we’re comparing time spent on a website for two versions of a homepage, we first want to make sure that the standard deviation of time spent on version 1 is roughly equal to the standard deviation of time spent on version 2. To check this assumption, it is normally sufficient to divide one standard deviation by the other and see if the ratio is “close” to 1. Generally, a ratio between 0.9 and 1.1 should suffice.

That said, there is also a way to run a 2-sample t-test without assuming equal standard deviations — for example, by setting the equal_var parameter in the scipy.stats.ttest_ind() function equal to False. Running the test in this way has some disadvantages (it essentially makes it harder to reject the null hypothesis even when there is a true difference between groups), so it’s important to check for equal standard deviations before running the test.

3. The data should be normally distributed…ish

Data analysts in the real world often still perform these tests on data that are not normally distributed. This is usually not a problem if sample size is large, but it depends on how non-normal the data is. In general, the bigger the sample size, the safer you are!

4. The groups created by the categorical variable must be independent

Here are some examples where the groups are not independent:

the number of goals scored per soccer player before, during, and after undergoing a rigorous training regimen (not independent because the same players are measured in each category)
years of schooling completed by a group of adults compared to their parents (not independent because kids and their parents can influence one another)

### Chi-square test 

An A/B test where half of users were shown a green submit button and the other half were shown a purple submit button. Was one group more likely to click the submit button?
People under and over age 40 were given a survey asking “Which of the following three products is your favorite?” Did these age groups have significantly different preferences?

1. The observations should be independently randomly sampled from the population

This is also true of 2-sample t-tests, ANOVA, and Tukey. The purpose of this assumption is to ensure that the sample is representative of the population of interest.

2. The categories of both variables must be mutually exclusive

In other words, individual observations should only fall into one category per variable. This means that categorical variables like “college major”, where students can have multiple different college majors, would not be appropriate for a Chi-Square test.

3. The groups should be independent

Similar to 2-sample t-tests, ANOVA, and Tukey, a Chi-Square test also shouldn’t be used if either of the categorical variables splits observations into groups that can influence one another. For example, a Chi-Square test would not be appropriate if one of the variables represents three different time points.

In [None]:
from scipy.stats import chi2_contingency

table = pd.crosstab(variable_1, variable_2)
chi2, pval, dof, expected = chi2_contingency(table)

In [None]:
import pandas as pd
from scipy.stats import chi2_contingency

# create contingency table
ab_contingency = pd.crosstab(data.Web_Version, data.Subscribed)

# run a Chi-Square test
chi2, pval, dof, expected = chi2_contingency(ab_contingency)