In [1]:
from datascience import *
import numpy as np
from math import *
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

## Hypothesis Testing

In Data8.2x, you have been using simulation to conduct hypothesis testing. Now that we have completed Data8.2x, this is a good time to take a step back and reflect on hypothesis testing. 

Every hypothesis test has roughly the same structure. The following 4 steps provide a rough outline:

1) State the null and alternate hypotheses. Generally, the alternate hypothesis is what you are trying to show. Essentially, if you want to show a result, we assume the opposite is true and try to prove ourselves wrong. 

2) Determine/calculate a test statistic. See your book for a formal definition, but generally, the test statistic is any quantity that helps us evaluate our sample with respect to our null hypothesis. 

3) Determine distribution of test statistic and compute a $p$-value. If you have taken inferential statistics before, you likely computed a $z$ or $t$ statistic and used a calculator or table to compute a $p$-value. This is based on asymptotic theory of sample means/proportions. This is not the approach taken in Data8.2x. With better computing power, we can use simulation to obtain an empirical distribution of our test statistic under the null hypothesis. 

4) Conclude. For low $p$-value (generally below 0.05), we reject the null hypothesis. For high $p$-value, we fail to reject. Low $p$-value implies that our sample would be very unusual if the null hypothesis were actually true. Therefore, that is evidence that the null hypothesis is wrong. 

### Example

Let's work through an example. Suppose that in the upcoming election, Referendum A is up for approval in Colorado. You suspect that in El Paso County, more than half of eligible voters support the referendum. You collect a random sample of 200 eligible voters in El Paso County and 115 of them express support. Is there evidence to conclude that supports your suspicion? 

#### Step 1: Hypothesis

State the null and alternative hypotheses.

Null hypothesis: population, pi, is equal to .5

Alternative hypothesis: population, pi, is greater than .5

#### Step 2: Test Statistic

Select a test statistic and compute that test statistic for the sample.

The test statistic is the random variable X.  

X = the number of voters in support out of 200 that exceed 100. (We would expect this value to be 0 if the null were true)

#### Step 3: $p$-value

3a) If $H_0$ were true, what should the value of $\hat{p}$ be close to? In other words, if in fact, half of eligible voters support the referendum, what value should your test statistic take? 

It should be close to 0. 

3b) In words (and in the context of this problem), describe what the $p$-value is. 

If the true proportion of voters in favor of the referendum was .5, then the P value is telling us the probability of getting 15 or more voters in excess of 100. 

3c) Find the $p$-value directly and using simulation. Hint: the binomial distribution will be of use here. 

In [4]:
#NOTE: I used several python boxes, so don't feel the need to put everything in this box.
my_binom=stats.binom(n=200,p=.5,loc=-100)
p_val = 1-my_binom.cdf(14)
p_val

0.020018595806698514

In [7]:
#Using simulation...
sim_num = 10000
test_stat=15
np.count_nonzero(my_binom.rvs(sim_num)>=test_stat)/sim_num

0.0199

#### Step 4: Conclude

What is your conclusion? Be sure to state your conclusion in the context of the problem.

The P-value is statistically significant, but only marginally.  I would collect more data. Using 5% cut-off rule, we would reject null. 

## Confidence Intervals

Construct and interpret a 95% confidence interval on $p$, the true proportion of eligible El Paso County voters who support the referendum. There are many ways to construct such an interval (bootstrap, the binomial distribution, asymptotically). Select one and implement. 

Also, compare your interval to the results of your hypothesis test. Does your interval contain the value 0.5? Why does that matter? 

In [10]:
my_data = np.repeat([0,1],[85,115]) #Like the 200 people from population sample
my_data

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1])

In [12]:
np.random.seed(678)
my_boots=[]
for i in np.arange(10000):
    my_boots=np.append(my_boots, sum(np.random.choice(my_data,size=200))-100)
test_stat=Table().with_column("test stat",my_boots)
test_stat

test stat
11
13
6
13
-2
20
22
29
17
3


In [13]:
test_stat.percentile(5)

test stat
3


We are 95% confident that the true proportion of voters in favor of the referendum is greater than 103/200=.515