In [1]:
from datascience import *
import numpy as np
from math import *
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

## Lesson 24: Hypothesis Testing Errors & Power

Throughout this block, we have been studying hypothesis tests. We have covered the four basic steps of any hypothesis test, and we have practiced various methods for obtaining the distribution of our test statistic under the null hypothesis. 

After we have reached a conclusion (reject or fail to reject), we must consider possible errors. 

### Type I error 

Type I error is the event that we rejected the null hypothesis when the null hypothesis was actually true. Type I error is also known as a false positive. The probability of a Type I error is usually defined by the threshold used for rejection. A common threshold is 0.05. Those of you who have taken statistics before may recognize this value as $\alpha$. 

### Type II error

Type II error is the event that we failed to reject the null hypothesis when the null hypothesis was actually false. This is otherwise known as a false negative. The probability of a Type II error is harder to find and requires a more in-depth analysis of a hypothesis test. The probability of a Type II error is often given as $\beta$, and $1-\beta$ is referred to as **Power**. The power of a test is probability that we will reject the null hypothesis when we are supposed to. 

Which one of these errors is more serious? It depends on the context of the problem. 

### Example: Golf Balls

Joe has a summer job at a golf course and one of his jobs is to fish out golf balls from the water traps. He has a theory that certain types of golf ball are more likely to end up in the water than others. Let's assume there are four brands of golf ball, let's and assume that all four are used equally at this golf course. He fishes out 100 golf balls and counts each brand. He finds 30 of brand A, 30 of brand B, 20 of brand C and 20 of brand D. Conduct a hypothesis test to determine whether certain types of golf ball are more likely than others to end up in the water.

Step 1: Hypotheses

Null Hypothesis: Each Brand is as lkely to end up in the water.
Altervative Hypothesis: A certain brand more likely to end up in the water.

Step 2: Test statistic

There are many correct answers, but let's go with sum of absolute difference between observed and expected counts under $H_0$. To do this, we need to find the expected counts. If each ball was equally likely, how many should we expected to find of each if we select 100 golf balls? 

In [2]:
25

25

Step 3: $p$-value

We need the distribution of the test statistic under $H_0$. 

In [5]:
Balls=['A','B','C','D']
CountsA=[]
CountsB=[]
CountsC=[]
CountsD=[]
for x in np.arange(10000):
    Array=np.random.choice(Balls,100,replace=True)
    CountsA=np.append(CountsA,sum(Array=='A'))
    CountsB=np.append(CountsB,sum(Array=='B'))
    CountsC=np.append(CountsC,sum(Array=='C'))
    CountsD=np.append(CountsD,sum(Array=='D'))
Table().with_columns("A",CountsA,"B",CountsB,"C",CountsC,"D",CountsD)

A,B,C,D
29,19,29,23
28,11,34,27
23,27,25,25
26,25,34,15
25,19,27,29
20,28,25,27
33,17,28,22
23,23,24,30
22,24,28,26
27,25,21,27


In [9]:
#Let us find the 5% that would make us suspect a brand is more likely than others.
print("The 5% that would make us suspect that a brand is more likely to land in the water is",(percentile(95,CountsA)+percentile(95,CountsB)+percentile(95,CountsC)+percentile(95,CountsD))/4)

The 5% that would make us suspect that a brand is more likely to land in the water is 32.0


Step 4: Conclude

There is not enough evidence to support that the null hypothesis is true.

What kind of error could we have made in this case? 

Type II

#### Power 
Suppose that, in truth, 30% of the balls found in the water were brand A, 30% were brand B, 20% were brand C and 20% were brand D. In this case, our collected sample reflected this truth perfectly. However, our hypothesis test failed to recognize this deviation from equal proportions. We made a type II error. This is because this test has fairly low power. Use simulation to determine the power of this test. 

I am looking for the probability that I reject the null hypothesis given the true proportions laid out above. Well, first I need to figure out for what values of my test statistic I would reject $H_0$. 

Reject Hnot when 32% or more of a brand lands in the water.

Next, I need to simulate from the true population and determine how often my test statistic would have met this threshold. 

In [16]:
True_Balls=['A','A','A','B','B','B','C','C','D','D']
True_CountsA=[]
True_CountsB=[]
True_CountsC=[]
True_CountsD=[]
for x in np.arange(10000):
    Array=np.random.choice(True_Balls,100,replace=True)
    True_CountsA=np.append(True_CountsA,sum(Array=='A'))
    True_CountsB=np.append(True_CountsB,sum(Array=='B'))
    True_CountsC=np.append(True_CountsC,sum(Array=='C'))
    True_CountsD=np.append(True_CountsD,sum(Array=='D'))
Table().with_columns("A",True_CountsA,"B",True_CountsB,"C",True_CountsC,"D",True_CountsD)
(sum(True_CountsA>=32)+sum(True_CountsB>=32)+sum(True_CountsC>=32)+sum(True_CountsD>=32))/40000

0.18465

Therefore there is a 18.5% chance that there would be a brand with a higher chance of landing in the water.

What do you think about this power? 

Thus the power is 81.5% which is a pretty high power.

Repeat this power calculation, but assume Joe collects 500 balls instead of 100. Note that you will have to obtain a new critical value. What does this tell you about power and sample size?

In [13]:
New_Balls=['A','B','C','D']
New_CountsA=[]
New_CountsB=[]
New_CountsC=[]
New_CountsD=[]
for x in np.arange(10000):
    Array=np.random.choice(New_Balls,500,replace=True)
    New_CountsA=np.append(New_CountsA,sum(Array=='A'))
    New_CountsB=np.append(New_CountsB,sum(Array=='B'))
    New_CountsC=np.append(New_CountsC,sum(Array=='C'))
    New_CountsD=np.append(New_CountsD,sum(Array=='D'))
Table().with_columns("A",New_CountsA,"B",New_CountsB,"C",New_CountsC,"D",New_CountsD)

A,B,C,D
108,121,137,134
129,121,129,121
136,108,125,131
117,147,119,117
119,127,118,136
135,123,124,118
125,124,128,123
131,127,126,116
148,121,106,125
132,92,132,144


In [14]:
print("The 5% that would make us suspect that a brand is more likely to land in the water is",(percentile(95,New_CountsA)+percentile(95,New_CountsB)+percentile(95,New_CountsC)+percentile(95,New_CountsD))/4)

The 5% that would make us suspect that a brand is more likely to land in the water is 141.0


In [17]:
New_True_Balls=['A','A','A','B','B','B','C','C','D','D']
New_True_CountsA=[]
New_True_CountsB=[]
New_True_CountsC=[]
New_True_CountsD=[]
for x in np.arange(10000):
    Array=np.random.choice(New_True_Balls,500,replace=True)
    New_True_CountsA=np.append(New_True_CountsA,sum(Array=='A'))
    New_True_CountsB=np.append(New_True_CountsB,sum(Array=='B'))
    New_True_CountsC=np.append(New_True_CountsC,sum(Array=='C'))
    New_True_CountsD=np.append(New_True_CountsD,sum(Array=='D'))
Table().with_columns("A",New_True_CountsA,"B",New_True_CountsB,"C",New_True_CountsC,"D",New_True_CountsD)
(sum(New_True_CountsA>=141)+sum(New_True_CountsB>=141)+sum(New_True_CountsC>=141)+sum(New_True_CountsD>=141))/40000

0.41055

With this new sample size probability of a brand being more likely to land in the water is 41.1% which is much hight than that of the previous example.