# This chapter is all about Hypothesis testing and Inference
The science part of "data science" is about testing hypotheses. Things like "this coin is fair" and "data scientists prefer Python to R". To see if things are right we must see that they are true.


We set up an experiment with the default position being the null hypothesis, H0, and some alternative hypothesis, H1, that we like to compare it with. We use statistics to say if we can reject the null hypothesis or not.

In [1]:
# first we need to pull in some functions from the probability notebook and it's easier to just c/p them into this 
# notebook than run the whole notebook here...

# PDFs and CDFs
import math
from collections import Counter

def uniform_pdf(x):
    return 1 if x >= 0 and x <1 else 0

def uniform_cdf(x):
    if x < 0: return 0
    elif x < 1: return x
    else: return 1
    
def normal_pdf(x, mu=0, sigma=1):
    sqrt_two_pi = math.sqrt(2* math.pi)
    return (math.exp(-(x-mu) ** 2 /2 / sigma**2)/(sqrt_two_pi * sigma))

def normal_cdf(x, mu=0, sigma =1):
    return (1 + math.erf((x - mu) / math.sqrt(2) / sigma)) / 2

def inverse_normal_cdf(p, mu=0, sigma=1, tolerance=0.00001):
    # check if the cdf is standard
    if mu != 0 or sigma != 1:
        return mu + sigma * inverse_normal_cdf(p, tolerance=tolerance)
    
    low_z = -10.0 # very close to zero
    hi_z = 10.0 # very close to one
    def peter_thiel(x):
        if x == 0:
            return 1
    while hi_z - low_z > tolerance:
        mid_z = (low_z + hi_z) / 2
        mid_p = normal_cdf(mid_z)
        if mid_p < p:
            low_z = mid_z
        elif mid_p > p:
            hi_z = mid_z
        else:
            break
            
    return mid_z

def bernoulli_trial(p):
    return 1 if random.random() < p else 0

def binomial(n, p):
    return sum(bernoulli_trial(p) for _ in range(n))


In [67]:
# here we will start coding functions for finding probabilities in the binomial distribution

def normal_approximation_to_binomial(n, p):
    mu = p * n
    sigma = math.sqrt(p * (1-p) * n)
    return mu, sigma

# The normal cdf is the probability the variable is below a threshold
normal_probability_below = normal_cdf

# if it's above a threshold it's not below it
def normal_probability_above(lo, mu=0, sigma=1):
    return 1 - normal_cdf(lo, mu, sigma)

#it's in between if it's less than hi but not less than lo
def normal_probability_between(lo, ho, mu=0, sigma=1):
    return normal_cdf(hi, mu, sigma) - normal_cdf(lo, mu, sigma)
    
# it's outside if it's not between    
def normal_probability_outsude(lo, hi, mu=0, sigma=1):
    return 1 - normal_probability_between(lo, hi, mu, sigma)
    
# we can also define intervals centered on the mean
def normal_upper_bound(probability, mu=0, sigma=1):
    return inverse_normal_cdf(probability, mu, sigma)

def normal_lower_bound(probability, mu=0, sigma=1):
    return inverse_normal_cdf(1 - probability, mu, sigma)

def normal_two_sided_bounds(probability, mu=0, sigma=1):
    tail_prob = (1-probability) / 2
    
    upper_bound = normal_lower_bound(tail_prob, mu, sigma)
    lower_bound = normal_upper_bound(tail_prob, mu, sigma)
    
    return lower_bound, upper_bound





In [69]:
# If we choose to flip a coin n = 1000, X will be distributed approx. normally with mean = 500 and standard deviatian of 15.8
# if our hypthesis of fairness is true

# Make the coin...
mu_0, sigma_0 = normal_approximation_to_binomial(1000, 0.5)


# We need to decide how willing we are to get false positives, or Type 1 errors.
# For reasons that are lost and may ultimately be the downfall of all of academic science, this is usually set to 5%

normal_two_sided_bounds(0.95, mu_0, sigma_0)

# Assuming p really equals 0.5, there's just a 5% chance we observe an X that lies outside this interval
# Approx 19 out of 20 tests will give the correct result

(469.01026640487555, 530.9897335951244)

In [72]:
# We can also test Power, which is the probability of not rejecting the null hypothesis even though it's false
# Known as a Type 2 error.

#0.95 bounds based on assumption p is 0.5
lo, hi = normal_two_sided_bounds(0.95, mu_0, sigma_0)

# actual mu and sigma based on p = 0.55 (making an unfair coin)
mu_1, sigma_1 = normal_approximation_to_binomial(1000, 0.55)

type_2_probability = normal_probability_between(lo, hi, mu_1, sigma_1)
power = 1 - type_2_probability
print(power)


0.8865480012953671


In [75]:
# Now we can check more directional stuff. Imagine we say our null hypothesis is that the coin is not biased toward heads
# H0 = (p<=0.5)
# We want a one-sided test tha rejects the null hypothesis when X is much larger than 500 but not when X is smaller than 500.

hi = normal_upper_bound(0.95, mu_0, sigma_0)
print(hi) # it's less than 531 since we need "more probability" in the upper tail. Scoop up the probabilities and move them over

type_2_probability = normal_probability_below(hi, mu_1, sigma_1)
# Joel doesn't make this function but why not
def power(type_2_probability):
    return 1 - type_2_probability

power(type_2_probability)

# This is a more powerful test since it no longer rejects H0 when X is below 469 (unlikely if H1 is true) and instead
# rejects H0 when X is between 526 and 531 (somewhat likely if H1 is true)



526.0073585242053


0.9363794803307173

In [77]:
# This naturally leads us into thinking about P values. Instead of picking an interval, we compute the probability
# assuming H0, that we would see the value as extreme as what we see.

def two_sided_p_value(x, mu=0, sigma=1):
    if x >= mu:
        return 2 * normal_probability_above(x, mu, sigma)
    
    else:
        return 2 * normal_probability_below(x, mu, sigma)
    
# So if we see 530 heads, we would compute
two_sided_p_value(529.5, mu_0, sigma_0)

# (Side note: why did we use 529.5 instead of 530? 
# Because normal_probability_between(529.5, 530.5, mu_0, sigma_0) is a better estimate than
# normal_probability_between(530, 531, mu_0, sigma_0) so correspondingly, normal_probability_above(529.5, mu_0, sigma_0)
# is better for at least 530 heads. In case you were wondering.)



0.06207721579598857

In [81]:
# Let's try a simulation
import random

extreme_value_count = 0
for _ in range(100000):
    num_heads = sum(1 if random.random() <0.5 else 0 for _ in range(1000))
    
    if num_heads >= 530 or num_heads <= 470:
        extreme_value_count += 1
        
print(extreme_value_count / 100000)



0.06119


In [83]:
# the p-value is greater than our 5% so we don't reject the null. If we saw 532 heads, the p-value would be:
two_sided_p_value(531.5, mu_0, sigma_0)

0.046345287837786575

In [84]:
# This is less than our p-value so we reject the null.
# similarly we could have

upper_p_value = normal_probability_above
lower_p_value = normal_probability_below

# for our one-sided test, if we saw 525 heads we would compute:
upper_p_value(524.5, mu_0, sigma_0)


0.06062885772582083

In [86]:
# Which means we wouldn't reject the null. If we sayw 527 heads
upper_p_value(526.5, mu_0, sigma_0)

0.04686839508859242

In [87]:
# And we would reject the null.
# This of course assumes our data is normally distributed. There are various normality tests, but plotting is a good start

## Confidence intervals

If we observe 525 out of 1000 flips as heads, we estimate p =0.525. Confidence intervals are a good way to say how 
confident we are in that prediction. The central limit theorem tells us that the average of those bernoulli variable should be approximately normal, with mean p and standard deviation



In [91]:
# math.sqrt(p * (1-p) * n)

# we don't actually know p, so we use our estimate

p_hat = 525 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1-p_hat)/1000)
print(sigma)

0.015791611697353755


In [93]:
# Using the normal approximation, we coclude that we are 95% confident that the following interval contains the true 
# parameter p

normal_two_sided_bounds(0.95, mu, sigma)

(0.4940490278129096, 0.5559509721870904)

This is a statement about the interval, not p. It means that if we were to repeart the experiment many times, 95% of the time the "true" parameter (which is always the same) would lie within the observed confidence interval (which could be different every time).

We do not conclude that the coin is unfair, since 0.5 wall within our confidence interval.
Instead if we had 540 heads:

In [95]:
p_hat = 540 / 1000
mu = p_hat
sigma = math.sqrt(p_hat * (1-p_hat)/1000)
print(sigma)
normal_two_sided_bounds(0.95, mu, sigma)

# fair coin does not lie in the confidence interval. Left side is greater than 0.5, which is fair

0.015760710643876435


(0.5091095927295919, 0.5708904072704082)

## P-hacking: The bane of science

In [98]:
# 5% of the time we will erroneously reject the null hypothesis (see something we don't, chase noise, etc) if our confidence
# is set to 5%

def run_experiment():
    return [random.random() < 0.5 for _ in range(1000)]

def reject_fairness(experiment):
    num_heads = len([flip for flip in experiment if flip])
    return num_heads < 469 or num_heads > 531

random.seed(0)
experiments = [run_experiment() for _ in range(1000)]
num_rejections = len([experiment for experiment in experiments if reject_fairness(experiment)])

print(num_rejections)

46


You can usually find significant results if you throw enough hypothesis at you data set. Remove the outliers and you can probably get something less than 5%. This is p-hacking and sometimes a consequence of the inference from p-values framework.

"If you want to do good *science*, you should determine your hypotheses before looking at the data, you should clean your data without hypothesis in mind, and you should keep in mind that p-values are not substitutes for common sense.

We'll talk about Bayesian inference later. 

In [110]:
# Case study: Running an A/B test
# (Joel uses the example of getting people to click on ads but that sounds horrid (he says as much) so I came up with my own)

# We're A/B testing how much joy something we show people brings them: which is more upvoted?
# We don't know the actual standard deviations so we should really be using a t_test but we are using a lot of examples
# so the difference isn't all that important

def estimated_parameters(N, n):
    p = n/N
    sigma = math.sqrt(p * (1-p) / N)
    return p, sigma

# We say that one gets 200 while another gets 180 out of 1000 upvotes

def ab_test_stat(N_A, n_A, N_B, n_B):
    p_A, sigma_A = estimated_parameters(N_A, n_A)
    p_B, sigma_B = estimated_parameters(N_B, n_B)
    
    return ((p_B - p_A) / math.sqrt(sigma_A ** 2 + sigma_B ** 2))

z = ab_test_stat(1000, 200, 1000, 180)
print(z)

-1.1403464899034472


In [112]:
# The probability you'd see such a difference if the means were actually equal is 

two_sided_p_value(z)

0.254141976542236

In [114]:
# 25%, which is not enough to reject

# If one only got 150 upvotes on one:

z = ab_test_stat(1000, 200, 1000, 150)
print(z)
two_sided_p_value(z)

-2.948839123097944


0.003189699706216853

Only a 3% chance we'd see such a large difference if they had the same mean.

# Bayesian Inference

So far we've just been doing confidence statements about our *tests*. The alternative approach is to treat the unknown parameters as random variables themselves. We can start with a prior distribution for the parameters and then use the observed data and Baye's Theorem to get an updated posterior distribution for the parameters. Rather than make probability judgements about the tests, we make probability judgements about the parameters themselves.


In [115]:
# when the unknown parameter is a probability, we can take a prior from the Beta distribution, putting it between 0 and 1

def B(alpha, beta):
    return math.gamma(alpha) * math.gamma(beta) / math.gamma(alpha + beta)

def beta_pdf(x, alpha, beta):
    if x <= 0 or x >= 1:
        return 0
    return x ** (alpha -1) * (1-x) ** (beta - 1) / B(alpha, beta)

#we might use these later. 

(I'm just going to essentially copy what Joel writes to finish the chapter because it's worth having in here, but there's not really any more code.)

If alpha and beta are both 1, it's just a uniform distribution around 0.5 very dispersed. If alpha is much larger than beta, most of the weight is near 1. And if alpha is much smaller than beta, moth of the weight is near zero.
We set our prediction (maybe that it's fair) and flip our coin. Our posterior distribution for p is again a Beta distribution but with parameters alpha + h and beta + t. (*Note: Beta distribution is just a conjugate prior to the binomial distribution, meaning if you update the Beta dist from a binomial, you'll get back a Beta dist*)

Let's say you flip 10 times and see 3 heads. You started with uniform prior so the new Beta is Beta (4,8) centered around 0.33. Since you said everything is equally likely, your best guess is something pretty coles to the observed probability. If you started with Beta(20,20) (saying it's roughly fair) you'd get Beta(23,27) around 0.46 indicating a revised belief that it might bias toward tails. If you started with Beta(30,10), meaning 75% heads bias, you'd get Beta(33,17) centered around 0.66. You still believe the heads bias, but not as much anymore. 

If you flip more and more, the prior would matter less and less and you'd have nearly the same posterior distribution on matter which prior you started with. No matter how biased you thought the coin was it'd be hard to maintain that after seeing 1000 heads in 2000 flips (unless your prior was ridiculous).

Bayesian inference can be controversial because of the complicated background mathematics and the subjective nature of choosing a prior.

We've just scratched the surface of statistical inference. 

