# Week 8 - Hypothesis Testing
A way to check if results are statistically significant or just random chance.  `

Scenario (Week 6):  
You want to track customer average wait time  
But, you can't track every customer - too many people  
- Take random sample of customers  
CLT:  
- If you take lots of random samples, the distribution of the sample means will  
    look like a bell curve (normal) even if the original data isn't.  

Question:  
Can we prove our average waiting time is under 5 minutes?  


## Learning Outcomes
- Demonstrate an understanding of one-sample and two-sample hypothesis testing
    for normal random variables, and the difference between one-sided and two sided
    testing.
- Apply basic hypothesis testing to data.

We ask if there is evidence against a null hypothesis  
$$H_0 : Null hypothesis$$  
$$vs   $$
$$H_A Alternative hypothesis$$  

Null hypothesis is always default position  
“is there sufficient evidence in the data to dismiss the  
hypothesis that μ is equal to some fixed value μ0?”  

- We use the Neyman-Pearson framework to answer questions like this  
- We are interested in the evidence against the null hypothesis
    contained in our data  
- To do this, we ask: “How likely would it be to see our data
    sample y by chance if the null hypothesis were true?”  
- So key ideas  
    - We assume null hypothesis is true;  
    - We calculate the probability of observing our sample by chance
    if it were true.  
- The smaller this probability, the stronger the evidence against
    our null being true since our data is the realisation of true
    events and our assumption is just our assumption  


1. Calculate the μ  
= $\frac{1}{n} \sum_{i=0}^{n}{y_i}$  

2. Calculate the z-score  
$Z_n = \frac{\hat\mu - \mu}{\sigma/\sqrt{n}}$  
Zi ~ N(0, 1)  

3. Two-sided:  
P = 1 - P(-|Zi| < Z < |Zi|>)  
= 2P(Z < -|Zi|)  

4. We can informally grade the p-value: for
- p > 0.05 we have weak/no evidence against the null;
- 0.01 < p < 0.05 we have moderate evidence against the null;
- p < 0.01 we have strong evidence against the null.
This correlates to <5%, 1-5%, <1% chance that it would've happened if the null hypothesis were true, thus being unlikely enough that it was just chance.

1. $H_o: \mu <= \mu_0$ vs $H_a:\mu > \mu_0$  

### Testing $\mu$ with unknown variance
We estimate the variance using the unbiased estimator (usually better than MLE)  
We then use t-test instead of z-test to incorporate uncertainties of not knowing the population $\theta$

### Discussion Questions
1. A small micro-loan bank has 500 loan customers. If the total annual loan repayments
    made by an individual is a random variable with fixed but unknown mean 𝜇 and
    standard deviation $900, approximate the 95% confidence interval for the
    population mean 𝜇 given the sample mean is $755.  

$\hat\mu - 1.96 \frac{\sigma}{\sqrt{n}}, \hat\mu + 1.96 \frac{\sigma}{\sqrt{n}} $  
= 755 - 1.96($\frac{900}{\sqrt(500)}$), 755 + 1.96($\frac{900}{\sqrt(500)}$)  
= 670.11, 833.89  



2. Imagine we have measured BMI on a sample of women aged 20-34 from the Pima ethnic group
    who do not have diabetes and a sample of BMI measurements from Pima ethnic women aged
    20-34 who do have diabetes. Let us assume that the population standard deviations of BMI for
    Pima ethnic people with and without diabetes have been estimated from another, larger study,
    and are known to be 𝜎𝑛 = 6.79 and 𝜎𝑑 = 6.69 for non-diabetics and diabetics, respectively.
    The two samples are:  
    𝒚𝑛 =(46.8,27.8,32.5,39.5,32.8,31.0,26.2,20.8)  
    𝒚𝑑 =(33.6,23.3,43.1,31.0,30.5,38.0,30.1,25.8)  
    Researchers want to know if there is a difference, at the population level, in BMIs between
    Pima ethnic women aged 20-34 with and without diabetes, i.e., we want to test:  
    𝐻0: 𝜇𝑛 = 𝜇𝑑  
    vs  
    𝐻𝐴: 𝜇𝑛 ≠ 𝜇𝑑  
    What would be the p-value and our conclusion assuming a significance level of 0.05?  

𝜎𝑛 = 6.79 and 𝜎𝑑 = 6.69 -> known variances  

In [3]:
# Sample means for yn and yd:
yn_values <- c(46.8, 27.8, 32.5, 39.5, 32.8, 31.0, 26.2, 20.8)
yd_values <- c(33.6, 23.3, 43.1, 31.0, 30.5, 38.0, 30.1, 25.8)

yn_mean <- mean(yn_values)
yd_mean <- mean(yd_values)

print(yn_mean)
print(yd_mean)

# Z(μn - μd) = (μn - μd) / sqrt((σn^2 + σd^2)/n)
Zscore <- (yn_mean - yd_mean) / sqrt((6.79^2 + 6.69^2) / 8)
print(Zscore)


[1] 32.175
[1] 31.925
[1] 0.07418194


In [None]:
# 3. Question 2, but we don't have the population variances. Use the approximate method for testing differences of means
#    We now need to estimate population variance using unbiased estimates of variance from our samples
yn_values <- c(46.8, 27.8, 32.5, 39.5, 32.8, 31.0, 26.2, 20.8)
yd_values <- c(33.6, 23.3, 43.1, 31.0, 30.5, 38.0, 30.1, 25.8)
yn_mean <- mean(yn_values)
yd_mean <- mean(yd_values)
# T_score <- (32.175-31.925)/sqrt(Sp2())


# Tutorial

2.1 Increase the variance?  
For a given sample with sample mean ˆμ, what happens to the z-score if the population variance increases?  
Moreover, what happens to the p-value if the population variance increases? How can you interpret this?  
    If population variance increases, the Z-score would also increase, and thus the p-value would decrease  
    This is because we are less certain that our observations weren't due to random chance.  

2.2 What is a p-value?  
Imagine we observed some data y, and test H0 : μ = μ0 vs HA : μ̸ = μ0 using the above procedure. We find
that the p-value is 0.9. What does a “p-value of 0.9’ ’ mean?  
    p-value of 0.9 means there was a 90% chance that our observation was just random chance. It's likely to occur, so we cannot reject the null hypothesis.

2.3 Interpret the p-value  
Does p-value of 0.9 prove that the population mean μ = μ0? If not, what does this p-value suggest?  
    It doesn't prove it, but it suggests it is. By default, we assume the null hypothesis and that complies with null.

2.4 The t.test() command  
Of course, it is generally not possible to know what the population variance is, so this assumption is usually
unrealistic. Instead, it is common to use the data itself to estimate the variance using the unbiased estimate
of variance, which means that our test statistic becomes a t-score (see Lecture 5, Slides 34–35).  
t.test() computes t-test p-values.  

2.5 Using the data bpdata.csv, a person is at risk if their systolic blood pressure is 120-139 mmHg, and has high bp
if their bp is more than 139 mmHg. Using t.test(), if the mean is at 120, then this group is an at-risk population.  

2.6 One sided test  
If we can believe that the blood pressure in this population should only be lower than 120, we can use
one-sided t-test. (Note that one-sided hypothesis testing may be of interest when it is reasonable to assume
the parameter can take on values only on one side of the null value, as might be the case if it can be assumed
that the factor under study can only have either no effect or a positive effect on the outcome of interest.
However, this assumption is often difficult to justify, and for this reason two-sided hypotheses are more
common in practice.)  

In [22]:
bp <- read.csv("data/bpdata.csv")
t.test(x = bp$BP, mu = 120)
t.test(x = bp$BP, mu = 120, alternative = "less")


	One Sample t-test

data:  bp$BP
t = -2.0123, df = 19, p-value = 0.05858
alternative hypothesis: true mean is not equal to 120
95 percent confidence interval:
 113.0636 120.1364
sample estimates:
mean of x 
    116.6 



	One Sample t-test

data:  bp$BP
t = -2.0123, df = 19, p-value = 0.02929
alternative hypothesis: true mean is less than 120
95 percent confidence interval:
     -Inf 119.5215
sample estimates:
mean of x 
    116.6 
