# Statistical Hypothesis Testing I

In this notebook, we will implement and apply **statistical hypothesis tests** to make inferences about populations based on sample data.

At the start, we clarify common misconceptions in statistical hypothesis testing.

Subsequently, we will implement the one-sample $z$-test and the one-sample $t$-test.

Finally, we will apply one of the tests to a concrete example.

### **Table of Contents**
1. [Clarification of Misconceptions](#misconceptions)
2. [One-samples Tests](#one-sample-tests)
3. [Example](#example)

In [1]:
%load_ext autoreload
%autoreload 2

import matplotlib.pyplot as plt
import numpy as np

from scipy import stats

### **1. Clarification of Misconceptions** <a class="anchor" id="misconeptions"></a>
Statistical hypothesis testing can often cause confusion and thus misconceptions, which we would like to clarify below.

#### **Questions:**
1. (a) Is the $p$-value the probability that the null hypothesis $H_0$ is true given the data?
   
   Wahrscheinlichkeit extremere Werte (von H_0 distribution abweichende Werte) zu bekommen unter der Voraussetzung, dass H_0 war ist.
   
   (b) Are hypothesis tests carried out to decide if the H_0 is true or false?

   Es gibt keine Sicherheit, es sind immer nur Indizien.
   Wenn das Significance level klein ist, kann man H_0 verwerfen.
   H_0 zu verwerfen kann sicherer gemacht werden, als zu sagen, dass sie wahr kist.
   Gegeben \alpha: Sprechen Indizien gegen oder für H_0.
   Vermutung ist H_1, also das was man überprüfen will.
   
   (c) Are hypothesis tests carried out to establish the test statistic?
   
   Test Statistik beschreibt wie gut die Observations die distributions abbilden, die in der H_0 angenommen wurden.
   Hypothesis tests haben nicht (nur) die Aufgabe test statistics zu etablieren.
   Sie sollen Inferenzen möglich machen.


BEGIN SOLUTION
(a)
   No, it is the probability of observing more extreme values than the test statistic $s$, if the null hypothesis $H_0$ is true. Instead, the question aims to find $$p(H_0 \mid \widetilde{\mathcal{X}}_N).$$


   (b) Are hypothesis tess carried out to decide if the null hypothesis is true or false?



   A hypothesis testing procedure is carried out to investigate the claim made about a population parameter $\theta$ and while it is possible to decide on the probability of the null hypothesis $H_0$ being true based on certain level of significance $\alpha$, it is not possible to actually conclude that the null hypothesis $H_0$ is true. This is because it is statistically incorrect to make a decision to accept the null hypothesis.



   (c) Are hypothesis tests carried out to establish the test statistic?



   No, in hypothesis testing, it is essential to is to make an inference about a population parameter.

 END SOLUTION




### **2. One-sample Tests** <a class="anchor" id="one-sample-tests"></a>

We implement the function [`z_test_one_sample`](../e2ml/evaluation/_one_sample_tests.py) in the [`e2ml.evaluation`](../e2ml/evaluation) subpackage. Once, the implementation has been completed, we check it for varying types of tests.

In [2]:
from e2ml.evaluation import z_test_one_sample
sigma = 0.5
mu_0 = 2
sample_data = np.round(stats.norm.rvs(loc=2, scale=sigma, size=10, random_state=50), 1)
z_statistic, p = z_test_one_sample(sample_data=sample_data, mu_0=mu_0, sigma=sigma, test_type="right-tail")
assert np.round(z_statistic, 4) == -1.5811 , 'The z-test statistic must be ca. 4.590.' 
assert np.round(p, 4) == 0.9431, 'The p-value must be ca. 0.0007 for the one-sided right-tail test.' 
z_statistic, p = z_test_one_sample(sample_data=sample_data, mu_0=mu_0, sigma=sigma, test_type="left-tail")
assert np.round(z_statistic, 4) == -1.5811 , 'The z-test statistic must be ca. 4.590.' 
assert np.round(p, 4) == 0.0569, 'The p-value must be ca. 0.9993 for the one-sided left-tail test.' 
z_statistic, p = z_test_one_sample(sample_data=sample_data, mu_0=mu_0, sigma=sigma, test_type="two-sided")
assert np.round(z_statistic, 4) == -1.5811 , 'The z-test statistic must be ca. 4.590.' 
assert np.round(p, 4) == 0.1138, 'The p-value must be ca. 0.0014 for the two-sided test.' 

We implement the function [`t_test_one_sample`](../e2ml/evaluation/_one_sample_tests.py) in the [`e2ml.evaluation`](../e2ml/evaluation) subpackage. Once, the implementation has been completed, we check it for varying types of tests.

In [3]:
from e2ml.evaluation import t_test_one_sample
sample_data = np.round(stats.norm.rvs(loc=13.5, scale=0.25, size=10, random_state=1), 1)
t_statistic, p = t_test_one_sample(sample_data=sample_data, mu_0=13, test_type="right-tail")
assert np.round(t_statistic, 4) == 4.5898 , 'The t-test statistic must be ca. 4.590.' 
assert np.round(p, 4) == 0.0007, 'The p-value must be ca. 0.0007 for the one-sided right-tail test.' 
t_statistic, p = t_test_one_sample(sample_data=sample_data, mu_0=13, test_type="left-tail")
assert np.round(t_statistic, 4) == 4.5898 , 'The t-test statistic must be ca. 4.590.' 
assert np.round(p, 4) == 0.9993, 'The p-value must be ca. 0.9993 for the one-sided left-tail test.' 
t_statistic, p = t_test_one_sample(sample_data=sample_data, mu_0=13, test_type="two-sided")
assert np.round(t_statistic, 4) == 4.5898 , 'The t-test statistic must be ca. 4.590.' 
assert np.round(p, 4) == 0.0013, 'The p-value must be ca. 0.0014 for the two-sided test.' 

[autoreload of sklearn.utils.fixes failed: Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/e2ml-env/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 273, in check
    superreload(m, reload, self.old_objects)
  File "/opt/homebrew/anaconda3/envs/e2ml-env/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 496, in superreload
    update_generic(old_obj, new_obj)
  File "/opt/homebrew/anaconda3/envs/e2ml-env/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 393, in update_generic
    update(a, b)
  File "/opt/homebrew/anaconda3/envs/e2ml-env/lib/python3.10/site-packages/IPython/extensions/autoreload.py", line 305, in update_function
    setattr(old, name, getattr(new, name))
ValueError: delayed() requires a code object with 2 free vars, not 0
]
[autoreload of sklearn.utils._param_validation failed: Traceback (most recent call last):
  File "/opt/homebrew/anaconda3/envs/e2ml-env/lib/python3.10/site-packages/IPy

### **3. Example** <a class="anchor" id="example"></a>

Let us assume we have access to the follwing *identically and independently distributed* (i.i.d.) heart rate measurements $[\mathrm{beats/min}]$ of 40 patients in an *intensive care unit* (ICU):

$124, 111,  96, 104,  89, 106,  94,  48, 117,  61, 117, 104,  72,
86, 126, 103,  97,  49,  78,  52, 119, 107, 131, 112,  78, 132,
80, 139,  87,  44,  40,  60,  40,  80,  41, 103, 102,  44, 115,
103.$

#### **Questions:**
3. (a) Are heart rates from ICU patients unusual given normal heart rate has mean of 72 beats/min with a significance of .01? Perform a statistical hypothesis test by following the steps presented in the lecture and by using Python.

   cf. below

   p-hacking: 
      - define alpha after p value is computed. 
      - ziehe Subset of observed samples auf denen getestet: Obwohl Daten eigentlich von dist generiert wurden, sieht das bei gewissen Subsets nicht so aus. -> By chance effect der Hypothese zu schreiben.-> Wiederhole subsets ziehen
      -> Klausurrelevant. Youtube

   Man kann nicht grundsätzlich sagen, ob es besser ist auf einen gesamten oder wiederholt auf subsamples zu testen.
   (Möglicherweise ab großen Datensätzen könnte wiederholtes subsamplen aussagekräftiger zu sein.)

BEGIN SOLUTION

   Step 1 (Define null and alternative hypothesis): We perform a two-sided test:

   $$
   H_0: \mu = 72 \text{ versus } \mu \neq 72.
   $$

   Step 2 (Select test statistic): Since we the study the mean of the population of ICU patients, we select the empirical mean as test statistic:

   $$
   \mu_{40} = \frac{1}{40} \sum_{n=1}^{40} x_n.
   $$

   Step 3 (Select test statistic): We have no information about the population distribution. However, we have $N=40$ i.i.d. observed samples such that we argue with the rule of thumb for the central limit theorem as basis for the $t$-transformation:

   $$
   t_{40} = \frac{\mu_{40} - \mu}{\frac{\sigma_{40}}{\sqrt{40}}} \sim \mathrm{St}(39).
   $$

   Step 4 (Choose significance level): According to the question, we use $\alpha=0.01$.

   Step 5 (Evaluate test statistic) + Step 6 (Compute $p$-value): We perform both steps using Python in the next cell to obtain:

   $$
   \widetilde{\mathcal{t}}_{40} \approx 3.8685,
   p \approx 0.0004.
   $$

   Step 7 (Decide on the null hypothesis): Since $p \approx 0.0004 < 0.01 = \alpha$, we reject the null hypothesis $H_0$ meaning the sample data $\widetilde{\mathcal{X}}_{40}$ do provide sufficient evidence that the ICU patient's mean heart rate is significantly different from $\mu_0 = 72$ given $\alpha = 0.01$.


   END SOLUTION


In [4]:
data = [124, 111,  96, 104,  89, 106,  94,  48, 117,  61, 117, 104,  72,
86, 126, 103,  97,  49,  78,  52, 119, 107, 131, 112,  78, 132,
80, 139,  87,  44,  40,  60,  40,  80,  41, 103, 102,  44, 115,
103]

# (1) define hypotheses
# H_0: mu = 72
# H_1: mu != 72
mu_0 = 72   # population mean

# (2) select test statistic: Mittelwert
s_n = (1/len(data))*sum(data)

# (3) find sampling distribution of test statistic under H_0
# z statistic: std is knownc & independent & normally distributed & enough (>30) samples -> NOT usable
# t statistic: std is unknown & independent & normally distributed & enough (>30) samples -> usable
# t transformation anwenden ==  treffen die Annahmen student t Verteilung ist zugrundeliegende Verteilung (aufgrund der erfüllten Voraussetzungen)

# (4) define significance level: always define alpha before computing p-value
alpha = 0.01

# (5) evaluate test statistic for observed data & (6) compute p-value
t_statistic, p = t_test_one_sample(sample_data=data, mu_0=mu_0, test_type="two-sided")

# (7) make decision
if p < alpha:
    print("Reject H_0 with significance level alpha={}, and pvalue={}".format(np.round(alpha,4),np.round(p,4)))
else:
    print("Do not reject H_0")



Reject H_0 with significance level alpha=0.01, and pvalue=0.0004
