# Practice Session 4: Hypothesis Testing I

In [2]:
import pandas as pd
import numpy as np
from numpy.random import default_rng

import seaborn as sns

from scipy import stats
from scipy.stats import t, norm, ttest_1samp

import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider
plt.rcParams["figure.figsize"] = (8, 8)

## Part 1: *t*-Distribution
Yesterday, we talked about the normal distribution and saw that roughly 95% of the area under the normal curve\
lies within $\mu \pm 1.96\sigma$ (that is, within approximately 1.96 standard deviations of the mean).\
We also saw that when we repeatedly drew samples and calculated their means, the histogram of those sample means looked bell-shaped.\
That was exactly the **Central Limit Theorem (CLT)** in action.\
Regardless of the original data’s distribution (as long as the observations are independent and identically distributed with finite variance):
- the sampling distribution of the mean $\bar{X}$ becomes approximately normal distribution as the sample size *n* grows,  
- centered at the true mean $\mu$,
- with a spread described by the **standard error**:\
 $\text{SE} = \frac{\sigma}{\sqrt{n}}$

So according to the CLT, the mean of samples follows approximately $\bar{X} \sim \mathcal{N}\left(\mu, \frac{\sigma^2}{n}\right)$ \
This in turn means that $\frac{\bar{X} - \mu}{\sigma / \sqrt{n}} \sim \mathcal{N}(0, 1)$

Based on the properties of the standard normal distribution and after a few algebraic steps we can write:
$P\left(\bar{X} - 1.96 \frac{\sigma}{\sqrt{n}} \leq \mu \leq \bar{X} + 1.96 \frac{\sigma}{\sqrt{n}}\right) = 0.95$\
This result defines the **95% confidence interval** for the population mean $\mu$.

> It’s important to interpret this correctly:
> The 95% confidence level means that *if we repeated the sampling process many times*,
> about **95% of those intervals** would contain the true mean $\mu$.\
> So the confidence level refers to the **method’s reliability**, not the probability that $\mu$ lies inside one specific interval.


Let’s now connect this theory to a concrete example.

Yesterday, we considered a population that followed a normal distribution with a **mean of 10.0** and a **standard deviation of 1.8**, that is: $\mathcal{N}(10.0, 1.8^2)$

If we repeatedly draw random samples of size $n = 5$ from this population, the CLT tells us that the sample means will be approximately normally distributed as well:\
$\bar{X} \sim \mathcal{N}\left(10.0, \frac{1.8^2}{5}\right)$\
Based on this, we can use the formula derived above to construct a 95% confidence interval for the population mean:\
$\bar{X} \pm 1.96 \times \frac{1.8}{\sqrt{5}}$

So, if we were to repeat this sampling process many times, about **95% of the calculated intervals** would include the true mean value of **10.0**.

However, in real-world situations we typically **do not know** the true standard deviation $\sigma = 1.8$.\
Instead, we have to estimate it from the sample using the sample standard deviation ($s$).

If we now use this sample estimate in our formula for the 95% confidence interval, we would write: $\bar{X} \pm 1.96 \cdot \frac{s}{\sqrt{n}}$

Let’s check whether this version of the interval still captures the true mean about 95% of the time when the sample size is small ($n = 5$).

Let’s test this empirically by simulation.

We will:
1. Draw many samples (100000) of size $n = 5$ from $\mathcal{N}(10.0, 1.8^2)$.
2. For each sample, compute $\bar{X}$ and $s$.
3. Build the interval $\bar{X} \pm 1.96 \cdot \frac{s}{\sqrt{n}}$.
4. Check whether $\mu = 10.0$ falls inside.
5. Calculate the **coverage proportion** across all repetitions.

In [None]:
# YOU CODE HERE!

As we see, the coverage is lower - only about **88%**, not 95%.  So, why does this happen?

In our simulation, the population standard deviation $\sigma$ was **unknown**, so we estimated it using the sample standard deviation $s$.\
When the sample size $n$ is **large**, this substitution has very little effect - the sampling distribution of the statistic
$\frac{\bar{X} - \mu}{s / \sqrt{n}}$ remains approximately **standard normal**.  
This gives us the familiar large-sample confidence interval for the mean: $\bar{X} \pm z_{\alpha/2} \frac{s}{\sqrt{n}}$,\
which is valid for sufficiently large samples (usually $n \ge 30\text{–}40$), regardless of the exact shape of the population distribution  

However, when the **sample size is small**, replacing $\sigma$ with $s$  adds noticeable **extra variability**.  
The statistic above is no longer normally distributed, and using the standard normal critical value $z_{\alpha/2}=1.96$
produces confidence intervals that are **too narrow** - as we just observed.

To account for this additional uncertainty, we use a different reference distribution: the [**Student’s *t*-distribution**](https://onlinestatbook.com/2/estimation/t_distribution.html).

#### The *t*-distribution

The *t*-distribution was developed in [1908 by William Sealy Gosset](https://seismo.berkeley.edu/~kirchner/eps_120/Odds_n_ends/Students_original_paper.pdf), a statistician working at the Guinness Brewery in Dublin.  
Because his employer required anonymity, he published under the pseudonym **“Student”**, hence the name **Student’s *t*-distribution**.

The *t*-distribution applies when:
- the underlying population is approximately normal,
- the sample size $n$ is small,
- and the population standard deviation $\sigma$ is unknown and estimated by $s$.

It is **symmetric** like the normal distribution, but has **heavier tails**, reflecting the extra uncertainty from estimating $\sigma$.  
The exact shape depends on the **degrees of freedom** ($\text{df} = n - 1$).  
As the sample size increases, the *t*-distribution gradually approaches the **standard normal**.

Let’s visualize this next to see how the *t*-distribution compares to the normal distribution for different sample sizes.


In [3]:
# x values for the density plot
x = np.linspace(-5, 5, 400)

# Standard normal PDF (fixed)
normal_pdf = norm.pdf(x)

def plot_t_vs_normal(df=4):
    """Plot the t-distribution for a given df compared to the standard normal."""
    t_pdf = t.pdf(x, df)
    
    plt.figure(figsize=(7, 4))
    plt.plot(x, normal_pdf, 'k--', lw=2, label='Normal (Z)')
    plt.plot(x, t_pdf, color='#00bf63', lw=2, label=f't-distribution (df={df})')
  
    plt.xlabel("Value")
    plt.ylabel("Density")
    plt.legend()
    plt.grid(alpha=0.3)
    plt.ylim(0, 0.45)
    plt.show()

# Interactive slider for df
interact(plot_t_vs_normal, df=IntSlider(min=1, max=100, step=1, value=4, description='df'));

interactive(children=(IntSlider(value=4, description='df', min=1), Output()), _dom_classes=('widget-interact',…

> As we can see, both curves are symmetric and centered at zero, but the *t*-distribution has heavier tails.  
> This means that extreme values are more likely - reflecting the extra uncertainty when estimating the standard deviation from a small sample.
>
> As the degrees of freedom increase (df = 10, 30, ...), the *t*-curve gradually approaches the standard normal.  
> For large samples (roughly $n > 30$), the difference becomes negligible.


#### <font color='#fc7202'>Task 1:</font>

In the previous simulation, we saw that using the normal critical value (1.96) with a small sample size ($n = 5$) gave us coverage of only about **88%**, not 95%.  
Now, let’s see if using the *t*-distribution instead fixes the problem.

1. Repeat the same simulation as before - draw many samples of size $n = 5$ from  $\mathcal{N}(10.0, 1.8^2)$.
2. For each sample:
   - Compute the sample mean ($\bar{X}$) and the sample standard deviation ($s$).
   - Construct a 95% confidence interval using the *t*-distribution:\
     $\bar{X} \pm t_{0.975,\,df} \times \frac{s}{\sqrt{n}}$,
     where $df = n - 1$.
3. Check in how many of these intervals the true mean $\mu = 10.0$ falls.
4. Compute the coverage proportion (should be close to 0.95).

> *Hint:*
> You can get the correct *t*-critical value ($t_{0.975, df}$) using `SciPy`:
> ```python
> df = n - 1
> t_crit = t.ppf(0.975, df)   # 0.975 corresponds to the upper tail for a 95% CI
> ```

In [None]:
# YOUR CODE HERE!

> You should now see a **coverage** very close to **0.95** (typically around 0.949 - 0.952 depending on random seed).\
> This confirms that using the *t*-distribution instead of the normal distribution correctly adjusts for the additional uncertainty from estimating $\sigma$ with $s$.\
> As the sample size $n$ increases, the *t*-distribution approaches the normal distribution, and both methods yield almost identical results.


*You might be wondering now, why are we spending so much time on t-distribution when today’s topic is **Hypothesis Testing**?*\
*The t-distribution forms the basis of the t-test, which we’ll start exploring next.*


## Part 2: Introduction to Statistical Inference and Hypothesis Testing

By *inference*, we refer to a formal process of drawing conclusions from data.  
The goal of statistical inference is to reach conclusions that are supported by evidence, not merely by observation or intuition.  
In statistics, evidence arises through the careful and reasoned application of statistical methods and the evaluation of probabilities.

To carry out a **hypothesis test**, we always begin with a **question** - what do we want to find out from our data?

---

Once the research question is clear, the **frequentist framework** guides us through  a formal and reproducible sequence of steps:

1. **Formulate the hypotheses**  
   We express the research question as two competing statements:  
   - **Null hypothesis ($H_0$)** - assumes there is **no real effect or difference**, and any variation we observe is due to **random chance**.  
   - **Alternative hypothesis ($H_1$)** - represents the claim that there **is** a real effect or that the parameter differs from the value stated in $H_0$.
   
   These two statements must cover **all possibilities** and be **mutually exclusive**.\
   It’s important to note that we can **never prove $H_0$ to be true** - we can only **fail to reject it** based on the available data.  
   Statistical tests are designed to look for **sufficient evidence to reject $H_0$**, not to confirm it.

2. **Establish the test statistic (and its null distribution)**  
   Identify which **test statistic** will be used to compare $H_0$ and $H_1$.  
   This choice depends on:
   - the **type of data** (e.g., means, proportions, variances),  
   - the **experimental design** (e.g., one-sample, independent two-sample, paired), and  
   - what is **known about the population** (e.g., whether the population standard deviation $\sigma$ is known).  

   Under the null hypothesis, this statistic follows a **known reference (sampling) distribution**, such as the  *z*, *t*, *χ²*, or *F* distribution.

3. **Set the decision rule (choose $\alpha$ and define rejection criteria)**  
   Next, define the rule for deciding when to reject $H_0$:  
   - Choose the **significance level** $\alpha$, which represents the probability of making a *Type I error* (rejecting a true $H_0$).  
     The most common choice is **$\alpha = 0.05$**, but stricter values (e.g., 0.01) may be used in sensitive applications.  
   - Using the null distribution, determine the **critical value(s)** that define the **rejection region** -  
     the range of test statistic values that would be considered unlikely if $H_0$ were true.  
   - *Equivalent p-value approach:* reject $H_0$ if the computed ***p*-value** ≤ **α** (adjusted for one- or two-tailed tests).

4. **Collect data, compute the test statistic, and compare**  
   Gather the sample data and calculate the observed value of the chosen test statistic.  
   Compare this value with the **critical region** (or use the *p*-value approach):  
   - If the test statistic falls inside the rejection region or if *p* ≤ *α*, → **Reject $H_0$**.  
   - Otherwise, → **Fail to reject $H_0$**.

5. **Interpret the result in context**  
   A statistical result only has meaning when connected back to the original question.  
   - Clearly report whether $H_0$ was rejected or not, and **what that implies about the research question**.  
   - Remember: “Failing to reject $H_0$” does *not* prove that $H_0$ is true - it simply means there was **not enough evidence** to conclude otherwise.  
   - Whenever possible, complement the test result with **effect sizes** and **confidence intervals** to provide a fuller picture of the findings.
---

#### One-Tailed or Two-Tailed Tests
Based on how we formulate our hypotheses, a test can be **one-tailed** or **two-tailed**.  
- If the alternative hypothesis states only that the parameter is *different* from the null value ($H_1\!:\,\mu \ne \mu_0$), the test is **two-tailed**,\
and the significance level $\alpha$ is divided equally between both tails of the distribution.  
- If the alternative specifies a *direction* of difference ($H_1\!:\,\mu > \mu_0$ or $H_1\!:\,\mu < \mu_0$), the test is **one-tailed**, and the entire $\alpha$ lies in a single tail.  

This choice must be made **before** the analysis and should always reflect the research question.


#### Type I and Type II Errors

When we make decisions based on sample data, there are **two kinds of mistakes** we can make.
These are called **Type I** and **Type II** errors.

Whenever we perform a hypothesis test, there are two possible *truths* (whether $H_0$ is true or false) and two possible *decisions* (whether we reject $H_0$ or not).  
The four combinations lead to the following outcomes:

| **Null hypothesis ($H_0$) is ...**| **True** | **False** |
|-------------------------|----------------|-----------------|
| **Rejected** | ❌ **Type I error** (false positive) | ✅ **Correct decision** (true positive) |
| **Failed to reject** | ✅ **Correct decision** (true negative) | ❌ **Type II error** (false negative) |

- **Type I error (α):**  
  Occurs when we *reject a true null hypothesis*.  
  In other words, we detect an effect that **does not actually exist**.  
  The probability of making this error is the **significance level**,  
  typically $\alpha = 0.05$ (5%).

- **Type II error (β):**  
  Occurs when we *fail to reject a false null hypothesis*.  
  In other words, we **miss a real effect** that does exist.  
  The probability of this mistake is denoted by **β**.

- **Power of a test (1 − β):**  
  The probability of correctly rejecting a false $H_0$.  
  A more *powerful* test has a smaller chance of missing true effects.

**How to interpret α and β**
- The smaller the **α**, the less likely we are to make a **Type I error**, but the harder it becomes to detect true effects (increasing **β**).  
- Conversely, increasing **α** makes it easier to find effects (reducing **β**), but increases the risk of false positives.  
- Therefore, α and β are **interconnected** - lowering one usually raises the other.

In practice, we choose α (e.g., 0.05) before testing, and aim to design studies with high **power** (often ≥ 0.8), to minimize the risk of both errors as much as possible.

![errors](https://www.researchgate.net/publication/361295532/figure/fig2/AS:11431281100133326@1669325756681/The-relationship-between-a-type-I-error-alpha-and-a-type-II-error-beta-Note.png)


> But the main question in practice is often: **how do we choose the right statistical test?**  
> Thanks to Matt, we have a helpful guideline image that summarizes this decision process.  
> Today, we’ll focus on one of the most common tests from that chart - the ***t*-test**.
![what_test](https://raw.githubusercontent.com/RaHub4AI/MI7032/refs/heads/main/Pictures/Hypothesis%20Testing%20Key%202023.png)


#### Putting Theory into Practice

Let’s now put all this theory into practice.  
We’ll start with one of the most widely used tools in data analysis - the ***t*-test**.

The *t*-test allows us to compare **means** and decide whether any observed difference is **statistically significant** or simply due to random variation.

Depending on the question we want to answer, there are three main types of *t*-tests:

1. **One-sample *t*-test**  
   Used to compare the **mean of a single sample** against a **known or reference value**.  
   Statistic: $t = \frac{\bar{X} - \mu_0}{s / \sqrt{n}}$, where:  
      - $\bar{X}$ = sample mean  
      - $\mu_0$ = hypothesized (reference) mean  
      - $s$ = sample standard deviation  
      - $n$ = sample size  

      *Example:* Testing whether the mean concentration measured in a reference material differs from the certified value.

2. **Independent two-sample *t*-test**
   Used to compare the **means of two independent groups**  (e.g., two different treatments or populations).\
   Statistic: $t = \frac{\bar{X}_1 - \bar{X}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$, where:
   - $s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}}$ is the **pooled standard deviation**, and
   - $n_1$, $n_2$, $s_1$, $s_2$ are the sample sizes and standard deviations of the two groups.

   *Example:* Comparing the average pollutant concentration in two different rivers.

3. **Paired-sample (dependent) *t*-test**\
   Used when the two samples are **related**, for example, measurements taken **before and after** a treatment on the same individuals.\
   Statistic: $t = \frac{\bar{D}}{s_D / \sqrt{n}}$, where:  
      - $\bar{D}$ = mean of the differences between paired observations  
      - $s_D$ = standard deviation of the differences  
      - $n$ = number of pairs  

   *Example:* Temperature measurements made at the same time, but different locations


In the next step, we’ll begin with the simplest case - the **one-sample *t*-test** - to test whether our measured mean differs significantly from a known reference value.

#### <font color='#fc7202'>Task 2: </font>
You purchase a **certified reference material (CRM)** for *lead (Pb)* concentration in rainwater. According to **NIST**, the certified concentration is **190 ng/L**.\
You analyze this reference sample **ten times** using your instrument and obtain the following results (in ng/L):
$187,\; 171,\; 191,\; 176,\; 196,\; 181,\; 189,\; 190,\; 185,\; 189$

1. Test whether your measurements **differ** from the certified value at the **95% confidence level**.
2. Test whether your measurements are **significantly lower** than the certified value  
   at the **95% confidence level**.

> *Hints:*
> Start by clearly stating your hypotheses ($H_0$ and $H_1$).   
> Check the assumptions of the one-sample *t*-test:  
>    - The data are independent.  
>    - The measurements come from an approximately normal distribution (you can visualize or test this).
>
> To perform the test, you can use the built-in function `scipy.stats.ttest_1samp()`.

<font color='#00bf63'>*Your hypotheses here!*</font>

In [None]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

#### <font color='#fc7202'>Task 3:</font>

You are developing a new extraction and analysis method for **carbon tetrachloride** in air.  
In spike/recovery experiments, you add **exactly 50 ng** of carbon tetrachloride to air in a closed chamber.  
Five experiments yield the following measurements (ng): $50.4,\; 50.7,\; 49.1,\; 49.0,\; 51.1$

Is there evidence of **systematic error (bias)** in your analysis at the **95% confidence level**?


<font color='#00bf63'>*Your hypotheses here!*</font>

In [50]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>

So far, we have compared a **sample mean** to a **known reference value** using the one-sample *t*-test.  
Next, we’ll see how to compare the **means of two groups** - using the **two-sample *t*-test**.

#### <font color="#fc7202">Task 4:</font>

Two methods for determining **chromium** concentrations in grass were applied to the **same samples**.  
Results (units consistent):

| Sample | Method 1 | Method 2 |
|:-----:|:--------:|:--------:|
| 1 | 1.79 | 2.01 |
| 2 | 1.74 | 2.81 |
| 3 | 1.41 | 2.34 |
| 4 | 1.29 | 2.12 |
| 5 | 1.15 | 2.39 |

Do the two methods give results with **means that differ significantly** at the 95% confidence level?

<font color='#00bf63'>*Your hypotheses here!*</font>

In [None]:
# YOUR CODE HERE!

<font color='#00bf63'>*Your answer here!*</font>