<h1 style="font-size: 1.6rem; font-weight: bold">Module 6 - Topic 1: Statistics</h1>
<p style="margin-top: 5px; margin-bottom: 5px;">Monash University Australia</p>
<p style="margin-top: 5px; margin-bottom: 5px;">ITO 4001: Foundations of Computing</p>
<p style="margin-top: 5px; margin-bottom: 5px;">Jupyter Notebook by: Tristan Sim Yook Min</p>
References: Images and Diagrams from Monash Faculty of Information Technology

---

### **Z-Tests: Statistical Inference for Normal Populations with Known Variance**

A statistical hypothesis represents a claim or assumption about the parameters of a population distribution. We call it a hypothesis because its truth remains uncertain until tested. The fundamental challenge in hypothesis testing is creating a systematic method to evaluate whether observed sample data supports or contradicts our initial assumption about the population.

### **Example: Testing Population Means When Variance is Known**

Consider a random sample $X_1, X_2, \ldots, X_n$ drawn from a normal distribution with unknown mean $\mu$ but known variance $\sigma^2$. Our goal is to evaluate the null hypothesis:

$H_0: \mu = \mu_0$

against the competing alternative hypothesis:

$H_1: \mu \neq \mu_0$

where $\mu_0$ represents a specific value we want to test against.

#### Building the Test Statistic

The sample mean $\bar{X} = \frac{\sum_{i=1}^n X_i}{n}$ serves as our natural estimator for the population mean $\mu$. Intuitively, we should accept the null hypothesis $H_0$ when $\bar{X}$ falls reasonably close to $\mu_0$. This logic leads us to define a rejection region:

$C = \{X_1, \ldots, X_n : |\bar{X} - \mu_0| > c\}$

where $c$ represents a threshold value we need to determine.

#### Finding the Critical Threshold

To construct a test with significance level $\alpha$, we must choose $c$ such that the probability of Type I error equals $\alpha$. This means finding $c$ where:

$P_{\mu_0}\{|\bar{X} - \mu_0| > c\} = \alpha$

The notation $P_{\mu_0}$ indicates we calculate this probability assuming the null hypothesis is true (i.e., $\mu = \mu_0$).

Under the null hypothesis, $\bar{X}$ follows a normal distribution with mean $\mu_0$ and variance $\sigma^2/n$. Therefore, we can standardize using:

$Z \equiv \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}$

This standardized variable $Z$ follows a standard normal distribution.

#### Deriving the Decision Rule

The condition $P_{\mu_0}\{|\bar{X} - \mu_0| > c\} = \alpha$ can be rewritten as:

$P\left\{|Z| > \frac{c\sqrt{n}}{\sigma}\right\} = \alpha$

Due to the symmetry of the standard normal distribution:

$2P\left\{Z > \frac{c\sqrt{n}}{\sigma}\right\} = \alpha$

Since we know that $P\{Z > z_{\alpha/2}\} = \alpha/2$ for a standard normal variable, we can set:

$\frac{c\sqrt{n}}{\sigma} = z_{\alpha/2}$

Solving for $c$ gives us:
$c = \frac{z_{\alpha/2}\sigma}{\sqrt{n}}$

### **Two-Tailed Test Decision Framework**

With our critical value established, the significance level $\alpha$ test follows this decision rule:

- **Reject** $H_0$ when $\frac{\sqrt{n}}{\sigma}|\bar{X} - \mu_0| > z_{\alpha/2}$
- **Fail to reject** $H_0$ when $\frac{\sqrt{n}}{\sigma}|\bar{X} - \mu_0| \leq z_{\alpha/2}$

This approach, testing $\mu = \mu_0$ against $\mu \neq \mu_0$, is termed a **two-tailed test**. We consider both extremely large positive and negative deviations of the sample mean from $\mu_0$ as evidence against our null hypothesis.

### **One-Tailed Tests**

When we specifically want to determine if the population mean is greater than or less than $\mu_0$ (rather than simply different from it), we employ **one-tailed tests**.

### **Upper-Tail Testing**

Consider testing the directional hypothesis:

$H_0: \mu \leq \mu_0 \text{ versus } H_1: \mu > \mu_0$

Logic dictates that we should reject $H_0$ when our sample mean $\bar{X}$ substantially exceeds $\mu_0$. This leads to a rejection region:

$C = \{(X_1, \ldots, X_n): \bar{X} - \mu_0 > c\}$

To maintain a Type I error rate of $\alpha$, we need the critical value $c$ to satisfy:

$P_{\mu_0}\{\bar{X} - \mu_0 > c\} = \alpha$

Using our standardization $Z = \frac{\bar{X} - \mu_0}{\sigma/\sqrt{n}}$, which follows a standard normal distribution under $H_0$:

$P\left\{Z > \frac{c\sqrt{n}}{\sigma}\right\} = \alpha$

Since $P\{Z > z_\alpha\} = \alpha$ for a standard normal variable, we obtain:

$c = \frac{z_\alpha \sigma}{\sqrt{n}}$

### **One-Tailed Test Decision Framework**

The upper-tail hypothesis test follows this decision rule:

- **Fail to reject** $H_0$ when $\frac{\sqrt{n}}{\sigma}(\bar{X} - \mu_0) \leq z_\alpha$
- **Reject** $H_0$ when $\frac{\sqrt{n}}{\sigma}(\bar{X} - \mu_0) > z_\alpha$

---

### **T-Tests: Statistical Inference When Population Variance is Unknown**

While z-tests are powerful when population variance is known, real-world scenarios often involve unknown variances. The t-test addresses this limitation by using sample variance to estimate the unknown population variance, making it one of the most practical tools in statistical inference.

### **Single Sample T-Test: Testing Population Mean with Unknown Variance**

#### **The Problem Setup**

When both the population mean and variance are unknown, we cannot use the standard normal distribution. Consider testing:

$$H_0: \mu = \mu_0$$

against the alternative:

$$H_1: \mu \neq \mu_0$$

Note that this null hypothesis is **composite** rather than simple, since it doesn't specify the variance value.

#### **Estimating the Unknown Variance**

Since the population variance $\sigma^2$ is unknown, we estimate it using the sample variance:

$$S^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$$

Our intuition suggests rejecting $H_0$ when the standardized difference is large:

$$\left|\frac{\bar{X} - \mu_0}{S/\sqrt{n}}\right|$$

#### **The T-Distribution**

To establish the critical values, we need the distribution of our test statistic. When $H_0$ is true, the statistic:

$$T = \frac{\sqrt{n}(\bar{X} - \mu_0)}{S}$$

follows a **t-distribution with $(n-1)$ degrees of freedom**.

#### **Probability Statement**

Under the null hypothesis:

$$P_{\mu_0}\left\{-t_{\alpha/2, n-1} \leq \frac{\sqrt{n}(\bar{X} - \mu_0)}{S} \leq t_{\alpha/2, n-1}\right\} = 1 - \alpha$$

where $t_{\alpha/2, n-1}$ represents the $100(\alpha/2)$ upper percentile of the t-distribution with $(n-1)$ degrees of freedom.

By definition: $P\{T_{n-1} \geq t_{\alpha/2, n-1}\} = P\{T_{n-1} \leq -t_{\alpha/2, n-1}\} = \alpha/2$

### **Decision Framework for Single Sample T-Test**

The significance level $\alpha$ test follows this rule:

- **Fail to reject** $H_0$ if: $\left|\frac{\sqrt{n}(\bar{X} - \mu_0)}{S}\right| \leq t_{\alpha/2, n-1}$

- **Reject** $H_0$ if: $\left|\frac{\sqrt{n}(\bar{X} - \mu_0)}{S}\right| > t_{\alpha/2, n-1}$

### **P-Value Calculation**

If $t$ represents the observed value of our test statistic $T = \sqrt{n}(\bar{X} - \mu_0)/S$, then:

**p-value** = Probability that $|T|$ would exceed $|t|$ when $H_0$ is true

This equals the probability that the absolute value of a t-random variable with $(n-1)$ degrees of freedom exceeds $|t|$.

### **Two-Sample Tests: Comparing Means of Two Populations**

#### **Two-Sample Z-Test (Known Variances)**

When comparing two populations with **known variances**, suppose we have independent samples $X_1, \ldots, X_n$ and $Y_1, \ldots, Y_m$ from normal populations with unknown means $\mu_x, \mu_y$ but known variances $\sigma_x^2, \sigma_y^2$.

**Hypotheses:**
$$H_0: \mu_x = \mu_y \text{ versus } H_1: \mu_x \neq \mu_y$$

**Distribution of Difference:**
Under $H_0$ (when $\mu_x = \mu_y$):

$$\bar{X} - \bar{Y} \sim N\left(\mu_x - \mu_y, \frac{\sigma_x^2}{n} + \frac{\sigma_y^2}{m}\right)$$

**Standardized Test Statistic:**
$$\frac{\bar{X} - \bar{Y} - (\mu_x - \mu_y)}{\sqrt{\frac{\sigma_x^2}{n} + \frac{\sigma_y^2}{m}}} \sim N(0,1)$$

**Decision Rule:**
- **Fail to reject** $H_0$ if: $\frac{|\bar{X} - \bar{Y}|}{\sqrt{\frac{\sigma_x^2}{n} + \frac{\sigma_y^2}{m}}} \leq z_{\alpha/2}$

- **Reject** $H_0$ if: $\frac{|\bar{X} - \bar{Y}|}{\sqrt{\frac{\sigma_x^2}{n} + \frac{\sigma_y^2}{m}}} > z_{\alpha/2}$

### **Two-Sample T-Test (Unknown Equal Variances)**

More realistically, when all parameters are unknown, we test:

$$H_0: \mu_x = \mu_y \text{ versus } H_1: \mu_x \neq \mu_y$$

**Key Assumption:** The unknown variances are equal: $\sigma^2 = \sigma_x^2 = \sigma_y^2$

### **Sample Variance Calculations**

Define the individual sample variances:

$$S_x^2 = \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1}$$

$$S_y^2 = \frac{\sum_{i=1}^m (Y_i - \bar{Y})^2}{m-1}$$

### **Pooled Variance Estimator**

The **pooled estimator** of the common variance $\sigma^2$ is:

$$S_p^2 = \frac{(n-1)S_x^2 + (m-1)S_y^2}{n + m - 2}$$

This combines information from both samples to estimate the shared variance.

### **Test Statistic Distribution**

Under $H_0$ (when $\mu_x - \mu_y = 0$):

$$T \equiv \frac{\bar{X} - \bar{Y}}{\sqrt{S_p^2\left(\frac{1}{n} + \frac{1}{m}\right)}} \sim t_{n+m-2}$$

This follows a t-distribution with $(n + m - 2)$ degrees of freedom.

### **Decision Framework for Two-Sample T-Test**

- **Fail to reject** $H_0$ if: $|T| \leq t_{\alpha/2, n+m-2}$

- **Reject** $H_0$ if: $|T| > t_{\alpha/2, n+m-2}$

where $t_{\alpha/2, n+m-2}$ is the $100(\alpha/2)$ percentile point of a t-distribution with $(n+m-2)$ degrees of freedom.

### **Critical Values Reference Table**

Here's a reference table for common t-distribution critical values:

| df | $t_{0.10}$ | $t_{0.05}$ | $t_{0.025}$ | $t_{0.01}$ | $t_{0.005}$ |
|---|---|---|---|---|---|
| 1 | 3.078 | 6.314 | 12.706 | 31.821 | 63.657 |
| 2 | 1.886 | 2.920 | 4.303 | 6.965 | 9.925 |
| 3 | 1.638 | 2.353 | 3.182 | 4.541 | 5.841 |
| 4 | 1.533 | 2.132 | 2.776 | 3.747 | 4.604 |
| 5 | 1.476 | 2.015 | 2.571 | 3.365 | 4.032 |
| 10 | 1.372 | 1.812 | 2.228 | 2.764 | 3.169 |
| 15 | 1.341 | 1.753 | 2.131 | 2.602 | 2.947 |
| 20 | 1.325 | 1.725 | 2.086 | 2.528 | 2.845 |
| 25 | 1.316 | 1.708 | 2.060 | 2.485 | 2.787 |
| 30 | 1.310 | 1.697 | 2.042 | 2.457 | 2.750 |
| $\infty$ | 1.282 | 1.645 | 1.960 | 2.326 | 2.576 |

**Note:** As degrees of freedom approach infinity, t-values converge to z-values (standard normal).

### **Comprehensive Test Summary**

| Test Type | Conditions | Test Statistic | Degrees of Freedom | Decision Rule |
|-----------|------------|----------------|-------------------|---------------|
| One-sample t-test | $\sigma^2$ unknown | $\frac{\sqrt{n}(\bar{X} - \mu_0)}{S}$ | $n-1$ | Reject if $\|T\| > t_{\alpha/2, n-1}$ |
| Two-sample z-test | $\sigma_x^2, \sigma_y^2$ known | $\frac{\bar{X} - \bar{Y}}{\sqrt{\frac{\sigma_x^2}{n} + \frac{\sigma_y^2}{m}}}$ | N/A (use $z_{\alpha/2}$) | Reject if $\|Z\| > z_{\alpha/2}$ |
| Two-sample t-test | $\sigma_x^2 = \sigma_y^2$ unknown | $\frac{\bar{X} - \bar{Y}}{\sqrt{S_p^2(\frac{1}{n} + \frac{1}{m})}}$ | $n+m-2$ | Reject if $\|T\| > t_{\alpha/2, n+m-2}$ |


---