# 6 Hypothesis Testing

Assume we have a hypothesis $H_0$. We call it a null hypothesis. We check that whether or not the observed data 
reject the hypothesis. This is the hypothesis testing. 

**Not rejecting the hypothesis does not mean that the data accept the hypothesis**. It is simply that the data cannot provide significant counterevidence.

### P-Value

P-value is the probability that we observe the very data or more extreme ones under $H_0$.

Example: Consider a Bernoulli trial with probability $q$. Now we have a sample of size $20$ where there are $17$ ones and $3$ zeros. The hypotheses we make are $H_0:\ q=0.5$ and $H_1:\ q\neq 0.5$. Then, under $H_0$ we shall expect that a trial of size $20$ should result in approximately $10$ ones. Yet there are $17$ ones, and more extreme cases could be $0,1,2,3,17,18,19,20$ ones. Hence there is probability 
$$\mathbb P(X\leqslant 3\ {\rm or  }\ X\geqslant 17) \approx 0.0026$$
leading to such extreme results. 

### $\alpha$-Significance Level

We set an $\alpha$-significance level for a test on $H_0$ and $H_1$. When the P-value is very extreme (very small) and is smaller than $\alpha$, then we reject the null hypothesis $H_0$. For example, we set  $\alpha = 0.05$  in the example above and by $0.05>0.0026$ we conclude to reject $H_0$. In other words, the data observed are too abnormal for $q = 0.5$ so we reject the hypothesis.

In [11]:
(1 - pbinom(16, 20, 0.5))*2

### Two Types of Errors

There are four possible states in hypothesis testing as listed below. 
| Cases|Reject $H_0$|Not reject $H_0$|
|--|--|--|
|$H_0$ is true|Type I Error|Correct|
|$H_1$ is true|Correct|Type II Error|

Apparently, $\alpha {\rm\ confidence\ level}\geqslant \mathbb P({\rm Type\ I\ Error})$.

Define the power function of a test by 
$$\beta(\theta) = 1 - \mathbb P_\theta({\rm Type \ II\ Error}),$$
the probability that $H_0$ successfully get rejected when the parameter is $\theta$ $(\theta\in H_1)$. When fixing $\alpha$, one would choose a test with $\alpha$-significance level that maximizes the power $\beta(\theta)$ in order to minimize the probability of type II error.

## Wald Test

Assume we have now estimated some parameter by $\theta$. Then here comes some data and we compute $\hat \theta$ and $\widehat{\rm SE}(\hat \theta)$. If $\hat \theta$ is  an asympototically normal estimator $\hat \theta$, i.e. 
$$(\hat\theta - \theta)/ \widehat{\rm SE}(\hat \theta)\stackrel{d}{\rightarrow} N(0,1),$$

then when $n$ is large and provided the estimator $\hat\theta$ is good,  the statistics above should approximate $0$. If the error we observe is large, then we might consider reject the hypothesis.

Let $T= (\hat\theta - \theta)/ \widehat{\rm SE}(\theta)$ be the statistics at the significance level $\alpha$. 

* If the alternative hypothesis is $\theta\neq \theta_0$, then we reject $H_0$  if $|T|>\Phi^{-1}(1-\frac \alpha 2)$. 
* When the alternative hypothesis is $\theta< \theta_0$, then we reject $H_0$ if $T<\Phi^{-1}(\alpha )$. 
* When the 
alternative is $\theta>\theta_0$, then we reject $H_0$ if $T>\Phi^{-1}(1-\alpha)$.

## Chi-square Tests

### Chi-square Distribution
Recall the difinition of chi-square distribution: Let $X_1,\dotsc,X_n\sim N(0,1)$ are independent standard normal, then 
$$Z = X_1^2+\dotsc+X_n^2$$
has chi-square distribution with $k$ degrees of freedom. We denote it by $\chi_k^2$.

$$\mathbb E(\chi_k^2) = k\quad {\rm and}\quad {\rm Var}(\chi_k^2) = 2k.$$

It has density $f(x) = \frac{1}{2^\frac k2 \Gamma(k/2)}x^{k/2 - 1}e^{-x/2}$.

Proof: First we compute the CDF by high-dimensional spherical coordinates (https://zhuanlan.zhihu.com/p/128580414),

$$\begin{aligned}\mathbb P(\chi_k^2\leqslant t)&=
\int\dotsi\int_{x_1^2+\dotsc +x_k^2\leqslant t}\frac{1}{\sqrt{2\pi}^k}e^{-\frac{x_1^2+\dotsc+x_k^2}{2}}dx_k\dotsm dx_1\\
&=\int_0^{2\pi} \int_0^\pi \dotsi\int_0^{\pi}\int_0^{\sqrt t}
\frac{1}{\sqrt{2\pi }^k }e^{-\frac {r^2}{2}}r^{k-1}\sin^{k-2}\theta_1\dotsm\sin\theta_{k-2}drd\theta_1\dotsm d\theta_{k-2}d\theta_{k-1}\\
&= \int_0^{2\pi}d\theta_{k-1}\cdot\prod_{j=2}^{k-1}\int_0^{\pi}\sin^{j-1}\theta_{k-j}d\theta_{k-j} \cdot \int_0^{\sqrt t}\frac{1}{\sqrt{2\pi}^k}r^{k-1}e^{-\frac{r^2}{2}}dr\\
&=2\pi \cdot\left(\frac{\pi}{2}\right)^{[\frac{k-2}{2}]}\cdot 2^{k-2}\cdot \prod_{j=1}^{k-2}\frac{(j-1)!!}{j!!}\cdot (2\pi)^{-\frac k2}\int_0^{\sqrt t} r^{k-1}e^{-\frac{r^2}{2}}dr\\
&=\left\{\begin{array}{ll}\frac{1}{(k-2)!!}\int_0^{\sqrt t} r^{k-1}e^{-\frac{r^2}{2}}dr & {\rm for \ even\ }k\\
\sqrt{\frac{2}{\pi}}\frac{1}{(k-2)!!}\int_0^{\sqrt t} r^{k-1}e^{-\frac{r^2}{2}}dr & {\rm for \ odd\ }k\end{array}\right.\\
&=\frac{1}{2^{\frac k2-1}\Gamma(k/2)}\int_0^{\sqrt t} r^{k-1}e^{-\frac{r^2}{2}}dr.
\end{aligned}$$

Take the derivative with respect to $t$ yields the result. It has mean and variance given by 
$$\begin{aligned}\mathbb E(\chi_k^2) &= k\mathbb E(X_1^2)=k{\rm Var}(X_1) = k\\
{\rm Var}(\chi_k^2) &= \mathbb E\left(\left(\sum_{j=1}^k X_j^2\right)^2\right) - k^2
=k\mathbb E(X_1^4)+k(k-1)\mathbb E(X_1^2) - k^2= 3k+k(k-1)-k^2 = 2k.
\end{aligned}$$

### Chi-square Test for Normal Variance 

If $X_1,\dotsc,X_n$ are independent samples from a normal distribution and we want to test $H_0:\ X_i\sim N(\mu,\sigma^2)$ where we only care about $\sigma^2$. Then we test the statistic 
$$T = \sigma^{-2}\sum_{i=1}^n (X_i - \overline X)^2\sim \chi_{n-1}^2.$$

When $T$ is far away from the mean $\mathbb E(\chi_{n-1}^2) = n-1$, we reject the null hypothesis $H_0$. More explicitly, we reject $T$ at the $\alpha$-confidence level when $T\notin (F^{-1}(\frac \alpha 2),F^{-1}(1 - \frac \alpha 2))$ where $F$ is the CDF of $\chi_{n-1}^2$.

This is also to say, 
$$\left(\frac{\sum_{i=1}^n (X_i - \overline X)^2}{F^{-1}(1 - \frac\alpha 2)},
\frac{\sum_{i=1}^n (X_i - \overline X)^2}{F^{-1}(\frac\alpha 2)}\right)$$
is an $1 - \alpha$ confidence level for $\sigma^2$.

### Likelihood Ratio Test

Assume $\theta = [\theta_1,\dotsc,\theta_n]^T\in \Theta$ are parameters and we want to test   whether $A\theta =b$, where $A\in \mathbb R^{m\times n}\ (m\leqslant n)$ is of full row rank.

Set $\Theta_0:\{\theta:\ A\theta =b\}$ be the target subspace while $\Theta$ be the whole space. 

Now given observations $X_1,\dotsc,X_n$, we let the variable 
$$\lambda = \frac{\sup_{\theta\in \Theta} L(\theta;x)}{\sup_{\theta\in\Theta_0} L(\theta;x)}
 = \frac{\sup_{\theta\in \Theta}f_\theta (x)}{\sup_{\theta\in\Theta_0}f_\theta (x)} $$
be the ratio of two likelihood. Apparently $\lambda\geqslant 1$. Our test statistic is given by $2\log \lambda$, which has the property below stated by the Wilks' theorem
$$2\log \lambda\stackrel{d}{\rightarrow} \chi_m^2\quad {\rm under\ }H_0.$$

Since there are $m$ equations in the $\Theta_0$ while there are $n$ variables, so $\Theta_0$ has $n-m$ degrees of freedom. The degree of freedom of the chi-square equals to the difference of the degree of freedom (dimensionality) of $\Theta$ and $\Theta_0$.

### Goodness-of-fit Test

If we have data $X_1,\dotsc,X_n$ categorized into $k$ classes, $Z_1,\dotsc,Z_k$ where $Z_i\in\mathbb N$ is the number of $X_i$ (frequency) in the class. Now we fit it by some model $M(\hat \Theta)$, i.e. Poisson and we would like to test whether our model is reasonable. Then we first estimate the expectation of each frequency, $\hat Z_1,\dotsc,\hat Z_k$. Then we test 
$$T = \sum_{j=1}^n \frac{(Z_j - \hat Z_j)^2}{\hat Z_j}\stackrel{d}{\rightarrow}\chi_{k-1-d}^2$$
where $d$ is the number of parameters in the model $M(\hat \Theta)$. We reject $H_0$ when $T$ is too large.

Note that we require all $\hat Z_j\geqslant 5$ to avoid large errors. Small classes that have small frequencies should be merged into one greater class.

In [29]:
x = c(70, 57, 45, 21, 7)
n = sum(x)
lambda = sum(x * (0:(length(x) - 1))) / n
p = dpois((0:(length(x) - 2)), lambda) # compute P(k=0,1,2,3) for Poisson
p = c(p, 1 - sum(p))  # compute P(k>=4)      for Poisson
cat('Expectation', p * n, '\n')
t = sum((x - p*n)^2 / (p*n))
cat('df =' , length(x)-1-1, '\n',
     't =' , t, '\n',
     'p =' , 1 - pchisq(t, 3))

Expectation 60.84425 72.40466 43.08077 17.08871 6.581606 
df = 3 
 t = 5.662527 
 p = 0.1292345

### Chi-square Tests for Contigency Tables

Suppose $(X,Y)$ are two discrete random variables and we want to know whether they are independent. $X$ belongs to $r$ classes while $Y$ belongs to $c$ classes and $(X,Y)$ belongs to $r\times c$ joint classes.

Sum up the frequency in each row and each column as below.
$$
\begin{array}{ll|llll|l}
                     &   &  & &    Y      &         \\
          &     & 1        & 2        & \cdots & c             & {\rm  sum        }        \\ \hline
  & 1                     & Z_{11}      & Z_{12}      & \cdots & Z_{1c}      & Z_{1\cdot}       \\
 & 2                     & Z_{21}      & Z_{22}      & \cdots & Z_{2c}      & Z_{2\cdot}       \\
               X    & \vdots              & \vdots      & \vdots      & \ddots & \vdots      & \vdots           \\
 & r                     & Z_{r1}      & Z_{r2}      & \cdots & Z_{rc}      & Z_{r\cdot}       \\ \hline
 & {\rm sum}                 & Z_{\cdot 1} & Z_{\cdot 2} & \cdots & Z_{\cdot c} & Z_{\cdot\cdot}=N
\end{array}$$

On the one hand, if we assume that $X,Y$ are independent, then 
$$\mathbb P(X = i,\ Y = j) = \mathbb P(X = i)\mathbb P(Y = j) = \frac{Z_{i\cdot}Z_{\cdot j}}{N^2}$$
and the corresponding entry has expectation
$$ E_{ij} = N\cdot \mathbb P(X = i, y = j) =  \frac{Z_{i\cdot}Z_{\cdot j}}{N}.$$

On the other, we estimate $N\cdot \mathbb P(X = i,\ Y = j) = Z_{ij}$. So, follow the idea in the goodness-of-fit test, we test the statistic
$$T = \sum_{i=1}^r\sum_{j=1}^c\frac{(Z_{ij} - E_{ij})^2}{E_{ij}}\stackrel{d}{\rightarrow}\chi_{(r-1)(c-1)}^2.$$

#### Degree of Freedom 

We can show the chi-square statistic has $(r-1)(c-1)$ degrees of freedom following the idea of  likelihood ratio test.
Considering the whole space $\Theta$ is $r\times c$ parameters $Z_{ij}$ satisfying the constraint 
$\sum_{i,j} Z_{ij} = N$, it has $(rc-1)$ degrees of freedom.

Our target subspace $\Theta_0$ is ${\rm rank}([Z_{ij}])=1$ and $\sum_{i,j}Z_{ij}=N$. It has $(r+c-1)-1=r+c-2$ degrees of freedom because $(r+c-1)$ entries are sufficient to  determine a rank-1 matrix.

So the chi-square has $(rc-1)-(r+c-2)= (r-1)(c-1)$ degrees of freedom.

#### Special Case 


When $r = c = 2$, and we have the following $2\times 2$ table 
$$\begin{array}{l|ll|l}
    & Y   &     & {\rm sum}     \\ \hline
X   & a   & b   & a+b     \\
    & c   & d   & c+d     \\ \hline
{\rm sum} & a+c & b+d & a+b+c+d\end{array}$$
And one can verify that the test statistic is 
$$T =K^2= \frac{(a+b+c+d)(ad-bc)^2}{(a+b)(c+d)(a+c)(b+d)}\sim \chi_1^2.$$

In [1]:
x = matrix(c(332,1360,318,104,29,35,27,18), nrow = 2)
print(x)
s = 0
for (i in (1:2)){
    for (j in (1:4)){
        eij = sum(x[i,1:4]) * sum(x[1:2,j]) / sum(x)
        s = s + (x[i,j] - eij)^2 / eij
    }
}
print(s)
print(pchisq(s, 3))
chisq.test(x)

     [,1] [,2] [,3] [,4]
[1,]  332  318   29   27
[2,] 1360  104   35   18
[1] 507.0797
[1] 1



	Pearson's Chi-squared test

data:  x
X-squared = 507.08, df = 3, p-value < 2.2e-16


## T Tests

### T Distribution

T distribution, also known as the Student's t distribution, is defined as below. Let $X\sim N(0,1)$ and $Z\sim \chi_k^2$ are independent, then 
$$T = \frac{X}{\sqrt{\frac Zk}}$$
has t-distribution with $k$ degrees of freedom. We denote it by $t_k$. The distribution is symmetric and it has zero mean as long as it has finite mean.

It has heavy tail that $\mathbb E(|T|^k) = \infty$. Also, since $\chi_k^2/k\stackrel{\mathbb P}{\rightarrow} 1$ by LLN, we have $T_k \stackrel{d}{\rightarrow}N(0,1)$ by Slutsky's theorem.

The density of $t_k$ is given by 
$f(x) = \frac{\Gamma(\frac{k+1}{2})}{\sqrt{k\pi}\Gamma(\frac k2)}\left(1 + \frac{x^2}{k}\right)^{-\frac{k+1}{2}}$.

Proof: We first derive the CDF by 
$$\begin{aligned}\mathbb P(T_k\leqslant t)&= 
\iint_{ x\leqslant t\sqrt{z/k}}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\cdot 
\frac{1}{2^{\frac k2}\Gamma(k/2)}z^{\frac k2 - 1} e^{-\frac z2}dxdz\\&
=\int_{0}^{+\infty}\int_{-\infty}^{t\sqrt{z/k}}\frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\cdot 
\frac{1}{2^{\frac k2}\Gamma(k/2)}z^{\frac k2 - 1} e^{-\frac z2}dxdz.
\end{aligned}
$$
Take the derivative when $t>0$ and thus 
$$\begin{aligned}f(t)&
=\int_0^{+\infty}\sqrt{\frac zk}\frac{1}{\sqrt{2\pi}}e^{-\frac{t^2z}{2k}}\cdot 
\frac{1}{2^{\frac k2}\Gamma(k/2)}z^{\frac k2 - 1} e^{-\frac z2}dz
=\frac{1}{2^{\frac k2}\sqrt{2k\pi}\Gamma(\frac k2)}\int_0^{+\infty} 
 z^{\frac{k-1}{2}} e^{-\frac 12\left(\frac{t^2}{k}+1\right)z}dz\\
 &=\frac{1}{2^{\frac k2}\sqrt{2k\pi}\Gamma(\frac k2)}\cdot  2^\frac{k+1}{2}\left(\frac{t^2}{k}+1\right)^{-\frac{k+1}{2}} \int_0^{+\infty} 
 u^{\frac{k-1}{2}} e^{-u}du\\ &= 
 \frac{\Gamma(\frac {k+1}{2})}{\sqrt{k\pi}\Gamma(\frac k2)}\left(\frac{t^2}{k}+1\right)^{-\frac{k+1}{2}}.
\end{aligned}
$$


### T-test for Normal Mean 
If $X_1,\dotsc,X_n$ are independent samples from a normal distribution where the mean is unknown and we do not care about its variance, then we can test the estimate for mean $\mu$ by assuming $H_0:\ X_i\sim N(\mu,\sigma^2)$ and testing
$$T = \frac{\sqrt n (\overline X - \mu)}{\widehat{\rm SE}(\overline X)}=\frac{\sqrt n (\overline X - \mu)}{\sqrt{\frac {1}{n-1}\sum_{i=1}^n (X_i- \overline X)^2}}\sim t_{n-1}.$$
When the statistics $T$ has large absolute value, then we tend to reject the null hypothesis $H_0$. 
Here we have used the fact that $\sigma^{-2}\sum_{i=1}^n (X_i - \overline X)^2 \sim \chi_{n-1}^2$ and that $\sigma^{-1}\sqrt n(\overline X - \mu)\sim N(0,1)$ and they are independent.

We reject the null hypothesis $H_0$ when $T$ is far off the origin.

### T-test for Two Samples 

Suppose we have two independent samples $X_1,\dotsc,X_n$ and $Y_1,\dotsc,Y_m$. We would test $H_0:\ \mu_X-\mu_Y=\Delta \mu$. If we **assume the variance of the  two samples are equal**, then 
$\sigma^{-1}\sqrt{\frac{nm}{n+m}}(\overline X - \overline Y - \Delta \mu)\sim N(0,1)$ and 
$\sigma^{-2}\left( \sum_{j=1}^n (X_j - \overline X)^2 +  \sum_{k=1}^m (Y_k - \overline Y)^2\right)\sim \chi_{m+n-2}^2$ and they are independent. So we test the statistic
$$T = \sqrt{\frac{n+m-2}{\frac{n+m}{nm}}}\frac{\overline X - \overline Y  - \Delta \mu}{\sqrt{ \sum_{j=1}^n (X_j  - X)^2+ \sum_{k=1}^m (Y_k - \overline Y)^2}}\sim t_{n+m-2}.$$

Proof: Since $\overline X \sim N(\mu_1,\frac 1n \sigma^2)$ and $\overline Y\sim N(\mu_2,\frac 1m \sigma^2)$ are independent, we learn $\overline X - \overline Y -\Delta\mu \sim N(0, \left(\frac 1n+\frac 1m\right)\sigma^2)$. Therefore, by normalization we have
$$\sigma^{-1}\sqrt{\frac{nm}{n+m}}(\overline X - \overline Y - \Delta \mu)\sim N(0,1).$$
Also, we have known $\sigma^{-2}\sum_{j=1}^n (X_j - \overline X)^2\sim \chi^2_{n-1}$, $\sigma^{-2}\sum_{k=1}^m (Y_k - \overline Y)^2\sim \chi^2_{m-1}$ are independent with $\overline X, \overline Y$, so 
$$\sigma^{-2}\left(\sum_{j=1}^n (X_j - \overline X)^2+\sum_{k=1}^m (Y_k - \overline Y)^2\right)\sim \chi^2_{n+m-2}$$
and is independent with $\overline X,\overline Y$. Hence the statistics $T$ is $t_{n+m-2}$.

## Other Tests
### Permutation Test

Suppose we have two samples $X:\ X_1,\dotsc,X_m$ and $Y:\ Y_1,\dotsc,Y_n$. We are willing to know whether $X,Y$ are sampled from the same distribution. 

Under the hypothesis that they are from the same distribution, we merge the two samples into one sample with size $m+n$ and randomly redivide the sample into two (ordered) $X':\ X_1',\dotsc,X_m'$ and $Y':\ Y_1',\dotsc,Y_n'$. We measure the resemblance by 
$$T = |\overline X' - \overline Y'|\quad{\rm or}\quad T = |\overline X' - \overline Y'|^2+|\frac{1}{m-1}\sum_{i=1}^n (X_i' - \overline X')^2
-\frac{1}{n-1}\sum_{j=1}^n(Y_i' - \overline Y')^2|.$$

 Among all possible divisions,  our initial observation $(X,Y)$ should have a high probability being alike as we assume that they have the same distribution. We can count how many divisions $(X',Y')$ presents more significant difference,
 $$p = \frac{1}{(m+n)!}\sum_{{\rm Divisions\ }(X',Y')}\mathbb I_{T(X',Y')>T{(X,Y)}}.$$

Here we divide it by $(m+n)!$ because there are $(m+n)!$ different ordered divisions. 

When $p$ is too small, say $p<\alpha$, it means that $T(X,Y)$ is larger than a proportion of $1 - \alpha$ divisions. It implies that $(X,Y)$ has great difference. And in this case we reject $H_0$ at $\alpha$-significance level.

<br>

In general, $(m+n)!$ might be large. We can estimate $p$ by randomly sampling some of the  divisions.

<br>

Permutation test is nonparametric.

In [4]:
# example: test whether two independent samples are from the same distribution
# X: (0.225, 0.262, 0.217, 0.240, 0.230, 0.229, 0.235, 0.217)
# Y: (0.209, 0.205, 0.196, 0.210, 0.202, 0.207, 0.224, 0.223, 0.220, 0.201)
x <- c(225,262,217,240,230,229,235,217) * .001
m <- length(x)
y <- c(209,205,196,210,202,207,224,223,220,201) * .001
t <- abs(mean(x) - mean(y))
x <- c(x,y)
n <- length(x)
s <- 0
for (i in (1:10000)){
	z <- sample(x, n) # random permutation, the first m data are X', while the other are Y'
	if (abs(mean(z[1:m]) - mean(z[(m+1):n]) > t)){
		s <- s + 1
	}
}
print(s / 10000) # very small, so reject H0

[1] 6e-04


## Neymann-Pearson Lemma 

If a test of size $\alpha$ has the following form:
$$\begin{aligned}&H_0:\ \theta = \theta_0\quad {\rm against}\quad H_1:\ \theta = \theta_1\\ &
{\rm reject \ }H_0 {\rm \ when \quad\quad }
L(\theta_1; x) > KL(\theta_0 x)\\ &
{\rm not\ reject \ }H_0{\rm \ when \ }L(\theta_1; x) < KL(\theta_0 x),
\end{aligned}$$
where $L(\theta;x)$ is the likelihood function and $K>0$ is a constant determined by $\alpha$, then the test is called the **most powerful test (MPT)**, in the sense of minimizing the probability of Type II Error or maximizing the power $\beta(\theta)$.

Corollary: The likelihood ratio test is UMPT (uniformly most powerful test).