# Lecture: Wald Test and Likelihood-ratio Test

The Wald test, the Likelihood-ratio test, together with the Lagrange multiplier test are considered as the three classical approaches to hypothesis testing.

In previous lectures we've considered tests for various statistics and hypotheses, but what about regression coefficients? Today, we'll cover the first two hypothesis tests that we can use to test the significance of a single coefficient, say $\theta$, or the joint significance of of several components of $\beta$. We begin from introducing the Wald test.

## Wald Tests

This portion of the lecture details the different forms of the Wald Test and there many uses. The general use of the Wald Test is to analyze and determine the probability of a parameter(s) in a model taking a specific value. However as we will see, the most common and primary use is the determine whether a coefficient is signifigant or not. We will begin with the Wald Test of a single parameter.

### Single Parameter

The null and alternative hypothesis for a testing a single parameter $\theta$ follow the general form of:

$$
    H_0: \theta = \theta_0\\
    H_a: \theta \neq \theta_0
$$

The Wald Test statistic is then written as:

$$
    W = \frac{(\hat{\theta} - \theta_0)^2}{Var(\hat{\theta})},
$$

where $\hat{\theta}$ is the M.L.E. of the actual parameter $\theta$, which follows an asymptotic $\chi^2$-distribution with one degree of freedom. However an alternative form the statistic can be obtained be taking the square root of $W$.

$$
    \sqrt{W} = \frac{\hat{\theta} - \theta_0}{se(\hat{\theta})}
$$

Here $se(\hat{\theta})$ represents the standard error of $\hat{\theta}$, which is obtained by taking the square root of its variance, however this value is not always known. Under the assumption of normality of the data, this ratio will follow a Student's t distribution, which makes calculating a p-value relatively simple. Even better, if the standard error is known the ratio has approximately a standard normal distribution. This means that one can treat it as a z-statistic and perform a z-test when deciding whether to reject the null hypothesis.

The Wald Test is most commonly used to determine whether a parameter has any significance in a model. In this case “significant” means that they add something to the model; parameters that add nothing can be removed without affecting the model in any meaningful way. This is done using the null hypothesis $H_0: \theta = 0$, which simplifies our test statistic to:

$$
    \sqrt{W} = \frac{\hat{\theta}}{se(\hat{\theta})}
$$

The intuition about how the test works is that it tests how far the estimated parameters are from zero (or any other value under the null hypothesis) in standard errors.

\begin{example}\label{example:1}
Say we have performed a linear regression on a large approximately normally distributed data set $(n > 30)$, and obtained a model $y = 4.2 + .38x_1 + .86x_2$ with known $se(\beta_2) = .409$. We want to determine whether variable $x_2$ is has any signifigance in our model. We follow by performing a Wald Test on $\beta_2$ with null hypothesis $H_0: \beta_2 = 0$.

We first find our test statistic, which in this case will be a z-statistic since we are working with a large sample.

$$
    Z = \frac{\hat{\beta_2}}{se(\hat{\beta_2})} = \frac{.86}{.409} = 2.10
$$

We can then calculate the p-value using a z-test:

$$
    p = 2 * \mathbb{P}(Z > 2.10) = .036
$$

With a critical value of $\alpha = .05$ we would reject the null hypothesis and say that $\beta_2$ is significantly different from zero with 95% confidence.
\end{example}

\begin{example}
The Wald Test can also be used to construct a confidence interval for a parameter. Looking at our example, we can again use a z-test to create a 95% $(\alpha = .05)$ confidence interval for $\beta_2$ as follows:

$$
    \hat{\beta_2} \pm z_{1-\alpha/2}*se(\hat{\beta_2}) = .86 \pm 0.80164 = [.058, 1.66]
$$

* Note: If the standard error of $\beta_2$ was unknown, then the process would be almost the same but with an estimate of the standard error obtained from the residual sum of squares and a t-test being used as a substitute in all instances where we have used a z-test.
\end{example}

### Mulitple Parameters

The Wald test can also be used to test the joint significance of several coefficients. For instance, take a vector of parameters $\beta'$ and seperate into two components $\beta_1'$ and $\beta_2'$ with $p_1$ and $p_2$ elements respectively. Now consider the hypothesis:

$$
    H_0: \beta_1 = 0
$$

Given this setup, the equation for the Wald Test statistic would then be:

$$
    W = \hat{\beta_1'} Var^{-1}(\hat{\beta_1}) \hat{\beta_1}
$$

* Note: If $\beta_1'$ contains a single coefficient and $p_1 = 1$, this formula will reduce to the one for a single parameter as shown above.

This quadratic form of $W$ will take the form of a $\chi^2$ distribution with $p_1$ degrees of freedom in large samples regardless of whether the standard error is known or estimated. This also holds in smaller samples under the assumption of normality and with a known standard error.

However, in small samples where the standard error is estimated using the residual sum of squares with $n - p$ d.f., the distribution of $W/p_1$ will be an $F$ with $p_1$ and $n-p$ degrees of freedom.  

## Likelihood-ratio Tests

The Likelihood-ratio test compares two competing statistcal models based on their log-likelihoods. we discuss the Likelihood-ratio test in two cases, i.e. likelihood-ratio test with simple hypotheses and the general likelihood-ratio test. In general case, one model will be imposed some constraint while the other one will use the entire parameter space. And what the likelihood-ratio test will tell us is that whether there is any difference on fitting data when using these two models. If there is not a significant improvement when using the unconstrained one, we can simply choose to use the constrained one which uses less data. 

In this part,we start with the Neyman–Pearson lemma, which pave the way for the test. 

### The Neyman–Pearson Lemma

Let $H_0$ and $H_1$ be simple hypotheses. The simple hypotheses means two sets of data come from the smae distributions, either both discrete or both continuous. For a constant $c > 0$, suppose that the likelihood ratio test which rejects $H_0$ when $L(x) < c$ has a significance level $\alpha$. Then for any other test of $H_0$ with significance level at most $\alpha$, its power against $H_1$ is at most the power of this likelihood ratio test.

What the Neyman-Pearson Lemma demonstrates is that when we are comparing two models without unknown parameters, the likelihood-ratio test has the highest power among all other tests at a specified significance level $\alpha$.

### Likelihood Function

Let $X_1, X_2, \dots, X_n$ be a random sample from a distribution with a parameter $\theta$. We observe that $X_1 = x_1, X_2 = x_2, \dots, X_n = x_n$.

* If the $X_i$'s are discrete, then the likelihood function is defined as the product of each $X_i$'s probability mass function depending on the parameter $\theta$.

$$L(x_1,x_2,\dots,x_n;\theta) = P_{X_1X_2 \dots X_n}(x_1,x_2,\dots,x_n;\theta)$$

* If the $X_i$'s are jointly continuous, then the likelihood function is defined as the product of density function $f$ of each $x$ depending on the parameter $\theta$.

$$L(x_1,x_2,\dots,x_n;\theta) = f_{X_1X_2 \dots X_n}(x_1,x_2,\dots,x_n;\theta)$$

And the likelihood function tends to be highest near the true value of $\theta$.

## Likelihood-ratio Test with Simple Hypotheses

The distribution of the data of two models are fully specified under both the null hypothesis and the alternative hypothesis.

$$
    H_0: \theta = \theta_0\\
    H_1: \theta = \theta_1
$$

We define

$$LR = \lambda(x_1,x_2,\dots,x_n) = \frac{L(x_1,x_2,\dots,x_n;\theta_0)}{L(x_1,x_2,\dots,x_n;\theta_1)}$$

Just like the likelihood value, we know the log-likelihood value of regression model is also a way to measure the goodness of fit for a model. The higher the log-likelihood, the better the model fits the dataset.

In this sense, by our definition, we know the likelihood ratio $\lambda$ is small if the alternative model fits better the data than the null model. And vice versa, $\lambda$ is large if the null model fits better the data than the alternative model.

To perform a likelihood ratio test, we choose a constant $c$. We reject $H_0$ if $\lambda < c$ and accept it if $\lambda \geq c$. The value of $c$ can be chosen based on the desired $\alpha$.

\begin{example}

Consider a motion sensor for museums that uses infrared waves to detect thieves. The system receives a sinal and, based on the received signal, it needs to decide whether or not there is someone who approaches the collection without getting permission. Let X be the received signal. Suppose that we know:

$X = W$, if not detect people in the range.

$X = 1 + W$, if detect people in the range.

where $W \sim \mathcal{N}(0,\sigma^2 = \frac{1}{4})$. Thus, we can write $X = \theta + W$, where $\theta = 0$ if there is no people being detected in the range, and $\theta = 1$ if there is at least one person being detected in the range. $H_0$ and $H_1$ are defined as follows:

$H_0: \theta = \theta_0 = 0$,

$H_1: \theta = \theta_1 = 1$.

Let $X = x$. Design a level 0.05 test ($\alpha = 0.05$) to decide between $H_0$ and $H_1$.

\end{example}

Under $H_0$, $X \sim \mathcal{N}(0,\sigma^2 = \frac{1}{4})$. Therefore, $L(x;\theta_0) = f_X(x;\theta_0) = \frac{2}{\sqrt{2\pi}}e^{-\frac{4x^2}{2}} = \sqrt{\frac{2}{\pi}}e^{-2x^2}$

On the other hand, under $H_1$, $X \sim \mathcal{N}(1,\sigma^2 = \frac{1}{4})$. Therefore, $L(x;\theta_1) = f_X(x;\theta_1) = \frac{2}{\sqrt{2\pi}}e^{-\frac{4(x-1)^2}{2}} = \sqrt{\frac{2}{\pi}}e^{-2(x-1)^2}$

Therefore, $\lambda(x) = \frac{L(x;\theta_0)}{L(x;\theta_1)} = e^{-2x^2+2(x-1)^2} = e^{2-4x}$.

Thus, we accept $H_0$ if $e^{2-4x}\geq c$,

where $c$ is the threshold. Equivalently, we accept $H_0$ if $x \leq \frac{1}{4}(2 - \ln c)$.

Let us define $c^{'} = \frac{1}{4}(2 - \ln c)$, where $c^{'}$ is a new threshold. Remember that $x$ is the observed value of the random variable $X$. Thus, we can summarize the decision rule as follows. We accept $H_0$ if $X \leq c^{'}$.

Because the problem is designed for a motion sensor which protects the collection from being stolen, we would accept error that it detects a theft even when there is no person in the range. It means we would prefer the type 1 error that rejecting the null hypothesis when it's actually true rather than type 2 error.

Let $\alpha = \mathbb{P}(type \, I \, error) = \mathbb{P}(Reject \, H_0 | H_0) = \mathbb{P}(X > c^{'}|H_0) = \mathbb{P}(X > c^{'}) = 1 - \Phi(2c^{'})$.

Calculating $\alpha = 0.05 = 1 - \Phi(2c^{'})$, we get $c^{'} = \frac{1}{2}\Phi^{-1}(1-\alpha) = \frac{1}{2}\Phi^{-1}(0.95) = 0.8225$

Therefore, We accept $H_0$ if $X \leq 0.8225$.

\begin{example}

Suppose $X_1, \dots, X_n$ is a random sample of size n from an exponential distribution $f(x|\theta) = \frac{1}{\theta}e^{-\frac{x}{\theta}}$, $x>0$

Conduct the following simple hypothesis testing problem:

$H_0: \theta = \theta_0 = 2$,

$H_1: \theta = \theta_1 = 1$,

Let $X = x$. Design a level 0.05 test ($\alpha = 0.05$) of size $n = 5$ from an exponential distribution to decide between $H_0$ and $H_1$.

\end{example}

Under $H_0$, the likelihood functioin $L(x_1,x_2,\dots,x_n;\theta_0) = f_{X_1X_2 \dots X_n}(x_1,x_2,\dots,x_n;\theta_0) = \prod_{i=1}^{n}\frac{1}{\theta_0}e^{-\frac{x_i}{\theta_0}} = \theta_0^{-n}e^{-\sum\frac{x_i}{\theta_0}}$

Similarly, Under $H_1$, the likelihood functioin $L(x_1,x_2,\dots,x_n;\theta_1) = f_{X_1X_2 \dots X_n}(x_1,x_2,\dots,x_n;\theta_1) = \theta_1^{-n}e^{-\sum\frac{x_i}{\theta_1}}$

We define the likelihood ratio as follows:

$ LR = \lambda(x) = \frac{f_{X_1X_2 \dots X_n}(x_1,x_2,\dots,x_n;\theta_0)}{f_{X_1X_2 \dots X_n}(x_1,x_2,\dots,x_n;\theta_1)} = \frac{\theta_0^{-n}e^{-\sum\frac{x_i}{\theta_0}}}{\theta_0^{-n}e^{-\sum\frac{x_i}{\theta_1}}} = (\frac{\theta_0}{\theta_1})^{-n}e^{(\frac{1}{\theta_1}-\frac{1}{\theta_0})\sum x_i}$

If the data supports $H_1$, then the likelihood function $f_{X_1X_2 \dots X_n}(x_1,x_2,\dots,x_n;\theta_1)$ should be large, therefore the $LR = \lambda$ is small. Thus, we reject the null hypothesis if $LR \leq c$, where $c$ is a constant such that $\mathbb{P}(LR \leq c) = \alpha$ under the null hypothesis $H_0$.

Setting $\alpha = \mathbb{P}(LR \leq c) = \mathbb{P}((\frac{\theta_0}{\theta_1})^{-n}e^{(\frac{1}{\theta_1}-\frac{1}{\theta_0})\sum x_i} \leq c) = \mathbb{P}(e^{(\frac{1}{\theta_1}-\frac{1}{\theta_0})\sum x_i} \leq (\frac{\theta_0}{\theta_1})^{n}c) = \mathbb{P}((\frac{1}{\theta_1}-\frac{1}{\theta_0})\sum x_i \leq \ln [(\frac{\theta_0}{\theta_1})^{n}c)]) = \mathbb{P}(\sum x_i \leq \frac{\ln c + n\ln \theta_0 - n\ln \theta_1}{\frac{1}{\theta_1}-\frac{1}{\theta_0}})$

Let $\frac{2}{\theta_0}\sum x_i = V$, we get $\alpha = \mathbb{P}(V \leq \frac{2}{\theta_0}\frac{\ln c + n\ln \theta_0 - n\ln \theta_1}{\frac{1}{\theta_1}-\frac{1}{\theta_0}})$

Because a chi-squared distribution with 2 degrees of freedom (k = 2) is an exponential distribution with a mean value of 2 (rate $\theta = 2$ ), we know under $H_0$, $\frac{2}{\theta_0}X_i$ follows $\chi_2^2$ distribution, consequently, $V$ follows a Chi square distribution with $2n$ degrees of freedom.

So in this question, we plug in $n = 5$ and look at chi-square table under $2n = 10$ degrees of freedom under which there is $\alpha = 0.05$ area. We obtain a result 3.94. So $\mathbb{P}(\frac{2}{2}\sum x_i \leq 3.94) = 0.05$. This implies that we should reject $H_0$ if $\sum x_i \leq 3.94$.

By solving $\frac{2}{\theta_0}\frac{\ln c + n\ln \theta_0 - n\ln \theta_1}{\frac{1}{\theta_1}-\frac{1}{\theta_0}} = 3.94$ with $\theta_0 = 2$, $\theta_1 = 1$, and $n = 5$, we get $c = 0.8034$

Therefore, we reject the null hypothesis $H_0$ if $(\frac{\theta_0}{\theta_1})^{-n}e^{(\frac{1}{\theta_1}-\frac{1}{\theta_0})\sum x_i} = (\frac{2}{1})^{-5}e^{(\frac{1}{1}-\frac{1}{2})\sum x_i} = \frac{1}{32}e^{\frac{1}{2}\sum x_i} \leq 0.8034 = c$

## General Likelihood-ratio Test (GLRT)

General likelihood-ratio test is used to test two compositional sets of data, for example we say $S_0$ and $S_1$ where $S_1$ = $S_0^c$, i.e. $S_0 \cup S_1 = S$ where $S$ denotes the full data set. 

Consider the following hypotheses:

$$
    H_0:\theta \in S_0\\
    H_1:\theta \in S_1
$$.

But unlike the likelihood-ratio test with simple hypotheses, when choosing the test statistics $LR$, instead of using the division between $L(x_1,x_2,\dots,x_n;\theta_0)$ and $L(x_1,x_2,\dots,x_n;\theta_1)$, GLRT will use the supremum of the likelihood when picking from the full set $S$ as the unconstrained part we introduced before to calculate $LR = \lambda$.

The difinition of supremum is given by:

The supremum (abbreviated sup) of a subset $S$ of a partially ordered set $P$ is the least element in $P$ that is greater than or equal to all elements of $S$, if such an element exists. The supremum is also referred to as the least upper bound.

So, the difinition of GLRT is given by:

Let $X_1, X_2, X_3, \dots, X_n$ be a random sample from a distribution with a parameter $\theta$. Suppose that we have observed $X_1 = x_1, X_2 = x_2,\dots, X_n = x_n$.

Define:

$$LR = \lambda(x_1,x_2,\dots,x_n) = \frac{l_0}{l} = \frac{sup \{ L(x_1,x_2,\dots,x_n;\theta):\theta \in S_0 \}}{sup \{ L(x_1,x_2,\dots,x_n;\theta):\theta \in S \}}$$

The idea behind the GLRT is that: We first find the likelihood corresponding to the most likely values of $\theta$ in $\S_0$ and $S_1$ respectively. Denote $l_1 = sup \{ L(x_1,x_2,\dots,x_n;\theta):\theta \in S_0 \}$. In two extreme cases. If $l_0 = l$, then we can say that the most likely value of $\theta$ blongs to $S_0$. In this sense, we should not reject $H_0$. On the other hand, if $\frac{l_0}{l_1}$ is much smaller than 1, we shoud reject $H_0$ in fovor of $H_1$. 

To conduct a likelihood-ratio test, we choose a threshold $0 \leq c \leq 1$ and compare $\frac{l_0}{l}$ to $c$.

Notice that $l$ denotes the supreme likelihood which comes from the entire parameter space, the unconstrained one as we mentioned before. While $l_0$ denotes the constrained one from partial parameter space. So, $0 \leq \frac{l_0}{l} \leq 1$ because constrained one can never surpass the unconstrained one.

If $\frac{l_0}{l} \geq c$, we accept $H_0$. If $\frac{l_0}{l} \leq c$, we reject $H_0$. The value of $c$ can be chosen based on the desired significance level $\alpha$.

### Test Statistic

With the same idea, the test statistics of the likelyhood-ratio test is often defined as the difference between the log-likelihoods:

$$\lambda_{LR} = -2 \ln [\frac{sup_{\theta \in \theta_0}L(\theta)}{sup_{\theta \in \theta}L(\theta)}] = -2[l(\theta_0)-l(\hat{\theta})]$$

### Wilks' Theorem

Assuming $H_0$ is true, as the sample size $n$ approaches $\infty$, the test statistic $\lambda_{LR}$ will asympotically chi-squared distributed ($\chi^2$) with degrees of freedom equal to the difference in dimensionality of $\theta$ and $\theta_0$.

Wilks' theorem implies that for a great variety of hypotheses, we can calculate the likelihood ratio $\lambda$ for the data and then compare the observed $\lambda_{LR}$ to the $\chi^2$ value corresponding to the desired significance level $\alpha$.

$$\lambda_{LR} = -2[l(\theta_0)-l(\hat{\theta})] \stackrel{H_0}{\sim}\chi_q^2$$

## LRT in Linear Regression

In linear regression, a likelihood ratio test compares how well two nested regression models fit the data and determines which one fits better.

A nested model is a regression model that contains a subset of the predictor variables in another regression model. We can consider nested model as $l_0$ and the full model as $l$.

Suppose we have a regression model with four predictor variables: 

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \epsilon$ 

A nested model of the regression model above which only include partial original predictor variables:

$Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$

Our goal is to determine whether these two models are significantly different. We can perform a likelihood-ratio test which uses the following hypotheses:

$H_0$: The full model and the nested model fit the data equally well. Thus, we choose to use the nested model as it contains less parameter and easier to compute.

$H_1$: The full model fits the data significantly better than the nested model. Thus, we should use the full model because the nested model cannot represent the true condition.

Comparing to the given p-value, if we can reject $H_0$, then we can conclude that the full model offers a significantly better fit than the nested model.

\begin{example}

With the given dataset, we are wondering whether we can use the nested regression model to substitute the full regression model. 

significance level $\alpha$ is given by 0.05

Two models are given below:

Full model: $mpg = \beta_0 + \beta_1disp + \beta_2carb + \beta_3hp + \beta_4cyl$

Reduced model: $mpg = \beta_0 + \beta_1disp + \beta_2carb$

$H_0$: There is no significant difference. Can use nested regression model to substitute the full regression model.

$H_1$: There is a significant difference. Cannot use nested regression model to substitute the full regression model.

\end{example}

In [1]:
import statsmodels.api as sm
import pandas as pd
import scipy

In [2]:
#define URL where dataset is located
data = pd.read_csv("mtcars.csv")
data

Unnamed: 0,model,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
5,Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
6,Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
7,Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
8,Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
9,Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [3]:
#For a full model
#define response variable
Y_full = data['mpg']

#define predictor variables
X_full = data[['disp', 'carb', 'hp', 'cyl']]

#add constant to predictor variables
X_full = sm.add_constant(X_full)

#fit regression model
full_model = sm.OLS(Y_full, X_full).fit()

#calculate log-likelihood of model
full_ll = full_model.llf
full_ll

-77.55789711787898

In [4]:
#For a nested model
#define response variable
Y_nested = data['mpg']

#define predictor variables
X_nested = data[['disp', 'carb']]

#add constant to predictor variables
X_nested = sm.add_constant(X_nested)

#fit regression model
reduced_model = sm.OLS(Y_nested, X_nested).fit()

#calculate log-likelihood of model
reduced_ll = reduced_model.llf
reduced_ll

-78.60301334355185

In [5]:
#calculate likelihood ratio Chi-Squared test statistic
LR_statistic = -2*(reduced_ll-full_ll)

print("likelihood-ratio test statistic is", LR_statistic)

#calculate p-value of test statistic using 2 degrees of freedom
p_val = scipy.stats.chi2.sf(LR_statistic, 2)

print("p-value is", p_val)

likelihood-ratio test statistic is 2.0902324513457415
p-value is 0.35165094613502257


Because the calculated p-value = 0.35165094613502257 $\geq$ significance level $\alpha$ = 0.05, we do not reject the null hypothesis $H_0$.

Therefore, we can use nested regression model to substitute the full regression model.

## References

### References for Wald Test Part:
- https://en.wikipedia.org/wiki/Likelihood-ratio_test
- https://www.probabilitycourse.com/chapter8/8_4_5_likelihood_ratio_tests.php
- http://people.missouristate.edu/songfengzheng/Teaching/MTH541/Lecture%20notes/LRT.pdf
- https://www.statology.org/likelihood-ratio-test-in-python/

### Bibliographical Notes for Wald Test Part:
The Wald Test portion of this lecture is largely based on lecture notes from "Generalized Linear Models" by Germán Rodríguez at Princeton, lecture notes from "MS&E 226" by Ramesh Johari at Stanford, and Chapter 14.1 of the book "Biostatistics" by Ronald N. Forthofer, Eun Sul Lee and Mike Hernandez.

### References for Likelihood-ratio Test Part:
- https://web-s-ebscohost-com.ezproxy.library.wisc.edu/ehost/ebookviewer/ebook/bmxlYmtfXzE4NTc0OF9fQU41?sid=9645d4ef-3afc-42db-8ee2-fa5a7476607b@redis&vid=0&format=EB&lpid=lp_387&rid=0
- https://data.princeton.edu/wws509/notes/c2s3
- https://www.statisticshowto.com/wald-test/
- https://web.stanford.edu/~rjohari/teaching/notes/226_lecture15_inference.pdf

### Bibliographical Notes for Likelihood-ratio Test Part:
The likelihood-ratio test part is largely based on "Likelihood-ratio test" from wikipedia, Chapter 8.4.5 in "Introduction to Probability, Statistics, and Random Processes" by Hossein Pishro-Nik, lecture notes from "Math 541: Statistical Theory II" by Songfeng Zheng, and "How to Perform a Likelihood Ratio Test in Python" by Zach, as a final final project for COMP SCI 639.