# Statistical Hypothesis Testing 

We use hypothesis testing (or significance testing) to conclude which of the two hypotheses about the distribution of the data is rejected. We have seen an example of it before; here, we will look into details of the testing procedure and a variety of tests to be used depending on the assumptions and properties of the data sets we analyze. 

General testing procedure has the following steps: 

### 1. We state the null and alternative hypotheses

Null hypothesis represents the case that there is no statistically significant difference between observed samples or the sample and some distribution; any observed difference is due to sampling variance or some error. We would like to be able to **reject the null hypothesis** to show that the alternative hypothesis has evidence. 

We have to make sure that the null hypothesis ($H_0$) and the alternative hypothesis ($H_1$ or $H_a$) are mutually exclusive and the $H_0$ represents the case of no effect or no change. 

**Example**: Suppose that we analyze the the effect of gender on the average height in the population. Our hypothesis is that the average height of males is larger than the average height of females. Our null hypothesis should state that there is no significant difference between the average heights of males versus females. To put it in statistical terms, we write:

$$H_0 : \mu_m = \mu_f$$
$$H_a : \mu_m > \mu_f$$


Here, $\mu_m$ is the mean of the heights of males and $\mu_f$ is the mean of the heights of the females in the sample. Our alternate hypothesis is what we want to show, and null hypothesis states that there is no change. 

**Note that** our $H_a$ here calls for a **one-tailed** test; we will discuss that later. 

### 2. Choose a significance level 

The significance level (also known as alpha level) detemines probability threshold of rejecting the null hypothesis. This is the value that we compare the **p-value** against. As we will see, if the p-value is smaller than this level, we reject the null hypothesis and conclude that the effect or change is **statistically significant**. It is customary to use an alpha level of 0.05 or 0.01. 

### 3. Choose and compute a test statistic 

Remember the difference between a parameter and statistic: Mean and standard deviation of a **population** are parameters; mean and standard deviation of a **sample** are statistics, and we want to use statistics to do inferences about the population. 


A test statistic is a random variable that is calculated from the sample data to be used in the hypothesis test to determine whther to reject the null hypothesis. The choice of the test statistic depends on the properties of the data, the assumptions we make about the distribution, and the sample size as we will see later. 

**Example**: Z-statistic or Z-test can be used as a test statistic if the distribution of the test statistic under the null hypothesis follows a normal distribution. Z-test tests the **mean** ($\overline{\mu}$) of a sample. 

In a **one-sample** z-test, we compare the mean of a sample to a population to see if the sample comes from that population. The null hypothesis is that the means are equal, alternative hypothesis is that means are not equal (two-tailed test). Formally we write $H_0: \overline{\mu}-\mu = 0$ or $H_0: \overline{\mu} = \mu $ and $H_1: \overline{\mu} \neq \mu$.

If you remember the [**sampling distributions**](https://www.statcrunch.com/applets/type3&samplingdist), we know that the mean of a random sample from a population is a random variable; if we pick different samples from the population, their means will follow a normal distribution and according to the central limit theorem, **the mean of the means** will converge to the population mean. **The standard deviation of the means** (also called the standard error) will be $\frac{\sigma}{\sqrt{n}}$ where $\sigma$ is population's standard deviation and $n$ is the size of the sample. So the z-statistic becomes: 

$$ Z = \frac{\mu - \overline{\mu}}{\frac{\sigma}{\sqrt{n}}} $$

What we are doing here is to find the Z value (Z score, standard score) so that we can compute the p-value based on the standard normal distribution $N(0,1)$ with zero mean and one standard variation. 

<img src="../images/snorm.gif">

**The area under the curve** in the standard normal distribution is 1; we can compute the probability of a z value being less than a particular threshold as the $Pr(z<Z_0)$ by **computing the area under the curve to the left of $Z_0$**. As we know from standard deviation rule, about **68%** of the data falls within **one** standard deviation of the mean, so the area between -1 and 1 in the **standard** normal distribution is about 0.68. 


<img src="../images/std3.png" width="500">

We can compute this using the `pnorm()` function. It will return the area to the left of the z value, so we should find the difference: 

In [None]:
# area to the left of Z=1 minus area to the left of Z=-1

pnorm(1)-pnorm(-1)

We can compute the area under the curve for the $z$ value and compare it to the **significance level alpha** whether to reject the null hypothesis. 

**If we have a two-tailed test**, we are looking at the extreme values on both tails of the curve, so we should divide alpha by two. **If we have a one-tailed test**, depending on the $H_1$, we are either looking at the right tail or the left tail. 

<img src="../images/alpha_z.png">

**The critical value** for the $z$ is the $Z$ value that corresponds to an alpha level of 0.05 (or 0.01 depending on your choice). 

For a **one-tailed test**, we have to pick $-Z_\alpha$ or $Z_\alpha$ depending on our $H_1$ and compare whether to reject the null hypothesis. 

The critical values corresponding to a significance level of 0.05 are 1.645 and -1.645. See below how the area under the curve corresponds to 0.05 for those z values. 

In [None]:

# lower tail: area under the curve on the LEFT of the critical Z value 
pnorm(-1.645)

# upper tail: area under the curve on the RIGHT of the critical Z value 
1-pnorm(1.645)
#or 
pnorm(1.645, lower.tail=FALSE)


**We can use `qnorm` to get the critical z values for a given area under the curve:**

In [None]:
qnorm(0.05)
qnorm(0.05, lower.tail=FALSE)


If our alternative hypothesis is $H_1: \mu \lt \mu_0$ then the rejection region is $z \lt - 1.645$; 
if our alternative hypothesis is $H_1: \mu \gt \mu_0$ then the rejection region is $z \gt  1.645$.


For a **two-tailed test**, we need to find the critical **Z values on both ends** that correspond to $\alpha=0.025$. If our $z$ value is greater than $Z_{\frac{\alpha}{2}}$ **or** less than $-Z_{\frac{\alpha}{2}}$, we can reject the null hypothesis; otherwise, we **fail to reject** the null hypothesis. 

The corresponding z values: 

In [None]:
# upper tail 
qnorm(0.025)

# lower tail 
qnorm(0.025, lower.tail=FALSE)


In this case, our alternative hypothesis is $H_1: \mu \ne \mu_0$ and the rejection region is $|z| \gt 1.96$; we are looking at the absolute value of z. 

### 4. Interpret the p-value 

We can either find the critical z values as above, or find the **p-value** that corresponds to our z value by using a [**Z table**](http://www.z-table.com/). **If the p-value is smaller than our significance level alpha, we reject the null hypothesis.** Note that for the positive z values, we have to subtract the table entry from 1 to find the area on the **right** of the z value. 

If the p-value is NOT smaller than the significance level alpha, we **fail to reject** the null hypothesis. This does **NOT** mean the null hypothesis is true; it simply means that there is not enough evidence to claim that the effect or change is **statistically significant**. 

the p-value given for the Z-statistic should be interpreted as how likely it is that a result as extreme or more extreme than that observed would have occured under the null hypothesis. Practically, p-value is the fraction of the time that we would expect to see such an extreme value of the test statistic if we repeated the experiment many times given that $H_0$
holds.


**There are some misconceptions about interpreting the p-value:**


- **The p-value is not the probability that the null hypothesis is true**, or the probability that the alternative hypothesis is false. The p-value is the probability of obtaining an effect that is at least as extreme as the observed effect, given that the null hypothesis is true. This is NOT same as probability that the null hypothesis is true given the observed effect.


 - **The p-value is not the probability that the observed effects were produced by random chance alone**. The p-value is computed under the assumption that the null hypothesis is true. This means that the p-value is a statement about the relation of the data to that hypothesis.
 

 - **The p-value does not indicate the size or importance of the observed effect**. A small p-value can be observed for an effect that is NOT meaningful or important. In fact, the larger the sample size, the smaller the minimum effect needed to produce a statistically significant p-value.

---


**In the special case of $n=1$, we want to know the chances of a single measurement coming from the population distribution.**

**EXAMPLE**: In a class of 50 students, the mean of a test score is 85 and the standard deviation is 5. What is the chance that a randomly picked student will have a score greater than 92? 

Here, we take the class as the population and our sample is of size 1 (n=1), and we want to get a probability for the **upper tail** of the distribution. 

In [None]:
pnorm(92, mean=85, sd=5, lower.tail=FALSE)

We simply found the area under the standard normal distribution that corresponds to the left of a z value for 92. 

In [None]:
Z = (92-85)/(5/sqrt(1))
Z

In [None]:
pnorm(1.4, lower.tail=FALSE)

There is about 0.08 probability that the picked student will have a score higher than 92. It also means that about 8% of the students scored higher than 92. 

---

## Test Types 



The type of the z-testing above is called a **one sample Z-test**. We will see different types of tests depending on the data and the assumptions. 


The assumptions we have to make to run a Z-test:

 - The data is continuous (not categorical) 
 
 - The sample is a simple random sample from its population (each observation has an equal probability of being selected). 
 
 - The data follows normal probability distribution. 
 
 - Sample size is large (practically n>30). 
 
 - We know the population standard deviation. 
 
If these assumptions do not hold, we will have to use other statistical tests. The testing procedure and the test statistics depend on whether:

- we have dependent or independent samples

- we have one sample or two samples 

- we have large enough sample size. 

- we can assume normality

- we know population parameters (parametric or non-parametric) 

- we have continuous or categorical variables 

We will look at the most common test types and discuss the issues above. 


---

### How are dependent and independent samples different?

Dependent samples are paired measurements for one set of items. Independent samples are measurements made on two different sets of items.

When you conduct a hypothesis test using two random samples, you must choose the type of test based on whether the samples are dependent or independent. Therefore, it's important to know whether your samples are dependent or independent:

 - If the values in one sample affect the values in the other sample, then the samples are dependent.

 - If the values in one sample reveal no information about those of the other sample, then the samples are independent.
 
 
 **Example:** 
Consider a drug company that wants to test the effectiveness of a new drug in reducing blood pressure. They could collect data in two ways:

 1. Sample the blood pressures of the **same** people **before and after** they receive a dose. The two samples are **dependent** because they are taken from the same people. The people with the highest blood pressure in the first sample will likely have the highest blood pressure in the second sample.
 
 
 
 2. Give **one group** of people an active drug and give a **different group** of people (control group) an inactive placebo, then compare the blood pressures between the groups. These two samples would likely be independent because the measurements are from different people. Knowing something about the distribution of values in the first sample doesn't inform you about the distribution of values in the second.
 
 
It may not always be easy to figure out the dependence. Imagine that you are asking married couples to rate happiness in marriage and want to know if wives have a significantly different rating than husbands. Even though you have **two separate groups**, the ratings will not be perfectly independent since the couples are in the same marriage. You **cannot** use a testing procedure meant for independent samples. 
 
 
### One-sample vs Two-sample 

A two-sample test is used to compare the statistics of two independent samples. As discussed above, if we have two samples that do not overlap and do not have dependence of the variable we want to test, we can use two-sample tests. We use one sample test if we want to test a sample coming from a population, or if a sample's statistic differs from some value. 

Considering the heights of males versus females, if our hypothesis is that average male height is significantly different than average female height, we are talking about two indepedendent samples (two distinct groups and height variable doesn't have dependence **between** two groups). This is a case for two-sample testing. 

If our hypothesis is that average male height is significantly different than average height of **the population**, we have one sample coming from a population that includes **both** males and females. This is a case for one-sample testing.

**Example:**  $H_a$: The per capita income of West Virginia counties is significantly
less than the national average. This alternative hypothesis calls for a one-sample test since West Virginia counties is a sample from the whole US, so they also contribute to the national average. 


### Sample Size

If we have large sample size and we know the standard deviation of the population, we can use z-tests. If the sample size is small (typically n<30) or we do not know the standard deviation of the population, we use **t-test**. 

---

## T-test

**T-test**, also known as Student's t-test, is used for small samples and/or unknown standard deviation of the population, given that the data follows a normal distribution. 

Just like with z-test, we can do a one-sample or two-sample t-test. For larger values of sample size, t-distribution approximates the normal distribution. [Take a look at here to see the comparison graphics](https://en.wikipedia.org/wiki/Student%27s_t-distribution). 

We can do a **paired t-test** on the differences of the matched pairs if they follow a normal distribution. Here is the [t-distribution table](https://www.tdistributiontable.com/) to look up the values. We can look up the critical values for a given degrees of freedom and and a given alpha. 

**Example:** You want to figure out how much to charge for a product that was initially priced at 50 dollars. You do a survey of 30 potential customers to find out their willingness to pay for the product and the average comes out 55.70 dollars and the variance is 64.8 . Does this mean you can charge more for the product? 

First, we set our hypotheses. Our null hypothesis is $H_0: \mu \le 50$ and alternative hypothesis is $H_1: \mu \gt 50$. 

Then, we choose a significance level: alpha=0.01

Our test statistic will a one-sample t-test: we don't know the population standard deviation (how much all our customers are willing to pay and the variability of it).

Here, we take the population mean as 50 dollars. That is not correct, but we want to test how much the sample deviates from this "ideal" mean. 

Ouur sample size is n=30. Using the standard deviation of the sample, our t-statistic is:

$$ t = \frac{\overline{\mu} - \mu}{S/\sqrt(n)}$$


$$ t = \frac{55.7 - 50}{\sqrt(64.8)/\sqrt(30)} = 3.878 $$

Degrees of freedom $df=n-1=29$. We need to look up the critical value for alpha=0.01 and df=29. Remember that our $H_1$ is one tailed. The critical value is 2.462 which is smaller than our $t$ so we can reject the null hypothesis. 




In [None]:
s = sqrt(64.8)
mubar = 55.7
mu = 50 
n=30

t= (mubar-mu)/(s/sqrt(n))
t

Just like we used `pnorm` with the z-test, we can use `pt` with t-test to find out the area that corresponds to our t value. 

In [None]:
pt(t, df=29, lower.tail=FALSE)

P-value is much smaller than the alpha=0.01. We can reject the null hypothesis; the population mean for the price exceeds 50 dollars. This p-value shows that there is very little chance that we would have observed this evidence if the population mean would be less than 50 dollars. So we can increase the price of the product. 


---

## Tests for Categorical Variables 

If we have categorical variables, we cannot use t-test any more, we have to use Fisher's Exact Test or Chi-squared test. 

The Chi-Square Test of Independence determines whether there is an association between categorical variables (i.e., whether the variables are independent or related). It is a **non-parametric test**.

This test utilizes a contingency table to analyze the data. A contingency table (also known as a cross-tabulation, crosstab, or two-way table) is an arrangement in which data is classified according to two categorical variables. The categories for one variable appear in the rows, and the categories for the other variable appear in columns. Each variable must have two or more categories. Each cell reflects the total count of cases for a specific pair of categories.

If your categorical variables represent "before" and "after" observations, then the chi-square test of independence is NOT appropriate. This is because the assumption of the independence of observations is violated. In this situation, **McNemar's Test** is appropriate.

Assumptions: 

 - Two categorical variables.
 - Two or more categories (groups) for each variable.
 - Independence of observations.
 - There is no relationship between the subjects in each group.
 - The categorical variables are not "paired" in any way.
 - Relatively large sample size.
 - Expected frequencies for each cell are at least 1.
 - Expected frequencies should be at least 5 for the majority (80%) of the cells.


**In these tests, $H_0$ is always "the variables are independent" and $H_1$ is always "the variables are dependent".** 

---
**EXAMPLE:**
 
We will use the iris data set and will create some categories out of it. We will add the variable `size` which will have the values `small` and `big`. 
 


In [None]:
dat <- iris

dat$size <- ifelse(dat$Sepal.Length < median(dat$Sepal.Length),
  "small", "big"
)

In [None]:
head(dat)

# create a two-way table 

table(dat$Species, dat$size)


Let's see if there is an association between `Species` and `size`. 

In [None]:
test <- chisq.test(table(dat$Species, dat$size))
test

P-value shows that there is a significance association between species and size. 

To see how this test can be run manually, [take a look at here](https://statsandr.com/blog/chi-square-test-of-independence-by-hand/). 


If a warning such as “Chi-squared approximation may be incorrect” appears, it means that the smallest expected frequencies is lower than 5. To avoid this issue, you can either increase observations, or use the Fisher's Exact Test. 



In [None]:
test.f <- fisher.test(table(dat$Species, dat$size))
test.f

---

**Another Example** 

Suppose that we want to determine whether there is a statistically significant association between smoking and being a professional athlete. Smoking can only be “yes” or “no” and being a professional athlete can only be “yes” or “no.” The two variables of interest are qualitative variables and we collected data on 14 persons.


In [None]:
dat2 <- data.frame(
  "smoke_no" = c(7, 0),
  "smoke_yes" = c(2, 5),
  row.names = c("Athlete", "Non-athlete"),
  stringsAsFactors = FALSE
)
colnames(dat2) <- c("Non-smoker", "Smoker")

dat2

We should use Fisher’s Exact Test if there is at least one cell in the contingency table of the expected frequencies below 5. To retrieve the expected frequencies, we can use the `chisq.test()` function together with `$expected`:


In [None]:
chisq.test(dat2)$expected

The contingency table above confirms that we should use the Fisher’s exact test instead of the Chi-square test because there is at least one cell below 5. Note that `dat2` is already a contingency table, so we can call fisher.test directly. 

In [None]:
test.f2 <- fisher.test(dat2)
test.f2

---

### McNemar's Test 

McNemar's Test is used to determine if there are differences on a dichotomous
**dependent** variable between **two related** groups. A dichotomous variable is **a categorical
variable with two categories only**. It can be considered to be similar to the paired-samples
t-test, but for a dichotomous rather than a continuous dependent variable.

Assumptions: 

 -  You have one categorical dependent variable with two categories
      and one categorical independent variable with two related groups.
      
 -  The two groups of your dependent variable must be mutually exclusive. This means that no groups can overlap: a participant can only be in one of the two groups.
 
 - The cases (e.g., participants) are a random sample from the population of interest.

If your data are not dichotomous and you have more than two categories in your nominal variable an extension of the McNemar’s test called the McNemar-Bowker test might be appropriate. Fortunately, the code in R used in the example below would be identical as the mcnemar.test function can handle multiple categories in the nominal variable.


**EXAMPLE**

Suppose researchers want to know if a certain marketing video can change people’s opinion of a particular product. They survey 100 people to find out if they do or do not like the product. Then, they show all 100 people the marketing video and survey them again once the video is over.

The following table shows the total number of people who liked the product both before and after viewing the video:




To determine if there was a statistically significant difference in the proportion of people who liked the product before and after viewing the video, we can perform McNemar’s Test.


In [None]:
#create data
data <- matrix(c(30, 12, 40, 18), nrow = 2,
    dimnames = list("After Video" = c("After_Like", "After_Dislike"),
                    "Before Video" = c("Before_Like", "Before_Dislike")))

#view data
data

In [None]:
mcnemar.test(data)

P-value shows that there is a statistically significant difference between before and after. 