Research typically involves measuring one or more variables for a sample and computing descriptive statistics for that sample. In general, however, the researcher's goal is _not to draw conclusions about that sample but to draw conclusions about the population_ that the sample was selected from. Thus researchers must use sample statistics to draw conclusions about the corresponding values in the population.

Imagine, for example, that a researcher measures the number of depressive symptoms exhibited by each of 50 clinically depressed adults and computes the mean number of symptoms. The researcher probably wants to use this sample statistic (the mean number of symptoms for the sample) to draw conclusions about the corresponding population parameter (the mean number of symptoms for clinically depressed adults).


There is a certain amount of random variability in any statistic from sample to sample. The mean number of depressive symptoms might be 8.73 in one sample of clinically depressed adults, 6.45 in a second sample, and 9.44 in a third—even though these samples are selected randomly from the same population. Similarly, the correlation between two variables might be +.24 in one sample, −.04 in a second sample, and +.15 in a third—again, even though these samples are selected randomly from the same population. This random variability in a statistic from sample to sample is called **sampling error**. 


Any statistical relationship in a sample can be interpreted in two ways:

- There is a relationship in the population, and the relationship in the sample reflects this.

- There is no relationship in the population, and the relationship in the sample reflects only sampling error.


The purpose of null hypothesis testing is simply to help researchers decide between these two interpretations.


## The Logic of Null Hypothesis Testing

Null hypothesis testing is a formal approach to deciding between two interpretations of a statistical relationship in a sample. One interpretation is called the **null hypothesis** (often symbolized $H_0$ and read as _H-naught_). This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship "occurred by chance". The other interpretation is called the **alternative hypothesis** (often symbolized as $H_1$ or $H_a$). This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population.

General steps:

- Assume for the moment that the null hypothesis is true. There is no relationship between the variables in the population.

- Determine how likely the sample relationship would be if the null hypothesis were true.

- If the sample relationship would be *extremely unlikely*, then reject the null hypothesis in favor of the alternative hypothesis. If it would not be extremely unlikely, then retain the null hypothesis.


A crucial step in null hypothesis testing is finding the **likelihood** of the sample result if the null hypothesis were true. This probability is called the **p value**.


- A low p value means that the sample result would be unlikely if the null hypothesis were true and leads to the rejection of the null hypothesis. 

- A high p value means that the sample result would be likely if the null hypothesis were true and leads to the retention of the null hypothesis. 

But how low must the p value be before the sample result is considered unlikely enough to reject the null hypothesis?

In null hypothesis testing, this criterion is called $\alpha$ (alpha) and is almost always set to .05. If there is less than a 5% chance of a result as extreme as the sample result if the null hypothesis were true, then the null hypothesis is rejected. 

When this happens, the result is said to be **statistically significant**. If there is greater than a 5% chance of a result as extreme as the sample result when the null hypothesis is true, then the null hypothesis is retained. 

This does not necessarily mean that the researcher accepts the null hypothesis as true - only that there is not currently _enough evidence_ to conclude that it is true. Researchers often use the expression _"fail to reject the null hypothesis"_ rather than _"retain the null hypothesis."_ 



## The Misunderstood p Value

A misguided researcher might say that because the p value is .02, there is only a 2% chance that the result is due to chance and a 98% chance that it reflects a real relationship in the population. But this is _incorrect_. 

The p value is really the probability of a result _at least as extreme_ as the sample result _if_ the _null hypothesis were true_. So a p value of .02 means that if the null hypothesis were true, a sample result this extreme would occur only 2% of the time.

You can avoid this misunderstanding by remembering that the p value is _not the probability that any particular hypothesis is true or false_. Instead, it is the probability of obtaining the sample result if the null hypothesis were true.


Recall that null hypothesis testing involves answering the question, _"If the null hypothesis were true, what is the probability of a sample result as extreme as this one?"_ 



## Some Examples


### Example 1 (one sample t-test)

An outbreak of Salmonella-related illness was attributed to ice cream produced at a certain factory. Scientists measured the level of Salmonella in 9 randomly sampled
batches of ice cream. The levels (in MPN/g - Most Probable Number per Gram) were:


In [None]:
# MPN/g
x <- c(0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 
       0.519, 0.392, 0.418)


Is there evidence that the mean level of Salmonella in the ice cream is greater than 0.3 MPN/g?


Let $\mu$ be the mean level of Salmonella in all batches of ice cream. Here the hypothesis of interest can be expressed as:

$$
H_0 : \mu = 0.3 \\
H_a : \mu > 0.3
$$

Hence, we will need to include the options `alternative="greater", mu=0.3` to use the function `t.test` in R. 


In [None]:
t.test(x, alternative="greater", mu=0.3)



From the output we see that the p-value = 0.029. Hence, there is moderately strong evidence that the mean Salmonella level in the ice cream is above 0.3 MPN/g.


### Example 2 (Two-sample t-tests)

6 subjects were given a drug (treatment group) and an additional 6 subjects a placebo (control group). Their reaction time to a stimulus was measured (in ms). We
want to perform a two-sample t-test for comparing the means of the treatment and control groups.

Let $\mu _1$ be the mean of the population taking medicine and $\mu _2$ the mean of the untreated
population. Here the hypothesis of interest can be expressed as:

$$
H_0 : \mu_1 - \mu_2 = 0  \\
H_a : \mu_1 - \mu_2 < 0
$$


Here we will need to include the data for the treatment group in `x` and the data for the control group in `y`. We will also need to include the options `alternative="less", mu=0`.


Finally, we need to decide whether or not the standard deviations are the same in both groups.

Below is the relevant R-code when _assuming equal standard deviation_:


In [None]:
# 6 measurements both in control and treatment
Control <-  c(91, 87, 99, 77, 88, 91)
Treat <-   c(101, 110, 103, 93, 99, 104)
# perform two sample t-test
t.test(Control, Treat, alternative="less", var.equal=TRUE)


Below is the relevant R-code when _not assuming equal standard deviation_:



In [None]:
t.test(Control, Treat, alternative = "less")



Here the pooled t-test and the Welsh t-test give roughly the same results (p-value = 0.00313 and 0.00339, respectively).


### Example 3 (Paired t-tests)

There are many experimental settings where each subject in the study is in both the treatment and control group. For example, in a matched pairs design, subjects are matched in pairs and different treatments are given to each subject in the pair. 
The outcomes are thereafter compared pair-wise. Alternatively, one can measure each subject twice, before and after a treatment. In either of these situations we cannot use two-sample t-tests since the independence assumption is not valid. Instead we need to
use a _paired_ t-test. This can be done using the option `paired = TRUE`.



A study was performed to test whether cars get better mileage on premium gas than on regular gas. Each of 10 cars was first filled with either regular or premium gas,
decided by a coin toss, and the mileage for that tank was recorded. The mileage was recorded again for the same cars using the other kind of gasoline. We use a paired t-test
to determine whether cars get significantly better mileage with premium gas.


In [None]:
# Regular gas
reg <-  c(16, 20, 21, 22, 23, 22, 27, 25, 27, 28)
# Premium gas
prem <- c(19, 22, 24, 24, 25, 25, 26, 26, 28, 32)


Perform paired t-test



In [None]:
t.test(prem, reg, alternative = "greater", paired = TRUE)



The results show that the t-statistic is equal to 4.47 and the p-value is 0.00075. Since the p-value is very low, we reject the null hypothesis. There is strong evidence of a
_mean increase_ in gas mileage between regular and premium gasoline.





#### References

1. <https://opentextbc.ca/researchmethods/chapter/understanding-null-hypothesis-testing/>

2. De Veaux, Velleman and Bock, _"Stats: Data and Models"_, 2nd Edition. 

3. M.J. Crawley, _"Statistics: An Introduction Using R"_
