# NHST III: Statistical Significance
In the previous parts of this lesson, we have established the main mechanics of NHST. During those discussions, we made multiple references to calculating *probabilities* in order to reach conclusions. In this part of the lesson, we will finally formalise how this is achieved. To briefly review, we begin the process of inference using NHST by determining a null value for our parameter of interest and then comparing that value to the actual value we have calculated. This produces a *difference* called $\delta$. For all the reasons given previously, we then convert $\delta$ into a standardised value called a *test statistic*. This is often a $t$-statistic, which re-expresses $\delta$ in standard error units. Under the null hypothesis, the $t$-statistic has a $t$-distribution, whose width is governed by a parameter known as the *degrees of freedom*. This allows the distribution to vary its width in small samples to accommodate how the precision of the standard error estimate influences the range of probable $t$-values. So, we now have a test statistic that captures the deviation from the null in our dataset, as well as the distribution of this test-static when the null is true. Our very last step is to use both of these pieces of information to calculate the *probability* of our observed value of $t$, if the null hypothesis is true. This probability is perhaps one of the most controversial aspects of statistical inference, and is known as the $p$-value.

## The $p$-value and Statistical Significance
It is fairly likely ($p < 0.05$) that you will have been exposed to the $p$-value already in your statistical education. If you come from a Psychology background, this is practically guaranteed as the history of scientific enquiry in the field of Psychology is largely the history of collecting $p$-values. As a method of trying to reach conclusions from a statistical model, the $p$-value is ubiquitous. This is largely because, on the face of it, the $p$-value provides an easily applicable and objective criteria for determining evidence for or against the null hypothesis. However, as we will see in the final part of this lesson, this is arguably not true. Nevertheless, the $p$-value endures as the metric most commonly employed in Experimental Psychology to determine whether we can reject the null hypothesis of no effect in the population. We will leave criticisms of this approach to one side for the moment and just focus on making sure the definition of the $p$-value is clear. This is critical, as the $p$-value is often misunderstood by students and sometimes (rather worryingly) by researchers as well.  

### Fisher's Concept of the $p$-value
The $p$-value comes from Fisher's concept of NHST. As we will explore further below, Fisher's original conceptualisation of the $p$-value has been somewhat warped into our modern framework of NHST. So, for the moment, we will focus on Fisher's original concept, before seeing what has happened over the years.

Probabilities from continuous distributions are given by the area under the curve either *above* or *below* the value. So we are not calculating the probability of getting this *exact* value of $t$. Rather, it is the probability of this value of $t$ *or larger*.

...

Fisher actually viewed $p$-values as continuous evidence against the null, rather than as a threshold value for declaring significance. In fact, although the criterion of 0.05 is attributed to Fisher, it is insightful to see what he *actually* said about it:

```{epigraph}
The value for which $p = 0.05$, or 1 in 20, is convenient to use in practice if we want to draw a line beyond which we say that the deviation is significant, but it must not be forgotten that we shall often draw such a line when there is nothing there but chance.

-- Ronald Fisher
```

So, Fisher viewed this criterion as a convenient heuristic, but fully acknowledged its limitations and did not ever state that we should *reject* the null hypothesis when $p < 0.05$. Indeed, Fisher strongly disagreed with the concept of *rejecting* or *accepting* the null. These concepts came from a framework for NHST developed by [Jerzy Neyman](https://en.wikipedia.org/wiki/Jerzy_Neyman) and [Egon Pearson](https://en.wikipedia.org/wiki/Egon_Pearson).


## Fisher vs Neyman-Pearson
The history of NHST is rather tangled and often misunderstood, largely because it is a hybrid of two *fundamentally different* statistical philosophies. On the one side was Fisher and on the other was Neyman-Pearson. These two different frameworks were severly opposed by the other side, with major disagreements on the fundamental purpose of statistical inference and how evidence can be used. These two approaches were never meant to be combined. Indeed, both Fisher and Neyman-Pearson would probably agree that they *cannot* be combined in any meaningful way, due to their philosophical disagreement. And yet, modern NHST *has* merged these different ideas into a confused mishmash that neither side would have endorsed.

### Fisher's Method

```{epigraph}
It is a mistake to speak of the null hypothesis as though it were a thing we are testing, or as though we were trying to accept or reject it.

-- Ronald Fisher
```

Indeed, in Fisher's conceptualising a small $p$-value simply means that the data are surpring under the null, whereas a large $p$-value means that the data are not surprising under the null. Smaller $p$-values present more evidence against the null, but only so far as we make the logical leap of saying that very suprising data make it less likely that the null is correct. However, we would not *reject* the null, because it still remains possible that the null is correct. All we are saying is that the data we have are less compatible with the null. Nothing more.

### Neyman-Pearson's Method
In opposition to Fisher's approach, Neyman-Pearson saw inference as an exercise in *decision making*. Moreso, they saw inference as the process of making decision, where the long-run frequency of errors is controlled. They fundamentally rejected the $p$-value as having any utility in this approach. Because a $p$-value is a *data dependant* quantity - $P(\mathcal{D}|\mathcal{H}_{0})$ - it flies against Neyman-Pearson's philosophy of defining an error-control procedure that is divorced from the data itself. If we tie the $p$-value to a pre-specified $\alpha$ (such as $\alpha = 0.05$), then it can be used as a decision rule, but the *actual value of p does not matter*. At which point, we may as well just use the critical value of the test statistic. The $p$-value adds nothing to this procedure. Another way of thinking about this is that the $p$-value gives us a *post-hoc* summary of how surprising we find the data we have collected. Neyman-Pearson would claim that this is just a *description* of the data, rather than *inference*. For Neyman-Pearson, their error-control procedure must be fully defined prior to collecting the data. The only reason for calculating a $p$-value is to consider the graduation of evidence. However, Neyman-Pearson were not interested in this. For them, statistical inference is a yes/no question, no more.    

### Modern NHST
Given the descriptions above, it is hopefully clear that modern use of NHST would probably horrify Fisher and Neyman-Pearson in equal measure. Fisher would hate to see his $p$-values being used as a decision making tool for rejecting hypotheses, rather than as a measure of evidence. Neyman-Pearson would hate to see statistical software reporting $p$-values, rather than simply stating whether a test passed the critical threshold for significance. Our modern approach to NHST bleeds decision making and thresholds into Fisher's continuous measures of evidence or, alternatively, bleeds continuous measures of evidence into Neyman-Pearson's strict decision making approach. We are trying to [have our cake, and eat it](https://en.wikipedia.org/wiki/You_can%27t_have_your_cake_and_eat_it).

To see more clearly why the modern mish-mash of these ideas is philosophically inconsistent, consider the situation where we calculate $p = 0.001$:

- If we simply *reject* the null because $p < 0.05$ then we are doing Neyman-Pearson NHST with $\alpha = 0.05$, but using Fisher's $p$-values to do so. There is nothing fundamentally wrong with this, as long as we are only using the $p$-value as a *binary indicator*. However, this begs the question what the point of the $p$-value is? We could reach the same conclusion using the critical value of the test-statistic, which is how Neyman-Pearson designed this approach. If we are not interpretting the $p$-value as a continuous metric of evidence, then we do not need it.
- If we interpret this as strong evidence aginst the null, because the $p$-value is quite small, then we are doing Fisher's NHST. We may use a heuristic of $p < 0.05$ as some marker of "significance", but this is not a binary rule to accept or reject anything. We simply say that the data is quite surprising, if the null hypothesis were true.
- If we mash these two approaches together by rejecting the null hypothesis because $p < 0.05$, but also (whether explicitly or implicitly) take the magnitude of the $p$-value as stronger evidence than another $p$-value in our output table (say $p = 0.048$), then we are doing an incoherent muddling of these two approaches.

Although it *feels* like we should be able to combine these methods, this causes problems. For instance, it feels like we should be able to pre-specify a level of error-control, reject the null based on this level and then use the magnitude of the $p$-value to indicate our confidence in that deicison. However, what happens quite often is that researchers will treat values *close* to $p = 0.05$ as *marginally significant*, or move their criteria for significance based on the calculated $p$-values. If we start moving the goalposts based on the $p$-values, then the whole principle of fixing error rates before collecting the data goes out the window. Furthermore, treating $p$-values as *both* a binary indicator and a continuous measure of evidence is inconsistent. For instance, consider calculating $p = 0.06$. Neyman-Pearson would say that you have failed to reject the null hypothesis. It does not matter if $p = 0.06$ or $p = 0.6$, the decision rule is final. Fisher would say that the data are somewhat surpring under the null. Although the value is not below our heuristic of $p < 0.05$, it is suggestive of moderate evidence against the null. This may imply that more data is needed to fully understand what is going on, but there is certainly a hint that the null may not be true.

```{figure} images/pvaluepumpkin.jpg
---
scale: 50%
align: left
---
```

Anyone who has used NHST for long enough knows the pains of grappling with a $p$-value that is *nearly significant*. In these situations, the inherent conflict between a strict decision rule and a continuous measure of evidence becomes clear. We see a value that tells us that we cannot reject the null, yet the value is so close to our cutoff that it feels as if it should *mean something*. This is especially true when we get a value like $p = 0.051$ versus $p = 0.96$. In these situations, we want to be Fisher but have been taught to be Neyman-Pearson. We are trying to adhere to a yes/no decision-making framework, yet have been presented with a continuous metric of evidence. If we really want to adhere to the Neyman-Pearson framework, we should remove $p$-value columns from software and replace them with a column that simply says "reject" or "fail to reject". If we want to adhere to Fisher's framework, we keep $p$-values but remove strict decision rules, as well as concepts of error rates and power. We cannot really have both and remain logically consistent. And yet, here we are.

## Confidence Intervals

What we were actually looking at here were 68% confidence intervals.

Notice that this depends upon assuming normality.

## NHST in `R`

In [6]:
data(mtcars)

mod <- lm(mpg ~ wt + hp + disp, data=mtcars)
summary(mod)


Call:
lm(formula = mpg ~ wt + hp + disp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.891 -1.640 -0.172  1.061  5.861 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
wt          -3.800891   1.066191  -3.565  0.00133 ** 
hp          -0.031157   0.011436  -2.724  0.01097 *  
disp        -0.000937   0.010350  -0.091  0.92851    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.639 on 28 degrees of freedom
Multiple R-squared:  0.8268,	Adjusted R-squared:  0.8083 
F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11


We can also produce confidence intervals using the `confint()` function

In [7]:
print(confint(mod, level=0.95))

                  2.5 %       97.5 %
(Intercept) 32.78169625 41.429314293
wt          -5.98488310 -1.616898063
hp          -0.05458171 -0.007731388
disp        -0.02213750  0.020263482


Notably, other confidence interval levels can be produced. For instance, the intervals we used previous of $\pm 1 \times \text{SE}$ correspond to an approximate 68% CI, as shown below.

In [8]:
print(confint(mod, level=0.68))

                   16 %         84 %
(Intercept) 34.96844420 39.242566342
wt          -4.88033821 -2.721442954
hp          -0.04273454 -0.019578564
disp        -0.01141544  0.009541424


Note, however, that we do not need to prescribe to the theory of CIs in order to interpret our parameters in terms of $\pm 1 \times \text{SE}$ or $\pm 2 \times \text{SE}$ or whatever we want. The interpretation remains valid. The part that theory of CIs adds is the *probabilistic* information about the interval. So, the 68% part or the 95% part. These percentages are only valid under the assumptions of CIs. So we can interpret any interval around the estimates we like, however, we need to make further assumptions in order to make any probabilistic claims about the interval's behaviour over reapeated sampling.

However, it is not very typical to do this as 95% CIs are considered the standard due to their compatibility with $\alpha = 0.05$.

[^overunderfoot]: We will get more *overestimates* as well, leading to smaller test-statistic values. These are also ca