# NHST III: Statistical Significance
... As explained previously, we begin by determining a null value for our parameter of interest and then compare that value to the actual value we have calculated to produce a *difference* called $\delta$. For all the reasons given in the previous part of the lesson, we then convert $\delta$ into a standardised value called a *test statistic*. This is typically a $t$-statistic, which expressed $\delta$ in standard error units. Under the null hypothesis, the $t$-statistic has a $t$-distribution, whose width is governed by a parameter known as the *degrees of freedom*. This allows the distribution to accommodate how the precision of the estimated standard error influences the range of probable $t$-values under the null. So, we now have a test statistic that captures the deviation from the null in our dataset. Given that we also know how $t$ will be distributed when the null is true, our very last step is to calculate the *probability* of observing this value of $t$, if the null hypothesis were correct. This probability is known as a $p$-value.

## The $p$-value

Probabilities from continuous distributions are given by the area under the curve either *above* or *below* the value. So we are not calculating the probability of getting this *exact* value of $t$. Rather, it is the probability of this value of $t$ *or larger*.

Fisher actually viewed $p$-values as continuous evidence against the null, rather than as a threshold value for declaring significance. In fact, although the criterion of 0.05 is attributed to Fisher, it is insightful to see what he *actually* said about it:

```{epigraph}
The value for which $p = 0.05$, or 1 in 20, is convenient to use in practice if we want to draw a line beyond which we say that the deviation is significant, but it must not be forgotten that we shall often draw such a line when there is nothing there but chance.

-- Ronald Fisher
```

So, Fisher viewed this criterion as a convenient heuristic, but fully acknowledged its limitations and did not ever state that we should *reject* the null hypothesis when $p < 0.05$. Indeed, Fisher strongly disagreed with the Neyman-Pearson concept of rejecting or failing to reject the null

```{epigraph}
It is a mistake to speak of the null hypothesis as though it were a thing we are testing, or as though we were trying to accept or reject it.

-- Ronald Fisher
```

Indeed, in Fisher's conceptualising a small $p$-value simply means that the data are surpring under the null, whereas a large $p$-value means that the data are not surprising under the null. Smaller $p$-values present more evidence against the null, but only so far as we make the logical leap of saying that very suprising data make it less likely that the null is correct. However, we would not *reject* the null, because it still remains possible that the null is correct. All we are saying is that the data we have are less compatible with the null. Nothing more.

## Fisher vs Neyman-Pearson

## Confidence Intervals

What we were actually looking at here were 68% confidence intervals.

Notice that this depends upon assuming normality.

## NHST in `R`

In [6]:
data(mtcars)

mod <- lm(mpg ~ wt + hp + disp, data=mtcars)
summary(mod)


Call:
lm(formula = mpg ~ wt + hp + disp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.891 -1.640 -0.172  1.061  5.861 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
wt          -3.800891   1.066191  -3.565  0.00133 ** 
hp          -0.031157   0.011436  -2.724  0.01097 *  
disp        -0.000937   0.010350  -0.091  0.92851    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.639 on 28 degrees of freedom
Multiple R-squared:  0.8268,	Adjusted R-squared:  0.8083 
F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11


We can also produce confidence intervals using the `confint()` function

In [7]:
print(confint(mod, level=0.95))

                  2.5 %       97.5 %
(Intercept) 32.78169625 41.429314293
wt          -5.98488310 -1.616898063
hp          -0.05458171 -0.007731388
disp        -0.02213750  0.020263482


Notably, other confidence interval levels can be produced. For instance, the intervals we used previous of $\pm 1 \times \text{SE}$ correspond to an approximate 68% CI, as shown below.

In [8]:
print(confint(mod, level=0.68))

                   16 %         84 %
(Intercept) 34.96844420 39.242566342
wt          -4.88033821 -2.721442954
hp          -0.04273454 -0.019578564
disp        -0.01141544  0.009541424


Note, however, that we do not need to prescribe to the theory of CIs in order to interpret our parameters in terms of $\pm 1 \times \text{SE}$ or $\pm 2 \times \text{SE}$ or whatever we want. The interpretation remains valid. The part that theory of CIs adds is the *probabilistic* information about the interval. So, the 68% part or the 95% part. These percentages are only valid under the assumptions of CIs. So we can interpret any interval around the estimates we like, however, we need to make further assumptions in order to make any probabilistic claims about the interval's behaviour over reapeated sampling.

However, it is not very typical to do this as 95% CIs are considered the standard due to their compatibility with $\alpha = 0.05$.

[^overunderfoot]: We will get more *overestimates* as well, leading to smaller test-statistic values. These are also ca