# NHST III: Statistical Significance
In the previous parts of this lesson, we established the core mechanics of NHST. During those discussions, we made multiple references to calculating *probabilities* in order to reach conclusions. In this part of the lesson, we will finally formalise how this is achieved. To briefly review, the process of NHST involves the following steps

1. Determine a null value for our parameter of interest. This does not have to be 0, but can be any value indicative of *no effect*. 
2. Compare the null value to the actual value we have calculated and call this difference $\delta$. 
3. Convert $\delta$ into a *standardised* value by dividing by its standard error. The resultant value is called a *test statistic* and expresses our calculated difference in standard error units. 
4. If we know the standard error, call the test statistic $z$. If we have estimated the standard error from the data, call the test statistic $t$. 
4. Finally, determine the *null distribution* of the test statistic. For a $z$-statistic, this is a standard normal distribution. For a $t$-statistic, this is a $t$-distribution whose width is governed by the *degrees of freedom*. This allows the distribution to flexibly accommodate how the precision of the estimated standard error influences the range of probable $t$-values.

So, by this point, we have a test statistic that captures the deviation from the null in our dataset, and we known the distribution of the test-static when the null is true. Our very last step is to use both of these pieces of information to calculate the *probability* of our observed value of $t$, if the null hypothesis is true. This probability is perhaps one of the most controversial, misunderstood and misused aspects of classical statistical inference. This is the infamous $p$-value.

## The $p$-value and Statistical Significance
It is fairly likely ($p < 0.05$) that you will have been exposed to the $p$-value already in your statistical education. If you come from a Psychology background, this is practically guaranteed as the history of scientific enquiry in the field of Psychology is largely the history of amassing $p$-values. As a method of trying to reach conclusions from a statistical model, the $p$-value is ubiquitous. This is largely because, on the face of it, the $p$-value provides an easily applicable and objective criteria for determining evidence against the null hypothesis. However, as we will come to see, the ability of the $p$-value to provide the information that researchers typically want is limited and arguably misleading. Despite the many issues with how $p$-values are used in practise, this humble little probability endures as the metric most commonly employed in Experimental Psychology to reach conclusions. As such, you *have* to understand this approach, even if you disagree with it. We will leave most of the criticisms to one side for the moment and just focus on making sure the definition of the $p$-value is clear. This is critical, as the $p$-value is often misunderstood by students and sometimes (rather worryingly) by researchers as well. As part of this, we will also examine the history of the $p$-value, and NHST in general, because this is illuminating in terms of understanding the role of the $p$-value and also, critically, why its modern useage can appear to be so logically unsatisfying.   


### Defining the $p$-value
To begin with, it is useful to understand that the definition of the $p$-value comes directly from Fisher's development of NHST. As we will explore further below, Fisher's original conceptualisation of the $p$-value has been somewhat warped into our modern framework of NHST. So, for the moment, we will focus on Fisher's original concept, before seeing what has happened over the years.

The basic definition of the $p$-value is the *probability of obtaining a test-statistic value as extreme, or more, assuming the null hypothsis is true*. Formally, we can write this as

$$
p = P(t|\mathcal{H}_{0}),
$$

where $t$ is our observed test-statistic value. How do we calculate this probability? Well, we know the distribution of $t$ when the null hypothesis is true, as this is just the null $t$-distribution which is centred on 0 and has a width controlled by the degrees of freedom. So, if our hypothesised null value is *correct*, then we would expect the $t$-value calculated over many repeats to have a null $t$-distribution. This capture the degree of natural variation we would expect in the $t$-value across different samples, even when the null hypothesis is correct. If our calculated $t$-value appears *consistent* with this distribution, then it might be that the null is true. However, if our calculated $t$-value appears *inconsistent* with this distribution, then it might be that the null is incorrect.

A visualisation of the $p$-value is given in {numref}`pval-fig`. The distribution shown here represents the null distribution of the test statistic. The $p$-value is then the probability of the observed value of the test statistic, which corresponds to the area of the distribution *above* the observed value. This is important, because we are not calculating the probability of seeing *exactly* this calculated value under the null[^probfoot], instead it is the probability of this value *or larger*.

```{figure} images/pvalue.png
---
width: 600px
name: pval-fig
---
Illustration of the p-value as the area under the null distribution of the test statistic above the calculated value.
```

As an example, say we calculated $t = 1.2$ and have $20$ degrees of freedom. The $p$-value can then be derived in `R` using the `pt()` function (probability from the $t$-distribution)

In [6]:
t  <- 1.2
df <- 20
p  <- pt(q=t, df=df, lower.tail=FALSE)
print(p)

[1] 0.1220808


So, the probability of getting a $t$-value of $1.2$ or larger when the null hypothesis is true is $p = 0.122$, or $12.2\%$. This is only based on the *upper-tail* of the distribution. If we wanted to test the probability of both $t > 1.2$ *and* $t < -1.2$, we would add both tails together by multiplying the $p$-value by 2. This is typically done because we do not always have a strong *directional* hypothesis. In other words, we do not care whether $\delta$ is positive or negative, only whether the difference is *large*. This is the difference between *one-tailed* and *two-tailed* tests, as shown below

In [7]:
one.tailed.p  <-     pt(q=t, df=df, lower.tail=FALSE)
two.tailed.p  <- 2 * pt(q=t, df=df, lower.tail=FALSE)

df <- data.frame("One.tailed"=one.tailed.p,
                 "Two.tailed"=two.tailed.p)
print(df,row.names=FALSE)

 One.tailed Two.tailed
  0.1220808  0.2441616


Note that the majority of statistical software will show you two-tailed tests by default because strong directional hypotheses tend to be the *exception* rather than the rule and it is generally safer to report a $p$-value that is *too large* rather than one that is *too small*.

### Fisher's Concept of the $p$-value
The next most obvious question is, once we have calculated the $p$-value, how do we interpret it? For Fisher, the $p$-value was a *continuous measure of evidence against the null*. The smaller the $p$-value, the greater the evidence *against* the null. Or, more precisely, the most *surprising* the data would be, if the null were true. In advocating this perspective of continuous evidence, Fisher provided some heuristics to aid interpretation:

```{epigraph}
The value for which $p = 0.05$, or 1 in 20, is convenient to use in practice if we want to draw a line beyond which we say that the deviation is significant...

-- Fisher, *Statistical Methods for Research Workers* (1925)
```

So Fisher suggested that we use $p < 0.05$ as a metric for determining *statistical significance*. However, the important point here (which is most readily forgotten in modern NHST) is that this is a *rough guideline* not some sort of *scientific truth*. Indeed, the full quote is much more illuminating

```{epigraph}
The value for which $p = 0.05$, or 1 in 20, is convenient to use in practice if we want to draw a line beyond which we say that the deviation is significant, but it must not be forgotten that we shall often draw such a line when there is nothing there but chance.

-- Fisher, *Statistical Methods for Research Workers* (1925)
```

So, Fisher viewed this criterion as a convenient heuristic, but fully acknowledged its limitations. A quote from later in his career is also illuminating of Fisher's view 

```{epigraph}
The level of significance in such tests is a matter of convenience, and does not affect the logic of the procedure. The tests of significance...are not to be interpreted rigidly, but are meant to afford evidence, which can be taken into account, together with the rest of the evidence available

-- Fisher, *Statistical Methods and Scientific Inference* (1956)
```

So, again, the criteria of $p < 0.05$ is not a rigid cut-off, but simply an interpretational device that should be used in conjunction with all other available evidence. It is interesting to think about this in light of criticisms of using a threshold of $p < 0.05$ for statistical inference. A more recent famous quote is

```{epigraph}
Surely, God loves the 0.06 nearly as much as the 0.05?

-- Rosnow & Rosenthal (1989)
```

Indeed, Fisher would probably agree. As a continuous measure of evidence, a $p$-value of 0.06 *is* informative. Although not quite reaching our heuristic of $p < 0.05$, this is still *very close* and suggests some modest evidence against the null. For Fisher, the actual value of $p$ was *very important* for inference. Fisher *never* advocated the position of *rejecting* the null hypothesis when $p < 0.05$. Indeed, Fisher strongly disagreed with this idea, which came from a separate development of NHST attributable to two statisticians called [Jerzy Neyman](https://en.wikipedia.org/wiki/Jerzy_Neyman) and [Egon Pearson](https://en.wikipedia.org/wiki/Egon_Pearson).

## Neyman-Pearson NHST
In opposition to Fisher's approach, Neyman-Pearson's approach to NHST saw inference as an exercise in *decision making*. Moreso, they saw inference as the process of making decisions *under uncertainty*, emphasising the long-run control of error rates across repeated experiments. It is here that the concepts of the *alternative hypothesis*, *Type I/Type II errors*, *statistical power* and *rejection of the null* come from. Fisher, in his advocation of the $p$-value as a continuous metric of evidence, strongly disagreed with all these ideas. Neyman-Pearson NHST proceeds in several steps, which we will now briefly explore

First, we start by defining *two* competing hypotheses. The null hypothesis $(\mathcal{H}_{0})$ and the *alternative* hypothesis $(\mathcal{H}_{1})$. The alternative requires the definition of an *effect size*, which is the magnitude of effect that we expect to see, if the alternative is true. We then define a test-statistic with a known distribution under the null for capturing the discrepancy between the data and the null hypothesis. This is the same procedure as Fisher's approach and so does not need repeating. 

Next, we need to consider the two different types of error that can occur when we make a decision about the null hypothesis:

1. We *reject* $\mathcal{H}_{0}$ when it is *true*
    - This is a Type I error, also known as a *false-positive*, and is denoted $\alpha$
2. We *fail to reject* $\mathcal{H}_{0}$ when $\mathcal{H}_{1}$ is *true*
    - This is a Type II error, also known as a *false-negative*, and is denoted $\beta$

The idea is that we want to set $\alpha$ to an acceptable level to minimise false positives across repeat experiments. A typical value is $\alpha = 0.05$, which no doubt adds to the conflation of Fisher's approach and Neyman-Pearson's approach. We also want to *minimise* $\beta$ or, equivalently, *maximise* $1 - \beta$. This is known as *statistical power* and can be interpreted as the probability of correctly rejecting $\mathcal{H}_{0}$ when $\mathcal{H}_{1}$ is true. This quantity depends upon the magnitude of the effect size, as well as the sample size and variance.

Based on our choice of $\alpha$ and our test-statistic, we can then create a decision rule. This is achieved using the null distribution of the test-statistic. For instance, if we were using a $t$-statistic with $\alpha = 0.05$ and 20 degrees of freedom, the critical value is $t_{\text{crit}} = 1.725$. This can be calculated in `R` using

In [None]:
alpha  <- 0.05
df     <- 20
t.crit <- qt(p=alpha, df=df, lower.tail=FALSE)

print(t.crit)

[1] 1.724718


where the `qt()` function will calculate a $t$-value from a probability (whereas `pt()` calculated a probability from a $t$-value). If the null is true, we should only calculate a $t$-value *larger* than $1.725$ around 5% of the time. Thus, over repeated experiments, if the null is true then we will only make a Type I error around 5% of the time. The Type I error rate has therefore been *controlled* at the given $\alpha$-level. Any value of $t > 1.725$ therefore falls within a *rejection-region*. In other words, a region of values of $t$ which are so large that we will reject the null. 

Finally, based on the value of $t$ calculated from the data, we either *reject the null hypothesis* when $t > t_{\text{crit}}$, or *fail to reject the null* when $t < t_{\text{crit}}$. If we stick rigidly to these rejection regions then then probability of false-positives is controlled at the desired $\alpha$-level. Notice here that there is no contiuous assessment of the evidence against the null hypothesis, we simply *reject* or *accept* based on a strict threshold. Because of this, the $p$-value has *no role* in this procedure. The critical value of the test-statistic is determined *a priori* and we simply make a binary decision depending on what we calculate from the data. The precise control of the error rates requires a black-and-white decision. If our evidence shifts around and our decisions can be flexible, then we lose this control. For Neyman-Pearson, making a decision that controls long-term error *is the point* of statistical inference and a $p$-value is *unnecessary* for this purpose.  

### Confidence Intervals
... These were derived *directly* from the rejection regions indicated above. ... The interval contains every parameter value that would lead us to *fail to reject the null hypothesis*, given our current data. If we set our proposed value to anything within this interval, the test-statistic will be below the critical value and we will *fail to reject* the null. So this gives us a range of proposed values where, if we set any of them to the null value, would lead to us failing to reject the null. As such, any proposed value *outside* the interval is far enough away from our estimated value that it would lead to us *rejecting the null*. So, if our null value of interest is *inside* the interval, then we fail to reject the null. If it is *outside* the interval, then we do reject the null. 

- Proposed value *outside* the interval = reject the null
- Proposed value *inside* the interval = fail to reject the null

So then we can consider our *null* proposed value. If it is inside the interval then it is close enough to our estimated value and we fail to reject the null. If it is outside the interval then it is far enough away that we will reject the null. In practice, if we use a null value of 0, we end up seeing if 0 is *inside* or *outside* the confidence interval.

## Modern Interpretation of NHST
As we can see from the discussions above, the history of NHST is rather tangled and has often been misunderstood. This is largely because its modern incarnation is a hybrid of both Fisherian and Neyman-Pearsonian NHST. However, these are two *fundamentally different* statistical philosophies. Each framework were severly opposed by the other side, with major disagreements on the fundamental purpose of statistical inference and how evidence can be used. These two approaches were never meant to be combined. Indeed, both Fisher and Neyman-Pearson would probably agree that they *cannot* be combined in any meaningful way. And yet, modern NHST *has* merged these different ideas into a confused mish-mash that neither Fisher nor Neyman-Pearson would have endorsed.

### Fisher vs Neyman-Pearson
Fisher vehemently opposed Neyman-Pearson's approach to NHST. As we saw above, Fisher believed in a continuous assessment of the evidence against the null, using heuristics to *guide* interpretation, rather than creating strict thresholds. Nature does not say "yes" or "no, rather it provides evidence that should be interpreted by science. Turning statistical inference into a simple binary decision makes the process robotic, automated and incapable of any nuance. For Fisher, this went against how nature actually operates. It is interesting to note that a common criticism of NHST is that it creates an unrealistic dichotomisiation of evidence[^evidencefoot], yet this is something that Fisher never endorsed. Fisher emphasised that a null hypothesis might never actually be true in reality. Instead, it is a simplifying assumption. The goal is therefore not to “accept” or “reject” it, but to see how well it can explain the data. In his own words, Fisher said: 

```{epigraph}
It is a mistake to speak of the null hypothesis as though it were a thing we are testing, or as though we were trying to accept or reject it.

-- Fisher, *Statistical Methods and Scientific Inference* (1956)
```

In opposition, Neyman-Pearson were of the view that one cannot sit and ponder the meaning of a $p$-value forever. Eventually, some decision must be reached

- "Is this treatment effective enough to consider offering it to patients?" 
- "Is the quality control of this product good enough to enter production?" 
- "Are the side-effects of this drug severe enough that the medication should be banned?"

In all these cases a clear decision must be made. For this to happen, we need rules for reaching decisions in the face of uncertainty that *minimise the probability of severe mistakes*. Indeed, Neyman was very critical of the seemingly subjective way a $p$-value could be used to reason about a hypothesis, particularly when a $p$-value is a statement about a *single set of data*

```{epigraph}
The test of a hypothesis is a rule which tells us whether to reject or not to reject a given hypothesis in the light of observed data. It is not a method of reasoning which we can apply at will to interpret a particular set of data.

-- Neyman, *"Inductive Behavior" as a Basic Concept of Philosophy of Science* (1957)
```

Because of this, Neyman-Pearson fundamentally rejected the $p$-value as having any utility in their approach to inference. Because a $p$-value is *data dependant* it goes against the Neyman-Pearson philosophy of defining an error-control procedure that is divorced from the data itself. If we tie the $p$-value to a pre-specified $\alpha$ (such as $\alpha = 0.05$), then it can be used as a decision rule, but the *actual value of p does not matter*. At which point, we may as well just use the critical value of the test statistic. The $p$-value adds nothing to this procedure. Neyman-Pearson would likely claim that the $p$-value is just a *post-hoc description of how surprising the data is*. To them, this is just a description of the data, not *inference*. The only reason to calculate a $p$-value is to consider how compatible our data is with the null hypothesis. But Neyman-Pearson were not interested in this. For them, statistical inference is simply a yes/no question with no grey areas. 

### Modern NHST
In its most modern incarnation, NHST involves *both* the calculation of $p$-values *and* the use of stringent cut-offs that result in either *rejecting* or *failing to reject* the null. We speak of *statistical power*, *alternative hypotheses* and $\alpha$*-levels*, yet our statistic software prints $p$-values rather than binary decisions. It is hopefully clear that this modern use of NHST would have horrified Fisher and Neyman-Pearson in equal measure. 

Fisher would hate to see his $p$-values being used as a decision making tool for rejecting hypotheses, rather than as a measure of evidence. Neyman-Pearson would hate to see statistical software reporting $p$-values, rather than simply stating whether a test passed the critical threshold for rejection. Our modern approach to NHST bleeds decision making and thresholds into Fisher's continuous measures of evidence, and bleeds continuous measures of evidence and the term "significance" into Neyman-Pearson's strict decision making approach. This has been called the *hybrid logic* of modern NHST by [Gigerenzer (1993)](https://media.pluto.psy.uconn.edu/Gigerenzer%20superego%20ego%20id.pdf), who stated

```{epigraph}
The Freudian metaphor suggests that the resulting conceptual confusion in the minds of researchers, editors, and textbook writers is not due to limited intelligence. The metaphor brings the anxiety and guilt, the compulsive and ritualistic behavior, and the dogmatic blindness associated with the hybrid logic into the foreground. It is as if the raging personal and intellectual conflicts between Fisher and Neyman and Pearson, and between these frequentists and the Bayesians were projected into an "intrapsychic" conflict in the minds of researchers. And the attempts of textbook writers to solve this conflict by denying it have produced remarkable emotional, behavioral, and cognitive distortions.

-- Gerd Gigerenzer, *The Superego, the Ego, and the Id in Statistical Reasoning* (1993)
```

To see more clearly why the modern mish-mash of these ideas is philosophically inconsistent, consider the situation where we calculate $p = 0.001$:

- If we simply *reject* the null because $p < 0.05$ then we are doing Neyman-Pearson NHST but using Fisher's $p$-values to implement it. There is nothing fundamentally wrong with this, as long as we are only using the $p$-value as a *binary indicator*. However, this begs the question what the point of the $p$-value is? We could reach the same conclusion using the critical value of the test-statistic, which is how Neyman-Pearson designed their approach. By looking at the $p$-value, the temptation is there to interpret its magnitude.
- If we interpret this as strong evidence aginst the null, because the $p$-value is quite small, then we are doing Fisher's NHST. We may use a heuristic of $p < 0.05$ as some marker of "significance", but this is not a binary rule to accept or reject anything. We simply say that the data is quite surprising, if the null hypothesis were true. What we do *not* do is reject the null hypothesis on this basis, because it still remains possible that we have witnessed a rare event. Also, the null may never *actually* be true, so all we can really say is that these data seem incompatible with this simplifying device.
- If we squash these two approaches together by rejecting the null hypothesis because $p < 0.05$, but also (whether explicitly or implicitly) take the magnitude of the $p$-value as stronger evidence than another $p$-value in our output table (say $p = 0.048$), then we are doing the modern incoherent muddling of these two approaches.

Although it *feels* like we should be able to combine these methods, this causes problems. For instance, it feels like we should be able to pre-specify a level of error-control, reject the null based on this level and then use the magnitude of the $p$-value to indicate our confidence in that deicison. However, what happens quite often is that researchers will treat values *close* to $p = 0.05$ as *marginally significant*, or move their criteria for significance based on the calculated $p$-values. If we start moving the goalposts based on the $p$-values, then the whole principle of fixing error rates before collecting the data goes out the window. Furthermore, treating $p$-values as both a binary indicator *and* a continuous measure of evidence is inconsistent. For instance, consider calculating $p = 0.06$. Neyman-Pearson would say that you have *failed to reject the null*. It does not matter if $p = 0.06$ or $p = 0.6$, the decision rule is final. Fisher would say that the data are somewhat surpring under the null. Although the value is not below our heuristic of $p < 0.05$, it is suggestive of moderate evidence *against the null*. This may imply that more data is needed to fully understand what is going on, but there is certainly a hint that the null may not be true. So we reach two different conclusions depending on how we treat the $p$-value.

```{figure} images/pvaluepumpkin.jpg
---
scale: 50%
align: left
---
```

Anyone who has used NHST for long enough knows the pains of grappling with a $p$-value that is *nearly significant*. In these situations, the inherent conflict between a strict decision rule and a continuous measure of evidence becomes clear. We see a value that tells us that we cannot reject the null, yet the value is so close to the threshold that it feels as if it should *mean something*. This is especially true when we get a value like $p = 0.051$ versus $p = 0.96$. In these situations, we want to be Fisher but have been taught to be Neyman-Pearson. We are trying to adhere to a yes/no decision-making framework, yet have been presented with a continuous metric of evidence. When this happens, researchers often try to find a way out by claiming "marginal significance", which is another way of saying "yes, we know we have failed to reject the null here, but it is really close!" So is this meaningful or not? This use of NHST claims that a result with $p = 0.051$ is simultaneously *meaningful* and *not meaningful*. We are trying to adhere to a strict threshold that maintains long-run error rates, but also move our goalposts around in terms of what we want to conclude based on how close the $p$-value is to that threshold. Logically, we cannot be doing both. We cannot both reject a result to maintain a long-term error rate *and* include it in our discussion because it is really close to the threshold. Given this, it is not really suprising that when faced with such a logical conflict, unscrupulous researchers will find means to *nudge* a $p$-value under the threshold and resolve their dilemma.  

In reality, if we really want to adhere to the Neyman-Pearson framework, we should remove $p$-value columns from software. Instead, software should simply state whether to "reject" or "fail to reject" a result based on the critical value. No more information is given because no more information is needed and the conflict above is resolved. Alternatively, if we want to adhere to Fisher's framework, we keep $p$-values but remove strict decision rules, as well as concepts of error rates and power. We simply see the $p$-value as a continuous metric of evidence and interpret it with certain heuristics in mind. Again, this resolves the conflict. What we cannot really do is have all these elements and remain logically consistent...and yet, here we are.

## NHST in `R`
Despite the philosophical issues highlighted above, NHST remains a core element of modern statistical inference. We can muse about why this is from a historical perspective and complain about its inconsistencies, but that does not stop it from being used. As such, we need to at least see how NHST is implemented within `R`. First, let us look again at the output table that results from the multiple regression model `mpg ~ wt + hp + disp`:

In [1]:
data(mtcars)
mod <- lm(mpg ~ wt + hp + disp, data=mtcars)
summary(mod)


Call:
lm(formula = mpg ~ wt + hp + disp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.891 -1.640 -0.172  1.061  5.861 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.105505   2.110815  17.579  < 2e-16 ***
wt          -3.800891   1.066191  -3.565  0.00133 ** 
hp          -0.031157   0.011436  -2.724  0.01097 *  
disp        -0.000937   0.010350  -0.091  0.92851    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.639 on 28 degrees of freedom
Multiple R-squared:  0.8268,	Adjusted R-squared:  0.8083 
F-statistic: 44.57 on 3 and 28 DF,  p-value: 8.65e-11


Here we can see that each parameter estimate is listed, alongside the standard error estimate, the $t$-statistic and the associated two-tailed $p$-value. Each $t$-statistic is constructed by dividing the estimate by the standard error. As such, each test is an implicit comparison with a proposed population parameter of 0. This is a reasonable null for each slope, as it captures the idea of *no relationship* in the population. For the intercept, the utility of testing for a regression through the origin can be somewhat questionable and is dependent upon context.

To take a single example, consider the `wt` predictor. The estimates are $\hat{\beta}_{1} = -3.80$ and $\text{SE}\left(\hat{\beta}_{1}\right) = 1.066$. The $t$-statistic is then

$$
t = \frac{\hat{\beta}_{1} - 0}{\text{SE}\left(\hat{\beta}_{1}\right)} = \frac{-3.80}{1.066} = -3.565.
$$

This tells us that the difference between our estimated slope and 0 is -3.80 in the original units of MPG, which is equivalent to 3.565 standard errors. Because the estimated slope is *negative* the $t$-statistic is *also* negative. So this is 3.565 standard errors *below* 0. The associated $p$-value is $0.0013$, which we can also calculate manually in `R` using

In [13]:
t  <- summary(mod)$coefficients[2,"t value"] # t-statsitic for wt
df <- mod$df.residual
p  <- 2 * pt(q=abs(t), df=df, lower.tail=FALSE) # absolute value of t for upper-tail
print(p)

[1] 0.001330991


This indicates that, if this slope were 0 in the population, the probability of calculating a $t$-statistic of -3.565 is about 0.13% (around a *tenth* of 1%, or 1 out of 1,000). Fisher would say that this value is significant because $p < 0.05$, but would also emphasise that such a small $p$-value suggests that are data are *very inconsistent* with a null slope of 0 in the population. If we are using $\alpha = 0.05$, Neyman-Pearson would tell us to *reject* the null hypothesis that the slope is 0. In doing so, we would ignore the magnitude of the $p$-value because only the decision rule matters. As we can see, `R` tells us both. We have the $p$-values listed from Fisher, with stars next to them to indicate the Neyman-Pearson decision rule for different values of $\alpha$ (as listed under `Signif. codes`).


### Practical Implications
In this example, the *practical* implication of adhering to either The Fisher or Neyman-Pearson approach is minimal. For both `wt` and `hp`, Fisher would say there was strong evidence against the null (though moreso for `wt` than `hp`) and Neyman-Pearson would tell us to reject the null for both tests. For `disp`, Fisher would say that there was very strong evidence *in favour* of the null or, more precisely, that the data are very *compatible* or *well-explained* by the null. Similarly, Neyman-Pearson would tell us that we have failed to reject the null. So, although these two concepts are philosophically at-odds, their practical application in this setting has led to similar conclusions. Where the issues appear are not in the extremes, but around the margins of the threshold. In other words, when we get $p = 0.049$, $p = 0.051$ or (heaven help us) $p = 0.050$. We already discussed how the problem of induction has largely been side-stepped by science on the grounds of *pragmatism* and here we see the same phenomena occurring. Because *practically* the conclusions are very similar a lot of the time, we are happy to push forward mashing these approaches together. However, this conflict does get in the way sometimes and it feels rather unsatisfying to use a method with principles that do not make sense when examined closely, even if we can justify it pragmatically. We have already entered into the world of induction by appealing to pragmatism, and here we are again having to resolve another logical problem by appealing to pragmatism. [It is pragmatism all the way down...](https://en.wikipedia.org/wiki/Turtles_all_the_way_down).

`````{topic} What do you now know?
In this section, we have explored the final part of NHST in terms of the concept of $p$-values, as well as different perspectives on hypothesis testing. After reading this section, you should have a good sense of:

- The definition of the $p$-value as the probability of a test-statistic as larger, or larger, than the one we have calculated, assuming the null hypothesis is true.
- The difference between *one-tailed* and *two-tailed* $p$-values.
- Fisher's concept of the $p$-value as a continuous metric of evidence for the compatibility of the data with the null.
- The idea that $p < 0.05$ was original a *heuristic* for interpretation, rather than a fixed threshold.
- The conflicting view on hypothesis testing introduced by Neyman-Pearson, who saw statistical inference as a form of decision-making under uncertainty.
- The fact that rejecting the null, concepts of Type I and Type II errors, and statistical power are all part of Neyman-Pearson's framework that was *not* endorsed by Fisher.
- The fact that modern NHST is a somewhat illogical hybrid of these two approaches.
- The idea that justification for this hybrid approach can only really be made on pragmatic grounds.
`````

[^probfoot]: Technically, the probability of an exact value (call it $x^{\ast}$) from a continuous probability distribution is 0. This is not to say it is *impossible*, rather it cannot be calculated because a single point has no area under the distribution curve. It has a height, given by the probability density at that point, but its width is 0 because it stretches from $x^{\ast} - x^{\ast} = 0$.

[^evidencefoot]: This is also true of alternative methods that use cut-off values, even if they are *supposed* to guide interpretation rather than being *rules*. For instance, using "small", "medium" or "large" to categorise effect sizes serves only to *trichotomise* evidence. Similarly, more recent applications of Bayes Factors as an alternative to NHST suffer from the same issue of catgorising results into "weak evidence", "moderate evidence", "strong evidence" etc. In all cases, the danger comes from an "interpretational device" turning into blind application of a rule, which is exactly what happened to Fisher's heuristic of $p=0.05$.  