In [1]:
suppressMessages(library(car))
suppressMessages(library(effects))
suppressMessages(library(emmeans))

# Effects Sizes and Multiple Comparisons Correction



As the final section in this part of the lesson, we need to take a moment to consider the topic of follow-up tests and multiple comparisons. In the previous lesson, we justified the use of omnibus tests as a guard against the multiple testing problem. Even though this guard exists at a higher-level, when we use the omnibus tests to guide the follow-up tests we still run the risk of false positives. As such, it is customary to apply some degree of correction to the follow-up tests.

### Corrections in `R`

General access to $p$-values corrections is available in base `R` using the `p.adjust()` function. We simply supply a list of $p$-values and the name of the method we want to use and `R` will return the adjusted values. For instance

In [17]:
p.raw <- c(0.002, 0.765, 0.05, 0.123)
p.adj <- p.adjust(p=p.raw, method="bonferroni")

print(p.adj)

[1] 0.008 1.000 0.200 0.492


In this example, the basic Bonferroni method is used where each $p$-value is multiplied by the number of $p$-values. We can see this for ourseleves by calculating this correction manually.

In [18]:
# bonferroni correction
n.p   <- length(p.raw)
p.adj <- p.raw * n.p 

# make sure all p <= 1
for (i in 1:n.p){
    if (p.adj[i] > 1){
        p.adj[i] <- 1
    }
}

print(p.adj)

[1] 0.008 1.000 0.200 0.492


In terms of other adjustment methods, `R` has 6 built-in possibilities that can be listed by calling `p.adjust.methods`

In [19]:
print(p.adjust.methods)

[1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
[6] "BY"         "fdr"        "none"      


noting that `fdr` is an alias for `BH`, and `none` is just a pass-through option that does nothing. It is somewhat beyond this lesson to go into the details of all of these. So, a very general heuristic is to use the `holm` method as the most general-purpose approach. This is more powerful than `bonferroni`[^bonf-foot], but has no additional assumptions. If the $p$-values are *correlated* (e.g. from repeated measurements), then `hochberg` is a better choice because it relaxes the assumption of independance yet retains good power.  

### Corrections and Families of Tests in `emmeans`
In terms of using these corrections within `emmeans`, it is a simple as using the `adjust` argument to name the correction that we want applied. As an example, using the `holm` method with the follow-up tests of the `vs:origin` interaction gives.

In [20]:
emm <- emmeans(mod, pairwise ~ vs|origin, adjust="holm")
print(emm$contrasts)

origin = Europe:
 contrast              estimate   SE df t.ratio p.value
 Straight - (V-shaped)     6.20 2.24 26   2.765  0.0103

origin = Japan:
 contrast              estimate   SE df t.ratio p.value
 Straight - (V-shaped)    -7.40 3.79 26  -1.954  0.0616

origin = USA:
 contrast              estimate   SE df t.ratio p.value
 Straight - (V-shaped)     6.02 2.73 26   2.203  0.0367



Now, you would be forgiven for thinking that nothing has changed here. And, in fact, you would be right. In this instance, `emmeans` has not applied any correction, despite our request. So what is going on?

The answer is that concept of multiple comparisons is not always as straightforward as it might seem. ...


... Each level of the second factor is taken to define a family of tests, independent from other levels. We can see this if we swap the tests defined earlier to look at the effects of `origin` at each level of `vs`

In [21]:
emm <- emmeans(mod, pairwise ~ origin|vs, adjust="holm")
print(emm$contrasts)

vs = Straight:
 contrast       estimate   SE df t.ratio p.value
 Europe - Japan    -4.14 2.81 26  -1.473  0.3056
 Europe - USA       3.70 2.81 26   1.316  0.3056
 Japan - USA        7.83 3.39 26   2.312  0.0869

vs = V-shaped:
 contrast       estimate   SE df t.ratio p.value
 Europe - Japan   -17.73 3.39 26  -5.234  <.0001
 Europe - USA       3.52 2.14 26   1.641  0.1128
 Japan - USA       21.25 3.21 26   6.611  <.0001

P value adjustment: holm method for 3 tests 


So, `emmeans` takes this as two families, each containing 3 tests. Corrections to the $p$-values are then applied *within* families.

### Planned Comparisons vs Post-hoc Tests
... Beware of this reasoning. Although planned tests are more *credible* in the sense that you are not $p$-hacking, this does not change anything about the error rate. The error does not magically know what your intentions were and then change itself. The error rate is a fact of multiple testing that does not change with intent. As such, even if you have pre-specified tests, you still need to control for multiple comparisons.

[^fake-foot]: Remember, these data are *entirely* fabricated. There is nothing magical about Japanese cars that makes V-shaped engines super efficient in terms of MPG.

[^modmat-foot]: You do not need to construct these weights manually, as shown in the code. Instead, you can get the coding for the *whole* model by using the `model.matrix()` function and passing in the model object. This will show you the *exact* coding for every data point in the model.

[^cohen-foot]: Cohen's aim was to get practitioners away from $p$-values and the dichotomisation of evidence. So, instead, he decided to *trichotomise* his measure of effect size into *Small = 0.2*, *Medium = 0.5* and *Large = 0.8*. In Cohen's defence, this was likely a compromise position to try and get his effect size adopted by researchers who do not like subtlety and nuance in their results and would rather just have rules to follow. This is something you will come up against time and time again in the wider world of research.

[^neg-foot]: The fact that this is `-1` rather than `1` makes no differences. It just reverses the direction of the comparison.

[^followup-foot]: This is the approach often advocated in Psychology to follow-up an ANOVA, but it is far from the most appropriate.

[^bonf-foot]: Bonferroni is a very common correction where each $p$-value is multiplied by the number of tests.

[^interact-foot]: Alternatively, we can think of this as indicating that the differences in MPG betwee `Japan`, `USA` and `Europe` depends upon whether the engine is `Straight` or `V-Shaped`. Although interactions can be interpreted different ways around, for most problems there is usually one way of conceptualising the interaction that is more intuitive or useful than the other.

[^regmc-foot]: Though we should be aware that the regression tests are not corrected for multiple comparisons.