In [1]:
suppressMessages(library(car))
suppressMessages(library(effects))
suppressMessages(library(emmeans))

data(mtcars)

# Origin factor
mtcars$origin <- c('Japan','Japan','USA','USA','USA','USA','USA','Europe','Europe',
                   'Europe','Europe','Europe','Europe','Europe','USA','USA','USA',
                   'Europe','Japan','Japan','Japan','USA','USA','USA','USA',
                   'Europe','Europe','Europe','USA','Europe','Europe','Europe')
mtcars$origin <- as.factor(mtcars$origin)

# VS factor
vs.lab <- rep("",length(mtcars$vs)) 
vs.lab[mtcars$vs == 0] <- "V-shaped"
vs.lab[mtcars$vs == 1] <- "Straight"
mtcars$vs <- as.factor(vs.lab)

# Create fake interaction
mpg.fake          <- mtcars$mpg                  # copy mpg
mpg.idx           <- mtcars$origin == "Japan" &
                     mtcars$vs     == "V-shaped" # index of Japan-VShaped cell
mpg.fake[mpg.idx] <- mpg.fake[mpg.idx] + 15      # add constant to all data from that cell
mtcars$mpg.fake   <- mpg.fake   

mod <- lm(mpg.fake ~ origin + vs + origin:vs, data=mtcars)

# Advanced Uses of `emmeans`


## Multiple Comparison Correction


### Familywise Error
Remembering back to our original justification for omnibus tests, we stated that the reason given is often one of error control. It is interesting to note that, historically, this perspective is much more in the Neyman-Pearson tradition than it is in the Fisherian tradition. After all, the notion of the familywise error (FWER) is couched in Neyman-Pearson decision making and how often we would make a false-positive decision over repeated tests. Fisher did not deny that calculating many $p$-values increased the chance of seeing small $p$-values. If you do something enough times, rare events will happen. However, he advocated for context and the use of the actual magnitude of the $p$-value to guide interpretation. In contrast, Neyman-Pearson wanted strict error control at a given $\alpha$-level. Based on their perspective, you can calculate the probability of incorrect decisions as

$$
\text{FWER} = 1 - (1 - \alpha)^{m},
$$

where $m$ is the number of tests being performed in the "family". For instance, if we only perform one test at $\alpha - 0.05$, we have

$$
\text{FWER} = 1 - (1 - 0.05)^{1} = 0.05.
$$

However, if we perform 100 tests we have

$$
\text{FWER} = 1 - (1 - 0.05)^{100} = 1 - 0.006 = 0.994.
$$

So we go from only a 5% probability of one or more false-positive under the null to a 99.4% probability of one or more false-positive under the null. As such, using a strict decision rule with a  fixed $\alpha$ only works to control the error-rate when we perform a single test. As soon as we start performing multiple tests, this error rate increases because the chance of seeing a rare event also increases if we do something over and over. It is like buying millions of lottery tickets. The chance of any single ticket winning is very low, but if we buy enough of them we can almost guarantee that *at least one* of the tickets will be a winner. We are simply providing more opportunities for a rare event to occur.

Fisher acknowledged this situation, but he rejected the idea of fixing the error-rate *a priori* and then adjusting tests to suit that error-rate. Indeed, what he said about this whole idea was

```{epigraph}
...the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.

-- Fisher (Statistical Methods and Scientific Inference, 1956, p.42-45)
```

So Fisher would tell us to interpret the $p$-values in context. We do not choose a single $\alpha$ and then stick dogmatically to it in all circumstances. Rather, our marker for "significance" must be context-bound and relative to the problem at hand.

However, in our modern mish-mash of NHST, it is more typical to do exactly what Fisher *did not* want and fixed our $\alpha = 0.05$ in all circumstances, adjusting the results of tests to try and maintain this level irrespective of context. This is a deeply Nyman-Pearson idea, though we typically apply such corrections to Fisher's $p$-values (rather than adjusting the critical value of the test, as Nyman-Pearson would want).


### $p$-value Corrections in `R`

General access to $p$-values corrections is available in base `R` using the `p.adjust()` function. We simply supply a list of $p$-values and the name of the method we want to use and `R` will return the adjusted values. For instance

In [2]:
p.raw <- c(0.002, 0.765, 0.05, 0.123)
p.adj <- p.adjust(p=p.raw, method="bonferroni")

print(p.adj)

[1] 0.008 1.000 0.200 0.492


In this example, the basic Bonferroni method is used where each $p$-value is multiplied by the number of $p$-values. We can see this for ourseleves by calculating this correction manually.

In [3]:
# bonferroni correction
n.p   <- length(p.raw)
p.adj <- p.raw * n.p 

# make sure all p <= 1
for (i in 1:n.p){
    if (p.adj[i] > 1){
        p.adj[i] <- 1
    }
}

print(p.adj)

[1] 0.008 1.000 0.200 0.492


In terms of other adjustment methods, `R` has 6 built-in possibilities that can be listed by calling `p.adjust.methods`

In [4]:
print(p.adjust.methods)

[1] "holm"       "hochberg"   "hommel"     "bonferroni" "BH"        
[6] "BY"         "fdr"        "none"      


noting that `fdr` is an alias for `BH`, and `none` is just a pass-through option that does nothing. It is somewhat beyond this lesson to go into the details of all of these. So, a very general heuristic is to use the `holm` method as the most general-purpose approach. This is more powerful than `bonferroni`[^bonf-foot], but has no additional assumptions. If the $p$-values are *correlated* (e.g. from repeated measurements), then `hochberg` is a better choice because it relaxes the assumption of independance yet retains good power.  

### Corrections in `emmeans` and Families of Tests
In terms of using these corrections within `emmeans`, it is a simple as using the `adjust` argument to name the correction that we want applied. As an example, using the `holm` method with the follow-up tests of the `vs:origin` interaction gives.

In [5]:
emm <- emmeans(mod, pairwise ~ vs|origin, adjust="holm")
print(emm$contrasts)

origin = Europe:
 contrast              estimate   SE df t.ratio p.value
 Straight - (V-shaped)     6.20 2.24 26   2.765  0.0103

origin = Japan:
 contrast              estimate   SE df t.ratio p.value
 Straight - (V-shaped)    -7.40 3.79 26  -1.954  0.0616

origin = USA:
 contrast              estimate   SE df t.ratio p.value
 Straight - (V-shaped)     6.02 2.73 26   2.203  0.0367



Now, you would be forgiven for thinking that nothing has changed here. And, in fact, you would be right. In this instance, `emmeans` has not applied any correction, despite our request. So what is going on?

The answer is that concept of multiple comparisons is not always as straightforward as it might seem. ...


... Each level of the second factor is taken to define a family of tests, independent from other levels. We can see this if we swap the tests defined earlier to look at the effects of `origin` at each level of `vs`

In [6]:
emm <- emmeans(mod, pairwise ~ origin|vs, adjust="holm")
print(emm$contrasts)

vs = Straight:
 contrast       estimate   SE df t.ratio p.value
 Europe - Japan    -4.14 2.81 26  -1.473  0.3056
 Europe - USA       3.70 2.81 26   1.316  0.3056
 Japan - USA        7.83 3.39 26   2.312  0.0869

vs = V-shaped:
 contrast       estimate   SE df t.ratio p.value
 Europe - Japan   -17.73 3.39 26  -5.234  <.0001
 Europe - USA       3.52 2.14 26   1.641  0.1128
 Japan - USA       21.25 3.21 26   6.611  <.0001

P value adjustment: holm method for 3 tests 


So, `emmeans` takes this as two families, each containing 3 tests. Corrections to the $p$-values are then applied *within* families.

`````{admonition} Follow-up tests, post-hoc tests and planned comparisons
:class: tip
As with many elements of statistics, comparisons that we run after the omnibus ANOVA tests have some different names with some subtle interpretational differences. In general, any test performed in an effort to unpack the ANOVA omnibus test is regarded as a *follow-up test*, because it follows the main ANOVA analysis. This is clearly unnecessary if the omnibus test only involves 2 means, so these tests are really for *main effects* of factors with $k > 2$ levels and for interactions. In some instances, this may necessitate further omnibus tests to break down a larger omnibus test. So the core distinction in terminology is not related to the form of the tests, but rather their *intent*. This is the difference between *post-hoc tests* and *planned comparisons*. Post-hoc tests are those where we have found a significant effect and then wish to break it down to find out what is driving it. Planned comparisons are those tests that we decided on *before* running the analysis. For instance, if we had a clinical trial then the only differences we might care about are those between the patients and controls. There are arguments around error control and the type of follow-up test we are using, but we will discuss those further below. For now, the main practical difference is that planned comparisons can make omnibus tests *redundant*. So, if you only have a small number of comparisons you want to make, you can probably skip the omnibus tests entirely and just go straight to those comparisons of interest.

... Beware of this reasoning. Although planned tests are more *credible* in the sense that you are not $p$-hacking, this does not change anything about the error rate. The error does not magically know what your intentions were and then change itself. The error rate is a fact of multiple testing that does not change with intent. As such, even if you have pre-specified tests, you still need to control for multiple comparisons.
``````

## Effect Sizes and Confidence Intervals
...On this basis, it is often not worth worrying about effect sizes on omnibus tests. Indeed, from the perspective of *estimation*, this form of error control is not really the focus. Remember, these ideas come from the Nyman-Pearson school of NHST and are based entirely on the concept of *accepting* or *rejecting* hypotheses over many repeats of an experiment. When focusing on effect sizes, our interpretation relates to whether an effect is *large* or *meaningful* in context. The effect we estimate is what it is, no matter how many other effects we also estimate. The same is true of a confidence interval. We do not need correction here because we are not looking at long-run error control of binary decisions. We simply used the interval to inform us about the *precision* of an estimate. From this perspective, not only are effect sizes on omnibus tests rarely of interest, the whole concept of an omnibus tests is debatable. As such, we can go straight from the model to our comparisons of interest, without having to go via an ANOVA table and without having to worry about multiple comparisons.

In [16]:
emm <- emmeans(mod, ~ vs|origin)
print(eff_size(emm, sigma=sigma(mod), edf=df.residual(mod)))

origin = Europe:
 contrast              effect.size    SE df lower.CL upper.CL
 Straight - (V-shaped)        1.49 0.578 26   0.3042    2.682

origin = Japan:
 contrast              effect.size    SE df lower.CL upper.CL
 Straight - (V-shaped)       -1.78 0.946 26  -3.7274    0.161

origin = USA:
 contrast              effect.size    SE df lower.CL upper.CL
 Straight - (V-shaped)        1.45 0.688 26   0.0351    2.865

sigma used for effect sizes: 4.15 
Confidence level used: 0.95 


If these effects were all very similar in magnitude, we might conclude that `origin` seems to make little difference here and collapse over that factor to look at the effects of `vs` alone. However, we need to take care to do this correctly in order to produce effects reflective of the Type II tests, as we will now discuss.

## Type II Follow-up Tests
... If we go back to our original (non-fake) data where there is no 2-way interaction, ...

## Higher-order Interactions


Although we have now seen how to follow-up a 2-way interaction with `emmeans`, things get more complex when we have even higher-order interactions in the model. In general, it is not recommended to go beyond a 3-way interaction due to the complexities that come with interpretation. You can, if you wish, but things start getting difficult very quickly. Here, we will demonstrate using `emmeans` to break-down a 3-way interaction.

In order to do so, we will be using a different dataset. Within the `datarium` package there is a dataset called `headache` that gives results from a clinical trial of 3 treatments for headaches. The factors available are `gender`, `risk` and `treatment`. The outcome variable is `pain_score`.

In [None]:
library('datarium')
data(headache)
print(head(headache))

[90m# A data frame: 6 × 5[39m
     id gender risk  treatment pain_score
[90m*[39m [3m[90m<int>[39m[23m [3m[90m<fct>[39m[23m  [3m[90m<fct>[39m[23m [3m[90m<fct>[39m[23m          [3m[90m<dbl>[39m[23m
[90m1[39m     1 male   low   X               79.3
[90m2[39m     2 male   low   X               76.8
[90m3[39m     3 male   low   X               70.8
[90m4[39m     4 male   low   X               81.2
[90m5[39m     5 male   low   X               75.1
[90m6[39m     6 male   low   X               73.1


We can proceed using our usual approach, though we will skip the call to `summary()` and the checking of assumptions, in the interests of saving space.

In [None]:
mod <- lm(pain_score ~ gender*risk*treatment, data=headache)
print(Anova(mod))

Anova Table (Type II tests)

Response: pain_score
                       Sum Sq Df F value    Pr(>F)    
gender                 313.36  1 16.1957 0.0001625 ***
risk                  1793.56  1 92.6988   8.8e-14 ***
treatment              283.17  2  7.3177 0.0014328 ** 
gender:risk              2.73  1  0.1411 0.7084867    
gender:treatment       129.18  2  3.3384 0.0422001 *  
risk:treatment          27.60  2  0.7131 0.4942214    
gender:risk:treatment  286.60  2  7.4063 0.0013345 ** 
Residuals             1160.89 60                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


So we can see that there is a significant 3-way interaction. How do we follow this up? We could adapt the `pairwise ~` syntax we used earlier to give

In [None]:
emm <- emmeans(mod, pairwise ~ gender:risk|treatment)
print(emm$contrasts)

treatment = X:
 contrast                 estimate   SE df t.ratio p.value
 male high - female high     13.87 2.54 60   5.463  <.0001
 male high - male low        16.69 2.54 60   6.571  <.0001
 male high - female low      18.58 2.54 60   7.317  <.0001
 female high - male low       2.81 2.54 60   1.108  0.6862
 female high - female low     4.71 2.54 60   1.854  0.2588
 male low - female low        1.90 2.54 60   0.746  0.8778

treatment = Y:
 contrast                 estimate   SE df t.ratio p.value
 male high - female high      1.17 2.54 60   0.459  0.9676
 male high - male low         9.20 2.54 60   3.624  0.0033
 male high - female low      13.98 2.54 60   5.505  <.0001
 female high - male low       8.04 2.54 60   3.165  0.0127
 female high - female low    12.81 2.54 60   5.045  <.0001
 male low - female low        4.78 2.54 60   1.881  0.2471

treatment = Z:
 contrast                 estimate   SE df t.ratio p.value
 male high - female high     -1.35 2.54 60  -0.533  0.9506
 male hig

However, the problem here is that the `gender` and `risk` factors have been collapsed, leading to some comparisons that keep one factor constant (e.g. `male high - female high`) whereas others that change both factors (e.g. `male high - female low`).

A better approach is to get `emmeans` to more sensibly combine the levels of the various factors. One approach is to simply generate all the means and then use the `contrast()` function with the `interaction` and `by` options. For instance

In [None]:
emm <- emmeans(mod, ~ gender*risk*treatment)
contrast(emm, interaction=c(gender="pairwise", risk="pairwise"), by="treatment")

treatment = X:
 gender_pairwise risk_pairwise estimate   SE df t.ratio p.value
 male - female   high - low       11.98 3.59 60   3.335  0.0015

treatment = Y:
 gender_pairwise risk_pairwise estimate   SE df t.ratio p.value
 male - female   high - low       -3.61 3.59 60  -1.005  0.3188

treatment = Z:
 gender_pairwise risk_pairwise estimate   SE df t.ratio p.value
 male - female   high - low       -6.03 3.59 60  -1.679  0.0983


This has effectively given us 3 interaction tests, one for each level of treatment. So we can conclude that there is a significant interaction effect between `gender` and `risk`, but only for treatment `X`. We could investigate this further by holding one of the factors in the `interaction` option constant by setting its contrasts to `"identity"`. For example

In [None]:
contrast(emm, interaction=c(gender="pairwise", risk="identity"), by="treatment")

treatment = X:
 gender_pairwise risk_identity estimate   SE df t.ratio p.value
 male - female   high             13.87 2.54 60   5.463  <.0001
 male - female   low               1.90 2.54 60   0.746  0.4583

treatment = Y:
 gender_pairwise risk_identity estimate   SE df t.ratio p.value
 male - female   high              1.17 2.54 60   0.459  0.6477
 male - female   low               4.78 2.54 60   1.881  0.0648

treatment = Z:
 gender_pairwise risk_identity estimate   SE df t.ratio p.value
 male - female   high             -1.35 2.54 60  -0.533  0.5958
 male - female   low               4.68 2.54 60   1.841  0.0705


So we can see that the `gender` effect for treatment `X` is only within the `high` risk group and not the `low` risk group. There are marginal effect of `gender` for the other treatments, but we need to be careful interpreting these if the interaction effects were not significant. In addition, these effects may not survive multiple comparisons correction, as we will see in the next section.

`````{admonition} Follow-up tests, post-hoc tests and planned comparisons
:class: tip
As with many elements of statistics, comparisons that we run after the omnibus ANOVA tests have some different names with some subtle interpretational differences. In general, any test performed in an effort to unpack the ANOVA omnibus test is regarded as a *follow-up test*, because it follows the main ANOVA analysis. This is clearly unnecessary if the omnibus test only involves 2 means, so these tests are really for *main effects* of factors with $k > 2$ levels and for interactions. In some instances, this may necessitate further omnibus tests to break down a larger omnibus test. So the core distinction in terminology is not related to the form of the tests, but rather their *intent*. This is the difference between *post-hoc tests* and *planned comparisons*. Post-hoc tests are those where we have found a significant effect and then wish to break it down to find out what is driving it. Planned comparisons are those tests that we decided on *before* running the analysis. For instance, if we had a clinical trial then the only differences we might care about are those between the patients and controls. There are arguments around error control and the type of follow-up test we are using, but we will discuss those further below. For now, the main practical difference is that planned comparisons can make omnibus tests *redundant*. So, if you only have a small number of comparisons you want to make, you can probably skip the omnibus tests entirely and just go straight to those comparisons of interest.
``````

## Custom Contrast Weights
Finally, if we ever have a comparison that we want to make that cannot be expressed using the `emmeans` formula syntax, we always have the option of using our own contrast weights. 

[^fake-foot]: Remember, these data are *entirely* fabricated. There is nothing magical about Japanese cars that makes V-shaped engines super efficient in terms of MPG.

[^modmat-foot]: You do not need to construct these weights manually, as shown in the code. Instead, you can get the coding for the *whole* model by using the `model.matrix()` function and passing in the model object. This will show you the *exact* coding for every data point in the model.

[^cohen-foot]: Cohen's aim was to get practitioners away from $p$-values and the dichotomisation of evidence. So, instead, he decided to *trichotomise* his measure of effect size into *Small = 0.2*, *Medium = 0.5* and *Large = 0.8*. In Cohen's defence, this was likely a compromise position to try and get his effect size adopted by researchers who do not like subtlety and nuance in their results and would rather just have rules to follow. This is something you will come up against time and time again in the wider world of research.

[^neg-foot]: The fact that this is `-1` rather than `1` makes no differences. It just reverses the direction of the comparison.

[^followup-foot]: This is the approach often advocated in Psychology to follow-up an ANOVA, but it is far from the most appropriate.

[^bonf-foot]: Bonferroni is a very common correction where each $p$-value is multiplied by the number of tests.

[^interact-foot]: Alternatively, we can think of this as indicating that the differences in MPG betwee `Japan`, `USA` and `Europe` depends upon whether the engine is `Straight` or `V-Shaped`. Although interactions can be interpreted different ways around, for most problems there is usually one way of conceptualising the interaction that is more intuitive or useful than the other.

[^regmc-foot]: Though we should be aware that the regression tests are not corrected for multiple comparisons.