Keegan Korthauer
- Two-way ANOVA or a linear model with interaction
- Two-way ANOVA without interaction: additive models
- Some additional examples
Note that the source Rmd for this document can be found here
Which group means are we comparing in a model with 2 factors?
For simplicity, we first consider only two levels of dev_stage
: E16
and P28
By default, lm
assumes a reference-treatment
effect parametrization. We just need more indicator
variables!!
The names of these parameters and variables may look overwhelming but it helps to think of them as just names for:
$x_{KO,ijk}$ : a indicator variable with value 1 for NrlKO genotype samples (any sample with j=NrlKO), and 0 otherwise. I call this variable$x_{KO}$
$x_{P28,ijk}$ : a different indicator variable with value 1 for P28 samples (any sample with k=P28), and 0 otherwise. I call this variable$x_{P28}$
$\tau_{KO}$ ,$\tau_{P28}$ , and$\tau_{KO:P28}$ : parameters to model the simple effects of genotype (NrlKO), development (P28), and their interaction
Note: in this “simple” version with 2 levels per factor we need only
one indicator variable per factor:
As before, comparisons are relative to a reference but now we have reference levels in both factors: E16 and WT
For any sample
as before
Here is the lm
output table for the two factor fit (extracted using
broom::tidy()
).
twoFactFit <- lm(expression ~ genotype * dev_stage, oneGene)
tidy(twoFactFit)
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 9.91 0.157 62.9 2.02e-15
## 2 genotypeNrlKO -0.984 0.240 -4.09 1.78e- 3
## 3 dev_stageP28 1.64 0.223 7.35 1.44e- 5
## 4 genotypeNrlKO:dev_stageP28 -2.04 0.328 -6.23 6.47e- 5
Notice that the lm
estimate,
(means.2Fact <- group_by(oneGene, dev_stage, genotype) %>%
summarize(cellMeans = mean(expression)) %>%
ungroup() %>%
mutate(txEffects = cellMeans - cellMeans[1],
lmEst = tidy(twoFactFit)$estimate))
## `summarise()` has grouped output by 'dev_stage'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 5
## dev_stage genotype cellMeans txEffects lmEst
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 E16 WT 9.91 0 9.91
## 2 E16 NrlKO 8.92 -0.984 -0.984
## 3 P28 WT 11.5 1.64 1.64
## 4 P28 NrlKO 8.52 -1.39 -2.04
To show this explicitly, we pull out the lm
estimate for the reference
group (WT E16):
means.2Fact %>%
filter(dev_stage == "E16" & genotype == "WT") %>%
pull(lmEst)
## [1] 9.906954
And now the sample mean of the reference group (WT E16):
means.2Fact %>%
filter(dev_stage == "E16" & genotype == "WT") %>%
pull(cellMeans)
## [1] 9.906954
For any WT sample at E16:
For any KO sample at E16:
Substracting these expectations we get
And its lm
estimate,
To show this explicitly, we pull out the lm
estimate for the KO effect
(diff between E16:NrlKO and E16:WT):
means.2Fact %>%
filter(dev_stage == "E16" & genotype == "NrlKO") %>%
pull(lmEst)
## [1] -0.9844049
And now the differences in sample means between the E16:NrlKO group and the reference group (WT E16):
means.2Fact %>%
filter(dev_stage == "E16" & genotype == "NrlKO") %>%
pull(txEffects)
## [1] -0.9844049
For any WT sample at E16:
For any WT sample at P28:
Substracting these expectations we get
And its lm
estimate,
To show this explicitly, we pull out the lm
estimate for the P28
effect (diff between P28:WT and E16:WT):
means.2Fact %>%
filter(dev_stage == "P28" & genotype == "WT") %>%
pull(lmEst)
## [1] 1.636609
And now the differences in sample means between the E16:NrlKO group and the reference group (WT E16):
means.2Fact %>%
filter(dev_stage == "P28" & genotype == "WT") %>%
pull(txEffects)
## [1] 1.636609
Can we simply add up the simple effect of genotype NrlKO, and the simple effect of developmental stage P28, to get the effect at P28 in NrlKO?? If so, we’d expect the P28:NrlKO group to have a mean predicted by the dotted grey line (i.e.that the effect of KO is the same at E16 as it is at P28:
We see that this does not seem to be the case. This is where the interaction effect comes in. Let’s see what it means mathematically.
For any KO sample at P28:
Using the expectations from above, you can show that:
This term is represented by the blue arrow:
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
And its lm
estimate,
To show this explicitly, we pull out the lm
estimate for the
interaction effect:
means.2Fact %>%
filter(dev_stage == "P28" & genotype == "NrlKO") %>%
pull(lmEst)
## [1] -2.040372
And now the differences in sample means between the E16:NrlKO group and the reference group (WT E16):
mean.E16.WT <- means.2Fact %>% filter(dev_stage == "E16" & genotype == "WT") %>% pull(cellMeans)
mean.E16.KO <- means.2Fact %>% filter(dev_stage == "E16" & genotype == "NrlKO") %>% pull(cellMeans)
mean.P28.WT <- means.2Fact %>% filter(dev_stage == "P28" & genotype == "WT") %>% pull(cellMeans)
mean.P28.KO <- means.2Fact %>% filter(dev_stage == "P28" & genotype == "NrlKO") %>% pull(cellMeans)
(mean.P28.KO - mean.P28.WT) - (mean.E16.KO - mean.E16.WT)
## [1] -2.040372
The interpretation of the coefficients changed when we drop the interaction terms
As before, let’s look at the expectations of each group:
-
for any sample
$i$ with$j=WT$ and$k=E16$ :$x_{KO}=0$ and$x_{P28}=0$ (only$\theta$ remains):$$E[Y_{WT,E16}]=\theta$$ -
for any sample
$i$ with$j=WT$ and$k=P28$ :$x_{KO}=0$ and$x_{P28}=1$ :$$E[Y_{WT,P28}]=\theta + \tau_{P28}$$ -
for any sample
$i$ with$j=NrlKO$ and$k=E16$ :$x_{KO}=1$ and$x_{P28}=0$ :$$E[Y_{KO,E16}]=\theta + \tau_{KO}$$ -
for any sample
$i$ with$j=NrlKO$ and$k=P28$ :$x_{KO}=1$ and$x_{P28}=1$ :$$E[Y_{KO,P28}]=\theta + \tau_{KO} + \tau_{P28}$$
After some simple algebra, you get:
Then,
$$\tau_{P28}= (\text{Eff}{P28|WT} + \text{Eff}{P28|KO})/2$$ is the
average effect of P28 over the levels of genotype
!!
Similar for the other main effect.
The intercept parameter is now $\bar{Y} - \bar{x}{ij,KO}\hat{\tau}{KO} - \bar{x}{ij,P28}\hat{\tau}{P28}$
Here is an example gene which has only the effect of developmental stage significant (in WT). The other two terms are not significant (effect of genotype at E16 and interaction).
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.12 0.143 42.8 1.37e-13
## 2 genotypeNrlKO -0.119 0.218 -0.547 5.95e- 1
## 3 dev_stageP28 1.11 0.202 5.47 1.94e- 4
## 4 genotypeNrlKO:dev_stageP28 -0.412 0.298 -1.39 1.93e- 1
Again, the interaction effect is not significant, so there may be a
development effect regardless of the genotype. Or likewise a genotype
effect regardless of developmental stage. We need to test those
hypotheses (main effects) using anova
anova(multFit)
## Analysis of Variance Table
##
## Response: expression
## Df Sum Sq Mean Sq F value Pr(>F)
## genotype 1 0.28444 0.28444 3.4808 0.08896 .
## dev_stage 1 3.11838 3.11838 38.1606 6.933e-05 ***
## genotype:dev_stage 1 0.15690 0.15690 1.9200 0.19331
## Residuals 11 0.89889 0.08172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that there is indeed a significant main effect of developmental stage. But the main effect of genotype is not significant.
but not the interaction… note the almost parallel pattern
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 9.47 0.127 74.6 3.11e-16
## 2 genotypeNrlKO -0.526 0.194 -2.71 2.02e- 2
## 3 dev_stageP28 0.795 0.180 4.43 1.01e- 3
## 4 genotypeNrlKO:dev_stageP28 -0.209 0.264 -0.791 4.45e- 1
Note that the main effects for both are also significant (but not the interaction).
## Analysis of Variance Table
##
## Response: expression
## Df Sum Sq Mean Sq F value Pr(>F)
## genotype 1 1.29452 1.29452 20.0799 0.0009298 ***
## dev_stage 1 1.81238 1.81238 28.1126 0.0002516 ***
## genotype:dev_stage 1 0.04039 0.04039 0.6265 0.4453874
## Residuals 11 0.70915 0.06447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1