In [1]:
library(car)

Loading required package: carData



# Unbalanced ANOVA Models

```{figure} images/unbalanced-text.webp
---
scale: 80%
align: right
---
```

... Indeed, whole textbooks were written about unbalanced data (as can be seen on the *right*). So this is a topic that deserves some attention, even if it is largely *ignored* by modern teaching in Psychology. There is something of an assumption that the issues of balance have been *solved* and thus do not need considering anymore. However, this is not really true. The "solution" implemented by SAS and SPSS is the Type III sums-of-squares, which researchers continue to use because it is the default[^default-foot]. However, as discussed briefly last week, this approach is highly flawed.

In this part of the lesson, we will dig deeper into the Type I/II/III debate so that you understand what each type of sums-of-squares means, when they are most appropriate to use and what the various arguments are for/against them. In general, we will be recommending Type II for 95% of all use-cases. However, it is important not to just take our word for it. Instead, it is important that you *understand* the difference and can make your own informed judgement.

## The Problem of Imbalance
... Perhaps the most important thing to recognise here is that imbalance is only a problem when we insist on trying to interpret effects that *do not make sense* in the context of the model. For instance, trying to interpret a main effect in the presence of an interaction. If an interaction effect is *large* then the main effects make no sense however, when the interaction effect is *small*, it adds little to the predictive accuracy of the model and should not be there. 

The key point here is that all this hassle goes away if we just engage with the idea of *model building* and only interpret tests once we have a suitable model in place

## The Principle of Marginality
One of the key ways of resolving 

## Type I Sums-of-squares

## Type II Sums-of-squares

## Type III Sums-of-squares

In [2]:
data(mtcars)

# Origin factor
mtcars$origin <- c('Japan','Japan','USA','USA','USA','USA','USA','Europe','Europe',
                   'Europe','Europe','Europe','Europe','Europe','USA','USA','USA',
                   'Europe','Japan','Japan','Japan','USA','USA','USA','USA',
                   'Europe','Europe','Europe','USA','Europe','Europe','Europe')
mtcars$origin <- as.factor(mtcars$origin)

# VS factor
vs.lab <- rep("",length(mtcars$vs)) 
vs.lab[mtcars$vs == 0] <- "V-shaped"
vs.lab[mtcars$vs == 1] <- "Straight"
mtcars$vs <- as.factor(vs.lab)

# Create fake interaction
mpg.fake          <- mtcars$mpg                  # copy mpg
mpg.idx           <- mtcars$origin == "Japan" &
                     mtcars$vs     == "V-shaped" # index of Japan-VShaped cell
mpg.fake[mpg.idx] <- mpg.fake[mpg.idx] + 15      # add constant to all data from that cell
mtcars$mpg.fake   <- mpg.fake   

In [7]:
mod.full <- lm(mpg.fake ~ vs + origin + vs:origin, data=mtcars)

print(drop1(mod.full, scope="vs", test="F"))
Anova(mod.full, type="III")


Single term deletions

Model:
mpg.fake ~ vs + origin + vs:origin
       Df Sum of Sq    RSS     AIC F value  Pr(>F)  
<none>              447.70  96.429                  
vs      1    131.62 579.32 102.676  7.6436 0.01034 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Unnamed: 0_level_0,Sum Sq,Df,F value,Pr(>F)
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),4787.31125,1,278.018874,2.114455e-15
vs,131.6172,1,7.643553,0.01033623
origin,92.20887,2,2.677474,0.08763434
vs:origin,181.54541,2,5.271545,0.01197251
Residuals,447.70375,26,,


So this had not worked. Why? The reason is due to the form of dummy coding used by `R`. Put simply, the missing term `vs` is absorbed into the `vs:origin` term, making the two models *identical*. In fact, the only way to get this comparison to produce a result for us is to *change* the dummy coding. The fact that we need to adjust an arbitrary element of the model to get the numbers we want should be a clue that this is *not* a sensible comparison to make. Nevertheless, if we change the coding to a form where each term represents an independant element of the variance decomposition, we can get a result here. This form of coding is known as *sum-to-zero* coding, or *sum* coding for short. We do not need to understand this in any great detail, but we do need to highlight that the result depends on the coding. This means that Type III tests only work *under certain ANOVA constraints*. Given that the constraint is not a core component of the model (as the model will work identically under any arbitrary constrain), this is another clue that this method makes little sense. 

In the example below, we set the coding used for each factor in the call to `lm()`. This *can* be set globally, but then it becomes easy to forget to switch it back again and we may get confused when trying to interpret the model parameters.

In [9]:
mod.sum <- lm(mpg.fake ~ vs + origin + vs:origin, data=mtcars, contrasts=list(vs=contr.sum, origin=contr.sum))

print(Anova(mod.sum, type="III"))
print(drop1(mod.sum, scope = "vs", test = "F"))

Anova Table (Type III tests)

Response: mpg.fake
             Sum Sq Df  F value    Pr(>F)    
(Intercept) 13094.1  1 760.4268 < 2.2e-16 ***
vs             14.9  1   0.8631   0.36142    
origin        680.8  2  19.7672 6.033e-06 ***
vs:origin     181.5  2   5.2715   0.01197 *  
Residuals     447.7 26                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Single term deletions

Model:
mpg.fake ~ vs + origin + vs:origin
       Df Sum of Sq    RSS    AIC F value Pr(>F)
<none>              447.70 96.429               
vs      1    14.862 462.57 95.474  0.8631 0.3614


## Resolving the Sums-of-squares Circus
... The truth is that the main reason all this hassle exists is because the neat partition of the ANOVA effects disappears when the data are imbalanced. In order to resolve this, we have to choose a method of partitioning the sums-of-squares. The definitions given above come directly from SAS, who's aim was not some principled statistical derivation that makes sense, rather it was to give their users what they wanted: identical ANOVA output irrespective of balance. Because the traditional ANOVA was not seen as an exercise in model building, it was not typical to remove terms that appeared redundant. In order to maintain this completeness, SAS wanted ANOVA tables that contained *all* terms, rather than certain terms disappearing under imbalance. As such, different methods for decomposing these effects were developed and a choice was provided. 

From a modern perspective, all this hassle is unnecessary if we engage with the process of *model building*. This is something we will discuss in much greater detail in the machine learning module next semester. However, the idea is very simple. If a term adds little predictive utility, remove it and create the simplest model you can. From this perspective, if the highest-order interaction is *small* it would be removed and then the lower-order terms become interpretable again. No need for Type II tests to make them intepretable *despite* the presence of the interaction term. However, if an interaction is *large*, it stays in the model and we only interpret the highest-order term for each factor. Under this scheme, the whole Type I/II/III debate disappears. 

As an example, say we have the model 

$$
Y = A + B + C + AB + AC + BC + ABC.
$$

If the 3-way interaction is uninteresting, we can drop it to form

$$
Y = A + B + C + AB + AC + BC.
$$

Now, say that $AC$ and $BC$ are also uninteresting, we can settle on

$$
Y = A + B + C + AB.
$$

We would now interpret the 2-way interaction $AB$ and the main effect $C$. Because we have respected marginality here when building these models, all these terms have interpretable effects

In [5]:
library('datarium')
library('car')
data(headache)
mod <- lm(pain_score ~ gender*risk*treatment, data=headache)
print(Anova(mod))

mod <- lm(pain_score ~ gender + risk + treatment + risk:treatment, data=headache)

mod.sum <- lm(pain_score ~ gender + risk + treatment + risk:treatment, data=headache, contrasts=list(gender=contr.sum,risk=contr.sum,treatment=contr.sum))

print(anova(mod))
print(Anova(mod))
print(Anova(mod.sum, type="III"))

ERROR: Error in library("datarium"): there is no package called ‘datarium’


`````{topic} What do you now know?
In this section, we have explored ... After reading this section, you should have a good sense of:

- ...
- ...
- ...

`````

[^default-foot]: Always be wary of defaults. If there is one way of getting an entire scientific field to adhere to a particular way of doing something without the need for any critical evaluation, simply make it the default in software. Defaults do not automatically hold some higher-level of credibility simply because they were the value that the developer picked. Many times these are well-considered, but this is not a *guarantee*. We can easily be led astray by default choices because we do not have to justify using them. This does not have an official name, but we could perhaps call it *the default authority effect*. It is effectively a reversal of the burden of proof: deviating from defaults requires defence, whereas using defaults is treated as neutral. Yet this presupposes that the defaults are normatively sound, which is rarely demonstrated or even documented.