# One-way ANOVA
In the previous section, we saw how dummy variables can be incorporated into a regression model to represent a categorical predictor. We also saw how this can produce *identical* results to a traditional $t$-test. Now it is time to take this further and see how dummy variable regression can also be equivalent to a traditional Analysis of Variance (ANOVA). The starting point for this is the basic *One-way ANOVA*, where we have a *single* categorical predictor with $> 2$ levels. As we will see, this is simply a generalisation of the $t$-test we have already seen.

Before we begin, it is important to set expectations and address a common misunderstanding. When you think of an ANOVA, you may automatically think of an ANOVA *table*, like the one shown below

| Source         | df  | SS      | MS      | F     | p-value |
|----------------|-----|---------|---------|-------|---------|
| Between Groups | 2   | 24.67   | 12.33   | 5.21  | 0.012   |
| Within Groups  | 27  | 63.83   | 2.36    |       |         |
| Total          | 29  | 88.50   |         |       |         |

However, this is *not* the ANOVA. In reality, an ANOVA is a very basic *linear model* concerned with *group means*. The ANOVA table is simply a convenient way to organise the arithmetic of hypothesis tests. The term *Analysis of Variance* comes from the way that these hypothesis tests are determined, but the estimates that these hypotheses are actually about come from a *model*. Traditional statistical education in Psychology tends to skip the actual model and go straight to the hypothesis tests. But this is not what we are going to do. Instead, we will ignore the ANOVA table for the moment and start by focusing on the *true form* of the ANOVA as a *basic linear model of group means*.

## One-way ANOVA as a Linear Model
The linear model at the heart of the one-way ANOVA is

$$
\begin{align*}
    y_{ij} &= \mu + \alpha_{j} + \epsilon_{ij} \\
    \epsilon_{ij} &\sim \mathcal{N}\left(0,\sigma^{2}\right).
\end{align*}
$$

Or, equivalently

$$
y_{ij} \sim \mathcal{N}\left(\mu + \alpha_{j},\sigma^{2}\right).
$$

Here, $\mu$ represents the *grand mean* (the overall mean of all the data) and $\alpha_{j}$ represents the *deviation* from the grand mean to the mean of group $j$. If our categorical predictor has $k$ levels ($j = 1,2,..,k$), then the $i\text{th}$ observation from the first level would be denoted $y_{i1}$, the $i\text{th}$ observation from the second level would be denoted $y_{i2}$ and so on. The model given above has $k$ unique model predictions, because $\alpha$ has $k$ different values, one for each level of the categorical predictor. For example, if $j = 1,2,3$ then $k=3$ and we have

$$
\begin{align*}
    E\left(y_{i1}\right) &= \mu + \alpha_{1} = \mu_{1} \\
    E\left(y_{i2}\right) &= \mu + \alpha_{2} = \mu_{2} \\
    E\left(y_{i3}\right) &= \mu + \alpha_{3} = \mu_{3}
\end{align*}
$$

When written this way, the term $\alpha$ captures the *effect* of the categorical predictor. This is encoded as deviations from the grand mean. When the group means are very similar, they will all be close to the grand mean and these deviations will be *small*. However, if the group means are quite different from each other, they will be much further away from the grand mean and the deviations will be *larger*. So, the magnitude of the $\alpha_{j}$'s tells us precisely how different the group means are from each other. Indeed, if there were *no* differences between the means then the group means would be the same as the grand mean and $\alpha_{1} = \alpha_{2} = \dots = \alpha_{k} = 0$.

The most important element of this model is that the grand mean plus the effect of the predictor will always equal the group mean. Much like we saw for regression models, we can think of the model equation as a set of *instructions*. For every observation, our starting point is the grand mean $\mu$. The $\alpha_{j}$ then tells us how far we have to travel to reach the mean of group $j$. So $\mu + \alpha_{j} = \mu_{j}$, the mean of group $j$. In order to reach observation $i$ from that point, we need to walk a final $\epsilon_{ij}$ number of steps. So think of the model equation as a map that takes you from the grand mean to the actual data value, via the group means. This makes the model *very simple*. Every data point has a predicted value that is equal to the mean of the group that the data point belongs to. The errors then reflect the differences between the group means and the raw data. 

To make this even clearer, if we say that there are $n = 4$ data point per-group then $i = 1,2,3,4$ and the model for the whole dataset is

$$
\begin{alignat*}{2}
&\text{Group 1} \begin{cases}
        y_{11} &= \mu + \alpha_{1} + \epsilon_{11} &&= \mu_{1} + \epsilon_{11} \\
        y_{21} &= \mu + \alpha_{1} + \epsilon_{21} &&= \mu_{1} + \epsilon_{21} \\
        y_{31} &= \mu + \alpha_{1} + \epsilon_{31} &&= \mu_{1} + \epsilon_{31} \\
        y_{41} &= \mu + \alpha_{1} + \epsilon_{41} &&= \mu_{1} + \epsilon_{41} \\
\end{cases} \\
&\text{Group 2} \begin{cases}
        y_{12} &= \mu + \alpha_{2} + \epsilon_{12} &&= \mu_{2} + \epsilon_{12} \\
        y_{22} &= \mu + \alpha_{2} + \epsilon_{22} &&= \mu_{2} + \epsilon_{22} \\
        y_{32} &= \mu + \alpha_{2} + \epsilon_{32} &&= \mu_{2} + \epsilon_{32} \\
        y_{42} &= \mu + \alpha_{2} + \epsilon_{42} &&= \mu_{2} + \epsilon_{42} \\
\end{cases} \\
&\text{Group 3} \begin{cases}
        y_{13} &= \mu + \alpha_{3} + \epsilon_{13} &&= \mu_{3} + \epsilon_{13} \\
        y_{23} &= \mu + \alpha_{3} + \epsilon_{23} &&= \mu_{3} + \epsilon_{23} \\
        y_{33} &= \mu + \alpha_{3} + \epsilon_{33} &&= \mu_{3} + \epsilon_{33} \\
        y_{43} &= \mu + \alpha_{3} + \epsilon_{43} &&= \mu_{3} + \epsilon_{43} \\
\end{cases}
\end{alignat*}.
$$

So, $\mu$ is common across all the model equations (a *constant*), with $\alpha_{j}$ encoding the unique effect of level $j$ of the categorical predictor. Together, these values will always equal the *group mean*.

`````{admonition} Why write the ANOVA this way?
:class: tip, dropdown
At first glance, it may seem unnecessarily complicated to write the ANOVA model as 

$$
y_{ij} = \mu + \alpha_{j} + \epsilon_{ij},
$$

rather than 

$$
y_{ij} = \mu_{j} + \epsilon_{ij}.
$$ 

The second form is actually more intuitive and makes it much more explicit that the ANOVA is using *means* as its predicted values. 

However, there are more subtle benefits to a model that uses *deviations* from the grand mean to parameterise the model. To begin with, there is a clearer link here with regression models that constain an intercept (constant) and a slope. This will make mapping the ANOVA on to a regression model simpler. In addition, when we come to add *more predictors* to the ANOVA, each one can be conceptualised as a separate deviation from the grand mean. So, a model with an additional factor can be written as

$$
y_{ijk} = \mu + \alpha_{j} + \beta_{k} + \epsilon_{ijk}
$$

as opposed to

$$
y_{ijk} = \mu_{jk} + \epsilon_{ijk},
$$

where the additional factor is hidden within the subscripts. This becomes even more important when we talk about an ANOVA as a model comparison procedure, because we need to add and remove different predictors from the model. This is much clearer to do when the predictors are represented by explicit terms, rather than hidden in the definition of a mean. Finally, thinking about elements of a model as *deviations* from a constant will be helpful for more complex circumstances, such as mixed-effects model, which we will cover later in the course. 

Aside from practical considerations, there are also some conceptual advantages to spelling the ANOVA out this way. To see why, consider the analogy of the model equation providing travel instructions to reach each data point. From this perspective, the model $y_{ij} = \mu_{j} + \epsilon_{ij}$ is like starting from 0 and then travelling $\mu_{j}$ steps to reach the group mean. The problem here is that most of that journey relates to the *units* of the data, rather than anything meaningful about our experiment. For instance, if we have measured reaction time in milliseconds, we will need to travel thousands of units to reach the data, but most of that journey has nothing to do with our experimental manipulations. If instead we *start* at the grand mean, then this scaling has already been taken care of and we start in a more reasonable position. In a way, the grand mean serves to start us in a decent place, separating out the experimental effects from the magnitude of the units. At the risk of stretching this metaphor too far, it is the difference between instructions for climbing a mountain starting from the base of the mountain, or including all the steps to travel to the base of the mountain from your own front door.
`````

### Numeric Example
To make this particular model parameterisation even clearer, let us insert some actual numbers. In the table below, we list 3 groups with 3 observations each.

| Group | Observations |
| ----- | ------------ |
| A     | 2, 7, 4      |
| B     | 8, 6, 9      |
| C     | 10, 12, 8    |

The estimated group means are then simply the sample means

$$
\begin{align*}
    \hat{\mu}_{1} &= \frac{2 + 7 + 4}{3} = 4.33 \\
    \hat{\mu}_{2} &= \frac{8 + 6 + 9}{3} = 7.66 \\
    \hat{\mu}_{3} &= \frac{10 + 12 + 8}{3} = 10
\end{align*}
$$

and the estimated grand mean is

$$
\hat{\mu} = \frac{4.33 + 7.66 + 10}{3} = 7.33.
$$

In terms of the factor effects, we therefore have

$$
\begin{alignat*}{2}
    \hat{\alpha}_{1} &= \hat{\mu}_{1} - \hat{\mu} = 4.33 - 7.33 &&= -3   \\
    \hat{\alpha}_{2} &= \hat{\mu}_{2} - \hat{\mu} = 7.66 - 7.33 &&= 0.33 \\
    \hat{\alpha}_{3} &= \hat{\mu}_{3} - \hat{\mu} = 10.0 - 7.33 &&= 2.67 
\end{alignat*}
$$

Notice as well that $\sum{\hat{\alpha}_{j}} = 0$, which is an important feature we will come back to later. Inserting all of this into the model equations, we have

$$
\begin{alignat*}{2}
&\text{Group A} \begin{cases}
        2\phantom{0} &= \overbrace{7.33 + (-3)}^{\mu + \alpha_{1}} &+& \overbrace{(-2.33)}^{\epsilon_{i1}} \\
        7\phantom{0} &= 7.33 + (-3) &+& \phantom{(-}2.67  \\
        4\phantom{0} &= 7.33 + (-3) &+& (-0.33) \\
\end{cases} \\
&\text{Group B} \begin{cases}
        8\phantom{0} &= \overbrace{7.33 + 0.33}^{\mu + \alpha_{2}} &+& \overbrace{\phantom{(-}0.34\phantom{)}}^{\epsilon_{i2}} \\
        6\phantom{0} &= 7.33 + 0.33 &+& (-1.66) \\
        9\phantom{0} &= 7.33 + 0.33 &+& \phantom{(-}1.34 \\
\end{cases} \\
&\text{Group C} \begin{cases}
        9\phantom{0} &= \overbrace{7.33 + 2.67}^{\mu + \alpha_{3}} &+& \overbrace{(-1.00)}^{\epsilon_{i3}} \\
        10 &= 7.33 + 2.67 &+& \phantom{(-}0.00 \\
        11 &= 7.33 + 2.67 &+& \phantom{(-}1.00 \\
\end{cases}
\end{alignat*}.
$$

It is hopefully now clear that $\mu$ is constant across all observations, $\alpha_{j}$ is unique to each group and $\epsilon_{ij}$ is unique to each observation. In other words, $\mu$ captures a univeral truth of the outcome variable, $\alpha_{j}$ captures how this truth changes across the different groups and $\epsilon_{ij}$ captures the random abberations from this truth that are unique to each measurement.

## Connecting Dummy Variable Regression to the ANOVA Model
We have now seen in more detail the theory behind the ANOVA model. However, we have yet to explicitly connect this with the dummy variable regression model we saw in the previous part of this lesson. To do so, notice that there is nothing in the definition above that states how many values $j$ should take. This is important because both the one-sample and two-sample $t$-tests are *sepecial cases* of the one-way ANOVA model. In the *one-sample* case, we set $j = 1$ and we have 

$$
y_{i} = \mu + \epsilon_{i},
$$

which is about as simply as any linear model can get[^j-foot]. In the *two-sample* case, we set $j = 1,2$ and we have 

$$
\begin{align*}
    y_{i1} &= \mu + \alpha_{1} + \epsilon_{i1} \\
    y_{i2} &= \mu + \alpha_{2} + \epsilon_{i2},
\end{align*}
$$

which is even more clearly a simplification of the one-way ANOVA. From a modelling perspective, there is therefore *no difference* between a $t$-test and an ANOVA. In fact, giving these two procedures different names is almost entirely unnecessary. 

Seeing that the two-sample $t$-test is equivalent to a simplified one-way ANOVA provides us a good opportunity to make a more explicit connection between the one-way ANOVA model

$$
y_{ij} = \mu + \alpha_{j} + \epsilon_{ij}
$$

and the simple regression model

$$
y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}.
$$

You may *almost* be able to see it, as these two models are very close to each other, save for the use of $\mu$ and $\alpha$ instead of $\beta$, and the inclusion of $x$ in the regression model. So, we just need to do a little bit more work before these specifications become *identical*.

### Model Parameterisation and Constraints
One of the problems with ANOVA models is that there are many different ways of writing them, using different numbers of parameters. These different options are known as model *parameterisations*. We have already seen two examples, where we have chosen to write the model as $E(y_{ij}) = \mu + \alpha_{j}$ instead of $E(y_{ij}) = \mu_{j}$. Remember, the most important element of an ANOVA model is that the model predictions are always the *group means*. These are the least-squares estimates as well as the maximum likelihood estimates. So, no matter how we want to write the model, the best estimates will always be the sample means. 

With this in mind, consider possible solutions for $\mu$ and $\alpha_{j}$ when we have

$$
E(y_{ij}) = \mu + \alpha_{j} = \mu_{j}.
$$

In effect, we want two numbers that add up to $\mu_{j}$, the sample mean. The problem is that there are literally an *infinite* number of possible choices here. For instance, if $\mu_{1} = 6$, we could choose $\mu = 5$ and $\alpha_{1} = 1$, or $\mu = 200$ and $\alpha_{1} = -194$, or $\mu = -3,467$ and $\alpha_{1} = 3,473$ and so on. This is know as the model being *overparameterised* or *unidentifiable*. If we try to use this model in a computer, the computer will not be able to find a unique single set of parameters because the options are, quite literally, *limitless*. This is not just a problem in a computer. If you tried to find the parameter values that maximise the likelihood, you would find yourself dealing with an infinite number of possibilities.

To stop this from happening and to force the model to choose one possibility out of the infinity of options, we have to impose a *constraint*. Think of this as adding some *rule* to finding parameters that allows the computer (or person) to settle on one possible solution. Although these constraints could be *anything*, we typically choose between constraints that lead to useful values of the parameters. Whilst $\mu = -3,467$ and $\alpha_{1} = 3,473$ would work (because the predicted value is still $\mu_{1}$), neither of these values are particularly useful or meaningful. However, if we impose the constraint that $\sum{\alpha_{j}} = 0$, then the *only* choice is that $\mu$ is the grand mean and the $\alpha_{j}$ are deflections from the grand mean[^constraint-foot]. These are much more useful parameters because they actually tell us something about the data that we are interested in.

Imposing $\sum{\alpha_{j}} = 0$ is one constraint we can use, but there are many others. In fact, the use of dummy variables is a different form of constraint. Instead of imposing $\sum{\alpha_{j}} = 0$, the dummy variables actually impose that one of the factor effects is equal to 0. For instance, if we have $j = 1,2$ and set $\alpha_{1} = 0$, then we have

$$
\begin{alignat*}{2}
    E\left(y_{i1}\right) &= \mu + \alpha_{1} = \mu &&= \mu_{1} \\
    E\left(y_{i2}\right) &= \mu + \alpha_{2} = \mu + \alpha_{2} &&= \mu_{2}.
\end{alignat*}
$$

This means that $\mu$ *has* to be the mean of the first group, which then forces $\alpha_{2}$ to be the *mean difference*. This is the *only way* that this combination of parameters with this constraint can ever equal the group means. So, the correct way of viewing a regression model containing dummy variables is as an ANOVA model with a *built-in constraint* that produces useful parameter estimates[^R-constraints-foot].

### The Dummy Variable ANOVA Model
As a final step, we will just make the link between the constrainted one-way ANOVA and a dummy variable regression model absolutely explicit, so there can be no doubts that these are *exactly the same model*.

As we saw in the previous part of the lesson, the dummy variable representation of the two-sample $t$-test creates *two* equations for the *two* different categories

$$
\begin{align*}
    E\left(y_{i1}\right) &= \beta_{0} + \left(\beta_{1} \times 0\right) = \beta_{0} \\
    E\left(y_{i2}\right) &= \beta_{0} + \left(\beta_{1} \times 1\right) = \beta_{0} + \beta_{1}
\end{align*}
$$

In other words, every data point from the first category gets a predicted value of $\beta_{0}$ and every data point from the second category gets a predicted value of $\beta_{0} + \beta_{1}$. Putting this in the context of group means, we also know that $\beta_{0} = \mu_{1}$ and $\beta_{1} = \left(\mu_{2} - \mu_{1}\right)$. If we substitute these values in to the regression equations above, we get

$$
\begin{align*}
    E\left(y_{i1}\right) &= \mu_{1} \\
    E\left(y_{i2}\right) &= \mu_{1} + \left(\mu_{2} - \mu_{1}\right) = \mu_{2}
\end{align*}
$$

Putting it all together, we get

$$
\begin{alignat*}{2}
    y_{i1} &= \beta_{0} + \epsilon_{i1} &&= \mu_{1} + \epsilon_{i1} \\
    y_{i2} &= \beta_{0} + \beta_{1} + \epsilon_{i2} &&= \mu_{2} + \epsilon_{i2}
\end{alignat*}
$$

So we know these models are equivalent in terms of the predicted values. However, if we now rename $\beta_{0} \rightarrow \mu$, $\beta_{1} \rightarrow \alpha_{2}$ and assume that $\alpha_{1} = 0$, then we have

$$
\begin{alignat*}{2}
    y_{i1} &= \mu + \epsilon_{i1} &&= \mu_{1} + \epsilon_{i1} \\
    y_{i2} &= \mu + \alpha_{2} + \epsilon_{i2} &&= \mu_{2} + \epsilon_{i2},
\end{alignat*}
$$

which is exactly the one-way ANOVA model with the constraint that $\alpha_{1} = 0$. So, we can now say that the dummy variable regression model is *identical* to a one-way ANOVA model, with the constraint that the $\alpha_{j}$ associated with the *reference category* is equal to 0.

## Dummy Variables Coding > 2 Levels
At this point, we should have a sense of the more formal linear model specification of a one-way ANOVA *and* should have a sense of how we can create this same model within the context of linear regression by using dummy variables. However, we have only seen this for cases where the categorical predictor has 2 levels. What happens when our our predictor has *more* than 2 levels?

The short answer is that we simply *add more dummy variables*. The general rule is that for $k$ levels of a predictor we need to add $k - 1$ dummy variables to the model. Notice that this fits with what we have already seen. For a two-sample $t$-test, $k = 2$ and we had to add $2 - 1 = 1$ dummy variable to the model. If it were a one-sample $t$-test, we would have $k = 1$ and we would add $1 - 1 = 0$ dummy variables to the model (because we only need an intercept). For the most basic one-way ANOVA case where $k = 3$, we therefore need to add $3 - 1 = 2$ dummy variables to the model. 

`````{admonition} The number of dummy variables and constraints
:class: tip
The number of dummy variables we add is actually directly tied to the model constraint. We add $k-1$ because we are setting one of the effects to 0. So we explicitly model $k-1$ levels of the categorical predictor and let the $k\text{th}$ level fall into the intercept. If we were to add $k$ levels then the model will become unidentifiable (because the choice of estimates is infinite) and it would not work. `R` will not let you do this when it makes its own dummy variables, but you can try it for yourself and see what happens if you manually create as many dummy variables as factor levels.
``````

### Example in `R`
We will examine how this is done in `R` and then use this to discuss the theory in more detail. To begin with, we will redefine our `origin` variable to further split the `Other` category into `Japan` and `Europe`. We also convert this into a factor straight away and check the levels.

In [2]:
data(mtcars)
mtcars$origin <- c('Japan','Japan','USA','USA','USA','USA','USA','Europe','Europe',
                   'Europe','Europe','Europe','Europe','Europe','USA','USA','USA',
                   'Europe','Japan','Japan','Japan','USA','USA','USA','USA',
                   'Europe','Europe','Europe','USA','Europe','Europe','Europe')
mtcars$origin <- as.factor(mtcars$origin)
print(levels(mtcars$origin))

[1] "Europe" "Japan"  "USA"   


So we now have a categorical variable with $k = 3$ levels and the dataset looks like this

In [3]:
print(mtcars)

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb origin
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  Japan
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Japan
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    USA
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    USA
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    USA
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1    USA
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4    USA
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 Europe
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 Europe
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4 Europe
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4 Europe
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17

Before modelling anything, let us now see how `R` has coded these catgories in terms of dummy variables

In [4]:
print(contrasts(mtcars$origin))

       Japan USA
Europe     0   0
Japan      1   0
USA        0   1


So, now we have *two* dummy variables. The first (lablled `Japan`) is a 1 for all the Japanese cars and a 0 for everything else, whereas the second (labelled `USA`) is a 1 for all the USA cars and a 0 for everything else. Although we will not be using manually-defined dummy variables, it can be helpful to create these so you have the clearest sense of what is being put into the model. The code below shows the creation of these two dummy variables. Adding these to the model will results in an *identical* fit to the model we will use use a little further below[^tryit-foot].

In [5]:
n         <- length(mtcars$origin)
dummy.JAP <- rep(0,n)
dummy.USA <- rep(0,n)

dummy.JAP[mtcars$origin == "Japan"] <- 1
dummy.USA[mtcars$origin == "USA"]   <- 1

print(dummy.JAP)
print(dummy.USA)

 [1] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
 [1] 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0


Including *two* dummy variables in the model will lead to *three* model parameters (intercept + two dummies), which we will call $\beta_{0}$, $\beta_{1}$ and $\beta_{2}$. As we saw in previous examples, the intercept parameter $\beta_{0}$ will become the *mean of the reference category*. In this example, this is whatever level is coded as a 0 across *both* dummy variables. The two slope parameters $\beta_{1}$ and $\beta_{2}$ will again become *mean differences* relative to the reference category. Without even fitting the model yet, we can work out that

$$
\begin{align*}
    \mu_{\text{EUR}} &= \beta_{0} + (\beta_{1} \times \mathbf{0}) + (\beta_{2} \times \mathbf{0}) = \beta_{0} \\
    \mu_{\text{JAP}}  &= \beta_{0} + (\beta_{1} \times \mathbf{1}) + (\beta_{2} \times \mathbf{0}) = \beta_{0} + \beta_{1} \\
    \mu_{\text{USA}}    &= \beta_{0} + (\beta_{1} \times \mathbf{0}) + (\beta_{2} \times \mathbf{1}) = \beta_{0} + \beta_{2} \\
\end{align*}
$$

The dummy coding has been highlighted here in **bold** to make the clearest connection to the table of dummy values given by `R` earlier. Our interpretation of the parameters in this model is therefore

| Parameter   | Interpretation                   |
|-------------|--------------------------------- |
| $\beta_{0}$ | Mean of `Europe`                 |
| $\beta_{1}$ | Mean difference `Japan - Europe` |
| $\beta_{2}$ | Mean difference `USA - Europe`   |

Linking back with the ANOVA model form earlier, this gives

$$
y_{ij} = \mu + \alpha_{j} + \epsilon_{ij},
$$

where $j = (1,2,3)$ and the constraint has been imposed that $\alpha_{3} = 0$. Let us double-check our understanding by fitting this model and examing the output

In [6]:
origin.mod <- lm(mpg ~ origin, data=mtcars)
summary(origin.mod)


Call:
lm(formula = mpg ~ origin, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8071 -4.1718 -0.7885  3.3444 10.5929 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   21.807      1.343  16.239 4.26e-16 ***
originJapan    3.753      2.618   1.434  0.16238    
originUSA     -5.669      1.935  -2.929  0.00656 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.025 on 29 degrees of freedom
Multiple R-squared:  0.3498,	Adjusted R-squared:  0.3049 
F-statistic:   7.8 on 2 and 29 DF,  p-value: 0.001947


Let us calculate the group means and differences manually and compare them with the model estimates

In [7]:
mu.EUR <- mean(mtcars$mpg[mtcars$origin == "Europe"])
mu.JAP <- mean(mtcars$mpg[mtcars$origin == "Japan"])
mu.USA <- mean(mtcars$mpg[mtcars$origin == "USA"])

beta.0 <- mu.EUR
beta.1 <- mu.JAP - mu.EUR
beta.2 <- mu.USA - mu.EUR

comp.table <- data.frame("Means"     = c(beta.0,beta.1,beta.2),
                         "Estimates" = coef(origin.mod))

print(comp.table)

                Means Estimates
(Intercept) 21.807143 21.807143
originJapan  3.752857  3.752857
originUSA   -5.668681 -5.668681


As we can see, the model fit is exactly as expected from our knowledge of the dummy variables. Furthermore, the automatic tests on the slope parameters are equivalent to performing $t$-tests on two of the possible mean differences across the levels of `origin`. On this basis, we could immediately draw some conclusions about the differences in MPG between these categories. For instance, if we want to use NHST for inference, we could say that there is a significant difference between the average MPG of USA cars and European cars ($t_{29} = -2.929, p < 0.01$), but not between the average MPG of Japanese cars and European cars ($t_{29} = 1.434, p = 0.16$). What about comparing the USA and Japan? We could simply relevel the `origin` factor and change the reference in order to generate this test. 

`````{admonition} Typical ANOVA Approach
:class: note
An important element to recognise at present is that what we have demonstrated here is not the *typical* approach to an ANOVA. If you think of a more traditional ANOVA, it would be usual to generate an *omnibus* test of the factor as a whole and then consider *follow-up* tests after this. However, there is actually nothing wrong with just jumping straight to the tests on the parameter estimates. After all, this is what we always do in regression models. The insistance on an omnibus tests within an ANOVA framework is largely *historical* and a matter of *convention*, rather than mathematical necessity. Nevertheless, omnibus tests are useful because they group together parameter estimates that relate to the same variable. So, in a way, omnibus tests test the *higher-level structure* of the model, before we drill-down into the specifics. We will explore this more typical procedure and the nature of omnibus tests in the next part of this lesson.
`````

[^j-foot]: Here, we have removed $j$ because it is constant across all the model equations and thus has become redundant.

[^constraint-foot]: We actually imposed this constraint *backwards* earlier, because we started from the position of saying what the parameter values should be (i.e. the grand mean and deflections from the grand mean) and then noted the constraint that appeared as a result. In reality, what you would do is start with the constraint that *forces* the model estimates to be the values that you want.

[^R-constraints-foot]: In fact, `R` contains different options for how factors are coded, which are effectively different constraints. Each one differs in how it forces the model estimates to be certain values. So, if you do not like the default constraint `R` imposes, you can always change it. We will see more about this later.

[^tryit-foot]: We will leave it as an exercise if you want to validate this for yourself. Remember, the only difference between doing this manually and letting `R` do it automatically is whether other functions inside `R` "know" that the two dummy variable actually represent a single categorical variable. Without this knowledge, everything else inside `R` will treat them as two separate binary variables.