

## 12.18 ANOVA for models with categorical independent variables.

Another useful application of ANOVA is to test for differences in means between categories of a categorical variable. 

Suppose we are interested in the association between an outcome $Y$ and a categorical variable $X$ with $K$ groups. We have already seen how to define a multivariable linear regression model using dummy variables for this situation. An alternative model, often termed the **ANOVA model**, is as follows: 

Let $y_{ki}$ be the value of the outcome for the $i^{th}$ observation in the $k^{th}$ group ($i=1,...,n_k$ and $k=1,...,K$). The ANOVA model is then defined as:

$$
y_{ki}=\mu_k + \epsilon_{ki} \text{, where } \epsilon_{ki} \sim NID(0,\sigma^2)
$$

Here, $\mu_k$ is the mean of the outcome in the $k^{th}$ group. With this representation, the null and alternative hypothesis are: 

+ $H_0: \mu_k= \mu$ (i.e. the means in all groups defined by the categorical variables are equal to a common value). 
+ $H_1: \mu_k \neq \mu$ (i.e. the group means are not all equal). 

### 12.18.1 Sum of squares for models with categorical variables

For models with a single independent categorical variable the fitted values are simply the group means ($\bar{y_k}$). Under the null hypothesis that the group means are all equal, the fitted values are all equal to the overall mean ($\bar{y}$). This leads to new terminology for the residual sum of squares ($SS_{RES}$) and the sum of squares explained by the model ($SS_{REG}$):

+ $SS_{RES} = \sum_{k=1}^K \sum_{i=1}^{n_k} (y_{ki} - \bar{y}_k)^2$
+ $SS_{REG} = \sum_{k=1}^K \sum_{i=1}^{n_k} (\bar{y}_k - \bar{y})^2 = \sum_{k=1}^K n_k (\bar{y}_k - \bar{y})^2 $




In this case, the residual sum of squares is often termed the **within group sum of squares $(SS_{Within})$** and the regression sum of squares is often termed the **between group sum of squares $(SS_{Between})$**.

### 12.18.2 The ANOVA table

When there are $K$ groups, the degrees of freedom for the within groups sum of squares is $n-K$ (because the model includes $K$ parameters) and the degrees of freedom for the between groups sum of squares is $K-1$ (because the null model contains a single parameter, the overall mean). Hence the ANOVA table is as follows: 

Source          | d.f.      | SS             | Mean Square                               | 
----------------|-----------|----------------|-------------------------------------------|
Between groups  | $K-1$     | $SS_{Between}$ | $MS_{Between}=\frac{SS_{Between}}{(K-1)}$ |
WIthin Groups   | $n-K$     | $SS_{Within}$  | $MS_{Within}=\frac{SS_{RES}}{n-K}$        | 
Total           | $n-1$     | $SS_{TOT}$     | $MS_{TOT}=\frac{SS_{TOT}}{n-1}$           | 

Table 4: The ANOVA Table for models with categorical variables


### 12.18.3 The F-test

To test the null hypothesis that the means in all groups are equal to a common value, the appropriate $F$-statistic is:

$$
F = \frac{MS_{Between}}{MS_{Within}} \sim F_{(K-1), n-K} \text{ under } H_0.
$$

If this test obtains a small $p$-value, then we have evidence that the means in the groups are not all the same. However, it does not tell us which of the group means differed from which other group means. For this reason, if we do find evidence of difference in means on an $F$-test, we may want to follow up with further analysis. Such further analysis may include pair-wise comparisons of means through analysis restricted to two groups. 

*Example* We conduct an F-test to compare the average birthweights between babies whose mothers smoke and whose mothers don't smoke using the birthweight data. 

Let $\mu_1$ and $\mu_0$ denote the mean birthweight for babies whose mothers do smoke and don't smoke, respectively. Then, the relevant hypotheses are: 

+ $H_0: \mu_1= \mu_0$ (i.e. the birthweight of a baby does not depend on whether the mother smoked)
+ $H_1: \mu_1\neq \mu_0$

Recall that we previously defined Model 2 to related birthweight and mother's smoking status: 

$$ 
y_i = \alpha_0 + \alpha_1 s_i + \epsilon_i
$$

Where $Y$ denotes the birthweight and

$$ 
s_{i}
\begin{cases}
    1 & \text{ if the mother smokes} \\
    0 & \text{ if the mother does not smoke}
\end{cases} 
$$

We can rewrite this equation using the ANOVA model as follows: 

$$
\begin{align}
    y_{1i} &=\mu_1 + \epsilon_{1i}  \text{ if the mother smokes} \\
    y_{0i} &=\mu_0 + \epsilon_{0i}  \text{ if the mother does not smoke}
\end{align} 
$$

where $y_{ki}$ is the mean birthweight in the $k^{th}$ group (groups are defined by mother's smoking status), $\mu_1 = \beta_0 + \beta_1$ and $\mu_0=\beta_0$ (in other words, our null hypothesis can be rewritten as: $\beta_1=0$). 

We can use either ```anova()``` or ```summary()``` to conduct the test in R: 

In [1]:
anova(model2)
summary(model2)

ERROR: Error in anova(model2): object 'model2' not found


In the ANOVA table, the ```factor(Maternal.Smoker)``` row gives the between groups results and the ```Residuals``` row gives the within group results. 

The $F$-statistic is equal to 76.02 with a $p$-value equal to $9.46\times10^{-18}$. This evidence suggests that there is a difference in the mean birthweight between the two groups defined by mother's smoking status. 

