## 12.17 ANOVA for multivariable linear regression

In the context of multivariable linear regression, ANOVA can be used to test whether a more complex model is a better fit than the null model (**the Global F test**), or whether a more complex model is a better fit than a simpler model that includes a subset of the covariates in the complex model (**the partial F test**). Each test requires slight modifications to the ANOVA table defined above and we will discuss these in turn. 

### 12.17.1 The Global F test

The general formulation of the ANOVA table (suitable for simple and multivariable linear regression models) is given in Table 3, where $p$ is the number of covariates in the model. 

Source      | d.f.      | SS         | Mean Square                        | 
------------|-----------|------------|------------------------------------|
Regression  | $p$       | $SS_{REG}$ | $MS_{REG}=\frac{SS_{REG}}{p}$      |
Residual    | $n-(p+1)$ | $SS_{RES}$ | $MS_{RES}=\frac{SS_{RES}}{n-p-1}$  | 
Total       | $n-p$     | $SS_{TOT}$ | $MS_{TOT}=\frac{SS_{TOT}}{n-1}$    | 

Table 3: The ANOVA Table 

Note that this is equivalent to Table 2 when $p=1$. 

The Global F test tests the null hypothesis ($H_0$) that the null model is a better fit than the more complex model against the alternative hypothesis ($H_1$) that the complex model is a better fit. Or, equivalently:

+ $H_0:$ All slope parameters in the complex model are equal to 0.  
+ $H_1:$ At least one of the slope parameters in the complex model is not equal to 0. 

The appropriate $F$ statistic is the ratio $MS_{REG}/MS_{RES}$ (as defined in Table 3). Under the null hypothesis, $F$ follows an $F_{p,(n-(p+1))}$ distribution. 

*Example:* We can use ```summary()``` to conduct a global F test for Model 3.

In [1]:
data<- read.csv('https://www.inferentialthinking.com/data/baby.csv')
#Create new (centered) variables in our data
data$Gestational.Days.Centered<-data$Gestational.Days-mean(data$Gestational.Days)
data$Maternal.Height.Centered<-data$Maternal.Height-mean(data$Maternal.Height)
# Fit models
model1<-lm(Birth.Weight~Gestational.Days, data=data)
model3<-lm(Birth.Weight~Gestational.Days.Centered+Maternal.Height.Centered, data=data)


#ANOVA for Model 3 
summary(model3)


Call:
lm(formula = Birth.Weight ~ Gestational.Days.Centered + Maternal.Height.Centered, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-53.829 -10.589   0.246  10.254  54.403 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)               119.46252    0.47980 248.983  < 2e-16 ***
Gestational.Days.Centered   0.45237    0.03006  15.051  < 2e-16 ***
Maternal.Height.Centered    1.27598    0.19049   6.698 3.27e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.44 on 1171 degrees of freedom
Multiple R-squared:  0.1969,	Adjusted R-squared:  0.1955 
F-statistic: 143.5 on 2 and 1171 DF,  p-value: < 2.2e-16


Our hypotheses are defined as:

+ $H_0$: the regression coefficients for both gestational days and mother's height are equal to 0.
+ $H_1$: the regression coefficient for either gestational days or mother's height (or both) is not equal to 0. 

The $F$ statistic is 143.5 with a $p$-value $<2.2 \times 10^{-11}$. Therefore, there is strong evidence against the null and we can conclude that at least one of the estimated regression coefficients is non-zero (i.e Model 3 is a better fit than the null model). 

## 12.17.2 The partial F-test 

The global $F$-test is a joint test of the statistical signficance of all the slope parameters in a linear regression model. On the other hand, the partial $F$-test compares the fit of a complex model (say Model A with $p$ predictors) with a simpler model (say Model B with $p-k$ predictors). 

The key to the partial $F$-test is the construction of an Analysis of Variance table that partitions the sum of squares explained by the complex model into that explained by the simple model and the extra sum of squares only explained by the complex model. Using the notation that $SS_{REG_A}$ denotes the sum of squares explained by the complex model, whilst $SS_{REG_B}$ denotes the sum of squares explained by the simpler model, the ANOVA table is as shown in Table 3. 

Source                     | d.f.      | SS                      | Mean Square                                      
---------------------------|-----------|-------------------------|-------------------------------------------------
Explained by Model B       | $p-k$     | $SS_{REG_B}$            | $MS_{REG_B}=\frac{SS_{REG_B}}{p-k}$   
Explained by Model A       | $p$       | $SS_{REG_A}$            | $MS_{REG_A}=\frac{SS_{REG_A}}{p}$   
Extra explained by Model A | $k$       | $SS_{REG_A}-SS_{REG_B}$ | $MS_{REG_X}=\frac{(SS_{REG_A}-SS_{REG_B}}{k}$   
Residual from Model A      | $n-(p_1)$ | $SS_{RES_A}$            | $MS_{RES}=\frac{SS_{RES_A}}{n-(p+1)}$         
Total                      | $(n-1)$   | $SS_{TOT}$              | $MS_{TOT}=\frac{SS_{TOT}}{n-1}$                 

Table 3: The ANOVA table comparing the fit of a model ( Model A) with $p$ predictors with that of one (Model B) with $(p-k)$ predictors

The partial $F$-test tests the null hypothesis that all of the slope parameters included in Model A but omitted from Model B are equal to zero. The alternative hypothesis is that at least one of the additional parameters in Model A is not equal to 0. 

The appropriate test statistic ($F$) is the ratio of extra mean sum of squares in Model A to the mean residual sum of squares from Model A. Under the null hypothesis, this test statistic follows an $F$-distribution:

$$\text{Under } H_0: F = \frac{MS_{REG_X}}{MS_{RES}} \sim F_{k,(n-(p+1))}$$


*Example*: We can use ```anova()``` to conduct a partial F-test to compare Models 1 and 3: 

+ $H_0:$ Model 1 is the better fit 
+ $H_0:$ Model 3 is the better fit


In [2]:
anova(model1, model3)

Res.Df,RSS,Df,Sum of Sq,F,Pr(>F)
1172,328608.3,,,,
1171,316482.2,1.0,12126.13,44.86728,3.266475e-11


The $F$-statistic is 44.87 with a $p$-value of $3.23\times 10^{-11}$. This is strong evidence against the null, and hence the data indicates that Model 3 is the better fit. 

In this case, the two models only differed by one variable (mother's height) and so the hypotheses could be re-written as: 

+ $H_0: \beta_2=0$, where $\beta_2$ is the regression coefficient for mother's height. 
+ $H_0: \beta_2 \neq 0$. 

In other words, when the models only differ by one variable, the partial F test is equivalent to the t test of the null hypothesis that the regression coefficient for that variable is equal to 0. Notice in our example that the results of the partial F test are the same as the t-test for $\beta_2=0$, with $F=t^2$. 

For this reason, partial F tests are more useful in situations where we wish to compare models that differ by more than one variable. 
