# Introduction to R Part 26: Analysis of Variance (ANOVA)

In lesson 24 we introduced the t-test for checking whether the means of two groups differ. The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

### One-Way ANOVA

The one-way ANOVA tests whether the mean of some numeric variable differs across the levels of one categorical variable. It essentailly answers the question: do any of the group means differ from one another? We won't get into the details of carrying out an ANOVA by hand as it invovles more calculations than the t-test, but the process is similar: you go through several calculations to arrive at a test statistic and then you compare the test statistic to a critical value based on a probabilitiy distribution. In the case of the ANOVA, you use the "f-distribution", which you can access with the functions rf(), pf(), qf() and df().

To carry out an ANOVA in R, you can use the aov() function. aov() takes a formula as the first argument of the form: numeric_response_variable ~ categorical_variable. Let's generate some fake voter age and deomgraphic data and use the ANOVA to compare average ages across the deomgraphic groups:

In [1]:
set.seed(12)
voter_race <- sample(c("white", "hispanic",                   # generate race data
                     "black", "asian", "other"),              
                     prob = c(0.5, 0.25 ,0.15, 0.1, 0.1), 
                     size=1000,
                     replace=TRUE)
 
voter_age <- rnorm(1000,50,20)                        # generate age data (equal means)
voter_age <- ifelse(voter_age<18,18,voter_age)

av_model <- aov(voter_age ~ voter_race)               # conduct the ANOVA and store the result
summary(av_model)                                     # check a summary of the test result

             Df Sum Sq Mean Sq F value Pr(>F)
voter_race    4   1204   300.9   0.815  0.515
Residuals   995 367270   369.1               

In the test output, the test statistic is the F-value of 0.905 and the p-value is 0.46. We could have calculated the p-value using the test statistic, the given degrees of freedom and the f-distribution:

In [2]:
pf(q=0.905,           # f-value
   df1=4,             # number of groups minus 1
   df2=995,           # observations minus number of groups
   lower.tail=FALSE)  # check upper tail only*

*Note: similar to the chi-squared test we are only interested in the upper tail of the distribution.

The test result indicates that there is not evidence that average ages differ based on the race variable, so we'd accept the null hypothesis that none of the groups differ.

Now let's make new age data where the group means do differ and run a second ANOVA:

In [3]:
set.seed(12)
white_dist <- rnorm(1000,55,20)     # draw ages from a different distribution for white voters
white_dist <- ifelse(white_dist<18,18,white_dist)

new_voter_ages <- ifelse(voter_race == "white", white_dist, voter_age)


av_model <- aov(new_voter_ages ~ voter_race)          # conduct the ANOVA and store the result
summary(av_model)                                     # check a summary of the test result

             Df Sum Sq Mean Sq F value Pr(>F)  
voter_race    4   3932   983.0    2.61 0.0342 *
Residuals   995 374665   376.5                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the code above, we changed the average age for white voters to 55 while keeping the other groups unchanged with an average age of 50. The resulting p-value 0.034 means our test is now significant at the 95% level. Notice that the test output does not indicate which group mean(s) differ from the others. We know that it is the white voters who differ because we set it up that way, but when testing real data, you may not know which group(s) caused the the test to throw a positive result. To check which groups differ after getting a positive ANOVA result, you can perform a follow up test or "post-hoc test".

One possible post-hoc test is to perform a separate t-test for each pair of groups. You can peform a t-test bewteen all pairs using the pairwise.t.test() function:

In [4]:
pairwise.t.test(new_voter_ages,      # conduct pairwise t-tests bewteen all groups
                voter_race, 
                p.adj = "none")      # do not adjust resulting p-values


	Pairwise comparisons using t tests with pooled SD 

data:  new_voter_ages and voter_race 

         asian black hispanic other
black    0.255 -     -        -    
hispanic 0.402 0.648 -        -    
other    0.888 0.202 0.323    -    
white    0.408 0.008 0.013    0.532

P value adjustment method: none 

The resulting table shows the p-values for each pairwise t-test. Using unadjusted pairwise t-tests can overestimate significance because the more comparisons you make, the more likely you are to come across an unlikely result due to chance. We can account for this multiple comparison problem by specifying a p adjustment argument:

In [5]:
pairwise.t.test(new_voter_ages,        # conduct pairwise t-tests bewteen all groups
                voter_race, 
                p.adj = "bonferroni")  # use bonferroni correction*


	Pairwise comparisons using t tests with pooled SD 

data:  new_voter_ages and voter_race 

         asian black hispanic other
black    1.00  -     -        -    
hispanic 1.00  1.00  -        -    
other    1.00  1.00  1.00     -    
white    1.00  0.08  0.13     1.00 

P value adjustment method: bonferroni 

*Note: Bonferroni correction adjusts the signifiance level α by dividing it by the number of comparisons made.

Note that after adjusting for multiple comparisons, the p-values are no longer significant at the 95% level. The Bonferroni correction is somewhat conservative in its p-value estimates. 

Another common post hoc-test is Tukey's test. You can carry out Tukey's test using the built-in R function TukeyHSD():

In [6]:
TukeyHSD(av_model)      # pass fitted ANOVA model

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = new_voter_ages ~ voter_race)

$voter_race
                     diff         lwr       upr     p adj
black-asian    -3.0009576 -10.2001836  4.198268 0.7856694
hispanic-asian -2.0604424  -8.7833237  4.662439 0.9188866
other-asian     0.4235040  -7.7872349  8.634243 0.9999112
white-asian     1.8851439  -4.3437325  8.114020 0.9222716
hispanic-black  0.9405152  -4.6833808  6.564411 0.9910128
other-black     3.4244616  -3.9136107 10.762534 0.7065118
white-black     4.8861015  -0.1368432  9.909046 0.0610682
other-hispanic  2.4839463  -4.3874134  9.355306 0.8608383
white-hispanic  3.9455862  -0.3669829  8.258155 0.0913566
white-other     1.4616399  -4.9272059  7.850486 0.9710417


The output of the Tukey test shows the average difference, a confidence interval, as well as a p-value for each pair of groups. Again, we see low p-value for the white-black and white-hispanic commparisons, suggesting that the white group is the one that led to the positive ANOVA result.

### Two-Way ANOVA

The two-way ANOVA extends the analysis of variance to cases where you have two categorical variables of interest. For example, a two-way ANOVA would let us check whether voter age varies across two demographic variables like race and gender at the same time. You can conduct a two way ANOVA by passing an extra categorical varaible into the formula supplied to the aov() function. Let's make a new variable for voter gender, alter voter ages based on that variable and then do a two-way ANOVA test investigating the effects of voter gender and race on age:

In [7]:
set.seed(10)
voter_gender <- sample(c("male","female"),     # generate genders
                       size=1000, 
                       prob=c(0.5,0.5),
                       replace = TRUE)

voter_age2 <- ifelse(voter_gender=="male", voter_age-1.5, voter_age+1.5)  # alter age based on gender
voter_age2 <- ifelse(voter_age2<18,18,voter_age2)

av_model <- aov(voter_age2 ~ voter_race + voter_gender)    # perform ANOVA
summary(av_model)                                          # show the result

              Df Sum Sq Mean Sq F value Pr(>F)  
voter_race     4   1203   300.8   0.821 0.5117  
voter_gender   1   1743  1743.4   4.760 0.0294 *
Residuals    994 364093   366.3                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the code above we added 1.5 years to the age of each female voter and subtracted 1.5 years from the age of each male voter. The test result detects this difference in age based on gender with a p-value of 0.029 for the voter_gender variable. On the other hand, the voter_race variable appears to have no significant effect on age.

The two-way ANOVA can also test the interation bewteen the categorical variables. To check for interaction, add a third term to the formula you supply to aov() equal to the product of the two categorial variables:

In [8]:
av_model <- aov(voter_age2 ~ voter_race + voter_gender +    # repeat the test
               (voter_race * voter_gender))                 # add interaction term

summary(av_model)                                           # check result

                         Df Sum Sq Mean Sq F value Pr(>F)  
voter_race                4   1203   300.8   0.820 0.5122  
voter_gender              1   1743  1743.4   4.755 0.0294 *
voter_race:voter_gender   4   1132   282.9   0.772 0.5438  
Residuals               990 362961   366.6                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test result shows no significant interaction bewteen gender and race, which is expected given that we created both independently. Let's create a new age variable with an interaction bewteen gender and race and then run the test again:

In [9]:
# Increase the age of asian female voters by 10
interaction_age <- ifelse((voter_gender=="female")&(voter_race=="asian"), 
                    (voter_age + 10), voter_age)    # alter age based on gender and race

av_model <- aov(interaction_age ~ voter_race + voter_gender +     # repeat the test
               (voter_race * voter_gender))                       

summary(av_model)                                                 

                         Df Sum Sq Mean Sq F value Pr(>F)  
voter_race                4   4612  1153.0   3.118 0.0146 *
voter_gender              1     89    88.7   0.240 0.6244  
voter_race:voter_gender   4   4229  1057.2   2.859 0.0226 *
Residuals               990 366107   369.8                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, we see a low p-value for the interaction bewteen race and gender. A low p-value for the interaction term suggests that some group defined by a combination of the two categorical variables may be having a large influence on the test results. In this case, we added 10 to the ages of all asian women voters, while all other gender/race combinations are drawn from the same distribution. To identify the specific variable combination affecting our results, we can run Tukey's test and inspect the interactions:

In [10]:
TukeyHSD(av_model)      # pass fitted ANOVA model

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = interaction_age ~ voter_race + voter_gender + (voter_race * voter_gender))

$voter_race
                     diff        lwr         upr     p adj
black-asian    -7.6521204 -14.786664 -0.51757695 0.0284489
hispanic-asian -6.7116051 -13.374084 -0.04912651 0.0473129
other-asian    -4.2276588 -12.364627  3.90930939 0.6149472
white-asian    -7.3934760 -13.566388 -1.22056391 0.0096866
hispanic-black  0.9405152  -4.632852  6.51388251 0.9906984
other-black     3.4244616  -3.847681 10.69660382 0.6994030
white-black     0.2586444  -4.719171  5.23645967 0.9999086
other-hispanic  2.4839463  -4.325677  9.29356924 0.8568273
white-hispanic -0.6818709  -4.955693  3.59195136 0.9924901
white-other    -3.1658172  -9.497261  3.16562710 0.6493747

$voter_gender
                  diff       lwr      upr     p adj
male-female -0.5955427 -2.982237 1.791151 0.6244823

$`voter_race:voter_gender`
                     

The output shows low p-values for several comparisons bewteen asian females and other groups. If this were a real study, this result might lead us toward investigating asian females as a subgroup in more detail. 

### Wrap Up

The one-way and two-way ANOVA tests let us check whether a numeric response variable varies according to the levels of one or two categorical variables. R makes it easy to perform ANOVA tests without diving too deep into the details of the procedure.

Next time, we'll move on from statistical inference to the final topic of this introducion: predictive modeling.

### Next Time: Introduction to R Part 27: TBD