# S07: ANOVA

The $t$-test was a great step forward in both statistics and science bringing a robust test to scientists which would work well even for small sample sizes.

The only problem was that the $t$-test could only compare and contrast **2 or fewer samples**. For situations where the grouping variable had 3 or more levels (freshman, sophomore, junior, senior), the $t$-test would not work.
<br><br><br>
Ron Fisher took up the challenge and created the mathematical extension of the $t$-test for use with more than 2 groups. His statistical test is called **ANOVA** which stands for <span style = 'color:blue;font-weight:bold'>Analysis of Variance</span>.

## Conducting the ANOVA in R}
**Example.** Test whether the variable GPA varies based upon year in school (Yr).

We proceed through the following steps:
- Create the Model
   $$\text{mod = lm(GPA $\sim$ Yr, data = pers)}$$

- Extract the Residuals
   $$\text{res = residuals(mod)}$$

- Produce a QQ plot of the Residuals (as we did with linear regression)

- If the Data are Appropriate, Run the ANOVA
   $$\text{anova = aov(GPA $\sim$ Yr, data = pers)}$$

- <span style = 'color:blue;font-weight:bold'>Only if we reject the null, conduct a *post hoc* </span><span style = 'color:red;font-weight:bold'>Tukey HSD</span>
   $$\text{TukeyHSD(anova)}$$

## Getting Started

Let's load some data frames for our examples.

In [4]:
pers <- read.csv('https://faculty.ung.edu/rsinn/data/personality.csv')
city <- read.csv('https://faculty.ung.edu/rsinn/data/worldcity.csv')

## Example 1: Perfectionism

Test whether levels of perfectionism vary depending upon whether one prefers to sit in the front, middle, or back of the classroom using the **Perf** and **SitClass** columns of the personality data frame. Test at the $\alpha = 0.05$ level.

## Example 2: Optimism and Humor

Test whether <span style = 'color:blue'>optimism scores</span> vary based upon <span style = 'color:blue'>primary humor style</span> at the $\alpha = 0.05$ level. We will use the **CHS** and **PHS** variables in the personality data frame.

## Example 3: Cities and Population

The **city** data frame lists the populations of the largest cities in the world. Test whether there is a difference in population averages between China, India, Brazil and Japan at the $\alpha = 0.05$ level.

In [17]:
data <- subset(city, country == 'China' | country == 'India' | country == 'Brazil' | country == 'Japan',
               select = c(1,4,5))
head(data)

Unnamed: 0,ï..city,country,population
1,Tokyo,Japan,37977000
3,Delhi,India,29617000
4,Mumbai,India,23355000
6,Shanghai,China,22120000
7,Sao Paulo,Brazil,22046000
10,Guangzhou,China,20902000
