# ANOVA and Bootstrapping

## Comparing More Than Two Means

* Compare means of 2 groups using a T statistic.
* Compare means of 3+ groups using a new test called **analysis of variance (ANOVA)** and a new statistic called **F**.


* ANOVA
    * H0: The mean outcome is the same across all categories.
    * HA: At least one pair of means are different from each other.


$$ F = \frac{\text{variability between groups}}{\text{variability within groups}}$$

* Obtaining a large F statistic requires that the variability between sample means is greater than the variability within the samples.

## ANOVA

* Variability partitioning.


* **Group** : **Between group variablity**.
* **Error** : **Within group variablity**.


### Degrees of Freedom

* **Total degress of freedom** is calculated as sample size minus one.

$$
df_T = n - 1
$$

* **Group degrees of freedom** is calculated as number of groups minus one.

$$
df_G = k - 1
$$

* **Error degrees of freedom** is the difference between the above two DF.

$$
df_E = df_T - df_G
$$

### Sum of Squares

* **Sum of squares total (SST)** measures the **total variability** in the response variable. 
* Calculated very similarly to variance (except not scaled by the sample size).

$$
SST = \sum^n_{i=1} (y_i - \bar{y})^2
$$

$$
SST = SSG + SSE
$$

* **Sum of squares groups (SSG)** measures the variability **between groups**. 
* It is the **explained variability**.

$$
SSG = \sum^k_{j=1} n_j (\bar{y}_j - \bar{y})^2
$$

* **Sum of squares error (SSE)** measures the variability **within groups**.
* It is the **unexplained variability**, unexplained by the group variable, due to other reasons.

$$
SSE = SST - SSG
$$

### Mean Squares

* Mean sqares is the average variability between and withing groups, calculated as the total variability (sum of squares) scaled by the associated degress of freedom.

* **Mean squares group (MSG)**

$$
MSG = \frac{SSG}{df_G}
$$

* **Mean squares error (MSE)**

$$
MSE = \frac{SSE}{df_E}
$$

### F-Statistic

* **F-statistic** is the ratio of the average between group and within 
group variabilities.
* It is never negative. Hence it's right-skewed.

$$
F = \frac{MSG}{MSE}
$$

### P-Value

* **P-value** is the probability of at least as large a ratio between the "between" and "within" group variabilities if in fact the means of all groups are equal.

**Example**

* F-statistics = 21.735
* DF_G = 3
* DF_E = 791

In [1]:
pf(q = 21.735, df1 = 3, df2 = 791, lower.tail = FALSE)

# If p-value is small (less than alpha), the data provide convincing
# evidence that at least one pair of population means are different
# from each other (but we can't tell which one).

# If p-value is large, the data do not provide convincing evidence that 
# at least one pair of population means are different from each other,
# the observed differences in sample means are attributable to 
# sampling variability (or chance).

## Conditions for ANOVA

* (1) **Independence**
    * Within groups: sampled observations must be independent.
        * Random sample / assignment
        * Each $n_j$ less than 10% of respective population
    * Between groups: the groups must be independent of each other (non-paired).
        * Carefully consider whether the groups may be dependent -> repeated measures anova
* (2) **Approximate normality**: distribution should be nearly normal within each group.
    * Especially important when sample sizes are small.
* (3) **Equal variance**: groups should have roughly equal variability.
    * Especially important when sample sizes differ between groups.

## Multiple Comparisons

* Which means are different?


* Two sample T tests for differences in each possible pair of groups.
* Multiple tests will inflate the Type I error rate ($\alpha$ significance level).
* Solution: use **modified significance level**.


* Testing many pairs of groups is called **multiple comparisons**.
* The **Bonferroni correction** $\alpha^\star$ suggests that a more stringent significance level is more appropriate for these tests.
    * Adjust $\alpha$ by the number of comparisons $K$ being considered.

$$
K = \frac{k(k-1)}{2}
$$

$$
\alpha^\star = \frac{\alpha}{K}
$$

* Constant variance: use consistent standard error and degrees of freedom for all tests.
* Compare the p-values from each test to the modified significance level.


* **Standard error for multiple pairwise comparisons**

$$
SE = \sqrt{
\frac{MSE}{n_1} + \frac{MSE}{n_2}
}
$$

* **Degrees of freedom for multiple pairwise comparisons**

$$
df = df_E
$$

**Example**

* If the explanatory variable in an ANOVA has 3 levels, and the F-test in ANOVA yields a significant result, how many pairwise comparisons are needed to compare each group to one another?

In [4]:
3 * (3-1) / 2

**Example**

* 4 class levels
* $\alpha$ = 0.05 for the original ANOVA

In [5]:
# Number of comparisons
(K <- 4 * (4-1) / 2)
# Corrected significance level
0.05 / K

* Is there a difference between the average vocabulary scores between middle and lower class Americans> (A single pairwise comparison.)
* DF_E = 691
* MSE = 3.628
* Lower class
    * N = 41
    * Mean = 5.07
* Middle class
    * N = 331
    * Mean = 6.76

In [13]:
# H0: mu_middle - mu_lower = 0
# HA: mu_middle - mu_lower != 0

(se <- sqrt(3.628/41 + 3.628/331))

(t = ((6.76 - 5.07) - 0) / se)

pt(t, df = 791, lower.tail = FALSE) * 2

# P-value is smaller than the alpha 0.00833. Reject the null hypothesis.

## Bootstrapping

* Take a bootstrap sample - a random sample taken **with replacement** from **the original sample**, of **the same size** as the original sample.
* Calculate the bootstrap statistic - a statistic such as mean, median, proportion, etc. computed on the bootstrap samples.
* Repeat the above two steps many times to create a bootstrap distribution - a distribution of bootstrap statistics.


* **Percentile method**
* **Standard error method**


* Not as rigid conditions as CLT based methods.
* If the bootstrap distribution is extremely skewed or sparse, the bootstrap interval might be unreliable.
* A representative sample is still required - if the sample is biased, the estimates resulting from this sample will also be biased.

## Exercises

OpenIntro Statistics, 3rd edition<br>
5.41, 5.43, 5.45, 5.47, 5.49, 5.51

**5.41 Fill in the blank.**
* When doing an ANOVA, you observe large differences in means between groups. Within the ANOVA framework, this would most likely be interpreted as evidence strongly favoring the ? hypothesis.

In [1]:
# alternative

**5.43 Chicken diet and weight, Part III.**
* In Exercises 5.31 and 5.33 we compared the effects of two types of feed at a time. A better analysis would first consider all feed types at once: casein, horsebean, linseed, meat meal, soybean, and sunflower. The ANOVA output below can be used to test for differences between the average weights of chicks on different diets.

|/|Df |Sum Sq |Mean Sq |F value |Pr(>F) |
|---|---|---|---|---|---|
|feed |5 |231,129.16 |46,225.83 |15.36 |0.0000 |
|Residuals |65 |195,556.02 |3,008.55 | 

* Conduct a hypothesis test to determine if these data provide convincing evidence that the average weight of chicks varies across some (or all) groups. Make sure to check relevant conditions. Figures and summary statistics are shown below.

In [5]:
# H0: The mean outcome is the same across all feed types.
# HA: At least one pair of means are different from each other.

# F = MSG / MSE
(f = 46225.83/3008.55)

pf(f, 5, 65, lower.tail = FALSE)

**5.45 Coffee, depression, and physical activity.** 
* Caffeine is the world’s most widely used stimulant, with approximately 80% consumed in the form of coffee. Participants in a study investigating the relationship between coffee consumption and exercise were asked to report the number of hours they spent per week on moderate (e.g., brisk walking) and vigorous (e.g., strenuous sports and jogging) exercise. Based on these data the researchers estimated the total hours of metabolic equivalent tasks (MET) per week, a value always greater than 0. The table below gives summary statistics of MET for women in this study based on the amount of coffee consumed.

|/|≤ 1 cup/week |2-6 cups/week |1 cup/day |2-3 cups/day |≥ 4 cups/day |Total |
|---|---|---|---|---|---|---|
|Mean |18.7 |19.6 |19.3 |18.9 |17.5 ||
|SD |21.1 |25.5 |22.5 |22.0 |22.0 ||
|n |12,215 |6,617 |17,234 |12,290 |2,383 |50,739|

* (a) Write the hypotheses for evaluating if the average physical activity level varies among the different levels of coffee consumption.
* (b) Check conditions and describe any assumptions you must make to proceed with the test.
* (c) Below is part of the output associated with this test. Fill in the empty cells.
* (d) What is the conclusion of the test?

In [6]:
# (a)
# H0: The mean MET is the same across all groups of coffee consumption.
# HA: At least one pair of means is different.

In [9]:
# (c)
(df_total = 50739 - 1)
(df_coffee = 5 - 1)
(df_residuals = df_total - df_coffee)

In [10]:
(sum_sq_coffee = 25575327 - 25564819)

In [12]:
(mean_sq_coffee = 10508 / df_coffee)
(mean_sq_residuals = 25564819 / df_residuals)

In [14]:
# F = MSG / MSE
(f = mean_sq_coffee / mean_sq_residuals)

In [17]:
pf(5.21, 4, 50734, lower.tail = FALSE)

|/|Df |Sum Sq |Mean Sq |F value |Pr(>F) |
|---|---|---|---|---|---|
|coffee| 4| 10508| 2627| 5.21|**0.0003** |
|Residuals| 50734|**25,564,819**| 503.9| /|/ |
|Total| 50738|**25,575,327**| /| /| /|

In [18]:
# (d)
# Since p-value is very small,
# reject the null hypothesis and conclude that there is
# at least one mean of MET that is different.

**5.47 GPA and major.** 
* Undergraduate students taking an introductory statistics course at Duke University conducted a survey about GPA and major. The side-by-side box plots show the distribution of GPA among three groups of majors. Also provided is the ANOVA output.

|/|Df |Sum Sq |Mean Sq |F value |Pr(>F) |
|---|---|---|---|---|---|
|major| 2| 0.03| 0.015| 0.185| 0.8313|
|Residuals| 195| 15.77| 0.081| /|/ |

* (a) Write the hypotheses for testing for a difference between average GPA across majors.
* (b) What is the conclusion of the hypothesis test? 
* (c) How many students answered these questions on the survey, i.e. what is the sample size?

In [19]:
# (a)
# H0: The average GPA across majors is the same.
# HA: At least one pair of average GPA is different.

In [20]:
# (b)
# The p-value is too large and we cannot reject the null hypothesis.
# We cannot conclude that there're any difference between 
# average GPA across majors.

In [22]:
# (c)
195+2+1

**5.49 True / False: ANOVA, Part I.** 
* Determine if the following statements are true or false in ANOVA, and explain your reasoning for statements you identify as false.
* (a) As the number of groups increases, the modified significance level for pairwise tests increases as well.
* (b) As the total sample size increases, the degrees of freedom for the residuals increases as well.
* (c) The constant variance condition can be somewhat relaxed when the sample sizes are relatively consistent across groups.
* (d) The independence assumption can be relaxed when the total sample size is large.

In [23]:
# (a)
# K = (k * (k-1)) / 2
# a_modified = a / K

# False
# The larger the number of groups, the larger the number of comparisons K,
# hence the smaller the modified significance level.

In [24]:
# (b)
# df_residuals = df_total - df_group
# df_residuals = (n) - (k - 1)

# True

In [26]:
# (c)
# True

In [27]:
# (d)
# False

**5.51 Prison isolation experiment, Part II.**
* Exercise 5.37 introduced an experiment that was conducted with the goal of identifying a treatment that reduces subjects’ psychopathic deviant T scores, where this score measures a person’s need for control or his rebellion against control. In Exercise 5.37 you evaluated the success of each treatment individually. An alternative analysis involves comparing the success of treatments. The relevant ANOVA output is given below.

|/|Df |Sum Sq |Mean Sq |F value |Pr(>F) |
|---|---|---|---|---|---|
|treatment |2 |639.48 |319.74 |3.33 |0.0461 |
|Residuals |39 |3740.43 |95.91 |
*s_pooled = 9.793 on df = 39*

* (a) What are the hypotheses? 
* (b) What is the conclusion of the test? Use a 5% significance level.
* (c) If in part (b) you determined that the test is significant, conduct pairwise tests to determine which groups are different from each other. If you did not reject the null hypothesis in part (b), recheck your answer.

|/|Tr 1|Tr 2|Tr 3|
|---|---|---|---|
|Mean|6.21|2.86|-3.21|
|SD|12.3|7.94|8.57|
|n|14|14|14|

In [28]:
# (b)
# Since p is smaller than 0.05, reject the null hypothesis
# and conclude that at least one pair of mean score is different.

In [32]:
# (c)

# K = (k * (k-1)) / 2
(K = 3 * 2 / 2)

# a_modified = a / K
(a_modified = 0.05 / K)

In [33]:
s_pooled = 9.793
df = 39

In [34]:
(se = sqrt(9.793^2/14 + 9.793^2/14))

In [41]:
# tr 1, tr 2
(t = (6.21-2.86)/se)
pt(t, df, lower.tail = FALSE) * 2

In [40]:
# tr 1, tr 3
(t = (6.21-(-3.21))/se)
pt(t, df, lower.tail = FALSE) * 2

In [42]:
# tr 1, tr 3
(t = (2.86-(-3.21))/se)
pt(t, df, lower.tail = FALSE) * 2