# Hypothesis Testing for Multiple Groups

### Data Science 410

## Testing Multiple Groups and ANOVA

So far, we have only looked at tests for comparing two samples. What if we have multiple groups and want to compare their means? Why can’t we just do multiple two-sample t-tests for all pairs?
- Results in increased probability of accepting a false hypothesis.
- For example, if we had 7 groups, there would be (7 Choose 2)=21 pairs to test.  If our alpha cutoff is 5%, then we are likely to accept about 1 false hypothesis (approximately 21*0.05).

There is another alternative:

- Null Hypothesis: All groups are samples from the same population.
- Alternative Hypothesis: At least one group has a statistically different mean.

This type of analysis is called “ANalysis Of VAriance”, or ANOVA. ANOVA is one of a large family of models used for **experimental design**.

### Brief History of ANOVA

ANOVA is not a new idea. 

- Laplace pioneered multiple comparison methods in 1827.
- Ronald A Fisher published seminal work in 1922, 1925 and 1935. The F (Fisher) statistic is named in his honor.

Fisher pioneered the use of linear models for testing multiple groups. Key methods he developed are ANOVA and the design of experiments. 

<img src="img/Ronald_Fisher.jpg" alt="Drawing" style="width:275x; height:350px"/>

<center>Ronald A. Fisher, another scary looking statistics professor!</center>   

Fisher had an overwhelming influence on the theory of classical (frequentist) statistics in the 20th Century. He was vehemently opposed to Bayesian methods, and ostracized any practitioners. In fact, Fisher's long shadow explains why we are only beginning to teach Bayesian methods in the 21st century. Unfortunately, as with Pearson, Fisher was also a eugenicist and a racist. Another serious blemish on the early history of statistics. 

Fisher's two books are still influential and in print. 

<img src="img/Fisher1.jpg" alt="Drawing" style="width:400x; height:350px"/>

<img src="img/Fisher2.jpg" alt="Drawing" style="width:400x; height:350px"/>

<center>Fisher's books of 1935 and 1925</center>


## Basic ANOVA Theory 

Let's have a look at how we would perform the comparisons between the multiple groups of data. The groups have each been subject to a different treatment. This method is known as **one-way ANOVA**.  

The general idea is to determine if the groups within the data set all have the same variance. In other words, did the different treatments lead to significantly different variances within the groups? 

The differences in variance is measured by the **F statistic**. The F-statistic is defined by the ratio: 

$$F = \frac{Variance\ between\ treatments}{Variance\ within\ treatments}$$  

This ratio will be close to 1.0 if the treatments did exhibit a significant effect. On the other hand, if the treatment has a significant effect on **at least one of the groups**, the F statistic will be $\gt 1$. 

The ratio of variances, or F statistic, follows an F distribution. There are two parameters which are the **degrees of freedom**. The variance between treatments and the variance within treatments each have a different degrees of freedom. For the F distribution of one-way ANOVA the degrees of freedom can be written:

\begin{align}
degrees\ of\ freedom\ between\ treatments\ &= DFT = I - 1 \\
degrees\ of\ freedom\ within\ treatments\ &= DFE = n - I
\end{align}

Where;

\begin{align}
I &= number\ of\ treatments\ or\ groups \\
n &= total\ number\ of\ subjects\ or\ samples
\end{align}

The shape of the F distribution is defined by these two degrees of freedom, as shown in the figure below. Notice how the distribution becomes more symmetric and peaked as the degrees of freedom increases. 

<img src="img/F_Distribution.jpg" alt="Drawing" style="width:300x; height:350px"/>
<center>F distribution for different degrees of freedom</center>

The null distribution is that the treatments have had no significant effect. The p-value is computed and compared to the cutoff value using the F distribution, given the two degree of freedom parameters.  

### Constructing an ANOVA Table

Comparisons between the multiple groups is traditionally laid out using an **ANOVA table**. Here we will construct the elements of this table piece by piece.  

First, we make data independence and Normality assumptions about the groups. Then define:

\begin{align}
I &= number\ of\ treatments\\
n &= number\ of\ data\ or\ samples\\
SS &= sum\ of\ squares
\end{align}

We can calculate the following **sum of squares** quantities:

\begin{align}
SST &= SS\ Treatment\\
SSE &= SS\ Error\ within\ groups\\
SSTotal &= SST + SSE
\end{align}

Further, 

\begin{align}
DFT &= degrees\ of\ freedom\ between\ Treatment\ groups = I - 1\\
DFE &= degrees\ of\ freedom\ within\ treatment\ groups = n - 1\\
DFTotoal &= DFT + DFE = (I-1) + (n-I) = n -1
\end{align}

And,

\begin{align}
MST &= mean\ square\ error\ Treatment\\
MSE &= mean\ square\ error\ within\ groups
\end{align}

Finally we can compute the F statistic with DFT and DFE degrees of freedom:

$$F = \frac{Variance\ between\ treatments}{Variance\ within\ treatments} = \frac{MST}{MSE} =  \frac{\frac{SST}{DFT}}{\frac{SSE}{DFE}}$$

Using the F statistic on the degrees of freedom we can compute the p-values of the test. Using the significance of the test is determined significance with respect to the cutoff value. We can lay these results out in an ANOVA table as follows:

|Type|Sum of Squares|df|Mean Square E|F|Significance|
|---|---|---|---|---|---|
|Between Groups|SST|DFT|SST/DFT|F Statistic| p-value|
|Within Groups|SSE|DFE|SSE/DFE|||
|Groups Total|SSTotal|DFTotal||||

### ANOVA Example

Let's start with an example with 4 groups. In Fisher's experimental design terminology we say we have data arises from 4 **treatments**. The code in the cell below computes the data for each of the 4 treatments. Run the code in the cell below and examine the difference in the box plots.

In [None]:
import numpy as np
import numpy.random as nr
import pandas as pd
import scipy.stats as ss
import statsmodels.stats.power as ssp
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
nr.seed(335566)
df1 = nr.normal(size = 50).tolist()
df2 = nr.normal(size = 50).tolist()
df3 = nr.normal(loc = 0.5, size = 60).tolist()
df4 = nr.normal(size = 40).tolist()


plt.boxplot([df1, df2, df3, df4])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')

The plot shows variation between the distributions of the four variables. The question is, are these differences significant. 

The code in the cell below applies uses the [scipy.stats.f_oneway](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) function, to perform the one-way ANOVA on the data from the 4 treatment groups. This function computes an F-Statistic and a p-value. Run this code and examine the results. 

In [None]:
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))

The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 4 variables have the same mean. These treatment groups are unlikely to have arisen from the null distribution that all treatments had no effect. 

### Power of the Test

There is also the question of the power of this ANOVA test. In other words, what is the probability that we will detect a difference in means? 

The code in the cell below uses the [statsmodels.stats.power.FTestAnovaPower.solve_power](https://www.statsmodels.org/stable/generated/statsmodels.stats.power.FTestAnovaPower.solve_power.html) function to compute power for mean differences in the range $\{ 0.1, 1.0 \}$. The power is plotted  against the mean difference. To be conservative, we are using the smallest number of samples for the variables as the number of observations, nobs. Execute this code.

In [None]:
def plot_power(x, y, xlabel, title):
    plt.plot(x, y, color = 'red', linewidth = 2)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel('Power')

diffs = np.arange(start = 0.1, stop = 1.0, step = 0.01) 
powers = ssp.FTestAnovaPower().solve_power(effect_size = diffs, nobs=40, alpha=0.05)
plot_power(diffs, powers, xlabel = 'Difference', title = 'Power vs. difference') 

You can see that even with 40 observations, the probability of detecting a farily small difference in means between the groups is quite high. 

**Your turn:** In a hypothetical example, a new manager at an auto dealership observes changes in the average daily total sales by day of the week. She has collected the daily sales data by day for the past 8 weeks. She wants to know if these differences are significant or just from random variation?

To solve the problem you will do the following:
1. Execute the code in the cell provided below to compute some simulated data values by day of the week and display a box plot. The parameters for the Normal distributions for each day of the week are based on the average sales for each day and the standard deviation of sales over the month. In other words, we are assuming the variance is constant over the days of the week.
2. In the next cell compute and display the F-Statistic and p-value for this sample. Is this p-value significant with a 0.05 cutoff? 
3. In the third cell compute the power of this test with the  following parameters:
  - Range of differences from 1.0 to 10 in steps of 0.1. 
  - To display the plot of power vs. dollars, you must scale these differences by 50,000, the scale (standard deviation) of the Normal distribution of these data. Do this after you have computed the power values. This process is necessary since the manager will want to see the results in units she understands, dollars. 

In [None]:
nr.seed(3356)
Mon = nr.normal(loc = 400000, scale = 50000, size = 8).tolist()
Tue = nr.normal(loc = 405000, scale = 50000, size = 8).tolist()
Wed = nr.normal(loc = 415000, scale = 50000, size = 8).tolist()
Thr = nr.normal(loc = 420000, scale = 50000, size = 8).tolist()
Fri = nr.normal(loc = 440000, scale = 50000, size = 8).tolist()
Sat = nr.normal(loc = 455000, scale = 50000, size = 8).tolist()

import matplotlib.pyplot as plt
plt.boxplot([Mon,Tue,Wed,Thr,Fri,Sat])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')

Eamine the results of you analysis and answer the following questions:
1. Is the difference between the sales on the different days statistically significant at the 95% level? 
2. For a price difference of $100,000\ and\ \$200,000 what is the approximate power of this test? 

### Turkey's ANOVA: Telling Groups Apart

From the above ANOVA results we know that there is some difference in the means of these variables. However, the **ANOVA does not tell us which variable is significantly different**. From the box plot of the first example, we could guess it that group 3, has the greatest difference with respect to group 2, but we really don't know. ANOVA cannot tell us this. 

In 1949 John Tukey proposed a test, which he dubbed the HSD, or [**Honest Significant Differences**](https://en.wikipedia.org/wiki/Tukey%27s_range_test), test, also know as **Tukey's range test**. The test exhaustively computes the following for each pair of groups:
- Difference of the means
- Confidence interval of the difference in the means
- A p-value from the distribution of the differences

These results are laid out in a table or can be plotted graphically. Only differences in means with a confidence interval not overlapping zero are considered significant.

The cells below contain the code to compute the Tukey HSD for the running example. The code uses the [statsmodels.stats.multicomp.pairwise_tukeyhsd](https://www.statsmodels.org/stable/generated/statsmodels.stats.multicomp.pairwise_tukeyhsd.html) function. Run this code and examine the results to determine which differences are significant?

In [None]:
df = pd.DataFrame({'vals': df1 + df2 + df3 + df4,
                   'group': ['1']*50 + ['2']*50 + ['3']*60 + ['4']*40})
Tukey_HSD = pairwise_tukeyhsd(df.vals, df.group)
print(Tukey_HSD)

Examine the table above. If the difference in means between the variables is significant, the confidence interval will not include 0. Which, pairs have a significant difference at the 95% confidence level? You can see the results of this test, with cutoff of 0.05, in the left most column of the table. 

The [statsmodels.stats.multicomp.tukeyhsd.plot_simultaneous](https://www.statsmodels.org/dev/generated/statsmodels.sandbox.stats.multicomp.TukeyHSDResults.plot_simultaneous.html) method for a pairwise_tukeyhsd object allows you to create a plot of the test results. The plot shows the difference of means as a dot and the confidence interval for this difference. Plot these figures and examine the results.

In [None]:
_ = Tukey_HSD.plot_simultaneous()

Examine the plot above. There is a line with a dot shown for each variable. The dot is the mean and the line shows the range of the confidence interval for  that mean. If the difference in means is significant at the confidence level, the confidence intervals will not overlap. Which, pairs in the above plot have a significant difference at the 95% confidence level?

**Your turn:** It would be useful for the manager of the auto dealership understand which days of the week have significantly different average sales at the 95% confidence level. To solve this problem do the following:
1. To do so, you will first need to construct a data frame by concatenating the sales data for each day of week and creating a column that indicates day of the week for each of these cases. 
2. Compute and print the results of the Tukey HSD test using the `pairwise_tukeyhsd` function. 
3. Use the `plot_simultneous` method on your model object to display the confidence intervals of the means. 

Which pairs of the day of the week are statistically different at the 95% confidence level? Do you think this result might help the manager to better schedule her sales people? 

## A Word of Caution

While the ANOVA and Tukey HSD methods allow us to tell if there is statistically significant differences between the means of multiple groups without data, there are limitations. Fundamentally, these these tests make **multiple comparisons** between the groups. As the number of groups increases, so does the chance of **false positives!**    

## Summary

We have covered lot of ground in this lesson. Specifically we have discussed:

- Variance comparison test for multiple grouped in the form of ANOVA. The null hypothesis is that there are no differences in the variances of the samples and they are all from the same population. 
- The Tukey HSD test provides a way determine which group pairs have significant differences. 
- All of these methods involve multiple comparisons and are therefore subject to finding false positives. 

#### Copyright 2017, 2018, 2019, 2020, Stephen F Elston. All rights reserved.