In [407]:
import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
from scipy import stats
import seaborn as sns
sns.set()

plt.rcParams["figure.figsize"] = (20,12)

ANOVA = Analysis of Variance

What is it for? 

It allows us to compare the means of different groups to see if their sample means are demonstratably different statistically, or whether the difference between the sample means is possibly just due to random variation.

When would we use it?

Imagine we have 100 respondents to a survey. We categorise the respondents into three categories:

1. Big spenders
2. Average spenders
3. Budget customers

All respondents rate how much they like our new branding on a scale of one to ten. 

Imagine we find the following:

$$ \bar{x}_{big} = 8.2 $$

$$\bar{x}_{avg} = 7.8$$

$$\bar{x}_{budget} = 6.9  $$

Note these are the sample means from the data! They are estimators, not parameters i.e. they are evidence for a hidden parameter

What if we want to test this hypothesis:

$$H_0 : \mu_{big} = \mu_{avg} = \mu_{budget}$$

You might think it is a simple as saying, if the sample means are different, then the true means (parameters) are probably different!

Yes, but how likely? In statistics, we only accept something as true if it can be demonstrated to some specific level of confidence.

In ANOVA, we use the following formulae to find out if this is true:

$SST = \sum_{i=1}^k  \sum_{j=1}^{n_i} (Y_{ij} - \bar{Y})^2 = \sum_{a = 1}^N (Y_a - \bar{Y})^2$

$SSE = \sum_{i=1}^k  \sum_{j=1}^{n_i} (Y_{ij} - \bar{Y_{i}})^2$

$SST = SSE + SSTr$

Therefore:

$SST - SSE = SSTr$

$MSE = \frac{SSE}{N-k}$

$MSTr = \frac{SSTr}{k-1}$

$\frac{MSTr}{MSE} = F-stat$

$F > F_{\alpha} (k-1, N-k) \rightarrow Reject H_0$

This actually turns out to be a generalisation of linear regression!

Q1. Find $F_{\alpha} (k-1, N-k)$ if $\alpha = 0.05$

Answer:

In [408]:
sp.stats.f.ppf(q=0.95, dfn=2, dfd=12)

3.8852938346523933

In [409]:
big = [8.2, 8.4, 8, 8.6, 7.8]

In [410]:
avg = [7.8, 7.7, 7.9, 7.5, 8.1]

In [411]:
budget = [6.9, 5.9, 7.9, 5, 8.8]

Big spenders: $[8.2, 8.4, 8, 8.6, 7.8]$

Avg. spenders: $[7.8, 7.7, 7.9, 7.5, 8.1]$

Budget spenders: $[6.9, 5.9, 7.9, 5, 8.8]$

Can we assume that the means are different at a 95% confidence level?

In [412]:
obs = big + avg + budget

In [413]:
obs

[8.2, 8.4, 8, 8.6, 7.8, 7.8, 7.7, 7.9, 7.5, 8.1, 6.9, 5.9, 7.9, 5, 8.8]

Let's find the grand mean $\bar{Y}$

In [414]:
gmean = sum(obs)/15

In [415]:
gmean

7.633333333333335

Now let's find the SST

$SST = \sum_{i=1}^k  \sum_{j=1}^{n_i} (Y_{ij} - \bar{Y})^2 = \sum_{a = 1}^N (Y_a - \bar{Y})^2$

In [416]:
SST = sum([(_ - gmean)**2 for _ in obs])

In [417]:
SST

14.253333333333334

Now the SSE: 

$SSE = \sum_{i=1}^k  \sum_{j=1}^{n_i} (Y_{ij} - \bar{Y_{i}})^2$



In [418]:
SSE = sum([(_ - np.mean(big))**2 for _ in big]) + sum([(_ - np.mean(avg))**2 for _ in avg]) + sum([(_ - np.mean(budget))**2 for _ in budget])

In [419]:
SSE

9.820000000000002

$SSTr = SST - SSE$

$MSE = \frac{SSE}{N-k}$

$MSTr = \frac{SSTr}{k-1}$

In [420]:
SSTr = SST - SSE

In [421]:
MSE = SSE/(15-3)

In [422]:
MSTr = SSTr/(3-1)

In [423]:
SSTr

4.433333333333332

In [424]:
MSE

0.8183333333333335

In [425]:
MSTr

2.216666666666666

In [426]:
F = MSTr/MSE

In [427]:
F

2.7087576374745406

In [428]:
Fcrit = sp.stats.f.ppf(q=0.95, dfn=2, dfd=12)

In [429]:
Fcrit

3.8852938346523933

In [430]:
F > Fcrit

False

Conclusion: We cannot reject $H_0$ with statistical significance at $\alpha = 0.05$

What about 90%?

In [431]:
Fcrit2 = sp.stats.f.ppf(q=0.90, dfn=2, dfd=12)

In [432]:
Fcrit2

2.806795605732417

In [433]:
F > Fcrit2

False

Also no, but it's close!

Now let's repeat the process, but this time we'll reduce the amount of variance for the budget spender column!

In [434]:
np.mean(budget)

6.9

In [435]:
np.var(budget)

1.8440000000000005

In [436]:
budget2 = [6.9, 6.8, 7, 6.8, 7]

In [437]:
 np.mean(budget2)

6.9

In [438]:
np.var(budget2)

0.008000000000000014

In [439]:
obs = big + avg + budget2

Note how small the SST has become! Previously it was 14.253. This is because, now, our sample means are generally much better predictions for the actual observed values.

In [440]:
SST = sum([(_ - gmean)**2 for _ in obs])
SST

5.0733333333333315

However, the SSE has also become smaller! A lot smaller! Previously it was 9.82!

Remember that the SSE will be used to find the Within-treatment variation! It gives a value that measures the within treatment variation across all groups.

In [441]:
SSE = sum([(_ - np.mean(big))**2 for _ in big]) + sum([(_ - np.mean(avg))**2 for _ in avg]) + sum([(_ - np.mean(budget2))**2 for _ in budget2])
SSE

0.6399999999999999

In [442]:
SSTr = SST - SSE
MSE = SSE/(15-3)
MSTr = SSTr/(3-1)

SSTr (sum of squared treatments) measures the variation between the treatments groups. Specifically the means of the treatment groups. 

In fact, SSTr can be calculated:

$$\sum_{i=1}^k n_i (\bar{Y_i} - \bar{Y})^2 $$

Which is just a sum of squared calculation comparing the means of the each group with the grand mean, weighted by the size of the sub-sample for each group!

In [443]:
output = []

for _ in [big, avg, budget2]:
    
    output.append(len(_)*(np.mean(_) - gmean)**2)

sum(output)
    
    

4.4333333333333265

In [444]:
SSTr

4.433333333333332

Note how the sum of squares treatment is the same as before we reduced the variance for the budget group! This is because the means of each group i.e. the $\bar{Y_i}$'s have remained the same.

Below, note that MSE is a measure of the within-group variance. Look how small it is! 0.053!

In [445]:
MSE

0.05333333333333332

Below we see the MSTr, a measure of the between-group variance. Look it's the same as before! 2.216

In [446]:
MSTr

2.216666666666666

The ratio between the within-group and between-group variance will be very different now. Look at the F-stat we get!

In [447]:
F = MSTr/MSE
F

41.56249999999999

Wow! Look that F-stat. 

In [448]:
Fcrit3 = sp.stats.f.ppf(q=0.99, dfn=2, dfd=12)
Fcrit3

6.9266081401913

In [449]:
F > Fcrit3

True

Now we have reduced the variance for budget customers, we have statistical confidence even at $\alpha = 0.01$

Conclusion: If the within-group variation (measured by MSE via the SSE) is smaller than the between group variation (measured by MSTr via the SSTr), then we can be more confident that the means of the group are truly different from each other. 

At some point, this reaches the confidence level for some given alpha $\alpha$.

Think about it. The less variance inside a group, the more confident we can be that we will find members of that "class" or group clustered around the mean for that group. 

If we find two groups like this i.e. very clustered around the mean for their group, and the means for each group are very different from each other, it becomes more likely that the means for each class are truly different. 

Alternatively, you can think of this as us having more reason to believe that the means for the parameters are truly different (Bayesian thinking!). 

On the other hand, imagine we had two groups that had quite similar means, and within each group we found a huge amount of variance (i.e. the values are scattered and spread). This would make it far more likely that the different between the means could be explained the random static exhibited by the groups. 

In that case, it would not be reasonable to believe that we had found a true difference.