This is optional content!  For a deep dive into testing for normality and what happens to that 5% chance of a false positive when you run lots and lots of tests, read on.  You do not need to master this content to succeed in the course.

## Testing for normality
This is the formula for the Shapiro-Wilk test statistic, *W*.

\begin{align} 
W=\frac { (\sum_{i=1}^n a_ix_{(i)} )^2}{ \sum_{i=1}^n (x_i-\bar{x})^2}
\end{align}

$x_{(i)}$ is the *i*th smallest number in the sample.  
$a_i$ is a constant determined by the sample size and acts as a scaling value.  
$\bar{x}$ is the sample mean.  
$x_i$ is the *i*th observation in the sample.

The equation computes the ratio between the value that would be expected for a normally-distributed sample of that size containing that much information (the numerator of the fraction above) and the actual sum of the differences between each of the values in the variable and the sample mean (the denominator).  Values close to 1 indicate that the distribution is similar to a normal distribution.  The smaller the W statistic becomes, the more divergence there is between the distribution of the data and the normal distribution.  

Note that when comparing groups, the distribution of each *group* must be normal.  

The Shapiro-Wilk test (and all other tests of non-normality) come with an important caveat: they are very sensitive to sample size. For small samples (50, others >2000 or more) the test will detect even very small and unimportant deviations from non-normality. Statistical tests of normality should always be accompanied by visualizations.

## Multiple Testing Correction: Tukey's Honest Significant Differences (HSD) Test

Instead of running many pairs of t-tests to find out which of the three roller coaster materials was the odd one out, we could run a Tukey's HSD test.  Unlike a t-test, Tukey's HSD test does pairwise tests that use a variability estimate based on variability from all the groups combined (the denominator from the F-test mentioned in the previous checkpoint) rather than variability from only the two groups being tested.  

\begin{equation}
Q=\frac{M_i-M_j}{\sqrt{MSE/n}}
\end{equation}

Where $MSE={\sum\sum(Y_{ij}-\bar{Y}_j)^2}/{(N-a)}$ from the denominator of the F-test.

In addition, when calculating the probability of getting this ratio, the test statistic, *Q* will be evaluated in light of a modified probability distribution that takes into account the number of means being tested across all pairwise tests. 

Running Tukey's HSD using Python's `statsmodels` package will get us a table with the differences between each pair of means, the upper and lower bounds of that difference estimate, and whether we should reject the null hypothesis that each pair of groups is not different.  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
%matplotlib inline

In [2]:
coaster_heights = pd.DataFrame()

steel_heights = [
    18.5, 14, 30.2, 25.2024, 15, 16, 13.5, 30, 20, 17, 13.716, 8.5, 16.1, 18,
    41, 30.3, 32.004, 28.004, 30.48, 34
    ]

wood_heights = [
    38.70, 46, 27.8, 43.52, 33.77, 29.26, 16.764, 45, 48.1, 16.764, 24.384,
    24.5, 40, 35.96, 22.24, 21.33, 27.73, 23.46, 21.64, 30.12
    ]

plastic_heights = [
    9, 8.2, 12, 21, 6.3, 11.7, 19.44, 4.75, 13, 18, 15.5, 15.6, 10, 11.77, 29,
    5, 3.2, 14.75, 18.2, 17.7
    ]

coaster_heights['Steel'] = steel_heights
coaster_heights['Wood'] = wood_heights
coaster_heights['Plastic'] = plastic_heights

heights=np.asarray(
    coaster_heights['Steel'].tolist() +
    coaster_heights['Wood'].tolist() +
    coaster_heights['Plastic'].tolist())

materials = np.array(['Steel', 'Wood','Plastic'])
materials = np.repeat(materials, 20)

tukey = pairwise_tukeyhsd(endog=heights,      # Data
                          groups=materials,   # Groups
                          alpha=0.05)         # Significance level

tukey.summary()  

group1,group2,meandiff,lower,upper,reject
Plastic,Steel,9.3698,2.8923,15.8474,True
Plastic,Wood,17.6466,11.1691,24.1241,True
Steel,Wood,8.2768,1.7992,14.7543,True


#### Multiple Testing Correction

You may be wondering why we went through all this trouble, when we could have just done a bunch of t-tests comparing the groups.  The reason for doing an ANOVA first, and following-up with post-hoc tests if the F-test suggests differences, is to prevent false positive results.  

As you've read several times by now, data scientists often use a probability threshold of 5% (p = .05) to determine whether groups we are comparing in our sample are meaningfully different in the population (though the use of this threshold, or even the existence of a threshold at all, is widely debated: [See this article for a deep dive into the issue]("http://www.stats.org.uk/statistical-inference/Johnson1999.pdf")).   One way of interpreting that threshold is that if there were no real difference between the groups in the population, we would see differences like those in our sample less than 5% of the time.  To put it another way, this threshold means we have a 1 in 20 chance of a false positive, of claiming there is a real difference when in fact, there is not.  

This 1 in 20 chance, however, only holds when we have done exactly one statistical test on the data.  If we perform two statistical tests on the same data, the chances of getting a false positive on at least one of them are now 1 in 10: $\frac1{20}+\frac1{20}$.  If we were to perform 200 tests on the same data, it is likely that up to 10 of our tests are false positives.  

As data scientists, we want to be confident that our conclusions accurately reflect the population.  There are various  **multiple test correction** methods that can be used to keep chances of a false positive below the 5% threshold.  One is Tukey's HSD test, which uses information about the overall sample variance and the number of groups being compared to "raise the bar" on how large a group difference must be before it passes the 5% probability threshold.  

There are many methods, some specific to certain types of analysis and others representing more general approaches suitable for many different statistical goals.  [This article in Nature Biotechnology](https://www.nature.com/articles/nbt1209-1135) has a good review of decision-points involved in selecting a correctional approach.  
