# Dangers of Multiple Comparisons

Testing multiple hypothesis from the same data can be problematic. Exhaustively testing all pairwise relationships between variables in a data set is a commonly used, but generally misleading from of multiple comparisons. The chance of finding false significance, using such a **data dredging** approach, can be surprisingly high. 

In this exercise you will perform multiple comparisons on 20 **identically distributed independent (iid)** variables. Ideally, such tests should not find significant relationships, but the actual result is quite different. 

To get started, execute the code in the cell below to load the required packages. 

In [None]:
import pandas as pd
import numpy as np
import numpy.random as nr
from scipy.stats import ttest_ind, f_oneway
from itertools import product

In this exercise you will apply a t-test to all pairwise combinations of identical Normally distributed variables. In this case, we will create a data set with 20 iid Normal distributions of 1000 samples each. Execute the code in the cell below to find this data and display the mean and variance of each variable.  

In [None]:
ncolumns = 20
nr.seed(234)
normal_vars = nr.normal(size=(1000,ncolumns))
print('The means of the columns are\n', np.mean(normal_vars, axis = 0))
print('\nThe variances of the columns are\n', np.var(normal_vars, axis = 0))

Notice that means and variances are close to 0.0 and 1.0. As expected, there is not much difference between these variables. 

Now for each pair of variables we will compute the t-statistic and p-value and append them to lists.

In [None]:
ttest_results = []
p_values = []
for i,j in product(range(ncolumns),range(ncolumns)):
    if(i != j): # We only want to test between different samples 
        t1, t2 = ttest_ind(normal_vars[:,i], normal_vars[:,j])
        ttest_results.append(t1)
        p_values.append(t2)

How many of these t-tests will show **significance** at the 0.05 cut-off level? There are 380 pairwise combinations, so we expect to find a number of falsely significant test results at this level. To find out, complete and execute the code in the cell below to filter the test results and print those that show significance. 

In [None]:
signifiance_level = 0.05
def find_significant(p_values, ttest_results, signifiance_level):
    n_cases = 0
    for i in range(len(p_values)):
        ##### Add the missing if statement here #############
        if(?????????????????????????): 
            n_cases += 1
            print('t-test with SIGNIFICANT, t-statistic = ', round(ttest_results[i],2), ' and p-value = ', round(p_values[i],4))
    print('\nNumber of falsely significant tests = ', n_cases)        
find_significant(p_values, ttest_results, signifiance_level)        

Notice the large number of apparently significant tests. Do you trust these results to show any important relationships in the data? 

Can the Bonforoni correction help? Execute the code in the cell below to apply the Bonforoni adjusted significance level to the p-value and ttest data.  

> ### Bonfirroni correction  
> Several adjustments to the multiple comparisons problem have been proposed. In 1979 Holm published a method know as the **Bonfirroni correction**. The adjustment is simple:
$$\alpha_b = \frac{\alpha}{m}\\
with\\ 
m =\ number\ of\ groups$$
> The problem with the Bonfirroni correction is the reduction in power as the  grows smaller. For big data problems with large numbers of groups, this issue can be especially serious. 


In [None]:
signifiance_bonforoni = signifiance_level/380.0
print('With Bonforoni correction the significance level is now = ', signifiance_bonforoni)
find_significant(p_values, ttest_results, signifiance_bonforoni)  

Even with the Bonforoni correction we have some false significance tests, if only just barely!

But, can we detect small effect with Bonforoni correction, as this method significantly reduces power of tests? Execute the code in the cell below, which compares a standard Normal to a Normal with a small mean (effect size), to find out. 

In [None]:
nr.seed(567)
ttest_ind(normal_vars[:,0], nr.normal(loc = 0.01, size=(1000,1)))

Given the Bonforoni correction, this difference in means would not be found significant. This illustrates the downside of the correction, which may prevent detection of significant effects, while still finding false significance. 

##### Copyright 2020, Stephen F. Elston. All rights reserved. 