# Assignment 01 - Pitfalls in Data Mining     
## CSCI E-96

The goal of data mining is to find important relationships in large complex datasets. These dataset typically contain a large number of variables. The **high-dimensional** nature of the data leads to some commonly encountered pitfalls which lead to incorrect inferences.   

A related problem is cutting off a large-scale analysis when a desired relationship is 'found'. This practice of **p-value mining** often leads to unwarranted inferences.     

## Multiple Hypothesis Tesing

Testing multiple hypothesis in high-dimensional data can be problematic. Exhaustively testing all pairwise relationships between variables in a data set is a commonly used, but generally misleading from of **multiple comparisons**. The chance of finding false significance, using such a **data dredging** approach, can be surprisingly high. 

In this exercise you will perform multiple comparisons on only 20 **identically distributed independent (iid)** variables. Ideally, such tests should not find significant relationships, but the actual result is quite different. 

To get started, execute the code in the cell below to load the required packages. 

In [18]:
import pandas as pd
import numpy as np
import numpy.random as nr
from scipy.stats import ttest_ind, f_oneway
from itertools import product, combinations

In this exercise you will apply a t-test to all pairwise combinations of identical Normally distributed variables. In this case, we will create a data set with 20 iid Normal distributions of 1000 samples each. Execute the code in the cell below to find this data and display the mean and variance of each variable.  

In [19]:
ncolumns = 20
nr.seed(234)
normal_vars = nr.normal(size=(1000,ncolumns))
print('The means of the columns are\n', np.mean(normal_vars, axis = 0))
print('\nThe variances of the columns are\n', np.var(normal_vars, axis = 0))

The means of the columns are
 [-1.16191649e-01  2.80829317e-02 -1.78516419e-02 -1.44691489e-02
  3.03718152e-02  1.20007442e-02 -9.58845606e-05  1.98662580e-03
  4.94154934e-02 -4.11640866e-02 -6.32977862e-03 -5.93868192e-02
 -2.56373595e-02  1.43568791e-02 -1.44725765e-02 -1.37023955e-02
  1.80622439e-02  5.87029691e-02 -2.02650514e-02 -1.56346106e-02]

The variances of the columns are
 [0.94834508 1.04744241 1.0258018  0.96977571 1.0089001  1.04113864
 1.00657222 0.99192594 1.04713487 1.04329434 1.04023108 0.96791346
 1.03706907 1.07179865 1.01431404 1.05060289 1.02054329 0.9686211
 1.02810287 0.99521555]


Notice that means and variances are close to 0.0 and 1.0. As expected, there is not much difference between these variables.

How many of these t-tests will show **significance** at the 0.05 cut-off level? There are 380 pairwise combinations, so we expect to find a number of falsely significant test results at this level. To find out, complete and execute the code in the cell below to filter the test results and print those that show significance. 

### Creating a hash 

The goal of this exercise is to compute pairwise hypothesis tests of the differences in means for each of the iid Normal vectors. As an intermediate step you will create a dictionary with the results of these hypothesis tests. The dictionaries store **key-value**, $(K,V)$, pairs. Each key must represent an index for the two vectors used to compute the test statistic. 

The question is, how can we represent the key for the pair of vectors? One option is to use a dictionary of dictionaries. This approach has a dictionary indexed by the first key, which contains a dictionary indexed by the second key. While, this nested dictionary approach would work, it requires two key look-ups per value. A better approach is to create a hash of the two indexes and use the hash as the key for the dictionary. Using a hash the values in the dictionary can be accessed in a single step with using hashed key.


> **Computational Note:** The Python dictionary is an efficient and reasonably scalable **hash table**. The hash function used depends on the type of the key; integer, string, etc. The resulting dictionary of key-value pairs, $(K,V)$, can therefore be access in far less than linear time, often about $O(log(N))$.  

If you are not familiar with Python dictionaries you can find a short tutorial [here](https://www.tutorialspoint.com/python_data_structure/python_hash_table.htm), as well as many other places on the web.

> **Exercise 1-1:** Given that our space of vectors is actually quite small, just 20, we do not need a sophisticated and scalable hash function. This hashed key will then be used to store and retrieve the values using a Python dictionary, in about $O(log(N))$ time.     

> In this exercise you will test a simple hash function and its inverse. Examine the code below and notice that the hash function encodes the two indexes into a single integer by simple additional and multiplication. Division (a slower process) is avoided. Efficiency of the inverse hash function is less important, since it is used less frequently.  

> To test this hash, do the following:    
> 1. Using the Python [ittertools.combinations](https://docs.python.org/3/library/itertools.html#itertools.combinations) function create all unique pairwise combinations of indexes i and j. The arguments to this function are the indexes to the iid Normal vectors. The iterator is `range(ncolumns)`.    
> 2. Within this loop compute the hash, and the inverse hash of the indexes, i, and j.   
> 3. On a single line print the following; the values of i and j, the hash key value, and the unhashed values of i and j.  

In [20]:
def hash_function(i, j, hash_key=4096):
    return i + (j + 1) * hash_key

def unhash(hash_val, hash_key=4096):
    j = int(hash_val/hash_key) - 1
    i = hash_val - (j + 1) * hash_key
    return i, j

## Put your code below. 
for i,j in combinations(range(ncolumns), 2):
    hash_value = hash_function(i,j)
    i_out, j_out = unhash(hash_value)
    print('i = ' + str(i) + '  j = ' + str(j) + '   hash = ' + str(hash_value) + '  after, i= ' + str(i_out) + '  j = ' + str(j_out))

i = 0  j = 1   hash = 8192  after, i= 0  j = 1
i = 0  j = 2   hash = 12288  after, i= 0  j = 2
i = 0  j = 3   hash = 16384  after, i= 0  j = 3
i = 0  j = 4   hash = 20480  after, i= 0  j = 4
i = 0  j = 5   hash = 24576  after, i= 0  j = 5
i = 0  j = 6   hash = 28672  after, i= 0  j = 6
i = 0  j = 7   hash = 32768  after, i= 0  j = 7
i = 0  j = 8   hash = 36864  after, i= 0  j = 8
i = 0  j = 9   hash = 40960  after, i= 0  j = 9
i = 0  j = 10   hash = 45056  after, i= 0  j = 10
i = 0  j = 11   hash = 49152  after, i= 0  j = 11
i = 0  j = 12   hash = 53248  after, i= 0  j = 12
i = 0  j = 13   hash = 57344  after, i= 0  j = 13
i = 0  j = 14   hash = 61440  after, i= 0  j = 14
i = 0  j = 15   hash = 65536  after, i= 0  j = 15
i = 0  j = 16   hash = 69632  after, i= 0  j = 16
i = 0  j = 17   hash = 73728  after, i= 0  j = 17
i = 0  j = 18   hash = 77824  after, i= 0  j = 18
i = 0  j = 19   hash = 81920  after, i= 0  j = 19
i = 1  j = 2   hash = 12289  after, i= 1  j = 2
i = 1  j = 3   hash =

> Examine the results you have printed. Do the unhashed values of i and j agree with the original values? Is there any evidence of a **hash key collision** whereby two combinations of i and j hash to the same value? You can get a feel for the answer to the second question by noticing how the hash values change with i for a few fixed values of j.     
> **End of exercise.**

### The map process

We are constructing this example as a map-reduce algorithm. The first step is the map process that computes the t-test for the pairwise iid Normal vectors.   

> **Exercise 1-2:** You will now create the code for the map task which computes the t-test results for every pair-wise combinations of the iid Normal vectors. By the following steps you will create code that represents a map task.  
> 1. Create a loop over all combinations of the pairs of iid Normal vectors, i and j.  
> 2. Filter out cases where $i = j$, the same vector.   
> 3. Compute the hash key value for the indexes, i and j.  
> 4. Add the values of the two-tailed t-statistic and p-value returned by the [tttest_ind](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) function. The arguments to this function are the iid Normal vectors indexed by i and j. 
> 5. Once the loop has executed verify that that length of the dictionary is $(N^2 - N)/2$ as expected. This result verifies there have been no hash key collisions. 

In [21]:
def map_hypothesis(vars):
    ncolumns = vars.shape[1]
    hypothesis_tests = {}
    for i,j in combinations(range(ncolumns), 2): 
        if(i != j): # We only want to test between different samples 
            ## Compute the hash key
            hash_key = hash_function(i,j)
            hypothesis_tests[hash_key] = ttest_ind(vars[:,i], vars[:,j])
    return hypothesis_tests

hypothesis_tests = map_hypothesis(normal_vars)
len(hypothesis_tests)        

190

> **End of exercise.**

### The reduce task

Now that the t-tests have been computed in the map process it is time to create a reduce process to filter the results to find the falsely significant results. The reduce task filters any non-significant cases from the dictionary. 

> **Exercise 1-3:** You will now create and apply the following code for the reduce process:   
> 1. Create a loop over all combinations of the pairs of iid Normal vectors, i and j.  
> 2. Filter out cases where $i = j$, where the keys will not exist in the dictionary.
> 3. Compute the hash key with `hash_function()` defined above   
> 4. Extract the t-statistic and p-value from the entry in the dictionary, indexed by the key.   
> 5. If the p-value is greater than the significance cut-off level remove the dictionary entry with the `pop(key)` method. 
> 6. Once the loop has executed, print the length of the remaining dictionary and on one line the indexes of the iid Normal vectors, the t-statistic, and p-value of the significant tests. 

In [23]:
significance_level = 0.05
def reduce_significance(test_dictionary, significance_level):    
    keys = list(test_dictionary.keys())
    for key in keys: 
        t_statistic, p_value = test_dictionary[key] 
        if p_value > significance_level: test_dictionary.pop(key)
    return test_dictionary                

hypothesis_tests = reduce_significance(hypothesis_tests, significance_level)
len(hypothesis_tests)            

22

In [24]:
for key in hypothesis_tests.keys():
    i, j = unhash(key)
    print('For vectors i = {0:2d}, j = {1:2d} the t-statistic = {2:6.4f}  p-value = {3:6.4f}'.format(i, j, hypothesis_tests[key][0],hypothesis_tests[key][1]))

For vectors i =  0, j =  1 the t-statistic = -3.2279  p-value = 0.0013
For vectors i =  0, j =  2 the t-statistic = -2.2122  p-value = 0.0271
For vectors i =  0, j =  3 the t-statistic = -2.3215  p-value = 0.0204
For vectors i =  0, j =  4 the t-statistic = -3.3112  p-value = 0.0009
For vectors i =  0, j =  5 the t-statistic = -2.8726  p-value = 0.0041
For vectors i =  0, j =  6 the t-statistic = -2.6244  p-value = 0.0087
For vectors i =  0, j =  7 the t-statistic = -2.6816  p-value = 0.0074
For vectors i =  0, j =  8 the t-statistic = -3.7054  p-value = 0.0002
For vectors i =  0, j = 10 the t-statistic = -2.4624  p-value = 0.0139
For vectors i =  0, j = 12 the t-statistic = -2.0313  p-value = 0.0424
For vectors i =  0, j = 13 the t-statistic = -2.9031  p-value = 0.0037
For vectors i =  0, j = 14 the t-statistic = -2.2949  p-value = 0.0218
For vectors i =  0, j = 15 the t-statistic = -2.2912  p-value = 0.0221
For vectors i =  0, j = 16 the t-statistic = -3.0241  p-value = 0.0025
For ve

> Notice the large number of apparently significant tests. Answer the following questions:  
> 1. Is the number of false positive cases higher than expected?    
> 2. Examine which of the iid Normal vectors contribute to the false positive results. Are there vectors which contribute multiple times?   
> **End of exercise.**

### Bonferroni correction  

Several adjustments to the multiple comparisons problem have been proposed. In Dunn published a method know as the **Bonfirroni correction** in 1961. The Bonferroni correction is a widely used method to reduce the false positive rate of hypothesis tests.  The adjustment is simple:
$$\alpha_b = \frac{\alpha}{m}\\
with\\ 
m =\ number\ of\ groups$$

Can the Bonferroni correction help? Yes, by greatly increasing the confidence level required for a statistically significant result. The problem with the Bonfirroni correction is the reduction in power as the  grows smaller. For big data problems with large numbers of groups, this issue can be especially serious. 

**Exercise 1-4:** You will now apply the Bonferroni correction to the iid Normal vectors. To do so, :   

In [25]:
significance_bonforoni = significance_level/190.0
hypothesis_tests = reduce_significance(hypothesis_tests, significance_bonforoni)
len(hypothesis_tests)            

2

> Even with the Bonforoni correction we have some false significance tests, if only just barely!    
> **End of exercise.**

But, can we detect small effect with Bonforoni correction, as this method significantly reduces power of tests? Execute the code in the cell below, which compares a standard Normal to a Normal with a small mean (effect size), to find out. 

In [26]:
nr.seed(567)
print(significance_bonforoni)
ttest_ind(normal_vars[:,0], nr.normal(loc = 0.01, size=(1000,1)))

0.0002631578947368421


Ttest_indResult(statistic=array([-2.49553488]), pvalue=array([0.01265684]))

Given the Bonforoni correction, this difference in means would not be found significant. This illustrates the downside of the correction, which may prevent detection of significant effects, while still finding false significance. 

## False Discovery Rate Control Methods   

In [27]:
ncolumns = 20
nr.seed(2334)
normal_samples = nr.normal(size=(1000,ncolumns))
normal_samples[:,:2] = np.add(normal_samples[:,:2], 0.1)
hypothesis_tests = map_hypothesis(normal_samples)

In [28]:
hypothesis_tests = map_hypothesis(normal_vars)
len(hypothesis_tests)        

190

### Holm's method

$$p(i) \le Threshold(Holm's) = \frac{\alpha}{N - i + 1}$$

Example: for the 10th ordered p-value with 1,000 total tests (genes) and significance level of 0.05, the cutoff is:   

$$p(10) \le \frac{0.05}{1000 - 10 + 1} = 0.00005045$$

In [None]:
def reduce_sort_key_reverse(kv_dictionary):  
    

##### Copyright 2020, 2021, Stephen F. Elston. All rights reserved. 