### Theory

#### Hypothesis Testing – The problem of multiple comparisons [5 points]

Experimentation in AI often happens like this: 

* Modify/Build an algorithm
* Compare the algorithm to a baseline by running a hypothesis test.
* If not significant, go back to step A
* If significant, start writing a paper. 

How many hypothesis tests, m, does it take to get to (with Type I error for each test = α):

1. P(mth experiment gives significant result | m experiments lacking power to reject H0)?
2. P(at least one significant result | m experiments lacking power to reject H0)?

#### Bias and unfairness in Interleaving experiments [10 points]
Balance interleaving has been shown to be biased in a number of corner cases. An example was given during the lecture with two ranked lists of length 3 being interleaved, and a randomly clicking population of users that resulted in algorithm A winning ⅔ of the time, even though in theory the percentage of wins should be 50% for both algorithms. Can you come up with a situation of two ranked lists of length 3 and a distribution of clicks over them for which Team-draft interleaving is unfair to the better algorithm?


In [56]:
import itertools
import numpy as np

# Added the numbers to facilitate easy sorting.
RELAVANCE_SYMBOLS = ['0N', '1R', '2HR']

# Generate all possible rankings for both Production and Experimental systems
P = list(itertools.product(RELAVANCE_SYMBOLS, repeat = 5))
E = list(itertools.product(RELAVANCE_SYMBOLS, repeat = 5))
 
rankings = itertools.product(P, E)

In [57]:
def precision_at_k(ranking, k):
    subset = ranking[:k]
    n_relevant = 0
    
    for result in subset:
        if result is not '0N':
            n_relevant += 1
    
    return n_relevant / k

def DGC_at_k(ranking, k):
    subset = ranking[:k]
    discounted_scores = []
    
    for i, result in enumerate(subset):
        # prepare variables for DGC formula
        rank = i + 1
        rel = RELAVANCE_SYMBOLS.index(result)
        
        # Calculate score
        score = (2**rel - 1) / np.log2(1 + rank)
        discounted_scores.append(score)
    
    # NB: sum is part of the formula
    return np.sum(discounted_scores)
        
def nDGC_at_k(ranking, k):
    true = DGC_at_k(ranking, k)
    best = DGC_at_k(sorted(ranking, reverse=True), k)
    
    return true / best

precision_p = []
precision_e = []
dgc_p = []
dgc_e = []

for p, e in rankings:
    precision_p.append(precision_at_k(p, 3))
    precision_e.append(precision_at_k(e, 3))
    
    dgc_p.append(nDGC_at_k(p, 3))
    dgc_e.append(nDGC_at_k(e, 3))



In [None]:
print(dgc_p)