### Theory

#### Hypothesis Testing – The problem of multiple comparisons [5 points]

Experimentation in AI often happens like this: 

* Modify/Build an algorithm
* Compare the algorithm to a baseline by running a hypothesis test.
* If not significant, go back to step A
* If significant, start writing a paper. 

How many hypothesis tests, m, does it take to get to (with Type I error for each test = α):

1. P(mth experiment gives significant result | m experiments lacking power to reject H0)?
2. P(at least one significant result | m experiments lacking power to reject H0)?

#### Bias and unfairness in Interleaving experiments [10 points]
Balance interleaving has been shown to be biased in a number of corner cases. An example was given during the lecture with two ranked lists of length 3 being interleaved, and a randomly clicking population of users that resulted in algorithm A winning ⅔ of the time, even though in theory the percentage of wins should be 50% for both algorithms. Can you come up with a situation of two ranked lists of length 3 and a distribution of clicks over them for which Team-draft interleaving is unfair to the better algorithm?


In [26]:
import itertools
import numpy as np

# Added the numbers to facilitate easy sorting.
RELAVANCE_SYMBOLS = ['0N', '1R', '2HR']

# Generate all possible rankings for both Production and Experimental systems
P = list(itertools.product(RELAVANCE_SYMBOLS, repeat = 5))
E = list(itertools.product(RELAVANCE_SYMBOLS, repeat = 5))
 
rankings = itertools.product(P, E)

rank_list = list(rankings)
rank_len = len(rank_list)


In [45]:
# quote from slide 15: "usually user only looks at very few top results : e.g. precision@3"
def precision_at_k(ranking, k):
    subset = ranking[:k]
    n_relevant = 0
    
    for result in subset:
        if result is not '0N':
            n_relevant += 1
    
    return n_relevant / k

def DGC_at_k(ranking, k):
    subset = ranking[:k]
    discounted_scores = []
    
    for i, result in enumerate(subset):
        # prepare variables for DGC formula
        rank = i + 1
        rel = RELAVANCE_SYMBOLS.index(result)
        
        # Calculate score
        score = (2**rel - 1) / np.log2(1 + rank)
        discounted_scores.append(score)
    
    # NB: sum is part of the formula
    return np.sum(discounted_scores)
        
def nDGC_at_k(ranking, k):
    true = DGC_at_k(ranking, k)
    best = DGC_at_k(sorted(ranking, reverse=True), k)
    
    return true / best

precision_p = []
precision_e = []
dgc_p = []
dgc_e = []

for p, e in rankings:
    precision_p.append(precision_at_k(p, 3))
    precision_e.append(precision_at_k(e, 3))
    
    dgc_p.append(nDGC_at_k(p, 3))
    dgc_e.append(nDGC_at_k(e, 3))



In [None]:
print(dgc_p)

In [72]:
import random
# Step 4: Implement 2 interleaving algorithms: 
## (1) Team-Draft Interleaving OR Balanced Interleaving, AND 
## (2) Probabilistic Interleaving.

# The interleaving algorithms should: 
## (a) given two rankings of relevance interleave them into a single ranking, and 
## (b) given the users clicks on the interleaved ranking assign credit to the algorithms 
##     that produced the rankings.


### BALANCED INTERLEAVING:
# Find two random rankings A and B and present them:
random_pick = random.randint(0,rank_len)
A, B = ['a','b','c','d','g','h'], ['b','e','a','f','g','h']
#A, B = rank_list[random_pick]

# initialize pointers p_a and p_b
p_a, p_b, I = 0, 0, []

# Flip a coin to decide which pointer to pick highest value from ranking.
turn = 'A' if random.randint(0, 1) == 0 else 'B'

# Greedily collect rankings from both lists and build the interleaved list.
print(turn, "is first:")
while p_a < len(A) and p_b < len(B):
    if p_a < p_b or p_a == p_b and turn == 'A':
        if A[p_a] not in I:
            I.append(A[p_a])
        p_a += 1
    else:
        if B[p_b] not in I:
            I.append(B[p_b])
        p_b += 1

print(I)

B is first:
['b', 'a', 'e', 'c', 'f', 'd', 'g', 'h']


In [108]:
### TEAM-DRAFT INTERLEAVING:

# Fully based on example in paper:
random_pick = random.randint(0,rank_len)
A, B = ['a','b','c','d','g','h'], ['b','e','a','f','g','h']
#A, B = rank_list[random_pick]

# create teams
team_a, team_b, I = 0, 0, []

while any(set(A)-set(I)) and any(set(B)-set(I)):
    turn = 'A' if random.randint(0, 1) == 0 else 'B'
    print("Turn is: ", turn)
    if team_a < team_b or team_a == team_b and turn == 'A':
        print("A")
        for a in A:
            if a not in I:
                I.append(a)
                break
        team_a += 1
    else:
        print("B")
        for b in B:
            if b not in I:
                I.append(b)
                break
        team_b += 1

print(I)

Turn is:  A
A
Turn is:  B
B
Turn is:  A
A
Turn is:  A
B
Turn is:  A
A
Turn is:  B
B
Turn is:  A
A
Turn is:  A
B
['a', 'b', 'c', 'e', 'd', 'f', 'g', 'h']
