## Homework 1

#### Michelle Appel (10170359)
#### Nils Hulzeboch (10749411)
#### Yves van Montfort (XXX)

#### 11-01-2018

### Theoretical Part [15 pts]

1. Hypothesis Testing – The problem of multiple comparisons [ 5 points ]
Experimentation in AI often happens like this:
A. Modify/Build an algorithm
B. Compare the algorithm to a baseline by running a hypothesis test.
C. If not significant, go back to step A
D. If significant, start writing a paper.
Compute the probabilities below  How many hypothesis tests, m, does it take to get to  (with Type I error for each test = α):
(a) P(mt  h  experiment gives significant result | m experiments lacking power to reject H 0)  ? (b) P(at least one significant result | m experiments lacking power to reject H 0)  ?

(a) P($m^{th}$ experiment gives significant result | m experiments lacking power to reject $H_0$) = $1 - (1 - \alpha)^m$ (= $m \alpha$ when $\alpha$ is small)

Where m is the amount of experiments and $\alpha$ is the Type I error.

(b) P(at least one significant result | m experiments lacking power to reject $H_0$) = $\alpha$

2. Bias and unfairness in Interleaving experiments [ 10 points ]
Balance interleaving has been shown to be biased in a number of corner cases. An example was given during the lecture with two ranked lists of length 3 being interleaved, and a randomly clicking population of users that resulted in algorithm A winning 2⁄3 of the time, even though in theory the percentage of wins should be 50% for both algorithms. Can you come up with a
situation of two ranked lists of length 3 and a distribution of clicks over them for which Team-draft interleaving is unfair to the better algorithm?

### Experimental Part [85 pts]

In [31]:
import itertools
import numpy as np
import random

#### Step 1: Simulate Rankings of Relevance for E and P (5 points)

In the first step you will generate pairs of rankings of relevance, for the production P and experimental E, respectively, for a hypothetical query q. Assume a 3-graded relevance, i.e. {N, R, HR}. Construct all possible P and E ranking pairs of length 5. This step should give you about.

In [6]:
relevances = ('N', 'R', 'HR') # The three relevance classes

# All combinations with length 5 of the relevance classes
combinations = list(itertools.combinations_with_replacement(relevances, 5))

# All permutations per combination
permutations = ()
for combination in combinations:
    permutations += tuple(set(itertools.permutations(combination)))
    
# Ranking pairs of production P and experimental E
ranking_pairs = ()
for ranking_p in permutations:
    for ranking_e in permutations:
        if ranking_p != ranking_e: # If pairs are not the same
            ranking_pairs += ((ranking_p, ranking_e),) # Extend list with ranking pair

#### Step 2: Implement Evaluation Measures (10 points)
    
Implement 1 binary and 2 multi-graded evaluation measures out of the 7 measures mentioned above. 
(Note 2: Some of the aforementioned measures require the total number of relevant and highly relevant documents in the entire collection – pay extra attention on how to find this)

##### Binary evaluation measures:
1. Precision at rank k,
2. Recall at rank k,
3. Average Precision,

In [3]:
# Precision = TP / (TP + FP)
def precision(ranking, rank=None):
    tp = ranking[:rank].count('R') + ranking[:rank].count('HR')
    fp = ranking[:rank].count('N')
    return tp / (tp + fp)


# Recall = TP / (TP + FN)
def recall(ranking, no_relevant_documents, rank=None):
    if rank is None:
        rank = len(ranking)
    
    tp = ranking[:rank].count('R') + ranking[:rank].count('HR')
    fn = no_relevant_documents - tp
    return tp / (tp + fn)
    
def recalls(ranking_pair, rank=None):
    no_relevant_documents = 0
    for ranking in ranking_pair:
        no_relevant_documents += ranking.count('R') + ranking.count('HR')
    
    recalls = ()    
    for ranking in ranking_pair:
        recalls += (recall(ranking=ranking, no_relevant_documents=no_relevant_documents, rank=rank),)
    return recalls


# Average precision
def average_precision(ranking):
    precisions = ()
    for rank in range(1, len(ranking)+1):
        if ranking[rank-1] == 'R' or ranking[rank-1] == 'HR':
            precisions += (precision(ranking, rank=rank),)

    if len(precisions) > 0:
        return np.mean(precisions)
    else:
        return 0

##### Multi-graded evaluation measures:

1. Normalized Discounted Cumulative Gain at rank k (nDCG@k),
2. Expected Reciprocal Rank (ERR).

In [4]:
# Normalized Discounted Cumulative Gain at rank k (nDCG@k)
def nDCGk(ranking, rank=None):
    if rank is None:
        rank = len(ranking)
    
    gains = ()
    for r, rel_grade in enumerate(ranking[:rank]):
        if rel_grade == 'N':
            rel_r = 0
        elif rel_grade == 'R':
            rel_r = 0.5
        elif rel_grade == 'HR':
            rel_r = 1
        
        gains += ((2**rel_r - 1)/(np.log2(2+r)),)
        
    return np.sum(gains)


# Mapping from relevance grades to probability of relevance
def R(rel_grade, g_max):
    if rel_grade == 'N':
        g = 0
    elif rel_grade == 'R':
        g = 0.5
    elif rel_grade == 'HR':
        g = 1

    return (2**g - 1)/(2**g_max)

# Expected Reciprocal Rank (ERR)
def ERR(ranking, rank=None):
    if rank is None:
        rank = len(ranking)
    
    g_max = 1
    
    likelihood_sum_elem = ()
    for r, r_rel_grad in enumerate(ranking[:rank]):
        P_prod_elem = ()
        for i_rel_grad in ranking[:rank]:
            P_prod_elem += ((1/(r+1))*(1 - R(i_rel_grad, g_max))*R(r_rel_grad, g_max),)
        likelihood_sum_elem += (np.prod(P_prod_elem),)
    
    return np.sum(likelihood_sum_elem)

#### Step 3: Calculate the 𝛥measure (0 points)
    
For the three measures and all P and E ranking pairs constructed above calculate the difference: 𝛥measure = measureE-measureP. Consider only those pairs for which E outperforms P.

In [5]:
def delta_measure(ranking_pairs, evaluation_measure):
    delta_measures = ()
    
    for ranking_pair in ranking_pairs:
        delta_measure = evaluation_measure(ranking_pair[1]) - evaluation_measure(ranking_pair[0])
        if delta_measure > 0:
            delta_measures += (delta_measure,)
            
    return np.mean(delta_measures)

[Based on Lecture 2]
#### Step 4: Implement Interleaving (15 points)

Implement 2 interleaving algorithms: (1) Team-Draft Interleaving OR Balanced Interleaving, AND (2), Probabilistic Interleaving. The interleaving algorithms should (a) given two rankings of relevance interleave them into a single ranking, and (b) given the users clicks on the interleaved ranking assign credit to the algorithms that produced the rankings.
(Note 4: Note here that as opposed to a normal interleaving experiment where rankings consists of urls or docids, in our case the rankings consist of relevance labels. Hence in this case (a) you will assume that E and P return different documents, (b) the interleaved ranking will also be a ranking of labels.)

In [196]:
# Balanced Interleaving
def balanced_interleaving(ranking_pair, rank=None):
    if rank is None:
        rank = len(ranking_pair[0]) + len(ranking_pair[1])
    
    p_first = random.randint(0,1)
    
    interleaved_ranking = ()    
    for p, e in zip(*ranking_pair):
        if p_first:
            interleaved_ranking += (p, e)
        else:
            interleaved_ranking += (e, p)            
    
    return interleaved_ranking[:rank]
       
# Team-Draft Interleaving
def team_draft_interleaving(ranking_pair, rank=None):
    ranking_p = list(ranking_pair[0])
    ranking_e = list(ranking_pair[1])
                             
    if rank is None:
        rank = len(ranking_p) + len(ranking_e)
        
    team_p = ()
    team_e = ()
        
    interleaved_ranking = ()
    for i in range(rank):
        if (len(team_p) < len(team_e)) or (len(team_p) == len(team_e) and random.randint(0,1)):
            rel_grade = ranking_p.pop(0)
            interleaved_ranking += (rel_grade,)
            team_p += (rel_grade,)
        else:
            rel_grade = ranking_e.pop(0)
            interleaved_ranking += (rel_grade,)
            team_e += (rel_grade,)
    
    return interleaved_ranking[:rank]

def probabilistic_interleaving(ranking_pair, rank=None):
    if rank is None:
        rank = len(ranking_pair[0]) + len(ranking_pair[1])
        
    

#### Step 5: Implement User Clicks Simulation (15 points)
    
Having interleaved all the ranking pairs an online experiment could be ran. However, given that we do not have any users (and the entire homework is a big simulation) we will simulate user clicks.
We have considered a number of click models including:
1. Random Click Model (RCM)
2. Position-Based Model (PBM)
3. Simple Dependent Click Model (SDCM)
4. Simple Dynamic Bayesian Network (SDBN)

Consider two different click models, (a) the Random Click Model (RCM), and (b) one out of the remaining 3 aforementioned models. The parameters of some of these models can be estimated using the Maximum Likelihood Estimation (MLE) method, while others require using the Expectation-Maximization (EM) method. Implement the two models so that (a) there is a method that learns the parameters of the model given a set of training data, (b) there is a method that predicts the click probability given a ranked list of relevance labels, (c) there is a method that decides - stochastically - whether a document is clicked based on these probabilities.

Having implemented the two click models, estimate the model parameters using the Yandex Click Log [https://drive.google.com/file/d/1tqMptjHvAisN1CJ35oCEZ9_lb0cEJwV0/view].

(Note 6: Do not learn the attractiveness parameter 𝑎uq)

#### Step 6: Simulate Interleaving Experiment (10 points)

Having implemented the click models, it is time to run the simulated experiment.
For each of interleaved ranking run N simulations for each one of the click models implemented and measure the proportion p of wins for E.

(Note 7: Some of the models above include an attractiveness parameter 𝑎uq. Use the relevance label to assign this parameter by setting 𝑎uq for a document u in the ranked list accordingly. (See Click Models for Web Search, http://clickmodels.weebly.com/uploads/5/2/2/5/52257029/mc2015-clickmodels.pdf)

#### Step 7: Results and Analysis (30 points)

Compare the results of the offline experiments (i.e. the values of the 𝛥measure) with the results of the online experiment (i.e. proportion of wins), analyze them and reach your conclusions regarding their agreement.
Use easy to read and comprehend visuals to demonstrate the results;
Analyze the results on the basis of
the evaluation measure used,
the interleaving method used,
the click model used.
Report and ground your conclusions.
(Note 8: This is the place where you need to demonstrate your deeper understanding of what you have implemented so far; hence the large number of points assigned. Make sure you clearly do that so that the examiner of your work can grade it accordingly.)

Yandex Click Log File:

The dataset includes user sessions extracted from Yandex logs, with queries, URL rankings and clicks. To allay privacy concerns the user data is fully anonymized. So, only meaningless numeric IDs of queries, sessions, and URLs are released. The queries are grouped only by sessions and no user IDs are provided. The dataset consists of several parts. Logs represent a set of rows, where each row represents one of the possible user actions: query or click.
In the case of a Query:

SessionID TimePassed TypeOfAction QueryID RegionID ListOfURLs

In the case of a Click:
SessionID TimePassed TypeOfAction URLID

SessionID - the unique identifier of the user session.
TimePassed - the time elapsed since the beginning of the current session in standard time units.
TypeOfAction - type of user action. This may be either a query (Q), or a click (C).
QueryID - the unique identifier of the request.
RegionID - the unique identifier of the country from which a given query. This identifier may take four values.
URLID - the unique identifier of the document.
ListOfURLs - the list of documents from left to right as they have been shown to users on the page extradition Yandex (top to bottom).
