This notebook helps to create different relevance label tsv files. One where we keep the assessor labels and thus make a graded relevance label dataset. And two where we turn the assessor labels from graded into binary with different thresholds. We either set the threshold at 2, meaning that if the assessor labels are <2 it is irrelevant, otherwise (>=2) relevant. Or we set the threshold at 3, meaning that if the assessor labels are <3 it is irrelevant, otherwise (>=3) relevant.

# Import Libraries

In [1]:
import pandas as pd
import json
import numpy as np
from collections import Counter

# Specify Filenames

In [2]:
binary_threshold2_dataset_filename = 'output/thesis_dataset_binary_threshold2.tsv'
binary_threshold3_dataset_filename = 'output/thesis_dataset_binary_threshold3.tsv'
graded_dataset_filename = 'output/thesis_dataset_graded_relevance.tsv'

# Load Data

In [3]:
with open('data/data.json', 'r') as infile:
    data = json.load(infile)
query_ids = list(data.keys())

# Create Dataframes per Query

In [4]:
def create_labels(nr_assessors):
    labels = ["query_id","passageid","msmarco"]
    for i in range(nr_assessors):
        labels = labels + ['user%s_id'%(i+1),'user%s_label'%(i+1)]
    return labels

In [5]:
dataframe_data = {}
label_data = {}
assessor_data = {}
for query_id in query_ids:
    query_data_lists = []
    query_data = data[query_id]
    nr_assessors = 0
    for i, passage_id in enumerate(query_data.keys()):
        dataFrameRow2be = [query_id, passage_id] + query_data[passage_id]
        query_data_lists.append(dataFrameRow2be)
        if i == 0:
            nr_assessors = int((len(query_data[passage_id])-1)/2)
    dataframe_data[query_id] = query_data_lists
    label_data[query_id] = create_labels(nr_assessors)
    assessor_data[query_id] = nr_assessors

From the initial data analysis we know which query ids we can use and which we will not use.

As a reminder:

1. There should be at least 3 assessors.
2. The assessors should agree with the original MS MARCO relevant passage. 
3. The MS MARCO relevant passage should not be the only relevant passage after majority voting of the assessor input.

In [6]:
experiment_query_ids = []
with open("data/experiment_queries.txt", "r") as infile:
    for line in infile:
        experiment_query_ids.append(line.rstrip())

In [7]:
dataframes = {}
for query_id in experiment_query_ids:
    df = pd.DataFrame(dataframe_data[query_id],columns=label_data[query_id])
    dataframes[query_id] = df

In [8]:
print(len(experiment_query_ids))

42


## Fix query 993153

Query 993153 had one assessor that forget to provide input for 19 of the 20 passages. As there are still 6 other assessors for this query, it is wise to remove the input from the input from that assessor.

In [9]:
dataframes['993153']

Unnamed: 0,query_id,passageid,msmarco,user1_id,user1_label,user2_id,user2_label,user3_id,user3_label,user4_id,user4_label,user5_id,user5_label,user6_id,user6_label,user7_id,user7_label
0,993153,931774,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,1,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,1,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
1,993153,6818707,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,4,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,4,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,3,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
2,993153,931772,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,1,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,1,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
3,993153,8174479,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,2,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,2,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
4,993153,2970085,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,2,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,3,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,5,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,4
5,993153,7679926,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,1,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,3,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
6,993153,6304514,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,2,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,1,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
7,993153,3731448,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,2,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,2,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
8,993153,6160228,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,2,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,3,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,2
9,993153,447864,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,CSy5HjxiGvOFJI2GrRW5tWjIhPU2,no_input,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,1,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,1,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1


In [10]:
del dataframes['993153']['user2_label']
del dataframes['993153']['user2_id']

In [11]:
assessor_data['993153'] = 6

In [12]:
nr_assessors = assessor_data['993153']
user_column_names = []
for i in range(nr_assessors):
    user_column_names = user_column_names + ["user%s_id"%(i+1), "user%s_label"%(i+1)]
dataframes['993153'].columns = ['query_id','passageid','msmarco'] + user_column_names

In [13]:
dataframes['993153']

Unnamed: 0,query_id,passageid,msmarco,user1_id,user1_label,user2_id,user2_label,user3_id,user3_label,user4_id,user4_label,user5_id,user5_label,user6_id,user6_label
0,993153,931774,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,1,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,1,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
1,993153,6818707,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,4,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,4,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,3,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
2,993153,931772,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,1,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,1,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
3,993153,8174479,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,2,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,2,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
4,993153,2970085,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,2,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,3,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,5,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,4
5,993153,7679926,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,1,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,3,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
6,993153,6304514,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,2,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,1,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
7,993153,3731448,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,2,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,2,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1
8,993153,6160228,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,2,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,3,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,2
9,993153,447864,irrelevant,AdqXiUfSnyZTKzaEEUtz4JrITjD2,1,XqZsRA3c3QQdWbsJ27oEs8plMYl2,1,bJ8IzR4NOPRXuDFEBVpiTZzdxUG2,1,jIAS83dqkiWODrjmPAvj8CIfsYP2,1,stmaxqNmmMPHcRWGv2BsI6tp0rH3,1,xkVAJ8mrGPUrhBPQC38nFa6Q3Za2,1


## Compute Binary Relevance MS MARCO Label

In [14]:
for query_id in experiment_query_ids:
    df = dataframes[query_id]
    df['msmarco_binary'] = df['msmarco'].apply(lambda x: 0 if (x == 'irrelevant') else 1)

## Compute Graded Relevance MS MARCO Label

This is tricky: MSMARCO does provide relevant passages, but no graded relevance. So we can either leave the msmarco relevance information out and only use the labels provided by the subjects. Or we can consider the msmarco relevant passage as 5 always. For now lets add a column where we set the relevant passages to 5. But it is probably wise to consider the subject labels. Even if they disagree with msmarco and label a passage irrelevant (1).

In [15]:
for query_id in experiment_query_ids:
    df = dataframes[query_id]
    df['msmarco_graded'] = df['msmarco'].apply(lambda x: 1 if (x == 'irrelevant') else 5)

## Helper Functions

In [16]:
def getListUserLabelColumns(nr_assessors):
    label_columns = []
    for i in range(nr_assessors):
        label_columns.append("user%s_label"%(i+1))
    return label_columns

In [17]:
def makeBinary(labels):
    binary_labels = []
    for label in labels:
        if not label == "no_input":
            if int(label) < binary_threshold:
                binary_labels.append(0)
            else:
                binary_labels.append(1)
    return binary_labels

In [18]:
def setVoteThreshold(labels):
    if(len(labels) % 2) == 0:
        vote_threshold = int(len(labels)/2)
    else:
        vote_threshold = int(np.floor(len(labels)/2))
    return vote_threshold

In [19]:
def performMajorityVoting(msmarco_label,labels):
    vote_counts = Counter(labels)
    majority_label = msmarco_label
    vote_threshold = setVoteThreshold(labels)
    for label, votes in vote_counts.items():
        if votes > vote_threshold:
            majority_label = label
    return majority_label

In [20]:
def str2int(labels):
    integer_labels = []
    for label in labels:
        if not label == "no_input":
            integer_labels.append(int(label))
    return integer_labels

In [21]:
def getAgreementLabel(labels):
    return int(np.ceil(np.median(labels)))

# Compute Binary Agreement Label

## Threshold == 2

In [22]:
# < 2 is irrelevant >= 2 is relevant
binary_threshold = 2

In [23]:
for query_id in experiment_query_ids:
    df = dataframes[query_id]
    agreement_labels = []
    nr_assessors = assessor_data[query_id]
    for index, row in df.iterrows():
        user_labels = row[getListUserLabelColumns(nr_assessors)].values
        binary_labels = makeBinary(user_labels)
        agreement_label = performMajorityVoting(row['msmarco_binary'],binary_labels)
        agreement_labels.append(agreement_label)
    df['agreement_label_threshold2'] = agreement_labels
    dataframes[query_id] = df

## Threshold == 3

In [24]:
# < 3 is irrelevant >= 3 is relevant
binary_threshold = 3

In [25]:
for query_id in experiment_query_ids:
    df = dataframes[query_id]
    agreement_labels = []
    nr_assessors = assessor_data[query_id]
    for index, row in df.iterrows():
        user_labels = row[getListUserLabelColumns(nr_assessors)].values
        binary_labels = makeBinary(user_labels)
        agreement_label = performMajorityVoting(row['msmarco_binary'],binary_labels)
        agreement_labels.append(agreement_label)
    df['agreement_label_threshold3'] = agreement_labels
    dataframes[query_id] = df

## Graded

In [26]:
for query_id in experiment_query_ids:
    df = dataframes[query_id]
    agreement_labels = []
    nr_assessors = assessor_data[query_id]
    for index, row in df.iterrows():
        user_labels = row[getListUserLabelColumns(nr_assessors)].values
        integer_labels = str2int(user_labels)
        agreement_label = getAgreementLabel(integer_labels)
        agreement_labels.append(agreement_label)
    df['agreement_label_graded'] = agreement_labels
    dataframes[query_id] = df

# Save Thesis Datasets

## Threshold == 2

In [27]:
relevant_queries = []
relevant_passages = []
counter = 0
for query_id in experiment_query_ids:
    counter += 1
    df = dataframes[query_id]
    relevance_df = df[df['agreement_label_threshold2'] == 1]
    relevant_queries = relevant_queries + relevance_df['query_id'].values.tolist()
    relevant_passages = relevant_passages + relevance_df['passageid'].values.tolist()
output_df = pd.DataFrame()
output_df['query_id'] = relevant_queries
output_df['label1'] = 0
output_df['passage_id'] = relevant_passages
output_df['label2'] = 1

In [28]:
output_df.to_csv(binary_threshold2_dataset_filename,sep='\t',index=False,header=False)

## Threshold == 3

In [29]:
relevant_queries = []
relevant_passages = []
counter = 0
for query_id in experiment_query_ids:
    counter += 1
    df = dataframes[query_id]
    relevance_df = df[df['agreement_label_threshold3'] == 1]
    relevant_queries = relevant_queries + relevance_df['query_id'].values.tolist()
    relevant_passages = relevant_passages + relevance_df['passageid'].values.tolist()
output_df = pd.DataFrame()
output_df['query_id'] = relevant_queries
output_df['label1'] = 0
output_df['passage_id'] = relevant_passages
output_df['label2'] = 1

In [30]:
output_df.to_csv(binary_threshold3_dataset_filename,sep='\t',index=False,header=False)

## Graded

Now we want to create the actual dataset. This new relevance dataset looks different than the qrels.tsv file from ms marco (query_id;label1;passage_id;label2). It should look more like this: (query_id;passage_id;graded_label)

In [31]:
output_df_data = []
counter = 0

for query_id in experiment_query_ids:
    counter += 1
    df = dataframes[query_id]
    for index, row in df.iterrows():
        output_df_data.append({"query_id": query_id, "passage_id":row['passageid'], "label":row['agreement_label_graded']})

In [32]:
output_df = pd.DataFrame()
output_df = pd.DataFrame(output_df_data, columns=['query_id','passage_id','label'])

In [33]:
output_df.to_csv(graded_dataset_filename,sep='\t',index=False,header=False)