This notebook is meant to create a relevance label dataset where we use graded relevance labels (1 - 5). 

## Import Libraries

In [1]:
import pandas as pd
import json
import numpy as np
from collections import Counter

## Load Data

In [2]:
with open('data.json', 'r') as infile:
    data = json.load(infile)
query_ids = list(data.keys())

## Create Dataframes per Query

In [3]:
def create_labels(nr_assessors):
    labels = ["query_id","passageid","msmarco"]
    for i in range(nr_assessors):
        labels = labels + ['user%s_id'%(i+1),'user%s_label'%(i+1)]
    return labels

In [4]:
dataframe_data = {}
label_data = {}
assessor_data = {}
for query_id in query_ids:
    query_data_lists = []
    query_data = data[query_id]
    nr_assessors = 0
    for i, passage_id in enumerate(query_data.keys()):
        dataFrameRow2be = [query_id, passage_id] + query_data[passage_id]
        query_data_lists.append(dataFrameRow2be)
        if i == 0:
            nr_assessors = int((len(query_data[passage_id])-1)/2)
    dataframe_data[query_id] = query_data_lists
    label_data[query_id] = create_labels(nr_assessors)
    assessor_data[query_id] = nr_assessors

In [5]:
dataframes = {}
for query_id in query_ids:
    df = pd.DataFrame(dataframe_data[query_id],columns=label_data[query_id])
    dataframes[query_id] = df

## Compute Graded Relevance MSMARCO Label

This is tricky: MSMARCO does provide relevant passages, but no graded relevance. So we can either leave the msmarco relevance information out and only use the labels provided by the subjects. Or we can consider the msmarco relevant passage as 5 always. For now lets add a column where we set the relevant passages to 5. But it is probably wise to consider the subject labels. Even if they disagree with msmarco and label a passage irrelevant (1).

In [6]:
for query_id in query_ids:
    df = dataframes[query_id]
    df['msmarco_graded'] = df['msmarco'].apply(lambda x: 1 if (x == 'irrelevant') else 5)

## Check for missing data

Sometimes an assessors forgot to provide input, which means that the dataset includes "no_input". This is missing data and often we can fix this by considering the input of the remaining assessors. If the remaining assessors agree, I can still provide the agreed relevance label. If there is no agreement between the remaining assessors, I simply take the relevance label that MSMARCO provides.

In [7]:
queries_with_missing_data = []
counter = 0
for query_id in query_ids:
    nr_assessors = assessor_data[query_id]
    if nr_assessors >= 3:
        df = dataframes[query_id]
        if 'no_input' in df.values:
            counter += 1
            print(query_id)
            queries_with_missing_data.append(query_id)
print("nr of queries with missing data: %s"%(counter))

1077356
904389
1097449
426442
993153
825147
92542
689885
758519
1096257
202306
nr of queries with missing data: 11


In [8]:
for query_id in queries_with_missing_data:
    print(query_id)
    df = dataframes[query_id]
    column_labels = list(df.columns)
    for i in range(7):
        column = 'user%s_label'%(i+1)
        if column in column_labels:
            print(column)
            print(len(df[df[column].isin(["no_input"])]))
    print("\n")

1077356
user1_label
1
user2_label
0
user3_label
0


904389
user1_label
0
user2_label
0
user3_label
1


1097449
user1_label
0
user2_label
0
user3_label
0
user4_label
1
user5_label
0
user6_label
0
user7_label
0


426442
user1_label
0
user2_label
3
user3_label
0


993153
user1_label
0
user2_label
19
user3_label
1
user4_label
0
user5_label
0
user6_label
0
user7_label
0


825147
user1_label
0
user2_label
0
user3_label
1
user4_label
0
user5_label
0


92542
user1_label
0
user2_label
0
user3_label
1


689885
user1_label
1
user2_label
0
user3_label
0


758519
user1_label
0
user2_label
1
user3_label
0


1096257
user1_label
0
user2_label
1
user3_label
0


202306
user1_label
0
user2_label
1
user3_label
0




Most queries have just one 'no_input' entry, except for query 993153 which has 19 entries. So we need to remove this query from the data and find a way to deal with the remaining queries.

In [9]:
query_ids.remove('993153')

## Compute Graded Relevance Agreement Label

In [10]:
def getListUserLabelColumns(nr_assessors):
    label_columns = []
    for i in range(nr_assessors):
        label_columns.append("user%s_label"%(i+1))
    return label_columns

In [15]:
def str2int(labels):
    integer_labels = []
    for label in labels:
        if not label == "no_input":
            integer_labels.append(int(label))
    return integer_labels

In [16]:
def getAgreementLabel(msmarco_label,labels):
    if usersDoAgree(labels):
        return labels[0]
    else:
        return performMajorityVote(msmarco_label,labels)

In [19]:
def usersDoAgree(labels):
    return (len(set(labels)) == 1)

In [20]:
def performMajorityVote(msmarco_label,labels):
    count_votes = Counter(labels)
    majority_label = msmarco_label
    if (len(labels) % 2) == 0:
        vote_threshold = int(len(labels)/2)
        for label,votes in count_votes.items():
            if votes > vote_threshold:
                majority_label = label
    else:
        vote_threshold = int(np.ceil(len(labels)/2))
        for label, votes in count_votes.items():
            if votes >= vote_threshold:
                majority_label = label
    return majority_label

In [22]:
for query_id in query_ids:
    df = dataframes[query_id]
    agreement_labels = []
    nr_assessors = assessor_data[query_id]
    for index, row in df.iterrows():
        user_labels = row[getListUserLabelColumns(nr_assessors)].values
        integer_labels = str2int(user_labels)
        agreement_label = getAgreementLabel(row['msmarco_graded'],integer_labels)
        agreement_labels.append(agreement_label)
    df['agreement_label'] = agreement_labels
    dataframes[query_id] = df

In [23]:
dataframes['825147']

Unnamed: 0,query_id,passageid,msmarco,user1_id,user1_label,user2_id,user2_label,user3_id,user3_label,user4_id,user4_label,user5_id,user5_label,msmarco_graded,agreement_label
0,825147,4474806,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,1,UREdJ6PMp1ambtR003yUY9YzMin1,1,Y66GSjYrILSPtH6sTPPGGg2UVvk2,1,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,2,wJEeouJBg2cDvZjvrHQGiJX8vNf2,2,1,1
1,825147,3572516,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,1,UREdJ6PMp1ambtR003yUY9YzMin1,1,Y66GSjYrILSPtH6sTPPGGg2UVvk2,1,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,1,wJEeouJBg2cDvZjvrHQGiJX8vNf2,2,1,1
2,825147,2072179,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,1,UREdJ6PMp1ambtR003yUY9YzMin1,1,Y66GSjYrILSPtH6sTPPGGg2UVvk2,1,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,2,wJEeouJBg2cDvZjvrHQGiJX8vNf2,1,1,1
3,825147,5742137,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,1,UREdJ6PMp1ambtR003yUY9YzMin1,3,Y66GSjYrILSPtH6sTPPGGg2UVvk2,3,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,3,wJEeouJBg2cDvZjvrHQGiJX8vNf2,3,1,3
4,825147,1178895,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,2,UREdJ6PMp1ambtR003yUY9YzMin1,2,Y66GSjYrILSPtH6sTPPGGg2UVvk2,2,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,1,wJEeouJBg2cDvZjvrHQGiJX8vNf2,2,1,2
5,825147,702337,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,1,UREdJ6PMp1ambtR003yUY9YzMin1,2,Y66GSjYrILSPtH6sTPPGGg2UVvk2,5,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,5,wJEeouJBg2cDvZjvrHQGiJX8vNf2,4,1,1
6,825147,2978890,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,1,UREdJ6PMp1ambtR003yUY9YzMin1,1,Y66GSjYrILSPtH6sTPPGGg2UVvk2,1,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,2,wJEeouJBg2cDvZjvrHQGiJX8vNf2,2,1,1
7,825147,8602943,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,3,UREdJ6PMp1ambtR003yUY9YzMin1,4,Y66GSjYrILSPtH6sTPPGGg2UVvk2,1,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,1,wJEeouJBg2cDvZjvrHQGiJX8vNf2,2,1,1
8,825147,573858,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,1,UREdJ6PMp1ambtR003yUY9YzMin1,1,Y66GSjYrILSPtH6sTPPGGg2UVvk2,1,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,3,wJEeouJBg2cDvZjvrHQGiJX8vNf2,3,1,1
9,825147,425816,irrelevant,IFYqTNplX3NAC0bqca41rpGT2552,3,UREdJ6PMp1ambtR003yUY9YzMin1,4,Y66GSjYrILSPtH6sTPPGGg2UVvk2,no_input,Zsx92rJo3vNY5Au4v3ojF1PMzJq1,5,wJEeouJBg2cDvZjvrHQGiJX8vNf2,5,1,1


## Create Thesis Dataset

In contrast to the binary dataset, its a bit tricky how to deal with those cases where the subjects did not grade the msmarco relevant passage that high. We could leave out those queries where the subjects graded the MSMARCO passage lower than 3 or even lower than 2. The latter sounds reasonable as the users would grade it 1, while MSMARCO grades it 5. Lets start with removing those queries.

In [24]:
queries_2_remove = []
for query_id in query_ids:
    nr_assessors = assessor_data[query_id]
    if nr_assessors >= 3:
        df = dataframes[query_id]
        idx = df.index[(df['msmarco_graded'] == 5) & (df['agreement_label'] == 1)]
        if not (idx.values.size == 0):
            queries_2_remove.append(query_id)

In [25]:
queries_2_remove

['427323', '993987', '540906']

There are 3 queries for which it is the case that the subjects did not label the msmarco relevant passage as relevant by grading it 1. So these queries will be left out of the dataset.

Next we want to get rid of all queries for which is the case that (after majority voting) only have the msmarco passage as relevant. We want to get rid of those queries as they do not provide any changes to the original dataset except for the graded relevance. This is thus the case when all passages (except for the ms marco relevant passage) are graded 1.

In [29]:
for query_id in query_ids:
    nr_assessors = assessor_data[query_id]
    if nr_assessors >= 3:
        df = dataframes[query_id]
        nr_relevant = df[df['msmarco_graded'] == 1]['agreement_label'].sum()
        if (nr_relevant == 19):
            queries_2_remove.append(query_id)

In [30]:
queries_2_remove

['427323', '993987', '540906', '1034595', '335710']

In [31]:
for query_id in queries_2_remove:
    query_ids.remove(query_id)

Now we want to create the actual dataset. This new relevance dataset looks different than the qrels.tsv file from ms marco (query_id;label1;passage_id;label2). It should look more like this: (query_id;passage_id;graded_label)

In [36]:
output_df_data = []
counter = 0

for query_id in query_ids:
    nr_assessors = assessor_data[query_id]
    if nr_assessors >= 3:
        counter += 1
        df = dataframes[query_id]
        for index, row in df.iterrows():
            output_df_data.append({"query_id": query_id, "passage_id":row['passageid'], "label":row['agreement_label']})

In [41]:
output_df = pd.DataFrame(output_df_data, columns=['query_id','passage_id','label'])

In [43]:
output_df.to_csv('thesis_dataset_graded_relevance.tsv',sep='\t',index=False,header=False)

In [44]:
counter

45