# Project Data Overview

This notebook contains some general info regarding the data available for this project.

The judgements used by this project are the NIST expert judgements ('stage1-dev') and the consensus labels ('stage2-dev') from [the 2011 TREC Crowdsourcing track](https://sites.google.com/site/treccrowd/2011).

The actual document data used is, sadly, not publicly available (http://lemurproject.org/clueweb09/, ClueWeb09 dataset, T11Crowd subsection), but can be acquired by signing a non-commercial use agreement with the provider.

Some of the data management code has been cannibalized from [Martin Davtyan's previous work on the subject](https://github.com/martinthenext/ir-crowd-thesis) (while at ETH Zurich).

In [9]:
import io
import os

In [134]:
# This should be the root folder containing the different judgement datasets.
DATA_ROOT = os.path.join(os.getenv("HOME"), 'data')

# This file contains exclusively the NIST expert judgements from 'stage1-dev'
# 'cat'ed together into a single file (from all the teams, as well as the
# common data).
EXPERT_GROUND_TRUTH_FILE = os.path.join(DATA_ROOT, 'ground_truth')

# This file contains the document labels computed by the Mechanical Turk
# workers.
# TODO(andrei) There can be contradictions, right?
# Unlike the NIST expert judgement file, this one is provided from the 
# beginning as just one file (yay!). Contains the development data for the
# second part of the challenge (consensus).
WORKER_LABEL_FILE = os.path.join(DATA_ROOT, 'stage2-dev', 'stage2.dev')

# Mechanical Turk worker judgements for the 2011 Crowdsourcing Track. 
JUDGEMENT_FILE = os.path.join(DATA_ROOT, 'all_judgements.tsv')

# Provided test data for the 1st stage of the TREC 2011 Crowdsourcing Track.
TEST_LABEL_FILE_SHARED = os.path.join(DATA_ROOT, 'test-set-Aug-8', 'trec-cs-2011-test-set-shared.csv')
TEST_LABEL_FILE_TEAMS = os.path.join(DATA_ROOT, 'test-set-Aug-8', 'trec-cs-2011-test-set-assigned-to-teams.csv')

In [102]:
class JudgementRecord(object):
    """ Judgement record submitted in the 2011 Crowdsourcing Track.
    
        Attributes:
            label_type: Additional label metadata (enum).
                0: default
                1. rejected label: where you would have filtered this 
                label out before subsequent use
                2. automated label: label was produced by automation 
                (.artificial artificial artificial intelligence.)
                3. training / quality-control label: used in training/evaluating
                worker, not for labeling test data
    """
    def __init__(self, table_row):
        attributes = table_row.split('\t')
        team_id, worker_id, _, topic_id, doc_id, _, relevance, _, _, _, label_type = attributes
        self.team_id = team_id
        self.worker_id = worker_id
        self.label_type = int(label_type)
        self.topic_id = topic_id
        self.doc_id = doc_id
        if not relevance == 'na':
            self.is_relevant = (float(relevance) >= 0.5)
        else:
            self.is_relevant = None
            
    def is_useful(self):
        return self.label_type == 0 and (self.is_relevant is not None)
        
            
class WorkerLabel(object):
    def __init__(self, table_row):
        attributes = table_row.split()
        topic_id, hit_id, worker_id, document_id, nist_label, worker_label = attributes
        self.topic_id = topic_id
        self.hit_id = hit_id
        self.worker_id = worker_id
        self.document_id = document_id
        self.nist_label = nist_label
        self.worker_label = worker_label
        
        
class ExpertLabel(object):
    def __init__(self, attributes):
        if len(attributes) == 3:
            topic_id, document_id, label = attributes
        elif len(attributes) == 4:
            # Also includes set column, which we ignore
            _, topic_id, document_id, label = attributes
        elif len(attributes) == 5:
            # Also includes team and set columns, which we ignore
            _, _, topic_id,document_id, label = attributes
        else:
            raise Exception("Unsupported expert label format: [%s]" % table_row)
        
        self.topic_id = topic_id
        self.document_id = document_id
        # 0 (non-relevant), 1 (relevant) or 2 (highly relevant)
        self.label = int(label)
        
    def is_relevant(self):
        return self.label > 0
    
    def __repr__(self):
        return "%s:%s:%s" % (self.topic_id, self.document_id, "Relevant" if self.is_relevant() else "Not relevant")

In [141]:
def read_judgement_labels(file_name):
    with io.open(file_name, 'r') as f:
        return [JudgementRecord(line[:-1]) for line in f]
            
def read_expert_labels(file_name, header=False, sep=None):
    with io.open(file_name, 'r') as f:
        if header:
            # Skip the header
            f.readline()
        return [ExpertLabel(line.split(sep)) for line in f]

def read_worker_labels(file_name):
    with io.open(file_name, 'r') as f:
        return [WorkerLabel(line) for line in f]

In [142]:
expert_labels = read_expert_labels(EXPERT_GROUND_TRUTH_FILE)
print("%d NIST expert labels" % len(expert_labels))

2033 NIST expert labels


In [143]:
worker_labels = read_worker_labels(WORKER_LABEL_FILE)
print("%d Mechanical Turk worker labels" % len(worker_labels))

10770 Mechanical Turk worker labels


In [144]:
expert_label_topic_ids = { l.topic_id for l in expert_labels }
print("%d topics in NIST expert label data" % len(expert_label_topic_ids))

244 topics in NIST expert label data


In [138]:
worker_label_topic_ids = { l.topic_id for l in worker_labels }
print("%d topics in development worker label data" % len(worker_label_topic_ids))

25 topics in development worker label data


Is it normal to have 25 topics in worker labels, but 244 topics in expert labels? (stage1-dev and stage2-dev READMEs confirm these counts!)

In [140]:
common_expert_worker_topic_ids = expert_label_topic_ids & worker_label_topic_ids
str(len(common_expert_worker_topic_ids)) + ' topics in common (NIST expert labels and development worker labels)'

'24 topics in common (NIST expert labels and development worker labels)'

In [109]:
judgement_labels_2011 = read_judgement_labels(JUDGEMENT_FILE)
str(len(judgement_labels_2011)) + ' judgement labels'

'64042 judgement labels'

### 2011 Judgement Data

In [110]:
judgement_topic_ids = { l.topic_id for l in judgement_labels_2011 }
len(judgement_topic_ids)

46

In [111]:
print(len(judgement_topic_ids & expert_label_topic_ids))
print(len(judgement_topic_ids & worker_label_topic_ids))

17
1


In [145]:
# Clear out labels deemed irrelevant (e.g. ones used for worker assessment).

useful_judgement_labels_2011 = [l for l in judgement_labels_2011 if l.is_useful()]

In [147]:
useful_judgement_topic_ids = { l.topic_id for l in useful_judgement_labels_2011 }
print("%d different topics in 2011 judgement data" % len(useful_judgement_topic_ids))
print("%d topics in common between 2011 judgement data and original NISP expert label data." %
      len(useful_judgement_topic_ids & expert_label_topic_ids))
print("%d topics in common between 2011 judgement data and original (dev) worker label data." % 
      len(useful_judgement_topic_ids & worker_label_topic_ids))

30 different topics in 2011 judgement data
2 topics in common between 2011 judgement data and original NISP expert label data.
0 topics in common between 2011 judgement data and original (dev) worker label data.


### 2011 Test Data

In [148]:
test_data_shared = read_expert_labels(TEST_LABEL_FILE_SHARED, header=True, sep=',')
test_data_team = read_expert_labels(TEST_LABEL_FILE_TEAMS, header=True, sep=',')

print(len(test_data_shared))
print("First 5:\n" + "\n".join([str(d) for d in test_data_shared[:5]]))
print("Last 5:\n" + "\n".join([str(d) for d in test_data_shared[-5:]]))

test_data = test_data_shared + test_data_team
print("Last 5 (after merge):")
print("\n".join([str(d) for d in test_data[-5:]]))

1655
First 5:
20542:clueweb09-en0003-47-17392:Relevant
20542:clueweb09-en0002-74-25816:Relevant
20542:clueweb09-en0000-00-00000:Not relevant
20542:clueweb09-enwp00-69-12844:Relevant
20542:clueweb09-en0002-93-19628:Relevant
Last 5:
20996:clueweb09-en0129-94-14964:Not relevant
20996:clueweb09-en0129-94-14966:Not relevant
20996:clueweb09-en0131-42-22886:Not relevant
20996:clueweb09-en0132-77-26392:Not relevant
20996:clueweb09-enwp01-17-03021:Not relevant
Last 5 (after merge):
20958:clueweb09-en0112-59-01254:Not relevant
20958:clueweb09-en0114-14-25526:Not relevant
20958:clueweb09-en0116-09-14871:Not relevant
20958:clueweb09-en0116-09-14873:Not relevant
20958:clueweb09-en0121-92-02032:Not relevant


In [129]:
test_topic_ids = { l.topic_id for l in test_data }
print("%d different topics in test data." % len(test_topic_ids))

30 different topics in test data.


In [149]:
print(len(test_topic_ids & useful_judgement_topic_ids))

30


In [131]:
print(len(test_topic_ids & expert_label_topic_ids))

2


In [132]:
print(len(test_topic_ids & worker_label_topic_ids))

0


## Summary
 * Full topic overlap between judgement data and test data.
 * 6% (2/30) topic overlap between expert label data and test data.
 * 0% (0/30) topic overlap between original worker consensus training data labels and test data.