# Project Data Overview

This notebook contains some general info regarding the data available for this project.

The judgements used by this project are the NIST expert judgements ('stage1-dev') and the consensus labels ('stage2-dev') from [the 2011 TREC Crowdsourcing track](https://sites.google.com/site/treccrowd/2011).

The actual document data used is, sadly, not publicly available (http://lemurproject.org/clueweb09/, ClueWeb09 dataset, T11Crowd subsection), but can be acquired by signing a non-commercial use agreement with the provider.

Some of the data management code has been cannibalized from [Martin Davtyan's previous work on the subject](https://github.com/martinthenext/ir-crowd-thesis) (while at ETH Zurich).

In [1]:
import io
import os

In [2]:
# This should be the root folder containing the different judgement datasets.
DATA_ROOT = os.path.join(os.getenv('HOME'), 'data')

# This file contains exclusively the NIST expert judgements from 'stage1-dev'
# 'cat'ed together into a single file (from all the teams, as well as the
# common data).
EXPERT_GROUND_TRUTH_FILE = os.path.join(DATA_ROOT, 'ground_truth')

# This file contains the document labels computed by the Mechanical Turk
# workers.
# TODO(andrei) There can be contradictions, right?
# Unlike the NIST expert judgement file, this one is provided from the 
# beginning as just one file (yay!). Contains the development data for the
# second part of the challenge (consensus).
WORKER_LABEL_FILE = os.path.join(DATA_ROOT, 'stage2-dev', 'stage2.dev')

# Mechanical Turk worker judgements for the 2011 Crowdsourcing Track. 
JUDGEMENT_FILE = os.path.join(DATA_ROOT, 'all_judgements.tsv')

# Provided test data for the 1st stage of the TREC 2011 Crowdsourcing Track.
TEST_LABEL_FILE_SHARED = os.path.join(DATA_ROOT, 'test-set-Aug-8', 'trec-cs-2011-test-set-shared.csv')
TEST_LABEL_FILE_TEAMS = os.path.join(DATA_ROOT, 'test-set-Aug-8', 'trec-cs-2011-test-set-assigned-to-teams.csv')

In [9]:
# This loads the necessary data wrangling classes and functions.
# I
%run ../data.py

In [None]:
def read_judgement_labels(file_name):
    with io.open(file_name, 'r') as f:
        return [JudgementRecord(line[:-1]) for line in f]
            
def read_expert_labels(file_name, header=False, sep=None):
    with io.open(file_name, 'r') as f:
        if header:
            # Skip the header
            f.readline()
        return [ExpertLabel(line.split(sep)) for line in f]

def read_worker_labels(file_name):
    with io.open(file_name, 'r') as f:
        return [WorkerLabel(line) for line in f]

In [None]:
expert_labels = read_expert_labels(EXPERT_GROUND_TRUTH_FILE)
print("%d NIST expert labels" % len(expert_labels))

In [None]:
worker_labels = read_worker_labels(WORKER_LABEL_FILE)
print("%d Mechanical Turk worker labels" % len(worker_labels))

In [None]:
expert_label_topic_ids = { l.topic_id for l in expert_labels }
print("%d topics in NIST expert label data" % len(expert_label_topic_ids))

In [None]:
worker_label_topic_ids = { l.topic_id for l in worker_labels }
print("%d topics in development worker label data" % len(worker_label_topic_ids))

Is it normal to have 25 topics in worker labels, but 244 topics in expert labels? (stage1-dev and stage2-dev READMEs confirm these counts!)

In [None]:
common_expert_worker_topic_ids = expert_label_topic_ids & worker_label_topic_ids
str(len(common_expert_worker_topic_ids)) + ' topics in common (NIST expert labels and development worker labels)'

In [None]:
judgement_labels_2011 = read_judgement_labels(JUDGEMENT_FILE)
str(len(judgement_labels_2011)) + ' judgement labels'

### 2011 Judgement Data

In [None]:
judgement_topic_ids = { l.topic_id for l in judgement_labels_2011 }
len(judgement_topic_ids)

In [None]:
print(len(judgement_topic_ids & expert_label_topic_ids))
print(len(judgement_topic_ids & worker_label_topic_ids))

In [None]:
# Clear out labels deemed irrelevant (e.g. ones used for worker assessment).

useful_judgement_labels_2011 = [l for l in judgement_labels_2011 if l.is_useful()]

In [None]:
useful_judgement_topic_ids = { l.topic_id for l in useful_judgement_labels_2011 }
print("%d different topics in 2011 judgement data" % len(useful_judgement_topic_ids))
print("%d topics in common between 2011 judgement data and original NIST expert label data." %
      len(useful_judgement_topic_ids & expert_label_topic_ids))
print("%d topics in common between 2011 judgement data and original (dev) worker label data." % 
      len(useful_judgement_topic_ids & worker_label_topic_ids))

### 2011 Test Data

In [None]:
test_data_shared = read_expert_labels(TEST_LABEL_FILE_SHARED, header=True, sep=',')
test_data_team = read_expert_labels(TEST_LABEL_FILE_TEAMS, header=True, sep=',')

print(len(test_data_shared))
print("First 5:\n" + "\n".join([str(d) for d in test_data_shared[:5]]))
print("Last 5:\n" + "\n".join([str(d) for d in test_data_shared[-5:]]))

test_data = test_data_shared + test_data_team
print("Last 5 (after merge):")
print("\n".join([str(d) for d in test_data[-5:]]))

In [None]:
test_topic_ids = { l.topic_id for l in test_data }
print("%d different topics in test data." % len(test_topic_ids))

In [None]:
print(len(test_topic_ids & useful_judgement_topic_ids))

In [None]:
print(len(test_topic_ids & expert_label_topic_ids))

In [None]:
print(len(test_topic_ids & worker_label_topic_ids))

## Summary
 * Full topic overlap between judgement data and test data.
 * 6% (2/30) topic overlap between expert label data and test data.
 * 0% (0/30) topic overlap between original worker consensus training data labels and test data.