# Project Data Overview

This notebook contains some general info regarding the data available for this project.

The judgements used by this project are the NIST expert judgements ('stage1-dev') and the consensus labels ('stage2-dev') from [the 2011 TREC Crowdsourcing track](https://sites.google.com/site/treccrowd/2011).

The actual document data used is, sadly, not publicly available (http://lemurproject.org/clueweb09/, ClueWeb09 dataset, T11Crowd subsection), but can be acquired by signing a non-commercial use agreement with the provider.

Some of the data management code has been cannibalized from [Martin Davtyan's previous work on the subject](https://github.com/martinthenext/ir-crowd-thesis) (while at ETH Zurich).

**Some of the numbers in this notebook may be out of date, since the old files have been moved around to have new, clearer names, and most of the confusion has been resolved.**

In [4]:
%load_ext autoreload

In [2]:
import io
import os

In [3]:
# This makes Jupyter pretend to be Pythonic and play well with modules.

import sys
sys.path.append(os.path.expandvars(os.path.join(os.getcwd(), '..')))

In [8]:
%autoreload 2

from crowd.data import *
from crowd.config import *

In [29]:
if 'notebooks' in os.getcwd():
    print(os.getcwd())
    os.chdir('..')
    print(os.getcwd())

## Some statistics

In [31]:
ground_truth = read_ground_truth()
turk_labels = read_useful_judgement_labels()

In [34]:
ground_truth_topics = {t.topic_id for t in ground_truth}
turk_label_topics = {lbl.topic_id for lbl in turk_labels}

assert len(ground_truth_topics & turk_label_topics) == 30, \
    "All 30 topics must be covered by the ground truth and label data!"

In [35]:
judgements_by_doc_id = get_all_judgements_by_doc_id(turk_labels)

In [40]:
print(len(judgements_by_doc_id))

3195


In [43]:
labels_by_topic_doc = {}
for label in turk_labels:
    if label.topic_id not in labels_by_topic_doc:
        labels_by_topic_doc[label.topic_id] = {}
        
    if label.doc_id not in labels_by_topic_doc[label.topic_id]:
        labels_by_topic_doc[label.topic_id][label.doc_id] = []
        
    labels_by_topic_doc[label.topic_id][label.doc_id].append(label)

In [46]:
print(len(labels_by_topic_doc))
print(labels_by_topic_doc.keys())

30
dict_keys(['20958', '20956', '20996', '20686', '20696', '20814', '20488', '20764', '20694', '20976', '20714', '20704', '20916', '20424', '20778', '20922', '20542', '20972', '20690', '20910', '20584', '20766', '20780', '20962', '20644', '20832', '20636', '20812', '20932', '20642'])


In [58]:
for tid, doc_map in labels_by_topic_doc.items():
    s = 0
    for doc_id, labels in doc_map.items():
        if len(labels) == 0:
            continue
            
        s += len(labels)
        
    avg_labels_labeled = s / len(doc_map)
    print("#{0}\t Avg. labels per labeled doc: {1:.2f}; Total votes: {2}"
          .format(tid, avg_labels_labeled, s))

#20958	 Avg. labels per labeled doc: 8.60; Total votes: 860
#20956	 Avg. labels per labeled doc: 20.62; Total votes: 2268
#20996	 Avg. labels per labeled doc: 21.15; Total votes: 2327
#20686	 Avg. labels per labeled doc: 20.79; Total votes: 2391
#20696	 Avg. labels per labeled doc: 4.05; Total votes: 446
#20814	 Avg. labels per labeled doc: 9.38; Total votes: 938
#20488	 Avg. labels per labeled doc: 12.36; Total votes: 1360
#20764	 Avg. labels per labeled doc: 11.55; Total votes: 1155
#20694	 Avg. labels per labeled doc: 19.48; Total votes: 1948
#20976	 Avg. labels per labeled doc: 11.93; Total votes: 1074
#20714	 Avg. labels per labeled doc: 6.21; Total votes: 683
#20704	 Avg. labels per labeled doc: 5.17; Total votes: 465
#20916	 Avg. labels per labeled doc: 7.86; Total votes: 865
#20424	 Avg. labels per labeled doc: 3.90; Total votes: 390
#20778	 Avg. labels per labeled doc: 21.72; Total votes: 2389
#20922	 Avg. labels per labeled doc: 1.05; Total votes: 105
#20542	 Avg. labels per 

## This is the old data (not used in project)

This code evaluates the overlap and usefulness of several different datasets I downloaded separately, in an attempt to establish which are the correct ones. The issue is now solved and the methods used in the above section should be used when working with the 2011 corpus used in my, Martin's, or Piyush's paper.

In [12]:
expert_labels = read_expert_labels(EXPERT_GROUND_TRUTH_FILE, header=True, sep=',')
print("%d NIST expert labels" % len(expert_labels))

9380 NIST expert labels


In [13]:
worker_labels = read_worker_labels(WORKER_LABEL_FILE)
print("%d Mechanical Turk worker labels" % len(worker_labels))

10770 Mechanical Turk worker labels


In [14]:
expert_label_topic_ids = { l.topic_id for l in expert_labels }
print("%d topics in NIST expert label data" % len(expert_label_topic_ids))

30 topics in NIST expert label data


In [15]:
worker_label_topic_ids = { l.topic_id for l in worker_labels }
print("%d topics in development worker label data" % len(worker_label_topic_ids))

25 topics in development worker label data


In [16]:
common_expert_worker_topic_ids = expert_label_topic_ids & worker_label_topic_ids
str(len(common_expert_worker_topic_ids)) + ' topics in common (ground truth expert labels and development worker labels)'

'0 topics in common (ground truth expert labels and development worker labels)'

In [18]:
judgement_labels_2011 = read_judgement_labels(JUDGEMENT_FILE)
str(len(judgement_labels_2011)) + ' judgement labels'

'64042 judgement labels'

The topic overlap is to be expected, since the expert label data is from 2011, and the worker label data is older, from an entirely different session.

### 2011 Judgement Data

In [19]:
judgement_topic_ids = { l.topic_id for l in judgement_labels_2011 }
len(judgement_topic_ids)

46

In [20]:
print(len(judgement_topic_ids & expert_label_topic_ids))
print(len(judgement_topic_ids & worker_label_topic_ids))

30
1


In [21]:
# Clear out labels deemed irrelevant (e.g. ones used for worker assessment).
useful_judgement_labels_2011 = read_useful_judgement_labels()

In [22]:
useful_judgement_topic_ids = { l.topic_id for l in useful_judgement_labels_2011 }
print("%d different topics in 2011 judgement data" % len(useful_judgement_topic_ids))
print("%d topics in common between 2011 judgement data and original NIST expert label data." %
      len(useful_judgement_topic_ids & expert_label_topic_ids))
print("%d topics in common between 2011 judgement data and original (dev) worker label data." % 
      len(useful_judgement_topic_ids & worker_label_topic_ids))

30 different topics in 2011 judgement data
30 topics in common between 2011 judgement data and original NIST expert label data.
0 topics in common between 2011 judgement data and original (dev) worker label data.


### 2011 Test Data

In [24]:
test_data_shared = read_expert_labels(TEST_LABEL_FILE_SHARED, header=True, sep=',')
test_data_team = read_expert_labels(TEST_LABEL_FILE_TEAMS, header=True, sep=',')

print(len(test_data_shared))
print("First 5:\n" + "\n".join([str(d) for d in test_data_shared[:5]]))
print("Last 5:\n" + "\n".join([str(d) for d in test_data_shared[-5:]]))

test_data = test_data_shared + test_data_team
print("Last 5 (after merge):")
print("\n".join([str(d) for d in test_data[-5:]]))

1655
First 5:
20542:clueweb09-en0003-47-17392:Relevant
20542:clueweb09-en0002-74-25816:Relevant
20542:clueweb09-en0000-00-00000:Non-relevant
20542:clueweb09-enwp00-69-12844:Relevant
20542:clueweb09-en0002-93-19628:Relevant
Last 5:
20996:clueweb09-en0129-94-14964:Unknown
20996:clueweb09-en0129-94-14966:Unknown
20996:clueweb09-en0131-42-22886:Unknown
20996:clueweb09-en0132-77-26392:Unknown
20996:clueweb09-enwp01-17-03021:Unknown
Last 5 (after merge):
20958:clueweb09-en0112-59-01254:Unknown
20958:clueweb09-en0114-14-25526:Unknown
20958:clueweb09-en0116-09-14871:Unknown
20958:clueweb09-en0116-09-14873:Unknown
20958:clueweb09-en0121-92-02032:Unknown


In [25]:
test_topic_ids = { l.topic_id for l in test_data }
print("%d different topics in test data." % len(test_topic_ids))

30 different topics in test data.


In [26]:
print(len(test_topic_ids & useful_judgement_topic_ids))

30


In [27]:
print(len(test_topic_ids & expert_label_topic_ids))

30


In [28]:
print(len(test_topic_ids & worker_label_topic_ids))

0


# Ground truth stats (old)

## Also including non-ground-truth labels (-1)

In [28]:
print("[%d] total entries in test data." % len(test_data))
test_data_unique_docs = {l.document_id for l in test_data}
test_data_unique_topics = {l.topic_id for l in test_data}
print("[%d] Unique document IDs in test data." % len(test_data_unique_docs))
print("[%d] Unique topic IDs in test data." % len(test_data_unique_topics))

test_data_unique_points = {(l.topic_id, l.document_id) for l in test_data}
print("[%d] Unique judgements in test data." % len(test_data_unique_points))

[9380] total entries in test data.
[3195] Unique document IDs in test data.
[30] Unique topic IDs in test data.
[3200] Unique judgements in test data.


## Filtering out non-ground-truth labels

In [32]:
useful_test_dp = [l for l in test_data if l.label != -1]
print("[%d] Useful data points in test data (with labels != -1)." % len(useful_test_dp))
print("[%d] Topic IDs in useful data." % len({l.topic_id for l in useful_test_dp}))
print("[%d] Document IDs in useful data." % len({l.document_id for l in useful_test_dp}))
unique_useful_test_dp = {(l.topic_id, l.document_id) for l in useful_test_dp}
print("[%d] Unique useful data points." % len(unique_useful_test_dp))

[1015] Useful data points in test data (with labels != -1).
[30] Topic IDs in useful data.
[394] Document IDs in useful data.
[395] Unique useful data points.
