# Project Data Overview

This notebook contains some general info regarding the data available for this project.

The judgements used by this project are the NIST expert judgements ('stage1-dev') and the consensus labels ('stage2-dev') from [the 2011 TREC Crowdsourcing track](https://sites.google.com/site/treccrowd/2011).

The actual document data used is, sadly, not publicly available (http://lemurproject.org/clueweb09/, ClueWeb09 dataset, T11Crowd subsection), but can be acquired by signing a non-commercial use agreement with the provider.

Some of the data management code has been cannibalized from [Martin Davtyan's previous work on the subject](https://github.com/martinthenext/ir-crowd-thesis) (while at ETH Zurich).

In [37]:
import io
import os

In [38]:
# This loads the necessary data wrangling classes and functions.
# It's done this way since notebooks aren't (and shouldn't) be part
# of an actual Python project.
%run ../data.py

In [39]:
%run ../config.py

In [40]:
expert_labels = read_expert_labels(EXPERT_GROUND_TRUTH_FILE)
print("%d NIST expert labels" % len(expert_labels))

2033 NIST expert labels


In [41]:
worker_labels = read_worker_labels(WORKER_LABEL_FILE)
print("%d Mechanical Turk worker labels" % len(worker_labels))

10770 Mechanical Turk worker labels


In [42]:
expert_label_topic_ids = { l.topic_id for l in expert_labels }
print("%d topics in NIST expert label data" % len(expert_label_topic_ids))

244 topics in NIST expert label data


In [43]:
worker_label_topic_ids = { l.topic_id for l in worker_labels }
print("%d topics in development worker label data" % len(worker_label_topic_ids))

25 topics in development worker label data


Is it normal to have 25 topics in worker labels, but 244 topics in expert labels? (stage1-dev and stage2-dev READMEs confirm these counts!)

In [44]:
common_expert_worker_topic_ids = expert_label_topic_ids & worker_label_topic_ids
str(len(common_expert_worker_topic_ids)) + ' topics in common (NIST expert labels and development worker labels)'

'24 topics in common (NIST expert labels and development worker labels)'

In [45]:
judgement_labels_2011 = read_judgement_labels(JUDGEMENT_FILE)
str(len(judgement_labels_2011)) + ' judgement labels'

'64042 judgement labels'

### 2011 Judgement Data

In [46]:
judgement_topic_ids = { l.topic_id for l in judgement_labels_2011 }
len(judgement_topic_ids)

46

In [47]:
print(len(judgement_topic_ids & expert_label_topic_ids))
print(len(judgement_topic_ids & worker_label_topic_ids))

17
1


In [48]:
# Clear out labels deemed irrelevant (e.g. ones used for worker assessment).

useful_judgement_labels_2011 = read_useful_judgement_labels(JUDGEMENT_FILE)

In [49]:
useful_judgement_topic_ids = { l.topic_id for l in useful_judgement_labels_2011 }
print("%d different topics in 2011 judgement data" % len(useful_judgement_topic_ids))
print("%d topics in common between 2011 judgement data and original NIST expert label data." %
      len(useful_judgement_topic_ids & expert_label_topic_ids))
print("%d topics in common between 2011 judgement data and original (dev) worker label data." % 
      len(useful_judgement_topic_ids & worker_label_topic_ids))

30 different topics in 2011 judgement data
2 topics in common between 2011 judgement data and original NIST expert label data.
0 topics in common between 2011 judgement data and original (dev) worker label data.


### 2011 Test Data

In [29]:
test_data_shared = read_expert_labels(TEST_LABEL_FILE_SHARED, header=True, sep=',')
test_data_team = read_expert_labels(TEST_LABEL_FILE_TEAMS, header=True, sep=',')

print(len(test_data_shared))
print("First 5:\n" + "\n".join([str(d) for d in test_data_shared[:5]]))
print("Last 5:\n" + "\n".join([str(d) for d in test_data_shared[-5:]]))

test_data = test_data_shared + test_data_team
print("Last 5 (after merge):")
print("\n".join([str(d) for d in test_data[-5:]]))

1655
First 5:
20542:clueweb09-en0003-47-17392:Relevant
20542:clueweb09-en0002-74-25816:Relevant
20542:clueweb09-en0000-00-00000:Not relevant
20542:clueweb09-enwp00-69-12844:Relevant
20542:clueweb09-en0002-93-19628:Relevant
Last 5:
20996:clueweb09-en0129-94-14964:Not relevant
20996:clueweb09-en0129-94-14966:Not relevant
20996:clueweb09-en0131-42-22886:Not relevant
20996:clueweb09-en0132-77-26392:Not relevant
20996:clueweb09-enwp01-17-03021:Not relevant
Last 5 (after merge):
20958:clueweb09-en0112-59-01254:Not relevant
20958:clueweb09-en0114-14-25526:Not relevant
20958:clueweb09-en0116-09-14871:Not relevant
20958:clueweb09-en0116-09-14873:Not relevant
20958:clueweb09-en0121-92-02032:Not relevant


In [30]:
test_topic_ids = { l.topic_id for l in test_data }
print("%d different topics in test data." % len(test_topic_ids))

30 different topics in test data.


In [31]:
print(len(test_topic_ids & useful_judgement_topic_ids))

30


In [32]:
print(len(test_topic_ids & expert_label_topic_ids))

2


In [33]:
print(len(test_topic_ids & worker_label_topic_ids))

0


## Summary
 * Full topic overlap between judgement data and test data.
 * 6% (2/30) topic overlap between expert label data and test data.
 * 0% (0/30) topic overlap between original worker consensus training data labels and test data.