#BioCreative V Task 3b CrowdFlower Work Unit Formatter

Tong Shu Li<br>
Created on 2015-07-02<br>
Last updated 2015-07-30

From our preliminary experiments with making the crowd perform the chemical-induced disease relation extraction task at the abstract level (jobs #746297, #746647, #748223), we saw that the crowd performed very well on relationships which existed within the same sentence, and performed poorly on relations which spanned the whole abstract.

We will now divide the task up into two parts:
1. The simpler, sentence-level task will involve verifying one relationship from one sentence in which both concepts co-occur.
2. The harder, abstract-level task will involve verifying one relationship from the entire abstract when the two concepts never co-occur within any sentence.

---

The <code>classify_relations()</code> routine of the <code>Sentence</code> and <code>Paper</code> objects have already separated all possible chemical-disease relation pairs into three disjoint categories:

1. Relations which follow the "[chemical]-induced [disease]" (CID) structure.
2. Relations which co-occur within a sentence but do not follow the CID structure.
3. Relations which do not co-occur within any sentences.

This notebook takes the relation pairs in each category and generates the information needed for the CrowdFlower interface. No decision making about which category each relation belong to is performed here.

---

This work unit formatter also supports the batch upload of test questions. To batch upload test questions, the following are needed:

1. A "_golden" column that contains "TRUE" for all test questions that need to be uploaded.
2. A "_gold" and "_gold_reason" column appended to each CML question that needs an answer.
3. The CML tag "gold=true" for each CML question.
4. The regular data of course for the test questions.

---

In [1]:
from collections import defaultdict
import os
import pandas as pd
import pickle
import random

In [2]:
# old random seed for original 30 abstract testset:
# random.seed("2015-06-11:14:46")

# new random seed
random.seed("2015-08-01:14:07")

In [3]:
from src.data_model import parse_input
from src.make_sections import create_sections

In [4]:
def add_simple_tag(tag_name, tag_class, text):
    return "<{0} class=\"{1}\">{2}</{0}>".format(tag_name, tag_class, text)

---

###Read the gold standard training data:

In [5]:
if os.path.exists("data/training/parsed_training_set.pickle"):
    print "Reading cached version of training set"
    with open("data/training/parsed_training_set.pickle", "rb") as fin:
        training_data = pickle.load(fin)
else:
    training_data = parse_input("data/training", "CDR_TrainingSet.txt")
    with open("data/training/parsed_training_set.pickle", "wb") as fout:
        pickle.dump(training_data, fout)

Reading cached version of training set


In [6]:
len(training_data)

500

In [7]:
development_data = parse_input("data/development", "CDR_DevelopmentSet.txt")

###Check for errors:

The Paper objects already checked that each relation only joins two MeSH ids during the parsing process.

The Paper objects are inspected to ensure that the annotations match the text positions during loading.

---

### Choose a random sample of papers as the testset:

In [8]:
TESTSET_SIZE = 200
testset = random.sample(development_data, TESTSET_SIZE)

for paper in testset:
    print paper.pmid

15579441
18006530
3732088
11208990
19721134
20552622
625456
17366349
7282516
16428827
16820346
3703509
15673851
16418614
10411803
8766220
458486
2051906
11847945
18997632
10565806
11230490
12013711
2021202
11195262
2980315
11705128
7072798
8958188
11009181
18657397
8012887
921394
9226773
19234905
17786501
871943
2339463
2004015
20727411
3115150
12093990
3769769
9545159
3670965
20164825
663266
4090988
3962737
12739036
11337188
9672273
16574713
2886572
10840460
19893084
6150641
12691807
982002
16710500
7007443
3131282
6323692
85485
12448656
3123611
10225068
326460
4038130
12907924
2320800
15974569
188339
2435991
8911359
2893236
8251368
761833
6127992
3183120
3973521
12231232
7752389
6540303
3950060
18589141
2840807
8267029
17943461
1969772
1928887
14975762
3961813
20735774
7843916
18441470
11302406
14982270
7619765
3686155
1899352
11185967
384871
19274460
2722224
6454943
430165
12734532
10524660
11063349
19944736
11694026
12851669
19674115
3084782
9564988
2790457
18483878
18356633
160063

---

###Highlighting functions:

In [9]:
def highlight_concepts(text, breaks):
    """
    Inserts HTML tags around the pieces of text
    which need to be highlighted in a string.
    """
    breaks = sorted(breaks, key = lambda x: x[0])
    
    final = []
    for i in range(len(breaks) - 1):
        s = text[breaks[i][0] : breaks[i+1][0]]
        if breaks[i][1] != "n":
            s = add_simple_tag("span", breaks[i][1], s)
            
        final.append(s)
        
    return "".join(final)

In [10]:
def highlight_text(text, offset, uniq_spans):
    """
    Given a string and the annotations which fall
    within this string, highlights the concepts.
    """
    # index of break, type of break (n = nothing)
    breaks = [(0, "n"), (len(text), "n")]
    
    for span in uniq_spans:
        breaks.append((span.start - offset, span.stype))
        breaks.append((span.stop - offset, "n"))
            
    return highlight_concepts(text, breaks)

---

###Processors for each of the subtasks:

In [11]:
def grab_names(annotations):
    """
    Determines the unique names of the annotations.
    """
    # determine the names of the concept identifiers
    used_names = defaultdict(set) # lower case set of used names (to avoid repeats)
    real_name = defaultdict(set) # set of unique names verbatim (to preseve capitalization)
    for annotation in annotations:
        if annotation.text.lower() not in used_names[annotation.stype]:
            used_names[annotation.stype].add(annotation.text.lower())
            real_name[annotation.stype].add(annotation.text)
            
    return real_name

In [12]:
def process_sentence_task(sentence, rel_pairs):
    """
    Given a Sentence object, and the set of chemical-disease relation
    identifier pairs, creates a set of sentence-level verification tasks.
    """
    data = defaultdict(list)
    for chemical_id, disease_id in rel_pairs:
        spans = [annotation for annotation in sentence.annotations if annotation.uid in [chemical_id, disease_id]]
        real_name = grab_names(spans)

        data["pmid"].append(sentence.pmid)
        
        data["form_sentence"].append(highlight_text(sentence.text, sentence.start, spans))
        
        data["chemical_id"].append(chemical_id)
        data["disease_id"].append(disease_id)
        
        data["chemical_name"].append(add_simple_tag("span", "chemical", "/".join(real_name["chemical"])))
        data["disease_name"].append(add_simple_tag("span", "disease", "/".join(real_name["disease"])))
        
        data["relation_pair_id"].append("{0}_{1}_{2}".format(sentence.pmid, chemical_id, disease_id))
        
    return pd.DataFrame(data)

In [13]:
def process_abstract_task(paper, rel_pairs):
    """
    Makes a set of abstract-level tasks for one paper.
    """
    data = defaultdict(list)
    for chemical_id, disease_id in rel_pairs:
        spans = [annotation for annotation in paper.annotations if annotation.uid in [chemical_id, disease_id]]
        real_name = grab_names(spans)

        form_title = highlight_text(paper.title, 0,
                                    filter(lambda x: x.stop <= len(paper.title), spans))

        form_abstract = highlight_text(paper.abstract, len(paper.title) + 1,
                                       filter(lambda x: x.start > len(paper.title), spans))

        form_abstract = create_sections(form_abstract)

        data["pmid"].append(paper.pmid)

        data["form_title"].append(form_title)
        data["form_abstract"].append(form_abstract)

        data["chemical_id"].append(chemical_id)
        data["disease_id"].append(disease_id)
        data["chemical_name"].append(add_simple_tag("span", "chemical", "/".join(real_name["chemical"])))
        data["disease_name"].append(add_simple_tag("span", "disease", "/".join(real_name["disease"])))
            
    return pd.DataFrame(data)

---

In [14]:
def create_work_units(dataset):
    """
    Given a list of Paper objects representing the abstracts
    we wish to find the CID relations in, this function
    creates the work units for the CrowdFlower tasks.
    
    CID relations are judged to be always true and no crowd
    worker ever sees that relation.
    
    Each sentence-bound non-CID relation can create one or
    multiple sentence-level work units, depending on how
    many sentences in that abstract contain the relationship.
    
    Each non-sentence bound relation creates one abstract-level
    work unit.
    
    Relation type classification is already done by the Paper
    objects.
    """
    cid_relations = dict()
    easy_units = []
    hard_units = []
    for paper in dataset:
        cid_relations[paper.pmid] = paper.poss_relations["CID"]
        
        # create the sentence-level tasks:
        for sentence in paper.sentences:
            work = sentence.poss_relations[False] - paper.poss_relations["CID"]
            easy_units.append(process_sentence_task(sentence, work))
                
        # create the abstract-level tasks:
        hard_units.append(process_abstract_task(paper, paper.poss_relations["not_sentence_bound"]))
            
    # return two dataframes
    easy_units = pd.concat(easy_units).reset_index(drop = True)
    hard_units = pd.concat(hard_units).reset_index(drop = True)
    
    easy_units["uniq_id"] = pd.Series(["bcv_easy_{0}".format(i) for i in range(len(easy_units))])
    hard_units["uniq_id"] = pd.Series(["bcv_hard_{0}".format(i) for i in range(len(hard_units))])
    
    return (cid_relations, easy_units, hard_units)

---

###Generate the work units and print to file:

In [15]:
cid_relations, easy_units, hard_units = create_work_units(testset)

In [16]:
cid_relations

{48362: set(),
 85485: set(),
 188339: set(),
 326460: {('D001539', 'D007008')},
 384871: set(),
 430165: set(),
 458486: {('D007980', 'D004409')},
 625456: set(),
 663266: set(),
 761833: set(),
 871943: set(),
 921394: {('D007649', 'D013610')},
 982002: set(),
 1255900: set(),
 1355091: set(),
 1504402: {('D007741', 'D056486')},
 1687392: {('D013469', 'D002375')},
 1899352: set(),
 1928887: {('D002216', 'D007022')},
 1969772: {('D009599', 'D007022'), ('D018818', 'D007022')},
 2004015: {('D015215', 'D000740')},
 2021202: {('D007530', 'D007022')},
 2051906: {('D001379', 'D002779')},
 2320800: set(),
 2339463: set(),
 2435991: {('D004837', 'D001145')},
 2564649: {('D005283', 'D009127')},
 2576810: {('D000661', 'D006948')},
 2716967: {('D009020', 'D000699')},
 2722224: {('D006854', 'D006973')},
 2790457: {('D003042', 'D012640')},
 2840807: set(),
 2886572: {('D004837', 'D006973')},
 2893236: set(),
 2980315: set(),
 3084782: set(),
 3108839: set(),
 3115150: {('D015760', 'D009127')},
 31

---

### Write the CID relations to a file for later aggregation:

In [17]:
with open("data/cid_relations.pickle", "wb") as fout:
    pickle.dump(cid_relations, fout)

---

In [18]:
easy_units.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_sentence,pmid,relation_pair_id,uniq_id
0,D011692,"<span class=""chemical"">puromycin aminonucleosi...",D000860,"<span class=""disease"">hypoxia</span>","With this model, we were able to identify diff...",15579441,15579441_D011692_D000860,bcv_easy_0
1,D011692,"<span class=""chemical"">puromycin aminonucleosi...",D000860,"<span class=""disease"">hypoxia</span>","Expression of the <span class=""disease"">hypoxi...",15579441,15579441_D011692_D000860,bcv_easy_1
2,D015742,"<span class=""chemical"">propofol</span>",D010146,"<span class=""disease"">pain</span>","Reduction of <span class=""disease"">pain</span>...",18006530,18006530_D015742_D010146,bcv_easy_2
3,C071741,"<span class=""chemical"">remifentanil</span>",D010146,"<span class=""disease"">pain</span>","Reduction of <span class=""disease"">pain</span>...",18006530,18006530_C071741_D010146,bcv_easy_3
4,D015742,"<span class=""chemical"">propofol</span>",D010146,"<span class=""disease"">Pain</span>","BACKGROUND: <span class=""disease"">Pain</span> ...",18006530,18006530_D015742_D010146,bcv_easy_4


In [19]:
hard_units.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_abstract,form_title,pmid,uniq_id
0,D011692,"<span class=""chemical"">puromycin aminonucleosi...",D011507,"<span class=""disease"">proteinuria</span>",Despite the increasing need to identify and qu...,"Hypoxia in renal disease with <span class=""dis...",15579441,bcv_hard_0
1,D011692,"<span class=""chemical"">puromycin aminonucleosi...",D007674,"<span class=""disease"">diseased kidney/glomerul...",Despite the increasing need to identify and qu...,"Hypoxia in <span class=""disease"">renal disease...",15579441,bcv_hard_1
2,D011692,"<span class=""chemical"">puromycin aminonucleosi...",D006973,"<span class=""disease"">hypertension</span>",Despite the increasing need to identify and qu...,Hypoxia in renal disease with proteinuria and/...,15579441,bcv_hard_2
3,D002752,"<span class=""chemical"">chlorthalidone</span>",D003327,"<span class=""disease"">coronary disease</span>",It has been proposed that modest changes in pl...,"Diuretics, potassium and arrhythmias in hypert...",3732088,bcv_hard_3
4,D002752,"<span class=""chemical"">chlorthalidone</span>",D017202,"<span class=""disease"">ischaemic heart disease<...",It has been proposed that modest changes in pl...,"Diuretics, potassium and arrhythmias in hypert...",3732088,bcv_hard_4


---

In [20]:
easy_units.shape

(1279, 8)

In [21]:
hard_units.shape

(1063, 8)

###Add the test questions:

In [22]:
abs_test_ques = pd.read_csv("data/crowdflower/test_questions/final_abstract_test_questions.tsv", sep = '\t')

In [23]:
hard_units = hard_units.append(abs_test_ques)

In [24]:
hard_units.shape

(1115, 13)

###Write work units to file

In [25]:
easy_units.to_csv("data/crowdflower/data_for_easy_job_.tsv", sep = '\t', index = False)

In [26]:
hard_units.to_csv("data/crowdflower/data_for_hard_job_.tsv", sep = '\t', index = False)