#BioCreative V Task 3b CrowdFlower Work Unit Formatter

Tong Shu Li<br>
Created on 2015-07-02<br>
Last updated 2015-07-30

From our preliminary experiments with making the crowd perform the chemical-induced disease relation extraction task at the abstract level (jobs #746297, #746647, #748223), we saw that the crowd performed very well on relationships which existed within the same sentence, and performed poorly on relations which spanned the whole abstract.

We will now divide the task up into two parts:
1. The simpler, sentence-level task will involve verifying one relationship from one sentence in which both concepts co-occur.
2. The harder, abstract-level task will involve verifying one relationship from the entire abstract when the two concepts never co-occur within any sentence.

---

The <code>classify_relations()</code> routine of the <code>Sentence</code> and <code>Paper</code> objects have already separated all possible chemical-disease relation pairs into three disjoint categories:

1. Relations which follow the "[chemical]-induced [disease]" (CID) structure.
2. Relations which co-occur within a sentence but do not follow the CID structure.
3. Relations which do not co-occur within any sentences.

This notebook takes the relation pairs in each category and generates the information needed for the CrowdFlower interface. No decision making about which category each relation belong to is performed here.

In [1]:
from collections import defaultdict
import os
import pandas as pd
import pickle
import random

In [2]:
# old random seed for original 30 abstract testset:
# random.seed("2015-06-11:14:46")

# new random seed
random.seed("2015-07-30:11:27")

In [3]:
from src.data_model import parse_input
from src.make_sections import create_sections

In [4]:
def add_simple_tag(tag_name, tag_class, text):
    return "<{0} class=\"{1}\">{2}</{0}>".format(tag_name, tag_class, text)

---

###Read the gold standard training data:

In [5]:
if os.path.exists("data/training/parsed_training_set.pickle"):
    print "Reading cached version of training set"
    with open("data/training/parsed_training_set.pickle", "rb") as fin:
        training_data = pickle.load(fin)
else:
    training_data = parse_input("data/training", "CDR_TrainingSet.txt")
    with open("data/training/parsed_training_set.pickle", "wb") as fout:
        pickle.dump(training_data, fout)

Reading cached version of training set


In [6]:
len(training_data)

500

###Check for errors:

The Paper objects already checked that each relation only joins two MeSH ids during the parsing process.

The Paper objects are inspected to ensure that the annotations match the text positions during loading.

---

### Choose a random sample of papers as the testset:

In [7]:
TESTSET_SIZE = 500
testset = random.sample(training_data, TESTSET_SIZE)

for paper in testset:
    print paper.pmid

11569530
11135224
1378968
10091617
11250767
18162529
2334618
220563
3780846
8919272
16904497
11198499
8437969
19346865
12921865
3856631
10683478
2515254
869641
11391224
8953972
10739826
8955532
21029050
20080419
6386793
16005948
8829025
2440413
18186898
10743446
12820454
8590259
18261172
2071257
10193204
3865016
11166519
7890216
18726058
3341566
1436384
12584269
11206082
18083142
12589964
7516729
19037603
20722491
15572383
12464714
11532387
15233872
603022
8649546
9855119
322550
1835291
11897407
9653867
19957053
17042910
8595686
20880751
11875660
12615818
17020434
9334596
8421099
7083920
6503301
2355241
14596845
18004067
6308277
12443032
19356053
11642480
9284778
9746003
9625142
11334364
11752354
15632880
9931093
227508
3560095
7292072
7352670
8638206
6538499
17111419
7881871
1592014
1009330
18503483
20882060
9351491
15737522
3371379
6133211
354896
14513889
11431197
7628595
11007689
3719553
2522601
15804801
3970039
3107448
15863244
3412544
18631865
18081909
18464113
1639466
1992636
107

---

###Highlighting functions:

In [8]:
def highlight_concepts(text, breaks):
    """
    Inserts HTML tags around the pieces of text
    which need to be highlighted in a string.
    """
    breaks = sorted(breaks, key = lambda x: x[0])
    
    final = []
    for i in range(len(breaks) - 1):
        s = text[breaks[i][0] : breaks[i+1][0]]
        if breaks[i][1] != "n":
            s = add_simple_tag("span", breaks[i][1], s)
            
        final.append(s)
        
    return "".join(final)

In [9]:
def highlight_text(text, offset, uniq_spans):
    """
    Given a string and the annotations which fall
    within this string, highlights the concepts.
    """
    # index of break, type of break (n = nothing)
    breaks = [(0, "n"), (len(text), "n")]
    
    for span in uniq_spans:
        breaks.append((span.start - offset, span.stype))
        breaks.append((span.stop - offset, "n"))
            
    return highlight_concepts(text, breaks)

---

###Processors for each of the subtasks:

In [10]:
def grab_names(annotations):
    """
    Determines the unique names of the annotations.
    """
    # determine the names of the concept identifiers
    used_names = defaultdict(set) # lower case set of used names (to avoid repeats)
    real_name = defaultdict(set) # set of unique names verbatim (to preseve capitalization)
    for annotation in annotations:
        if annotation.text.lower() not in used_names[annotation.stype]:
            used_names[annotation.stype].add(annotation.text.lower())
            real_name[annotation.stype].add(annotation.text)
            
    return real_name

In [11]:
def process_sentence_task(sentence, chemical_id, disease_id):
    """
    Given a Sentence object, and the chemical-disease relation
    identifier pair, creates one sentence-level verification task.
    """
    spans = [annotation for annotation in sentence.annotations if annotation.uid in [chemical_id, disease_id]]
    real_name = grab_names(spans)

    data = dict()
    data["pmid"] = [sentence.pmid]
    data["form_sentence"] = [highlight_text(sentence.text, sentence.start, spans)]
    data["chemical_id"] = [chemical_id]
    data["disease_id"] = [disease_id]
    data["relation_pair_id"] = ["{0}_{1}_{2}".format(sentence.pmid, chemical_id, disease_id)]
    data["chemical_name"] = [add_simple_tag("span", "chemical", "/".join(real_name["chemical"]))]
    data["disease_name"] = [add_simple_tag("span", "disease", "/".join(real_name["disease"]))]
    
    return pd.DataFrame(data)

In [12]:
def process_abstract_task(paper, chemical_id, disease_id):
    """
    Makes one abstract level chemical-disease relationship
    verification task.
    """
    spans = [annotation for annotation in paper.annotations if annotation.uid in [chemical_id, disease_id]]
    real_name = grab_names(spans)
            
    form_title = highlight_text(paper.title, 0,
                                filter(lambda x: x.stop <= len(paper.title), spans))
    
    form_abstract = highlight_text(paper.abstract, len(paper.title) + 1,
                                   filter(lambda x: x.start > len(paper.title), spans))
            
    form_abstract = create_sections(form_abstract)
            
    data = dict()
    data["pmid"] = [paper.pmid]
    
    data["form_title"] = [form_title]
    data["form_abstract"] = [form_abstract]
    
    data["chemical_id"] = [chemical_id]
    data["disease_id"] = [disease_id]
    data["chemical_name"] = [add_simple_tag("span", "chemical", "/".join(real_name["chemical"]))]
    data["disease_name"] = [add_simple_tag("span", "disease", "/".join(real_name["disease"]))]
            
    return pd.DataFrame(data)

---

In [13]:
def create_work_units(dataset):
    """
    Given a list of Paper objects representing the abstracts
    we wish to find the CID relations in, this function
    creates the work units for the CrowdFlower tasks.
    
    CID relations are judged to be always true and no crowd
    worker ever sees that relation.
    
    Each sentence-bound non-CID relation can create one or
    multiple sentence-level work units, depending on how
    many sentences in that abstract contain the relationship.
    
    Each non-sentence bound relation creates one abstract-level
    work unit.
    
    Relation type classification is already done by the Paper
    objects.
    """
    cid_relations = dict()
    easy_units = []
    hard_units = []
    for paper in dataset:
        cid_relations[paper.pmid] = paper.poss_relations["CID"]
        
        # create the sentence-level tasks:
        for sentence in paper.sentences:
            work = sentence.poss_relations[False] - paper.poss_relations["CID"]
            for rel_pair in work:
                easy_units.append(process_sentence_task(sentence, rel_pair[0], rel_pair[1]))
                
        # create the abstract-level tasks:
        for rel_pair in paper.poss_relations["not_sentence_bound"]:
            hard_units.append(process_abstract_task(paper, rel_pair[0], rel_pair[1]))
            
    # return two dataframes
    easy_units = pd.concat(easy_units).reset_index(drop = True)
    hard_units = pd.concat(hard_units).reset_index(drop = True)
    
    easy_units["uniq_id"] = pd.Series(["bcv_easy_{0}".format(i) for i in range(len(easy_units))])
    hard_units["uniq_id"] = pd.Series(["bcv_hard_{0}".format(i) for i in range(len(hard_units))])
    
    return (cid_relations, easy_units, hard_units)

---

###Generate the work units and print to file:

In [14]:
cid_relations, easy_units, hard_units = create_work_units(testset)

In [15]:
cid_relations

{26094: set(),
 84204: set(),
 88336: set(),
 137340: {('D009020', 'D006948')},
 150790: set(),
 220563: set(),
 227508: set(),
 230316: set(),
 234669: set(),
 322550: {('D009599', 'D007022')},
 347884: set(),
 354896: {('D008012', 'D006323')},
 424937: {('D008750', 'D056486')},
 435349: set(),
 567256: set(),
 603022: set(),
 809711: {('D005996', 'D007022')},
 869641: set(),
 891050: set(),
 983936: {('D004837', 'D009202')},
 1009330: set(),
 1085609: set(),
 1130930: set(),
 1147734: set(),
 1378968: {('D008094', 'D007674'), ('D008094', 'D007676')},
 1420741: set(),
 1428568: set(),
 1436384: {('D000661', 'D020258')},
 1468485: {('D003520', 'D006470|D003556')},
 1527456: set(),
 1549199: {('D007980', 'D011618')},
 1592014: {('C005618', 'D012640'), ('D003042', 'D012640')},
 1601297: set(),
 1616457: set(),
 1639466: set(),
 1664218: set(),
 1720453: set(),
 1728522: {('D019980', 'D056486')},
 1732369: set(),
 1749407: {('D003042', 'D009203')},
 1786266: set(),
 1833784: {('D009538', 

---

### Write the CID relations to a file for later aggregation:

In [16]:
with open("data/cid_relations.pickle", "wb") as fout:
    pickle.dump(cid_relations, fout)

---

In [17]:
easy_units.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_sentence,pmid,relation_pair_id,uniq_id
0,D016593,"<span class=""chemical"">terfenadine</span>",D016171,"<span class=""disease"">TDP</span>","<span class=""disease"">TDP</span> is a side-eff...",11569530,11569530_D016593_D016171,bcv_easy_0
1,C010637,"<span class=""chemical"">terodiline</span>",D016171,"<span class=""disease"">TDP</span>","<span class=""disease"">TDP</span> is a side-eff...",11569530,11569530_C010637_D016171,bcv_easy_1
2,D020117,"<span class=""chemical"">cisapride</span>",D016171,"<span class=""disease"">TDP</span>",Four compounds known to increase QT interval a...,11569530,11569530_D020117_D016171,bcv_easy_2
3,C063968,"<span class=""chemical"">E4031</span>",D016171,"<span class=""disease"">TDP</span>",Four compounds known to increase QT interval a...,11569530,11569530_C063968_D016171,bcv_easy_3
4,D016593,"<span class=""chemical"">terfenadine</span>",D016171,"<span class=""disease"">TDP</span>",Four compounds known to increase QT interval a...,11569530,11569530_D016593_D016171,bcv_easy_4


In [18]:
hard_units.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_abstract,form_title,pmid,uniq_id
0,C010637,"<span class=""chemical"">terodiline</span>",D017180,"<span class=""disease"">ventricular tachycardia<...",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_0
1,D016593,"<span class=""chemical"">terfenadine</span>",D017180,"<span class=""disease"">ventricular tachycardia<...",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_1
2,C063968,"<span class=""chemical"">E4031</span>",D017180,"<span class=""disease"">ventricular tachycardia<...",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_2
3,D020117,"<span class=""chemical"">cisapride</span>",D017180,"<span class=""disease"">ventricular tachycardia<...",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_3
4,D002945,"<span class=""chemical"">cisplatin</span>",D009503,"<span class=""disease"">neutropenia</span>","<p>BACKGROUND: <span class=""chemical"">Cisplati...","Paclitaxel, <span class=""chemical"">cisplatin</...",11135224,bcv_hard_4


---

In [19]:
easy_units.shape

(3188, 8)

In [20]:
hard_units.shape

(3080, 8)

###Write work units to file

In [21]:
easy_units.to_csv("data/crowdflower/data_for_easy_job_.tsv", sep = '\t', index = False)

In [22]:
hard_units.to_csv("data/crowdflower/data_for_hard_job_.tsv", sep = '\t', index = False)