#BioCreative V Task 3b workflow decision maker

Tong Shu Li<br>
Created on 2015-07-02<br>
Last updated 2015-07-29

From our preliminary experiments with making the crowd perform the chemical-induced disease relation extraction task at the abstract level (jobs #746297, #746647, #748223), we saw that the crowd performed very well on relationships which existed within the same sentence, and performed poorly on relations which spanned the whole abstract.

We will now divide the task up into two parts:
1. The simpler task will only involve determining one relationship from one sentence in which both concepts cooccur.
2. The harder task will involve determining one relationship from the entire abstract when the two concepts never cooccur in the same sentence.

In [1]:
from collections import defaultdict
import os
import pandas as pd
import pickle
import random
import sys

In [2]:
random.seed("2015-06-11:14:46")

In [3]:
sys.path.append("/home/toby/Code/util/")
from web_util import add_simple_tag

In [4]:
from src.data_model import parse_input
from src.data_model import Relation
from src.make_sections import create_sections

---

###Read the gold standard training data:

In [5]:
if os.path.exists("data/training/parsed_training_set.pickle"):
    print "Reading cached version of training set"
    with open("data/training/parsed_training_set.pickle", "rb") as fin:
        training_data = pickle.load(fin)
else:
    training_data = parse_input("data/training", "CDR_TrainingSet.txt")
    with open("data/training/parsed_training_set.pickle", "wb") as fout:
        pickle.dump(training_data, fout)

Reading cached version of training set


In [6]:
len(training_data)

500

###Check for errors:

The Paper objects already checked that each relation only joins two MeSH ids during the parsing process.

The Paper objects are inspected to ensure that the annotations match the text positions during loading.

---

### Choose a random sample of papers as the testset:

In [7]:
TESTSET_SIZE = 30
testset = random.sample(training_data, TESTSET_SIZE)

for paper in testset:
    print paper.pmid

11569530
11135224
1378968
19269743
8595686
16337777
10520387
17931375
3800626
9522143
17241784
2265898
6666578
15632880
8590259
12198388
2515254
1835291
12041669
7582165
2096243
7449470
2375138
6692345
17261653
18631865
1130930
10835440
15602202
16167916


---

###Highlighting functions:

In [8]:
def highlight_concepts(text, breaks):
    """
    Inserts HTML tags around the pieces of text
    which need to be highlighted in a string.
    """
    breaks = sorted(breaks, key = lambda x: x[0])
    
    final = []
    for i in range(len(breaks) - 1):
        s = text[breaks[i][0] : breaks[i+1][0]]
        if breaks[i][1] != "n":
            s = add_simple_tag("span", breaks[i][1], s)
            
        final.append(s)
        
    return "".join(final)

In [9]:
def highlight_text(text, offset, uniq_spans):
    """
    Given a string and the annotations which fall
    within this string, highlights the concepts.
    """
    # index of break, type of break (n = nothing)
    breaks = [(0, "n"), (len(text), "n")]
    
    for span in uniq_spans:
        breaks.append((span.start - offset, span.stype))
        breaks.append((span.stop - offset, "n"))
            
    return highlight_concepts(text, breaks)

---

###Processors for each of the subtasks:

In [10]:
def grab_names(annotations):
    """
    Determines the unique names of the annotations.
    """
    # determine the names of the concept identifiers
    used_names = defaultdict(set) # lower case set of used names (to avoid repeats)
    real_name = defaultdict(set) # set of unique names verbatim (to preseve capitalization)
    for annotation in annotations:
        if annotation.text.lower() not in used_names[annotation.stype]:
            used_names[annotation.stype].add(annotation.text.lower())
            real_name[annotation.stype].add(annotation.text)
            
    return real_name

In [11]:
def process_sentence_task(sentence, chemical_id, disease_id):
    """
    Given a Sentence object, and the chemical-disease relation
    identifier pair, creates one sentence-level verification task.
    """
    spans = [annotation for annotation in sentence.annotations if annotation.uid in [chemical_id, disease_id]]
    real_name = grab_names(spans)

    data = dict()
    data["pmid"] = [sentence.pmid]
    data["form_sentence"] = [highlight_text(sentence.text, sentence.start, spans)]
    data["chemical_id"] = [chemical_id]
    data["disease_id"] = [disease_id]
    data["chemical_name"] = [add_simple_tag("span", "chemical", "/".join(real_name["chemical"]))]
    data["disease_name"] = [add_simple_tag("span", "disease", "/".join(real_name["disease"]))]
    
    return pd.DataFrame(data)

In [12]:
def process_abstract_task(paper, chemical_id, disease_id):
    """
    Makes one abstract level chemical-disease relationship
    verification task.
    """
    spans = [annotation for annotation in paper.annotations if annotation.uid in [chemical_id, disease_id]]
    real_name = grab_names(spans)
            
    form_title = highlight_text(paper.title, 0,
                                filter(lambda x: x.stop <= len(paper.title), spans))
    
    form_abstract = highlight_text(paper.abstract, len(paper.title) + 1,
                                   filter(lambda x: x.start > len(paper.title), spans))
            
    form_abstract = create_sections(form_abstract)
            
    data = dict()
    data["pmid"] = [paper.pmid]
    
    data["form_title"] = [form_title]
    data["form_abstract"] = [form_abstract]
    
    data["chemical_id"] = [chemical_id]
    data["disease_id"] = [disease_id]
    data["chemical_name"] = [add_simple_tag("span", "chemical", "/".join(real_name["chemical"]))]
    data["disease_name"] = [add_simple_tag("span", "disease", "/".join(real_name["disease"]))]
            
    return pd.DataFrame(data)

---

In [13]:
def create_work_units(dataset):
    """
    Given a list of Paper objects representing the abstracts
    we wish to find the CID relations in, this function
    creates the work units for the CrowdFlower tasks.
    
    CID relations are judged to be always true and no crowd
    worker ever sees that relation.
    
    Each sentence-bound non-CID relation can create one or
    multiple sentence-level work units, depending on how
    many sentences in that abstract contain the relationship.
    
    Each non-sentence bound relation creates one abstract-level
    work unit.
    
    Relation type classification is already done by the Paper
    objects.
    """
    cid_relations = dict()
    easy_units = []
    hard_units = []
    for paper in dataset:
        cid_relations[paper.pmid] = paper.poss_relations["CID"]
        
        # create the sentence-level tasks:
        for sentence in paper.sentences:
            work = sentence.poss_relations[False] - paper.poss_relations["CID"]
            for rel_pair in work:
                easy_units.append(process_sentence_task(sentence, rel_pair[0], rel_pair[1]))
                
        # create the abstract-level tasks:
        for rel_pair in paper.poss_relations["not_sentence_bound"]:
            hard_units.append(process_abstract_task(paper, rel_pair[0], rel_pair[1]))
            
    # return two dataframes
    easy_units = pd.concat(easy_units).reset_index(drop = True)
    hard_units = pd.concat(hard_units).reset_index(drop = True)
    
    easy_units["uniq_id"] = pd.Series(["bcv_easy_{0}".format(i) for i in range(len(easy_units))])
    hard_units["uniq_id"] = pd.Series(["bcv_hard_{0}".format(i) for i in range(len(hard_units))])
    
    return (cid_relations, easy_units, hard_units)

---

###Generate the work units and print to file:

In [14]:
cid_relations, easy_units, hard_units = create_work_units(testset)

In [15]:
cid_relations

{1130930: set(),
 1378968: {('D008094', 'D007674'), ('D008094', 'D007676')},
 1835291: set(),
 2096243: {('C017367', 'D019965'), ('C017367', 'D056784')},
 2265898: set(),
 2375138: set(),
 2515254: set(),
 3800626: {('D010423', 'D005355|D009135'), ('D010423', 'D009135')},
 6666578: {('D010396', 'D001018')},
 6692345: {('D005200', 'D001749')},
 7449470: set(),
 7582165: set(),
 8590259: set(),
 8595686: set(),
 9522143: set(),
 10520387: set(),
 10835440: {('D009553', 'D007022')},
 11135224: set(),
 11569530: set(),
 12041669: {('D010396', 'D000741')},
 12198388: set(),
 15602202: set(),
 15632880: {('D013148', 'D006947'), ('D013148', 'D051437')},
 16167916: set(),
 16337777: set(),
 17241784: {('D019821', 'D009135')},
 17261653: {('D003000', 'D001919')},
 17931375: set(),
 18631865: set(),
 19269743: {('D002211', 'D010146')}}

---

### Write the CID relations to a file for later aggregation:

In [16]:
with open("data/cid_relations.pickle", "wb") as fout:
    pickle.dump(cid_relations, fout)

---

In [17]:
easy_units.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_sentence,pmid,uniq_id
0,D016593,"<span class=""chemical"">terfenadine</span>",D016171,"<span class=""disease"">TDP</span>","<span class=""disease"">TDP</span> is a side-eff...",11569530,bcv_easy_0
1,C010637,"<span class=""chemical"">terodiline</span>",D016171,"<span class=""disease"">TDP</span>","<span class=""disease"">TDP</span> is a side-eff...",11569530,bcv_easy_1
2,D020117,"<span class=""chemical"">cisapride</span>",D016171,"<span class=""disease"">TDP</span>",Four compounds known to increase QT interval a...,11569530,bcv_easy_2
3,C063968,"<span class=""chemical"">E4031</span>",D016171,"<span class=""disease"">TDP</span>",Four compounds known to increase QT interval a...,11569530,bcv_easy_3
4,D016593,"<span class=""chemical"">terfenadine</span>",D016171,"<span class=""disease"">TDP</span>",Four compounds known to increase QT interval a...,11569530,bcv_easy_4


In [18]:
hard_units.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_abstract,form_title,pmid,uniq_id
0,C010637,"<span class=""chemical"">terodiline</span>",D017180,"<span class=""disease"">ventricular tachycardia<...",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_0
1,D016593,"<span class=""chemical"">terfenadine</span>",D017180,"<span class=""disease"">ventricular tachycardia<...",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_1
2,C063968,"<span class=""chemical"">E4031</span>",D017180,"<span class=""disease"">ventricular tachycardia<...",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_2
3,D020117,"<span class=""chemical"">cisapride</span>",D017180,"<span class=""disease"">ventricular tachycardia<...",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_3
4,D002945,"<span class=""chemical"">cisplatin</span>",D009503,"<span class=""disease"">neutropenia</span>","<p>BACKGROUND: <span class=""chemical"">Cisplati...","Paclitaxel, <span class=""chemical"">cisplatin</...",11135224,bcv_hard_4


---

In [19]:
easy_units.shape

(142, 7)

In [20]:
hard_units.shape

(89, 8)

###Write work units to file

In [21]:
easy_units.to_csv("data/crowdflower/data_for_easy_job_.tsv", sep = '\t', index = False)

In [22]:
hard_units.to_csv("data/crowdflower/data_for_hard_job_.tsv", sep = '\t', index = False)