#BioCreative V Task 3b workflow decision maker

Tong Shu Li<br>
Created on 2015-07-02<br>
Last updated 2015-07-06

From our preliminary experiments with making the crowd perform the chemical-induced disease relation extraction task at the abstract level (jobs #746297, #746647, #748223), we saw that the crowd performed very well on relationships which existed within the same sentence, and performed poorly on relations which spanned the whole abstract.

We will now divide the task up into two parts:
1. The simpler task will only involve determining one relationship from one sentence in which both concepts cooccur.
2. The harder task will involve determining one relationship from the entire abstract when the two concepts never cooccur in the same sentence.

In [1]:
from collections import defaultdict
import os
import pandas as pd
import pickle
import random
import sys

In [2]:
random.seed("2015-06-11:14:46")

In [3]:
sys.path.append("/home/toby/Code/util/")
from web_util import add_simple_tag

In [4]:
from src.data_model import parse_input
from src.data_model import Relation

---

###Read the gold standard training data:

In [5]:
if os.path.exists("data/training/parsed_training_set.pickle"):
    print "Reading cached version of training set"
    with open("data/training/parsed_training_set.pickle", "rb") as fin:
        training_data = pickle.load(fin)
else:
    training_data = parse_input("data/training", "CDR_TrainingSet.txt")
    with open("data/training/parsed_training_set.pickle", "wb") as fout:
        pickle.dump(training_data, fout)

In [6]:
len(training_data)

500

###Check for errors:

The Paper objects already checked that each relation only joins two MeSH ids during the parsing process.

The Paper objects are inspected to ensure that the annotations match the text positions during loading.

---

### Choose a random sample of papers as the testset:

In [7]:
TESTSET_SIZE = 30
testset = random.sample(training_data, TESTSET_SIZE)

for paper in testset:
    print paper.pmid

11569530
11135224
1378968
19269743
8595686
16337777
10520387
17931375
3800626
9522143
17241784
2265898
6666578
15632880
8590259
12198388
2515254
1835291
12041669
7582165
2096243
7449470
2375138
6692345
17261653
18631865
1130930
10835440
15602202
16167916


---

###Generate the work units:

In [8]:
def is_CID_relation(sentence, drug, disease):
    """
    Given a Sentence object, and two Annotation
    objects representing the drug and disease,
    decides whether the pair follows the 'chemical-induce disease'
    relationship structure.
    """
    return (drug.stop < disease.start and
            disease.start - drug.stop <= 15 and
        "induce" in sentence.text[drug.stop - sentence.start :
                                  disease.start - sentence.start].lower())

def check_CID_structure(sentence, drug_id, disease_id):
    """
    Use first instance of drug, last instance of disease.
    """
    for annotation in sentence.annotations:
        if annotation.uid == drug_id:
            drug_obj = annotation
            break
            
    for annotation in reversed(sentence.annotations):
        if annotation.uid == disease_id:
            disease_obj = annotation
            break
            
    return is_CID_relation(sentence, drug_obj, disease_obj)

###Highlighting functions:

In [9]:
def highlight_concepts(text, breaks):
    breaks = sorted(breaks, key = lambda x: x[0])
    
    final = []
    for i in range(len(breaks) - 1):
        s = text[breaks[i][0] : breaks[i+1][0]]
        if breaks[i][1] != "n":
            s = add_simple_tag("span", breaks[i][1], s)
            
        final.append(s)
        
    return "".join(final)

In [10]:
def highlight_sentence(text, offset, uniq_spans):
    """
    Highlights a sentence.
    """
    # index of break, type of break (n = nothing)
    breaks = [(0, "n"), (len(text), "n")]
    
    for span in uniq_spans:
        breaks.append((span.start - offset, span.stype))
        breaks.append((span.stop - offset, "n"))
            
    return highlight_concepts(text, breaks)

In [11]:
def highlight_title(title, uniq_spans):
    """
    Highlights the title.
    """
    # index of break, type of break (n = nothing)
    breaks = [(0, "n"), (len(title), "n")]
    
    for span in uniq_spans:
        if span.start < len(title):
            breaks.append((span.start, span.stype))
            breaks.append((span.stop, "n"))
            
    return highlight_concepts(title, breaks)

In [12]:
def highlight_abstract(title_length, abstract, uniq_spans):
    # we need to subtract out the length of the title + 1
    breaks = [(0, "n"), (len(abstract), "n")]
    
    for span in uniq_spans:
        if span.start > title_length:
            breaks.append((span.start - title_length - 1, span.stype))
            breaks.append((span.stop - title_length - 1, "n"))
            
    return highlight_concepts(abstract, breaks)

---

In [13]:
def process_sentence_task(sentence, drug_id, disease_id):
    """
    Makes one sentence level chemical-disease relationship
    verification task.
    """
    spans = []
    name = defaultdict(set)
    for annotation in sentence.annotations:
        if annotation.uid in [drug_id, disease_id]:
            spans.append(annotation)
            name[annotation.stype].add(annotation.text)

    data = dict()
    data["pmid"] = [sentence.pmid]
    data["form_sentence"] = [highlight_sentence(sentence.text, sentence.start, spans)]
    data["drug_id"] = [drug_id]
    data["disease_id"] = [disease_id]
    data["drug_name"] = [add_simple_tag("span", "chemical", "/".join(name["chemical"]))]
    data["disease_name"] = [add_simple_tag("span", "disease", "/".join(name["disease"]))]
    
    return pd.DataFrame(data)

In [14]:
def process_abstract_task(paper, drug_id, disease_id):
    """
    Makes one abstract level chemical-disease relationship
    verification task.
    """
    spans = []
    name = defaultdict(set)
    for annotation in paper.annotations:
        if annotation.uid in [drug_id, disease_id]:
            spans.append(annotation)
            name[annotation.stype].add(annotation.text)
            
    data = dict()
    data["pmid"] = [paper.pmid]
    data["form_title"] = [highlight_title(paper.title, spans)]
    data["form_abstract"] = [highlight_abstract(len(paper.title), paper.abstract, spans)]
    data["drug_id"] = [drug_id]
    data["disease_id"] = [disease_id]    
    data["drug_name"] = [add_simple_tag("span", "chemical", "/".join(name["chemical"]))]
    data["disease_name"] = [add_simple_tag("span", "disease", "/".join(name["disease"]))]
            
    return pd.DataFrame(data)

In [15]:
def create_work_units(dataset):
    """
    Makes CrowdFlower work units.
    """
    easy_units = []
    hard_units = []
    for paper in dataset:
        for rel_pair in paper.get_work_units():
            cooccur = False
            for sentence in paper.sentences:
                found = [False, False]
                for annotation in sentence.annotations:
                    for i, concept_id in enumerate(rel_pair):
                        if annotation.uid == concept_id:
                            found[i] = True
                    
                if found[0] and found[1]:
                    cooccur = True
                    if check_CID_structure(sentence, rel_pair[0], rel_pair[1]):
                        # assume that this is true
                        print "{0}|{1}|{2}".format(paper.pmid, rel_pair[0], rel_pair[1])
                    else:
                        # make easy work unit
                        easy_units.append(process_sentence_task(sentence, rel_pair[0], rel_pair[1]))
                        
            if not cooccur:
                hard_units.append(process_abstract_task(paper, rel_pair[0], rel_pair[1]))
                
    # return two dataframes
    easy_units = pd.concat(easy_units).reset_index(drop = True)
    hard_units = pd.concat(hard_units).reset_index(drop = True)
    
    easy_units["uniq_id"] = pd.Series(["bcv_easy_{0}".format(i) for i in range(len(easy_units))])
    hard_units["uniq_id"] = pd.Series(["bcv_hard_{0}".format(i) for i in range(len(hard_units))])
    
    return (easy_units, hard_units)

###Generate the work units and print to file:

In [16]:
easy_units, hard_units = create_work_units(testset)

1378968|D008094|D007674
1378968|D008094|D007674
3800626|D010423|D009135
17241784|D019821|D009135
15632880|D013148|D006947
12041669|D010396|D000741
2096243|C017367|D056784
2096243|C017367|D056784
17261653|D003000|D001919


In [17]:
easy_units.head()

Unnamed: 0,disease_id,disease_name,drug_id,drug_name,form_sentence,pmid,uniq_id
0,D016171,"<span class=""disease"">TDP</span>",D020117,"<span class=""chemical"">cisapride</span>",Four compounds known to increase QT interval a...,11569530,bcv_easy_0
1,D016171,"<span class=""disease"">TDP</span>",D020117,"<span class=""chemical"">cisapride</span>","For compounds that have shown <span class=""dis...",11569530,bcv_easy_1
2,D016171,"<span class=""disease"">TDP</span>",C063968,"<span class=""chemical"">E4031</span>",Four compounds known to increase QT interval a...,11569530,bcv_easy_2
3,D016171,"<span class=""disease"">TDP</span>",D016593,"<span class=""chemical"">terfenadine</span>","<span class=""disease"">TDP</span> is a side-eff...",11569530,bcv_easy_3
4,D016171,"<span class=""disease"">TDP</span>",D016593,"<span class=""chemical"">terfenadine</span>",Four compounds known to increase QT interval a...,11569530,bcv_easy_4


In [18]:
hard_units.head()

Unnamed: 0,disease_id,disease_name,drug_id,drug_name,form_abstract,form_title,pmid,uniq_id
0,D017180,"<span class=""disease"">ventricular tachycardia<...",D020117,"<span class=""chemical"">cisapride</span>",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_0
1,D017180,"<span class=""disease"">ventricular tachycardia<...",C063968,"<span class=""chemical"">E4031</span>",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_1
2,D017180,"<span class=""disease"">ventricular tachycardia<...",D016593,"<span class=""chemical"">terfenadine</span>",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_2
3,D017180,"<span class=""disease"">ventricular tachycardia<...",C010637,"<span class=""chemical"">terodiline</span>",1. Torsades de pointes (TDP) is a potentially ...,Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_hard_3
4,D013921,"<span class=""disease"">thrombocytopenia</span>",D017239,"<span class=""chemical"">Paclitaxel/paclitaxel</...",BACKGROUND: Cisplatin-based chemotherapy combi...,"<span class=""chemical"">Paclitaxel</span>, cisp...",11135224,bcv_hard_4


In [19]:
easy_units.to_csv("data/crowdflower/data_for_easy_job_.tsv", sep = '\t', index = False)

In [20]:
hard_units.to_csv("data/crowdflower/data_for_hard_job_.tsv", sep = '\t', index = False)