# Analysis of job 746647: Finding drug-disease relationships for biocreative V (version 2)

2015-06-22 Tong Shu Li

Crowdflower job 746647 is the second iteration of the chemical-induced-disease relationship extraction task for Biocreative V. Version 1.0 (job 743229) did not have proper test question validation, and was therefore plagued with spammers.

This version is exactly the same as job 746297, except that the payment was doubled. Every other setting, data, and test questions were all the exact same.

This version had proper elimination of workers who chose the wrong choices.

Job 746647 was launched at 3:24 pm, Tuesday June 23, 2015, and completed at 5:50 pm on Tuesday June 23, 2015. The total cost was $96.00 USD.

Settings:
- 5 rows per page
- 5 judgements per row
- 100 cents per page
- level 1 contributor
- 50 seconds minimum per page
- worker has to maintain 70% minimum accuracy
- there were 11 test questions
- responses had to match the test questions exactly

The question grading scheme used (answers had to exactly the same as the gold in order to be considered correct) was admittedly a bit strict, but I wanted to see what the results were before deciding to relax them.

In [1]:
from collections import defaultdict
import pandas as pd

In [2]:
from src.filter_data import filter_data
from src.parse_gold import parse_input
from src.parse_gold import Relation

In [3]:
settings = {
    "loc": "data/crowdflower/results",
    "fname": "job_746647_full_with_untrusted.csv",
    "data_subset": "normal",
    "min_accuracy": 0.7,
    "max_accuracy": 1.0
}

raw_data = filter_data(settings)

In [4]:
len(raw_data)

255

In [5]:
res = raw_data.query("pmid == 18631865")

In [6]:
res

Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,choice_2_ids,choice_2_label,choice_3_ids,choice_3_label,choice_4_ids,choice_4_label,form_abstract,form_title,pmid,uniq_id
277,741717814,6/23/2015 22:38:18,False,1668637444,,6/23/2015 22:37:44,False,neodev,0.875,32608383,...,D020123_induces_D011507,"<span class=""chemical"">rapamycin</span> contri...",empty,<strong>Do not</strong> choose this choice.,empty,<strong>Do not</strong> choose this choice.,Massive urinary protein excretion has been obs...,"mToR inhibitors-induced <span class=""disease"">...",18631865,bcv_id_44
278,741717814,6/23/2015 22:43:39,False,1668641717,,6/23/2015 22:38:47,False,elite,1.0,31599083,...,D020123_induces_D011507,"<span class=""chemical"">rapamycin</span> contri...",empty,<strong>Do not</strong> choose this choice.,empty,<strong>Do not</strong> choose this choice.,Massive urinary protein excretion has been obs...,"mToR inhibitors-induced <span class=""disease"">...",18631865,bcv_id_44
279,741717814,6/23/2015 22:47:44,False,1668645586,,6/23/2015 22:39:12,False,clixsense,0.7143,6591664,...,D020123_induces_D011507,"<span class=""chemical"">rapamycin</span> contri...",empty,<strong>Do not</strong> choose this choice.,empty,<strong>Do not</strong> choose this choice.,Massive urinary protein excretion has been obs...,"mToR inhibitors-induced <span class=""disease"">...",18631865,bcv_id_44
280,741717814,6/23/2015 23:14:40,False,1668672786,,6/23/2015 23:07:53,False,dollarsignup,0.8,10824531,...,D020123_induces_D011507,"<span class=""chemical"">rapamycin</span> contri...",empty,<strong>Do not</strong> choose this choice.,empty,<strong>Do not</strong> choose this choice.,Massive urinary protein excretion has been obs...,"mToR inhibitors-induced <span class=""disease"">...",18631865,bcv_id_44
281,741717814,6/23/2015 23:19:27,False,1668677808,,6/23/2015 23:09:23,False,neodev,0.8182,14596658,...,D020123_induces_D011507,"<span class=""chemical"">rapamycin</span> contri...",empty,<strong>Do not</strong> choose this choice.,empty,<strong>Do not</strong> choose this choice.,Massive urinary protein excretion has been obs...,"mToR inhibitors-induced <span class=""disease"">...",18631865,bcv_id_44


In [7]:
res["chemical_disease_relationships"]

277              choice_2
278    choice_0\nchoice_2
279         none_are_true
280         none_are_true
281         none_are_true
Name: chemical_disease_relationships, dtype: object

In [8]:
raw_data.head()

Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,choice_2_ids,choice_2_label,choice_3_ids,choice_3_label,choice_4_ids,choice_4_label,form_abstract,form_title,pmid,uniq_id
0,741717769,6/23/2015 22:56:48,False,1668654997,,6/23/2015 22:53:20,False,neodev,0.8,33014938,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
2,741717769,6/23/2015 23:00:44,False,1668658995,,6/23/2015 22:57:38,False,neodev,0.8182,14596658,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
3,741717769,6/23/2015 23:00:47,False,1668659007,,6/23/2015 22:53:29,False,neodev,0.8333,32591740,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
4,741717769,6/23/2015 23:03:04,False,1668661305,,6/23/2015 22:53:03,False,dollarsignup,0.8,10824531,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
5,741717769,6/23/2015 23:25:28,False,1668683944,,6/23/2015 23:23:48,False,elite,0.8,26583095,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0


### First let's clean up the data a bit:

We need to ensure that the choices make sense. If they chose "none are true", then no other choice should be selected. We also want to check that they didn't choose any empty choices.

In [9]:
def check_data(data_frame):
    bad_responses = set()

    for idx, row in data_frame.iterrows():
        unit_id = row["_unit_id"]
        worker_id = row["_worker_id"]

        response = row["chemical_disease_relationships"].split('\n')

        # if none are true, then it should be the only choice..
        if "none_are_true" in response and len(response) > 1:
            bad_responses.add((unit_id, worker_id))

        for i in range(5):
            column = "choice_{0}_ids".format(i)
            if (row[column] == "empty") and ("choice_{0}".format(i) in response): # clicked empty response
                bad_responses.add((unit_id, worker_id))
                
    return bad_responses

In [10]:
check_data(raw_data)

set()

This time everyone followed the rules regarding the job! There were no accidental choices either, which is great.

Now that the data have been cleaned up, we can proceed to the analysis.

In [11]:
def aggregate_votes(uniq_id, data_frame):
    """
    Given a data frame representing all the unique votes
    for one work unit, aggregates the votes for each of the
    possible choices.
    
    Returns an unsorted data frame containing the relationships
    with normalized scores.
    """
    rel_id = dict()
    
    # first map the ids: choice # -> id_pair
    for i in range(5):
        colname = "choice_{0}_ids".format(i)
        assert len(data_frame[colname].unique()) == 1
        rel_id["choice_{0}".format(i)] = data_frame.iloc[0][colname]  

    scores = defaultdict(float)
    # increment each relationship pair id by the worker's trust score
    for idx, row in data_frame.iterrows():
        # check that none of the above does not conflict with the other choices
        user_choices = row["chemical_disease_relationships"].split('\n')
        if "none_are_true" in user_choices:
            assert len(user_choices) == 1, idx
            # vote against all other choices
            for i in range(5):
                scores[rel_id["choice_{0}".format(i)]] -= row["_trust"]
        else:
            for choice in user_choices:
                scores[rel_id[choice]] += row["_trust"]
            
    total_trust = sum(data_frame["_trust"])
    
    # normalize choices and remove those below zero or empty
    temp = defaultdict(list)
    for id_pair, score in scores.items():
        score /= total_trust
        if score > 0 and id_pair != "empty":
            temp["id_pair"].append(id_pair)
            temp["normalized_score"].append(score)
            
    df = pd.DataFrame(temp)
    
    df["uniq_id"] = uniq_id
    assert len(data_frame["_unit_id"].unique()) == 1
    df["unit_id"] = data_frame["_unit_id"].iloc[0]
    
    return df

In [12]:
def generate_results(data_frame):
    results = []
    for pmid, pmid_group in data_frame.groupby("pmid"):
        temp = []
        for uniq_id, group in pmid_group.groupby("uniq_id"):
            scores = aggregate_votes(uniq_id, group)
            temp.append(scores)

        df = pd.concat(temp)
        if not df.empty:
            df = df.sort("normalized_score", axis = 0, ascending = False)
            df["pmid"] = pmid
            results.append(df)
            
    return pd.concat(results)

In [13]:
results = generate_results(raw_data)

In [14]:
results.head()

Unnamed: 0,id_pair,normalized_score,uniq_id,unit_id,pmid
1,D002512_induces_D007683,0.811571,bcv_id_46,741717816,1130930
0,D002512_induces_D007674,0.78802,bcv_id_46,741717816,1130930
0,D005839_induces_D009846|D051437,0.403663,bcv_id_45,741717815,1130930
1,D005839_induces_D007683,0.403663,bcv_id_45,741717815,1130930
3,D002512_induces_D009846|D051437,0.403663,bcv_id_45,741717815,1130930


Now that we have our results table, listing some drug-disease relationships with a confidence score, we can perform a ROC analysis on the score as a predictor. To generate our ROC curve, we will:

1. Use the gold standard to look up whether each disease pair id was a true positive or not, and convert the ids to 1s or 0s.
2. Use the R ROCR package to generate the ROC curve.

In [15]:
training_data = parse_input("data/training", "CDR_TrainingSet.txt")

In [16]:
used_pmids = set(results["pmid"].unique())
used_pmids

{1130930,
 1378968,
 1835291,
 2096243,
 2265898,
 2375138,
 2515254,
 3800626,
 6666578,
 6692345,
 7449470,
 7582165,
 8590259,
 8595686,
 9522143,
 10520387,
 10835440,
 11135224,
 11569530,
 12041669,
 12198388,
 15602202,
 15632880,
 16167916,
 16337777,
 17241784,
 17261653,
 17931375,
 19269743}

In [17]:
# create the gold
gold_relations = dict()

for paper in training_data:
    if int(paper.pmid) in used_pmids:
        gold_relations[paper.pmid] = paper.relations
    
print len(gold_relations)

29


In [18]:
sum(map(len, gold_relations.values()))

63

In [19]:
def in_gold(pmid, annot):
    for gold in gold_relations[str(pmid)]:
        if gold == annot:
            return True
        
    return False

In [20]:
is_in_gold = []
for idx, row in results.iterrows():
    pmid = row["pmid"]
    temp = row["id_pair"].split("_induces_")
    annot = Relation(temp[0], temp[1])
    
    is_in_gold.append(int(in_gold(pmid, annot)))
    
results["in_gold"] = is_in_gold

In [21]:
results.head()

Unnamed: 0,id_pair,normalized_score,uniq_id,unit_id,pmid,in_gold
1,D002512_induces_D007683,0.811571,bcv_id_46,741717816,1130930,1
0,D002512_induces_D007674,0.78802,bcv_id_46,741717816,1130930,0
0,D005839_induces_D009846|D051437,0.403663,bcv_id_45,741717815,1130930,0
1,D005839_induces_D007683,0.403663,bcv_id_45,741717815,1130930,1
3,D002512_induces_D009846|D051437,0.403663,bcv_id_45,741717815,1130930,0


In [22]:
results.to_csv("data/746647_ROC_test.txt", sep = '\t', index = False)

Generation of the ROC curve is done in R.

---

In [37]:
for pmid, group in results.groupby("pmid"):
    print pmid
    for rel in gold_relations[str(pmid)]:
        rel.output()
    print
    
    print group
    print "------------------"

1130930
D005839 D007683
D002512 D007683
D005839 D009846
D002512 D009846

                           id_pair  normalized_score    uniq_id    unit_id  \
1          D002512_induces_D007683          0.811571  bcv_id_46  741717816   
0          D002512_induces_D007674          0.788020  bcv_id_46  741717816   
0  D005839_induces_D009846|D051437          0.403663  bcv_id_45  741717815   
1          D005839_induces_D007683          0.403663  bcv_id_45  741717815   
3  D002512_induces_D009846|D051437          0.403663  bcv_id_45  741717815   
2          D005839_induces_D007674          0.178899  bcv_id_45  741717815   

      pmid  in_gold  
1  1130930        1  
0  1130930        0  
0  1130930        0  
1  1130930        1  
3  1130930        0  
2  1130930        0  
------------------
1378968
D008094 D006973
D008094 D011507
D008094 D007676

                   id_pair  normalized_score   uniq_id    unit_id     pmid  \
0  D008094_induces_D007676          1.000000  bcv_id_7  741717776  13789

---

I noticed that there was one person who did both this job (746647) and the one yesterday (746297). Let's see if anyone else worked on both jobs and how they did with respect to the gold in both jobs:

In [23]:
settings = {
    "loc": "data/crowdflower/results",
    "fname": "job_746297_full_with_untrusted.csv",
    "data_subset": "normal",
    "min_accuracy": 0.7,
    "max_accuracy": 1.0
}

job_746297_data = filter_data(settings)

In [24]:
len(job_746297_data["_worker_id"].unique())

23

In [25]:
settings = {
    "loc": "data/crowdflower/results",
    "fname": "job_746647_full_with_untrusted.csv",
    "data_subset": "normal",
    "min_accuracy": 0.7,
    "max_accuracy": 1.0
}

job_746647_data = filter_data(settings)

In [26]:
len(job_746647_data["_worker_id"].unique())

22

One fewer person worked on the task today.

In [27]:
veterans = set(job_746297_data["_worker_id"]) & set(job_746647_data["_worker_id"])
veterans

{6591664, 31599083, 32591740, 33014938}

Interesting. There are 4 people who worked on both jobs. Let's look at their performance:

In [28]:
for worker_id in veterans:
    trust_746297 = job_746297_data.query("_worker_id == {0}".format(worker_id)).iloc[0]["_trust"]
    trust_746647 = job_746647_data.query("_worker_id == {0}".format(worker_id)).iloc[0]["_trust"]
    
    print "Worker id: {0}".format(worker_id)
    print "Trust for 746297: {0}".format(trust_746297)
    print "Trust for 746647: {0}".format(trust_746647)
    print

Worker id: 6591664
Trust for 746297: 0.8333
Trust for 746647: 0.7143

Worker id: 33014938
Trust for 746297: 0.7143
Trust for 746647: 0.8

Worker id: 31599083
Trust for 746297: 0.8182
Trust for 746647: 1.0

Worker id: 32591740
Trust for 746297: 0.8333
Trust for 746647: 0.8333



Two people got better at the same test questions while one remained constant and one got worse. How did they do against the gold?

We will take each contributor who worked on both jobs, look at their responses to the questions they worked on, and see if their responses match the gold.

In [29]:
for worker_id in veterans:
    first_subset = job_746297_data.query("_worker_id == {0}".format(worker_id))
    second_subset = job_746647_data.query("_worker_id == {0}".format(worker_id))
    
    print worker_id
    print len(first_subset)
    print len(second_subset)
    print

6591664
4
8

33014938
7
20

31599083
23
24

32591740
4
4



Overall the four returning workers also did more work. Now whether this is because they came earier (before the work ran out) or because they were more confident (or motivated by the higher pay) is uncertain.

In [30]:
list(job_746647_data.columns.values)

['_unit_id',
 '_created_at',
 '_golden',
 '_id',
 '_missed',
 '_started_at',
 '_tainted',
 '_channel',
 '_trust',
 '_worker_id',
 '_country',
 '_region',
 '_city',
 '_ip',
 'chemical_disease_relationships',
 'comment_box',
 'chemical_disease_relationships_gold',
 'choice_0_ids',
 'choice_0_label',
 'choice_1_ids',
 'choice_1_label',
 'choice_2_ids',
 'choice_2_label',
 'choice_3_ids',
 'choice_3_label',
 'choice_4_ids',
 'choice_4_label',
 'form_abstract',
 'form_title',
 'pmid',
 'uniq_id']

In [31]:
def statistics(data_frame):
    """
    Determines the TP, FP, TN, and FN for each question.
    """
    true_pos = 0
    true_neg = 0
    false_pos = 0
    false_neg = 0
    for idx, row in data_frame.iterrows():
        pmid = row["pmid"]
        
        choices = row["chemical_disease_relationships"].split('\n')
        if "none_are_true" in choices:
            # look at all non empty and verify not in gold
            for i in range(5):
                col_name = "choice_{0}_ids".format(i)
                if row[col_name] != "empty":
                    id_pair = row[col_name].split("_induces_")
                    
                    if not in_gold(pmid, Relation(id_pair[0], id_pair[1])):
                        true_neg += 1
                    else:
                        false_neg += 1
        else:
            for choice in choices:
                assert row["{0}_ids".format(choice)] != "empty"
                id_pair = row["{0}_ids".format(choice)].split("_induces_")
                if in_gold(pmid, Relation(id_pair[0], id_pair[1])):
                    true_pos += 1
                else:
                    false_pos += 1
                    
    return (true_pos, true_neg, false_pos, false_neg)

In [32]:
for worker_id in veterans:
    first_subset = job_746297_data.query("_worker_id == {0}".format(worker_id))
    second_subset = job_746647_data.query("_worker_id == {0}".format(worker_id))
    
    print "worker id:", worker_id
    print "TP TN FP FN"
    print "first time:", statistics(first_subset)
    print "second time:", statistics(second_subset)
    print

worker id: 6591664
TP TN FP FN
first time: (2, 10, 0, 0)
second time:

KeyError: '18631865'