# Analysis of job 746297: Finding drug-disease relationships for biocreative V (version 2)

2015-06-22 Tong Shu Li

Crowdflower job 746297 is the second iteration of the chemical-induced-disease relationship extraction task for Biocreative V. Version 1.0 (job 743229) did not have proper test question validation, and was therefore plagued with spammers.

This version had proper elimination of workers who chose the wrong choices.

Job 746297 was launched at 3:24 pm, Monday June 21, 2015, and completed at 8:05 pm on Monday June 21, 2015. The total cost was $54.66 USD.

Settings:
- 5 rows per page
- 5 judgements per row
- 50 cents per page
- level 1 contributor
- 50 seconds minimum per page
- worker has to maintain 70% minimum accuracy
- there were 11 test questions
- responses had to match the test questions exactly

The question grading scheme used (answers had to exactly the same as the gold in order to be considered correct) was admittedly a bit strict, but I wanted to see what the results were before deciding to relax them.

The analysis procedure will be very similar to job 743229.

How do I make the checkboxes thing a hybrid? If they choose "none are true", then I want no other choices to be possible.

In [1]:
from collections import defaultdict
import pandas as pd

In [2]:
%load_ext rpy2.ipython

In [3]:
from src.filter_data import filter_data
from src.parse_gold import parse_input
from src.parse_gold import Relation

In [4]:
settings = {
    "loc": "data/crowdflower/results",
    "fname": "job_746297_full_with_untrusted.csv",
    "data_subset": "normal",
    "min_accuracy": 0.7,
    "max_accuracy": 1.0
}

raw_data = filter_data(settings)

In [5]:
len(raw_data)

255

In [6]:
raw_data.head()

Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,choice_2_ids,choice_2_label,choice_3_ids,choice_3_label,choice_4_ids,choice_4_label,form_abstract,form_title,pmid,uniq_id
0,741091284,6/22/2015 22:44:33,False,1667877616,,6/22/2015 22:43:35,False,neodev,0.7273,32824409,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
1,741091284,6/22/2015 23:10:22,False,1667886309,,6/22/2015 23:07:02,False,elite,0.9091,30936260,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
2,741091284,6/22/2015 23:10:53,False,1667886491,,6/22/2015 23:08:45,False,neodev,0.7273,11064916,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
3,741091284,6/22/2015 23:27:32,False,1667891825,,6/22/2015 23:13:27,False,neodev,0.7143,11029942,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
4,741091284,6/22/2015 23:38:33,False,1667895583,,6/22/2015 23:37:10,False,prizerebel,0.875,28853816,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0


### First let's clean up the data a bit:

We need to ensure that the choices make sense. If they chose "none are true", then no other choice should be selected. We also want to check that they didn't choose any empty choices.

In [7]:
def check_data(data_frame):
    bad_responses = set()

    for idx, row in data_frame.iterrows():
        unit_id = row["_unit_id"]
        worker_id = row["_worker_id"]

        response = row["chemical_disease_relationships"].split('\n')

        # if none are true, then it should be the only choice..
        if "none_are_true" in response and len(response) > 1:
            bad_responses.add((unit_id, worker_id))

        for i in range(5):
            column = "choice_{0}_ids".format(i)
            if (row[column] == "empty") and ("choice_{0}".format(i) in response): # clicked empty response
                bad_responses.add((unit_id, worker_id))
                
    return bad_responses

In [8]:
check_data(raw_data)

{(741091304, 30936260)}

Great, there was only one response that had an answer which didn't make sense. We can look at this answer and choose the result manually.

In [9]:
raw_data.query("_unit_id == 741091304 and _worker_id == 30936260")

Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,choice_2_ids,choice_2_label,choice_3_ids,choice_3_label,choice_4_ids,choice_4_label,form_abstract,form_title,pmid,uniq_id
101,741091304,6/22/2015 22:55:07,False,1667881387,,6/22/2015 22:49:15,False,elite,0.9091,30936260,...,D025101_induces_D007676,"<span class=""chemical"">vitamin B6</span> contr...",empty,<strong>Do not</strong> choose this choice.,empty,<strong>Do not</strong> choose this choice.,Two patients with similar clinical features ar...,"Serial <span class=""disease"">epilepsy</span> c...",2265898,bcv_id_20


In [10]:
raw_data.loc[101]

_unit_id                                                                       741091304
_created_at                                                           6/22/2015 22:55:07
_golden                                                                            False
_id                                                                           1667881387
_missed                                                                              NaN
_started_at                                                           6/22/2015 22:49:15
_tainted                                                                           False
_channel                                                                           elite
_trust                                                                            0.9091
_worker_id                                                                      30936260
_country                                                                             VNM
_region              

This contributor chose both "none are true" and the second choice, so we will manually change it to "none are true" because that overrides the other choice.

In [11]:
raw_data.loc[101, "chemical_disease_relationships"] = "none_are_true"

In [12]:
raw_data.loc[101]

_unit_id                                                                       741091304
_created_at                                                           6/22/2015 22:55:07
_golden                                                                            False
_id                                                                           1667881387
_missed                                                                              NaN
_started_at                                                           6/22/2015 22:49:15
_tainted                                                                           False
_channel                                                                           elite
_trust                                                                            0.9091
_worker_id                                                                      30936260
_country                                                                             VNM
_region              

In [13]:
check_data(raw_data)

set()

Now that the data have been cleaned up, we can proceed to the analysis.

In [14]:
list(raw_data.columns.values)

['_unit_id',
 '_created_at',
 '_golden',
 '_id',
 '_missed',
 '_started_at',
 '_tainted',
 '_channel',
 '_trust',
 '_worker_id',
 '_country',
 '_region',
 '_city',
 '_ip',
 'chemical_disease_relationships',
 'comment_box',
 'chemical_disease_relationships_gold',
 'choice_0_ids',
 'choice_0_label',
 'choice_1_ids',
 'choice_1_label',
 'choice_2_ids',
 'choice_2_label',
 'choice_3_ids',
 'choice_3_label',
 'choice_4_ids',
 'choice_4_label',
 'form_abstract',
 'form_title',
 'pmid',
 'uniq_id']

In [15]:
def aggregate_votes(uniq_id, data_frame):
    """
    Given a data frame representing all the unique votes
    for one work unit, aggregates the votes for each of the
    possible choices.
    
    Returns an unsorted data frame containing the relationships
    with normalized scores.
    """
    rel_id = dict()
    
    # first map the ids: choice # -> id_pair
    for i in range(5):
        colname = "choice_{0}_ids".format(i)
        assert len(data_frame[colname].unique()) == 1
        rel_id["choice_{0}".format(i)] = data_frame.iloc[0][colname]  

    scores = defaultdict(float)
    # increment each relationship pair id by the worker's trust score
    for idx, row in data_frame.iterrows():
        # check that none of the above does not conflict with the other choices
        user_choices = row["chemical_disease_relationships"].split('\n')
        if "none_are_true" in user_choices:
            assert len(user_choices) == 1, idx
            # vote against all other choices
            for i in range(5):
                scores[rel_id["choice_{0}".format(i)]] -= row["_trust"]
        else:
            for choice in user_choices:
                scores[rel_id[choice]] += row["_trust"]
            
    total_trust = sum(data_frame["_trust"])
    
    # normalize choices and remove those below zero or empty
    temp = defaultdict(list)
    for id_pair, score in scores.items():
        score /= total_trust
        if score > 0 and id_pair != "empty":
            temp["id_pair"].append(id_pair)
            temp["normalized_score"].append(score)
            
    df = pd.DataFrame(temp)
    
    df["uniq_id"] = uniq_id
    assert len(data_frame["_unit_id"].unique()) == 1
    df["unit_id"] = data_frame["_unit_id"].iloc[0]
    
    return df

In [17]:
def generate_results(data_frame):
    results = []
    for pmid, pmid_group in data_frame.groupby("pmid"):
        temp = []
        for uniq_id, group in pmid_group.groupby("uniq_id"):
            scores = aggregate_votes(uniq_id, group)
            temp.append(scores)

        df = pd.concat(temp)
        if not df.empty:
            df = df.sort("normalized_score", axis = 0, ascending = False)
            df["pmid"] = pmid
            results.append(df)
            
    return pd.concat(results)

In [18]:
results = generate_results(raw_data)

In [20]:
results

Unnamed: 0,id_pair,normalized_score,uniq_id,unit_id,pmid
2,D005839_induces_D007674,0.794784,bcv_id_45,741091329,1130930
0,D002512_induces_D007674,0.770023,bcv_id_46,741091330,1130930
1,D005839_induces_D007683,0.605403,bcv_id_45,741091329,1130930
2,D002512_induces_D007683,0.594662,bcv_id_46,741091330,1130930
0,D005839_induces_D009846|D051437,0.420701,bcv_id_45,741091329,1130930
3,D002512_induces_D009846|D051437,0.389918,bcv_id_45,741091329,1130930
1,D002512_induces_D051437,0.180698,bcv_id_46,741091330,1130930
0,D008094_induces_D007674,0.816750,bcv_id_6,741091290,1378968
1,D008094_induces_D011507,0.816750,bcv_id_6,741091290,1378968
3,D008094_induces_D006973,0.816750,bcv_id_6,741091290,1378968


Now that we have our results table, listing some drug-disease relationships with a confidence score, we can perform a ROC analysis on the score as a predictor. To generate our ROC curve, we will:

1. Use the gold standard to look up whether each disease pair id was a true positive or not, and convert the ids to 1s or 0s.
2. Use the R ROCR package to generate the ROC curve.

In [21]:
training_data = parse_input("data/training", "CDR_TrainingSet.txt")

In [22]:
used_pmids = set(results["pmid"].unique())
used_pmids

{1130930,
 1378968,
 1835291,
 2096243,
 2265898,
 2375138,
 2515254,
 3800626,
 6666578,
 6692345,
 7582165,
 8590259,
 8595686,
 9522143,
 10520387,
 10835440,
 11135224,
 11569530,
 12041669,
 12198388,
 15602202,
 15632880,
 16167916,
 16337777,
 17241784,
 17261653,
 17931375,
 18631865,
 19269743}

In [23]:
# create the gold
gold_relations = dict()

for paper in training_data:
    if int(paper.pmid) in used_pmids:
        gold_relations[paper.pmid] = paper.relations
    
print len(gold_relations)
print sum([])

29
0


In [24]:
sum(map(len, gold_relations.values()))

63

In [25]:
def in_gold(pmid, annot):
    for gold in gold_relations[str(pmid)]:
        if gold == annot:
            return True
        
    return False

In [26]:
is_in_gold = []
for idx, row in results.iterrows():
    pmid = row["pmid"]
    temp = row["id_pair"].split("_induces_")
    annot = Relation(temp[0], temp[1])
    
    is_in_gold.append(int(in_gold(pmid, annot)))
    
results["in_gold"] = is_in_gold

In [27]:
results

Unnamed: 0,id_pair,normalized_score,uniq_id,unit_id,pmid,in_gold
2,D005839_induces_D007674,0.794784,bcv_id_45,741091329,1130930,0
0,D002512_induces_D007674,0.770023,bcv_id_46,741091330,1130930,0
1,D005839_induces_D007683,0.605403,bcv_id_45,741091329,1130930,1
2,D002512_induces_D007683,0.594662,bcv_id_46,741091330,1130930,1
0,D005839_induces_D009846|D051437,0.420701,bcv_id_45,741091329,1130930,0
3,D002512_induces_D009846|D051437,0.389918,bcv_id_45,741091329,1130930,0
1,D002512_induces_D051437,0.180698,bcv_id_46,741091330,1130930,0
0,D008094_induces_D007674,0.816750,bcv_id_6,741091290,1378968,0
1,D008094_induces_D011507,0.816750,bcv_id_6,741091290,1378968,1
3,D008094_induces_D006973,0.816750,bcv_id_6,741091290,1378968,1


In [34]:
results.to_csv("data/ROC_test.txt", sep = '\t', index = False)

Generation of the ROC curve is done in R.