# Sentence-level CID relation verification task test question processing

Tong Shu Li<br>
Created on: Thursday 2015-08-13<br>
Last updated: 2015-08-13

For the strictly sentence-level CID relation verification task for BioCreative V, we need a lot of test questions in order to ensure that a job will complete quickly. The number of unique test questions determines the maximum amount of work a single worker can perform. Since we want able contributors to do as much work as possible, we need a large set of test questions.

The sentence-level task is being used as both part of the crowd-only processing workflow and as part of the collaboration with Alex and Laura of BeFree. BeFree only works at the sentence-level, so this task closely mimics what the algorithm itself is doing.

CrowdFlower by default displays one test question per page of work, which contains a variable number of work units. An initial test containing one full page of test questions is used to screen workers. If at any time during the task a worker drops below the minimum accuracy threshold, then they are removed from the job and all of their judgements are discredited. However, the workers are still paid for the work they have already performed.

Therefore the number of work units per page needs to be carefully set such that good workers can do a large amount of work, but bad workers will be kicked out quickly without wasting too much money. The equation for max amount of work units a single contributor can do is: W_m = (N_t - P)(P - 1) where W_m is the max number of work units, N_t is the number of test questions, and P is the number of work units per page.

For the 3000 abstracts which need to be annotated for BeFree, there are 17383 unique sentences containing 5279 unique co-occurring chemical-disease identifier pairs. Therefore if there are 150 work units, and 6 work units per page, then a single contributor can do (150 - 6)(6 - 1) = 720 work units. This is 13.63 % of the entire dataset, which seems like a reasonably large portion of the data. Any larger and individuals would have too much effect on the dataset.

### Test Question Origins

Since many prototypes and iterations of the sentence-level task have been tested, test questions come from a variety of sources. A list of the origins of each test question and how it was generated is given in the below table.

Test question set | # Test Ques. | BioCreative dataset of origin | Method of generation | Used in jobs | Description
--- | ---
sent_dev_set_762850 | 101 | Development set 500 abstracts | CrowdFlower #762850 online interface | N/A | This set of 101 test questions was generated by Toby by hand using the CrowdFlower online interface. All 500 development set abstracts were used to generate this set of test questions. The answer distribution is 45% yes_direct and 55% no_relation.
sent_train_set_revised_760841 | 52 | Training set 500 abstracts | CrowdFlower #760841 online interface | #761593 | The original 54 test questions (containing the NER error choice) from the training set was generated by Toby online in job #760841. The test questions were used in job #761593. After running job #761593, the test questions were revised in the online interface in job #763633. The NER error choices were removed and changed to "no_relation". Two confusing test questions were also dropped for clarity.
sent_work_dev_set_761593 | 400 | Development set 200 abstracts | Extracted from work units of job #761593 | #761593 | This set of test questions was derived from the normal work units seen by the crowd in job #761593. The original data came from a randomly selected 200 abstracts of the development set. From the set of 1279 work units for #761593, a subset of work units which had 5 judgements and the answer matched the gold standard was selected. The data was further subsetted to ensure an even answer choice distribution.

---

In [1]:
from __future__ import division
from collections import Counter
from collections import defaultdict
from IPython.display import Image
import matplotlib as plt
import numpy as np
import os
import pandas as pd
import pickle

In [2]:
NUMPY_RAND_SEED = 993402

In [3]:
from src.filter_data import filter_data
from src.data_model import Relation
from src.aggregate_results import *

---

### Create the sent_train_set_revised_760841 test question set:

In [4]:
train_set_revised = pd.read_csv("data/crowdflower/gold_reports/job_763633_gold_report.csv", sep = ',')

In [5]:
train_set_revised.shape

(54, 31)

In [6]:
# remove the two bad test questions
train_set_revised = train_set_revised.query("~_hidden")

In [7]:
# columns to keep
columns = ["verify_relationship_gold",
           "verify_relationship_gold_reason",
           "chemical_id", "chemical_name",
           "disease_id", "disease_name",
           "form_sentence", "original_job_id",
           "pmid", "relation_pair_id", "uniq_id"]

In [8]:
train_set_revised = train_set_revised[columns]

In [9]:
def renamer(val):
    return "{0}_sent_train_set_revised_760841".format(val.split("_job")[0])

train_set_revised.loc[:, "uniq_id"] = train_set_revised.loc[:, "uniq_id"].map(renamer)

In [10]:
train_set_revised.shape

(52, 11)

In [11]:
train_set_revised.head()

Unnamed: 0,verify_relationship_gold,verify_relationship_gold_reason,chemical_id,chemical_name,disease_id,disease_name,form_sentence,original_job_id,pmid,relation_pair_id,uniq_id
0,yes_direct,The patient developed psychosis after receivin...,D010672,"<span class=""chemical"">phenytoin</span>",D011605,"<span class=""disease"">psychosis</span>",The case of a nonepileptic patient who develop...,760841,14698717,14698717_D010672_D011605,bcv_easy_3152_sent_train_set_revised_760841
1,no_relation,Azathioprine had no effect on cancer prevalenc...,D001379,"<span class=""chemical"">azathioprine</span>",D009369,"<span class=""disease"">cancers</span>",There have been several long-term studies of p...,760841,3970039,3970039_D001379_D009369,bcv_easy_2200_sent_train_set_revised_760841
2,no_relation,The sentence says that the effects of diazepam...,D011433,"<span class=""chemical"">propranolol</span>",D016584,"<span class=""disease"">panic disorders</span>",The effects of oral doses of diazepam (single ...,760841,6387529,6387529_D011433_D016584,bcv_easy_174_sent_train_set_revised_760841
3,yes_direct,Giving mepivacaine and adrenaline together cau...,D004837,"<span class=""chemical"">adrenaline</span>",D001281,"<span class=""disease"">atrial fibrillation</span>","An increase in blood pressure, accompanied by ...",760841,9698967,9698967_D004837_D001281,bcv_easy_762_sent_train_set_revised_760841
4,yes_direct,Long-term patient monitoring showed that both ...,D007654,"<span class=""chemical"">ketoconazole</span>",D006973,"<span class=""disease"">hypertension</span>",In both cases normal plasma and urinary free c...,760841,2632720,2632720_D007654_D006973,bcv_easy_58_sent_train_set_revised_760841


---

### Create the sent_dev_set_762850 test question set:

In [12]:
dev_set = pd.read_csv("data/crowdflower/gold_reports/job_762850_gold_report.csv", sep = ',')

In [13]:
dev_set.shape

(101, 20)

In [14]:
columns = filter(lambda x: not x.startswith("_"), dev_set.columns.values)
columns = columns[:-1]

In [15]:
dev_set = dev_set[columns]
dev_set.loc[:, "uniq_id"] = dev_set.loc[:, "uniq_id"].map(lambda x: "{0}_sent_dev_set_762850".format(x))
dev_set["original_job_id"] = "762850"

In [16]:
dev_set.shape

(101, 12)

In [17]:
dev_set.head()

Unnamed: 0,verify_relationship_gold,verify_relationship_gold_reason,chemical_id,chemical_name,disease_id,disease_name,form_sentence,pmid,relation_pair_id,sentence_id,uniq_id,original_job_id
0,no_relation,The receptors to 5-HT6 are related to psychoti...,D012701,"<span class=""chemical"">5-HT</span>",D011605,"<span class=""disease"">psychotic disorders</span>",These animal models were considered to reflect...,20705401,20705401_D012701_D011605,20705401_4,bcv_easy_68_sent_dev_set_762850,762850
1,no_relation,D-penicillamine was used to treat the patients.,D010396,"<span class=""chemical"">D-penicillamine</span>",D012594,"<span class=""disease"">localized scleroderma</s...","Case reports of 11 patients with severe, exten...",2334179,2334179_D010396_D012594,2334179_3,bcv_easy_2399_sent_dev_set_762850,762850
2,no_relation,The sentence says the affective disorders were...,D015016,"<span class=""chemical"">yohimbine</span>",D019964,"<span class=""disease"">affective disorders</span>",METHOD: Six patients with either obsessive com...,1535072,1535072_D015016_D019964,1535072_4,bcv_easy_642_sent_dev_set_762850,762850
3,no_relation,The sentence says calcium supplementation cann...,D002118,"<span class=""chemical"">calcium</span>",D013035,"<span class=""disease"">muscle spasms</span>",While severe hypokalemia may cause muscle weak...,8492347,8492347_D002118_D013035,8492347_2,bcv_easy_1271_sent_dev_set_762850,762850
4,no_relation,Dexrazoxane is being used to try and make the ...,D064730,"<span class=""chemical"">dexrazoxane</span>",D006402,"<span class=""disease"">hematologic toxicity</span>",Clinical trials in patients with brain metasta...,15897593,15897593_D064730_D006402,15897593_8,bcv_easy_2793_sent_dev_set_762850,762850


---

### Create the sent_work_dev_set_761593 test questions:

In [18]:
settings = {
    "loc": "data/crowdflower/results",
    "fname": "job_761593_full_with_untrusted.csv",
    "data_subset": "normal",
    "min_accuracy": 0.7,
    "max_accuracy": 1.0
}

raw_data = filter_data(settings)

In [19]:
def read_gold_standard(dataset, file_format = "list"):
    assert dataset in ["training", "development"]
    assert file_format in ["list", "dict"]
    
    fname = "data/{0}/parsed_{0}_set_{1}.pickle".format(dataset, file_format)
    
    if os.path.exists(fname):
        print "Reading cached version of {0} set ({1})".format(dataset, file_format)
        
        with open(fname, "rb") as fin:
            data = pickle.load(fin)
    else:
        print "Parsing raw {0} file".format(dataset)
        data = parse_input("data/{0}".format(dataset),
                           "CDR_{0}Set.txt".format(dataset.capitalize()),
                           return_format = file_format)
        
        with open(fname, "wb") as fout:
            pickle.dump(data, fout)
            
    return data

development_set = read_gold_standard("development", "dict")

Reading cached version of development set (dict)


In [20]:
def in_gold(row):
    pmid = int(row["pmid"])
    return int(development_set[pmid].has_relation(Relation(pmid, row["chemical_id"], row["disease_id"])))

### Remove tainted judgements

In [21]:
raw_data = raw_data.query("~_tainted")

In [22]:
raw_data.shape

(5500, 31)

### Add the original sentence id back in:

In [23]:
sentence_id_mapping = pd.read_csv("data/befree/sentence_id_mapping.tsv", sep = '\t')

In [24]:
def get_sentence_id(uniq_id):
    temp = sentence_id_mapping.query("uniq_id == '{0}'".format(uniq_id))
    return temp["sentence_id"].iloc[0]

raw_data.loc[:, "sentence_id"] = raw_data["uniq_id"].map(get_sentence_id)

### Count the number of judgements for each work unit:

In [25]:
work_votes = defaultdict(set)
for uniq_id, group in raw_data.groupby("uniq_id"):
    work_votes[len(group["_worker_id"].unique())].add(uniq_id)

### Take only the work units which received a full 5 judgements

In [26]:
raw_data = raw_data.query("uniq_id in {0}".format(list(work_votes[5])))

In [27]:
raw_data.shape

(5175, 32)

### Aggregate results without choice mapping:

In [28]:
res = aggregate_results("uniq_id", "verify_relationship", raw_data, "majority_vote",
                        ["pmid", "_unit_id", "chemical_id", "disease_id", "relation_pair_id", "sentence_id"])

In [29]:
len(res["uniq_id"].unique())

1035

### Determine whether each relation is in the gold standard:

In [30]:
res["in_gold"] = res.loc[:, ["pmid", "chemical_id", "disease_id"]].apply(in_gold, axis = 1)

### Select only work units which had high crowd consensus for use as test questions:

In [31]:
res = res.query("num_votes >= 4")

In [32]:
res.shape

(716, 12)

### Select only sentences where the crowd's response matches the gold standard:

In [33]:
res.loc[:, "response"] = res["verify_relationship"].map(lambda x: int(x == "yes_direct"))

In [34]:
res = res.query("in_gold == response")

In [35]:
res.shape

(562, 13)

In [36]:
len(res["uniq_id"].unique())

562

### Take only the simple yes/no answers:

In [37]:
res["verify_relationship"].value_counts()

no_relation     343
yes_direct      211
ner_mistake       6
yes_indirect      2
dtype: int64

In [38]:
res = res.query("verify_relationship in {0}".format(["yes_direct", "no_relation"]))

In [39]:
res["verify_relationship"].value_counts(normalize = True)

no_relation    0.619134
yes_direct     0.380866
dtype: float64

### Sanity check to see crowd is doing things properly:

In [40]:
def sample_check(dataframe):
    SAMPLE_SIZE = 20
    units = dataframe.sample(SAMPLE_SIZE, random_state = NUMPY_RAND_SEED)
    
    for unit_id in units["unit_id"].unique():
        print "https://crowdflower.com/jobs/761593/units/{0}".format(int(unit_id))

A quick manual review of 20 of the work units for each category showed that the crowd is doing very well on these work units, and that the sentences are clear and unambiguous. Therefore it should be safe to use these sentences as test questions.

### Artificially select work units to keep answer distributions roughly equal:

In [41]:
TEST_QUES_PER_CHOICE = 200

In [42]:
yes_sample = res.query("verify_relationship == 'yes_direct'").sample(TEST_QUES_PER_CHOICE, random_state = NUMPY_RAND_SEED)

In [43]:
no_sample = res.query("verify_relationship == 'no_relation'").sample(TEST_QUES_PER_CHOICE, random_state = NUMPY_RAND_SEED)

In [44]:
test_ques = pd.concat([yes_sample, no_sample])

In [45]:
test_ques["verify_relationship"].value_counts()

yes_direct     200
no_relation    200
dtype: int64

In [46]:
test_ques_ids = set(test_ques["uniq_id"])

In [47]:
len(test_ques_ids)

400

### Add the necessary data columns to the test questions:

In [48]:
def get_data(uniq_id, col):
    temp = raw_data.query("uniq_id == '{0}'".format(uniq_id))
    assert len(temp[col].unique()) == 1
    return temp[col].iloc[0]

columns = ["chemical_id", "chemical_name",
          "disease_id", "disease_name",
          "form_sentence"]

for col in columns:
    test_ques.loc[:, col] = test_ques.loc[:, "uniq_id"].map(lambda v: get_data(v, col))

### Remove unnecessary columns:

In [49]:
test_ques = test_ques.drop(["conf_score", "num_votes", "percent_agree", "unit_id", "in_gold", "response"], axis = 1)

In [50]:
test_ques = test_ques.rename(columns = {"verify_relationship": "verify_relationship_gold"})
test_ques["verify_relationship_gold_reason"] = ""
test_ques["original_job_id"] = "761593"
test_ques.loc[:, "uniq_id"] = test_ques.loc[:, "uniq_id"].map(lambda x: "{0}_sent_work_dev_set_761593".format(x))

In [51]:
test_ques.head()

Unnamed: 0,uniq_id,verify_relationship_gold,pmid,chemical_id,disease_id,relation_pair_id,sentence_id,chemical_name,disease_name,form_sentence,verify_relationship_gold_reason,original_job_id
0,bcv_easy_330_sent_work_dev_set_761593,yes_direct,12691807,D007980,D004421,12691807_D007980_D004421,12691807_0,"<span class=""chemical"">Levodopa</span>","<span class=""disease"">dystonia</span>","<span class=""chemical"">Levodopa</span>-induced...",,761593
0,bcv_easy_395_sent_work_dev_set_761593,yes_direct,4038130,D008619,D009135,4038130_D008619_D009135,4038130_5,"<span class=""chemical"">mepivacaine</span>","<span class=""disease"">muscle damage</span>","In addition to <span class=""disease"">muscle da...",,761593
0,bcv_easy_258_sent_work_dev_set_761593,yes_direct,9545159,D004917,D008133,9545159_D004917_D008133,9545159_2,"<span class=""chemical"">erythromycin</span>","<span class=""disease"">Prolongation of QT inter...","<span class=""disease"">Prolongation of QT inter...",,761593
0,bcv_easy_1055_sent_work_dev_set_761593,yes_direct,6892185,D002217,D014202,6892185_D002217_D014202,6892185_4,"<span class=""chemical"">carbachol</span>","<span class=""disease"">tremor</span>","It is apparent that calcium chloride can ""diss...",,761593
0,bcv_easy_809_sent_work_dev_set_761593,yes_direct,7604176,D017035,D009220,7604176_D017035_D009220,7604176_1,"<span class=""chemical"">pravastatin</span>","<span class=""disease"">inflammatory myopathy</s...","A case of acute <span class=""disease"">inflamma...",,761593


### Write to file:

In [52]:
train_set_revised.to_csv("data/crowdflower/test_questions/sent_train_set_revised_760841.tsv", sep = '\t', index = False)

In [53]:
dev_set.to_csv("data/crowdflower/test_questions/sent_dev_set_762850.tsv", sep = '\t', index = False)

In [54]:
test_ques.to_csv("data/crowdflower/test_questions/sent_work_dev_set_761593.tsv", sep = '\t', index = False)

### Test question answer distributions:

In [55]:
temp = pd.concat([train_set_revised, dev_set, test_ques])

In [56]:
temp.shape

(553, 12)

In [57]:
temp["verify_relationship_gold"].value_counts(normalize = True)

no_relation     0.515371
yes_direct      0.479204
yes_indirect    0.005425
dtype: float64

The test question distribution is quite balanced. Although the utility of the yes_indirect choice is not great, it can be used as a method for monitoring and removing cheaters automatically.