# Processing pre-annotated data for BeFree

Tong Shu Li<br>
Created on 2015-08-10<br>
Last updated 2015-08-13

Alex and Laura sent us 3000 abstracts containing sentences that they would like the crowd to annotate. These sentences have already been processed for us (annotated and split), so we just need to get it into the correct form for CrowdFlower.

This notebook processes the data into something that can be used by CrowdFlower.

###Note:

The files Alex sent us directly have carriage returns (\r) which need to be stripped.

Crowdflower can display Unicode properly.

In [1]:
import pandas as pd

In [2]:
from src.data_model import *

---

### Read the data Alex sent us:

In [3]:
# quoting = 3 means there are no quote characters used at all!!
chem_data = pd.read_csv("data/befree/chemicals_entities_tmChem_sent_relative_offsets_v2.txt",
                        sep = '\t', quoting = 3)

In [4]:
dise_data = pd.read_csv("data/befree/diseases_entities_DNorm_sent_relative_offsets_v3.txt",
                        sep = '\t', quoting = 3)

In [5]:
chem_data.shape

(14995, 8)

In [6]:
dise_data.shape

(19109, 8)

In [7]:
chem_data.head()

Unnamed: 0,pmid,num_sent,mention_start,mention_stop,mention_text,mention_type,mention_id,sentence_text
0,18477409,1,21,67,omega-3 long-chain polyunsaturated fatty acids,Chemical,-1,While consumption of omega-3 long-chain polyun...
1,18477409,6,31,41,fatty-acid,Chemical,CHEBI:35366,Erythrocytes were analyzed for fatty-acid cont...
2,18477409,9,160,181,eicosapentaenoic acid,Chemical,MESH:D015118,"However, natural killer (NK) (CD3-CD16+CD56+) ..."
3,18477409,9,185,205,docosahexaenoic acid,Chemical,CHEBI:28125,"However, natural killer (NK) (CD3-CD16+CD56+) ..."
4,18477409,10,179,185,LCPUFA,Chemical,-1,No significant correlations were found with re...


In [8]:
dise_data.head()

Unnamed: 0,pmid,num_sent,mention_start,mention_stop,mention_text,mention_type,mention_id,sentence_text
0,18477409,1,123,143,inflammatory disease,Disease,MESH:D007249,While consumption of omega-3 long-chain polyun...
1,18477409,1,152,172,rheumatoid arthritis,Disease,MESH:D001172,While consumption of omega-3 long-chain polyun...
2,18477409,2,259,271,inflammation,Disease,MESH:D007249,The objective of this study was to determine w...
3,18477409,5,79,100,neutrophil chemotaxis,Disease,MESH:C565534,Leukocytes were also examined for lymphoprolif...
4,18477409,11,11,32,neutrophil chemotaxis,Disease,MESH:C565534,"Similarly, neutrophil chemotaxis, chemokinesis..."


### Convert all text to Unicode:

In [9]:
def convert_to_unicode(columns, dataframe):
    for col in columns:
        dataframe.loc[:, col] = dataframe.loc[:, col].map(lambda x: x.decode("utf-8"))
        
    return dataframe    

In [10]:
chem_data = convert_to_unicode(["mention_text", "sentence_text"], chem_data)

In [11]:
dise_data = convert_to_unicode(["mention_text", "sentence_text"], dise_data)

### Check that the indicies match:

In [12]:
def check(row):
    start = int(row["mention_start"])
    stop = int(row["mention_stop"])
    sent = row["sentence_text"]
    text = row["mention_text"]
    
    return sent[start : stop] == text
    
def check_indicies(dataframe):
    res = dataframe[["mention_text", "mention_start", "mention_stop", "sentence_text"]].apply(check, axis = 1)
    print res.value_counts()

In [13]:
check_indicies(chem_data)

True    14995
dtype: int64


In [14]:
check_indicies(dise_data)

True    19109
dtype: int64


The indicies of the text match only if everything is in Unicode. **They do not match if it is a regular string containing the Unicode code points.**

### Filter by identifier type

Zhiyong confirmed that the identifiers used for any relation will always be MeSH. Therefore we will filter out all identifiers which are not MeSH, because those will definitely not exist in the gold standard.

### Count id types:

In [15]:
def count_types(dataframe):
    ids = dataframe["mention_id"]
    ids = ids.map(lambda x: "|".join(map(lambda v: v.split(":")[0], x.split("|"))))
    print ids.value_counts()

In [16]:
count_types(chem_data)

MESH         12076
-1            1796
CHEBI         1104
MESH|MESH       19
dtype: int64


In [17]:
count_types(dise_data)

MESH    18999
OMIM      110
dtype: int64


We see that around 2800 rows of the chemical mentions are not using MeSH identifiers, and that only 114 rows of the diseases are not using MeSH identifiers. We will retain only the MeSH ids. 19 rows use compound MeSH ids, so we need to remember to process those properly.

### Retain MeSH rows only:

In [18]:
def get_mesh_rows(dataframe):
    """
    Look at all rows and return a dataframe containing only MeSH id rows.
    """
    ids = dataframe["mention_id"]
    ids = ids.map(lambda x: x.split(":")[0])
    dataframe["id_type"] = ids
    dataframe = dataframe.query("id_type == 'MESH'")
    return dataframe

In [19]:
mesh_chem = get_mesh_rows(chem_data)

In [20]:
mesh_dise = get_mesh_rows(dise_data)

In [21]:
mesh_chem["id_type"].value_counts()

MESH    12095
dtype: int64

In [22]:
mesh_dise["id_type"].value_counts()

MESH    18999
dtype: int64

---

### Join into one large dataframe:

In [23]:
clean_data = pd.concat([mesh_chem, mesh_dise])

In [24]:
clean_data.shape

(31094, 9)

### Remove the "MESH:" part from the ids:

In [25]:
def clean_identifier(val):
    """
    Some ids are compound ids joined by pipes.
    """
    sub_ids = val.split("|")
    cleaned = map(lambda x: x.split(":")[1], sub_ids)
    return "|".join(cleaned)

clean_data.loc[:, "mention_id"] = clean_data.loc[:, "mention_id"].map(clean_identifier)

In [26]:
clean_data.head()

Unnamed: 0,pmid,num_sent,mention_start,mention_stop,mention_text,mention_type,mention_id,sentence_text,id_type
2,18477409,9,160,181,eicosapentaenoic acid,Chemical,D015118,"However, natural killer (NK) (CD3-CD16+CD56+) ...",MESH
5,16819507,0,85,96,hydroxyurea,Chemical,D006918,RNA polymerase II transcription is required fo...,MESH
6,16819507,4,63,74,hydroxyurea,Chemical,D006918,We found that human papillomavirus type 16 E7 ...,MESH
7,16819507,4,131,145,alpha-amanitin,Chemical,D053959,We found that human papillomavirus type 16 E7 ...,MESH
9,22280402,0,71,79,androgen,Chemical,D000728,"""True"" antiandrogens-selective non-ligand-bind...",MESH


---

### Total PMIDs in dataset:

In [27]:
len(clean_data["pmid"].unique())

2756

## Generating data file for CrowdFlower:

We will not use our previous data generation scheme since this time we are only working on sentences. We will need to adapt our previous work unit formatter to work only on these sentences.

To reuse our previous work unit formatter code, we need to fit Alex's data into our data_model's Sentence class. This should be relatively easy. Once we have that we can just pass things off to the <code>process_sentence_task()</code> function, which will take care of the rest.

### Converting data to Sentence objects:

In [28]:
# for each unique sentence
# grab all concept annotations in that sentence
# sort the annotations in order
# convert to a Sentence object

def create_annotation(row):
    return Annotation(row["mention_id"], row["mention_type"],
                      row["mention_text"], row["mention_start"], row["mention_stop"])

def create_sentences(dataframe):
    """
    Given a dataframe, converts it to a batch of Sentence objects.
    """
    sentences = []
    for info, group in dataframe.groupby(["pmid", "num_sent"]):
        # for each unique sentence
        
        sent_text = group["sentence_text"].iloc[0]
        
        # each row in group is a single annotation
        
        # use each row to make on annotation object
        annotations = list(group[["mention_id", "mention_type", "mention_text", "mention_start", "mention_stop"]].apply(
            create_annotation, axis = 1))
        
        annotations = sorted(annotations)
        
        sentences.append(Sentence(info[0], info[1], sent_text, 0, len(sent_text), annotations))
        
    return sentences

In [29]:
all_sentences = create_sentences(clean_data)

In [30]:
len(all_sentences)

17198

### Create formatted data for CrowdFlower:

Now that we have Alex's data in our Sentence objects, we can use our existing data formatting code to make the tsv file for CrowdFlower.

In [31]:
def add_simple_tag(tag_name, tag_class, text):
    return "<{0} class=\"{1}\">{2}</{0}>".format(tag_name, tag_class, text)

In [32]:
def highlight_concepts(text, breaks):
    """
    Inserts HTML tags around the pieces of text
    which need to be highlighted in a string.
    """
    breaks = sorted(breaks, key = lambda x: x[0])
    
    final = []
    for i in range(len(breaks) - 1):
        s = text[breaks[i][0] : breaks[i+1][0]]
        if breaks[i][1] != "n":
            s = add_simple_tag("span", breaks[i][1], s)
            
        final.append(s)
        
    return "".join(final)

In [33]:
def highlight_text(text, offset, uniq_spans):
    """
    Given a string and the annotations which fall
    within this string, highlights the concepts.
    """
    # index of break, type of break (n = nothing)
    breaks = [(0, "n"), (len(text), "n")]
    
    for span in uniq_spans:
        breaks.append((span.start - offset, span.stype))
        breaks.append((span.stop - offset, "n"))
            
    return highlight_concepts(text, breaks)

In [34]:
def grab_names(annotations):
    """
    Determines the unique names of the annotations.
    """
    # determine the names of the concept identifiers
    used_names = defaultdict(set) # lower case set of used names (to avoid repeats)
    real_name = defaultdict(set) # set of unique names verbatim (to preseve capitalization)
    for annotation in annotations:
        if annotation.text.lower() not in used_names[annotation.stype]:
            used_names[annotation.stype].add(annotation.text.lower())
            real_name[annotation.stype].add(annotation.text)
            
    return real_name

In [35]:
def process_sentence_task(sentence, rel_pairs):
    """
    Given a Sentence object, and the set of chemical-disease relation
    identifier pairs, creates a set of sentence-level verification tasks.
    """
    data = defaultdict(list)
    for chemical_id, disease_id in rel_pairs:
        spans = [annotation for annotation in sentence.annotations if annotation.uid in [chemical_id, disease_id]]
        real_name = grab_names(spans)

        data["pmid"].append(sentence.pmid)
        
        data["form_sentence"].append(highlight_text(sentence.text, sentence.start, spans))
        
        data["chemical_id"].append(chemical_id)
        data["disease_id"].append(disease_id)
        
        data["chemical_name"].append(add_simple_tag("span", "chemical", "/".join(real_name["chemical"])))
        data["disease_name"].append(add_simple_tag("span", "disease", "/".join(real_name["disease"])))
        
        data["relation_pair_id"].append("{0}_{1}_{2}".format(sentence.pmid, chemical_id, disease_id))
        
        data["sentence_id"].append(sentence.uid)
        
    return pd.DataFrame(data)

---

In [36]:
def create_work_units(dataset):
    work_units = []
    for sentence in dataset:
        # process all the relations, even those we think are true due to the CID format
        res = process_sentence_task(sentence, sentence.poss_relations[False] | sentence.poss_relations[True])
        
        work_units.append(res)
        
    work_units = pd.concat(work_units).reset_index(drop = True)
    work_units["uniq_id"] = map(lambda x: "bcv_sentence_task_befree_{0}".format(x), work_units.index)
    
    return work_units

In [37]:
work_units = create_work_units(all_sentences)

In [38]:
work_units.shape

(5160, 9)

In [39]:
work_units.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_sentence,pmid,relation_pair_id,sentence_id,uniq_id
0,D016047,"<span class=""chemical"">2',3'-dideoxycytidine</...",D000744,"<span class=""disease"">immunodeficiency</span>","Novel mutation in the human <span class=""disea...",1279198,1279198_D016047_D000744,1279198_0,bcv_sentence_task_befree_0
1,D016049,"<span class=""chemical"">2',3'-dideoxyinosine</s...",D000744,"<span class=""disease"">immunodeficiency</span>","Novel mutation in the human <span class=""disea...",1279198,1279198_D016049_D000744,1279198_0,bcv_sentence_task_befree_1
2,D016047,"<span class=""chemical"">ddC/2',3'-dideoxycytidi...",D000744,"<span class=""disease"">immunodeficiency</span>",We have used the technique of in vitro selecti...,1279198,1279198_D016047_D000744,1279198_1,bcv_sentence_task_befree_2
3,D016049,"<span class=""chemical"">ddI/2',3'-dideoxyinosin...",D000744,"<span class=""disease"">immunodeficiency</span>",We have used the technique of in vitro selecti...,1279198,1279198_D016049_D000744,1279198_1,bcv_sentence_task_befree_3
4,D008274,"<span class=""chemical"">magnesium</span>",D009135,"<span class=""disease"">muscle damage</span>",This study examined the effect of <span class=...,1299490,1299490_D008274_D009135,1299490_1,bcv_sentence_task_befree_4


So we have 5160 sentence-bound relation pairs which need to be verified by the crowd. This is roughly 4x the number of rows of job 761593, which was extracted from 200 abstracts. Thankfully the scaling is not linear, otherwise we would have 20000 relations to verify.

To process such a large amount of data, we will need a lot of test questions, since that is the primary factor limiting how much work a single worker can do. We will probably also want to raise the number of rows per page so that people can do more work with the same amount of test questions.

### Add test questions to data file:

In [40]:
dev_set = pd.read_csv("data/crowdflower/test_questions/sent_dev_set_762850.tsv", sep = '\t')

In [41]:
train_set = pd.read_csv("data/crowdflower/test_questions/sent_train_set_revised_760841.tsv", sep = '\t')

In [42]:
test_ques = pd.concat([dev_set, train_set])

In [43]:
test_ques.shape

(153, 12)

CrowdFlower has a bug where if the answer to a test question has only
one answer, and is uploaded using the batch uploader, then the job will
not collect any judgements at all. This is due to a backend error where
the representation of the correct answer is different depending on how
a test question was created.

The correct representation should be as a list with a single element. However when a test question is created through batch upload, the representation is as a string, and not as a list. This causes some error in their backend.

I discovered this error through trial and error on 2015-08-09, and CrowdFlower has
acknowledged this bug as of 2015-08-13. As of 2015-08-13 the bug has not been fixed.

The work around I found was to do the following:
- Create two new columns, called "useless_gold" and "useless_gold_reason" where "useless" is a non-existent CML element name.
- Populate the "useless_gold" column with "test\ntest\ntest"
- When this file is uploaded and processed by the batch uploader, it seems that the presence of newlines forces the representation to the list version. However, only the useless_gold column will have the list representation. The real gold column will still be a string.
- This will allow the job to collect judgements normally. Why this works is beyond me.
- With this bug, it is impossible to tell if the gold_reason columns are being displayed properly to the workers.

CrowdFlower support wrote back saying another workaround is to:
1. Add a single newline to the end of the single answer choice in the _gold column. This causes the representation to be a list with a single element.
2. An alternative method is to save each test question again in the online interface. This will guarantee the correct representation internally, but is however time consuming and therefore best avoided.

In [44]:
test_ques.loc[:, "original_job_id"] = test_ques.loc[:, "original_job_id"].map(int)

test_ques.loc[:, "verify_relationship_gold"] = test_ques.loc[:, "verify_relationship_gold"].map(lambda x: "{0}\n".format(x))

test_ques["_golden"] = "TRUE"

In [45]:
test_ques.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_sentence,original_job_id,pmid,relation_pair_id,sentence_id,uniq_id,verify_relationship_gold,verify_relationship_gold_reason,_golden
0,D012701,"<span class=""chemical"">5-HT</span>",D011605,"<span class=""disease"">psychotic disorders</span>",These animal models were considered to reflect...,762850,20705401,20705401_D012701_D011605,20705401_4,bcv_easy_68_sent_dev_set_762850,no_relation\n,The receptors to 5-HT6 are related to psychoti...,True
1,D010396,"<span class=""chemical"">D-penicillamine</span>",D012594,"<span class=""disease"">localized scleroderma</s...","Case reports of 11 patients with severe, exten...",762850,2334179,2334179_D010396_D012594,2334179_3,bcv_easy_2399_sent_dev_set_762850,no_relation\n,D-penicillamine was used to treat the patients.,True
2,D015016,"<span class=""chemical"">yohimbine</span>",D019964,"<span class=""disease"">affective disorders</span>",METHOD: Six patients with either obsessive com...,762850,1535072,1535072_D015016_D019964,1535072_4,bcv_easy_642_sent_dev_set_762850,no_relation\n,The sentence says the affective disorders were...,True
3,D002118,"<span class=""chemical"">calcium</span>",D013035,"<span class=""disease"">muscle spasms</span>",While severe hypokalemia may cause muscle weak...,762850,8492347,8492347_D002118_D013035,8492347_2,bcv_easy_1271_sent_dev_set_762850,no_relation\n,The sentence says calcium supplementation cann...,True
4,D064730,"<span class=""chemical"">dexrazoxane</span>",D006402,"<span class=""disease"">hematologic toxicity</span>",Clinical trials in patients with brain metasta...,762850,15897593,15897593_D064730_D006402,15897593_8,bcv_easy_2793_sent_dev_set_762850,no_relation\n,Dexrazoxane is being used to try and make the ...,True


### Write final file to disk:

In [46]:
final_units = pd.concat([test_ques, work_units])

In [47]:
final_units.to_csv("data/crowdflower/data_for_befree_sentence_task.tsv", sep = '\t', index = False, encoding = "utf-8")