# BioCreative V Task 3 CrowdFlower Work Unit Formatter (Refinement version)

Tong Shu Li<br>
Created on: Monday 2015-12-16<br>
Last updated: 2015-12-17

The <code>classify_relations()</code> routine of the <code>Sentence</code> and <code>Paper</code> objects have already separated all possible chemical-disease relation pairs into three disjoint categories:

1. Relations which follow the "[chemical]-induced [disease]" (CID) structure.
2. Relations which co-occur within a sentence but do not follow the CID structure.
3. Relations which do not co-occur within any sentences.

This notebook takes the relation pairs in each category and generates the information needed for the CrowdFlower interface. No decision making about which category each relation belong to is performed here.

In [1]:
from collections import defaultdict
import pandas as pd
import sys

In [2]:
sys.path.append("..")

In [3]:
from src.data_model import parse_input
from src.make_cf_work_units import create_work_units

---

### Read our 50 + 50 papers

In [4]:
loc = "../data/refinement"
fname = "CDR_train_50_subset.txt"

train_sub = parse_input(loc, fname, fix_acronyms = False)

In [5]:
loc = "../data/refinement"
fname = "CDR_dev_50_subset.txt"

dev_sub = parse_input(loc, fname, fix_acronyms = False)

In [6]:
testset = train_sub.copy()
testset.update(dev_sub)

In [7]:
len(testset)

100

---

## Create CrowdFlower work units

Highlighting strategy:

1. Highlight annotations.
2. Highlight cooccurring sentences.

In [8]:
cid_rels, work_units = create_work_units(testset, "refine_try_1")

In [9]:
work_units.shape

(874, 9)

In [10]:
work_units.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,form_body,form_title,pmid,rel_origin,uniq_id
0,MESH:D018021,"<span class=""chemical"">LiCl</span>",MESH:D003919,"<span class=""disease"">diabetes-insipidus-like ...",The effect of amiloride on lithium-induced pol...,Attenuation of the lithium-induced <span class...,7453952,abs,refine_try_1_0
1,MESH:D011188,"<span class=""chemical"">potassium</span>",MESH:D011141,"<span class=""disease"">polyuria</span>",The effect of amiloride on lithium-induced pol...,Attenuation of the lithium-induced diabetes-in...,7453952,abs,refine_try_1_1
2,MESH:D011188,"<span class=""chemical"">potassium</span>",MESH:D059606,"<span class=""disease"">polydipsia</span>",The effect of amiloride on lithium-induced <sp...,Attenuation of the lithium-induced diabetes-in...,7453952,abs,refine_try_1_2
3,MESH:D018021,"<span class=""chemical"">LiCl</span>",MESH:D059606,"<span class=""disease"">polydipsia</span>","<span class=""sentence"">The effect of amiloride...",Attenuation of the lithium-induced diabetes-in...,7453952,sent,refine_try_1_3
4,MESH:D000584,"<span class=""chemical"">amiloride/Amiloride</span>",MESH:D011141,"<span class=""disease"">polyuria</span>","<span class=""sentence"">The effect of <span cla...",Attenuation of the lithium-induced diabetes-in...,7453952,sent,refine_try_1_4


---

## Add test questions

In [11]:
test_ques = pd.read_csv("refine_800_test_ques.tsv", sep = '\t')

In [12]:
total_data = pd.concat([test_ques, work_units])

In [13]:
total_data.to_csv("refine_run_1_all_data.tsv", sep = '\t', index = False)