# Selecting data for additional CrowdFlower tests

Tong Shu Li<br>
Created on: 2015-12-15<br>
Last updated: 2015-12-15

We are going to try and push the crowd's performance to the maximum possible before moving on from the BioCreative dataset.

First we will subset data to use as our testing set.

In [1]:
import os
import pandas as pd
import random
import sys

Seed should be the integer 1916771088044731497

In [2]:
# seed number generated by using hash() with a random string multiple times
random.seed(a = 1916771088044731497, version = 2)

In [3]:
sys.path.append("..")

In [4]:
from src.lingpipe.file_util import read_file

---

### Grab the set of all available PMIDs

In [5]:
def get_pmid(line):
    """Parse the PMID out of a line of a PubTator file."""
    v = line.find("\t")
    pos = v if v != -1 else line.find("|")
    return int(line[: pos])

def get_pmids(fname):
    return set([get_pmid(line) for line in read_file(fname) if line])

In [6]:
fname = os.path.join("..", "data", "gold_standard", "CDR_DevelopmentSet.txt")
devset_pmids = get_pmids(fname)

In [7]:
len(devset_pmids)

500

In [8]:
fname = os.path.join("..", "data", "gold_standard", "CDR_TrainingSet.txt")
train_pmids = get_pmids(fname)

In [9]:
len(train_pmids)

500

In [10]:
fname = os.path.join("..", "data", "devset_100_test", "processed_CDR_devset.txt")
dev100_pmids = get_pmids(fname)

In [11]:
len(dev100_pmids)

100

---

### Make a big summary dataframe

In [12]:
train = pd.DataFrame({
        "pmid": list(train_pmids),
        "dataset": ["train"] * 500
    })

dev = pd.DataFrame({
        "pmid": list(devset_pmids),
        "dataset": ["dev"] * 500
    })

papers = pd.concat([train, dev]).sort("pmid").reset_index(drop = True)

### Check for overlap between test questions and actual data

In [13]:
abs_data = pd.read_csv("../data/crowdflower/data_for_abs_task_job_771158.tsv", sep = '\t')
abs_tq = set(abs_data.query("_golden == True")["pmid"].map(lambda v: int(v)))

sent_data = pd.read_csv("../data/crowdflower/data_for_sent_task_job_771159.tsv", sep = '\t')
sent_tq = set(sent_data.query("_golden == True")["pmid"].map(lambda v: int(v)))

In [14]:
print(len(abs_tq), len(sent_tq))

115 261


---

### Add info to the dataframe

In [15]:
papers["used_as_sent_tq"] = papers.loc[:, "pmid"].map(lambda v: v in sent_tq)
papers["used_as_abs_tq"] = papers.loc[:, "pmid"].map(lambda v: v in abs_tq)

papers["used_as_devset_100"] = papers.loc[:, "pmid"].map(lambda v: v in dev100_pmids)

### Old test question dataset origin

In [16]:
len(papers.query("used_as_sent_tq & dataset == 'train'"))

47

In [17]:
len(papers.query("used_as_sent_tq & dataset == 'dev'"))

214

In [18]:
len(papers.query("used_as_sent_tq & used_as_devset_100"))

43

In [19]:
len(papers.query("used_as_abs_tq & dataset == 'train'"))

51

In [20]:
len(papers.query("used_as_abs_tq & dataset == 'dev'"))

64

In [21]:
len(papers.query("used_as_abs_tq & used_as_devset_100"))

11

I wasn't careful enough during the previous test question selection process, which used some abstracts as both test questions and as part of the actual data. This time I will be more careful and ensure that the two use separate sets of abstracts.

### Choosing data

We will take at least 100 abstracts from the training set to use as the data that we want our workers to work on. Unfortunately this won't let us compare directly with the previous development 100 dataset since there are overlaps with the actual data.

We will also "reserve" another 200 abstracts (100 from training, 150 from development) from which we will not use to make test questions, so that if we want to scale up larger with the same number of test questions, then we can.

In [22]:
unseen_train = set(papers.query("~used_as_abs_tq & ~used_as_sent_tq & dataset == 'train'")["pmid"])

In [23]:
len(unseen_train)

405

In [24]:
unseen_dev = set(papers.query("~used_as_abs_tq & ~used_as_sent_tq & ~used_as_devset_100 & dataset == 'dev'")["pmid"])

In [25]:
len(unseen_dev)

211

So we have 405 abstracts from the training set and 211 abstracts from the development which the crowd have never seen before (either as a task or as a test question). We will choose 200 abstracts from each of the training and development sets as our official data that we will use to evaluate our crowd's performance. The rest of the seen data will be used for test questions.

In [26]:
hidden_train = set(random.sample(unseen_train, 200))
hidden_dev = set(random.sample(unseen_dev, 200))

In [27]:
len(hidden_train)

200

In [28]:
len(hidden_dev)

200

In [29]:
small_hidden_train = set(random.sample(hidden_train, 50))
small_hidden_dev = set(random.sample(hidden_dev, 50))

These are the 400 abstracts which will never be used for test question purposes. We will use these to filter the original data to a new file.

---

### Add information back to the dataframe

In [30]:
# 200 papers from each dataset (training, development) that the crowd has never
# seen which we will use for performance evaluation
papers["new_unseen_train"] = papers.loc[:, "pmid"].map(lambda v: v in hidden_train)
papers["new_unseen_dev"] = papers.loc[:, "pmid"].map(lambda v: v in hidden_dev)

# a smaller subset of 100 total (50 dev, 50 train) papers that the crowd
# has never seen which we will use for our development iterations (too expensive to use larger sets)
papers["new_unseen_small_train"] = papers.loc[:, "pmid"].map(lambda v: v in small_hidden_train)
papers["new_unseen_small_dev"] = papers.loc[:, "pmid"].map(lambda v: v in small_hidden_dev)

# the rest of the papers will be used to create test questions
papers["for_test_ques_use"] = ~(papers.loc[:, "new_unseen_train"] | papers.loc[:, "new_unseen_dev"])

In [31]:
papers.head()

Unnamed: 0,dataset,pmid,used_as_sent_tq,used_as_abs_tq,used_as_devset_100,new_unseen_train,new_unseen_dev,new_unseen_small_train,new_unseen_small_dev,for_test_ques_use
0,dev,2004,False,False,False,False,True,False,False,False
1,train,26094,False,False,False,True,False,False,False,False
2,dev,28952,False,False,False,False,True,False,False,False
3,dev,33969,True,False,True,False,False,False,False,True
4,dev,48362,False,False,False,False,True,False,True,False


In [32]:
papers.to_csv("../data/refinement/dataset_summary.tsv", sep = '\t', index = False)

### Writing data to new file

In [33]:
def pipe_and_filter(pmids, infile, outfile):
    """Redirect a subset of the gold standard to a new file."""
    skipped = True
    with open(outfile, "w") as fout:
        for line in read_file(infile):
            if not line and not skipped:
                fout.write("\n")
                skipped = True
            elif line and (get_pmid(line) in pmids):
                fout.write("{}\n".format(line))
                skipped = False

### Writing all the unseen abstracts

In [34]:
infile = os.path.join("..", "data", "gold_standard", "CDR_TrainingSet.txt")
outfile = os.path.join("..", "data", "refinement", "CDR_train_200_subset.txt")

pipe_and_filter(hidden_train, infile, outfile)

In [35]:
infile = os.path.join("..", "data", "gold_standard", "CDR_DevelopmentSet.txt")
outfile = os.path.join("..", "data", "refinement", "CDR_dev_200_subset.txt")

pipe_and_filter(hidden_dev, infile, outfile)

In [36]:
infile = os.path.join("..", "data", "gold_standard", "CDR_TrainingSet.txt")
outfile = os.path.join("..", "data", "refinement", "CDR_train_50_subset.txt")

pipe_and_filter(small_hidden_train, infile, outfile)

In [37]:
infile = os.path.join("..", "data", "gold_standard", "CDR_DevelopmentSet.txt")
outfile = os.path.join("..", "data", "refinement", "CDR_dev_50_subset.txt")

pipe_and_filter(small_hidden_dev, infile, outfile)

### Writing the abstracts for use as test questions

In [38]:
infile = os.path.join("..", "data", "gold_standard", "CDR_TrainingSet.txt")
outfile = os.path.join("..", "data", "refinement", "CDR_train_for_test_ques.txt")

pmids = set(papers.query("dataset == 'train' & for_test_ques_use")["pmid"])

pipe_and_filter(pmids, infile, outfile)

In [39]:
infile = os.path.join("..", "data", "gold_standard", "CDR_DevelopmentSet.txt")
outfile = os.path.join("..", "data", "refinement", "CDR_dev_for_test_ques.txt")

pmids = set(papers.query("dataset == 'dev' & for_test_ques_use")["pmid"])

pipe_and_filter(pmids, infile, outfile)

Now we have all of our data partitioned into disjoint sets which we will use for actual performance evaluation and test question creation purposes.