#Analysis of CrowdFlower job #758438: Abstract-level CID verification task with NER mistake choice

Tong Shu Li<br>
Created on Tuesday 2015-08-04<br>
Last updated 2015-08-04

Results from CrowdFlower jobs #754530 and #755704 showed that splitting the task into to parts did indeed help improve performance slightly. It also increased job satisfaction rate.

One aspect of the task which has not been tested is NER mistake identification. For this job a fourth choice was added to the question which allowed a worker to say that the highlighting was incorrect and therefore the task was impossible.

For this task there were a number of cheaters who did a substantial amount of the actual work units. These cheaters were thankfully easy to detect because they chose the same response for all work units, and they also chose one of the least likely answer choices ("ner_mistake"). These workers have been banned from all future tasks, but a thorough worker answer distribution analysis and worker response time analysis is needed in the future for all tasks in order to identify cheaters.

The analysis of the job follows the analysis of job #754530. "Yes_indirect" and "ner_mistake" choices are counted as "no" votes.

###Job settings:

Parameter | Value
--- | ---
Job ID | #758438
Rows per page | 6
Judgements per row | 5
Payment per page | 24 cents USD
Payment per row | 4 cents USD
Contributor level | 2
Minimum time per page | 30 seconds
Minimum accuracy threshold | 70%
Number of test questions | 52
Date of launch | 4:56 pm PDT on Saturday 2015-08-01
Date of completion | 9:40 pm PDT on Saturday 2015-08-01
Total cost before bonuses | \$331.31 USD
Total cost after bonuses | $333.31 USD

---

In [1]:
from __future__ import division
from collections import Counter
from collections import defaultdict
from IPython.display import Image
import matplotlib as plt
import os
import pandas as pd
import pickle

In [2]:
matplotlib inline

In [3]:
%%bash

rm src/get_AUC_value.pyc

In [4]:
from src.filter_data import filter_data
from src.data_model import parse_input
from src.data_model import Relation
from src.get_AUC_value import get_AUC_value
from src.F_score import *
from src.aggregate_results import *

###Read the results of job #758438:

In [5]:
settings = {
    "loc": "data/crowdflower/results",
    "fname": "job_758438_full_with_untrusted.csv",
    "data_subset": "normal",
    "min_accuracy": 0.7,
    "max_accuracy": 1.0
}

raw_data = filter_data(settings)

In [6]:
raw_data.shape

(5315, 29)

###Map uniq id and unit id:

In [7]:
id_mapping = dict()
for uniq_id, group in raw_data.groupby("uniq_id"):
    id_mapping[uniq_id] = int(group["_unit_id"].iloc[0])

###Read the gold standard:

In [8]:
def read_gold_standard(dataset, file_format = "list"):
    assert dataset in ["training", "development"]
    assert file_format in ["list", "dict"]
    
    fname = "data/{0}/parsed_{0}_set_{1}.pickle".format(dataset, file_format)
    
    if os.path.exists(fname):
        print "Reading cached version of {0} set ({1})".format(dataset, file_format)
        
        with open(fname, "rb") as fin:
            data = pickle.load(fin)
    else:
        print "Parsing raw {0} file".format(dataset)
        data = parse_input("data/{0}".format(dataset),
                           "CDR_{0}Set.txt".format(dataset.capitalize()),
                           return_format = file_format)
        
        with open(fname, "wb") as fout:
            pickle.dump(data, fout)
            
    return data

In [9]:
development_set = read_gold_standard("development", "dict")

Reading cached version of development set (dict)


###Remove judgements made by cheaters:

In [10]:
bad_workers = {
 31501233,
 31720388,
 31720815,
 32025293,
 33081102,
 33081299,
 33081469,
 33081531,
 33085305,
 33085428,
 33238902,
 33301062,
 33301138,
 33596095}

In [11]:
clean_data = raw_data.query("_worker_id not in {0}".format(list(bad_workers)))

In [12]:
clean_data.shape

(3709, 29)

The cheaters unfortunately used scripts to cheat on a large number of work units (1606). This is worrisome, since now many work units have fewer than 5 judgements. We can filter these out to see what the effect is.

###Comments:

In [13]:
clean_data["comment_box"].unique()

array([nan,
       "I'm guessing here... I don't have a slightest clue what is the correct answer.",
       "It's 'liver rupture', not just 'rupture'",
       "'Glutamate receptors' and not just 'glutamate'. :D",
       'Dexamethasone IS a steroid. :D',
       'Yes, cocaine causes cocaine abuse... that one was easy.  XD',
       "It's 'Cortical Spreading Depression' and not just 'Depression'",
       "It's 'stroke-like' and not 'stroke'",
       "It's 'high fat diet' and not just 'fat'",
       "It's 'ATP/ADP ratio' and not just 'ADP'. :D",
       "It's 'calcium antagonists' and not just 'calcium'. :D",
       "it's 'Na(+)/H(+) exchanger type 3 (NHE3)' (read as Sodium-proton exchanger type 3) and not 'H'",
       "It's 'Na(+)-K(+)-2Cl(-) cotransporter (BSC-1)' and not just'K'",
       'Venlafaxine causes serotonine syndrome which causes thrombocytopenia',
       "It's 'serotonin-1A receptor agonist' and not just 'serotonin'",
       "It's 'citrate synthase' and not just 'citrate'",
   

In [14]:
comments = clean_data[~pd.isnull(clean_data["comment_box"])]

In [15]:
comments.shape

(17, 29)

In [16]:
comments["_worker_id"].unique()

array([32708888, 27555842])

Worker 27555842 left lots of helpful comments regarding what the correct concept highlighting should have been when the response was a NER error. No one else left any useful comments.

###Result aggregation:

In [17]:
res = aggregate_results("uniq_id", "verify_relationship", clean_data, "majority_vote", ["pmid", "_unit_id"])

In [18]:
len(res["uniq_id"].unique())

1060

In [19]:
res.head()

Unnamed: 0,uniq_id,verify_relationship,conf_score,num_votes,percent_agree,pmid,unit_id
1,bcv_hard_0,no_relation,2.517,3,0.752331,15579441,765527869
0,bcv_hard_0,yes_direct,0.8286,1,0.247669,15579441,765527869
1,bcv_hard_1,no_relation,3.3311,4,0.786583,15579441,765527870
0,bcv_hard_1,yes_direct,0.9038,1,0.213417,15579441,765527870
0,bcv_hard_10,no_relation,3.5156,4,1.0,3732088,765527879


###Error checking:

In [20]:
# how many work units have fewer than 5 votes?
unit_votes = defaultdict(set)
for uniq_id, group in res.groupby("uniq_id"):
    total_votes = group["num_votes"].sum()
    unit_votes[total_votes].add(uniq_id)

In [21]:
for num_votes, ids in unit_votes.items():
    print "Num votes:", num_votes
    print "Num work units:", len(ids)

Num votes: 1
Num work units: 67
Num votes: 2
Num work units: 186
Num votes: 3
Num work units: 273
Num votes: 4
Num work units: 219
Num votes: 5
Num work units: 315


A lot of work units have been affected by the cheaters. This is bad. For the analysis we can trying looking at all the work units with some minimum amount of votes.

###NER errors:

Were there any work units where NER error was the top choice?

In [22]:
units = set()
for uniq_id, group in res.groupby("uniq_id"):
    if group["verify_relationship"].iloc[0] == "ner_mistake":
        units.add((uniq_id, group["num_votes"].iloc[0]))

In [23]:
len(units)

7

In [24]:
units

{('bcv_hard_152', 2),
 ('bcv_hard_689', 2),
 ('bcv_hard_820', 1),
 ('bcv_hard_830', 2),
 ('bcv_hard_831', 2),
 ('bcv_hard_833', 2),
 ('bcv_hard_982', 1)}

In [25]:
for uniq_id, votes in units:
    print uniq_id, votes
    print "https://crowdflower.com/jobs/758438/units/{0}".format(id_mapping[uniq_id])

bcv_hard_689 2
https://crowdflower.com/jobs/758438/units/765528558
bcv_hard_820 1
https://crowdflower.com/jobs/758438/units/765528689
bcv_hard_830 2
https://crowdflower.com/jobs/758438/units/765528699
bcv_hard_982 1
https://crowdflower.com/jobs/758438/units/765528851
bcv_hard_152 2
https://crowdflower.com/jobs/758438/units/765528021
bcv_hard_831 2
https://crowdflower.com/jobs/758438/units/765528700
bcv_hard_833 2
https://crowdflower.com/jobs/758438/units/765528702


Some of these are indeed NER errors, but we have very few votes on the choice. For our analysis, we can just count NER votes as "no" votes, but we will need a strong signal when we are giving the data to BeFree.

###Mapping choices to a binary judgement:

In [None]:
# take positive votes only, indirect is yes
res_positive_yes = aggregate_results("uniq_id", "verify_relationship", clean_data,
                                    "positive_signal_only", ["pmid", "_unit_id", "chemical_id", "disease_id"],
                                    "yes_direct", {"yes_indirect" : "yes_direct", "ner_mistake": "no_relation"})

In [None]:
# take positive votes only, indirect is yes
res_positive_no = aggregate_results("uniq_id", "verify_relationship", clean_data,
                                    "positive_signal_only", ["pmid", "_unit_id", "chemical_id", "disease_id"],
                                    "yes_direct", {"yes_indirect" : "no_relation", "ner_mistake": "no_relation"})

In [None]:
res_positive_yes.head()

In [None]:
res_positive_no.head()

###Add the in_gold column to each result dataframe:

In [None]:
def in_gold(row):
    pmid = int(row["pmid"])
    return int(development_set[pmid].has_relation(Relation(pmid, row["chemical_id"], row["disease_id"])))

In [None]:
res_positive_yes["in_gold"] = res_positive_yes.loc[:, ["pmid", "chemical_id", "disease_id"]].apply(in_gold, axis = 1)

In [None]:
res_positive_no["in_gold"] = res_positive_no.loc[:, ["pmid", "chemical_id", "disease_id"]].apply(in_gold, axis = 1)

In [None]:
res_positive_yes.head()

In [None]:
res_positive_no.head()

In [None]:
res_positive_yes["in_gold"].value_counts()

In [None]:
res_positive_yes.shape

In [None]:
res_positive_yes = res_positive_yes.query("verify_relationship == 'yes_direct'")

In [None]:
res_positive_no = res_positive_no.query("verify_relationship == 'yes_direct'")

In [None]:
res_positive_yes.shape

In [None]:
res_positive_no.shape

In [None]:
res_positive_yes.head()

In [None]:
fname = "data/roc/job_758438_non_majority_metric_indirect_is_yes.png"
title = "ROC for job 758438 (abstract level) testing non-majority voting aggregation (indirect is yes)"
get_AUC_value(res_positive_yes, "percent_agree", "in_gold", fname, title)

In [None]:
Image(fname)

In [None]:
fname = "data/roc/job_758438_non_majority_metric_indirect_is_no.png"
title = "ROC for job 758438 (abstract level) testing non-majority voting aggregation (indirect is no)"
get_AUC_value(res_positive_no, "percent_agree", "in_gold", fname, title)

In [None]:
Image(fname)

In [None]:
max_F_score("percent_agree", "in_gold", res_positive_yes)

In [None]:
max_F_score("percent_agree", "in_gold", res_positive_no)

In [None]:
def plot_results(score_column, class_column, dataframe):
    res = all_F_scores(score_column, class_column, dataframe)
    res = res.sort("threshold")
    
    graph = res.plot(x = "threshold", figsize = (7, 7))
    graph.set_ylim((0, 1.1))

In [None]:
plot_results("percent_agree", "in_gold", res_positive_yes)

In [None]:
plot_results("percent_agree", "in_gold", res_positive_no)

In [None]:
plot_results("num_votes", "in_gold", res_positive_yes)

In [None]:
plot_results("num_votes", "in_gold", res_positive_no)

---

###Filter original cleaned data by the number of votes we got at the end:

In [None]:
sub_pos_yes = res_positive_yes.query("uniq_id in {0}".format(list(unit_votes[5])))

In [None]:
sub_pos_no = res_positive_no.query("uniq_id in {0}".format(list(unit_votes[5])))

In [None]:
sub_pos_yes.shape

In [None]:
sub_pos_no.shape

In [None]:
fname = "data/roc/job_758438_non_majority_metric_indirect_is_yes_5_votes_only.png"
title = ("ROC for job 758438 (abstract level) testing non-majority voting aggregation\n"
    "(indirect is yes); only units with 5 votes")
get_AUC_value(sub_pos_yes, "percent_agree", "in_gold", fname, title)

In [None]:
Image(fname)

In [None]:
fname = "data/roc/job_758438_non_majority_metric_indirect_is_no_5_votes_only.png"
title = ("ROC for job 758438 (abstract level) testing non-majority voting aggregation\n"
    "(indirect is no); only units with 5 votes")
get_AUC_value(sub_pos_no, "percent_agree", "in_gold", fname, title)

In [None]:
Image(fname)

In [None]:
max_F_score("percent_agree", "in_gold", sub_pos_yes)

In [None]:
max_F_score("percent_agree", "in_gold", sub_pos_no)

In [None]:
max_F_score("num_votes", "in_gold", sub_pos_yes)

In [None]:
max_F_score("num_votes", "in_gold", sub_pos_no)

In [None]:
plot_results("percent_agree", "in_gold", sub_pos_yes)

In [None]:
plot_results("percent_agree", "in_gold", sub_pos_no)

In [None]:
plot_results("num_votes", "in_gold", sub_pos_yes)

In [None]:
plot_results("num_votes", "in_gold", sub_pos_no)

In [None]:
sub_pos_yes = res_positive_yes.query("uniq_id in {0}".format(list(unit_votes[4] | unit_votes[5])))

In [None]:
sub_pos_no = res_positive_no.query("uniq_id in {0}".format(list(unit_votes[4] | unit_votes[5])))

In [None]:
fname = "data/roc/job_758438_non_majority_metric_indirect_is_yes_4_5_votes_only.png"
title = ("ROC for job 758438 (abstract level) testing non-majority voting aggregation\n"
    "(indirect is yes); only units with 4 and 5 votes")
get_AUC_value(sub_pos_yes, "percent_agree", "in_gold", fname, title)

In [None]:
Image(fname)

In [None]:
fname = "data/roc/job_758438_non_majority_metric_indirect_is_no_4_5_votes_only.png"
title = ("ROC for job 758438 (abstract level) testing non-majority voting aggregation\n"
    "(indirect is no); only units with 4 and 5 votes")
get_AUC_value(sub_pos_no, "percent_agree", "in_gold", fname, title)

In [None]:
Image(fname)

In [None]:
max_F_score("percent_agree", "in_gold", sub_pos_yes)

In [None]:
max_F_score("percent_agree", "in_gold", sub_pos_no)

In [None]:
plot_results("percent_agree", "in_gold", sub_pos_yes)

In [None]:
plot_results("percent_agree", "in_gold", sub_pos_no)

In [None]:
plot_results("num_votes", "in_gold", sub_pos_yes)

In [None]:
plot_results("num_votes", "in_gold", sub_pos_no)

## Conclusion

Workers did very poorly this time on the completely new data. The cheaters greatly decreased the number of work units which were done to completion. It also seems like the people were just not as good this time, since the F-score is almost 0.2 lower than before.

It still concerns me that the majority of the relations tested are supposed to be false according to the gold standard.