# Analysis of CrowdFlower Job 743229: determing drug induced disease relationships from a full abstract

2015-06-18 Tong Shu Li

CrowdFlower job 743229 was launched at 5:15 pm on Thursday 18 June 2015 and completed at 6:24 pm. 5 Rows of data were shown per page, with 5 judgements/row. Payment was 50 cents per page. Level 2 contributors were requested, and had to maintain a 70% accuracy minimum on 10 test questions while spending at least 30 seconds per page of work.

In [1]:
from collections import Counter
from collections import defaultdict
import pandas as pd

In [2]:
from src.filter_data import filter_data

In [3]:
settings = {
    "loc": "data/crowdflower",
    "fname": "job_743229_full_results.csv",
    "data_subset": "normal",
    "min_accuracy": 0.7,
    "max_accuracy": 1.0
}

In [4]:
raw_data = filter_data(settings)

In [5]:
len(raw_data)

255

In [6]:
raw_data.head()

Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,choice_2_ids,choice_2_label,choice_3_ids,choice_3_label,choice_4_ids,choice_4_label,form_abstract,form_title,pmid,uniq_id
0,739660086,6/19/2015 00:30:20,False,1665411847,,6/19/2015 00:29:42,False,neodev,0.8889,33203286,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
1,739660086,6/19/2015 00:31:30,False,1665412995,,6/19/2015 00:30:39,False,elite,0.7778,28769627,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
2,739660086,6/19/2015 00:37:16,False,1665418608,,6/19/2015 00:30:23,False,clixsense,0.875,31178177,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
3,739660086,6/19/2015 00:39:37,False,1665420942,,6/19/2015 00:30:32,False,neodev,1.0,29150840,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0
4,739660086,6/19/2015 00:43:00,False,1665424014,,6/19/2015 00:30:35,False,clixsense,0.8333,27026688,...,C063968_induces_D016171,"<span class=""chemical"">E4031</span> contribute...",C063968_induces_D017180,"<span class=""chemical"">E4031</span> contribute...",D016593_induces_D016171,"<span class=""chemical"">terfenadine</span> cont...","1. <span class=""disease"">Torsades de pointes</...",Pharmacokinetic/pharmacodynamic assessment of ...,11569530,bcv_id_0


In [7]:
list(raw_data.columns.values)

['_unit_id',
 '_created_at',
 '_golden',
 '_id',
 '_missed',
 '_started_at',
 '_tainted',
 '_channel',
 '_trust',
 '_worker_id',
 '_country',
 '_region',
 '_city',
 '_ip',
 'chemical_disease_relationships',
 'comment_box',
 'chemical_disease_relationships_gold',
 'choice_0_ids',
 'choice_0_label',
 'choice_1_ids',
 'choice_1_label',
 'choice_2_ids',
 'choice_2_label',
 'choice_3_ids',
 'choice_3_label',
 'choice_4_ids',
 'choice_4_label',
 'form_abstract',
 'form_title',
 'pmid',
 'uniq_id']

### A cursory glance at the data showed that there were some cheaters and bots

Let's look at the individual responses in depth to see if there were any that didn't make any logical sense considering the problem description.

Specifically, we will see if any answers:
1. Chose "none of the choices are correct" and also chose other choices.
2. Chose any choice that was marked as empty when I specifically stated not to click those choices.

In [8]:
def answer_distribution(worker_id):
    """
    Examine the answer distribution for this worker.
    """
    responses = raw_data.query("_worker_id == {0}".format(worker_id))
    print "Worker {0} made {1} judgements total".format(worker_id, len(responses))
    
    distribution = Counter(responses["chemical_disease_relationships"])
    print "Worker answer distribution:"
    print distribution
    print

### A look at the answer distributions for all workers shows that most people performed the task properly, as in their answer distributions follow no discernable pattern

In [9]:
for worker_id in raw_data["_worker_id"].unique():
    answer_distribution(worker_id)

Worker 33203286 made 16 judgements total
Worker answer distribution:
Counter({'choice_0\nchoice_1\nchoice_2\nchoice_3\nchoice_4\nnone_are_true': 16})

Worker 28769627 made 16 judgements total
Worker answer distribution:
Counter({'choice_0': 5, 'choice_0\nchoice_1': 4, 'choice_0\nchoice_1\nchoice_2\nchoice_3\nchoice_4': 2, 'choice_0\nchoice_2': 1, 'choice_0\nchoice_2\nchoice_4': 1, 'choice_0\nchoice_3': 1, 'choice_0\nchoice_1\nchoice_2\nchoice_3': 1, 'choice_1\nchoice_2\nchoice_3': 1})

Worker 31178177 made 11 judgements total
Worker answer distribution:
Counter({'choice_0\nchoice_1\nchoice_2\nchoice_3\nchoice_4': 6, 'choice_0\nchoice_1': 2, 'choice_0': 1, 'none_are_true': 1, 'choice_0\nchoice_1\nchoice_2': 1})

Worker 29150840 made 4 judgements total
Worker answer distribution:
Counter({'choice_0\nchoice_2\nchoice_4': 1, 'choice_0\nchoice_1\nchoice_2\nchoice_3': 1, 'choice_0': 1, 'none_are_true': 1})

Worker 27026688 made 4 judgements total
Worker answer distribution:
Counter({'choice_

### Now we will scan the data and look for odd responses:

In [10]:
bad_responses = set()

for idx, row in raw_data.iterrows():
    unit_id = row["_unit_id"]
    worker_id = row["_worker_id"]
    
    response = row["chemical_disease_relationships"].split('\n')
    
    # if none are true, then it should be the only choice..
    if "none_are_true" in response and len(response) > 1:
        bad_responses.add((unit_id, worker_id))
        
    for i in range(5):
        column = "choice_{0}_ids".format(i)
        if (row[column] == "empty") and ("choice_{0}".format(i) in response): # clicked empty response
            bad_responses.add((unit_id, worker_id))

### A total of 77 judgements have been tainted:

In [11]:
len(bad_responses)

77

### Seven users represent all of the bad responses

In [12]:
# look through and count the users who made the most bad responses

cheaters = defaultdict(list)
for unit_id, worker_id in bad_responses:
    cheaters[worker_id].append(unit_id)

temp = []
for key, val in cheaters.items():
    temp.append((key, len(val)))
    
temp = sorted(temp, key = lambda x: -x[1])
temp

[(33078019, 20),
 (32088050, 20),
 (33203286, 16),
 (29794479, 11),
 (31706962, 4),
 (32337300, 4),
 (29269406, 2)]

### An examination of the response each bad user made for each work unit shows a consistent pattern:

In [13]:
for worker_id, num_bad_resp in temp:
    print "# of bad responses: {0}".format(num_bad_resp)
    answer_distribution(worker_id)
    
    # what did each worker answer for each question?
    
    for unit_id in cheaters[worker_id]:
        r = raw_data.query("_unit_id == {0} and _worker_id == {1}".format(unit_id, worker_id))
        assert len(r) == 1
        
        print "Unit {0} worker {1} response: {2}".format(unit_id,
                                                         worker_id,
                                                         r.iloc[0]["chemical_disease_relationships"].split('\n'))
        
    print "\n-----------------------------------------------\n"    

# of bad responses: 20
Worker 33078019 made 20 judgements total
Worker answer distribution:
Counter({'choice_0\nchoice_1\nchoice_2\nchoice_3\nchoice_4\nnone_are_true': 12, 'choice_0\nnone_are_true': 2, 'choice_0\nchoice_1\nchoice_2\nnone_are_true': 2, 'choice_0\nchoice_1\nchoice_2\nchoice_3\nnone_are_true': 2, 'choice_0\nchoice_1\nnone_are_true': 2})

Unit 739660126 worker 33078019 response: ['choice_0', 'choice_1', 'choice_2', 'choice_3', 'choice_4', 'none_are_true']
Unit 739660103 worker 33078019 response: ['choice_0', 'choice_1', 'choice_2', 'choice_3', 'choice_4', 'none_are_true']
Unit 739660111 worker 33078019 response: ['choice_0', 'choice_1', 'choice_2', 'choice_3', 'choice_4', 'none_are_true']
Unit 739660105 worker 33078019 response: ['choice_0', 'choice_1', 'choice_2', 'choice_3', 'choice_4', 'none_are_true']
Unit 739660124 worker 33078019 response: ['choice_0', 'choice_1', 'none_are_true']
Unit 739660096 worker 33078019 response: ['choice_0', 'none_are_true']
Unit 739660131 w

### No one made a mistake and chose some normal choices, but also chose an empty choice by accident:

In [14]:
# did anyone click an empty response but not none of the above?
bad_responses = set()

for idx, row in raw_data.iterrows():
    unit_id = row["_unit_id"]
    worker_id = row["_worker_id"]
    
    response = row["chemical_disease_relationships"].split('\n')
        
    for i in range(5):
        column = "choice_{0}_ids".format(i)
        if (row[column] == "empty") and ("choice_{0}".format(i) in response) and ("none_are_true" not in response): # clicked empty response
            bad_responses.add((unit_id, worker_id))
            
            
bad_responses

set()

The fact that no person chose an empty choice without also choosing "none are true" is pretty damning. This means that everyone who was working properly followed the rules and did not choose the empty choices.

Conversely, anyone who chose an empty choice also chose none of the above, which implies that they were not following the rules.

### Luckily, the cheaters were not very smart and made obvious cheating patterns. I have flagged and rejected all the work that these users performed.

The responses of these seven users: [33078019, 31706962, 29794479, 32088050, 32337300, 33203286, 29269406] will be eliminated from the data. Thankfully the tainted responses only represent 30% of the dataset. We can still perform some useful analyses.

In [15]:
cheaters.keys()

[33078019, 31706962, 29794479, 32088050, 32337300, 33203286, 29269406]

In [16]:
# remove all responses made by the cheaters

cheater_ids = cheaters.keys()

cleaned_data = raw_data.query("_worker_id not in {0}".format(cheater_ids))

In [17]:
len(cleaned_data)

167

Save to file:

In [18]:
cleaned_data.to_csv("data/crowdflower/cleaned_job_743229_full.csv", sep = ",", index = False)