# Analysis of job 743229: Vote aggregation

2015-06-19 Tong Shu Li

Now that we have finally cleaned up the data, we can now ask how well our crowd performed with respect to the gold standard.

In [1]:
from __future__ import division
from collections import defaultdict
import pandas as pd

In [2]:
from src.filter_data import filter_data
from src.parse_gold import parse_input
from src.parse_gold import Relation

### Read our finalized data:

In [3]:
settings = {
    "loc": "data/crowdflower",
    "fname": "job_743229_final_data.csv",
    "data_subset": "all",
    "min_accuracy": 0.7,
    "max_accuracy": 1.0
}

final_data = filter_data(settings)

In [4]:
len(final_data)

58

In [5]:
final_data.head()

Unnamed: 0,_unit_id,_created_at,_golden,_id,_missed,_started_at,_tainted,_channel,_trust,_worker_id,...,choice_2_ids,choice_2_label,choice_3_ids,choice_3_label,choice_4_ids,choice_4_label,form_abstract,form_title,pmid,uniq_id
0,739660089,6/19/2015 00:44:51,False,1665425696,,6/19/2015 00:39:14,False,clixsense,0.833333,31001914,...,D002945_induces_D003643,"<span class=""chemical"">cisplatin</span> contri...",D002945_induces_D009503,"<span class=""chemical"">cisplatin</span> contri...",D002945_induces_D002289,"<span class=""chemical"">cisplatin</span> contri...","<p>BACKGROUND: <span class=""chemical"">Cisplati...","Paclitaxel, <span class=""chemical"">cisplatin</...",11135224,bcv_id_3
1,739660089,6/19/2015 01:09:45,False,1665448869,,6/19/2015 01:06:44,False,elite,0.714286,31668998,...,D002945_induces_D003643,"<span class=""chemical"">cisplatin</span> contri...",D002945_induces_D009503,"<span class=""chemical"">cisplatin</span> contri...",D002945_induces_D002289,"<span class=""chemical"">cisplatin</span> contri...","<p>BACKGROUND: <span class=""chemical"">Cisplati...","Paclitaxel, <span class=""chemical"">cisplatin</...",11135224,bcv_id_3
2,739660095,6/19/2015 00:44:51,False,1665425703,,6/19/2015 00:39:14,False,clixsense,0.833333,31001914,...,D013874_induces_D010146,"<span class=""chemical"">Thiopentone</span> cont...",D013874_induces_D014474,"<span class=""chemical"">Thiopentone</span> cont...",D008012_induces_D010146,"<span class=""chemical"">lidocaine</span> contri...","This study investigated <span class=""chemical""...","<span class=""chemical"">Thiopentone</span> pret...",8595686,bcv_id_9
3,739660095,6/19/2015 01:09:45,False,1665448873,,6/19/2015 01:06:44,False,elite,0.714286,31668998,...,D013874_induces_D010146,"<span class=""chemical"">Thiopentone</span> cont...",D013874_induces_D014474,"<span class=""chemical"">Thiopentone</span> cont...",D008012_induces_D010146,"<span class=""chemical"">lidocaine</span> contri...","This study investigated <span class=""chemical""...","<span class=""chemical"">Thiopentone</span> pret...",8595686,bcv_id_9
4,739660097,6/19/2015 00:29:23,False,1665411021,,6/19/2015 00:27:14,False,neodev,0.7,11000920,...,D010862_induces_D028361,"<span class=""chemical"">pilocarpine</span> cont...",D010862_induces_D004827,"<span class=""chemical"">pilocarpine</span> cont...",D010862_induces_D004833,"<span class=""chemical"">pilocarpine</span> cont...","<span class=""disease"">Mitochondrial abnormalit...",Investigation of mitochondrial involvement in ...,16337777,bcv_id_11


## Aggregation scheme:

1. Aggregate on unique id (part of the N x M expansion of all drug-disease pairs)
2. Aggregate above results based on PMID
3. Return a ranked list of drug-disease pairs for each PMID
4. Perform ROC analysis on ranked list using normalized threshold.

Voting scheme:

Consider the case where we have M choices voted upon by N people. One of the M choices is that "None of the given choices are true". How do we aggregate votes?

Since the M choices represent the set of all possible drug-disease relationships in an abstract, we want a list of the possible relationships for any one work unit, with a score of how confident we are that that relationship might be true. Notice that picking the top answer is not the right approach because we would lose lots of information.

With these considerations in mind, the voting scheme will be as follows:
1. Choices which a person does not pick get no change to their score.
2. Choices which a person explicitly picks gets their trust score added to that choice.
3. The "none of the above" choice applies the negative trust score to every other choice.

Finally, all the choices are ranked in decreasing score, and any choice with a positive score is taken.
Negative scores represent cases where more people said that choice was wrong than people who said it was right.

The scores are finally normalized by the total trust score sum of all the people who worked on that work unit. This ensures that we can compare between work units which recieved different numbers of votes.



### Group votes first by unique id:

Given a data frame containing all N votes for M choices, aggregate the votes and return a data frame with 5 columns: the pmid, the uniq id, the work id, the id pair for the relationship, and the normalized score

In [6]:
def aggregate_votes(uniq_id, data_frame):
    """
    Given a data frame representing all the unique votes
    for one work unit, aggregates the votes for each of the
    possible choices.
    
    Returns an unsorted data frame containing the relationships
    with normalized scores.
    """
    rel_id = dict()
    
    # first map the ids: choice # -> id_pair
    for i in range(5):
        colname = "choice_{0}_ids".format(i)
        assert len(data_frame[colname].unique()) == 1
        rel_id["choice_{0}".format(i)] = data_frame.iloc[0][colname]  

    scores = defaultdict(float)
    # increment each relationship pair id by the worker's trust score
    for idx, row in data_frame.iterrows():
        # check that none of the above does not conflict with the other choices
        user_choices = row["chemical_disease_relationships"].split('\n')
        if "none_are_true" in user_choices:
            assert len(user_choices) == 1
            # vote against all other choices
            for i in range(5):
                scores[rel_id["choice_{0}".format(i)]] -= row["_trust"]
        else:
            for choice in user_choices:
                scores[rel_id[choice]] += row["_trust"]
            
    total_trust = sum(data_frame["_trust"])
    
    # normalize choices and remove those below zero or empty
    temp = defaultdict(list)
    for id_pair, score in scores.items():
        score /= total_trust
        if score > 0 and id_pair != "empty":
            temp["id_pair"].append(id_pair)
            temp["normalized_score"].append(score)
            
    df = pd.DataFrame(temp)
    
    df["uniq_id"] = uniq_id
    assert len(data_frame["_unit_id"].unique()) == 1
    df["unit_id"] = data_frame["_unit_id"].iloc[0]
    
    return df

### Aggregate by PMID:

In [7]:
def generate_results():
    results = []
    for pmid, pmid_group in final_data.groupby("pmid"):
        temp = []
        for uniq_id, group in pmid_group.groupby("uniq_id"):
            scores = aggregate_votes(uniq_id, group)
            temp.append(scores)

        df = pd.concat(temp)
        if not df.empty:
            df = df.sort("normalized_score", axis = 0, ascending = False)
            df["pmid"] = pmid
            results.append(df)
            
    return pd.concat(results)

In [8]:
results = generate_results()

results

Unnamed: 0,id_pair,normalized_score,uniq_id,unit_id,pmid
1,D002512_induces_D007683,1.0,bcv_id_46,739660132,1130930
0,D002512_induces_D007674,0.482759,bcv_id_46,739660132,1130930
0,D008094_induces_D007676,1.0,bcv_id_7,739717853,1378968
0,D009241_induces_D029424,0.378238,bcv_id_33,739660119,1835291
1,D013806_induces_D029424,0.051813,bcv_id_33,739660119,1835291
0,D008874_induces_D012140|D002318,1.0,bcv_id_39,739660125,2375138
0,D005996_induces_D008881,1.0,bcv_id_32,739675179,2515254
0,D010423_induces_D009408|D020425,1.0,bcv_id_15,739660101,3800626
0,D001241_induces_D013274,1.0,bcv_id_41,739660127,6692345
1,D010248_induces_D014581,0.714286,bcv_id_36,739660122,7582165


### We didn't manage to collect enough data with this job, so let's just look at performance if we take the top result for each paper:

In [9]:
training_data = parse_input("data/training", "CDR_TrainingSet.txt")

In [10]:
used_pmids = set(results["pmid"].unique())
used_pmids

{1130930,
 1378968,
 1835291,
 2375138,
 2515254,
 3800626,
 6692345,
 7582165,
 8590259,
 8595686,
 9522143,
 10520387,
 10835440,
 11135224,
 12041669,
 12198388,
 15602202,
 15632880,
 16167916,
 16337777,
 17261653}

In [11]:
# create the gold
gold_relations = dict()

for paper in training_data:
    if int(paper.pmid) in used_pmids:
        gold_relations[paper.pmid] = paper.relations
    
print len(gold_relations)

21


In [12]:
def in_gold(pmid, annot):
    for gold in gold_relations[str(pmid)]:
        if gold == annot:
            return True
        
    return False

In [19]:
# check results of crowd against gold:

# just do all of them:

def statistics():
    num_intersect = 0
    num_guesses = 0

    for pmid, group in results.groupby("pmid"):
        
        temp = group["id_pair"].iloc[0].split("_induces_")
        annot = Relation(temp[0], temp[1])
        if in_gold(pmid, annot):
            num_intersect += 1
            
        num_guesses += 1
        

#         for crowd_rel in group["id_pair"]:
#             temp = crowd_rel.split("_induces_")
#             annot = Relation(temp[0], temp[1])

#             if in_gold(pmid, annot):
#                 num_intersect += 1

    total_pos = 0
    for pmid, rels in gold_relations.items():
        total_pos += len(rels)

    print "recall: {0}".format(num_intersect / total_pos)
    print "precision: {0}".format(num_intersect / num_guesses)

In [20]:
statistics()

recall: 0.28
precision: 0.666666666667


In [17]:
total_pos = 0
for pmid, rels in gold_relations.items():
    total_pos += len(rels)
    
print "total gold relations: {0}".format(total_pos)

print len(results)

    

total gold relations: 50
28


In [15]:
print "recall: {0}".format(num_intersect / total_pos)
print "precision: {0}".format(num_intersect / len(results))

recall: 0.34
precision: 0.607142857143


Precision is pretty low, but we have to keep in mind that the data are far from complete due to Crowdflower quiz issues. Overall, these data are mostly inconclusive about how well the crowd can actually handle this task. We need to gather far more reponses to be able to say anything about the precision and recall.