# Relationship Annotation Sampling for Manual Quality Inspection

This notebook includes the methods used for basic data clean up for ease of downstream analysis and the random sampling of relationship annotations for manual quality inspection. Mark2Cure is an open source, citizen science effort. Although this particular notebook could be a lot shorter, it is intended to be easy to follow along for people WITHOUT bioinformatics and programming experience and will hopefully serve as an easy-to-follow introduction to some basic [python](https://en.wikibooks.org/wiki/Python_Programming) and [pandas](https://pandas.pydata.org/) functions. 

## Import [modules](https://en.wikibooks.org/wiki/Python_Programming/Modules) and [dictionaries](https://en.wikibooks.org/wiki/Python_Programming/Dictionaries)

In [1]:
import pandas
import random
import relationship_dictionaries
from m2c_rel_basic import add_type_cols
from m2c_rel_basic import split_out_testers
from pandas import read_csv

The actual relationship annotations made by users are saved as hash codes. To make sense of it, import the dictionaries available for translating them. The dictionaries can be found in the relationship_dictionaries.py file.

In [2]:
rel_hash_dict,redundant_response_dict,abbreviated_rels_dict,abbreviated_rels_dict_4_hash,concept_broken_dict,concept_not_broken_dict = relationship_dictionaries.load_RE_dictionaries()


## Load, Format, and Clean up data

Import the exported relationship annotations data. 

In [2]:
datasource = '2017.11.22 RE anns export.txt'
savepath = 'data/'
exppath = 'exports/'
filesrc = savepath+datasource
all_data_imported = read_csv(filesrc, delimiter='\t', header=0)
#print(all_data_imported.head(2))
print(len(all_data_imported))

28059


For concept 1 and concept 2, add the concept type based on the information in the relation_type column

In [4]:
all_data_imported.rename(columns={'relation_type':'reltype','answer':'evtype','concept_1_id':'refid1','concept_2_id':'refid2'}, inplace=True)
all_data_imported['concept_pair']=all_data_imported['refid1'].astype(str).str.cat(all_data_imported['refid2'].astype(str),sep="_x_")
all_relation_anns = add_type_cols(all_data_imported)

Standardize the user responses using the dictionaries in order to make the data more interpretable. The rel_hash_dict translates the relationship annotation hash code to the response that the user selected. Note that some options are redundant and were included based on internal usability study results indicating that users were more comfortable having such options. The redundant_response_dict merges redundant responses so that they are treated appropriately.

In [5]:
print('notice that: \n')
print(all_relation_anns[['user_id','evtype','reltype','pmid']].head(n=2))
all_relation_anns.replace({'evtype':rel_hash_dict}, inplace=True)
all_relation_anns.replace({'evtype':redundant_response_dict}, inplace=True)
print('\n becomes: \n')
print(all_relation_anns[['user_id','evtype','reltype','pmid']].head(n=2))

notice that: 

   user_id                                    evtype reltype     pmid
0      364  52d80rv4t0h0g14gb83oamjfm8h9rz19zl1ubzku     g_d  9621534
1      364  zl4RlTGwZM9Ud3CCXpU2VZa7eQVnJj0MdbsRBMGy     g_d  9621534

 becomes: 

   user_id                           evtype reltype     pmid
0      364  gene has no relation to disease     g_d  9621534
1      364                       c_1_broken     g_d  9621534


**A single task unit is the annotation of a specific concept pair from a specific abstract (pmid). Hence, a single task can be identified by the concept pair and pmid.**

To get a count of the number of users that classified each task, we use the pandas groupby and size functions.  The groupby function will group the data in the table by the values of whatever columns you specify. The size function produces the number of unique rows in the table.

Group the relationship table by pmid and concept and obtain the number of rows (ie- size) that each task appeared in. This will result in a table with unique tasks and a count of the number of times each task was done by the Mark2Cure community. 

In [6]:
## Group relationship table by pmid and concept pair to get number of users that classified each concept pair
rel_ann_counts = all_relation_anns.groupby(['pmid','refid1','refid2','reltype','concept_pair']).size().reset_index(name='user_count')
pmid_counts = all_relation_anns.groupby(['pmid']).size().reset_index(name='pmid_count')
print('total number of unique tasks done by users: ',len(rel_ann_counts),'\n')

## Pull annotations that have been completed by at least 15 users
threshold = 15
ann_threshold = rel_ann_counts.loc[rel_ann_counts['user_count']>=threshold]
completed_conceptpairs = rel_ann_counts.loc[rel_ann_counts['user_count']>=15]
print('total number of unique tasks considered complete (done by 15 users): ', len(ann_threshold))


total number of unique tasks done by users:  4047 

total number of unique tasks considered complete (done by 15 users):  1009


Mark2Cure does not have a separate development server; thus, internal testing is sometimes performed on the live production server using test accounts. The data from these accounts should always be excluded from downstream analysis. 

For our downstream analysis we want to focus on annotations tasks that can be considered complete. These are tasks which have been done by at least 15 users and should no longer be available for users to work on. Because test accounts may have done some of these tasks, we distinguish between true annotations (submitted by users) and test annotations (test submissions).

Going back to our relationship annotation table (with concept types added), we can use the split_out_testers function to split this relationship annotation table into separate tables: one without test annotations (filtered results) and one with only test annotations (test_anns)

In [7]:
## Split the relationship annotation table into true responses vs test responses
filtered_results, test_anns, test_account_list = split_out_testers(all_relation_anns)
#print(filtered_results)

**Clean up the nontest, user annotations**

In [8]:
## Get the number of times each response (evtype) was selected for each unique task (pmid,concept pair)
nontest_anns = filtered_results.copy()
cprelation_counts = nontest_anns.groupby(['pmid','concept_pair','reltype','evtype','refid1','refid2']).size().reset_index(name='relation_count')
#print(cprelation_counts.head(n=5))

## Get the number of users that annotated each task (eg- number of times the task was done by real users)
nontest_counts = nontest_anns.groupby(['pmid','concept_pair']).size().reset_index(name='true_completions')

**Analyze the test annotations for inclusion in the cleaned up data file**

In [9]:
## Get the test completion counts
test_counts = test_anns.groupby(['pmid','concept_pair','reltype','evtype','refid1','refid2']).size().reset_index(name='test_completions')
print(test_counts.head(n=5))


      pmid    concept_pair reltype                   evtype refid1   refid2  \
0  1325164   5443_x_202200     g_d  gene relates to disease   5443   202200   
1  1325164  5443_x_C536009     g_d  gene relates to disease   5443  C536009   
2  1325164  5443_x_D004931     g_d  gene relates to disease   5443  D004931   
3  1325164  5443_x_D035583     g_d               c_2_broken   5443  D035583   
4  1325164  5443_x_D052439     g_d               c_2_broken   5443  D052439   

   test_completions  
0                 1  
1                 1  
2                 1  
3                 1  
4                 1  


There are many ways to use pandas for filtering data. One way is to use the merge function which essentially merges the data from multiple tables into one based on selected columns they have in common. There is a nice introductory explanation of pandas merge types [here](https://www.shanelynn.ie/merge-join-dataframes-python-pandas-index-1/#mergetypes). By using a left merge, we can very quickly pull the corresponding data from an unfiltered table into a filtered one, resulting in an expanded filtered table

In [10]:
## Add the data on test completions as a column to the non-test table for ease of downstream analysis
tmpjoined = cprelation_counts.merge(test_counts,on=['pmid','concept_pair','reltype','evtype','refid1','refid2'],how='left')

## Use completed_conceptpairs table to select only cp/pmids in cprelation_counts table that have at been completed by at least 15 users
annresults = pandas.merge(completed_conceptpairs, tmpjoined, on=['pmid', 'concept_pair','reltype','refid1','refid2'], how='left').fillna(0)

## Calculate number of true completions by subtracting out the test completions
annresults['true_responses'] = annresults['user_count'].astype(int).sub(annresults['test_completions'].astype(int))

## obtain ratios for each answer selected for each task
annresults['response_ratio'] = annresults['relation_count'].astype(int).div(annresults['true_responses'].astype(int))
print(annresults.head(n=5))

      pmid   refid1   refid2 reltype       concept_pair  user_count  \
0  1299347  C095810  D008232     c_d  C095810_x_D008232          15   
1  1299347  C095810  D008232     c_d  C095810_x_D008232          15   
2  1299347  D005944  D008232     c_d  D005944_x_D008232          15   
3  1299347  D005944  D008232     c_d  D005944_x_D008232          15   
4  1299347  D005944  D008232     c_d  D005944_x_D008232          15   

                                   evtype  relation_count  test_completions  \
0                              c_1_broken              14               0.0   
1             drug (may) cause(s) disease               1               0.0   
2                              c_1_broken               8               0.0   
3  drug (may) increase(s) risk of disease               1               0.0   
4         drug has no relation to disease               3               0.0   

   true_responses  response_ratio  
0              15        0.933333  
1              15        0

Pull the annotations completed by at least 15 users for downstream analysis.

In [11]:
## Add the true response and response_ratio data for completed annotations to the original nontest annotation table.
all_anns = nontest_anns.merge(annresults,on=('concept_pair','refid1','refid2','reltype','pmid','evtype'),how='left').fillna(-1)
all_completed_anns = all_anns.loc[all_anns['true_responses']!=-1].copy()
print('Task submissions in the set of completed Relationship tasks: ',len(all_completed_anns))

## Create the set of all concept pairs x pmid done by real users
all_cp_pmids = all_anns.groupby(['pmid','concept_pair','refid1','refid2','refid1_type','refid2_type','reltype']).size().reset_index(name='counts')

## Create a unique task identifier by hashing the concept pair and pmids for ease of downstream analysis
all_completed_anns['cpmid'] = all_completed_anns['pmid'].astype(str).str.cat(all_completed_anns['concept_pair'].astype(str), sep='_')
all_cp_pmids['cpmid'] = all_cp_pmids['pmid'].astype(str).str.cat(all_cp_pmids['concept_pair'].astype(str), sep='_')

cpmid_set = all_completed_anns['cpmid'].unique().tolist()
print('Unique pmid-specific concept pairs (ie-relation tasks) completed: ',len(cpmid_set))

Task submissions in the set of completed Relationship tasks:  15739
Unique pmid-specific concept pairs (ie-relation tasks) completed:  1009


### Save the cleaned up and completed annotation data for future downstream analysis

In [45]:
#### Export the completed annotations dataframe for future analysis
annresults.to_csv(savepath+'annresults.txt', sep='\t', header=True)
all_completed_anns.to_csv(savepath+'all_completed_anns.txt', sep='\t', header=True)

In [16]:
all_completed_anns_pmids = all_completed_anns[['pmid']].drop_duplicates(keep='first').reset_index(drop=True)
all_completed_anns_pmids.to_csv(exppath+'all_completed_anns_pmids.txt', sep='\t', header=True)

## Take a 10% sample of the completed annotations for manual quality inspection

Take a sample of the cp_pmids. 
The set of unique tasks (cpmids) (with test accounts removed) is 1009 cpmids in length.  
10% of this would be about 100 cpmids.  
Do 4 samples of 30 unique cpmids which may allow for expert inconsistency as well.  
Export the samples for manual inspection

In [46]:
interations_to_try=4
samples_per_iteration = 30
i=0
analysis_results = []
sampling_table = pandas.DataFrame(columns=['pmid','concept_pair','refid1','refid2','refid1_type','refid2_type','reltype','cpmid'])

while i<interations_to_try:
    sampling_set = random.sample(cpmid_set, samples_per_iteration)
    for each_cpmid in sampling_set:
        tmp_chk = all_cp_pmids.loc[all_cp_pmids['cpmid']==each_cpmid]
        sampling_table = pandas.concat((sampling_table,tmp_chk))
    sampling_table.to_csv(exppath+'sample_'+str(i)+'_for_expert_ann.txt', sep='\t', header=True) 
    i=i+1
