#### step 1: use the text spans (spans that start with the same index) to cluster entities
- For now, we don't want to deal with the mentions with the same starting index but different ending indices from the same annotator.
- [{'span': [276, 288], ...'annotator': 'NIO'},
   {'span': [276, 296], ...'annotator': 'NIO'}]
- But I don't know how to exclude the case when both PTC and NIO annotators are included, but the NIO have the nested annotations.
- [{'span': [276, 288], ...'annotator': 'NIO'},
   {'span': [276, 288], ...'annotator': 'PTC'},
   {'span': [276, 296], ...'annotator': 'NIO'}]
- Limit the comparisons to happen only between one entity from PTC and one entity from NIO. 

#### step 2: use the Levenshtein distance to compare the entities within the same cluster

In [1]:
## is it better to pre-process this step when creating the json file?
## before starting, the first thing is to remove the duplications within each annotator's results
## this is for the NIO as the dictionary may have plurals and the annotator uses stemmers 
## this will result in the duplicated entities like below:
# {
#                 "span": [114, 119],
#                 "mention": "brain",
#                 "identifier": "AlzheimerOntology:brain",
#                 "concept": "brain",
#                 "type": ""
#         }, {
#                 "span": [114, 119],
#                 "mention": "brain",
#                 "identifier": "AlzheimerOntology:brain",
#                 "concept": "brain",
#                 "type": ""
#         }

In [9]:
import json
from itertools import groupby,combinations
from collections import defaultdict
# import enchant

In [2]:
path = "/Users/yidesdo21/Projects/outputs/10_ptc_ten_recent/json/"
nio_path = "nio_ten_recent_spans_nested_case.json"
ptc_path = "ptc_ten_recent_spans_nested_case.json"
combined_path = "combined_ten_recent_spans_nested_case.json"

In [3]:
with open(path+nio_path) as f:
    nio_annos = json.load(f)
    
with open(path+ptc_path) as f:
    ptc_annos = json.load(f)

with open(path+combined_path) as f:
    combined_annos = json.load(f)

In [4]:
combined_annos[0]

{'title': 'CCC_000637925000033',
 'text': 'Glycoengineering artificial receptors for microglia to phagocytose A beta aggregates.Oligomeric and fibrillar amyloid-beta (A beta) are principally internalized via receptor-mediated endocytosis (RME) by microglia, the main scavenger of A beta in the brain. Nevertheless, the inflammatory cascade will be evoked after vast A beta aggregate binding to pattern recognition receptors on the cell membrane, which then significantly decreases the expression of these receptors and further deteriorate A beta deposition. This vicious circle will weaken the ability of microglia for A beta elimination. Herein, a combination of metabolic glycoengineering and self-triggered click chemistry is utilized to engineer microglial membranes with ThS as artificial A beta receptors to promote microglia to phagocytose A beta aggregates. Additionally, to circumvent the undesirable immune response during the process of the bioorthogonal chemistry reaction and A beta-micr

In [5]:
# tc_list is a list of dictionaries {key-title:value-a list of clusters of entities to be compared}
# the entities will be compared if they 1) have the same starting index, and 2) come from different annotators
# but we may run into one circumstance where one entity comes from PTC, but more than two come from nested NIO.
# for now, I just leave this.

# each cluster in the value is to be compared by the <Levenshtein distance> (tbd)
# tr_list is the entities that don't need to be compared because no disambiguities exist there.

# define a fuction for key -- the key is the starting index
def key_func(k):
    return k['span'][0]

anno_num = len(combined_annos)
# title_compare, title_reserve = dict(), dict()
tc_list, tr_list = list(), list()

for i in range(anno_num):
    combined_ents = combined_annos[i].get("ents")
    combined_title = combined_annos[i].get("title")

    groups = []
    uniquekeys = []

    # sort INFO data by the starting index.
    INFO = sorted(combined_ents, key=key_func)

    for key, value in groupby(INFO, key_func):   # group the spans by the same starting index
        groups.append(list(value))
        uniquekeys.append(key)
    
#     print(groups)
    
    compare_list, reserve_list = list(), list()   # the compare list inclueds the entities to be compared by similarity measures

    for group in groups:    
        end_indices = set()
        annotators = set()

        if len(group) < 2:  # the entities that are not needed to be compared. only one entity
            reserve_list.append(group)

        else:
            for e in group:
                end_indices.add(e["span"][1])
                annotators.add(e["annotator"])

            if len(annotators) == 1:  # same annotator -- the nested annotations or annotations from different ontologies in NIO
                reserve_list.append(group)

            else: 
#                 print(group)
                compare_list.append(group)
                
#     title_compare["title"] = combined_title 
#     title_compare["comparisons"] = compare_list 
#     title_reserve["title"] = combined_title 
#     title_reserve["reserves"] = reserve_list
    
    tc_list.append((combined_title, compare_list))  # tc_list plus the tr_lsit is the full ents for each article
    tr_list.append((combined_title, reserve_list))
    
#     break
    

In [6]:
len(tc_list)

10

In [7]:
len(tr_list)

10

In [8]:
# tc_list[0]

In [16]:
dict_id = defaultdict(int)
dict_mention = defaultdict(int)

for tc in tc_list:
    title, comparisons = tc[0], tc[1]
    print(title)
    
   
    for comparison in comparisons:
#         print(comparison)
        mentions = list()
        identifiers = list()
        for ent in comparison:
            mention, identifier = ent["mention"], ent["identifier"]
            mentions.append(mention)
            identifiers.append(identifier)
#             print(mention)
        
        set_id = list(set(identifiers))
        set_mention = list(set(mentions))
        dict_id[tuple(set_id)] += 1
        dict_mention[tuple(set_mention)] += 1
        
        print(mentions)
        print(identifiers)
        print("------------")
        # compare between each two mentions
#         for pair in combinations(mentions,2):   # need identifiers for each pair 
#             print(pair)
#         print("------------")
        
    print("--------------------")

CCC_000637925000033
['amyloid', 'amyloid-beta', 'amyloid-beta']
['AlzheimerOntology:amyloid_beta_protein', 'MESH:D016229', 'AlzheimerOntology:amyloid_beta_protein']
------------
['oxygen', 'oxygen']
['MESH:D010100', 'AlzheimerOntology:oxygen']
------------
--------------------
CCC_000647663200001
['amyloid', 'amyloid', 'amyloid-beta']
['MESH:D016229', 'AlzheimerOntology:amyloid_beta_protein', 'AlzheimerOntology:amyloid_beta_protein']
------------
['mice', 'mice']
['10090', 'NDDUO:Mouse_']
------------
['Amyloid', 'Amyloid precursor protein', 'Amyloid precursor protein']
['AlzheimerOntology:amyloid_beta_protein', '11820', 'AlzheimerOntology:amyloid_precursor_protein']
------------
['amyloid', 'amyloid-beta', 'amyloid-beta']
['AlzheimerOntology:amyloid_beta_protein', 'MESH:D016229', 'AlzheimerOntology:amyloid_beta_protein']
------------
['Alzheimer', 'Alzheimer', 'Alzheimer?s', 'Alzheimer?s disease']
['MESH:D000544', 'AlzheimerOntology:Subtypes', 'AlzheimerOntology:Subtypes', 'AlzheimerO

In [17]:
{k: v for k, v in sorted(dict_mention.items(), key=lambda item: item[1], reverse=True)}

{('AD',): 21,
 ("Alzheimer's", 'Alzheimer', "Alzheimer's disease"): 11,
 ('BIN1',): 8,
 ('ApoE',): 7,
 ('mice',): 6,
 ('amyloid-beta', 'amyloid'): 5,
 ('tau', 'tau aggregation'): 3,
 ('tauopathies',): 2,
 ('oxygen',): 1,
 ('Amyloid precursor protein', 'Amyloid'): 1,
 ('Alzheimer?s', 'Alzheimer?s disease', 'Alzheimer'): 1,
 ('amyloid plaque', 'amyloid'): 1,
 ('Presenilin 1', 'Presenilin'): 1,
 ('mouse', 'mouse models'): 1,
 ('mouse',): 1,
 ('Alzheimer',): 1,
 ('amyloid beta', 'amyloid'): 1,
 ('Neurodegenerative Disorders',): 1,
 ("Alzheimer's", 'Alzheimer'): 1}

In [15]:
{k: v for k, v in sorted(dict_id.items(), key=lambda item: item[1], reverse=True)}

{('AlzheimerOntology:Subtypes', 'MESH:D000544'): 35,
 ('AlzheimerOntology:BIN1', '274'): 8,
 ('NDDUO:Mouse_', '10090'): 7,
 ('AlzheimerOntology:amyloid_beta_protein', 'MESH:D016229'): 6,
 ('AlzheimerOntology:APOE', 'AlzheimerOntology:ApoE_Protein', '11816'): 4,
 ('MESH:C536599', 'AlzheimerOntology:t_tau'): 3,
 ('348', 'AlzheimerOntology:APOE', 'AlzheimerOntology:ApoE_Protein'): 3,
 ('AlzheimerOntology:tauopathy', 'MESH:D024801', 'obo:ND_0000151'): 2,
 ('MESH:D010100', 'AlzheimerOntology:oxygen'): 1,
 ('AlzheimerOntology:amyloid_precursor_protein',
  'AlzheimerOntology:amyloid_beta_protein',
  '11820'): 1,
 ('AlzheimerOntology:amyloid_beta_protein',
  'AlzheimerOntology:presence_of_amyloid_plaque',
  'MESH:D016229'): 1,
 ('AlzheimerOntology:presenilin',
  'AlzheimerOntology:presenilin_1',
  '19164'): 1,
 ('NDDUO:Mouse_', '10090', 'NDDUO:In_vivo_models'): 1,
 ('NDDUO:neurodegenerative_disease', 'NDDUO:disorder', 'MESH:D019636'): 1}

In [11]:
# groups = []
# uniquekeys = []
 
# # sort INFO data by the starting index.
# INFO = sorted(combined_ents, key=key_func)
  
# for key, value in groupby(INFO, key_func):
#     groups.append(list(value))
#     uniquekeys.append(key)
    

In [12]:
# compare_list, reserve_list = list(), list()   # the compare list inclueds the entities to be compared by similarity measures

# for group in groups:    
#     end_indices = set()
#     annotators = set()
    
#     if len(group) < 2:  # the entities that are not needed to be compared 
#         reserve_list.append(group)
        
#     else:
#         for e in group:
#             end_indices.add(e["span"][1])
#             annotators.add(e["annotator"])
        
#         if len(annotators) == 1 and len(end_indices) > 1:
#             reserve_list.append(group)
        
#         else: 
#             compare_list.append(group)
    

In [13]:
# compare_list