### what this notebook offers: 
####  check the title of the start-end pairs in the specific link type

In [2]:
import pandas as pd
import re
import json
import xmltodict
pd.set_option('display.max_rows', 1000)


path = "/Users/yidesdo21/Projects/outputs/12_time_slicing/"

- Aim: use link prediction to find AD research discoveries 
- Scopes: 
    - corpus from Colin's collections,
    - articles in 2019 as training data, and 2020 as testing data
    - PTC and NIO annotators to represent the corpus 
    - co-occurrence on the article level to link the concepts 
    - time slicing to evaluate the model outcomes
- Contributions:

### The outcome we have last time 

- issue 01: the start and end pairs are not interesting 
    - solution: use node types and link types to filter the outcomes
        - PTC uids have types: Gene, Variant, Disease, Chemical, Species, and Cell Line
        - NIO uids don't have types. I created the types by classifying uid's ontology prefix into four domains covered by NIO: cognitive functions, neuropsychological tests, Alzheimer's Disease, and brain areas. NIO uids have more than 10 prefixes for the ontologies. I classify them in the following way. The NIO uids have five types:
            - MF --> AD - cognitive functions
            - NPT --> AD - neuropsychological tests
            - ADO, NDDUO --> AD - diagnosis, treatment, and molecular mechanisms
            - FMA, NDDO --> AD - brain areas
            - others --> AD - other
        - The links are the combination of two nodes, e.g. "Gene -- AD - cognitive functions" 
    - outcomes: 
        - classify top 100 results with the link types
        - check the start and end pairs with a certain link type
    - thoughts:
        - Can't tell the evidence for the link between the nodes by looking at the titles. Use co-occurrence on the abstract level as a link might be too weak. 
        - The node type schema might be colliding, e.g. an NIO uid with the type "AD - brain areas" can also be a Gene. This may be the reason of not returning the 'interesting' outcomes in the top. 
        - Filtering the outcomes with node types and link types seems to show more 'interesting' outcomes. But we might need a better node type schema. And the 'interestingness' issue may require work on the modeling and ranking stage, rather than the analysis stage. Also, we might need a way to define 'interestingness' if that is what we are evaluating. So far, the outcomes do answer the research question: what are the AD research discoveries by using link prediction models. It is just the case studies and manual analysis are hard to perform in the evaluation section. 


##### prefixes for NIO uids. NIO uids example: 'NDDO:NDDO_00000021'
    Counter({'NDDO': 840,
     'NDDUO': 217,
     'dc': 15,
     'go': 1,
     'iao': 1,
     'obo': 2376,
     'oboInOwl': 10,
     'rdfs': 1,
     'AlzheimerOntology': 1306,
     'OntoDT': 32,
     'bfo': 9,
     'http': 1,
     'snap': 21,
     'span': 17})
     
##### uids with 'obo' as prefix is more complex, e.g. "obo:BFO_0000001"
    Counter({'BFO': 51,
         'FMA': 619,
         'GO': 66,
         'IAO': 113,
         'IDO': 1,
         'MF': 59,
         'NBO': 489,
         'NCBITaxon': 13,
         'ND': 108,
         'NPT': 470,
         'OBI': 273,
         'OGMS': 11,
         'PATO': 7,
         'RO': 34,
         'UBERON': 1,
         'UO': 55,
         'fma': 6})

![title](img/four_domains_nio.png)

- issue 02: the limitation of using link prediction models
    - We filter the training network and the testing network have the same set of nodes. The CN,PA,AA models use local neighbourhood similaries to predict the potential links. They can't predict the links between existing and new nodes. But the filtering ignores a lot of nodes and links.
        - For instance, using link prediction models and using 2016 as the cut-off date, 31% of nodes will be ignored in the testing dataset, while 67% of nodes will be ignored in the training dataset.
        - Look at 'predict_missing_links.ipynb' and the gr_ke table 
    - Though with the limitation of current link prediction models, the outcomes still answer the research question.

In [61]:
measure_19 = pd.read_csv(path+"features/measure_2019.csv")

In [62]:
measure_19

Unnamed: 0,start,end,cn,pa,aa
0,AlzheimerOntology:disease_onset,obo:OBI_0000070,263,160801,47.994927
1,AlzheimerOntology:APP,AlzheimerOntology:Populations,235,155803,42.039344
2,NDDUO:Distribution,obo:FMA_62493,232,149703,41.052517
3,AlzheimerOntology:inflammation,snap:Site,230,146673,40.976206
4,obo:GO_0007612,obo:RO_0000081,229,139092,40.829211
...,...,...,...,...,...
387114,4851,obo:FMA_61924,0,600,0.000000
387115,4851,MESH:D019833,0,250,0.000000
387116,4851,AlzheimerOntology:Incidence_rate,0,370,0.000000
387117,14805,4851,0,330,0.000000


In [63]:
## open the dictionary that links ptc uids and the ptc categories
with open(path+"metadata/"+"ptc_category_uid.txt") as f:
    lines = f.read().split("}, '")
    split = [l.split("': {'") for l in lines]
    
    ptc_uid_category = dict()
    # special case:"defaultdict(<class 'set'>, {'MESH:D009369", "Disease'"    
    ptc_uid_category["MESH:D009369"] = "Disease"

    for s in split[1:]:
        uid, category = s[0], s[1][:-1]
        ptc_uid_category[uid] = category

## open the dictionary that links nio uids and the ptc categories
with open(path+"metadata/"+'nio_category_uid.json', 'r') as fp:
    nio_uid_category = json.load(fp)
    
## add two uid dictionaries together
def merge(D1,D2):
    py={**D1,**D2}
    return py

uid_category =(merge(ptc_uid_category,nio_uid_category))


In [64]:
len(ptc_uid_category)

10433

In [65]:
len(nio_uid_category)

4847

In [66]:
len(uid_category)

15280

In [67]:
## link nio uid to mention 
xml_path = "/Users/yidesdo21/Projects/inputs/dictionary/"

with open(xml_path+"nio_ado_case.xml") as f:
    nio_xml = f.read()

nio_parsed = xmltodict.parse(nio_xml)
nio_dict = nio_parsed["synonym"]["token"]
nio_uid_canonical = dict()

for nio in nio_dict:
    token_id, cano = nio["@id"], nio["@canonical"]
    nio_uid_canonical[token_id] = cano

## open the dictionary that links ptc uids and the ptc mentions
with open(path+"metadata/"+"ptc_mention_uid.txt") as f:
    lines = f.read().split("}, '")
    split = [l.split("': {'") for l in lines]
    
    ptc_uid_mention = dict()
    # special case:"defaultdict(<class 'set'>, {'MESH:D009369"    
    ptc_uid_mention["MESH:D009369"] = ["'cancer', 'spheroids', 'tumour', 'neoplasia', 'cancers', 'tumors', 'lesions', 'tumoral', 'neoplasms', 'neoplasm', 'keratocarcinomata', 'cancerous', 'carcinomas', 'Tumors', 'malignancies', 'tumor', 'carcinoma', 'malignancy', 'tumours', 'disease', 'Cancer'"]

    for s in split[1:]:
        uid, mention = s[0], s[1:]
        ptc_uid_mention[uid] = mention
        
## add two dictionaries together
uid_canonical =(merge(nio_uid_canonical,ptc_uid_mention))


In [68]:
len(nio_uid_canonical)

4847

In [69]:
len(ptc_uid_mention)

10433

In [70]:
len(uid_canonical)

15280

In [71]:
## link article to the uid pairs 
with open(path+"metadata/"+'articles_with_anno.json', 'r') as fp:
    articles = json.load(fp)


In [72]:
def uid_to_article(start,end,yr):
    """enter a start uids and and end uid, and the year of the testing data,
    return the title of the article that both uids co-occur
    -- start: string, end: string, yr: string"""
    titles = list()
#     abstracts = list()
    
    for article in articles:
        year = article.get("year")
        if year == yr:
            ptc_uids,nio_uids = article.get("ptc_uids"),article.get("nio_uids")

            if ptc_uids is None:
                ptc_uids = list()
            if nio_uids is None:
                nio_uids = list()

            uids = []
            uids.extend(ptc_uids)
            uids.extend(nio_uids)

            if start in uids and end in uids:
                titles.append(article.get("title"))
#                 abstracts.append(article.get("abstract"))
    
    return titles

In [73]:
uid_to_article("AlzheimerOntology:APP","AlzheimerOntology:Populations","2020")

['abrogation of type-i interferon signalling alters the microglial response to a beta(1-42)',
 'computational analysis of alzheimer amyloid plaque composition in 2d-and elastically reconstructed 3d-maldi ms images']

In [75]:
measure_19['start_category'] = measure_19['start'].map(uid_category)
measure_19['end_category'] = measure_19['end'].map(uid_category)
measure_19['start_mention'] = measure_19['start'].map(uid_canonical)
measure_19['end_mention'] = measure_19['end'].map(uid_canonical)
# measure_19[['start_category', 'end_category']] = measure_19[['start_category','end_category']].fillna('AD_hallmark')
measure_19["link"] = measure_19['start_category'] + "+" + measure_19['end_category']
measure_19['rank'] = measure_19.index
measure_19 = measure_19[["rank","start","start_category","start_mention","end",
                         "end_category","end_mention","link","cn","pa","aa"]]
# measure_19.style.hide_index()


In [76]:
measure_19.iloc[:100]

Unnamed: 0,rank,start,start_category,start_mention,end,end_category,end_mention,link,cn,pa,aa
0,0,AlzheimerOntology:disease_onset,"AD -- Diagnosis, treatment, and molecular mech...",disease onset,obo:OBI_0000070,AD -- Others,assay,"AD -- Diagnosis, treatment, and molecular mech...",263,160801,47.994927
1,1,AlzheimerOntology:APP,"AD -- Diagnosis, treatment, and molecular mech...",APP,AlzheimerOntology:Populations,"AD -- Diagnosis, treatment, and molecular mech...",Population,"AD -- Diagnosis, treatment, and molecular mech...",235,155803,42.039344
2,2,NDDUO:Distribution,"AD -- Diagnosis, treatment, and molecular mech...",Distribution,obo:FMA_62493,AD -- Brain areas,Hippocampus,"AD -- Diagnosis, treatment, and molecular mech...",232,149703,41.052517
3,3,AlzheimerOntology:inflammation,"AD -- Diagnosis, treatment, and molecular mech...",inflammation,snap:Site,AD -- Others,site,"AD -- Diagnosis, treatment, and molecular mech...",230,146673,40.976206
4,4,obo:GO_0007612,AD -- Others,learning,obo:RO_0000081,AD -- Others,role of,AD -- Others+AD -- Others,229,139092,40.829211
5,5,obo:RO_0002211,AD -- Others,regulates,obo:UO_0000036,AD -- Others,year,AD -- Others+AD -- Others,227,161664,40.584781
6,6,NDDUO:rate,"AD -- Diagnosis, treatment, and molecular mech...",rate,obo:FMA_62493,AD -- Brain areas,Hippocampus,"AD -- Diagnosis, treatment, and molecular mech...",225,142197,39.817861
7,7,11820,Gene,"[beta', 'A(beta)', 'AbetaPP', 'arcAbeta', '(Ab...",AlzheimerOntology:Tangles,"AD -- Diagnosis, treatment, and molecular mech...",Tangle,"Gene+AD -- Diagnosis, treatment, and molecular...",224,149292,40.133814
8,8,AlzheimerOntology:Tangles,"AD -- Diagnosis, treatment, and molecular mech...",Tangle,obo:NBO_0000282,AD -- Others,liking,"AD -- Diagnosis, treatment, and molecular mech...",224,136764,40.044052
9,9,AlzheimerOntology:Fibrils,"AD -- Diagnosis, treatment, and molecular mech...",Fibril,AlzheimerOntology:age_risk_factor,"AD -- Diagnosis, treatment, and molecular mech...",age risk factor,"AD -- Diagnosis, treatment, and molecular mech...",223,160892,40.31567


In [77]:
## no NA values in the dataframe -- each uid has a category 
# measure_19.isnull().sum()

In [78]:
# cn model -- link categories of top 100 results
measure_19.iloc[:100]["link"].value_counts()

AD -- Diagnosis, treatment, and molecular mechanisms+AD -- Diagnosis, treatment, and molecular mechanisms    49
AD -- Diagnosis, treatment, and molecular mechanisms+AD -- Others                                            25
AD -- Diagnosis, treatment, and molecular mechanisms+AD -- Brain areas                                        5
Disease+AD -- Diagnosis, treatment, and molecular mechanisms                                                  4
AD -- Diagnosis, treatment, and molecular mechanisms+Disease                                                  4
Gene+AD -- Diagnosis, treatment, and molecular mechanisms                                                     3
AD -- Others+AD -- Others                                                                                     3
Disease+AD -- Others                                                                                          2
AD -- Brain areas+AD -- Diagnosis, treatment, and molecular mechanisms                                  

In [79]:
# check pairs with specific links
measure_19_100 = measure_19.iloc[:100]

def check_links(link):
    check_link = link
    measure_19_100_link = measure_19_100[measure_19_100['link']==check_link]
    return measure_19_100_link[["rank","start","start_mention","end","end_mention","cn"]]

In [80]:
check_links(link="AD -- Diagnosis, treatment, and molecular mechanisms+AD -- Diagnosis, treatment, and molecular mechanisms")


Unnamed: 0,rank,start,start_mention,end,end_mention,cn
1,1,AlzheimerOntology:APP,APP,AlzheimerOntology:Populations,Population,235
9,9,AlzheimerOntology:Fibrils,Fibril,AlzheimerOntology:age_risk_factor,age risk factor,223
10,10,AlzheimerOntology:Fibrils,Fibril,NDDUO:Age,Age,223
11,11,AlzheimerOntology:Fibrils,Fibril,AlzheimerOntology:advanced_glycation_end_product,advanced glycation end product,223
19,19,AlzheimerOntology:inflammation,inflammation,NDDUO:Sampling,Sampling,215
20,20,AlzheimerOntology:Positron_emission_tomography,Positron emission tomography,NDDUO:Neurons,Neuron,215
22,22,AlzheimerOntology:Late_Onset_Alzheimer_s_Disease,Late Onset Alzheimer's Disease,NDDUO:Severe,thing related to severe stage,214
25,25,AlzheimerOntology:Microglia,Microglia,AlzheimerOntology:Tangles,Tangle,212
29,29,AlzheimerOntology:Fibrils,Fibril,AlzheimerOntology:Mild_cognitive_Impairment,Mild cognitive Impairment,209
30,30,AlzheimerOntology:Positron_emission_tomography,Positron emission tomography,AlzheimerOntology:inflammation,inflammation,209


In [81]:
check_links(link="AD -- Diagnosis, treatment, and molecular mechanisms+AD -- Brain areas")

Unnamed: 0,rank,start,start_mention,end,end_mention,cn
2,2,NDDUO:Distribution,Distribution,obo:FMA_62493,Hippocampus,232
6,6,NDDUO:rate,rate,obo:FMA_62493,Hippocampus,225
12,12,AlzheimerOntology:Positron_emission_tomography,Positron emission tomography,obo:FMA_62493,Hippocampus,223
37,37,NDDUO:inhibitor,inhibitor,obo:FMA_256135,Body,206
88,88,AlzheimerOntology:Receptors,Receptor,NDDO:NDDO_00000107,cohort,197


In [56]:
check_links(link="AD -- Diagnosis, treatment, and molecular mechanisms+Disease")

Unnamed: 0,rank,start,start_mention,end,end_mention,cn
49,49,AlzheimerOntology:APOE,APOE,MESH:D009410,"[loss', 'toxicity', 'decline', 'dysfunctions',...",204
56,56,AlzheimerOntology:cerebrospinal_fluid,cerebrospinal fluid,MESH:D003643,"[Death', 'cancer', 'deaths', 'Dis', 'Mortality...",201
79,79,AlzheimerOntology:ratio,ratio,MESH:D007249,"[impairs', 'inflammation', 'disorder', 'reacti...",198
97,97,AlzheimerOntology:secretase,secretase,MESH:D019636,"[conditions', 'dementia', 'illnesses', 'loss',...",195


In [54]:
check_links(link="Disease+AD -- Diagnosis, treatment, and molecular mechanisms")

Unnamed: 0,rank,start,start_mention,end,end_mention,cn
15,15,MESH:D003643,"[Death', 'cancer', 'deaths', 'Dis', 'Mortality...",NDDUO:Mild,thing related to mild stage,220
40,40,MESH:D009410,"[loss', 'toxicity', 'decline', 'dysfunctions',...",NDDUO:Severe,thing related to severe stage,206
77,77,MESH:D003643,"[Death', 'cancer', 'deaths', 'Dis', 'Mortality...",NDDUO:Pathogenesis,thing related to pathogenesis,198
94,94,MESH:D020258,"[Neuritogenic-neurotoxic', 'neurotoxicity', 'd...",NDDUO:Sampling,Sampling,196


In [55]:
check_links(link="Gene+AD -- Diagnosis, treatment, and molecular mechanisms")

Unnamed: 0,rank,start,start_mention,end,end_mention,cn
7,7,11820,"[beta', 'A(beta)', 'AbetaPP', 'arcAbeta', '(Ab...",AlzheimerOntology:Tangles,Tangle,224
14,14,11820,"[beta', 'A(beta)', 'AbetaPP', 'arcAbeta', '(Ab...",AlzheimerOntology:Populations,Population,222
31,31,11820,"[beta', 'A(beta)', 'AbetaPP', 'arcAbeta', '(Ab...",AlzheimerOntology:neurofibrillary_tangle,neurofibrillary tangle,208


In [45]:
# check the title of the start-end pairs in the specific link
# a high cn value doesn't necessarily mean the pairs co-occur in the testing network
## need to be careful, the third argument of the uid_to_article needs to be revised when with different testing data
measure_19_100_link["title"] = measure_19_100_link.apply(lambda x: uid_to_article(x['start'], x['end'], "2020"), axis=1)
print(check_link)
result = measure_19_100_link[["rank","start","start_mention","end","end_mention","cn","title"]].to_json(orient="records")
parsed = json.loads(result)
parsed


AD -- Diagnosis, treatment, and molecular mechanisms+Disease


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  measure_19_100_link["title"] = measure_19_100_link.apply(lambda x: uid_to_article(x['start'], x['end'], "2020"), axis=1)


[{'rank': 49,
  'start': 'AlzheimerOntology:APOE',
  'start_mention': 'APOE',
  'end': 'MESH:D009410',
  'end_mention': ["loss', 'toxicity', 'decline', 'dysfunctions', 'disorders', 'mitochondria', 'homeostasis', 'atrophy', 'Dysfunction', 'networks', 'damage', 'deficits', 'dysfunction/degeneration', 'neuritic', 'disorder', 'viability', 'ectopias', 'Toxicity', 'impairment', 'diseases', 'abnormalities', 'bodies', 'activity', 'death', 'linked', 'soma', 'necrosis', 'endocytosis', 'somata', 'glucose', 'dendritic', 'defects', 'alterations', 'metabolism', 'malfunction', 'hyperexcitation', 'hyperactivation', 'astrogliosis', 'injuries', 'dyshomeostasis', 'neuroinflammation', 'glia', 'dysfunction', 'excitotoxicity', 'function', 'dendrites', 'pyroptosis', 'accumulation', 'neuron', 'perturbation', 'degeneration', 'Injury', 'injury', 'insulin', 'Death', 'deaths', 'axons', 'neurons', 'dystrophy', 'Abeta', 'hyperhomocysteinemic', 'colonies', 'hyperexcitability', 'hypometabolism', 'spinal', 'brain', 'n