# Search for Ontologies using bioontology Recommender

* ref
https://data.bioontology.org/documentation#nav_recommender

Recommender Tab:
The Recommender takes as input a text or a list of keywords and suggests appropriate ontologies for it.

The ontology ranking algorithm used by the Recommender evaluates the adequacy of each ontology to the input using a combination of four evaluation criteria:

* Coverage: At what extent the ontology represents the input? The Recommender invokes the NCBO Annotator service to obtain all the annotations for the input and then uses those annotations to compute a coverage score for each ontology.
* Acceptance: How well-known and trusted is the ontology by the biomedical community? The number of visits to the ontology page in BioPortal and the presence or absence of the ontology in UMLS are used to compute an acceptance score for each ontology.
* Detail of knowledge: What is the level of detail provided by the ontology for the input data? It is computed using the number of definitions, synonyms and properties of the ontology classes that cover the input data.
* Specialization: How specialized is the ontology to the input data’s domain? It is calculated using the number and type of the annotations done with the ontology and the position of each annotated class in the ontology hierarchy. The result is normalized by the size of the ontology, in order to identify small ontologies that are specialized to the input data.

* Parameters
* input_type={1|2} // default = 1. 1 means that the input type is text. 2 means that the input type is a list of comma separated keywords.
* output_type={1|2} // default = 1. 1 means that the output will be a ranked list of individual ontologies. 2 means that the output will be a ranked list of ontology sets.
* max_elements_set={2|3|4} // default = 3. Maximum number of ontologies per set (only for output_type = 2).
* wc={value in the range [0,1]} // default = 0.55. Weight assigned to the ontology coverage criterion.
* wa={value in the range [0,1]} // default = 0.15. Weight assigned to the ontology acceptance criterion.
* wd={value in the range [0,1]} // default = 0.15. Weight assigned to the ontology detail criterion.
* ws={value in the range [0,1]} // default = 0.15. Weight assigned to the ontology specialization criterion.
* ontologies={ontology_id1, ontology_id2, …, ontology_idN} // default = (empty) (all BioPortal ontologies will be evaluated).

In [7]:
from pronto import Ontology,xref
from collections import defaultdict
import pandas as pd
import numpy as np
from tqdm import tqdm
from urllib.parse import quote
import urllib.request, urllib.error, urllib.parse
import json
import os
from pprint import pprint
import pickle
import sys
import retrieve_LSFC
import importlib



# Load LSFC 


In [8]:
import warnings
warnings.filterwarnings("ignore")
module_path = os.path.abspath(os.path.join('../../'))
if module_path not in sys.path:
    sys.path.append(module_path+"//utils")

import retrieve_LSFC

import importlib
# reload if library gets updated
importlib.reload(retrieve_LSFC)

### Load LSFC 

LSFC_file='../../LSFC/LSFC.obo'
id_to_name,name_to_id,id_to_synonyms,id_2_childs,id_2_parents=retrieve_LSFC.read_LSFC(LSFC_file)
# Get names of 9 main LSF catgeories
LSF_exisiting_names, LSF_Labels,LFIDs,categories=retrieve_LSFC.generate_lfid_categories_labels(LSFC_file)



# General functions

In [9]:

REST_URL = "http://data.bioontology.org"
API_KEY = "017c7a81-885c-43e5-abcf-4071b6ff5373"


def make_query(term):
    #term='"'+term+'"'
    return quote(term)


def get_json(url):
    opener = urllib.request.build_opener()
    opener.addheaders = [('Authorization', 'apikey token=' + API_KEY)]
    return json.loads(opener.open(url).read())



# Produce LSF-Slim
* Cover names and synonyms of high level concepts of LSF ontology:
    * For all branches select second level concepts except nutrition
    * for nutrition branch select second level concepts except 'Micronutrient' and 'Macronutrient' for which thirs level concepts will be selected

In [16]:
def LSF_Slim(LSFC_file):
    LSFC = Ontology(LSFC_file)
    root='LFID:0000000'
    #get LSF branches by expanding root
    branches=list(LSFC[root].subclasses(1,with_self=False))

    high_level_names=[]
    for branch in branches:
            nodes=list(LSFC[branch.id].subclasses(2,with_self=True))    #add all second level names and synonyms
            # add node name and all its synonyms
            for node in nodes:
                # skip node if it is a leaf node, since it can not be considered as a high level node
                if not node.is_leaf():
                    high_level_names.append(node.name)
                    for syn in node.synonyms: # add synonyms
                        high_level_names.append(syn.description) # description is the name of the synonym

    return high_level_names

high_level_names=LSF_Slim(LSFC_file)



# Calling Bioportal Annotator



* High level name of the LSFC is annotated by all available ontologies

* So for every single Ontology we will keep a list of exisiting matched names in 'dict_ontology_to_annotated_names'

* Metrics for shortlisting the Ontologies:
    * Covergae: what is the percentage of the Lifestyle-slims apeared in the target ontology
    * Overlap: what is the pecentage of all names of target ontology appeared in Lifestyle-slims

In [22]:
def annotator(high_level_names):
    """Annotates high_level_names (LSF slim) with concepts from existing ontologies
    Args:
        high_level_names (_type_): LSF slim

    Returns:
        dict: List of annotated names with other existing ontologies
    """
    dict_ontology_to_annotated_names=defaultdict(set)

    for i,name in tqdm(enumerate(high_level_names)):
        results=get_json('https://data.bioontology.org/search?q=' + make_query(name)+'&pagesize=500&ontologies=&include_properties=TRUE&include_views=TRUE&includeObsolete=TRUE&require_definition=false&exact_match=true&categories=&suggest=TRUE')

        for result in results['collection']:
            ontology_acronym=result['links']['ontology'].split('https://data.bioontology.org/ontologies/')[1]   
            dict_ontology_to_annotated_names[ontology_acronym].add(name)
    return dict_ontology_to_annotated_names


dict_ontology_to_annotated_names=annotator(high_level_names)


509it [15:12,  1.79s/it]


In [39]:
def add_metric(dict_ontology_to_annotated_names):
    """Add metrics to onotologies

    Args:
        dict_ontology_to_annotated_names (dict): _description_

    Returns:
        dataframe: A dataframe for the annotated ontologies along with their quality metrics
    """
    df_ontology_to_annotated_names=pd.DataFrame(columns=['acronym','ontology_name','acceptance_score','match_count','coverage','names','class_counts','number_of_releases','first_release_year','last_release_year','overlap'])

    for ontology in tqdm(dict_ontology_to_annotated_names):
        names=list(dict_ontology_to_annotated_names[ontology])
        match_count=len(names)
        names='#'.join(names)

        ontology_name=get_json('http://data.bioontology.org/ontologies/'+ontology)['name']

        submissions=get_json('https://data.bioontology.org/ontologies/'+ontology+'/submissions')

        number_of_releases=len(submissions)
        first_release_year=submissions[-1]['released'].split('-')[0]
        last_release_year=submissions[0]['released'].split('-')[0]


        coverage=match_count/len(high_level_names)


        try:
            class_counts=get_json('https://data.bioontology.org/ontologies/'+ontology+'/metrics')['classes']

        except:
            class_counts=0


        try:
            # pick up a matching word for this ontology to just get the acceptance score of the ontology using Recommender API
            # Acceptance score shows how wellknown and trusted is the ontology in the biomedical community
            word=list(dict_ontology_to_annotated_names[ontology])[0]
            acceptance_score=get_json('https://data.bioontology.org/recommender?input=' +make_query(word)+ '&ontologies='+ontology)[0]['acceptanceResult']['normalizedScore']
        except:
            acceptance_score=0



        try:
            # igonre if class counts are less than 100 
            if match_count > class_counts or  class_counts< 100 or  match_count < 5:
                #it is a wrong info
                overlap=0
            else:
                overlap=match_count/class_counts
                
        except:
            overlap=0


        df_ontology_to_annotated_names=df_ontology_to_annotated_names.append ({'acronym':ontology,'ontology_name':ontology_name,'acceptance_score':acceptance_score,'match_count':match_count,'coverage':coverage,'names':names,'class_counts':class_counts,'number_of_releases':number_of_releases,'first_release_year':first_release_year,'last_release_year':last_release_year,'overlap':overlap},ignore_index=True)

    return df_ontology_to_annotated_names



df_ontology_to_annotated_names=add_metric(dict_ontology_to_annotated_names)
   

100%|██████████| 313/313 [49:57<00:00,  9.58s/it]


In [43]:
df_ontology_to_annotated_names.to_csv('../../data/xref/Xrefs_ontology_to_annotated_names.tsv',sep='\t',index=None)


# Filter based on Acceptance Score if release date is before 2018

In [45]:
df_ontologies=pd.read_csv('../../data/xref/Xrefs_ontology_to_annotated_names.tsv',sep='\t')
df_ontologies_filtered=df_ontologies[(df_ontologies.acceptance_score>0.5)|(df_ontologies.first_release_year > 2018) ]
df_ontologies_filtered.shape

(111, 11)

# Assign ranking based on coverage:


In [46]:
df_ontologies_filtered=df_ontologies_filtered.sort_values(by='coverage',ascending=False)
df_ontologies_filtered.reset_index(drop=True,inplace=True)

for i,row in tqdm(df_ontologies_filtered.iterrows()):
    df_ontologies_filtered.at[i,'covergae_rank']=i+1



111it [00:00, 22080.52it/s]


# Assign ranking based on overlap:


In [47]:
df_ontologies_filtered=df_ontologies_filtered.sort_values(by='overlap',ascending=False)
df_ontologies_filtered.reset_index(drop=True,inplace=True)

for i,row in tqdm(df_ontologies_filtered.iterrows()):
    df_ontologies_filtered.at[i,'overlap_rank']=i+1



111it [00:00, 16681.04it/s]


# Get min of two rankings

In [48]:
for i,row in tqdm(df_ontologies_filtered.iterrows()):
    df_ontologies_filtered.at[i,'rank']=min(row['covergae_rank'],row['overlap_rank'])

111it [00:00, 18606.34it/s]


# Sort based on final rank

In [49]:
df_ontologies_filtered=df_ontologies_filtered.sort_values(by='rank')
df_ontologies_filtered.reset_index(drop=True,inplace=True)


In [50]:
df_ontologies_filtered.head()

Unnamed: 0,acronym,ontology_name,acceptance_score,match_count,coverage,names,class_counts,number_of_releases,first_release_year,last_release_year,overlap,covergae_rank,overlap_rank,rank
0,EXMO,Exercise Medicine Ontology,0.0,11,0.021611,energy#physical exercise#anaerobic exercise#ty...,407,2,2022,2022,0.027027,8.0,1.0,1.0
1,MDM,Mapping of Drug Names and MeSH 2022,0.0,117,0.229862,pesticide residues#physical activities#video g...,44789,5,2021,2022,0.002612,1.0,14.0,1.0
2,EMIF-AD,EMIF-AD ontology,0.0,8,0.015717,employment status#vegetables#income#alcohol us...,700,3,2020,2020,0.011429,15.0,2.0,2.0
3,BERO,Biological and Environmental Research Ontology,0.0,47,0.092338,pesticide residues#radiation exposure#dietary ...,392307,3,2022,2022,0.00012,2.0,28.0,2.0
4,INBIO,Invasion Biology Ontology,0.0,5,0.009823,urbanized area#energy#radiation#environmental ...,458,2,2022,2023,0.010917,30.0,3.0,3.0


In [None]:
df_ontologies_filtered=df_ontologies_filtered.sort_values(by='rank')


# Remove manually checked irrelevent ontologies

In [54]:
manually_removed_ontology_acronyms=['NXDX', 'CMDO', 'OLAM', 'PATEL', 'TIMEBANK', 'WC', 'BIPOM', 'BIOLINK', 'MIO', 'COVID19-IBO', 'DRPSNPTO', 'STMSO', 'PCAO', 'INBIO', 'CHEMROF', 'MODSCI', 'INTO', 'TOCWWE', 'HASCO', 'DRANPTO', 'NMDCO', 'COVID-19-ONT-PM', 'WWECA', 'ZP', 'PARTUMDO', 'MHMO', 'LEPAO', 'ICD10CM', 'VIDO', 'OCD', 'REGN_BRO', 'IDO-COVID-19', 'HIO', 'CLAO', 'INBANCIDO', 'MELO', 'ECOCORE', 'COID', 'COVID-19', 'INBIODIV', 'BKO', 'MDDB', 'EMIF-AD', 'FRMO', 'DPO', 'MCO', 'EPIO', 'ENVTHES', 'PSDO', 'XPO', 'AISM', 'AFO', 'FMA', 'FNS-H', 'ITO', 'OBI_IEE', 'ZONMW-CONTENT', 'NPOKB', 'ID-AMR', 'COB', 'FOVT', 'ICEO', 'INCENTIVE-VARS', 'INCENTIVE', 'ETHANC', 'ETANC', 'ODHT', 'CDPEO', 'I2SV', 'ADCAD', 'CSTD', 'HOME', 'M4M19-SUBS', 'CASE-BASE-ONTO']
for i,ontology in df_ontologies_filtered.iterrows():
    if ontology['acronym']  in manually_removed_ontology_acronyms:
        df_ontologies_filtered=df_ontologies_filtered.drop([i])


In [56]:
bioportal='https://bioportal.bioontology.org/ontologies/'
for i,ontology in df_ontologies_filtered.iterrows():
    df_ontologies_filtered.at[i,'url']=bioportal+ontology['acronym']


In [58]:
df_ontologies_filtered.to_csv('../../data/xref/Xrefs_df_ontology_to_annotated_names_Ranked_Filtered_Checked.tsv',sep='\t')
# Final list of 50 ontologies is saved as 50_Selected Ontologies.tsv

