# Using ROBOKOP's similarity service

The ROBOKOP similarity service takes an entity as an input and returns similar entities.  Here, similarity is defined as by the sharing of connections to an intermediate entity.   For instance, a user may have a particular disease such as asthma, and want to find other diseases that are phenotypically similar.

Similarity between two entities, based on an intermediary type that relates them, is defined using a Jaccard coefficient.  It's defined as the number of intermediate nodes that link the two end nodes divided by the total number of nodes to which the end nodes are linked.

For example, suppose that we are looking for diseases that are phenotpyically similar to asthma.  Asthma is related to a set of phenotypes.  Another disease, such as COPD is related to another set of phenotypes.  The Jaccard coefficient between Asthma and COPD will be the size of the intersection between these sets divided by the size of their union.  by definition, the similarity cofficient will be in the range [0,1].

## Basic Usage

This function explains how to call the similarity function.  It takes an identifier (`ident`) and its type (`type1`), the type of return (`type 2`) and the type of the intermediate nodes defining the similarity (`by_type`).

In [10]:
robokop_server = 'robokop.renci.org'

In [11]:
import requests
import pandas as pd

In [12]:
def similarity(type1,ident,type2,by_type,threshhold=None,max_results=None,rebuild=None):
    url=f'http://{robokop_server}/api/simple/similarity/{type1}/{ident}/{type2}/{by_type}'
    params = { 'threshhold': threshhold, 'max_results': max_results, 'rebuild': rebuild }
    params = { k:v for k,v in params.items() if v is not None }
    response=requests.get(url, params = params)
    print( 'Return code:',response.status_code )
    return response.json()

In [13]:
asthma = 'MONDO:0004979'
similar = similarity('disease',asthma,'disease','phenotypic_feature')
print(similar)

Return code: 200
[{'id': 'MONDO:0003014', 'name': 'rhinitis', 'similarity': 0.5079787234042553}, {'id': 'MONDO:0003781', 'name': 'bronchitis', 'similarity': 0.5026737967914439}, {'id': 'MONDO:0005002', 'name': 'chronic obstructive pulmonary disease', 'similarity': 0.42795698924731185}, {'id': 'MONDO:0004822', 'name': 'bronchiectasis', 'similarity': 0.42574257425742573}, {'id': 'MONDO:0005812', 'name': 'influenza', 'similarity': 0.42035398230088494}, {'id': 'MONDO:0005324', 'name': 'seasonal allergic rhinitis', 'similarity': 0.4119318181818182}]


The return is a list of dictionaries, and is easily converted into a pandas DataFrame:

In [14]:
pd.DataFrame(similar)

Unnamed: 0,id,name,similarity
0,MONDO:0003014,rhinitis,0.507979
1,MONDO:0003781,bronchitis,0.502674
2,MONDO:0005002,chronic obstructive pulmonary disease,0.427957
3,MONDO:0004822,bronchiectasis,0.425743
4,MONDO:0005812,influenza,0.420354
5,MONDO:0005324,seasonal allergic rhinitis,0.411932


## Controlling the number of responses with threshhold and maxresults

The similarity service by default returns the top 100 matches with a coefficient > 0.4.  These somewhat arbitrary cutoffs can be modified using the threshhold and maxresults parameters.  In the example above, only a single result for asthma is found with this cutoff.  We can lower the threshhold to return more results:

In [16]:
similar = similarity('disease',asthma,'disease','phenotypic_feature',threshhold=0.25)
pd.DataFrame(similar)

Return code: 200


Unnamed: 0,id,name,similarity
0,MONDO:0003014,rhinitis,0.507979
1,MONDO:0003781,bronchitis,0.502674
2,MONDO:0005002,chronic obstructive pulmonary disease,0.427957
3,MONDO:0004822,bronchiectasis,0.425743
4,MONDO:0005812,influenza,0.420354
5,MONDO:0005324,seasonal allergic rhinitis,0.411932
6,MONDO:0005249,pneumonia,0.383471
7,MONDO:0005077,pertussis,0.381616
8,MONDO:0006052,pulmonary tuberculosis,0.378092
9,MONDO:0000265,aspiration pneumonia (disease),0.373434


This number of results can be reduced again using maxresults:

In [17]:
similar = similarity('disease',asthma,'disease','phenotypic_feature',threshhold=0.25,max_results=5)
pd.DataFrame(similar)

Return code: 200


Unnamed: 0,id,name,similarity
0,MONDO:0003014,rhinitis,0.507979
1,MONDO:0003781,bronchitis,0.502674
2,MONDO:0005002,chronic obstructive pulmonary disease,0.427957
3,MONDO:0004822,bronchiectasis,0.425743
4,MONDO:0005812,influenza,0.420354


## Caching and rebuilding

ROBOKOP maintains cached results.  The cache is built both opportunistically (including the results of all previous queries) and proactively (pre-loading data that expected to be heavily used).  By default, expand only looks in its cache.  If a result has not been previously cached, then this call will not return anything (and may return a status code of 500).

If a user wants to force the service to look beyond its local cache, it sends a parameter `rebuild=True`, as shown below.

If a user wants to be sure to retreive all relevant data, they should use `rebuild=True`, but this will be at the expense of performance.  In order to increase performance without sacrificing reliability, certain type pairs are preloaded into the cache.  In this case, there will be no difference in results between calling `rebuild=True` and `rebuild=False`, but calling with `rebuild=True` will be noticeably slower.

Certain pairs of types are preloaded into ROBOKOP's cache, so there is no point in using rebuild for them. The following list will be updated as the preloaded list is modified.  Note that with the data loaded, it doesn't matter which type is the query and which is the resut.  That is, if a row in this table specifies `disease` and `phenotypic_feature`, then there is no reason to use rebuild for `type1=type2='disease' by_type='phenotypic_feature'` or similar combinations.

See the "Expand" notebook for the most up-to-date list of features.

In [8]:
fanconis_anemia = 'MONDO:0019391'
similar_by_gene = similarity('disease',fanconis_anemia,'disease','gene',threshhold=0.1,rebuild=True)
pd.DataFrame(similar_by_gene)

Return code: 200


Unnamed: 0,id,name,similarity
0,MONDO:0009833,Shwachman-Diamond syndrome,0.204819
1,MONDO:0008876,Bloom syndrome,0.153061
2,MONDO:0001044,esophageal atresia (disease),0.144144
3,MONDO:0008586,esophageal atresia/tracheoesophageal fistula,0.144144
4,MONDO:0009215,Fanconi anemia complementation group A,0.108434


## Similarity of discordant types

An odd feature of this knowledge-graph based similarity is that the query type and result type don't need to be the same, as long as they share an intermediate feature.  For instance, we might use similarity by gene to find biological processes and activities for a given disease.  Note that we're using rebuild=True, since many of these types are not necessarily present in our cache:

In [9]:
similar_go = similarity('disease',fanconis_anemia,'biological_process_or_activity','gene',threshhold=0.1,rebuild=True)
pd.DataFrame(similar_go)

Return code: 200


Unnamed: 0,id,name,similarity
0,GO:0036297,interstrand cross-link repair,0.294118
1,GO:0000724,double-strand break repair via homologous reco...,0.136
2,GO:0031297,replication fork processing,0.111111
