# MELODI Presto API Example Usage

In [1]:
import json
import pandas as pd
import requests
import time
from random import randint
import scipy.stats as stats
from utils import enrich, overlap, sentence

### Configure parameters

In [2]:
API_URL = "https://melodi-presto.mrcieu.ac.uk/api/"

requests.get(f"{API_URL}/status").json()

True

### How the enrichment is performed

This is a basic Fisher's exact test, using the scipy stats function.

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html

For example, for a particular query we find one particular triple 10 times, and have 147 triples in total for the query. For the triple in question, there are 3,505 counts across all publications, and in the entire data set we have 6,533,824 triples.  

In [8]:
import scipy.stats as stats

queryTripleCount,queryTripleTotal,globalTripleCount,globalTripleTotal=[10,3505,147,6533824]
oddsratio, pvalue = stats.fisher_exact([[queryTripleCount, queryTripleTotal], [globalTripleCount, globalTripleTotal]])
oddsratio,pvalue

(126.81250303259678, 3.4724305806153405e-18)

### Performance 

We can compare the performance before and after the initial query. For example, if we use a query that has not already been run, e.g. `physical activity or 123`. In a PubMed search this returns over 550,000 articles. First time it takes over 20 seconds to run, creating a data set of around 10,000 triples, second time is a few seconds.

In [9]:
r=randint(0, 1000000)
q='physical activity or '+str(r)
print(q)

def run_enrich(query_term):
    start = time.time()
    enrich_df = enrich(q)
    print(enrich_df.shape)
    end = time.time()
    t = "{:.4f}".format(end-start)
    return t
    
t1 = run_enrich(q)
t2 = run_enrich(q)
print('t1:',t1,'\nt2:',t2)

physical activity or 355736
(12848, 16)
(12848, 16)
t1: 32.8849 
t2: 4.0485


Likewise, we can run the overlap query with two new queries, and then run again with the same

In [5]:
r=randint(0, 1000000)
q1=['vitamin d or '+str(r)]
q2=['prostate cancer or '+str(r)]
print(q1,':',q2)

def run_overlap(q1,q2):
    start = time.time()
    overlap_df = overlap(q1,q2)
    print(overlap_df.shape)
    end = time.time()
    t = "{:.4f}".format(end-start)
    return t
    
t1 = run_overlap(q1,q2)
t2 = run_overlap(q1,q2)
print('t1:',t1,'\nt2:',t2)

['vitamin d or 363782'] : ['prostate cancer or 363782']
(1497, 32)
(1497, 32)
t1: 27.1155 
t2: 1.4967


##### Comparing performance of similar tools

To our knowledge, the only methods providing this kind of overlap analysis are Arrowsmith (http://arrowsmith.psych.uic.edu/) and MELODI (http://melodi.biocompute.org.uk/). A query of `vitamin d` and `prostate cancer` takes over 30 minutes on both platforms. 

### Comparing output

We can attempt to compare the output of the same overlap query across the three platforms mentioned above. In this case, MELODI Presto data will be derived in real time, whereas data from Arrowsmith and MELODI have to be pre-calculated and downloaded as CSV files.

MELODI - http://melodi.biocompute.org.uk/results/b1741206-90ae-490b-8580-4ad7c50f7f45/  
Arrowmsith - failed to complete (02/07/20):


`Software error:
file: ../../arrowsmith_uic/data/stopwords_pubmed: No such file or directory at /var/www/arrowsmith-uic/cgi-bin/arrowsmith_uic/Arrowsmith/general.pm line 21.
For help, please send mail to the webmaster (neils@uic.edu), giving this error message and the time and date of the error.
`

##### Comparison of tools:

| Tool | URL | Data source | Updated | Article limit | API | 
| --- | --- | --- | --- | --- | --- |
|Arrowsmith | http://arrowsmith.psych.uic.edu/ | MEDLINE| 2014 | 50,000 | No |  
|MELODI | http://melodi.biocompute.org.uk/ | SemMedDB | 2018 | 1,000,000 | No | 
|MELODI Presto | https://melodi-presto.mrcieu.ac.uk/ | SemMedDB | 2020 | Unlimited | Yes |

In [6]:
#load the MELODI data
melodi_df=pd.read_csv('melodi_result_4534.csv')
#add column for overlap type
melodi_df['name3_type'] = melodi_df.apply(lambda row: row.name3.split(' ')[-1], axis = 1)
m_overlap_counts = melodi_df.groupby('name3_type')['name3'].value_counts().reset_index(name='counts')
m_overlap_counts[m_overlap_counts['name3_type']=='(gngm)']

Unnamed: 0,name3_type,name3,counts
50,(gngm),Vitamin D3 Receptor (gngm),144
51,(gngm),NF-kappa B (gngm),6


In [7]:
#create MELODI Presto data
melodi_presto_df = overlap(['vitamin d'],['prostate cancer'])
mp_overlap_counts = melodi_presto_df.groupby('object_type_x')['object_name_x'].value_counts().reset_index(name='counts')
#mp_overlap_counts
mp_overlap_counts[mp_overlap_counts['object_type_x']=='gngm']

Unnamed: 0,object_type_x,object_name_x,counts
38,gngm,Androgen Receptor|AR,70
39,gngm,FLVCR1,66
40,gngm,Vitamin D3 Receptor,50
41,gngm,Interleukin-6,28
42,gngm,Alkaline Phosphatase,27
43,gngm,Mixed Function Oxygenases,21
44,gngm,PTH gene|PTH,21
45,gngm,NF-kappa B,14
46,gngm,Osteocalcin,11
47,gngm,Proto-Oncogene Proteins c-akt|AKT1,11


Comparing the output from MELODI and MELODI Presto we can see the extra information now available. This is due to a combination of things, from an updated version of SemMedDB to the ability to return all triples, not just those below a set enrichment threshold