# Getting started with EpiGraphDB in Python

This notebook is provided as a brief introductory guide to working with the EpiGraphDB platform through Python. Here we will demonstrate a few basic operations that can be carried out using the platform, but for more advanced methods please refer to the [API endpoint documentation](http://docs.epigraphdb.org/api/api-endpoints/).

A Python wrapper for EpiGraphDB's API is currently in the works, but for now we will be querying it directly using the `requests` library- knowledge of this package is advantageous but not essential.


In [5]:
import requests

First, we will ping the API to check our connection:

In [6]:
# Store our API URL as a string for future use
API_URL = "https://api.epigraphdb.org"

# Here we use the .get() method to send a GET request to the /ping endpoint of the API
endpoint = '/ping'
response_object = requests.get(API_URL + endpoint)  

# Check that the ping was sucessful
response_object.raise_for_status() 
print("If this line gets printed, ping was sucessful.")

If this line gets printed, ping was sucessful.


## 1. Using EpiGraphDB to get biological mappings

In this first section, we will take an arbitrary list of genes and query the EpiGraph API to find the proteins that they map to. We will be using the `POST` HTTP method which requires its parameters to be passed in JSON format, a conversion that is easy to do using the `json` library. To find the correct names of the parameters that we are about to set, we navigate to the [EpiGraphDB API documentation](https://api.epigraphdb.org/) and locate the dropdown box for the endpoint that we are about to access. Clicking this box will display information about the endpoint, including parameter names and an example request body (which relates to the `params` dictionary that we will create below).

In [7]:
# 1.1 Mapping genes to proteins

# Set parameters and convert to JSON format
import json
params = {
  "gene_name_list": [
    "TP53",
    "BRCA1", 
    "TNF"
  ]
}
json_params = json.dumps(params)

# Send the POST request
endpoint = '/mappings/gene-to-protein'
response_object = requests.post(API_URL + endpoint, data=json_params)

# Check for successful request
response_object.raise_for_status()

# Store in a pandas dataframe
import pandas as pd
results = response_object.json()['results']
gene_protein_df = pd.json_normalize(results)

gene_protein_df

Unnamed: 0,gene.name,gene.ensembl_id,protein.uniprot_id
0,TP53,ENSG00000141510,P04637
1,BRCA1,ENSG00000012048,P38398
2,TNF,ENSG00000232810,P01375


In the above cell, we queried EpiGraphDB for the proteins that have been mapped to the genes *TP53*, *BRCA1*, and *TNF*. Our query went through successfully and we received an associated protein for each. The columns in our output dataframe take the general form `entity.attribute`, e.g. the `gene.ensembl_id` column comprises the [Ensembl IDs](https://www.ensembl.org/index.html) of the genes in the table. 

In [44]:
# 1.2 Proteins to pathways

# As above, this is another POST request, so we need our data in JSON format
json_params = json.dumps({
  "uniprot_id_list": list(gene_protein_df['protein.uniprot_id'].values)
})

# Send the request
endpoint = '/protein/in-pathway'
response_object = requests.post(API_URL + endpoint, data=json_params)

# Store results
results = response_object.json()['results']
protein_pathway_df = pd.json_normalize(results)

protein_pathway_df

Unnamed: 0,uniprot_id,pathway_count,pathway_reactome_id
0,P04637,5,"[R-HSA-6785807, R-HSA-390471, R-HSA-5689896, R..."
1,P38398,6,"[R-HSA-6796648, R-HSA-1221632, R-HSA-8953750, ..."
2,P01375,3,"[R-HSA-6785807, R-HSA-6783783, R-HSA-5357905]"


Above, we took the proteins that had been mapped to our genes of interest and queried the platform for their associated pathway data. The API found multiple such pathways for each gene and has returned the respective reactome IDs to us as lists.


It is worth noting here that so far we have only been accessing the `'results'` key in the nested dictionairy returned by the `.json()` method of our response object. The other available key is `'metadata'` (output below) which provides you with information about the request itself, including the specific Cypher query that the platform ran to get these results. If you would like to know more about the use of Cypher in these requests, there is a dedicated section at the end of this notebook.

In [52]:
from pprint import pprint
metadata = response_object.json()['metadata']

pprint(metadata)

{'empty_results': False,
 'query': 'MATCH (gene:Gene)-[gp:GENE_TO_PROTEIN]-(protein:Protein) WHERE '
          "gene.name IN ['TP53', 'BRCA1', 'TNF'] RETURN gene {.ensembl_id, "
          '.name}, protein {.uniprot_id}',
 'total_seconds': 0.007318}


## 2. [Insert section title here] [also finish this section]

In this example, we will query EpiGraphDB to obtain a list of traits for which there is strong evidence of an effect from the exposure trait 'body mass index'. This time we will be using a different HTTP method than before- the `GET` method, which is in fact easier to use in Python because the parameters can be passed directly as a dictionary. To learn more about the differences between `GET` and `POST`, please see [this guide](https://www.w3schools.com/tags/ref_httpmethods.asp). 

2) get a list of GWAS that matches 'body mass index', get all MR  with it as exposure, get all MR with something else as outcome, get all MR for a specified pair exposure+outcome

In [None]:
# 2.1 maybe gwas on BMI? 

In [50]:
# 2.2 MR stuff on BMI? 

# Create a dictionary for the parameters to be passed
params = {'exposure_trait': 'Body mass index',
          'outcome_trait': None,
          'pval_threshold': 1e-10}

# Send the request
endpoint = '/mr'
response_object = requests.get(API_URL + endpoint, params=params)

# Check for a successful respone
response_object.raise_for_status()

# Store the results of the query and display
result = response_object.json()['results']
BMI_MR_df = pd.json_normalize(result) 

BMI_MR_df

Unnamed: 0,exposure.id,exposure.trait,outcome.id,outcome.trait,mr.b,mr.se,mr.pval,mr.method,mr.selection,mr.moescore
0,ieu-a-2,Body mass index,ukb-a-74,Non-cancer illness code self-reported: diabetes,0.034559,0.002418,0.000000e+00,FE IVW,DF,0.93
1,ieu-a-2,Body mass index,ukb-a-388,Hip circumference,0.724105,0.026588,0.000000e+00,Simple median,Tophits,0.95
2,ieu-a-2,Body mass index,ukb-a-382,Waist circumference,0.656440,0.024496,0.000000e+00,Simple median,Tophits,0.94
3,ieu-a-2,Body mass index,ukb-a-35,Comparative height size at age 10,0.136684,0.007909,0.000000e+00,FE IVW,Tophits,0.94
4,ieu-a-2,Body mass index,ukb-a-34,Comparative body size at age 10,0.365580,0.023556,0.000000e+00,Simple median,HF,0.87
...,...,...,...,...,...,...,...,...,...,...
517,ieu-a-974,Body mass index,ukb-a-476,Pain type(s) experienced in last month: Knee pain,0.052106,0.005613,7.582779e-11,Simple mean,HF,0.90
518,ieu-a-785,Body mass index,ieu-a-1037,Difference in height between childhood and adu...,-0.520875,0.080135,8.034037e-11,FE IVW,Tophits,0.71
519,ieu-a-2,Body mass index,ieu-a-1034,Height,0.356558,0.055004,9.023410e-11,FE IVW,DF + HF,0.78
520,ieu-a-2,Body mass index,ukb-a-294,Wheeze or whistling in the chest in last year,0.052605,0.008118,9.166369e-11,Simple median,DF,0.89


The dataframe above displays the results of our query. We requested all traits for which an MR analysis using body mass index as the exposure variable returned a causal estimate with a p-value lower than 1e-10. 522 such traits were found, and information regarding the exposure variable, outcome variable, and MR parameters, is recorded in the columns with names starting `exposure.`, `outcome.`, and `mr.`, respectively. **[TODO: reword this because already mentioned column names above]** 

## 3. Querying biomedical literature

Accessing information in the literature is a ubiquitous task in research, be it for novel hypothesis generation or as part of evidence triangulation. EpiGraphDB facilitates fast processing of this information by allowing access to a host of literature-mined relationships that have been structured into semantic triples. These take the general form `(subject, predicate, object)` and have been generated using contemporary natural language processing techniques applied to a massive amount of published biomedical research papers. In the following section we will query the API for the relationship between a given gene and an outcome trait.

In [9]:
# Establish parameters
params = {
    'gene_name': "IL23R",
    'object_name': "Inflammatory bowel disease"
}

# Send the request
endpoint = "/literature/gene"
response_object = requests.get(API_URL + endpoint, params=params)

# Check for a successful respone
response_object.raise_for_status()

# Store the results of the query and display
result = response_object.json()['results']
lit_df = pd.json_normalize(result) 

lit_df

Unnamed: 0,pubmed_id,gene.name,st.predicate,st.object_name
0,"[17484863, 21155887]",IL23R,NEG_ASSOCIATED_WITH,Inflammatory Bowel Diseases
1,[27852544],IL23R,AFFECTS,Inflammatory Bowel Diseases
2,"[17484863, 19575361, 19496308, 18383521, 18341...",IL23R,ASSOCIATED_WITH,Inflammatory Bowel Diseases
3,[23131344],IL23R,PREDISPOSES,Inflammatory Bowel Diseases


The dataframe outputted above shows the results of our query- four unique predicates were found between the gene *IL23R* and the trait *Inflammatory bowel diseases* and are displayed in the `st.predicate` column. Our leftmost column contains the publication IDs of the papers from which this triple was derived, and we can access them by navigating to `https://pubmed.ncbi.nlm.nih.gov/*pubmed_id_here*`. In this case it seems that *ASSOCIATED_WITH* is the most common predicate linking our gene to the trait, but we can't see exactly how many papers there are. Let's add a paper count to the dataframe.

In [40]:
lit_df['publication_count'] = [len(papers_list) for papers_list in lit_df['pubmed_id']]
lit_df

Unnamed: 0,pubmed_id,gene.name,st.predicate,st.object_name,publication_count
0,"[17484863, 21155887]",IL23R,NEG_ASSOCIATED_WITH,Inflammatory Bowel Diseases,2
1,[27852544],IL23R,AFFECTS,Inflammatory Bowel Diseases,1
2,"[17484863, 19575361, 19496308, 18383521, 18341...",IL23R,ASSOCIATED_WITH,Inflammatory Bowel Diseases,21
3,[23131344],IL23R,PREDISPOSES,Inflammatory Bowel Diseases,1


## 4. Requests and Cypher

Until now, to get information from the platform we have been simply creating a dictionary or JSON object containing our parameters and then sending it to the correct endpoint of the API using the `requests` library. This is fine practice and the API has been designed specifically to allow this method of use, as we have (inexhaustively) demonstrated above. It works because the API automatically converts the HTTP requests that it receives into a Cypher query, which it then passes to the Neo4j database on which EpiGraphDB is built. The database passes back the result of the query, which is then returned to us in Python as a response object. Each response object contains metadata that includes the Cypher query that was called on the database, as shown in the cell below.

In [63]:
# 4.1 

params = {
  "gene_name_list": [
    "TP53"
  ]
}
json_params = json.dumps(params)
endpoint = '/mappings/gene-to-protein'
response_object = requests.post(API_URL + endpoint, data=json_params)
response_object.raise_for_status()

# Extract and print the Cypher query
cypher_query = response_object.json()['metadata']['query']
print(cypher_query)

MATCH (gene:Gene)-[gp:GENE_TO_PROTEIN]-(protein:Protein) WHERE gene.name IN ['TP53'] RETURN gene {.ensembl_id, .name}, protein {.uniprot_id}


The text printed above is the exact Cypher query that was run in section 1.1, behind the scenes. The basic structure of these queries is as follows:
***
MATCH *subgraph\**

WHERE *condition*

RETURN *data*
***
\* note that the subgraph should takes this general form: *(node)-[relationship]-(node)*

For more detailed information on Cypher queries, please refer to the [official documentation [TODO: add link]](). Otherwise, let's write our own basic query and send it to EpiGraphDB.

 

In [None]:
# 4.2 Sending custom Cypher queries

params = json.dumps({
    'query': 'insert cypher query here'
})

# Send the request
requests.post