# Getting started with EpiGraphDB in Python

This notebook is provided as a brief introductory guide to working with the EpiGraphDB platform through Python. Here we will demonstrate a few basic operations that can be carried out using the platform, but for more advanced methods please refer to the [API endpoint documentation](http://docs.epigraphdb.org/api/api-endpoints/).

A Python wrapper for EpiGraphDB's API is currently in the works, but for now we will be querying it directly using the `requests` library- knowledge of this package will be helpful but is by no means essential.


In [1]:
import requests

First, we will ping the API to check our connection:

In [23]:
# Store our API URL as a string for future use
API_URL = "https://api.epigraphdb.org"

# Here we use the .get() method to send a GET request to the /ping endpoint of the API
endpoint = '/ping'
response_object = requests.get(API_URL + endpoint)  

# Check that the ping was sucessful
response_object.raise_for_status() 
print("If this line gets printed, ping was sucessful.")

If this line gets printed, ping was sucessful.


## 1. Using EpiGraphDB to get biological mappings

[Add some text here to give a summary of whats about to happen. Note the difference between GET and POST and link to the API and requests library documentaiton to help them figure it out. Explain that MTHFR is commonly studied gene]


In [25]:
# 1.1 Mapping genes to proteins

# Set parameters (note that the server requires parameters to be in JSON format for a POST request)
import json
params = {
  "gene_name_list": [
    "TP53"
  ]
}
json_params = json.dumps(params)

# Send the POST request
endpoint = '/mappings/gene-to-protein'
response_object = requests.post(API_URL + endpoint, data=json_params)

# Check for successful request
response_object.raise_for_status()

# Store and display the results
results = response_object.json()['results']
gene_protein_df = pd.json_normalize(results)
gene_protein_df

Unnamed: 0,gene.name,gene.ensembl_id,protein.uniprot_id
0,TP53,ENSG00000141510,P04637


In the above cell, we queried EpiGraphDB for the proteins that have been mapped to the gene *TP53*. Our query went through successfully and we received a single result- the protein *P04637*. The columns in our output dataframe take the general form `entity.attribute`, e.g. the `gene.ensembl_id` column comprises the [Ensembl ID](https://www.ensembl.org/index.html) of the genes in the table. 

In [40]:
# 1.2 Mapping proteins to pathways

# As above, this is another POST request, so we need our data in JSON format
json_params = json.dumps({
  "uniprot_id_list": [
    gene_protein_df['protein.uniprot_id'][0]  # Grab the first outputted protein ID from the code cell above
  ]
})

# Send the request
endpoint = '/protein/in-pathway'
response_object = requests.post(API_URL + endpoint, data=json_params)

# Store results
results = response_object.json()['results']
protein_pathway_df = pd.json_normalize(results)
protein_pathway_df

Unnamed: 0,uniprot_id,pathway_count,pathway_reactome_id
0,P04637,5,"[R-HSA-6785807, R-HSA-390471, R-HSA-5689896, R..."


Above, we took a protein that had been mapped to our gene of interest and queried the platform for its associated pathway data. The API found multiple such pathways and has returned their reactome IDs to us as a list.


It is worth noting here that so far we have only been accessing the `'results'` key in the dictionairy returned by the `.json()` method of our response object. The other available key is `'metadata'` which provides you with information about the request itself, including run time and the specific Cypher query that the platform ran to get these results. If you would like to know more about the use of Cypher in these requests, there is a section at the end of this notebook dedicated to that.

## [change this title to be part 2: something something]

In this example, we will query EpiGraphDB to obtain a list of traits for which there is strong evidence of an effect from the exposure trait 'body mass index'.

In [5]:
# Create a dictionary for the parameters to be passed
BMI_params = {'exposure_trait': 'Body mass index', 
          'pval_threshold': 1e-10}

# Send the request
BMI_response = requests.get(f"{API_URL}/mr", params=BMI_params)

# Check for a successful response status, raise an error if unsuccessful
BMI_response.raise_for_status()

# Store the results of the query, which can be obtained by calling the .json() method on the response object
BMI_result = BMI_response.json()['results']

# Convert our results from a nested dictionary to a pandas dataframe and display it below
import pandas as pd
BMI_df = pd.json_normalize(BMI_result) 
BMI_df

Unnamed: 0,exposure.id,exposure.trait,outcome.id,outcome.trait,mr.b,mr.se,mr.pval,mr.method,mr.selection,mr.moescore
0,ieu-a-2,Body mass index,ukb-a-74,Non-cancer illness code self-reported: diabetes,0.034559,0.002418,0.000000e+00,FE IVW,DF,0.93
1,ieu-a-2,Body mass index,ukb-a-388,Hip circumference,0.724105,0.026588,0.000000e+00,Simple median,Tophits,0.95
2,ieu-a-2,Body mass index,ukb-a-382,Waist circumference,0.656440,0.024496,0.000000e+00,Simple median,Tophits,0.94
3,ieu-a-2,Body mass index,ukb-a-35,Comparative height size at age 10,0.136684,0.007909,0.000000e+00,FE IVW,Tophits,0.94
4,ieu-a-2,Body mass index,ukb-a-34,Comparative body size at age 10,0.365580,0.023556,0.000000e+00,Simple median,HF,0.87
...,...,...,...,...,...,...,...,...,...,...
517,ieu-a-974,Body mass index,ukb-a-476,Pain type(s) experienced in last month: Knee pain,0.052106,0.005613,7.582779e-11,Simple mean,HF,0.90
518,ieu-a-785,Body mass index,ieu-a-1037,Difference in height between childhood and adu...,-0.520875,0.080135,8.034037e-11,FE IVW,Tophits,0.71
519,ieu-a-2,Body mass index,ieu-a-1034,Height,0.356558,0.055004,9.023410e-11,FE IVW,DF + HF,0.78
520,ieu-a-2,Body mass index,ukb-a-294,Wheeze or whistling in the chest in last year,0.052605,0.008118,9.166369e-11,Simple median,DF,0.89


The dataframe above displays the results of our query. We requested all traits for which an MR analysis using body mass index as the exposure variable returned a causal estimate with a p-value lower than 1e-10. 522 such traits were found, and information regarding the exposure variable, outcome variable, and MR parameters, is recorded in the columns with names starting `exposure.`, `outcome.`, and `mr.`, respectively.