# PollyGraph : Answering biomedical questions with large biomedical knowledge graphs

<img src="images/logo-logo.jpg" alt="Drawing" style="width: 500px;"/>

### What is PollyGraph ?

PollyGraph a.k.a ontology-driven knowledge graph is a semantic model for biomedical molecular data. 

In most organisations, big and small, information is spread out across different systems in variety of formats and contexts. As a result, users spend much time trying to manually find relevant data for their use. 

Polly, which is home to over 1.6 million biomedical molecular data, is standardised for both data and metadata. There are two major advantages to this effort. First, data standardisation using uniform file formats such as GCT, h5ad, vcf etc. enable large scale consumption of data for analysis. Second, metadata harmonisation enables users to perform data findability at scale without ever worrying to provide exact keywords for search.

To make search even more powerful for scientists, we introduce PollyGraph to perform semantic searches using several biological concepts.

### What is Semantic Search ? 

Semantic search is an advanced technology for optimizing the accuracy of our search results by enabling models to understand the meaning of concepts and the relationships between them. 

### Why do we need semantics ? 

As the data on Polly is rapidly growing in number, more and more knowledge populates the platform. Therefore, it becomes increasingly challenging for machines to process and retrieve "relevant" information on our behalf. Though it is easy for humans to decide whether two or more entities are associated, machines struggle and often fail to do it. 

### The perks of semantic-search

- It's context-aware : It identities entities that are relevant for the task and retreieves relevant datasets.
- It's extensively interlinked : Provides multiple references for an entitiy across multiple domain specific concepts and knowledge grows significantly by adding new associations between entities
- It's highly personalised : Capable of returning results that are more closely aligned with user's interests.

<img src="images/onto-text.png" alt="Drawing" style="width: 500px;"/>

## Ontology-driven Biomedical Knowledge Graphs

One of the main mechanisms behind this ability of semantic search is to utilise publicly available biomedical ontologies to provide more meaningful results using knowledge graph. Linking biomedical ontologies (e.g. MeSH, BTO, CVCL, HGNC, GO etc) by mining associations between them.

By leveraging ontologies, semantic search is able to provide a suitable response even if the results don’t contain the exact wording of the query.

<img src="images/primary_model.jpg" alt="model" style="width: 800px;"/>

### 1. Connecting to PollyGraph

In [2]:
pip install neo4j

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting neo4j
  Downloading neo4j-5.1.0.tar.gz (173 kB)
[K     |████████████████████████████████| 173 kB 3.5 MB/s eta 0:00:01
Building wheels for collected packages: neo4j
  Building wheel for neo4j (setup.py) ... [?25ldone
[?25h  Created wheel for neo4j: filename=neo4j-5.1.0-py3-none-any.whl size=248071 sha256=682605e76e0f084a95fefdf393a1c71e521be88d434584adf13c4e8bfb39a046
  Stored in directory: /home/elucidata/.cache/pip/wheels/b0/8b/7c/2880b7e278f6f2b9c2f9d8c4cd6415433d894fd7833d28701d
Successfully built neo4j
Installing collected packages: neo4j
Successfully installed neo4j-5.1.0
Note: you may need to restart the kernel to use updated packages.


In [1]:
from neo4j import GraphDatabase
import pandas as pd
import tqdm
import os
import time
from polly.auth import Polly
from polly.omixatlas import OmixAtlas


Polly.auth("eyJjdHkiOiJKV1QiLCJlbmMiOiJBMjU2R0NNIiwiYWxnIjoiUlNBLU9BRVAifQ.EajKGEGUjTX4OtunyIgUu0UyIzkxIvcNkpSYDcbcVFQKmtWRg0ZAYMOj_7nVsR87jiOjO3XCtHh1CM22ic8VzOX8nZZagJWcuQsVG5k3XoYwCkQwabonJu0qZHNjkBpghQkomHUqnz4A9kx7058FxF_0_l0dgtmjd5iDa9Yo7KxOQ5XnVK1mbGa_3wVHaSq1afZaM5WoT94pUQLQUf_Qs2gfmCyzGgUDLurfOAcnIpLcvbu3hoYv2ZppwUpR3PgkhHCkkjkqS9VZDm7Iz42lFq_wn9cyxJzLXwDgOwu_xPUBUkomx9hEr94QNIe8Z5dYu3pfYV0nHF71L2_b-XXvew.JToKVfC_LQhInxbM.5m7_ve6t1jIALjZSw4S1FQ8HNvLZdsS5WYcRlZOn-CV4anyCQpyhHp1NTJ9QseuvfJf5VRHFlUEPHiQqYo-GRqxNjsdvjE7W1j3q4xYcosx3eG3BeVryZbvH4YCg-1KPBgLe7-Fez1nXhxXdjjLkQ1wE_qJvBpn4jS9W-P_bqWbN8Z3Z7_piFT529vr1hSJXzDtKE8vMAEenkCgCfl2U9JnoJ6RClaYVIuql-ajCUoiab87peqNnggObKGsQRoYL2mU27dDjAGpOJFOHa7oxvnOIOMrJCY9hQcnTEGXymnjp9gN0dWSMxMIJOVuk-8WcoWdJ1dWQcIZ1sTZUql3j4rGjuN9B-CdsyAZrEIyggiQqP4BpDGpmBq16G66aAnMW3ewuXwoT5Lu-RyorOg-iJ-pVlB3OZt_MMf8cOU2w1H-R9x3c8SujJqIgIWc09xritvFoMkdho_GDaTJQy-DCBq5DqOLzD_a4Jw00jjize_9UqB0IufNoQUgJeGA4Lp8bBwwXqpCe6dClN5nBVbFLHx05tlOko12Gwu4_fkRLqfFQu4-vlcElhf5UPsXiVNCA-iSYHRxgkcuwAoy5AGhGiWPpK1qNbu_5cDUW7PSXJI1QLjNcATSxmn-iqLHjMPBDG85d1w_TZcN7v3j1tIoS1vOPQVliWShlTPH8nvE9nHmybNotpNnWz_ZFP6GiVk8_FaxnFdqtwV_nXiYx-nCWWvE1AuOaSNrVAyyMH8U8FR5GgbwqcbkUKQbqfuVe7RhqAVWF_C0NTkjC6i0hv7bIEeT7PTr5n7uMORnDE85dhX_IAKoyjD1dCtIf7hx57ytHEhdhI7XiSDenWZhrLVpKJqfgjDGbsxUGcUP3jJZO9yVCsMiIY3OdoTsxaNQS3oHMTympYbU09X_cYDk4f7emDM7BBEP3e2MeL596p6ck8lQ6Bj4UvKXTdFYtrrrYZbq3uW59fyy4RBrlFjRWqQuo8mcCLKyTjQ0HeZCah74Uusl79ju1F0vkv0MvQ6fb8uiyfGtZzKUtSrmsqodsSnZJSJmJSsgURhkzX53NQBpWx4QOgC0W8MS_s3OBCHcItTjXPLPlF0WL3YjqBquCrnMuCo30EL2Z3sILhonNRtLaq3rdOm7-HxoAhPGDucvEHA3fT7HKiY3yMaLOa3459BivLU_72NVOKCAJOZYj2ljgHrD5NnGg0Uu2dYV1N8MqMhVd1crvDHoQ5n3d20z8CKj8abF6JLtEMGaszVT_gCzLjIgUIvxHqD94pQ7gE3QOYxwTk16HS4G4nm5G0jntSBfR-JDGydwTYzxqCKKNXq9BxWqKXQbpOTRS8cEmHEc.Ge1aOJpssfL31_wDXJNDmA")

#Defining omixatlas object
omixatlas = OmixAtlas()

In [2]:
class PollyGraph:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response
    
def build_rel_query(field_name, col_name, meta_field, rel):
    
    if rel == 'mentioned_in':
        q = """
            UNWIND $rows as row
            MATCH (n:ns0__"""+field_name +"""{ns0__"""+meta_field+""": [row."""+col_name+"""]})
            MATCH (d:dataset {src_dataset_id: row.src_dataset_id})
            MERGE (n)-[:"""+rel+"""]->(d)
            RETURN count(*) as total
            """
    else:
        q = """
            UNWIND $rows as row
            MATCH (n:ns0__"""+field_name +"""{ns0__"""+meta_field+""": row."""+col_name+"""})
            MATCH (d:dataset {src_dataset_id: row.src_dataset_id})
            MERGE (n)-[:"""+rel+"""]->(d)
            RETURN count(*) as total
            """
        
    return(q)
    
def add_dataset(rows, batch_size=500):
   # Adds category nodes to the Neo4j graph.
    query = '''
            UNWIND $rows AS row
            MERGE (c:dataset {
                dataset_id: row.dataset_id,
                src_dataset_id: row.src_dataset_id, 
                src_overall_design: row.src_overall_design,
                src_description: row.src_description,
                src_summary: row.src_summary,
                data_type: row.data_type,
                curated_cell_line: row.curated_cell_line,
                curated_cell_type: row.curated_cell_type,
                curated_disease: row.curated_disease,
                curated_drug: row.curated_drug,
                curated_gene: row.curated_gene,
                curated_tissue: row.curated_tissue,
                condition_column: row.condition_column,
                condition_control: row.condition_control,
                condition_perturbation: row.condition_perturbation
                })
            
            RETURN count(*) as total
            '''
    return insert_data(query, rows)

def insert_data(query, rows, batch_size = 500):
    # Function to handle the updating the Neo4j database in batch mode.

    total = 0
    batch = 0
    start = time.time()
    result = None

    while batch * batch_size < len(rows):
        res = pollygraph.query(query, 
                         parameters = {'rows': rows[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
        total += res[0]['total']
        batch += 1
        result = {"total":total, 
                  "batches":batch, 
                  "time":time.time()-start}
        #print(result)

    return result

In [3]:
pollygraph = PollyGraph(uri="bolt://localhost:7687", user="neo4j", pwd="password")

### 2. Get dataset metadata from Polly

In [8]:
q = """SELECT 
              dataset_id,
              src_dataset_id, 
              src_overall_design,
              src_description,
              src_summary,
              data_type,
              curated_cell_line,
              curated_cell_type,
              curated_disease,
              curated_drug,
              curated_gene,
              curated_tissue,
              condition_column,
              condition_control,
              condition_perturbation
       FROM gdx_atlas.datasets"""

df = omixatlas.query_metadata(q, query_api_version='v2')
df.head()

Query execution succeeded (time taken: 2.37 seconds, data scanned: 2.382 MB)
Fetched 5120 rows


Unnamed: 0,dataset_id,src_dataset_id,src_overall_design,src_description,src_summary,data_type,curated_cell_line,curated_cell_type,curated_disease,curated_drug,curated_gene,curated_tissue,condition_column,condition_control,condition_perturbation
0,GSE25014_GPL570-2022-07-08-05-55-59,GSE25014_GPL570,Human pulmonary artery endothelial cells (PAEC...,Gene expression data of endothelium exposed to...,Sickle cell disease is characterized by hemoly...,Transcriptomics,[None],"[lung endothelial cell, microvascular endothe...","[Atherosclerosis, Systemic carnitine deficien...",[heme],[None],[None],kw_curated_drug,[none],[heme]
1,GSE25088_GPL1261-2022-07-08-05-56-00,GSE25088_GPL1261,3 C57Bl/6 wild-type and 3 STAT6 KO mice were u...,PPARg and IL-4-induced gene expression data fr...,C57Bl/6 wild-type and STAT6 KO mice were used ...,Transcriptomics,[None],[macrophage],[Normal],"[rosiglitazone, dimethyl sulfoxide, ethanol]","[STAT6, PPARG, IL4, FABP4, PPARA]",[bone marrow],kw_curated_drug,[none],[rosiglitazone]
2,GSE25098_GPL8321-2022-06-21-08-26-43,GSE25098_GPL8321,Contains rhabdomyosarcomas derived in vivo usi...,Rhabdomyosarcoma can be initiated in activated...,Microarray analysis of rhabdomyosarcomas gener...,Transcriptomics,[None],[progenitor cell],[Rhabdomyosarcoma],[None],"[TP53, KRAS, PAX7]",[muscle],kw_curated_disease,[Normal],[Rhabdomyosarcoma]
3,GSE25101_GPL6947-2022-06-21-08-26-44,GSE25101_GPL6947,RNA was extracted from whole blood using PAXGe...,Expression profiling in whole blood in ankylos...,Introduction: A number of genetic-association ...,Transcriptomics,[None],[None],"[Spondylitis, Ankylosing]",[None],"[EP300, SPOCK2]",[blood],kw_curated_disease,[Normal],"[Spondylitis, Ankylosing]"
4,GSE25123_GPL1261-2022-07-08-05-56-01,GSE25123_GPL1261,3 PPARg +/- LysCre and 3 PPARg fl/- LysCre mic...,PPARg and IL-4-induced gene expression data fr...,Conditional macrophage-specific PPARg knockout...,Transcriptomics,[None],[macrophage],[Normal],"[rosiglitazone, dimethyl sulfoxide, ethanol]","[STAT6, PPARG, IL4, FABP4, PPARA]",[bone marrow],kw_curated_drug,[none],[rosiglitazone]


### 3. Integrate Dataset Metadata on Polly with KG

In [9]:
add_dataset(df)

{'total': 5120, 'batches': 11, 'time': 11.312557220458984}

In [10]:
# Adding relationships for each dataset to biological entities
curated_columns = ['curated_cell_line', 'curated_cell_type',
                  'curated_drug', 'curated_tissue', 'curated_gene']

for index, row in tqdm.tqdm(df.iterrows()):
    index += 1
    row = df[index-1:index]
    pert_col = 'condition_perturbation'
    
    if row['condition_column'][index-1] == 'kw_curated_disease' or row['condition_column'][index-1] == 'kw_curated_drug':
        pert_field = row['condition_column'][index-1].split('_')[-1]+' '
        query = build_rel_query(pert_field, pert_col, "name", "is_perturbed_in")
        insert_data(query, row)
    
    for col in curated_columns:
        field_name = col.replace('curated_', '')
        item = row[col][index-1]
        row_df = row.explode(col)
        if col == 'curated_gene' and "".join(item).lower() != "none":
            query = build_rel_query(field_name, col, "gene_symbol", "mentioned_in")
            insert_data(query, row_df)
            
        elif col != 'curated_gene' and "".join(item).lower() != "none":
            query = build_rel_query(field_name, col, "name", "mentioned_in")
            insert_data(query, row_df)
            
        else:
            continue

5120it [10:47,  7.91it/s]


## Context-driven search examples using PollyGraph

### Q1: List genes that constitute TGF-β receptor signaling pathway that are differentially expressed in Prostate Cancer

In [4]:
query_string = '''
MATCH (n:ns0__pathway)--(m:ns0__disease)--(p:ns0__gene)
WHERE ANY (x in n.ns0__name WHERE toLower(x) CONTAINS("transforming growth factor beta receptor signaling")) AND
ANY (y in m.ns0__name+m.ns0__synonyms WHERE toLower(y) CONTAINS("prostate cancer"))
RETURN p.ns0__gene_symbol as gene, n.ns0__name as pathway, m.ns0__name as disease;
'''
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

Unnamed: 0,gene,pathway,disease
0,[ALKBH5],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
1,[TSPY26P],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
2,[METTL1],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
3,[STIL],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
4,[ROR2],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
...,...,...,...
821,[TRIM24],[Transforming growth factor beta receptor sign...,[Prostatic Neoplasms]
822,[DMBT1],[Transforming growth factor beta receptor sign...,[Prostatic Neoplasms]
823,[ARG1],[Transforming growth factor beta receptor sign...,[Prostatic Neoplasms]
824,[DUS3L],[Transforming growth factor beta receptor sign...,[Prostatic Neoplasms]


### Q2: Find datasets related to Brain Injury where BRD1 gene is regulated

In [5]:
query_string = '''
MATCH (n:ns0__disease) WHERE ANY (x in n.ns0__name + n.ns0__synonyms where toLower(x) CONTAINS("brain injury")) 
WITH n.ns0__name as disease
UNWIND disease as d
MATCH (m:ns0__gene {ns0__gene_symbol : ['BRD1']})--(n:ns0__disease {ns0__name : [d]})--(p: dataset)
RETURN p.src_dataset_id as src_dataset_id;
'''
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

### Q3: Fetch datasets with disease as Parkinson's disease and gene as LRRK2

In [15]:
query = """ MATCH (n:ns0__disease) WHERE ANY (x in n.ns0__name + n.ns0__synonyms where toLower(x) CONTAINS("parkinson")) 
WITH n.ns0__name as disease
UNWIND disease as d 
MATCH (m:ns0__gene {ns0__gene_symbol : ['LRRK2']})--(p: dataset)--(n:ns0__disease {ns0__name : [d]})
RETURN p.src_dataset_id as src_dataset_id,n.ns0__name as disease,m.ns0__gene_symbol as gene; 
"""
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

### Q4: Find datasets using a term

In [13]:
def find_datasets_using_term(node_type,node_property,term):
    node_label = "ns0__"+node_type
    properties = ""
    for i in node_property:
        prop_label = "ns0__"+i
        properties = properties +"n."+prop_label+"+ "
    properties = properties[0:len(properties)-2]+" "
    query = f"""MATCH (p: dataset)--(n:{node_label}) WHERE 
    ANY (x in {properties}where toLower(x) CONTAINS("{term.lower()}"))
    RETURN p.src_dataset_id as src_dataset_id,n.ns0__name as term;"""
    return query 

top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(find_datasets_using_term('tissue',['name','synonyms'],'skin'))])
top_cat_df

Unnamed: 0,src_dataset_id,term
0,GSE26934_GPL6480,[skin]
1,GSE27041_GPL570,[skin]
2,GSE27349_GPL11093,[skin fibroblast]
3,GSE27355_GPL11094,[skin]
4,GSE27628_GPL1261,[skin]
...,...,...
156,GSE39612_GPL570,[skin]
157,GSE41524_GPL5175,[skin]
158,GSE43996_GPL571,[skin fibroblast]
159,GSE44327_GPL6244,[skin]


In [14]:
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(find_datasets_using_term('disease',['name','synonyms'],'neoplasm'))])
top_cat_df

Unnamed: 0,src_dataset_id,term
0,GSE25140_GPL1261,[Prostatic Neoplasms]
1,GSE25251_GPL570,[Lung Neoplasms]
2,GSE25407_GPL570,[Breast Neoplasms]
3,GSE25835_GPL3921,[Neoplasms]
4,GSE25858_GPL7202,[Neoplasms]
...,...,...
437,GSE44705_GPL6246,[Neoplasms]
438,GSE44707_GPL6246,[Neoplasms]
439,GSE44740_GPL570,[Neoplasms]
440,GSE44971_GPL570,[Neoplasms]


### Q5: Find datasets for related terms

In [28]:
def find_datasets_using_related_terms(node_type,node_property,term):
    node_label = "ns0__"+node_type
    properties = ""
    for i in node_property:
        prop_label = "ns0__"+i
        properties = properties +"n."+prop_label+"+ "
    properties = properties[0:len(properties)-2]+" "
    relation = "ns0__is_a_"+node_type
    query = f"""MATCH (n:{node_label})-[:{relation}]-(m:{node_label}) 
    WHERE ANY (x in {properties}WHERE tolower(x) CONTAINS('{term.lower()}'))
    WITH {properties}as terms
    UNWIND terms as t 
    MATCH (p:dataset)--(n:{node_label} """+"{ns0__name: [t]}"+""") 
    RETURN p.src_dataset_id as dataset_id,n.ns0__name AS name;"""
    return query 

top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(find_datasets_using_related_terms('disease',['name','synonyms'],'cancer'))])
top_cat_df

Unnamed: 0,dataset_id,name
0,GSE36091_GPL1261,[Colorectal Neoplasms]
1,GSE97689_GPL6244,[Colorectal Neoplasms]
2,GSE84650_GPL17400,[Colorectal Neoplasms]
3,GSE90524_GPL16956,[Colorectal Neoplasms]
4,GSE93821_GPL18635,[Colorectal Neoplasms]
...,...,...
4113,GSE6004_GPL570,[Thyroid Neoplasms]
4114,GSE6004_GPL570,[Thyroid Neoplasms]
4115,GSE6004_GPL570,[Thyroid Neoplasms]
4116,GSE6004_GPL570,[Thyroid Neoplasms]


In [29]:
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(find_datasets_using_related_terms('cell_type',
                                                                                               ['name','synonyms'],'fibroblast'))])
top_cat_df

Unnamed: 0,dataset_id,name
0,GSE70818_GPL570,[fibroblast]
1,GSE46240_GPL9185,[fibroblast]
2,GSE45516_GPL570,[fibroblast]
3,GSE44429_GPL10904,[fibroblast]
4,GSE43894_GPL6885,[fibroblast]
...,...,...
1376,GSE35034_GPL14550,[skin fibroblast]
1377,GSE34308_GPL570,[skin fibroblast]
1378,GSE32502_GPL5175,[skin fibroblast]
1379,GSE28300_GPL4133,[skin fibroblast]


### Q5: Fetch datasets with cell lines obtained from samples of a particular disease 

In [33]:
query = """MATCH (p:dataset)--(y:ns0__cell_line)-[r:ns0__obtained_from_sample_with_disease]-(n:ns0__disease)
WHERE ANY (x in n.ns0__synonyms + n.ns0__name where x CONTAINS("sarcoma")) 
RETURN p.src_dataset_id as dataset_id,y.ns0__name as cell_line;""" 

top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

### Q6: Fetch drugs that downregulate the gene TP53

In [52]:
query = """MATCH (n:ns0__drug)-[r:ns0__drug_downregulates]-(m:ns0__gene)
WHERE ANY (x in m.ns0__gene_symbol + m.ns0__alias_symbol where x CONTAINS('TP53'))
RETURN n.ns0__name AS drug;"""
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

### Q7: Fetch other drugs which target the same genes colistin drug

In [50]:
query = """MATCH (p:ns0__drug)-[r1]-(q:ns0__gene)-[r2]-(r:ns0__drug) 
WHERE ANY(x IN p.ns0__name + p.ns0__synonyms WHERE tolower(x) CONTAINS("colistin")) 
RETURN p.ns0__name as drug1,type(r1) as drug1_int,q.ns0__gene_symbol as gene,type(r2) as drug2_int,r.ns0__name as drug2;"""
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

### Q8:  Datasets for a pathway (with a p-value)

In [46]:
#Will return a list for both
#assign into new dataframe and filter using desired p-value cut off
query = """MATCH (n:ns0__pathway) 
WHERE ANY (x in n.ns0__name WHERE toLower(x) CONTAINS("regulation of ras protein signal transduction")) 
RETURN n.ns0__datasets AS dataset_id,n.ns0__pval as p_value;""" 
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df = top_cat_df.explode('dataset_id')
top_cat_df.set_index(['dataset_id']).apply(pd.Series.explode).reset_index()

### Q9: Fetch sarcoma datasets for drugs that upregulate genes downregulated in sarcoma

In [49]:
query = """MATCH (p:ns0__disease)-[:ns0__downregulates]-(q:ns0__gene)-[:ns0__drug_upregulates]-(r:ns0__drug)
WHERE ANY(x IN p.ns0__name + p.ns0__synonyms WHERE tolower(x) CONTAINS('sarcoma')) 
WITH r.ns0__name AS drug, p.ns0__name AS disease 
UNWIND drug as dr, disease as d
MATCH (p:ns0__disease {ns0_name: [d]})--(n:dataset)--(r:ns0__drug {ns0__name: [dr]})
RETURN p.ns0__name as disease,n.src_dataset_id as dataset_id,r.ns0__name as drug_upreg_gene"""
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

### Q10: Query to find types of relations between two node types

Node Types
- ns0__gene
- ns0__pathway
- ns0__cell_type
- ns0__tissue,ns0__cell_type
- ns0__disease
- ns0__cell_line
- ns0__tissue
- ns0__drug

In [36]:
query = """MATCH (n:ns0__disease)-[r]-(m:ns0__gene)
RETURN distinct type(r) as relation_type;"""
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df