# PollyGraph : Answering biomedical questions with large biomedical knowledge graphs

<img src="images/logo-logo.jpg" alt="Drawing" style="width: 500px;"/>

### What is PollyGraph ?

PollyGraph a.k.a ontology-driven knowledge graph is a semantic model for biomedical molecular data. 

In most organisations, big and small, information is spread out across different systems in variety of formats and contexts. As a result, users spend much time trying to manually find relevant data for their use. 

Polly, which is home to over 1.6 million biomedical molecular data, is standardised for both data and metadata. There are two major advantages to this effort. First, data standardisation using uniform file formats such as GCT, h5ad, vcf etc. enable large scale consumption of data for analysis. Second, metadata harmonisation enables users to perform data findability at scale without ever worrying to provide exact keywords for search.

To make search even more powerful for scientists, we introduce PollyGraph to perform semantic searches using several biological concepts.

### What is Semantic Search ? 

Semantic search is an advanced technology for optimizing the accuracy of our search results by enabling models to understand the meaning of concepts and the relationships between them. 

### Why do we need semantics ? 

As the data on Polly is rapidly growing in number, more and more knowledge populates the platform. Therefore, it becomes increasingly challenging for machines to process and retrieve "relevant" information on our behalf. Though it is easy for humans to decide whether two or more entities are associated, machines struggle and often fail to do it. 

### The perks of semantic-search

- It's context-aware : It identities entities that are relevant for the task and retreieves relevant datasets.
- It's extensively interlinked : Provides multiple references for an entitiy across multiple domain specific concepts and knowledge grows significantly by adding new associations between entities
- It's highly personalised : Capable of returning results that are more closely aligned with user's interests.

<img src="images/onto-text.png" alt="Drawing" style="width: 500px;"/>

## Ontology-driven Biomedical Knowledge Graphs

One of the main mechanisms behind this ability of semantic search is to utilise publicly available biomedical ontologies to provide more meaningful results using knowledge graph. Linking biomedical ontologies (e.g. MeSH, BTO, CVCL, HGNC, GO etc) by mining associations between them.

By leveraging ontologies, semantic search is able to provide a suitable response even if the results don’t contain the exact wording of the query.

<img src="images/primary_model.jpg" alt="model" style="width: 800px;"/>

### 1. Connecting to PollyGraph

In [1]:
from neo4j import GraphDatabase
import pandas as pd
import tqdm
import os
import time
from polly.auth import Polly
from polly.omixatlas import OmixAtlas

AUTH_TOKEN="eyJjdHkiOiJKV1QiLCJlbmMiOiJBMjU2R0NNIiwiYWxnIjoiUlNBLU9BRVAifQ.QzHksBx2ZUfXnqD3o7EVj56A7WHzZ8fOkQ72_eCmef1igtab8OevFsU7Hs-TmfOPc2BkOn2WEEuj0l6Gq2_LRWpsJWkgQyw1yHovJyWTo9ntlvCD3qFsmrizIpaACsqoLG2wuf77RlIFKBA4Tr9G5EfntlWDXV9nmLhU1dnl7iBafiHfhFNOJhCKgEroMRhczr8Hqqw-iMqbn9QDcR1rrTp8cSYjUVq2PAVVnr90mwirV4Cv3Rs8WIeL8yNOhnzurLVmBAXs6f5Pe0DyNeg_DpureaM2UUsGZLPHbPsprpdqHf78Ffdl7mab6-cNgKTa1NwLVW36LWHZFQJAfsZ8Uw.msX5rQydqbU0I4sh.IjkX_ypX4hBitWFQZgty0OsebIlbMeolwGPqV2EyOYzHVLMJi6Wu_MO7yAptw3K9OakZ7J3OcG4QLJ_SlqU__tT0dx_AH13otXqqSBVDdPGEq3FAdyKuMwvnX5yQxo8gIhjadNlu67eXansia61GqQOICXi8L9gGb49K9JttEC8Bsb_btHZuNEk-rIFJn838SdB-bo0aQzP2K1OgbuUrX92ttvsZvo8j5h8g-gakRRq6DEpV86jaoZ0PrOzSEnjOv0iP6qWpUyPvr_JJgIY63K243WYLJfgl9K_TtQvkr-k2OcKoCpiIxnN_GLIiDnk9XnqtiIycwpTN8SsB7bn5xjaKKjumrh-m6vHlz6vbFH3qycRY2N0JGPUA9Iz1UZ0niYNx8OIRW635Lj7KYN5l56Ob8PIwvMVEEKzavILYPasdcvIJGWXQKBjyGZIs-ML6OOohWMSm4PvnoPAtRcAdm2axpp4TNfLpErlBPiBcEEml50XxI6Pwx2hpCD46OeGYZbNOjeyLA7asrJmVpqIo_j5Ga2wqd7nxPeo_d4vu4r_eKAlh28GZ_EewBpQ9CbJW54Belpr6BxCYczp_uht3pHLSI45Tu5OI6r6KXZ3XOvN7UxthOCpdZn7t7VE0YUIrFAgZFVPjMu9FqEdbjfZ_Ueceh26GviT-Mv87CdhAuLBuScoyzCAV6UYRfoV62il1r3LyDr5WriZiycnM-BzGAQsA22qOOJYOWhsMDYZQ_AvKvbbwkYNBmMk6x77ul5sduieKvg-c04QfZ5cGOns-E0NBO-P1AHdwbZkjDQzcyns7dot7f_qFscm5XlOJGB3yXWuv2_DrZz1BQYhjn2ZARG3YQkdWChsap93f6GWjoV_qMDIZ8YwjsiaToNrBH9wSA6JdxGiWS5DZPPQSVK_vMFBHQnxGHaKfb6t2hIDk2Qr4ruc15ggLEuU9LL7iwbLR69iUevk-CUtT2iVBSH608iDkgH1y5vgecLyJMYK8QnFpHR_xnaqi6qF8nrTK4O060SZQbZkop7Lx-Wxv74P2_hXMbsEcH_gW0PaWd8Id_MYKrK4q-oIaJ_ghyERbiOJVqgLtKn5U7ltsmbm1RnSZROoNeoaLDBTdlerG1o8R7D90rJL0hh30kQY2-thrnVUceZkSWAYTZ48wyU8W08OSwUdZ13fRYO96MyEiGtuBoNmVIZ0pewtH04YCGHNXPQYwJ_jqEXTYXCh6jXe11JFQDprK-2r22iMubEdJ6HH_r1FzcX3XbBG4dNcfp1q9pM_hvp-48MVXfYosfp2nNIKfZERMGrulNDUIMJJJAjb16qh-5vfNb7lIQd0G210.7-0XNxC4nnp-FL1SmIHvGA" # Obtain authentication tokens
Polly.auth(AUTH_TOKEN)

#Defining omixatlas object
omixatlas = OmixAtlas()

In [42]:
class PollyGraph:
    
    def __init__(self, uri, user, pwd):
        self.__uri = uri
        self.__user = user
        self.__pwd = pwd
        self.__driver = None
        try:
            self.__driver = GraphDatabase.driver(self.__uri, auth=(self.__user, self.__pwd))
        except Exception as e:
            print("Failed to create the driver:", e)
        
    def close(self):
        if self.__driver is not None:
            self.__driver.close()
        
    def query(self, query, parameters=None, db=None):
        assert self.__driver is not None, "Driver not initialized!"
        session = None
        response = None
        try: 
            session = self.__driver.session(database=db) if db is not None else self.__driver.session() 
            response = list(session.run(query, parameters))
        except Exception as e:
            print("Query failed:", e)
        finally: 
            if session is not None:
                session.close()
        return response
    
def build_rel_query(field_name, col_name, meta_field, rel):
    
    if rel == 'mentioned_in':
        q = """
            UNWIND $rows as row
            MATCH (n:ns0__"""+field_name +"""{ns0__"""+meta_field+""": [row."""+col_name+"""]})
            MATCH (d:dataset {src_dataset_id: row.src_dataset_id})
            MERGE (n)-[:"""+rel+"""]->(d)
            RETURN count(*) as total
            """
    else:
        q = """
            UNWIND $rows as row
            MATCH (n:ns0__"""+field_name +"""{ns0__"""+meta_field+""": row."""+col_name+"""})
            MATCH (d:dataset {src_dataset_id: row.src_dataset_id})
            MERGE (n)-[:"""+rel+"""]->(d)
            RETURN count(*) as total
            """
        
    return(q)
    
def add_dataset(rows, batch_size=500):
   # Adds category nodes to the Neo4j graph.
    query = '''
            UNWIND $rows AS row
            MERGE (c:dataset {
                dataset_id: row.dataset_id,
                src_dataset_id: row.src_dataset_id, 
                src_overall_design: row.src_overall_design,
                src_description: row.src_description,
                src_summary: row.src_summary,
                data_type: row.data_type,
                curated_cell_line: row.curated_cell_line,
                curated_cell_type: row.curated_cell_type,
                curated_disease: row.curated_disease,
                curated_drug: row.curated_drug,
                curated_gene: row.curated_gene,
                curated_tissue: row.curated_tissue,
                condition_column: row.condition_column,
                condition_control: row.condition_control,
                condition_perturbation: row.condition_perturbation
                })
            
            RETURN count(*) as total
            '''
    return insert_data(query, rows)

def insert_data(query, rows, batch_size = 500):
    # Function to handle the updating the Neo4j database in batch mode.

    total = 0
    batch = 0
    start = time.time()
    result = None

    while batch * batch_size < len(rows):
        res = pollygraph.query(query, 
                         parameters = {'rows': rows[batch*batch_size:(batch+1)*batch_size].to_dict('records')})
        total += res[0]['total']
        batch += 1
        result = {"total":total, 
                  "batches":batch, 
                  "time":time.time()-start}
        #print(result)

    return result

In [4]:
pollygraph = PollyGraph(uri="bolt://localhost:7687", user="neo4j", pwd="password")

### 2. Get dataset metadata from Polly

In [5]:
q = """SELECT 
              dataset_id,
              src_dataset_id, 
              src_overall_design,
              src_description,
              src_summary,
              data_type,
              curated_cell_line,
              curated_cell_type,
              curated_disease,
              curated_drug,
              curated_gene,
              curated_tissue,
              condition_column,
              condition_control,
              condition_perturbation
       FROM gdx_atlas.datasets"""

df = omixatlas.query_metadata(q, query_api_version='v2')
df.head()

Query execution succeeded (time taken: 2.81 seconds, data scanned: 2.382 MB)
Fetched 5120 rows


Unnamed: 0,dataset_id,src_dataset_id,condition_column,condition_control,condition_perturbation,curated_cell_line,curated_cell_type,curated_disease,curated_drug,curated_gene,curated_tissue,data_type,src_description,src_overall_design,src_summary
0,GSE100054_GPL23126-2022-06-21-07-47-56,GSE100054_GPL23126,kw_curated_disease,[Normal],[Parkinson Disease],[None],"[neuron, peripheral blood mononuclear cell]",[Parkinson Disease],[None],"[AMBRA1, HDAC6, BECN1, ULK1, ATG5, ATG4B, SNCA...","[brain, blood]",Transcriptomics,Expression profiling of peripheral blood monon...,PBMC in 9 normal controls and 10 patients with...,Autophagy is a highly conserved degradation pa...
1,GSE10006_GPL570-2022-06-21-07-47-57,GSE10006_GPL570,kw_curated_disease,[Normal],"[Pulmonary Disease, Chronic Obstructive]",[None],"[secretory cell, respiratory epithelial cell]","[Bacterial Infections, Pulmonary Disease, Chro...",[None],[ITLN1],[None],Transcriptomics,Decreased Expression of Intelectin 1 in The Hu...,Comparison of gene expression in airway epithe...,Lectins are proteins present on cell surfaces ...
2,GSE100095_GPL17586-2022-07-08-05-43-08,GSE100095_GPL17586,kw_curated_drug,[none],"[3',5'-cyclic AMP]",[BeWo],[None],[Choriocarcinoma],"[3',5'-cyclic AMP, progesterone]",[None],[None],Transcriptomics,Expression data from BeWo cells treated withou...,Human placenta choriocarcinoma cell line BeWo ...,"Cyclic AMP activates two downstream factors, p..."
3,GSE100194_GPL7202-2022-07-08-05-43-09,GSE100194_GPL7202,kw_curated_drug,[none],[ketotifen],[None],[mast cell],"[Dengue, Flavivirus Infections]",[ketotifen],[None],"[liver, spleen]",Transcriptomics,Transcriptional Profiling Confirms the Therape...,Gene expression in liver and spleen were measu...,"In this study, we show that the host response ..."
4,GSE100195_GPL7202-2022-07-08-05-43-10,GSE100195_GPL7202,kw_curated_drug,[none],[ketotifen],[None],[mast cell],"[Seizures, Febrile, Hemorrhagic Disorders, Vas...",[ketotifen],[None],"[spleen, liver]",Transcriptomics,Transcriptional Profiling Confirms the Therape...,Gene expression in liver and spleen were measu...,"In this study, we show that the host response ..."


### 3. Integrate Dataset Metadata on Polly with KG

In [44]:
add_dataset(df)

{'total': 5120, 'batches': 11, 'time': 5.165494680404663}

In [None]:
# Adding relationships for each dataset to biological entities
curated_columns = ['curated_cell_line', 'curated_cell_type',
                  'curated_drug', 'curated_tissue', 'curated_gene']

for index, row in tqdm.tqdm(df.iterrows()):
    index += 1
    row = df[index-1:index]
    pert_col = 'condition_perturbation'
    
    if row['condition_column'][index-1] == 'kw_curated_disease' or row['condition_column'][index-1] == 'kw_curated_drug':
        pert_field = row['condition_column'][index-1].split('_')[-1]+' '
        query = build_rel_query(pert_field, pert_col, "name", "is_perturbed_in")
        insert_data(query, row)
    
    for col in curated_columns:
        field_name = col.replace('curated_', '')
        item = row[col][index-1]
        row_df = row.explode(col)
        if col == 'curated_gene' and "".join(item).lower() != "none":
            query = build_rel_query(field_name, col, "gene_symbol", "mentioned_in")
            insert_data(query, row_df)
            
        elif col != 'curated_gene' and "".join(item).lower() != "none":
            query = build_rel_query(field_name, col, "name", "mentioned_in")
            insert_data(query, row_df)
            
        else:
            continue

## Context-driven search examples using PollyGraph

### Q1: List genes that constitute TGF-β receptor signaling pathway that are differentially expressed in Prostate Cancer

In [7]:
query_string = '''
MATCH (n:ns0__pathway)--(m:ns0__disease)--(p:ns0__gene)
WHERE ANY (x in n.ns0__name WHERE toLower(x) CONTAINS("transforming growth factor beta receptor signaling")) AND
ANY (y in m.ns0__name+m.ns0__synonyms WHERE toLower(y) CONTAINS("prostate cancer"))
RETURN p.ns0__gene_symbol as gene, n.ns0__name as pathway, m.ns0__name as disease;
'''
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

Unnamed: 0,gene,pathway,disease
0,[ALKBH5],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
1,[TSPY26P],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
2,[METTL1],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
3,[STIL],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
4,[ROR2],[Negative regulation of transforming growth fa...,[Prostatic Neoplasms]
...,...,...,...
821,[TRIM24],[Transforming growth factor beta receptor sign...,[Prostatic Neoplasms]
822,[DMBT1],[Transforming growth factor beta receptor sign...,[Prostatic Neoplasms]
823,[ARG1],[Transforming growth factor beta receptor sign...,[Prostatic Neoplasms]
824,[DUS3L],[Transforming growth factor beta receptor sign...,[Prostatic Neoplasms]


### Q2: Find datasets related to Brain Injury where BRD1 gene is regulated

In [8]:
query_string = '''
MATCH (n:ns0__disease) WHERE ANY (x in n.ns0__name + n.ns0__synonyms where toLower(x) CONTAINS("brain injury")) 
WITH n.ns0__name as disease
UNWIND disease as d
MATCH (m:ns0__gene {ns0__gene_symbol : ['BRD1']})--(n:ns0__disease {ns0__name : [d]})--(p: dataset)
RETURN p.src_dataset_id as src_dataset_id;
'''
top_cat_df = pd.DataFrame([dict(_) for _ in pollygraph.query(query_string)])
top_cat_df

Unnamed: 0,src_dataset_id
0,GSE68207_GPL18694
1,GSE92363_GPL14746
2,GSE45997_GPL1355
3,GSE44625_GPL6885
4,GSE41345_GPL6885
5,GSE39759_GPL7202
6,GSE58613_GPL571
7,GSE116980_GPL22396
8,GSE111452_GPL15084
