# Cypher Queries for Determining Regulatory Paths
*Núria Queralt Rosinach, Andrew Su*

**Freeze Lab meeting, January 2019**

## Overview
NGLY1 - AQP1 **regulatory review** (*NGLY1 v3.2*)

## Servers 

    * Local: bolt://kylo.scripps.edu:7690

### Imports

In [1]:
from neo4j.v1 import GraphDatabase, basic_auth
import pandas as pd
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199

### Functions

In [2]:
def runQuery( driver, query ):
    '''
    This function runs the query onto the database and returns the result.
    in: cypher query string
    out: neo4j query result object
    '''
    
    with driver.session() as session:
        result = session.run('' + query + '')
        
    return result


def parseNode( node ):
    '''
    This function parses the information gathered in the node data structure object resulting after querying neo4j.
        in: node record neo4j object
        out: node as dict
    '''
    
    n = dict()
    n["idx"] = int(node.id)
    n["type"] = list(node.labels)[0]
    n["id"] = str(node.properties['id'])
    n["preflabel"] = str(node.properties['preflabel'])
    n["name"] = str(node.properties['name'])
    n["description"] = str(node.properties['description'])

    return n


def parsePath( path ):
    '''
    This function parsers the information gathered in the path data structure object resulting after querying neo4j.
        in: path record neo4j object
        out: path as dict
    '''
    
    out = {}
    out['Nodes'] = []
    for node in path['path'].nodes:
        n = {}
        n['idx'] = int(node.id)
        n['type'] = list(node.labels)[0]
        n['id'] = str(node.properties['id'])
        n['preflabel'] = str(node.properties['preflabel'])
        n['name'] = str(node.properties['name'])
        n['description'] = str(node.properties['description'])
        out['Nodes'].append(n)
    out['Edges'] = []
    for edge in path['path'].relationships:
        e = {}
        e['idx'] = int(edge.id)
        e['start_node'] = int(edge.start)
        e['end_node'] = int(edge.end)
        e['type'] = str(edge.type)
        e['property_label'] = str(edge.properties['property_label'])
        e['property_uri'] = str(edge.properties['property_uri'])
        e['reference_uri'] = str(edge.properties['reference_uri'])
        e['reference_date'] = str(edge.properties['reference_date'])
        e['reference_supporting_text'] = str(edge.properties['reference_supporting_text'])
        out['Edges'].append(e)
        
    return out

### Initialize neo4j

In [3]:
driver = GraphDatabase.driver("bolt://kylo.scripps.edu:7690", auth=basic_auth("neo4j", "xena"))

# TRANSCRIPTOME ANALYSIS

### Common TFs for human RNA transcriptome
* **Query web 1: < NGLY1 -[Human RNA expression (Freeze)]- gene <-- TF > order by freq**
    * ngly1-[interacts with]-gene<-tf

In [5]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]-(fly_gene:GENE)-[:`RO:HOM0000020`]-(gene:GENE)<-[i2:`RO:0002434`]-(tf:GENE)

        WHERE source.id = 'FlyBase:FBgn0033050' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,i2,tf,gene,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 
        
        AND toLower(i1.reference_uri) =~ '.*pubmed/29346549.*' 
        
        AND toLower(i2.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN DISTINCT tf.id as id, tf.preflabel as symbol, tf.name as name, tf.description as description, count(distinct gene.preflabel) as freq
        
        ORDER BY freq DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({
        'TF_id': record['id'], 
        'TF_symbol': record['symbol'],
        'TF_name': record['name'],
        'TF_description': record['description'],
        'Frequency': record['freq']
    })
    
res_df = pd.DataFrame(out_l)
print(res_df.shape)

(12, 5)
CPU times: user 0 ns, sys: 3.7 ms, total: 3.7 ms
Wall time: 994 ms


In [6]:
# Summary table
res_df[['Frequency', 'TF_id', 'TF_symbol', 'TF_name', 'TF_description']].head()

Unnamed: 0,Frequency,TF_id,TF_symbol,TF_name,TF_description
0,5,HGNC:7996,NRF1,nuclear respiratory factor 1,"This gene encodes a protein that homodimerizes and functions as a transcription factor which activates the expression of some key metabolic genes regulating cellular growth and nuclear genes required for respiration, heme biosynthesis, and mitochondrial DNA transcription and replication. The protein has also been associated with the regulation of neurite outgrowth. Alternative splicing results in multiple transcript variants. Confusion has occurred in bibliographic databases due to the shared symbol of NRF1 for this gene and for 'nuclear factor (erythroid-derived 2)-like 1' which has an official symbol of NFE2L1. [provided by RefSeq, May 2014]."
1,5,HGNC:6204,JUN,"Jun proto-oncogene, AP-1 transcription factor subunit","This gene is the putative transforming gene of avian sarcoma virus 17. It encodes a protein which is highly similar to the viral protein, and which interacts directly with specific target DNA sequences to regulate gene expression. This gene is intronless and is mapped to 1p32-p31, a chromosomal region involved in both translocations and deletions in human malignancies. [provided by RefSeq, Jul 2008]."
2,3,HGNC:7782,NFE2L2,"nuclear factor, erythroid 2 like 2","This gene encodes a transcription factor which is a member of a small family of basic leucine zipper (bZIP) proteins. The encoded transcription factor regulates genes which contain antioxidant response elements (ARE) in their promoters; many of these genes encode proteins involved in response to injury and inflammation which includes the production of free radicals. Multiple transcript variants encoding different isoforms have been characterized for this gene. [provided by RefSeq, Sep 2015]."
3,3,HGNC:6780,MAFF,MAF bZIP transcription factor F,"The protein encoded by this gene is a basic leucine zipper (bZIP) transcription factor that lacks a transactivation domain. It is known to bind the US-2 DNA element in the promoter of the oxytocin receptor (OTR) gene and most likely heterodimerizes with other leucine zipper-containing proteins to enhance expression of the OTR gene during term pregnancy. The encoded protein can also form homodimers, and since it lacks a transactivation domain, the homodimer may act as a repressor of transcription. This gene may also be involved in the cellular stress response. Multiple transcript variants encoding two different isoforms have been found for this gene. [provided by RefSeq, Jun 2009]."
4,3,HGNC:7780,NFE2,"nuclear factor, erythroid 2",


### RNA gene clusters by TF and Pathway (pathways associated with gene sets)
    * Cluster RNA genes-common TF clusters by pathway (pathways associated with RNA genes in gene sets)

In [7]:
%%time
# Query 20
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]->(fly_g:GENE)-[:`RO:HOM0000020`]-(g:GENE)<-[i2:`RO:0002434`]-(tf:GENE), (g:GENE)-[i]-(pw:PHYS)

        WHERE source.id = 'FlyBase:FBgn0033050' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,i2,g,tf,i,pw,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 
        
        AND toLower(i1.reference_uri) =~ '.*pubmed/29346549.*'  
        
        AND toLower(i2.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'
        
        AND toLower(i.property_label) <> 'enables'

        RETURN DISTINCT tf.preflabel as TF_symbol, tf.name as TF_name, 
                        collect(DISTINCT g.preflabel) as geneset, count(distinct g.preflabel) as genes,
                        collect(DISTINCT pw.preflabel) as pathway, count(distinct pw.preflabel) as pathways
        
        ORDER BY genes DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({'TF symbol': record['TF_symbol'],
                  'TF name': record['TF_name'],
                  'Target geneset': record['geneset'],
                  'Total target genes': record['genes'],
                  'Pathway': record['pathway'],
                  'Total pathways': record['pathways']
                 })
    
res_df = pd.DataFrame(out_l)
print(res_df.shape)

(12, 6)
CPU times: user 5.66 ms, sys: 133 µs, total: 5.79 ms
Wall time: 5.74 s


In [8]:
# Summary table
res_df.head()

Unnamed: 0,Pathway,TF name,TF symbol,Target geneset,Total pathways,Total target genes
0,"[extracellular exosome, Metabolism, nucleus, Metabolism of proteins, Post-translational protein modification, transmembrane transport, Transmembrane transport of small molecules, positive regulation of transcription from RNA polymerase II promoter, positive regulation of transcription, DNA-templated, regulation of transcription from RNA polymerase II promoter, negative regulation of apoptotic process, protein complex, transcription from RNA polymerase II promoter, response to drug, cellular response to UV, cellular response to hypoxia, positive regulation of fibroblast proliferation]",nuclear respiratory factor 1,NRF1,"[GART, PSMA4, HIBADH, LIG3, MYC]",17,5
1,"[nucleus, Metabolism of proteins, positive regulation of transcription from RNA polymerase II promoter, positive regulation of transcription, DNA-templated, regulation of transcription from RNA polymerase II promoter, negative regulation of apoptotic process, Post-translational protein modification, protein complex, transcription from RNA polymerase II promoter, response to drug, cellular response to UV, cellular response to hypoxia, positive regulation of fibroblast proliferation, transmembrane transport, Transmembrane transport of small molecules, Metabolism, extracellular exosome, endoplasmic reticulum membrane, neuron projection, dendrite, anatomical structure morphogenesis, plasma membrane, integral component of plasma membrane, cell surface]","Jun proto-oncogene, AP-1 transcription factor subunit",JUN,"[MYC, PSMC3, HPD, TH, SLC6A2]",24,5
2,"[nucleus, Metabolism of proteins, positive regulation of transcription from RNA polymerase II promoter, positive regulation of transcription, DNA-templated, regulation of transcription from RNA polymerase II promoter, negative regulation of apoptotic process, Post-translational protein modification, protein complex, transcription from RNA polymerase II promoter, response to drug, cellular response to UV, cellular response to hypoxia, positive regulation of fibroblast proliferation, extracellular exosome, Metabolism, endoplasmic reticulum membrane]",MAF bZIP transcription factor F,MAFF,"[MYC, PSMG2, HPD]",16,3
3,"[nucleus, endoplasmic reticulum, extracellular exosome, positive regulation of transcription from RNA polymerase II promoter, negative regulation of apoptotic process, Metabolism of proteins, positive regulation of transcription, DNA-templated, regulation of transcription from RNA polymerase II promoter, Post-translational protein modification, protein complex, transcription from RNA polymerase II promoter, response to drug, cellular response to UV, cellular response to hypoxia, positive regulation of fibroblast proliferation]","nuclear factor, erythroid 2 like 2",NFE2L2,"[POMP, SQSTM1, MYC]",15,3
4,"[nucleus, Metabolism of proteins, extracellular exosome, Post-translational protein modification, transmembrane transport, Transmembrane transport of small molecules, Metabolism, integral component of membrane, endoplasmic reticulum, endoplasmic reticulum membrane, brush border, positive regulation of transcription from RNA polymerase II promoter, negative regulation of apoptotic process]","nuclear factor, erythroid 2",NFE2,"[PSMA1, LCTL, SQSTM1]",13,3


# NETWORK HYPOTHESES

### template that involves genes and pathways
* **Query 1: ( L = 3 ) (ngly1-ppi-pw-aqp1) ngly1-[interacts with]-gene--phys--aqp1 **
    * without degree filters
    * this query in graph v2 gives 0 paths

In [9]:
%%time
# Query 1
with driver.session() as session:
    result = session.run(
    """
    MATCH path=(source:GENE)-[:`RO:0002434`]-(:GENE)--(:PHYS)--(target:GENE)

    WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

    WITH path,

    [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked,

    [r IN relationships(path) WHERE r.property_label IN ['in paralogy relationship with','in orthology relationship with','colocalizes with']] AS edges_marked

    WHERE size(nodes_marked) = 0 AND size(edges_marked) = 0

    RETURN count(distinct path) as paths
    """
    )

# print result
for record in result:
    print('Total paths: {}'.format(record['paths']))

Total paths: 111
CPU times: user 1.49 ms, sys: 0 ns, total: 1.49 ms
Wall time: 95.5 ms


### template rna-reg L=3 min
* **Query 14: TF-pw (L=4) NGLY1--RNA_GENE--TF--PHYS--AQP1**
    * filter expression edge 
    * filter TF regulatory edge and directionality

In [10]:
%%time
# Query 14
query = (
        """
        MATCH path=(source:GENE {id: 'FlyBase:FBgn0033050'})-[i1:`RO:0002434`]-(fly_g:GENE)-[:`RO:HOM0000020`]-(rna:GENE)<-[i2:`RO:0002434`]-(tf:GENE)--(pw:PHYS)--(target:GENE {id: 'HGNC:633'})

        WHERE ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,i2,rna,tf,pw,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 
        
        AND toLower(i1.reference_uri) =~ '.*pubmed/29346549.*' 
        
        AND toLower(i2.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct rna) as rnas, count(distinct tf) as tfs, count(distinct pw) as pws, count(distinct path) as paths
        
        ORDER BY paths DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({'Expressed_genes': record['rnas'], 
                  'TFs': record['tfs'],
                  'Pathways': record['pws'],
                  'Paths': record['paths']})
    
res_df = pd.DataFrame(out_l)

CPU times: user 4.57 ms, sys: 0 ns, total: 4.57 ms
Wall time: 3.45 s


In [11]:
# Summary table
res_df

Unnamed: 0,Expressed_genes,Paths,Pathways,TFs
0,12,74,9,4


* **Query 14.1: TF-pw (L=3) NGLY1--RNA_GENE--TF--AQP1**
    * filter expression edge 
    * filter TF regulatory edge and directionality

In [12]:
%%time
# Query 14.1
query = (
        """
        MATCH path=(source:GENE {id: 'FlyBase:FBgn0033050'})-[i1:`RO:0002434`]-(fly_g:GENE)-[:`RO:HOM0000020`]-(rna:GENE)<-[i2:`RO:0002434`]-(tf:GENE)--(target:GENE {id: 'HGNC:633'})

        WHERE ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,i2,rna,tf,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 
        
        AND toLower(i1.reference_uri) =~ '.*pubmed/29346549.*' 
        
        AND toLower(i2.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct rna) as rnas, count(distinct tf) as tfs, count(distinct path) as paths
        
        ORDER BY paths DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({'Expressed_genes': record['rnas'], 
                  'TFs': record['tfs'],
                  'Paths': record['paths']})
    
res_df = pd.DataFrame(out_l)

CPU times: user 2.47 ms, sys: 0 ns, total: 2.47 ms
Wall time: 355 ms


In [13]:
# summary table
res_df

Unnamed: 0,Expressed_genes,Paths,TFs
0,2,8,2


### Look for a template of interest (only regulation edge): Open query + metapaths 
* **Query 13: Variable length 2-4 and pattern**
    * pattern: ngly1<-regulatory-tf-[<=3]-aqp1 
    * ngly1-[interacts with]-gene-[1..3]-aqp1 
    * filter specific regulatory types on specific edges 

In [14]:
%%time
# Query 13
query = (
        """
        MATCH path=(source:GENE)<-[i1:`RO:0002434`]-(tf:GENE)-[*..3]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 4973
CPU times: user 2.84 ms, sys: 242 µs, total: 3.08 ms
Wall time: 21.7 s


In [15]:
%%time
# Query 13
query = (
        """
        MATCH path=(source:GENE)<-[i1:`RO:0002434`]-(tf:GENE)-[*..3]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN DISTINCT extract (x in rels(path) | type(x)) as types, extract (n in nodes(path) | labels(n)) as labels, length(path) as mp_length, count(distinct path) as paths 
        
        ORDER BY mp_length, paths DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({'Nodes': record['labels'], 
                  'Relations': record['types'],
                  'Metapath length': record['mp_length'],
                  'Paths': record['paths']})
    
res_df = pd.DataFrame(out_l)

CPU times: user 6.93 ms, sys: 148 µs, total: 7.08 ms
Wall time: 21.7 s


In [16]:
# summary table of metapaths
res_df

Unnamed: 0,Metapath length,Nodes,Paths,Relations
0,3,"[[GENE], [GENE], [GENE], [GENE]]",24,"[RO:0002434, RO:0002434, RO:0002434]"
1,3,"[[GENE], [GENE], [GENE], [GENE]]",8,"[RO:0002434, RO:HOM0000011, RO:0002434]"
2,3,"[[GENE], [GENE], [ANAT], [GENE]]",5,"[RO:0002434, RO:0002206, RO:0002206]"
3,4,"[[GENE], [GENE], [GENE], [GENE], [GENE]]",2493,"[RO:0002434, RO:0002434, RO:0002434, RO:0002434]"
4,4,"[[GENE], [GENE], [GENE], [ANAT], [GENE]]",544,"[RO:0002434, RO:0002434, RO:0002206, RO:0002206]"
5,4,"[[GENE], [GENE], [GENE], [PHYS], [GENE]]",196,"[RO:0002434, RO:0002434, RO:0002331, RO:0002331]"
6,4,"[[GENE], [GENE], [GENE], [GENE], [GENE]]",196,"[RO:0002434, RO:HOM0000011, RO:0002434, RO:0002434]"
7,4,"[[GENE], [GENE], [DISO], [PHYS], [GENE]]",182,"[RO:0002434, RO:0002200, None, RO:0002331]"
8,4,"[[GENE], [GENE], [DISO], [GENE], [GENE]]",171,"[RO:0002434, RO:0002200, RO:0002200, RO:0002434]"
9,4,"[[GENE], [GENE], [GENE], [PHYS], [GENE]]",162,"[RO:0002434, RO:0002434, BFO:0000050, BFO:0000050]"


### template refinement (only regulation edge)
* **Query 12: ( L = 4 ) ngly1--TF-[regulatory]->gene--phys--aqp1 **
    * filter specific regulatory types on specific edges 
    * filter directionality
    * filter node type on third node, otherwise runtime is long

In [17]:
%%time
# Query 12
query = (
        """
        MATCH path=(source:GENE)--(tf:GENE)-[i1:`RO:0002434`]->(:GENE)--(:PHYS)--(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 16
CPU times: user 2.94 ms, sys: 64 µs, total: 3.01 ms
Wall time: 2.54 s


* **Query 11: ( L = 3 ) ngly1--TF-[regulatory]->gene--aqp1 **
    * filter specific regulatory types on specific edges 
    * filter directionality

In [18]:
%%time
# Query 11
query = (
        """
        MATCH path=(source:GENE)--(tf:GENE)-[i1:`RO:0002434`]->(:GENE)--(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 6
CPU times: user 0 ns, sys: 2.51 ms, total: 2.51 ms
Wall time: 267 ms


## Mitali queries
### 14 Jan 2019 meeting 

Here are the details of the querry we talked about yesterday;


Look for gene-gene interactions linking NGLY1 and AQPs through the following genes;

* ATF1 (ENSG00000123268)
    * https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000123268;r=12:50763710-50821122

* CREB1 (ENSG00000118260)
* PRKACA (ENSG00000072062)
* cyclic adenosine monophosphate (cAMP)
* **cAMP** is not a protein so there is no identification number for it. 

_Note_: Protein kinase A has multiple subunits. Each subunit has a different identifier. I have provided you with the gene identifier for its catalytic subunit PRKACA.


**QUERY TOPOLOGY**:

*  NGLY1-interacts-(gene)-interacts-(cAMP|ATF1|CREB1|PKA),  with length n=1/2

* **Query: ( L = 3 ) ngly1--TF-[regulatory]->gene--atf1 **
    * filter specific regulatory types on specific edges 
    * filter directionality

In [19]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]-(tf:GENE)-[*..2]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))



Paths: 52
CPU times: user 1.77 ms, sys: 141 µs, total: 1.91 ms
Wall time: 153 ms


In [20]:
%%time
query = (
        """
        MATCH path=(source:GENE)--(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 0
CPU times: user 1.63 ms, sys: 135 µs, total: 1.77 ms
Wall time: 12.9 ms


In [21]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]-(:GENE)-[i2:`RO:0002434`]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 0
CPU times: user 1.92 ms, sys: 3 µs, total: 1.92 ms
Wall time: 24.7 ms


In [22]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]-(:GENE)-[i2:`RO:0002434`]-(:GENE)-[i3:`RO:0002434`]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 508
CPU times: user 3.26 ms, sys: 0 ns, total: 3.26 ms
Wall time: 502 ms
