# Cypher Queries for Determining Regulatory Paths
*Núria Queralt Rosinach, Andrew Su*

**Freeze Lab meeting, January 2019**

## Overview
NGLY1 - AQP1 **regulatory review** (*NGLY1 v3.1*)

## Servers 

    * Local: bolt://kylo.scripps.edu:7689
    * AWS: bolt://52.87.232.110:7689

### Imports

In [3]:
from neo4j.v1 import GraphDatabase, basic_auth
import pandas as pd
pd.set_option('display.max_columns', None)  # or 1000
pd.set_option('display.max_rows', None)  # or 1000
pd.set_option('display.max_colwidth', -1)  # or 199

### Functions

In [4]:
def runQuery( driver, query ):
    '''
    This function runs the query onto the database and returns the result.
    in: cypher query string
    out: neo4j query result object
    '''
    
    with driver.session() as session:
        result = session.run('' + query + '')
        
    return result


def parseNode( node ):
    '''
    This function parses the information gathered in the node data structure object resulting after querying neo4j.
        in: node record neo4j object
        out: node as dict
    '''
    
    n = dict()
    n["idx"] = int(node.id)
    n["type"] = list(node.labels)[0]
    n["id"] = str(node.properties['id'])
    n["preflabel"] = str(node.properties['preflabel'])
    n["name"] = str(node.properties['name'])
    n["description"] = str(node.properties['description'])

    return n


def parsePath( path ):
    '''
    This function parsers the information gathered in the path data structure object resulting after querying neo4j.
        in: path record neo4j object
        out: path as dict
    '''
    
    out = {}
    out['Nodes'] = []
    for node in path['path'].nodes:
        n = {}
        n['idx'] = int(node.id)
        n['type'] = list(node.labels)[0]
        n['id'] = str(node.properties['id'])
        n['preflabel'] = str(node.properties['preflabel'])
        n['name'] = str(node.properties['name'])
        n['description'] = str(node.properties['description'])
        out['Nodes'].append(n)
    out['Edges'] = []
    for edge in path['path'].relationships:
        e = {}
        e['idx'] = int(edge.id)
        e['start_node'] = int(edge.start)
        e['end_node'] = int(edge.end)
        e['type'] = str(edge.type)
        e['property_label'] = str(edge.properties['property_label'])
        e['property_uri'] = str(edge.properties['property_uri'])
        e['reference_uri'] = str(edge.properties['reference_uri'])
        e['reference_date'] = str(edge.properties['reference_date'])
        e['reference_supporting_text'] = str(edge.properties['reference_supporting_text'])
        out['Edges'].append(e)
        
    return out

### Initialize neo4j

In [5]:
driver = GraphDatabase.driver("bolt://kylo.scripps.edu:7689", auth=basic_auth("neo4j", "xena"))
#driver = GraphDatabase.driver("bolt://52.87.232.110:7689")

In [6]:
# Question
## Query topology graph
## Table of summary
## Graph of paths
## Explore paths=> executable cypher query

# TRANSCRIPTOME ANALYSIS

### Common TFs for human RNA transcriptome
* **Query web 1: < NGLY1 -[Human RNA expression (Freeze)]- gene <-- TF > order by freq**
    * ngly1-[interacts with]-gene<-tf

In [7]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]-(gene:GENE)<-[i2:`RO:0002434`]-(tf:GENE)

        WHERE source.id = 'HGNC:17646' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,i2,tf,gene,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 
        
        AND toLower(i1.reference_supporting_text) =~ '.*freeze.*' 
        
        AND toLower(i2.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN DISTINCT tf.id as id, tf.preflabel as symbol, tf.name as name, tf.description as description, count(distinct gene.preflabel) as freq
        
        ORDER BY freq DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({
        'TF_id': record['id'], 
        'TF_symbol': record['symbol'],
        'TF_name': record['name'],
        'TF_description': record['description'],
        'Frequency': record['freq']
    })
    
res_df = pd.DataFrame(out_l)
print(res_df.shape)

(857, 5)
CPU times: user 30.9 ms, sys: 286 µs, total: 31.2 ms
Wall time: 10.5 s


In [8]:
# Summary table
res_df[['Frequency', 'TF_id', 'TF_symbol', 'TF_name', 'TF_description']].head()

Unnamed: 0,Frequency,TF_id,TF_symbol,TF_name,TF_description
0,471,HGNC:11205,SP1,Sp1 transcription factor,"The protein encoded by this gene is a zinc finger transcription factor that binds to GC-rich motifs of many promoters. The encoded protein is involved in many cellular processes, including cell differentiation, cell growth, apoptosis, immune responses, response to DNA damage, and chromatin remodeling. Post-translational modifications such as phosphorylation, acetylation, glycosylation, and proteolytic processing significantly affect the activity of this protein, which can be an activator or a repressor. Three transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Nov 2014]."
1,403,HGNC:6551,LEF1,lymphoid enhancer binding factor 1,"This gene encodes a transcription factor belonging to a family of proteins that share homology with the high mobility group protein-1. The protein encoded by this gene can bind to a functionally important site in the T-cell receptor-alpha enhancer, thereby conferring maximal enhancer activity. This transcription factor is involved in the Wnt signaling pathway, and it may function in hair cell differentiation and follicle morphogenesis. Mutations in this gene have been found in somatic sebaceous tumors. This gene has also been linked to other cancers, including androgen-independent prostate cancer. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Oct 2009]."
2,305,HGNC:6914,MAZ,MYC associated zinc finger protein,
3,287,HGNC:7139,FOXO4,forkhead box O4,"This gene encodes a member of the O class of winged helix/forkhead transcription factor family. Proteins encoded by this class are regulated by factors involved in growth and differentiation indicating they play a role in these processes. A translocation involving this gene on chromosome X and the homolog of the Drosophila trithorax gene, encoding a DNA binding protein, located on chromosome 11 is associated with leukemia. Multiple transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jan 2010]."
4,283,NCBIGene:26196534,E12,transcription regulator,


### RNA gene clusters by TF and Pathway (pathways associated with gene sets)
    * Cluster RNA genes-common TF clusters by pathway (pathways associated with RNA genes in gene sets)

In [9]:
%%time
# Query 20
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]->(g:GENE)<-[i2:`RO:0002434`]-(tf:GENE), (g:GENE)-[i]-(pw:PHYS)

        WHERE source.id = 'HGNC:17646' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,i2,g,tf,i,pw,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 
        
        AND toLower(i1.reference_supporting_text) =~ '.*freeze.*' 
        
        AND toLower(i2.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'
        
        AND toLower(i.property_label) <> 'enables'

        RETURN DISTINCT tf.preflabel as TF_symbol, tf.name as TF_name, 
                        collect(DISTINCT g.preflabel) as geneset, count(distinct g.preflabel) as genes,
                        collect(DISTINCT pw.preflabel) as pathway, count(distinct pw.preflabel) as pathways
        
        ORDER BY genes DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({'TF symbol': record['TF_symbol'],
                  'TF name': record['TF_name'],
                  'Target geneset': record['geneset'],
                  'Total target genes': record['genes'],
                  'Pathway': record['pathway'],
                  'Total pathways': record['pathways']
                 })
    
res_df = pd.DataFrame(out_l)
print(res_df.shape)

(850, 6)
CPU times: user 166 ms, sys: 0 ns, total: 166 ms
Wall time: 42 s


In [10]:
# Summary table
res_df.head()

Unnamed: 0,Pathway,TF name,TF symbol,Target geneset,Total pathways,Total target genes
0,"[nucleus, regulation of transcription from RNA polymerase II promoter, plasma membrane, endoplasmic reticulum membrane, Transmembrane transport of small molecules, Metabolism, integral component of membrane, response to drug, negative regulation of apoptotic process, cellular homeostasis, endoplasmic reticulum, neuron projection, axon, integral component of plasma membrane, cellular response to retinoic acid, extracellular exosome, dendrite, glycoprotein catabolic process, Protein processing in endoplasmic reticulum, protein complex, nuclear membrane, positive regulation of cell migration, apical plasma membrane, cell-cell junction, cell surface, Post-translational protein modification, Metabolism of proteins, response to calcium ion, brush border, cellular response to oxygen-glucose deprivation, protein homooligomerization, positive regulation of transcription from RNA polymerase II promoter, Renin secretion, positive regulation of transcription, DNA-templated, transcription from RNA polymerase II promoter, protein ubiquitination, N-glycan trimming in the ER and Calnexin/Calreticulin cycle, Asparagine N-linked glycosylation, cellular response to hydrogen peroxide, cellular response to oxidative stress, positive regulation of angiogenesis, Vasopressin regulates renal water homeostasis via Aquaporins, Aquaporin-mediated transport, Vasopressin-regulated water reabsorption, renal water homeostasis, positive regulation of epithelial cell migration, caveola, erythrocyte differentiation, response to retinoic acid, basolateral plasma membrane, wound healing, protein folding, negative regulation of cysteine-type endopeptidase activity involved in apoptotic process, brush border membrane, inflammatory response, cellular response to mechanical stimulus, transmembrane transport, apical part of cell, cellular response to hypoxia, positive regulation of fibroblast proliferation, sarcolemma, O2/CO2 exchange in erythrocytes, bicarbonate transport, Erythrocytes take up carbon dioxide and release oxygen, cellular response to UV, establishment or maintenance of actin cytoskeleton polarity, anatomical structure morphogenesis, axon terminus, Proximal tubule bicarbonate reclamation, transport, odontogenesis, metanephric glomerulus vasculature development, secretion by cell, potassium ion transport, potassium ion transmembrane transport, glomerular filtration, lateral ventricle development, sensory perception of pain, cellular response to dexamethasone stimulus, cellular response to cAMP]",Sp1 transcription factor,SP1,"[RBBP8, SGK1, SERINC3, ATG5, TEX2, UNC13B, TAOK3, STMN2, ABCA1, IL6ST, STT3B, XPO1, ADAM17, TCEAL1, PHTF1, RPAP2, IGF1R, PIGK, NUCB2, TRPC1, BCL2L13, RAB22A, PTPRA, DNM1L, JKAMP, EIF2AK3, BACE2, HSPH1, CNOT7, PPP3CB, PNRC1, SDCCAG8, CALCOCO1, SRP54, TPR, ZEB2, SLC35F5, LYPD3, MAFB, RNF139, MLLT3, ARL6IP5, ATL1, PKN2, ARIH1, SETX, ZFP91, PARP8, OSBPL9, ANXA4, GPD1L, BTG1, TNFSF10, ARID4A, UBA6, XRN2, LRIG1, RAB11A, JAK2, ZNF148, SMARCA5, VPS29, FBXO11, CCND1, ADK, IGFBP2, SQOR, RAB10, APPBP2, LEPR, TNC, ELF1, CRYAB, ETFDH, GNA13, UBE4B, TRPM7, GPM6B, CROT, MGAT4B, CLDND1, PAPOLG, FAS, HK2, MOCS2, MLPH, TIMP2, ARL1, ASTN1, PCNP, CASK, PPFIA2, MITF, UST, HAT1, CIT, KDR, FEM1C, ATG3, MT2A, ...]",80,402
1,"[integral component of plasma membrane, integral component of membrane, endoplasmic reticulum, plasma membrane, neuron projection, axon, extracellular exosome, sensory perception of pain, nucleus, protein ubiquitination, Post-translational protein modification, Metabolism of proteins, Asparagine N-linked glycosylation, positive regulation of cell migration, negative regulation of apoptotic process, endoplasmic reticulum membrane, Metabolism, Protein processing in endoplasmic reticulum, transport, protein complex, positive regulation of transcription from RNA polymerase II promoter, transmembrane transport, transcription from RNA polymerase II promoter, cell-cell junction, nuclear membrane, positive regulation of transcription, DNA-templated, Vasopressin regulates renal water homeostasis via Aquaporins, Aquaporin-mediated transport, Vasopressin-regulated water reabsorption, Transmembrane transport of small molecules, renal water homeostasis, positive regulation of epithelial cell migration, regulation of keratinocyte differentiation, cell surface, response to drug, response to calcium ion, regulation of transcription from RNA polymerase II promoter, cellular response to dexamethasone stimulus, excretion, water homeostasis, brush border membrane, brush border, basal plasma membrane, apical plasma membrane, cellular response to hypoxia, cellular response to cAMP, response to vitamin D, basolateral plasma membrane, dendrite, apical part of cell, cellular response to hydrogen peroxide, cellular response to UV, potassium ion transport, wound healing, sarcolemma, protein homooligomerization, negative regulation of cysteine-type endopeptidase activity involved in apoptotic process, anatomical structure morphogenesis, cellular response to oxidative stress, axon terminus, Renin secretion, inflammatory response, positive regulation of fibroblast proliferation, cellular response to mechanical stimulus, response to retinoic acid, cellular response to retinoic acid, odontogenesis, metanephric glomerulus vasculature development, erythrocyte differentiation, pancreatic juice secretion, lateral ventricle development, potassium ion transmembrane transport, Proximal tubule bicarbonate reclamation, N-glycan trimming in the ER and Calnexin/Calreticulin cycle, protein folding, positive regulation of angiogenesis, caveola, glycoprotein catabolic process, protein deglycosylation, Bile secretion, O2/CO2 exchange in erythrocytes, bicarbonate transport, glycerol transport, carbon dioxide transport, Erythrocytes take up carbon dioxide and release oxygen, Erythrocytes take up oxygen and release carbon dioxide, cell volume homeostasis, water transport, ammonium transport, Passive transport by Aquaporins, carbon dioxide transmembrane transport, multicellular organismal water homeostasis, cellular response to copper ion, cellular hyperosmotic response, cellular response to nitric oxide, ammonium transmembrane transport, maintenance of symbiont-containing vacuole by host, cellular homeostasis, nitric oxide transport, cellular response to salt stress, ...]",lymphoid enhancer binding factor 1,LEF1,"[GPR137B, TEX2, TAOK3, STMN2, RAP2A, TMEM62, TMEM35A, DLG2, SF3B1, UBR3, JDP2, PHTF1, WDR11, GOLGB1, IGF1R, SEC24B, RAB22A, GRIK2, NAALADL2, CRLS1, HNRNPLL, PLOD2, HSPH1, CNOT7, ZNF664, PNRC1, SLC22A23, PBX3, SDCCAG8, RB1CC1, ZEB2, IPO11, TXLNG, MAFB, ZNF569, PKN2, ARIH1, AKT3, TBL1XR1, ARRDC3, ZFP91, PARP8, FNBP1L, HSF2, TNFSF10, XRN2, RAB11A, ROCK2, ZNF148, FBXO11, SYNE2, AK9, NOTCH2, CCND1, LAMA4, LRCH2, CHD1, FBXO32, CSNK1G3, RAB10, STXBP6, PCNX1, NEDD4L, RANBP9, ELF1, GNA13, PDS5A, ACTR3, UBE4B, CTSC, GPM6B, LAMB2, DST, USP34, STC1, KREMEN1, SWT1, ASTN1, AFDN, TOB1, CASK, BCAR3, URI1, COMMD3, SVIL, CHMP2B, MITF, UST, RHOQ, VPS50, FEM1C, MT2A, ZNF644, SLC44A1, BNIP2, UBE3A, MFSD14A, NCAM1, FAP, EHBP1, ...]",111,340
2,"[integral component of membrane, nucleus, inflammatory response, Metabolism of proteins, plasma membrane, extracellular exosome, Renin secretion, response to drug, Metabolism, apical plasma membrane, Vasopressin regulates renal water homeostasis via Aquaporins, Aquaporin-mediated transport, Vasopressin-regulated water reabsorption, dendrite, Transmembrane transport of small molecules, Bile secretion, negative regulation of apoptotic process, cell-cell junction, protein complex, positive regulation of transcription from RNA polymerase II promoter, positive regulation of transcription, DNA-templated, integral component of plasma membrane, transport, caveola, positive regulation of cell migration, erythrocyte differentiation, regulation of transcription from RNA polymerase II promoter, neuron projection, brush border membrane, Post-translational protein modification, transcription from RNA polymerase II promoter, brush border, axon, protein ubiquitination, anatomical structure morphogenesis, cellular response to retinoic acid, positive regulation of fibroblast proliferation, nuclear membrane, cell surface, endoplasmic reticulum membrane, Asparagine N-linked glycosylation, Protein processing in endoplasmic reticulum, response to calcium ion, cellular response to hypoxia, cellular response to cAMP, response to vitamin D, endoplasmic reticulum, protein folding, regulation of keratinocyte differentiation, protein homooligomerization, sensory perception of pain, cellular response to dexamethasone stimulus, wound healing, cellular response to UV, basal plasma membrane, Proximal tubule bicarbonate reclamation, potassium ion transport, cellular response to mechanical stimulus, O2/CO2 exchange in erythrocytes, bicarbonate transport, glycerol transport, carbon dioxide transport, positive regulation of angiogenesis, basolateral plasma membrane, negative regulation of cysteine-type endopeptidase activity involved in apoptotic process, cellular response to hydrogen peroxide, Erythrocytes take up carbon dioxide and release oxygen, Erythrocytes take up oxygen and release carbon dioxide, cell volume homeostasis, potassium ion transmembrane transport, odontogenesis, water transport, apical part of cell, renal water homeostasis, ammonium transport, Passive transport by Aquaporins, sarcolemma, carbon dioxide transmembrane transport, pancreatic juice secretion, multicellular organismal water homeostasis, cellular response to copper ion, cellular hyperosmotic response, cellular response to nitric oxide, ammonium transmembrane transport, maintenance of symbiont-containing vacuole by host, cellular homeostasis, nitric oxide transport, cellular response to salt stress, cellular response to mercury ion, renal water transport, transepithelial water transport, cGMP biosynthetic process, cellular response to stress, symbiont-containing vacuole membrane, lateral ventricle development, positive regulation of saliva secretion, cerebrospinal fluid secretion, establishment or maintenance of actin cytoskeleton polarity, symbiont-containing vacuole, cellular response to inorganic substance, ...]",MYC associated zinc finger protein,MAZ,"[CYYR1, PGRMC2, PAPOLG, LOXL3, XRN2, STXBP6, EIF4A1, GNAS, DPT, CDC42EP5, CFL1, TINAGL1, PNRC1, RASSF2, NCOA3, SLC6A9, JAK2, EHBP1, CHD1, SHANK2, FBXL19, CSDC2, ID4, EBF1, ADAMTS4, ACTC1, PKN3, DOCK4, MNT, ACTR3, PHF20L1, STMN2, TRIM41, IGF2BP3, MITF, WNT2, SYNE1, ADAMTS15, BCAM, TGIF2, ZNF703, PPM1A, EFNB1, SMC6, PARP12, PTMA, EPB41L1, KDELR1, ADAM19, UBA3, LRFN5, ATF6B, CCND1, LAMA5, SUMO1, RB1CC1, LRRC4, XPO1, STC1, CUL2, GRN, ERF, RGS7, TCF21, HNRNPA3, WNK4, STAG2, PELP1, CD109, HOXA13, SDCCAG8, BCAR3, TMEM35A, ACAN, ZNF644, ATP5MC2, ASIC1, CD47, RAB10, ATL1, MARCH7, CBX6, EIF5A, SWT1, HNRNPA2B1, DLG2, CCP110, FSTL3, FGFRL1, EFNB3, GDNF, UST, CBX3, UBE2D3, HNRNPA0, MMP2, TFAP2C, GPM6B, TLK1, WAC, ...]",106,273
3,"[nucleus, protein complex, plasma membrane, positive regulation of transcription from RNA polymerase II promoter, keratan sulfate biosynthetic process, extracellular exosome, Metabolism, neuron projection, axon, integral component of membrane, sensory perception of pain, protein ubiquitination, Renin secretion, response to drug, apical plasma membrane, Vasopressin regulates renal water homeostasis via Aquaporins, Aquaporin-mediated transport, Vasopressin-regulated water reabsorption, dendrite, Transmembrane transport of small molecules, Bile secretion, integral component of plasma membrane, positive regulation of cell migration, negative regulation of apoptotic process, Post-translational protein modification, Metabolism of proteins, endoplasmic reticulum membrane, Protein processing in endoplasmic reticulum, endoplasmic reticulum, protein homooligomerization, transcription from RNA polymerase II promoter, nuclear membrane, positive regulation of transcription, DNA-templated, cellular response to dexamethasone stimulus, excretion, water homeostasis, brush border membrane, cellular response to hypoxia, Asparagine N-linked glycosylation, cell surface, basal plasma membrane, cellular response to cAMP, response to vitamin D, transmembrane transport, cellular response to UV, regulation of transcription from RNA polymerase II promoter, camera-type eye morphogenesis, protein folding, O2/CO2 exchange in erythrocytes, bicarbonate transport, Erythrocytes take up carbon dioxide and release oxygen, cell-cell junction, positive regulation of angiogenesis, positive regulation of epithelial cell migration, wound healing, secretion by cell, secretory granule organization, axon terminus, apical part of cell, negative regulation of cysteine-type endopeptidase activity involved in apoptotic process, cellular response to retinoic acid, odontogenesis, metanephric glomerulus vasculature development, inflammatory response, cellular response to hydrogen peroxide, pancreatic juice secretion, transport, cellular response to mechanical stimulus, glomerular filtration, anatomical structure morphogenesis, potassium ion transport, sarcolemma, potassium ion transmembrane transport, basolateral plasma membrane, Proximal tubule bicarbonate reclamation, brush border, glycerol transport, carbon dioxide transport, Erythrocytes take up oxygen and release carbon dioxide, cell volume homeostasis, water transport, renal water homeostasis, ammonium transport, Passive transport by Aquaporins, positive regulation of fibroblast proliferation, carbon dioxide transmembrane transport, multicellular organismal water homeostasis, cellular response to copper ion, cellular hyperosmotic response, cellular response to nitric oxide, ammonium transmembrane transport, maintenance of symbiont-containing vacuole by host, cellular homeostasis, nitric oxide transport, cellular response to salt stress, cellular response to mercury ion, renal water transport, transepithelial water transport, cGMP biosynthetic process, cellular response to stress, ...]",transcription regulator,E12,"[STK3, UNC13B, EGLN1, LUM, STMN2, TMEM62, TMEM35A, FNDC3A, DLG2, UBR3, CASC4, GNAS, IGF1R, RAB22A, ACSL4, CRLS1, EIF2AK3, PLOD2, SPINT2, ZNF664, ZEB2, TXLNG, MLLT3, ARL6IP5, PKN2, ARRDC3, ZFP91, PTGFR, FNBP1L, OSBPL9, GPD1L, LRIG1, GOLGA4, CDK14, SYNE2, AK9, LAMA4, ADK, LTBP2, FBXO32, NEDD4L, RANBP9, ZFR, GNA13, ACAA2, GPM6B, MMP15, MGAT4B, CLDND1, DST, PAPOLG, CMTM4, ZNF189, STC1, CHMP2B, UST, RHOQ, ZNF644, USP1, SDK1, SLC44A1, POLK, NCAM1, ADGRB2, KIF5B, RTN3, DSTN, KTN1, KIF13A, UCHL3, ITGA7, SYNE1, TFAP2C, AGL, COL8A1, DUSP3, TAB2, RSF1, TIGAR, RGS7, SCAMP5, USO1, DENND4A, SKIL, CYB5R4, PIK3R1, HIF1A, IDE, DGKA, ELK3, RAP1GAP2, TSC22D1, JMJD1C, JAZF1, PHKB, HMCN1, CYLD, ENPP5, C1GALT1C1, EPB41, ...]",107,238
4,"[integral component of plasma membrane, plasma membrane, nucleus, endoplasmic reticulum membrane, Transmembrane transport of small molecules, Metabolism, integral component of membrane, endoplasmic reticulum, regulation of transcription from RNA polymerase II promoter, neuron projection, axon, extracellular exosome, negative regulation of apoptotic process, dendrite, sensory perception of pain, Renin secretion, positive regulation of transcription from RNA polymerase II promoter, sarcolemma, response to drug, apical plasma membrane, Vasopressin regulates renal water homeostasis via Aquaporins, Aquaporin-mediated transport, Vasopressin-regulated water reabsorption, Bile secretion, Post-translational protein modification, Metabolism of proteins, Asparagine N-linked glycosylation, positive regulation of cell migration, Protein processing in endoplasmic reticulum, transport, protein homooligomerization, positive regulation of transcription, DNA-templated, transcription from RNA polymerase II promoter, nuclear membrane, protein complex, positive regulation of angiogenesis, protein ubiquitination, cellular response to dexamethasone stimulus, bicarbonate transport, basolateral plasma membrane, basal plasma membrane, cellular response to hypoxia, cellular response to cAMP, response to vitamin D, caveola, cell surface, transmembrane transport, inflammatory response, cellular response to hydrogen peroxide, cellular response to UV, camera-type eye morphogenesis, potassium ion transport, wound healing, positive regulation of fibroblast proliferation, cell-cell junction, positive regulation of epithelial cell migration, cellular response to oxidative stress, erythrocyte differentiation, odontogenesis, cellular response to retinoic acid, axon terminus, potassium ion transmembrane transport, brush border, apical part of cell, protein folding, anatomical structure morphogenesis, response to retinoic acid]",forkhead box O4,FOXO4,"[GPR137B, SGK1, SERINC3, TEX2, ZNF770, STMN2, IL6ST, DLG2, PPP3R1, KANK1, GNAS, GOLGB1, IFIH1, IGF1R, SEC24B, EIF3E, ACSL4, GRIK2, NAALADL2, EIF2AK3, HNRNPLL, ZNF664, PPP3CB, PNRC1, PBX3, TPR, ZEB2, TXLNG, MLLT3, PKN2, AKT3, TMEM167A, TBL1XR1, ARRDC3, BTG1, TNFSF10, ARID4A, ZZZ3, ZNF148, FBXO11, LAMA4, BBS7, LRCH2, CHD1, FBXO32, ACKR4, CSNK1G3, HIP1R, RAB10, APPBP2, STXBP6, PCNX1, RANBP9, ELF1, SLC4A7, UBE4B, SAT1, MGAT4B, DST, STC1, KREMEN1, SWT1, TOB1, COMMD3, CHMP2B, STARD5, MITF, SUMO1, HEBP2, RHOQ, DNAJB14, ADTRP, ZNF644, SDK1, SLC44A1, SLMAP, UBE3A, MGST3, SLC30A9, NCAM1, EHBP1, KBTBD2, GNAQ, SATB1, ANXA1, ZBTB20, NDFIP1, KIF2A, KDM3A, ZNF277, FAR1, STXBP3, RAB6A, NAP1L1, ZC3H7A, PCNA, AGL, COL8A1, KCNJ15, STAG2, ...]",67,237


# NETWORK HYPOTHESES

### template that involves genes and pathways
* **Query 1: ( L = 3 ) (ngly1-ppi-pw-aqp1) ngly1-[interacts with]-gene--phys--aqp1 **
    * without degree filters
    * this query in graph v2 gives 0 paths

In [11]:
%%time
# Query 1
with driver.session() as session:
    result = session.run(
    """
    MATCH path=(source:GENE)-[:`RO:0002434`]-(:GENE)--(:PHYS)--(target:GENE)

    WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

    WITH path,

    [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked,

    [r IN relationships(path) WHERE r.property_label IN ['in paralogy relationship with','in orthology relationship with','colocalizes with']] AS edges_marked

    WHERE size(nodes_marked) = 0 AND size(edges_marked) = 0

    RETURN count(distinct path) as paths
    """
    )

# print result
for record in result:
    print('Total paths: {}'.format(record['paths']))

Total paths: 1670
CPU times: user 1.5 ms, sys: 142 µs, total: 1.64 ms
Wall time: 511 ms


### template rna-reg L=3 min
* **Query 14: TF-pw (L=4) NGLY1--RNA_GENE--TF--PHYS--AQP1**
    * filter expression edge 
    * filter TF regulatory edge and directionality

In [12]:
%%time
# Query 14
query = (
        """
        MATCH path=(source:GENE {id: 'HGNC:17646'})-[i1:`RO:0002434`]-(rna:GENE)<-[i2:`RO:0002434`]-(tf:GENE)--(pw:PHYS)--(target:GENE {id: 'HGNC:633'})

        WHERE ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,i2,rna,tf,pw,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 
        
        AND toLower(i1.reference_supporting_text) =~ '.*freeze.*' 
        
        AND toLower(i2.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct rna) as rnas, count(distinct tf) as tfs, count(distinct pw) as pws, count(distinct path) as paths
        
        ORDER BY paths DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({'Expressed_genes': record['rnas'], 
                  'TFs': record['tfs'],
                  'Pathways': record['pws'],
                  'Paths': record['paths']})
    
res_df = pd.DataFrame(out_l)

CPU times: user 0 ns, sys: 4.37 ms, total: 4.37 ms
Wall time: 32.5 s


In [13]:
# Summary table
res_df

Unnamed: 0,Expressed_genes,Paths,Pathways,TFs
0,1349,13467,29,221


* **Query 14.1: TF-pw (L=3) NGLY1--RNA_GENE--TF--AQP1**
    * filter expression edge 
    * filter TF regulatory edge and directionality

In [14]:
%%time
# Query 14.1
query = (
        """
        MATCH path=(source:GENE {id: 'HGNC:17646'})-[i1:`RO:0002434`]-(rna:GENE)<-[i2:`RO:0002434`]-(tf:GENE)--(target:GENE {id: 'HGNC:633'})

        WHERE ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,i2,rna,tf,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 
        
        AND toLower(i1.reference_supporting_text) =~ '.*freeze.*' 
        
        AND toLower(i2.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct rna) as rnas, count(distinct tf) as tfs, count(distinct path) as paths
        
        ORDER BY paths DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({'Expressed_genes': record['rnas'], 
                  'TFs': record['tfs'],
                  'Paths': record['paths']})
    
res_df = pd.DataFrame(out_l)

CPU times: user 2.96 ms, sys: 0 ns, total: 2.96 ms
Wall time: 1.56 s


In [15]:
# summary table
res_df

Unnamed: 0,Expressed_genes,Paths,TFs
0,818,1839,13


### Look for a template of interest (only regulation edge): Open query + metapaths 
* **Query 13: Variable length 2-4 and pattern**
    * pattern: ngly1<-regulatory-tf-[<=3]-aqp1 
    * ngly1-[interacts with]-gene-[1..3]-aqp1 
    * filter specific regulatory types on specific edges 

In [16]:
%%time
# Query 13
query = (
        """
        MATCH path=(source:GENE)<-[i1:`RO:0002434`]-(tf:GENE)-[*..3]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 11213
CPU times: user 1.32 ms, sys: 3.02 ms, total: 4.34 ms
Wall time: 52.2 s


In [17]:
%%time
# Query 13
query = (
        """
        MATCH path=(source:GENE)<-[i1:`RO:0002434`]-(tf:GENE)-[*..3]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN DISTINCT extract (x in rels(path) | type(x)) as types, extract (n in nodes(path) | labels(n)) as labels, length(path) as mp_length, count(distinct path) as paths 
        
        ORDER BY mp_length, paths DESC
        """
)

# run query
result = runQuery( driver, query )

# parse results
out_l = list()
for record in result:
    out_l.append({'Nodes': record['labels'], 
                  'Relations': record['types'],
                  'Metapath length': record['mp_length'],
                  'Paths': record['paths']})
    
res_df = pd.DataFrame(out_l)

CPU times: user 5.17 ms, sys: 3.32 ms, total: 8.48 ms
Wall time: 52.6 s


In [18]:
# summary table of metapaths
res_df

Unnamed: 0,Metapath length,Nodes,Paths,Relations
0,3,"[[GENE], [GENE], [GENE], [GENE]]",36,"[RO:0002434, RO:0002434, RO:0002434]"
1,3,"[[GENE], [GENE], [GENE], [GENE]]",8,"[RO:0002434, RO:HOM0000011, RO:0002434]"
2,3,"[[GENE], [GENE], [ANAT], [GENE]]",5,"[RO:0002434, RO:0002206, RO:0002206]"
3,4,"[[GENE], [GENE], [GENE], [GENE], [GENE]]",6534,"[RO:0002434, RO:0002434, RO:0002434, RO:0002434]"
4,4,"[[GENE], [GENE], [GENE], [ANAT], [GENE]]",1502,"[RO:0002434, RO:0002434, RO:0002206, RO:0002206]"
5,4,"[[GENE], [GENE], [GENE], [PHYS], [GENE]]",432,"[RO:0002434, RO:0002434, BFO:0000050, BFO:0000050]"
6,4,"[[GENE], [GENE], [GENE], [PHYS], [GENE]]",425,"[RO:0002434, RO:0002434, RO:0002331, RO:0002331]"
7,4,"[[GENE], [GENE], [GENE], [GENE], [GENE]]",400,"[RO:0002434, RO:HOM0000011, RO:0002434, RO:0002434]"
8,4,"[[GENE], [GENE], [GENE], [GENE], [GENE]]",208,"[RO:0002434, RO:HOM0000011, RO:HOM0000011, RO:0002434]"
9,4,"[[GENE], [GENE], [GENE], [GENE], [GENE]]",190,"[RO:0002434, RO:0002325, RO:0002434, RO:0002434]"


### template refinement (only regulation edge)
* **Query 12: ( L = 4 ) ngly1--TF-[regulatory]->gene--phys--aqp1 **
    * filter specific regulatory types on specific edges 
    * filter directionality
    * filter node type on third node, otherwise runtime is long

In [19]:
%%time
# Query 12
query = (
        """
        MATCH path=(source:GENE)--(tf:GENE)-[i1:`RO:0002434`]->(:GENE)--(:PHYS)--(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 15628
CPU times: user 2.79 ms, sys: 273 µs, total: 3.06 ms
Wall time: 31.4 s


* **Query 11: ( L = 3 ) ngly1--TF-[regulatory]->gene--aqp1 **
    * filter specific regulatory types on specific edges 
    * filter directionality

In [20]:
%%time
# Query 11
query = (
        """
        MATCH path=(source:GENE)--(tf:GENE)-[i1:`RO:0002434`]->(:GENE)--(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:633' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 338
CPU times: user 2.36 ms, sys: 112 µs, total: 2.48 ms
Wall time: 1.06 s


## Mitali queries
### 14 Jan 2019 meeting 

Here are the details of the querry we talked about yesterday;


Look for gene-gene interactions linking NGLY1 and AQPs through the following genes;

* ATF1 (ENSG00000123268)
    * https://uswest.ensembl.org/Homo_sapiens/Gene/Summary?g=ENSG00000123268;r=12:50763710-50821122

* CREB1 (ENSG00000118260)
* PRKACA (ENSG00000072062)
* cyclic adenosine monophosphate (cAMP)
* **cAMP** is not a protein so there is no identification number for it. 

_Note_: Protein kinase A has multiple subunits. Each subunit has a different identifier. I have provided you with the gene identifier for its catalytic subunit PRKACA.


**QUERY TOPOLOGY**:

*  NGLY1-interacts-(gene)-interacts-(cAMP|ATF1|CREB1|PKA),  with length n=1/2

#### GENE IDs

I use BioThings to retrieve gene IDs [http://mygene.info/v3/api#/annotation/get_gene__geneid_]

* ATF1: "ATF1", "ENSEMBL:ENSG00000123268", "HGNC:783", "NCBIGene:466", "MGI:1298366", "FlyBase:FBgn0265784", 
    * [http://mygene.info/v3/gene/ENSG00000123268]
    * curl -X GET "http://mygene.info/v3/gene/ENSG00000123268" -H "accept: application/json"
    * "alias": "EWS-ATF1","FUS/ATF-1","TREB36"

* CREB1: "CREB1", "ENSEMBL:ENSG00000118260", "HGNC:2345", "NCBIGene:1385", "MGI:88494", "FlyBase:FBgn0265784", 
    * [http://mygene.info/v3/gene/ENSG00000118260]
    * curl -X GET "http://mygene.info/v3/gene/ENSG00000118260" -H "accept: application/json"
    * "alias": "CREB","CREB-1"
  
* PKA_PRKACA: "PRKACA", "ENSEMBL:ENSG00000072062", "HGNC:9380", "NCBIGene:5566", "MGI:97592", "FlyBase:FBgn0039796", "FlyBase:FBgn0000274", "FlyBase:FBgn0000273", 
    * [http://mygene.info/v3/gene/ENSG00000072062]
    * curl -X GET "http://mygene.info/v3/gene/ENSG00000072062" -H "accept: application/json"
    * Fly have orthology type of relation "O" (mouse is LDO)
    * "alias": "PKACA","PPNAD4"
    
 _Note_: Fly genes are more distant, broader orthology relation type.

In [41]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[:`RO:0002434`]-(:GENE)-[:`RO:0002434`]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' 
        
        AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 92
CPU times: user 0 ns, sys: 1.63 ms, total: 1.63 ms
Wall time: 80.8 ms


In [44]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[:`RO:0002434`]-(:GENE)-[:`RO:0002434`]-(:GENE)-[:`RO:0002434`]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' 
        
        AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 12042
CPU times: user 2.64 ms, sys: 0 ns, total: 2.64 ms
Wall time: 2.6 s


In [45]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[:`RO:0002434`]-(:GENE)-[:`RO:0002434`*..2]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' 
        
        AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 12134
CPU times: user 2.72 ms, sys: 359 µs, total: 3.08 ms
Wall time: 484 ms


In [47]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[:`RO:0002434`]-(:GENE)-[*..2]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' 
        
        AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 17005
CPU times: user 0 ns, sys: 2.73 ms, total: 2.73 ms
Wall time: 677 ms


In [48]:
%%time
query = (
        """
        MATCH path=(source:GENE)--(:GENE)-[*..2]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' 
        
        AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 17069
CPU times: user 0 ns, sys: 2.34 ms, total: 2.34 ms
Wall time: 696 ms


---

* **Query: ( L = 3 ) ngly1--TF-[regulatory]->gene--atf1 **
    * filter specific regulatory types on specific edges 
    * filter directionality

In [21]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]-(tf:GENE)-[*..2]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 AND toLower(i1.reference_supporting_text) =~ '.*tftargets.*|.*msigdb.*'

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))



Paths: 141
CPU times: user 1.9 ms, sys: 185 µs, total: 2.09 ms
Wall time: 1.83 s


In [22]:
%%time
query = (
        """
        MATCH path=(source:GENE)--(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 0
CPU times: user 1.78 ms, sys: 176 µs, total: 1.95 ms
Wall time: 29.2 ms


In [24]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]-(:GENE)-[i2:`RO:0002434`]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 92
CPU times: user 1.65 ms, sys: 162 µs, total: 1.81 ms
Wall time: 51.8 ms


In [25]:
%%time
query = (
        """
        MATCH path=(source:GENE)-[i1:`RO:0002434`]-(:GENE)-[i2:`RO:0002434`]-(:GENE)-[i3:`RO:0002434`]-(target:GENE)

        WHERE source.id = 'HGNC:17646' AND target.id = 'HGNC:783' AND ALL(x IN nodes(path) WHERE single(y IN nodes(path) WHERE y = x))

        WITH path,i1,

        [n IN nodes(path) WHERE n.preflabel IN ['cytoplasm','cytosol','nucleus','metabolism','membrane','protein binding','visible','viable','phenotype']] AS nodes_marked

        WHERE size(nodes_marked) = 0 

        RETURN count(distinct path) as paths
        """
)

# run query
result = runQuery( driver, query )

# parse results
for record in result:
    print('Paths: {}'.format(record['paths']))

Paths: 12042
CPU times: user 125 µs, sys: 3.22 ms, total: 3.35 ms
Wall time: 2.78 s
