# Extracting Edges from Reactome

Reactome exists in its purest form as a Neo4j data dump.  We will download neo4j, populate it with the reactome dump, then start the server and extract the edges we're interested in.

We will be extracting the following edge types:

- Mappings to external resources
    - Physical Entities and Events to Taxa
    - Physical Entities and Events to GO Biological Process
    - Physical Entities and Events to GO Cell Component
    - Physical Entities and Events to GO Molecular Function
    - Physical Entities and Events to Disease
        - Rxn AssocWith Disease
        - Rxn Disrupted in Disease
        - PW AssocWith Disease
        - PW Disrupted in Disease
    - Taxa to Disease Through intermediary node...
- Mappings between internal Resources
    - Failed Rxn to Rxn
    - Disrupted PW to PW
- Membership Details
    - Preceeding Reaction to Post Reaction
    - Pathway has event Reaction
    - PE input and output of Reactions
    - PE Component of Complex
- Regulation
    - Postive Regulation
    - Negative Regulation
  

In [1]:
import pandas as pd
from pathlib import Path
from data_tools import download

## Download Neo4j Database

In [2]:
dl_loc = Path('../0_data/external/').resolve()
neo4j_url = 'https://neo4j.com/artifact.php?name=neo4j-community-3.5.12-unix.tar.gz'
neo_f_name = neo4j_url.split('=')[-1]

download(neo4j_url, dl_loc.joinpath(neo_f_name))

File neo4j-community-3.5.12-unix.tar.gz exits. Skipping...


### Extract neo4j to the parent directory

In [3]:
neo_dir = Path('../').resolve()

In [4]:
import tarfile

with tarfile.open(dl_loc.joinpath(neo_f_name), 'r:gz') as n4j:
    n4j.extractall(neo_dir)
    neo_dir = neo_dir.joinpath(n4j.getnames()[0])

## Download Reactome Neo4j Dump

In [5]:
reactome_db_url = 'https://reactome.org/download/current/reactome.graphdb.tgz'
rct_f_name = reactome_db_url.split('/')[-1]

In [6]:
download(reactome_db_url, dl_loc.joinpath(rct_f_name))

File reactome.graphdb.tgz exits. Skipping...


### Extact Reactome to the database directory within neo4j

In [7]:
db_dir = neo_dir.joinpath('data/databases')

In [8]:
with tarfile.open(dl_loc.joinpath(rct_f_name), 'r:gz') as react_db:
    react_db.extractall(db_dir)

## Spin up the neo4j server

In [9]:
import subprocess
import time

command = str(neo_dir.joinpath('bin/neo4j')) + ' start'
subprocess.call(command, shell=True)
# Should take about < 5 seconds to spin up the new neo4j database server
time.sleep(5)

### Authenticate to connect to the server

In [10]:
import neo4j

try:
    # Use Neo4j Default port, user, and password
    driver = neo4j.GraphDatabase.driver('bolt://localhost:7687', auth=("neo4j", "neo4j"))
    session = driver.session()

    # We are required by Neo4j to change the password before any queries can may run
    # This data is neither public nor sensitive, so we will use an insecure password
    session.run("CALL dbms.changePassword('new password')");

except:
    # On subsequent runs, password will be different so this will be sued to authenticate
    # Need to wait a bit before re-auth, otherwise error will be thrown.
    time.sleep(5)
    driver = neo4j.GraphDatabase.driver('bolt://localhost:7687', auth=("neo4j", "new password"))
    session = driver.session();

# Run Database Queries

In [11]:
def query_to_df(query):
    txn = session.run(query)
    return pd.DataFrame(txn.data())

In [12]:
query_results = dict()

## Mappings To External Resources

### Physical Entities and Events to Taxa

In [13]:
q1 = "MATCH (n:Taxon)-[e]-(p:PhysicalEntity) RETURN n.taxId, n.displayName, type(e), p.stId, p.displayName"
pe_tax = query_to_df(q1)
len(pe_tax)

Failed to write data to connection Address(host='localhost', port=7687) (Address(host='127.0.0.1', port=7687)); ("0; 'Underlying socket connection gone (_ssl.c:2084)'")


399929

In [14]:
pe_tax.head()

Unnamed: 0,n.taxId,n.displayName,type(e),p.stId,p.displayName
0,9606,Homo sapiens,species,R-NUL-351080,Fz1:Dvl2:Daam1:RhoA:GTP [plasma membrane]
1,9606,Homo sapiens,species,R-NUL-351023,Fz1:Dvl2:Daam1 [plasma membrane]
2,9606,Homo sapiens,species,R-HSA-194545,RhoA (Mg cofactor):GTP [plasma membrane]
3,9606,Homo sapiens,species,R-NUL-206831,Wnt1:Frizzled1:phospho (3 sites) LRP6:CKIgamma...
4,9606,Homo sapiens,species,R-HSA-205884,"p-T1479,S1490,T1493-LRP6 [plasma membrane]"


In [15]:
pe_tax.columns = ['tax_id', 'tax_name', 'type', 'pe_id', 'pe_name']
pe_tax['tax_id'] = 'NCBITaxon:' + pe_tax['tax_id']
pe_tax.head(2)

Unnamed: 0,tax_id,tax_name,type,pe_id,pe_name
0,NCBITaxon:9606,Homo sapiens,species,R-NUL-351080,Fz1:Dvl2:Daam1:RhoA:GTP [plasma membrane]
1,NCBITaxon:9606,Homo sapiens,species,R-NUL-351023,Fz1:Dvl2:Daam1 [plasma membrane]


In [16]:
query_results['pe_tax'] = pe_tax

In [17]:
q2 = "MATCH (n:Taxon)-[e]-(p:Event) RETURN n.taxId, n.displayName, type(e), p.stId, p.displayName"
evt_tax = query_to_df(q2)
len(evt_tax)

104253

In [18]:
evt_tax.head()

Unnamed: 0,n.taxId,n.displayName,type(e),p.stId,p.displayName
0,9606,Homo sapiens,species,R-NUL-350375,Dvl2 binds to Daam1
1,9606,Homo sapiens,species,R-NUL-350485,RhoA:GTP binds to activated Daam1
2,9606,Homo sapiens,species,R-NUL-209104,Frog CKIgamma further phosphorylates Human LRP...
3,9606,Homo sapiens,species,R-NUL-209144,Human APC is finally phosphorylated by Murine ...
4,9606,Homo sapiens,species,R-NUL-209132,Human APC is initially phosphorylated by Murin...


In [19]:
evt_tax.columns = ['tax_id', 'tax_name', 'type', 'evt_id', 'evt_name']
evt_tax['tax_id'] = 'NCBITaxon:' + evt_tax['tax_id']
evt_tax.head(2)

Unnamed: 0,tax_id,tax_name,type,evt_id,evt_name
0,NCBITaxon:9606,Homo sapiens,species,R-NUL-350375,Dvl2 binds to Daam1
1,NCBITaxon:9606,Homo sapiens,species,R-NUL-350485,RhoA:GTP binds to activated Daam1


In [20]:
query_results['evt_tax'] = evt_tax

### Physical Entities and Events to GO Biological Process

In [21]:
q3 = """MATCH (pe:PhysicalEntity)-[e]-(cc:GO_BiologicalProcess) 
        RETURN pe.stId, pe.displayName, type(e), cc.databaseName, cc.accession, cc.displayName"""

pe_bp = query_to_df(q3)
len(pe_bp)

0

In [22]:
pe_bp.head(2)

No results for Physical Entitiy to Biological Process... 

In [23]:
q4 = """MATCH (pe:Event)-[e]-(bp:GO_BiologicalProcess) 
        RETURN pe.stId, pe.displayName, type(e), bp.databaseName, bp.accession, bp.displayName"""

evt_bp = query_to_df(q4)
len(evt_bp)

13698

In [24]:
evt_bp.head(2)

Unnamed: 0,pe.stId,pe.displayName,type(e),bp.databaseName,bp.accession,bp.displayName
0,R-DME-9613354,Lipophagy,goBiologicalProcess,GO,61724,lipophagy
1,R-XTR-9613354,Lipophagy,goBiologicalProcess,GO,61724,lipophagy


In [25]:
evt_bp.columns = ['evt_id', 'evt_name', 'type', 'curi', 'go_id', 'go_name']
evt_bp['go_id'] = evt_bp['curi'] + ':' + evt_bp['go_id']
evt_bp.head(2)

Unnamed: 0,evt_id,evt_name,type,curi,go_id,go_name
0,R-DME-9613354,Lipophagy,goBiologicalProcess,GO,GO:0061724,lipophagy
1,R-XTR-9613354,Lipophagy,goBiologicalProcess,GO,GO:0061724,lipophagy


In [26]:
query_results['evt_bp'] = evt_bp

### Physical Entities and Events to GO Cell Component

In [27]:
q5 = """MATCH (pe:PhysicalEntity)-[e]-(cc:GO_CellularComponent) 
        RETURN pe.stId, pe.displayName, type(e), cc.databaseName, cc.accession, cc.displayName"""

pe_cc = query_to_df(q5)
len(pe_cc)

409878

In [28]:
pe_cc.head(2)

Unnamed: 0,pe.stId,pe.displayName,type(e),cc.databaseName,cc.accession,cc.displayName
0,R-ALL-5334668,"PI(3,5)P2 [lysosomal membrane]",compartment,GO,5765,lysosomal membrane
1,R-XTR-429825,BLOC-1 Complex [lysosomal membrane],compartment,GO,5765,lysosomal membrane


In [29]:
pe_cc.columns = ['pe_id', 'pe_name', 'type', 'curi', 'go_id', 'go_name']
pe_cc['go_id'] = pe_cc['curi'] + ':' + pe_cc['go_id']
pe_cc.head(2)

Unnamed: 0,pe_id,pe_name,type,curi,go_id,go_name
0,R-ALL-5334668,"PI(3,5)P2 [lysosomal membrane]",compartment,GO,GO:0005765,lysosomal membrane
1,R-XTR-429825,BLOC-1 Complex [lysosomal membrane],compartment,GO,GO:0005765,lysosomal membrane


In [30]:
query_results['pe_cc'] = pe_cc

In [31]:
q6 = """MATCH (pe:Event)-[e]-(cc:GO_CellularComponent) 
        RETURN pe.stId, pe.displayName, type(e), cc.databaseName, cc.accession, cc.displayName"""

evt_cc = query_to_df(q6)
len(evt_cc)

133474

In [32]:
evt_cc.head(2)

Unnamed: 0,pe.stId,pe.displayName,type(e),cc.databaseName,cc.accession,cc.displayName
0,R-SCE-5333658,"CLAT:AP1:CLVS bind PI(3,5)P2",compartment,GO,5765,lysosomal membrane
1,R-SPO-5333658,"CLAT:AP1:CLVS bind PI(3,5)P2",compartment,GO,5765,lysosomal membrane


In [33]:
evt_cc.columns = ['evt_id', 'evt_name', 'type', 'curi', 'go_id', 'go_name']
evt_cc['go_id'] = evt_cc['curi'] + ':' + evt_cc['go_id']
evt_cc.head(2)

Unnamed: 0,evt_id,evt_name,type,curi,go_id,go_name
0,R-SCE-5333658,"CLAT:AP1:CLVS bind PI(3,5)P2",compartment,GO,GO:0005765,lysosomal membrane
1,R-SPO-5333658,"CLAT:AP1:CLVS bind PI(3,5)P2",compartment,GO,GO:0005765,lysosomal membrane


In [34]:
query_results['evt_cc'] = evt_cc

### Physical Entities and Events to GO Molecular Function

In [35]:
q7 = """MATCH (pe:PhysicalEntity)-[e]-(mf:GO_MolecularFunction) 
        RETURN pe.stId, pe.displayName, type(e), mf.databaseName, mf.accession, mf.displayName"""

pe_mf = query_to_df(q7)
len(pe_mf)

0

In [36]:
pe_mf.head(2)

In [37]:
q8 = """MATCH (pe:Event)-[e]-(mf:GO_MolecularFunction) 
        RETURN pe.stId, pe.displayName, type(e), mf.databaseName, mf.accession, mf.displayName"""

evt_mf = query_to_df(q8)
len(evt_mf)

0

In [38]:
evt_mf.head(2)

### Physical Entities and Events to Disease

#### Rxn AssocWith Disease

In [39]:
q9 = """Match (s:Reaction)-[e]-(t:Disease) 
        return s.stId, s.displayName, type(e), t.databaseName, t.identifier, t.displayName """

rxn_aw_dis = query_to_df(q9)
len(rxn_aw_dis)

673

In [40]:
rxn_aw_dis.head(2)

Unnamed: 0,s.stId,s.displayName,type(e),t.databaseName,t.identifier,t.displayName
0,R-HSA-2400009,PI3K inhibitors block PI3K catalytic activity,disease,DOID,162,cancer
1,R-HSA-2394007,PI3K gain of function mutants phosphorylate PI...,disease,DOID,162,cancer


In [41]:
rxn_aw_dis.columns = ['rxn_id', 'rxn_name', 'type', 'curi', 'do_id', 'do_name']
rxn_aw_dis['do_id'] = rxn_aw_dis['curi'] + ':' + rxn_aw_dis['do_id']
rxn_aw_dis['type'] = 'associated_with'
rxn_aw_dis.head(2)

Unnamed: 0,rxn_id,rxn_name,type,curi,do_id,do_name
0,R-HSA-2400009,PI3K inhibitors block PI3K catalytic activity,associated_with,DOID,DOID:162,cancer
1,R-HSA-2394007,PI3K gain of function mutants phosphorylate PI...,associated_with,DOID,DOID:162,cancer


In [42]:
query_results['rxn_aw_dis'] = rxn_aw_dis

####    Rxn Disrupted in Disease

In [43]:
q10 = """Match (s:Reaction)-[e]-(t:FailedReaction)-[e1]-(t1:Disease) 
       return s.stId, s.displayName, type(e), t.displayName, t.stId, type(e1), 
       t1.databaseName, t1.identifier, t1.displayName"""

rxn_di_dis = query_to_df(q10)
len(rxn_di_dis)

411

In [44]:
rxn_di_dis.head(2)

Unnamed: 0,s.stId,s.displayName,type(e),t.displayName,t.stId,type(e1),t1.databaseName,t1.identifier,t1.displayName
0,R-HSA-5658435,RAS GAPs bind RAS:GTP,normalReaction,RAS GAP mutants aren't stimulated by GAPs,R-HSA-9651280,disease,DOID,60233,cardiofaciocutaneous syndrome
1,R-HSA-5658435,RAS GAPs bind RAS:GTP,normalReaction,RAS GAP mutants aren't stimulated by GAPs,R-HSA-9651280,disease,DOID,162,cancer


In [45]:
rxn_di_dis.columns = ['rxn_id', 'rxn_name', 'type', 'drxn_id', 'drxn_name', 'type2', 'curi', 'do_id', 'do_name']
rxn_di_dis['do_id'] = rxn_di_dis['curi'] + ':' + rxn_di_dis['do_id']
rxn_di_dis['type'] = 'disrupted_in'
rxn_di_dis.head(2)

Unnamed: 0,rxn_id,rxn_name,type,drxn_id,drxn_name,type2,curi,do_id,do_name
0,R-HSA-5658435,RAS GAPs bind RAS:GTP,disrupted_in,RAS GAP mutants aren't stimulated by GAPs,R-HSA-9651280,disease,DOID,DOID:0060233,cardiofaciocutaneous syndrome
1,R-HSA-5658435,RAS GAPs bind RAS:GTP,disrupted_in,RAS GAP mutants aren't stimulated by GAPs,R-HSA-9651280,disease,DOID,DOID:162,cancer


In [46]:
query_results['rxn_di_dis'] = rxn_di_dis

#### PW AssocWith Disease

In [47]:
q11 = """Match (s:Pathway)-[e]-(t:Disease) 
        return s.stId, s.displayName, type(e), t.databaseName, t.identifier, t.displayName"""

pw_aw_dis = query_to_df(q11)
len(pw_aw_dis)

559

In [48]:
pw_aw_dis.head(2)

Unnamed: 0,s.stId,s.displayName,type(e),t.databaseName,t.identifier,t.displayName
0,R-HSA-9605308,Diseases of Base Excision Repair,disease,DOID,162,cancer
1,R-HSA-9616333,Defective Base Excision Repair Associated with...,disease,DOID,162,cancer


In [49]:
pw_aw_dis.columns = ['pw_id', 'pw_name', 'type', 'curi', 'do_id', 'do_name']
pw_aw_dis['do_id'] = pw_aw_dis['curi'] + ':' + pw_aw_dis['do_id']
pw_aw_dis['type'] = 'associated_with'
pw_aw_dis.head(2)

Unnamed: 0,pw_id,pw_name,type,curi,do_id,do_name
0,R-HSA-9605308,Diseases of Base Excision Repair,associated_with,DOID,DOID:162,cancer
1,R-HSA-9616333,Defective Base Excision Repair Associated with...,associated_with,DOID,DOID:162,cancer


In [50]:
query_results['pw_aw_dis'] = pw_aw_dis

#### PW Disrupted in Disease

In [51]:
q12 = """MATCH (pw:Pathway)<-[e:normalPathway]-(dpw:Pathway)-[e1]-(d:Disease) 
         RETURN pw.stId, pw.displayName, type(e), dpw.stId, dpw.displayName, type(e1), 
         d.databaseName, d.identifier, d.displayName"""

pw_di_dis = query_to_df(q12)
len(pw_di_dis)

368

In [52]:
pw_di_dis.head(2)

Unnamed: 0,pw.stId,pw.displayName,type(e),dpw.stId,dpw.displayName,type(e1),d.databaseName,d.identifier,d.displayName
0,R-HSA-73929,"Base-Excision Repair, AP Site Formation",normalPathway,R-HSA-9630221,Defective NTHL1 substrate processing,disease,DOID,162,cancer
1,R-HSA-73929,"Base-Excision Repair, AP Site Formation",normalPathway,R-HSA-9630222,Defective NTHL1 substrate binding,disease,DOID,162,cancer


In [53]:
pw_di_dis.columns = ['pw_id', 'pw_name', 'type', 'dpw_id', 'dpw_name', 'type2', 'curi', 'do_id', 'do_name']
pw_di_dis['do_id'] = pw_di_dis['curi'] + ':' + pw_di_dis['do_id']
pw_di_dis['type'] = 'disrupted_in'
pw_di_dis.head(2)

Unnamed: 0,pw_id,pw_name,type,dpw_id,dpw_name,type2,curi,do_id,do_name
0,R-HSA-73929,"Base-Excision Repair, AP Site Formation",disrupted_in,R-HSA-9630221,Defective NTHL1 substrate processing,disease,DOID,DOID:162,cancer
1,R-HSA-73929,"Base-Excision Repair, AP Site Formation",disrupted_in,R-HSA-9630222,Defective NTHL1 substrate binding,disease,DOID,DOID:162,cancer


In [54]:
query_results['pw_di_dis'] = pw_di_dis

### Taxa to Disease Trough intermediary node

In [55]:
q13 = """MATCH (n:Taxon)-[e1]-(p)-[e2]-(d:Disease) WHERE n.taxId <> '9606' 
         RETURN DISTINCT n.taxId, n.displayName, d.databaseName, d.identifier, d.displayName"""

tax_dis = query_to_df(q13)
len(tax_dis)

66

In [56]:
tax_dis.head(2)

Unnamed: 0,n.taxId,n.displayName,d.databaseName,d.identifier,d.displayName
0,10090,Mus musculus,DOID,162,cancer
1,28875,Rotavirus A,DOID,934,viral infectious disease


In [57]:
tax_dis.columns = ['tax_id', 'tax_name', 'curi', 'do_id', 'do_name']
tax_dis['tax_id'] = 'NCBITaxon:' + tax_dis['tax_id']
tax_dis['do_id'] = tax_dis['curi'] + ':' + tax_dis['do_id']
tax_dis.head(5)

Unnamed: 0,tax_id,tax_name,curi,do_id,do_name
0,NCBITaxon:10090,Mus musculus,DOID,DOID:162,cancer
1,NCBITaxon:28875,Rotavirus A,DOID,DOID:934,viral infectious disease
2,NCBITaxon:11034,Sindbis virus,DOID,DOID:934,viral infectious disease
3,NCBITaxon:11084,Tick-borne encephalitis virus,DOID,DOID:934,viral infectious disease
4,NCBITaxon:12637,Dengue virus,DOID,DOID:934,viral infectious disease


In [58]:
query_results['tax_dis'] = tax_dis

## Mappings between internal Resources

### Failed Rxn to Rxn

In [59]:
q14 = """MATCH (fr:FailedReaction)-[e]-(r:Reaction) 
         RETURN fr.stId, fr.displayName, type(e), r.stId, r.displayName"""
rxn_frxn = query_to_df(q14)
len(rxn_frxn)

358

In [60]:
rxn_frxn.head(2)

Unnamed: 0,fr.stId,fr.displayName,type(e),r.stId,r.displayName
0,R-HSA-9651280,RAS GAP mutants aren't stimulated by GAPs,normalReaction,R-HSA-5658435,RAS GAPs bind RAS:GTP
1,R-HSA-6802837,Loss-of-function NF1 variants don't stimulate ...,normalReaction,R-HSA-5658435,RAS GAPs bind RAS:GTP


In [61]:
rxn_frxn.columns = ['frxn_id', 'frxn_name', 'type', 'rxn_id', 'rxn_name']
rxn_frxn.head()

Unnamed: 0,frxn_id,frxn_name,type,rxn_id,rxn_name
0,R-HSA-9651280,RAS GAP mutants aren't stimulated by GAPs,normalReaction,R-HSA-5658435,RAS GAPs bind RAS:GTP
1,R-HSA-6802837,Loss-of-function NF1 variants don't stimulate ...,normalReaction,R-HSA-5658435,RAS GAPs bind RAS:GTP
2,R-HSA-4085027,Defective GFPT1 does not transfer an amino gro...,normalReaction,R-HSA-449715,"GFPT1,2 transfer an amino group from L-Gln to ..."
3,R-HSA-5609939,Defective PGM1 does not isomerise G6P to G1P,normalReaction,R-HSA-9638127,PGM1:Mg2+ isomerises G6P to G1P
4,R-HSA-4341669,Defective NEU1 does not hydrolyse Neu5Ac from ...,normalReaction,R-HSA-4084999,NEU1 hydrolyses Neu5Ac from glycoconjugates


In [62]:
query_results['rxn_frxn'] = rxn_frxn

### Disrupted PW to PW

In [63]:
q15 = """MATCH (dpw:Pathway)-[e:normalPathway]->(pw:Pathway) 
         RETURN dpw.stId, dpw.displayName, type(e), pw.stId, pw.displayName"""

pw_dpw = query_to_df(q15)
len(pw_dpw)

313

In [64]:
pw_dpw.head(2)

Unnamed: 0,dpw.stId,dpw.displayName,type(e),pw.stId,pw.displayName
0,R-HSA-6802957,Oncogenic MAPK signaling,normalPathway,R-HSA-5673001,RAF/MAP kinase cascade
1,R-HSA-6802949,Signaling by RAS mutants,normalPathway,R-HSA-5673001,RAF/MAP kinase cascade


In [65]:
pw_dpw.columns = ['dpw_id', 'dpw_name', 'type', 'pw_id', 'pw_name']
pw_dpw.tail()

Unnamed: 0,dpw_id,dpw_name,type,pw_id,pw_name
308,R-HSA-5602636,IKBKB deficiency causes SCID,normalPathway,R-HSA-445989,TAK1 activates NFkB by phosphorylation and act...
309,R-HSA-5602415,UNC93B1 deficiency - HSE,normalPathway,R-HSA-1679131,Trafficking and processing of endosomal TLR
310,R-HSA-5602680,MyD88 deficiency (TLR5),normalPathway,R-HSA-975871,MyD88 cascade initiated on plasma membrane
311,R-HSA-5602571,TRAF3 deficiency - HSE,normalPathway,R-HSA-168164,Toll Like Receptor 3 (TLR3) Cascade
312,R-HSA-5602410,TLR3 deficiency - HSE,normalPathway,R-HSA-168164,Toll Like Receptor 3 (TLR3) Cascade


In [66]:
query_results['pw_dpw'] = pw_dpw

## Membership Details

### Preceeding Reaction to Post Reaction

In [67]:
q16 = """MATCH (rx:Reaction)-[e:precedingEvent]->(rx2:Reaction) 
         RETURN rx.stId, rx.displayName, type(e), rx2.stId, rx2.displayName"""

rxn_p_rxn = query_to_df(q16)
len(rxn_p_rxn)

49821

In [68]:
rxn_p_rxn.head(2)

Unnamed: 0,rx.stId,rx.displayName,type(e),rx2.stId,rx2.displayName
0,R-HSA-9626034,EEF1A1 dissociates from p-GFAP,precedingEvent,R-HSA-9626046,p-GFAP binds EEF1A1
1,R-MMU-9626034,EEF1A1 dissociates from p-GFAP,precedingEvent,R-MMU-9626046,p-GFAP binds EEF1A1


In [69]:
rxn_p_rxn.columns = ['af_rxn_id', 'af_rxn_name', 'type', 'b4_rxn_id', 'b4_rxn_name']
rxn_p_rxn.drop_duplicates(subset=['af_rxn_name', 'b4_rxn_name']).head()

Unnamed: 0,af_rxn_id,af_rxn_name,type,b4_rxn_id,b4_rxn_name
0,R-HSA-9626034,EEF1A1 dissociates from p-GFAP,precedingEvent,R-HSA-9626046,p-GFAP binds EEF1A1
9,R-HSA-9626067,EEF1A1:GTP translocates from lysosomal membran...,precedingEvent,R-HSA-9626038,EEF1A1 binds GTP
11,R-MMU-9626038,EEF1A1 binds GTP,precedingEvent,R-MMU-9626034,EEF1A1 dissociates from p-GFAP
31,R-MMU-9626253,HSPA8 binds LAMP2a multimers,precedingEvent,R-MMU-9626242,p-GFAP:GFAP dissociates from LAMP2a multimer
34,R-RNO-9626242,p-GFAP:GFAP dissociates from LAMP2a multimer,precedingEvent,R-RNO-9626039,pGFAP binds GFAP in LAMP2a multimer


In [70]:
query_results['rxn_p_rxn'] = rxn_p_rxn

### Pathway has event Reaction

In [71]:
q17 = """MATCH (pw:Pathway)-[e]-(r:Reaction) 
         RETURN pw.stId, pw.displayName, type(e), r.stId, r.displayName"""

pw_rxn = query_to_df(q17)
len(pw_rxn)

77730

In [72]:
pw_rxn.head(2)

Unnamed: 0,pw.stId,pw.displayName,type(e),r.stId,r.displayName
0,R-HSA-9613829,Chaperone Mediated Autophagy,hasEvent,R-HSA-9626253,HSPA8 binds LAMP2a multimers
1,R-HSA-9613829,Chaperone Mediated Autophagy,hasEvent,R-HSA-9625197,GFAP binds LAMP2a multimer


In [73]:
pw_rxn.columns = ['pw_id', 'pw_name', 'type', 'rxn_id', 'rxn_name']
pw_rxn.head()

Unnamed: 0,pw_id,pw_name,type,rxn_id,rxn_name
0,R-HSA-9613829,Chaperone Mediated Autophagy,hasEvent,R-HSA-9626253,HSPA8 binds LAMP2a multimers
1,R-HSA-9613829,Chaperone Mediated Autophagy,hasEvent,R-HSA-9625197,GFAP binds LAMP2a multimer
2,R-HSA-9613829,Chaperone Mediated Autophagy,hasEvent,R-HSA-9626060,HSPA8:Substrate dissociates from LAMP2a multimer
3,R-HSA-9613829,Chaperone Mediated Autophagy,hasEvent,R-HSA-9625196,Unfolded substrate in LAMP2a multimeric comple...
4,R-HSA-9613829,Chaperone Mediated Autophagy,hasEvent,R-HSA-9622831,Substrate:LAMP2a binds HSP90


In [74]:
query_results['pw_rxn'] = pw_rxn

### Physical Entity input and output of Reactions

These are covered by the flat file download, however those are missing the input and output data for the physical entities, so they're all `PhysicalEntity part_of Reaction` edges.

In [75]:
q18 = """Match (s:Reaction)-[e]-(t:PhysicalEntity) 
         return s.stId, s.displayName, type(e), t.stId, t.displayName"""

rxn_pe = query_to_df(q18)
len(rxn_pe)

275079

In [76]:
rxn_pe.head(2)

Unnamed: 0,s.stId,s.displayName,type(e),t.stId,t.displayName
0,R-HSA-9626034,EEF1A1 dissociates from p-GFAP,output,R-HSA-9626022,EEF1A1 [lysosomal membrane]
1,R-HSA-9626034,EEF1A1 dissociates from p-GFAP,output,R-HSA-9626054,p-GFAP [lysosomal membrane]


In [77]:
rxn_pe.columns = ['rxn_id', 'rxn_name', 'type', 'pe_id', 'pe_name']
rxn_pe['type'].value_counts()

input                     149108
output                    125678
entityOnOtherCell            266
requiredInputComponent        27
Name: type, dtype: int64

Mostly interested in `input` and `output` and may drop the other types later...

In [78]:
query_results['rxn_pe'] = rxn_pe

### Physical Entity component of Complexes

Complexes are an internal neo4j type. However many are made up of items that have external identifiers, like Drugs, and Proteins. We would like to extract these relationships for future use. 

In [79]:
q19 = """MATCH (n1:Complex)-[e:hasComponent]-(n2:PhysicalEntity) 
        WHERE n2:Drug OR n2:GenomeEncodedEntity OR n2:SimpleEntity 
        RETURN DISTINCT n1.stId, n1.displayName, n2.stId, n2.displayName, labels(n2)"""

cplx_pe = query_to_df(q19)
len(cplx_pe)

139855

In [80]:
cplx_pe.head(2)

Unnamed: 0,n1.stId,n1.displayName,n2.stId,n2.displayName,labels(n2)
0,R-GGA-9652349,ERBB2 heterodimer:trastuzumab [plasma membrane],R-ALL-9634466,trastuzumab [extracellular region],"[DatabaseObject, PhysicalEntity, ProteinDrug, ..."
1,R-BTA-9652349,ERBB2 heterodimer:trastuzumab [plasma membrane],R-ALL-9634466,trastuzumab [extracellular region],"[DatabaseObject, PhysicalEntity, ProteinDrug, ..."


In [81]:
cplx_pe.columns = ['cplx_id', 'cplx_name', 'pe_id', 'pe_name', 'pe_label']
cplx_pe['pe_label'].astype(str).value_counts()

['DatabaseObject', 'PhysicalEntity', 'GenomeEncodedEntity', 'EntityWithAccessionedSequence']    114067
['DatabaseObject', 'PhysicalEntity', 'SimpleEntity']                                             16645
['DatabaseObject', 'PhysicalEntity', 'GenomeEncodedEntity']                                       9116
['DatabaseObject', 'PhysicalEntity', 'Drug', 'ChemicalDrug']                                        16
['DatabaseObject', 'PhysicalEntity', 'ProteinDrug', 'Drug']                                         11
Name: pe_label, dtype: int64

In [82]:
query_results['cplx_pe'] = cplx_pe

## Regulation

In [83]:
q20 = """MATCH (n1)<-[e:regulator]-(r)<-[e2:regulatedBy]-(n2) 
         RETURN DISTINCT n1.stId, n1.displayName, labels(n1), 
         labels(r), n2.stId, n2.displayName, labels(n2)"""

reg_edge = query_to_df(q20)
len(reg_edge)

7729

In [84]:
reg_edge.head(2)

Unnamed: 0,n1.stId,n1.displayName,labels(n1),labels(r),n2.stId,n2.displayName,labels(n2)
0,R-HSA-917723,ESCRT-III [cytosol],"[DatabaseObject, PhysicalEntity, Complex]","[DatabaseObject, PositiveRegulation, Regulation]",R-HSA-5682388,Autophagosome maturation,"[DatabaseObject, Event, ReactionLikeEvent, Bla..."
1,R-HSA-5683632,UVRAG complex [cytosol],"[DatabaseObject, PhysicalEntity, Complex]","[DatabaseObject, PositiveRegulation, Regulation]",R-HSA-5682388,Autophagosome maturation,"[DatabaseObject, Event, ReactionLikeEvent, Bla..."


In [85]:
reg_edge.columns = ['n1_id', 'n1_name', 'n1_label', 'type', 'n2_id', 'n2_name', 'n2_label']
reg_edge.head()

Unnamed: 0,n1_id,n1_name,n1_label,type,n2_id,n2_name,n2_label
0,R-HSA-917723,ESCRT-III [cytosol],"[DatabaseObject, PhysicalEntity, Complex]","[DatabaseObject, PositiveRegulation, Regulation]",R-HSA-5682388,Autophagosome maturation,"[DatabaseObject, Event, ReactionLikeEvent, Bla..."
1,R-HSA-5683632,UVRAG complex [cytosol],"[DatabaseObject, PhysicalEntity, Complex]","[DatabaseObject, PositiveRegulation, Regulation]",R-HSA-5682388,Autophagosome maturation,"[DatabaseObject, Event, ReactionLikeEvent, Bla..."
2,R-MMU-5683632,UVRAG complex [cytosol],"[DatabaseObject, PhysicalEntity, Complex]","[DatabaseObject, PositiveRegulation, Regulation]",R-MMU-5682388,Autophagosome maturation,"[DatabaseObject, Event, ReactionLikeEvent, Bla..."
3,R-MMU-917723,ESCRT-III [cytosol],"[DatabaseObject, PhysicalEntity, Complex]","[DatabaseObject, PositiveRegulation, Regulation]",R-MMU-5682388,Autophagosome maturation,"[DatabaseObject, Event, ReactionLikeEvent, Bla..."
4,R-RNO-5683632,UVRAG complex [cytosol],"[DatabaseObject, PhysicalEntity, Complex]","[DatabaseObject, PositiveRegulation, Regulation]",R-RNO-5682388,Autophagosome maturation,"[DatabaseObject, Event, ReactionLikeEvent, Bla..."


In [86]:
pos_idx = reg_edge['type'].apply(lambda x: 'PositiveRegulation' in x)
neg_idx = reg_edge['type'].apply(lambda x: 'NegativeRegulation' in x)

reg_edge.loc[pos_idx[pos_idx].index, 'type'] = 'PositiveRegulation'
reg_edge.loc[neg_idx[neg_idx].index, 'type'] = 'NegativeRegulation'
reg_edge['type'].value_counts()

PositiveRegulation    4896
NegativeRegulation    2833
Name: type, dtype: int64

Guessing that is almost always a PhysicalEntity that regulates an Event... lets see if thats true

In [87]:
reg_edge['n1_label'].apply(lambda x: 'PhysicalEntity' in x).sum() == len(reg_edge)

True

In [88]:
reg_edge['n2_label'].apply(lambda x: 'Event' in x).sum() == len(reg_edge)

True

In [89]:
reg_edge['n1_label'] = 'PhysicalEntity'
reg_edge['n2_label'] = 'Event'

reg_edge.columns = ['pe_id', 'pe_name', 'pe_label', 'type', 'evt_id', 'evt_name', 'evt_label']

reg_edge.head(5)

Unnamed: 0,pe_id,pe_name,pe_label,type,evt_id,evt_name,evt_label
0,R-HSA-917723,ESCRT-III [cytosol],PhysicalEntity,PositiveRegulation,R-HSA-5682388,Autophagosome maturation,Event
1,R-HSA-5683632,UVRAG complex [cytosol],PhysicalEntity,PositiveRegulation,R-HSA-5682388,Autophagosome maturation,Event
2,R-MMU-5683632,UVRAG complex [cytosol],PhysicalEntity,PositiveRegulation,R-MMU-5682388,Autophagosome maturation,Event
3,R-MMU-917723,ESCRT-III [cytosol],PhysicalEntity,PositiveRegulation,R-MMU-5682388,Autophagosome maturation,Event
4,R-RNO-5683632,UVRAG complex [cytosol],PhysicalEntity,PositiveRegulation,R-RNO-5682388,Autophagosome maturation,Event


In [90]:
query_results['reg_edge'] = reg_edge

# Save and exit

In [91]:
nb_name = '02a_Reactome_DB_Queries'
out_dir = Path('../2_pipeline/').joinpath(nb_name, 'out').resolve()

# Make the output directory if doesn't already exist
out_dir.mkdir(parents=True, exist_ok=True)

for name, df in query_results.items():
    df.to_csv(out_dir.joinpath(name+'.csv'), index=False)

#### Shutdown the Neo4j server

In [92]:
subprocess.call(command.replace('start', 'stop'), shell=True)

0