In [1]:
# @name
# @description
# @author
# @date

# Description

This is the notebook for the creation of the first review network and derived hypotheses. 

* Review network: From Monarch knowledge graph, we built a network seeded by 8 nodes, retrieving their explicit relationships and all the relationships among all these nodes. Seed nodes:

    - 'MONDO:0014109', # NGLY1 deficiency
    - 'HGNC:17646', # NGLY1 human gene
    - 'HGNC:633', # AQP1 human gene
    - 'MGI:103201', # AQP1 mouse gene
    - 'HGNC:7781', # NRF1 human gene* Ginger: known as NFE2L1. http://biogps.org/#goto=genereport&id=4779
    - 'HGNC:24622', # ENGASE human gene
    - 'HGNC:636', # AQP3 human gene
    - 'HGNC:19940' # AQP11 human gene
    

* Connecting paths: query templates.

In [11]:
#import time
import transcriptomics, regulation, curation, monarch, graph, neo4jlib, hypothesis, summary

## Edges library
### Review edges to integrate into the knowledge graph
#### import transcriptomics
We retrieved edges from RNA-seq transcriptomics profiles using the transcriptomics module:

    - Experimental data sets: from Chow et al. paper [pmid:29346549] (NGLY1 deficiency model on fruit fly)

In [7]:
# prepare data to graph schema
# raw_data: '~/workspace/ngly1-graph/regulation/ngly1-fly-chow-2018/data/supp_table_1.csv' 
# save at '/transcriptomics/ngly1-fly-chow-2018/data'
data = transcriptomics.read_data()


* This is the size of the raw expression data structure: (15370, 9)
* These are the expression attributes: Index(['FlyBase ID', 'Symbol', 'baseMean', 'log2FoldChange', 'Unnamed: 4',
       'lfcSE', 'stat', 'pvalue', 'padj'],
      dtype='object')
* This is the first record:
    FlyBase ID  Symbol    baseMean  log2FoldChange  Unnamed: 4     lfcSE  \
0  FBgn0030880  CG6788  175.577087       -4.209283     0.05406  0.190308   

        stat         pvalue           padj  
0 -22.118249  2.110000e-108  2.860000e-104  


In [8]:
transcriptomics.clean_data(data)


* This is the size of the clean expression data structure: (386, 6)
* These are the clean expression attributes: Index(['FlyBase ID', 'Symbol', 'log2FoldChange', 'pvalue', 'padj',
       'Regulation'],
      dtype='object')
* This is the first record:
    FlyBase ID Symbol  log2FoldChange        pvalue      padj   Regulation
0  FBgn0035904  GstO3        0.576871  2.130000e-08  0.000002  Upregulated


In [9]:
transcriptomics.prepare_data_edges()


* This is the size of the expression data structure: (386, 6)
* These are the expression attributes: Index(['FlyBase ID', 'Symbol', 'log2FoldChange', 'pvalue', 'padj',
       'Regulation'],
      dtype='object')
* This is the first record:
    FlyBase ID Symbol  log2FoldChange        pvalue      padj   Regulation
0  FBgn0035904  GstO3        0.576871  2.130000e-08  0.000002  Upregulated


In [11]:
edges = transcriptomics.prepare_rna_edges()


* This is the size of the expression data structure: (386, 13)
* These are the expression attributes: Index(['flybase_id', 'symbol', 'log2FoldChange', 'pvalue', 'padj',
       'regulation', 'source', 'subject_id', 'subject_label', 'property_id',
       'property_label', 'reference_id', 'object_id'],
      dtype='object')
* This is the first record:
    flybase_id symbol  log2FoldChange        pvalue      padj   regulation  \
0  FBgn0035904  GstO3        0.576871  2.130000e-08  0.000002  Upregulated   

  source           subject_id subject_label property_id  property_label  \
0   Chow  FlyBase:FBgn0033050          Pngl  RO:0002434  interacts with   

    reference_id            object_id  
0  PMID:29346549  FlyBase:FBgn0035904  

* This is the size of the edges data structure: (386, 12)
* These are the edges attributes: Index(['subject_id', 'subject_label', 'property_id', 'property_label',
       'object_id', 'object_label', 'log2FoldChange', 'pvalue', 'fdr',
       'regulation', 'sou

In [12]:
# build network
transcriptomics.build_edges(edges)


* This is the size of the edges file data structure: (386, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
             object_id property_description property_id  property_label  \
0  FlyBase:FBgn0035904                   NA  RO:0002434  interacts with   

                                property_uri reference_date  \
0  http://purl.obolibrary.org/obo/RO_0002434     2018-03-15   

                           reference_supporting_text  \
0  To understand how loss of NGLY1 contributes to...   

                                  reference_uri           subject_id  
0  https://www.ncbi.nlm.nih.gov/pubmed/29346549  FlyBase:FBgn0033050  


In [13]:
transcriptomics.build_nodes(edges)

* Total number of nodes: 386
querying 1-386...done.
Finished.

* This is the size of the nodes file data structure: (386, 6)
* These are the nodes attributes: Index(['description', 'id', 'name', 'preflabel', 'semantic_groups',
       'synonyms'],
      dtype='object')
* This is the first record:
  description                   id         name preflabel semantic_groups  \
0         NaN  FlyBase:FBgn0033050  PNGase-like      Pngl            GENE   

                         synonyms  
0  CG7865|Dmel\CG7865|PNGase|png1  


###### At once

In [19]:
%%time
# prepare data to graph schema
data = transcriptomics.read_data()
transcriptomics.clean_data(data)
transcriptomics.prepare_data_edges()
edges = transcriptomics.prepare_rna_edges()

# build network
transcriptomics.build_edges(edges)
transcriptomics.build_nodes(edges)


* This is the size of the raw expression data structure: (15370, 9)
* These are the expression attributes: Index(['FlyBase ID', 'Symbol', 'baseMean', 'log2FoldChange', 'Unnamed: 4',
       'lfcSE', 'stat', 'pvalue', 'padj'],
      dtype='object')
* This is the first record:
    FlyBase ID  Symbol    baseMean  log2FoldChange  Unnamed: 4     lfcSE  \
0  FBgn0030880  CG6788  175.577087       -4.209283     0.05406  0.190308   

        stat         pvalue           padj  
0 -22.118249  2.110000e-108  2.860000e-104  

* This is the size of the clean expression data structure: (386, 6)
* These are the clean expression attributes: Index(['FlyBase ID', 'Symbol', 'log2FoldChange', 'pvalue', 'padj',
       'Regulation'],
      dtype='object')
* This is the first record:
    FlyBase ID Symbol  log2FoldChange        pvalue      padj   Regulation
0  FBgn0035904  GstO3        0.576871  2.130000e-08  0.000002  Upregulated

* This is the size of the expression data structure: (386, 6)
* These are the

Network is returned only as CSV files at graph/

#### import regulation

In [12]:
# prepare msigdb data
regulation.prepare_msigdb_data()


* Number of Transcription Factor Targets (TFT) gene sets: 615


{'AAANWWTGC_UNKNOWN': ['4208',
  '481',
  '6095',
  '10370',
  '351',
  '4216',
  '493',
  '2904',
  '10745',
  '9874',
  '8929',
  '4620',
  '6502',
  '23051',
  '7050',
  '51747',
  '10891',
  '81628',
  '9043',
  '5595',
  '5159',
  '1946',
  '10890',
  '10783',
  '148872',
  '2048',
  '54885',
  '1756',
  '133584',
  '1781',
  '55534',
  '23471',
  '10602',
  '10439',
  '148170',
  '11278',
  '340419',
  '136259',
  '4916',
  '2045',
  '10500',
  '6423',
  '5612',
  '2252',
  '115752',
  '2000',
  '2742',
  '3769',
  '23189',
  '7871',
  '7026',
  '7091',
  '10368',
  '3624',
  '196264',
  '6444',
  '6272',
  '23598',
  '261729',
  '8499',
  '23193',
  '4826',
  '81618',
  '4610',
  '5122',
  '11061',
  '53335',
  '5533',
  '604',
  '4302',
  '30012',
  '7058',
  '22839',
  '5514',
  '1996',
  '4281',
  '8324',
  '79590',
  '4893',
  '1012',
  '1843',
  '54796',
  '23338',
  '57633',
  '58499',
  '80344',
  '929',
  '801',
  '8899',
  '1112',
  '9627',
  '4314',
  '2961',
  '443',


In [13]:
# prepare individual networks
data = regulation.load_tf_gene_edges()

In [14]:
dicts = regulation.get_gene_id_normalization_dictionaries(data)


* Querying BioThings to map gene symbols to hgnc and entrez IDs...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-3071...done.
Finished.
53 input query terms found no hit:
	['LMO2COM', 'CACCCBINDINGFACTOR', 'E2F4DP1', 'FOX', 'NFMUE1', 'MMEF2', 'TCF1P', 'IK3', 'INSAF', 'GNC
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

* Querying BioThings to map entrez to hgnc IDs and gene symbols...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-16632...done.
Finished.
137 input query terms found no hit:
	['136015',

In [15]:
regulation.prepare_data_edges(data, dicts)

In [16]:
# prepare regulation network
network = regulation.prepare_regulation_edges()

In [17]:
# build regulation network
regulation.build_edges(network)


* This is the size of the edges file data structure: (197267, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
   object_id property_description property_id  property_label  \
0  HGNC:8803                   NA  RO:0002434  interacts with   

                                property_uri reference_date  \
0  http://purl.obolibrary.org/obo/RO_0002434     2007-01-01   

                           reference_supporting_text  \
0  This edge comes from the TRED dataset in "tfta...   

                                  reference_uri subject_id  
0  https://www.ncbi.nlm.nih.gov/pubmed/17202159  HGNC:8615  


In [18]:
regulation.build_nodes(network)

* Total number of nodes: 16963

* Trap genes without gene symbol, i.e. genes with discontinued entrez ID...
* Number of concepts without gene symbol: 137
* Check that all genes without gene symbol are identified by entrez ID...
* Number of concepts without gene symbol by namespace:  NCBIGene    137
Name: id, dtype: int64

* Querying BioThings to map retired entrez to gene symbols...
querying 1-137...done.
Finished.
88 input query terms found no hit:
	['79907', '79911', '93333', '121301', '143902', '146856', '151720', '197379', '219392', '221943', '2
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

* Querying BioThings to retrieve node attributes...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 1

###### At once

In [14]:
%%time
# prepare msigdb data
regulation.prepare_msigdb_data()
# prepare individual networks
data = regulation.load_tf_gene_edges()
dicts = regulation.get_gene_id_normalization_dictionaries(data)
regulation.prepare_data_edges(data, dicts)
# prepare regulation network
network = regulation.prepare_regulation_edges()
# build regulation network
regulation.build_edges(network)
regulation.build_nodes(network)


* Number of Transcription Factor Targets (TFT) gene sets: 615

* Querying BioThings to map gene symbols to hgnc and entrez IDs...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-3071...done.
Finished.
53 input query terms found no hit:
	['E2F1DP1RB', 'FOX', 'ALPHACP1', 'E2F4DP1', 'CREBP1CJUN', 'CDPCR3HD', 'TAL1BETAE47', 'ISRE', 'IK3', 
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

* Querying BioThings to map entrez to hgnc IDs and gene symbols...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-16632...

Network is returned only as CSV files at graph/

#### import curation

In [20]:
# prepare curated edges and nodes
# from file:
edges_df, nodes_df = curation.read_data()


 Read data from curation/data/v2018 dir...

* Number of curated edges: 321

* Number of curated nodes: 288
CPU times: user 16.6 ms, sys: 16 ms, total: 32.6 ms
Wall time: 27.4 ms


In [21]:
curated_graph_df = curation.prepare_curated_edges(edges_df)


ID conversion: from ngly1 curated network to graph schema...

Mapping genes to HGNC ID...
querying 1-18...done.
Finished.

Adding diseases to MONDO ID network...

Adding gene to protein network...
querying 1-43...done.
Finished.

Drop duplicated gene-protein relations...


In [22]:
curated_graph_nodes_df = curation.prepare_curated_nodes(nodes_df)


Mapping genes to HGNC ID...
querying 1-18...done.
Finished.

Adding diseases described by the MONDO ontology...

Adding Name attribute: gene names from BioThings...
querying 1-298...done.
Finished.
38 input query terms found dup hits:
	[('or', 5), ('NGLY1', 3), ('of', 16), ('1', 14), ('by', 9), ('MRS', 9), ('CSF', 8), ('acid', 4), ('B
591 input query terms found no hit:
	['NGLY1-deficiency', 'misfolded', 'incompletely', 'synthesized', 'protein', 'catabolic', 'process', 
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

Preparing encoding genes from ngly1 curated network...

Adding BioThings annotation: gene symbol, name, synonyms, description...
querying 1-43...done.
Finished.


In [23]:
# build edges and nodes files
edges_l = curation.build_edges(curated_graph_df)


Save curated graph edges file at graph/...

* This is the size of the edges file data structure: (362, 10)
* These are the edges attributes: Index(['g2p_mark', 'object_id', 'property_description', 'property_id',
       'property_label', 'property_uri', 'reference_date',
       'reference_supporting_text', 'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
  g2p_mark     object_id property_description property_id property_label  \
0        0  DOID:0060728                  NaN  RO:0002200  has phenotype   

  property_uri reference_date  \
0          NaN     2016-07-07   

                           reference_supporting_text  \
0  NGLY1 deficiency (OMIM 610661 and 615273), or ...   

                                  reference_uri  subject_id  
0  https://www.ncbi.nlm.nih.gov/pubmed/27388694  HGNC:17646  


In [24]:
nodes_l = curation.build_nodes(curated_graph_nodes_df)


Save curated graph nodes file at graph/...

* This is the size of the nodes file data structure: (302, 6)
* These are the nodes attributes: Index(['description', 'id', 'name', 'preflabel', 'semantic_groups',
       'synonyms'],
      dtype='object')
* This is the first record:
     description            id name         preflabel semantic_groups  \
0  Human disease  DOID:0060728  NaN  NGLY1-deficiency            DISO   

                                            synonyms  
0  congenital disorder of deglycosylation|congeni...  


###### At once

In [15]:
%%time
# prepare curated edges and nodes
# from file:
edges_df, nodes_df = curation.read_data()
curated_graph_df = curation.prepare_curated_edges(edges_df)
curated_graph_nodes_df = curation.prepare_curated_nodes(nodes_df)

# build edges and nodes files
edges_l = curation.build_edges(curated_graph_df)
nodes_l = curation.build_nodes(curated_graph_nodes_df)


 Read data from curation/data/v2018 dir...

* Number of curated edges: 321

* Number of curated nodes: 288

ID conversion: from ngly1 curated network to graph schema...

Mapping genes to HGNC ID...
querying 1-18...done.
Finished.

Adding diseases to MONDO ID network...

Adding gene to protein network...
querying 1-43...done.
Finished.

Drop duplicated gene-protein relations...

Mapping genes to HGNC ID...
querying 1-18...done.
Finished.

Adding diseases described by the MONDO ontology...

Adding Name attribute: gene names from BioThings...
querying 1-298...done.
Finished.
38 input query terms found dup hits:
	[('or', 5), ('NGLY1', 3), ('of', 16), ('1', 14), ('by', 9), ('MRS', 9), ('CSF', 8), ('acid', 4), ('B
591 input query terms found no hit:
	['NGLY1-deficiency', 'misfolded', 'incompletely', 'synthesized', 'protein', 'catabolic', 'process', 
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

Preparing encoding genes from ngly1 curated network...

Ad

Network is returned as both CSV file and digital object

#### import monarch
We retrieved edges from Monarch using the monarch module:

    - From 8 seed nodes we retrieved 1st shell
    - From all seed and 1 shell nodes we retrieved edges among them

%%time
# seed nodes
seedList = [ 
    'MONDO:0014109', # NGLY1 deficiency
    'HGNC:17646', # NGLY1 human gene
    'HGNC:633', # AQP1 human gene
    'MGI:103201', # AQP1 mouse gene
    'HGNC:7781', # NRF1 human gene* Ginger: known as NFE2L1. http://biogps.org/#goto=genereport&id=4779
    'HGNC:24622', # ENGASE human gene
    'HGNC:636', # AQP3 human gene
    'HGNC:19940' # AQP11 human gene
]

# FASTER
seedList = [ 
    'MONDO:0014109', # NGLY1 deficiency
    'HGNC:19940' # AQP11 human gene
]

# get 1st layer
neighbourList = monarch.get_neighbours_list(seedList)

# get network nodes list: seed + 1st layer
geneList = sum([seedList,neighbourList], [])
print('genelist: ',len(geneList))

# get edge expansion
network = monarch.expand_edges(geneList)
print('network: ',len(network))

# save network
monarch.print_network(network, 'monarch_8seeds_1shell_network')

In [22]:
import monarch
# build monarch graph from monarch connections network
monarch_connections = monarch.read_connections() # OR monarch_network = read_connections()
print('### len of monarch_connections input:',len(monarch_connections))


* This is the size of the data structure: (32715, 7)
* These are the attributes: Index(['subject_id', 'subject_label', 'relation_id', 'relation_label',
       'object_id', 'object_label', 'reference_id_list'],
      dtype='object')
* This is the first record:
  subject_id subject_label relation_id  relation_label    object_id  \
0   RGD:2145          Aqp7  RO:0002434  interacts with  RGD:1303263   

  object_label reference_id_list  
0          Gk2               NaN  
### len of monarch_connections input: 32715


In [23]:
monarch_edges = monarch.build_edges(monarch_connections)
print('### len of monarch_edges output:',len(monarch_edges))


* This is the size of the edges file data structure: (32715, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
     object_id property_description property_id  property_label  \
0  RGD:1303263                   NA  RO:0002434  interacts with   

                                property_uri reference_date  \
0  http://purl.obolibrary.org/obo/RO_0002434             NA   

                           reference_supporting_text reference_uri subject_id  
0  This edge comes from the Monarch Knowledge Gra...            NA   RGD:2145  
### len of monarch_edges output: 32715


###### At once

In [24]:
%%time
# build monarch graph from monarch connections network
monarch_connections = monarch.read_connections() # OR monarch_network = read_connections()
print('### len of monarch_connections input:',len(monarch_connections))
monarch_edges = monarch.build_edges(monarch_connections)
print('### len of monarch_edges output:',len(monarch_edges))
monarch_nodes = monarch.build_nodes(monarch_connections)
print('### len of monarch_nodes output:',len(monarch_nodes))


* This is the size of the data structure: (32715, 7)
* These are the attributes: Index(['subject_id', 'subject_label', 'relation_id', 'relation_label',
       'object_id', 'object_label', 'reference_id_list'],
      dtype='object')
* This is the first record:
  subject_id subject_label relation_id  relation_label    object_id  \
0   RGD:2145          Aqp7  RO:0002434  interacts with  RGD:1303263   

  object_label reference_id_list  
0          Gk2               NaN  
### len of monarch_connections input: 32715

* This is the size of the edges file data structure: (32715, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
     object_id property_description property_id  property_label  \
0  RGD:1303263                   NA  RO:0002434  interacts with   

            

Network is returned as both CSV file and digital object

## Graph library
### Create the review knowledge graph
#### import graph

Tasks:

* Load Networks and calculate graph nodes
* Monarch graph connectivity
* Build graph

In [25]:
# build graph
edges = graph.build_edges()


Preparing networks...
Curated:
(362, 10)
Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri', 'g2p_mark'],
      dtype='object')
Monarch:
(226556, 9)
Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')
Transcriptomics:
(386, 9)
Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
Regulatory:
(9723, 9)
Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')

Concatenating into a graph...
(237027, 9)

D

In [26]:
nodes = graph.build_nodes(edges)


Preparing networks...
Curated:
(302, 6)
Index(['id', 'semantic_groups', 'preflabel', 'synonyms', 'description',
       'name'],
      dtype='object')
Monarch:
(4644, 6)
Index(['id', 'semantic_groups', 'preflabel', 'synonyms', 'description',
       'name'],
      dtype='object')
Transcriptomics:
(386, 6)
Index(['description', 'id', 'name', 'preflabel', 'semantic_groups',
       'synonyms'],
      dtype='object')
Regulatory:
(16963, 6)
Index(['description', 'id', 'name', 'preflabel', 'semantic_groups',
       'synonyms'],
      dtype='object')

Annotating nodes in the graph...
graph from e (9365, 1)
annotation check
curated (302, 6)
monarch (4644, 6)
rna (386, 6)
regulation (4226, 6)

Concatenating all nodes...
graph ann (9558, 6)
diff set()

Drop duplicated rows...
(9424, 6)

Drop duplicated nodes...
(9365, 6)

All graph nodes are annotated.
Regulation nodes not in the graph: 12737

Saving final graph...
(9365, 6)
Index(['id', 'semantic_groups', 'preflabel', 'synonyms', 'name',
       

###### At once

In [7]:
%%time
# build graph
edges = graph.build_edges()
nodes = graph.build_nodes(edges)


Preparing networks...
Curated:
(362, 10)
Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri', 'g2p_mark'],
      dtype='object')
Monarch:
(226556, 9)
Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')
Transcriptomics:
(386, 9)
Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
Regulatory:
(9723, 9)
Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')

Concatenating into a graph...
(237027, 9)

D

## Neo4jlib library
### Import the graph into Neo4j graph database
#### import neo4jlib

In [27]:
# import to graph interface, by now neo4j
## get edges and files for neo4j
edges_df = neo4jlib.get_dataframe(edges)

In [28]:
nodes_df = neo4jlib.get_dataframe(nodes)

In [29]:
statements = neo4jlib.get_statements(edges_df)

In [30]:
concepts = neo4jlib.get_concepts(nodes_df)

In [31]:
## import the graph into neo4j
# save files into neo4j import dir
neo4j_path = '/home/nuria/ngly1-graph/neo4j-graphs/neo4j/neo4j-lib'
#neo4j_path = './neo4j-community-3.0.3'
neo4jlib.save_neo4j_files(statements, neo4j_path, file_type = 'statements')


File '/home/nuria/ngly1-graph/neo4j-graphs/neo4j/neo4j-lib/import/ngly1/ngly1_statements.csv' saved.


In [32]:
neo4jlib.save_neo4j_files(concepts, neo4j_path, file_type = 'concepts')


File '/home/nuria/ngly1-graph/neo4j-graphs/neo4j/neo4j-lib/import/ngly1/ngly1_concepts.csv' saved.


In [33]:
# import graph into neo4j
neo4jlib.do_import(neo4j_path)


The graph is imported into the server. The server is running.



###### At once

In [11]:
%%time
# import to graph interface, by now neo4j
## get edges and files for neo4j
edges_df = neo4jlib.get_dataframe(edges)
nodes_df = neo4jlib.get_dataframe(nodes)
statements = neo4jlib.get_statements(edges_df)
concepts = neo4jlib.get_concepts(nodes_df)
print('statements: ', len(statements))
print('concepts: ',len(concepts))

## import the graph into neo4j
# save files into neo4j import dir
neo4j_path = '/home/nuria/ngly1-graph/neo4j-graphs/neo4j/neo4j-lib'
#neo4j_path = './neo4j-community-3.0.3'
neo4jlib.save_neo4j_files(statements, neo4j_path, file_type = 'statements')
neo4jlib.save_neo4j_files(concepts, neo4j_path, file_type = 'concepts')

# import graph into neo4j
neo4jlib.do_import(neo4j_path)

statements:  237027
concepts:  9365

File '/home/nuria/ngly1-graph/neo4j-graphs/neo4j/neo4j-lib/import/ngly1/ngly1_statements.csv' saved.

File '/home/nuria/ngly1-graph/neo4j-graphs/neo4j/neo4j-lib/import/ngly1/ngly1_concepts.csv' saved.

The graph is imported into the server. The server is running.

CPU times: user 3.03 s, sys: 200 ms, total: 3.23 s
Wall time: 3.29 s


=> Alternatively from file:

In [12]:
%%time
import neo4jlib
# import to graph interface, by now neo4j
## get edges and files for neo4j
edges = neo4jlib.get_dataframe_from_file('./graph/graph_edges_v2019-02-22')
nodes = neo4jlib.get_dataframe_from_file('./graph/graph_nodes_v2019-02-22')
statements = neo4jlib.get_statements(edges)
concepts = neo4jlib.get_concepts(nodes)
print('statements: ', len(statements))
print('concepts: ',len(concepts))

## import the graph into neo4j
# save files into neo4j import dir
neo4j_path = './neo4j-community-3.0.3'
neo4jlib.save_neo4j_files(statements, neo4j_path, file_type='statements')
neo4jlib.save_neo4j_files(concepts, neo4j_path, file_type='concepts')

# import graph into neo4j
neo4jlib.do_import(neo4j_path)

  call = lambda f, *a, **k: f(*a, **k)


statements:  237027
concepts:  9365

File './neo4j-community-3.0.3/import/ngly1/ngly1_statements.csv' saved.

File './neo4j-community-3.0.3/import/ngly1/ngly1_concepts.csv' saved.

The graph is imported into the server. The server is running.

CPU times: user 3.6 s, sys: 228 ms, total: 3.83 s
Wall time: 11.4 s


## hypothesis-generation library
### Query the graph for mechanistic explanation, then summarize the extracted paths
#### import hypothesis, summary

### Ortopheno query with general nodes/relations removed
#### import the graph into neo4j from file

In [2]:
import neo4jlib, hypothesis, summary

In [4]:
%%time
# import to graph interface, by now neo4j
## get edges and files for neo4j
edges_df = neo4jlib.get_dataframe_from_file('./graph/graph_edges_v2019-02-22')
nodes_df = neo4jlib.get_dataframe_from_file('./graph/graph_nodes_v2019-02-22')
statements = neo4jlib.get_statements(edges_df)
concepts = neo4jlib.get_concepts(nodes_df)
print('statements: ', len(statements))
print('concepts: ',len(concepts))

## import the graph into neo4j
# save files into neo4j import dir
neo4j_path = './neo4j-community-3.0.3'
neo4jlib.save_neo4j_files(statements, neo4j_path, file_type = 'statements')
neo4jlib.save_neo4j_files(concepts, neo4j_path, file_type = 'concepts')

# import graph into neo4j
neo4jlib.do_import(neo4j_path)

statements:  237027
concepts:  9365

File './neo4j-community-3.0.3/import/ngly1/ngly1_statements.csv' saved.

File './neo4j-community-3.0.3/import/ngly1/ngly1_concepts.csv' saved.

The graph is imported into the server. The server is running.

CPU times: user 3.57 s, sys: 247 ms, total: 3.81 s
Wall time: 11.3 s


In [5]:
%%time
# get orthopheno paths
seed = list([
        'HGNC:17646',  # NGLY1 human gene
        'HGNC:633'  # AQP1 human gene
])
hypothesis.query(seed,queryname='ngly1_aqp1',port='7687') #http_port= 7470; bolt_port=7680

CPU times: user 466 ms, sys: 19.7 ms, total: 485 ms
Wall time: 4.94 s



Hypothesis generator has finished. 2 QUERIES completed.


Checked manually that there are paths

In [6]:
%%time
# get orthopheno paths
seed = list([
        'HGNC:17646',  # NGLY1 human gene
        'HGNC:633'  # AQP1 human gene
])
hypothesis.query(seed, queryname='ngly1_aqp1', pwdegree='1000', phdegree='1000', port='7687')

CPU times: user 2.79 s, sys: 31.4 ms, total: 2.82 s
Wall time: 5.32 s



Hypothesis generator has finished. 2 QUERIES completed.


Checked manually that there are results. 

**There is not a check if there are results diff to 0**

In [7]:
%%time
import hypothesis
# get orthopheno paths
seed = list([
        'HGNC:17646',  # NGLY1 human gene
        'HGNC:633'  # AQP1 human gene
])
hypothesis.open_query(seed,queryname='ngly1_aqp1',port='7687')

CPU times: user 8.08 s, sys: 91.2 ms, total: 8.17 s
Wall time: 8.74 s



Hypothesis generator has finished. 2 QUERIES completed.


Results!  

* outfile: query_ngly1_aqp1_paths_v2019-02-22.json

In [9]:
%%time
import summary
# get summary
data = summary.path_load('./hypothesis/query_ngly1_aqp1_paths_v2019-02-22')

#parse data for summarization
data_parsed = list()
#funcs = [summary.metapaths, summary.nodes, summary.node_types, summary.edges, summary.edge_types]
for query in data:
    query_parsed = summary.query_parser(query)
    #metapath(query_parsed)
    #map(lambda x: x(query_parsed), funcs)
    data_parsed.append(query_parsed)
summary.metapaths(data_parsed)
summary.nodes(data_parsed)
summary.node_types(data_parsed)
summary.edges(data_parsed)
summary.edge_types(data_parsed)
#for query in data_parsed:
#    map(lambda x: x(query), funcs)


File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/query_ngly1_aqp1_paths_v2019-02-21_source:HGNC:17646_target:HGNC:633_summary_metapaths_v2019-02-22.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/query_ngly1_aqp1_paths_v2019-02-21_source:HGNC:17646_target:HGNC:633_summary_entities_in_metapaths_v2019-02-22.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/query_ngly1_aqp1_paths_v2019-02-21_source:HGNC:633_target:HGNC:17646_summary_metapaths_v2019-02-22.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/query_ngly1_aqp1_paths_v2019-02-21_source:HGNC:633_target:HGNC:17646_summary_entities_in_metapaths_v2019-02-22.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/monarch_orthopeno_network_query_source:HGNC:17646_target:HGNC:633_summary_nodes_v2019-02-22.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-ge

It seems that works. Only summaries for aqp1->ngly1. Check if it is correct. Also check the debugging part below what is it???