In [1]:
# @name
# @description
# @author
# @date

# Description

This is the notebook for the creation of the first review network and derived hypotheses. 

* Using intermediary variables from workflow objects. In this workflow variables are directly used for the next step. 


* Review network: From Monarch knowledge graph, we built a network seeded by 8 nodes, retrieving their explicit relationships and all the relationships among all these nodes. Seed nodes:

    - 'MONDO:0014109', # NGLY1 deficiency
    - 'HGNC:17646', # NGLY1 human gene
    - 'HGNC:633', # AQP1 human gene
    - 'MGI:103201', # AQP1 mouse gene
    - 'HGNC:7781', # NRF1 human gene* Ginger: known as NFE2L1. http://biogps.org/#goto=genereport&id=4779
    - 'HGNC:24622', # ENGASE human gene
    - 'HGNC:636', # AQP3 human gene
    - 'HGNC:19940' # AQP11 human gene
    

* Connecting paths: query templates.

In [2]:
#import time
import transcriptomics, regulation, curation, monarch, graph, neo4jlib, hypothesis, summary, utils

## Edges library
### Review edges to integrate into the knowledge graph
#### import transcriptomics
We retrieved edges from RNA-seq transcriptomics profiles using the transcriptomics module:

    - Experimental data sets: from Chow et al. paper [pmid:29346549] (NGLY1 deficiency model on fruit fly)

In [3]:
%%time
# prepare data to graph schema
csv_path = './transcriptomics/ngly1-fly-chow-2018/data/supp_table_1.csv'
data = transcriptomics.read_data(csv_path)
clean_data = transcriptomics.clean_data(data)
data_edges = transcriptomics.prepare_data_edges(clean_data)
rna_network = transcriptomics.prepare_rna_edges(data_edges)

# build network with graph schema
rna_edges = transcriptomics.build_edges(rna_network)
rna_nodes = transcriptomics.build_nodes(rna_network)


* This is the size of the raw expression data structure: (15370, 9)
* These are the expression attributes: Index(['FlyBase ID', 'Symbol', 'baseMean', 'log2FoldChange', 'Unnamed: 4',
       'lfcSE', 'stat', 'pvalue', 'padj'],
      dtype='object')
* This is the first record:
    FlyBase ID  Symbol    baseMean  log2FoldChange  Unnamed: 4     lfcSE  \
0  FBgn0030880  CG6788  175.577087       -4.209283     0.05406  0.190308   

        stat         pvalue           padj  
0 -22.118249  2.110000e-108  2.860000e-104  

* This is the size of the clean expression data structure: (386, 6)
* These are the clean expression attributes: Index(['FlyBase ID', 'Symbol', 'log2FoldChange', 'pvalue', 'padj',
       'Regulation'],
      dtype='object')
* This is the first record:
    FlyBase ID Symbol  log2FoldChange        pvalue      padj   Regulation
0  FBgn0035904  GstO3        0.576871  2.130000e-08  0.000002  Upregulated

* This is the size of the expression data structure: (386, 13)
* These are th

Network is returned as both CSV files at graph/ and digital object

In [4]:
# print type of objects
print('type edges:', type(rna_edges))
print('type nodes:', type(rna_nodes))
print()

# print objects sizes
print('len edges:', len(rna_edges))
print('len nodes:', len(rna_nodes))
print()

# print object attribute
print('attribute edges:', rna_edges[0].keys())
print('attribute nodes:', rna_nodes[0].keys())

type edges: <class 'list'>
type nodes: <class 'list'>

len edges: 386
len nodes: 386

attribute edges: dict_keys(['subject_id', 'object_id', 'property_id', 'property_label', 'property_description', 'property_uri', 'reference_uri', 'reference_supporting_text', 'reference_date'])
attribute nodes: dict_keys(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms', 'description'])


#### import regulation

In [5]:
%%time
# prepare msigdb data
gmt_path = './regulation/msigdb/data/c3.tft.v6.1.entrez.gmt'
regulation.prepare_msigdb_data(gmt_path)

# prepare individual networks
data = regulation.load_tf_gene_edges()
dicts = regulation.get_gene_id_normalization_dictionaries(data)
data_edges = regulation.prepare_data_edges(data, dicts)

# prepare regulation network
reg_network = regulation.prepare_regulation_edges(data_edges)

# build network with graph schema
reg_edges = regulation.build_edges(reg_network)
reg_nodes = regulation.build_nodes(reg_network)


* Number of Transcription Factor Targets (TFT) gene sets: 615

* Querying BioThings to map gene symbols to hgnc and entrez IDs...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-3071...done.
Finished.
53 input query terms found no hit:
	['NKX22', 'MEIS1BHOXA9', 'TAL1BETAE47', 'CEBPGAMMA', 'GNCF', 'NFKAPPAB65', 'E2F1DP1RB', 'MMEF2', 'E2
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

* Querying BioThings to map entrez to hgnc IDs and gene symbols...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-8000...done.
querying 8001-9000...done.
querying 9001-10000...done.
querying 10001-11000...done.
querying 11001-12000...done.
querying 12001-13000...done.
querying 13001-14000...done.
querying 14001-15000...done.
querying 15001-16000...done.
querying 16001-16632...

Network is returned as both CSV file at graph/ and digital object

In [6]:
# print type of objects
print('type edges:', type(reg_edges))
print('type nodes:', type(reg_nodes))
print()

# print objects sizes
print('len edges:', len(reg_edges))
print('len nodes:', len(reg_nodes))
print()

# print object attribute
print('attribute edges:', reg_edges[0].keys())
print('attribute nodes:', reg_nodes[0].keys())

type edges: <class 'list'>
type nodes: <class 'list'>

len edges: 197267
len nodes: 16963

attribute edges: dict_keys(['subject_id', 'object_id', 'property_id', 'property_label', 'property_description', 'property_uri', 'reference_uri', 'reference_supporting_text', 'reference_date'])
attribute nodes: dict_keys(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms', 'description'])


#### import curation

In [8]:
%%time
# graph v3.2
# read network from drive and concat all curated statements
curation_edges, curation_nodes = curation.read_network(version='v20180118')

# prepare data edges and nodes
data_edges = curation.prepare_data_edges(curation_edges)
data_nodes = curation.prepare_data_nodes(curation_nodes)

# prepare curated edges and nodes
curated_network = curation.prepare_curated_edges(data_edges)
curated_concepts = curation.prepare_curated_nodes(data_nodes)


# build edges and nodes files
curation_edges = curation.build_edges(curated_network)
curation_nodes = curation.build_nodes(curated_concepts)


Reading and concatenating all curated statements in the network...

* Curation edge files concatenated shape: (322, 22)

Reading and concatenating all curated nodes in the network...

* Curation node files concatenated shape: (318, 9)

Preparing curated network...

Drop duplicated rows...
Before drop: 322
After drop: 321

Save curated network at curation/...
Curated edges data structure shape: (321, 9)
Curated edges data structure fields: Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')

Preparing curated nodes...

Drop duplicated rows...
Before drop: 318
After drop: 288

Save curated nodes at curation/...
Curated nodes data structure shape: (288, 5)
Curated nodes data structure fields: Index(['id', 'semantic_groups', 'preflabel', 'synonyms', 'description'], dtype='object')

ID conversion: from ngly1 curated network to graph sch

Network is returned as both CSV file and digital object

In [9]:
# print type of objects
print('type edges:', type(curation_edges))
print('type nodes:', type(curation_nodes))
print()

# print objects sizes
print('len edges:', len(curation_edges))
print('len nodes:', len(curation_nodes))
print()

# print object attribute
print('attribute edges:', curation_edges[0].keys())
print('attribute nodes:', curation_nodes[0].keys())

type edges: <class 'list'>
type nodes: <class 'list'>

len edges: 362
len nodes: 302

attribute edges: dict_keys(['subject_id', 'object_id', 'property_id', 'property_label', 'property_description', 'property_uri', 'reference_uri', 'reference_supporting_text', 'reference_date', 'g2p_mark'])
attribute nodes: dict_keys(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms', 'description'])


#### import monarch
We retrieved edges from Monarch using the monarch module:

    - From 8 seed nodes we retrieved 1st shell
    - From all seed and 1 shell nodes we retrieved edges among them

In [10]:
%%time
# prepare data to graph schema
# seed nodes
seedList = [ 
    'MONDO:0014109', # NGLY1 deficiency
    'HGNC:17646', # NGLY1 human gene
    'HGNC:633', # AQP1 human gene
    'MGI:103201', # AQP1 mouse gene
    'HGNC:7781', # NRF1 human gene* Ginger: known as NFE2L1. http://biogps.org/#goto=genereport&id=4779
    'HGNC:24622', # ENGASE human gene
    'HGNC:636', # AQP3 human gene
    'HGNC:19940' # AQP11 human gene
] 

# get first shell of neighbours
neighboursList = monarch.get_neighbours_list(seedList)
print(len(neighboursList))

# introduce animal model ortho-phenotypes for seed and 1st shell neighbors
seed_orthophenoList = monarch.get_orthopheno_list(seedList)
print(len(seed_orthophenoList))
neighbours_orthophenoList = monarch.get_orthopheno_list(neighboursList)
print(len(neighbours_orthophenoList))

# network nodes: seed + 1shell + ortholog-phentoype
geneList = sum([seedList,
                neighboursList,
                seed_orthophenoList,
                neighbours_orthophenoList], 
               [])
print('genelist: ',len(geneList))

# get Monarch network
monarch_network = monarch.extract_edges(geneList)
print('network: ',len(monarch_network))

# save edges
monarch.print_network(monarch_network, 'monarch_connections')

# build network with graph schema #!!!#
monarch_edges = monarch.build_edges(monarch_network)
monarch_nodes = monarch.build_nodes(monarch_network)


The function "get_neighbours_list()" is running, please keep calm and have some coffee...


100%|██████████| 8/8 [00:20<00:00,  2.11s/it]


Finished get_neighbours...
705

The function "get_orthopheno_list()" is running, please keep calm and have some coffee...


100%|██████████| 8/8 [00:22<00:00,  2.49s/it]


Finished get_neighbours...


100%|██████████| 82/82 [01:26<00:00,  1.32s/it]


Finished get_neighbours...
240

The function "get_orthopheno_list()" is running, please keep calm and have some coffee...


100%|██████████| 705/705 [1:36:08<00:00,  5.23s/it]  


Finished get_neighbours...


100%|██████████| 1497/1497 [32:32<00:00,  1.53it/s] 


Finished get_neighbours...
4411
genelist:  5364

The function "extract_edges()" is running, please keep calm and have some coffee...


100%|██████████| 5086/5086 [5:39:15<00:00,  5.03s/it]   


Finished get_connections...
network:  36187

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/monarch/monarch_connections_v2019-03-10.csv' saved.
df (36187, 9)

* This is the size of the edges file data structure: (36187, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
        object_id property_description property_id property_label  \
0  UBERON:0010307                   NA  RO:0002206   expressed in   

                                property_uri reference_date  \
0  http://purl.obolibrary.org/obo/RO_0002206             NA   

                           reference_supporting_text reference_uri  \
0  This edge comes from the Monarch Knowledge Gra...            NA   

    subject_id  
0  MGI:1344333  
Number of concepts: 5086
Number of nodes CURIEs:

In [11]:
# print type of objects
print('type edges:', type(monarch_edges))
print('type nodes:', type(monarch_nodes))
print()

# print objects sizes
print('len edges:', len(monarch_edges))
print('len nodes:', len(monarch_nodes))
print()

# print object attribute
print('attribute edges:', monarch_edges[0].keys())
print('attribute nodes:', monarch_nodes[0].keys())

type edges: <class 'list'>
type nodes: <class 'list'>

len edges: 36187
len nodes: 5086

attribute edges: dict_keys(['subject_id', 'object_id', 'property_id', 'property_label', 'property_description', 'property_uri', 'reference_uri', 'reference_supporting_text', 'reference_date'])
attribute nodes: dict_keys(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms', 'description'])


Network is returned as both CSV file and digital object

## Graph library
### Create the review knowledge graph
#### import graph

Tasks:

* Load Networks and calculate graph nodes
* Monarch graph connectivity
* Build graph

In [12]:
%%time
# load networks and calculate graph nodes
graph_nodes_list, reg_graph_edges = graph.graph_nodes(
    curation=curation_edges,
    monarch=monarch_edges,
    transcriptomics=rna_edges,
    regulation=reg_edges
)

# monarch graph connectivity
# get Monarch edges
monarch_network_graph = monarch.extract_edges(graph_nodes_list)
print('network: ',len(monarch_network_graph))

# save network
monarch.print_network(monarch_network_graph, 'monarch_connections_graph')

# build Monarch graph network
monarch_graph_edges = monarch.build_edges(monarch_network_graph)
monarch_graph_nodes = monarch.build_nodes(monarch_network_graph)

# build graph
edges = graph.build_edges(
    curation=curation_edges,
    monarch=monarch_graph_edges,
    transcriptomics=rna_edges,
    regulation=reg_graph_edges
)
nodes = graph.build_nodes(
    statements=edges,
    curation=curation_nodes,
    monarch=monarch_graph_nodes,
    transcriptomics=rna_nodes,
    regulation=reg_nodes
)

Curated:
(362, 10)
Index(['g2p_mark', 'object_id', 'property_description', 'property_id',
       'property_label', 'property_uri', 'reference_date',
       'reference_supporting_text', 'reference_uri', 'subject_id'],
      dtype='object')
Monarch:
(36187, 9)
Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
Transcriptomics:
(386, 9)
Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
Regulatory:
(197267, 9)
Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')

Concatenating into a graph...
(36935, 9)

Drop duplicated rows...


100%|██████████| 9808/9808 [10:19:57<00:00,  3.15s/it]   


Finished get_connections...
network:  236374

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/monarch/monarch_connections_graph_v2019-03-10.csv' saved.
df (236374, 9)

* This is the size of the edges file data structure: (236374, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
    object_id property_description property_id  property_label  \
0  HGNC:19986                   NA  RO:0002434  interacts with   

                                property_uri reference_date  \
0  http://purl.obolibrary.org/obo/RO_0002434             NA   

                           reference_supporting_text reference_uri  subject_id  
0  This edge comes from the Monarch Knowledge Gra...            NA  HGNC:12590  
Number of concepts: 9466
Number of nodes CURIEs: 30
List o

In [16]:
# print type of objects
print('type edges:', type(edges))
print('type nodes:', type(nodes))
print()

# print objects sizes
print('len edges:', len(edges))
print('len nodes:', len(nodes))
print()

# print object attribute
print('attribute edges:', edges.columns)
print('attribute nodes:', nodes.columns)

type edges: <class 'pandas.core.frame.DataFrame'>
type nodes: <class 'pandas.core.frame.DataFrame'>

len edges: 246917
len nodes: 9808

attribute edges: Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')
attribute nodes: Index(['id', 'semantic_groups', 'preflabel', 'synonyms', 'name',
       'description'],
      dtype='object')


## Neo4jlib library
### Import the graph into Neo4j graph database
#### import neo4jlib

In [17]:
%%time
# import to graph interface, by now neo4j
## get edges and files for neo4j
edges_df = utils.get_dataframe(edges)
nodes_df = utils.get_dataframe(nodes)
statements = neo4jlib.get_statements(edges_df)
concepts = neo4jlib.get_concepts(nodes_df)
print('statements: ', len(statements))
print('concepts: ',len(concepts))

## import the graph into neo4j
# save files into neo4j import dir
neo4j_path = './neo4j-community-3.0.3'
neo4jlib.save_neo4j_files(statements, neo4j_path, file_type = 'statements')
neo4jlib.save_neo4j_files(concepts, neo4j_path, file_type = 'concepts')

# import graph into neo4j
neo4jlib.do_import(neo4j_path)

statements:  246917
concepts:  9808

File './neo4j-community-3.0.3/import/ngly1/ngly1_statements.csv' saved.

File './neo4j-community-3.0.3/import/ngly1/ngly1_concepts.csv' saved.

The graph is imported into the server. The server is running.

CPU times: user 3.42 s, sys: 220 ms, total: 3.64 s
Wall time: 11.6 s


In [19]:
# print type of objects
print('type edges:', type(statements))
print('type nodes:', type(concepts))
print()

# print objects sizes
print('len edges:', len(statements))
print('len nodes:', len(concepts))
print()

# print object attribute
print('attribute edges:', statements.columns)
print('attribute nodes:', concepts.columns)

type edges: <class 'pandas.core.frame.DataFrame'>
type nodes: <class 'pandas.core.frame.DataFrame'>

len edges: 246917
len nodes: 9808

attribute edges: Index([':START_ID', ':TYPE', ':END_ID', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description:IGNORE', 'property_uri'],
      dtype='object')
attribute nodes: Index(['id:ID', ':LABEL', 'preflabel', 'synonyms:IGNORE', 'name',
       'description'],
      dtype='object')


## hypothesis-generation library
### Query the graph for mechanistic explanation, then summarize the extracted paths
#### import hypothesis, summary

### Ortopheno query with general nodes/relations removed

In [20]:
%%time
# get orthopheno paths
seed = list([
        'HGNC:17646',  # NGLY1 human gene
        'HGNC:633'  # AQP1 human gene
])
hypothesis.query(seed,queryname='ngly1_aqp1',port='7687') #http_port= 7470; bolt_port=7680


Hypothesis generator has finished. 2 QUERIES completed.


CPU times: user 405 ms, sys: 32.4 ms, total: 438 ms
Wall time: 5 s


In [21]:
%%time
# get orthopheno paths
seed = list([
        'HGNC:17646',  # NGLY1 human gene
        'HGNC:633'  # AQP1 human gene
])
hypothesis.query(seed, queryname='ngly1_aqp1', pwdegree='1000', phdegree='1000', port='7687')


Hypothesis generator has finished. 2 QUERIES completed.


CPU times: user 2.73 s, sys: 38.1 ms, total: 2.77 s
Wall time: 5.35 s


In [22]:
%%time
import hypothesis
# get orthopheno paths
seed = list([
        'HGNC:17646',  # NGLY1 human gene
        'HGNC:633'  # AQP1 human gene
])
hypothesis.open_query(seed,queryname='ngly1_aqp1',port='7687')


Hypothesis generator has finished. 2 QUERIES completed.


CPU times: user 8.19 s, sys: 75.7 ms, total: 8.27 s
Wall time: 8.87 s


In [24]:
%%time
# get summary
data = summary.path_load('./hypothesis/query_ngly1_aqp1_paths_v2019-03-10')

#parse data for summarization
data_parsed = list()
#funcs = [summary.metapaths, summary.nodes, summary.node_types, summary.edges, summary.edge_types]
for query in data:
    query_parsed = summary.query_parser(query)
    #metapath(query_parsed)
    #map(lambda x: x(query_parsed), funcs)
    data_parsed.append(query_parsed)
summary.metapaths(data_parsed)
summary.nodes(data_parsed)
summary.node_types(data_parsed)
summary.edges(data_parsed)
summary.edge_types(data_parsed)
#for query in data_parsed:
#    map(lambda x: x(query), funcs)


File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/query_ngly1_aqp1_paths_source:HGNC:17646_target:HGNC:633_summary_metapaths_v2019-03-10.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/query_ngly1_aqp1_paths_source:HGNC:17646_target:HGNC:633_summary_entities_in_metapaths_v2019-03-10.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/query_ngly1_aqp1_paths_source:HGNC:633_target:HGNC:17646_summary_metapaths_v2019-03-10.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/query_ngly1_aqp1_paths_source:HGNC:633_target:HGNC:17646_summary_entities_in_metapaths_v2019-03-10.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/monarch_orthopeno_network_query_source:HGNC:17646_target:HGNC:633_summary_nodes_v2019-03-10.csv' saved.

File '/home/nuria/workspace/graph-hypothesis-generation-lib/plan/summaries/monarch_orthopeno_ne