In [1]:
# @name graph_v3.2_v20190616
# @description notebook to build the NGLY1 Deficiency review knowledge graph v3.2
# @author Núria Queralt Rosinach
# @date 16 June 2019

# Description

This is the notebook for the creation of the first review network and derived hypotheses. 

* Using intermediary variables from workflow objects. In this workflow variables are directly used for the next step. 


* Review network: From Monarch knowledge graph, we built a network seeded by 8 nodes, retrieving their explicit relationships and all the relationships among all these nodes. Seed nodes:

    - 'MONDO:0007739' HD
    - 'HGNC:4851' Htt
    - 'CHEBI:18248' Iron (not working)
    - 'HGNC:18229' Rhes (RASD2)
    
Possible seed nodes:
https://monarchinitiative.org/search/Iron
* Connecting paths: query templates.

In [1]:
import transcriptomics, regulation, curation, monarch, graph, neo4jlib, hypothesis, summary, utils
import pandas as pd

## Edges library
### Review edges to integrate into the knowledge graph and prepare them as individual networks

#### TRANSCRIPTOMICS NETWORK
#### import transcriptomics
We retrieved edges from RNA-seq transcriptomics profiles using the `transcriptomics` module:

    - Experimental data sets: from Chow et al. paper [pmid:29346549] (NGLY1 deficiency model on fruit fly)

In [2]:
%%time
# prepare data to graph schema
csv_path = './transcriptomics/GSE64810_mlhd_DESeq2_diffexp_DESeq2_outlier_trimmed_adjust.txt'
data = transcriptomics.read_data(csv_path, "\t")


The function "read_data()" is running...

* This is the size of the raw expression data structure: (28087, 10)
* These are the expression attributes: Index(['Unnamed: 0', 'symbol', 'baseMean', 'HD.mean', 'Control.mean',
       'log2FoldChange', 'lfcSE', 'stat', 'pvalue', 'padj'],
      dtype='object')
* This is the first record:
           Unnamed: 0 symbol  baseMean    HD.mean  Control.mean  \
0  ENSG00000069011.10  PITX1  5.645675  18.684286      0.323793   

   log2FoldChange     lfcSE      stat        pvalue          padj  
0        4.769658  0.366367  13.01879  9.567529e-39  2.687232e-34  

The raw data is saved at: /home/karolis/LUMC/HDSR/bioknowledge-reviewer/bioknowledge_reviewer/transcriptomics/HD/data/GSE64810_mlhd_DESeq2_diffexp_DESeq2_outlier_trimmed_adjust.csv


Finished read_data().

CPU times: user 46.1 ms, sys: 1.52 ms, total: 47.6 ms
Wall time: 46.2 ms


In [6]:
%%time
# prepare data to graph schema
csv_path = './transcriptomics/GSE64810_mlhd_DESeq2_diffexp_DESeq2_outlier_trimmed_adjust.txt'
data = transcriptomics.read_data(csv_path, "\t")
clean_data = transcriptomics.clean_data(data)
data_edges = transcriptomics.prepare_data_edges(clean_data)
rna_network = transcriptomics.prepare_rna_edges(data_edges)

# TODO:
# This function still produces ENSEMBLE ID's even though the code should already return HGNC ones

# build network with graph schema
rna_edges = transcriptomics.build_edges(rna_network)
rna_nodes, rna_dict = transcriptomics.build_nodes(rna_edges)
rna_edges = transcriptomics.rework_edges(rna_edges, rna_dict)


The function "read_data()" is running...

* This is the size of the raw expression data structure: (28087, 10)
* These are the expression attributes: Index(['Unnamed: 0', 'symbol', 'baseMean', 'HD.mean', 'Control.mean',
       'log2FoldChange', 'lfcSE', 'stat', 'pvalue', 'padj'],
      dtype='object')
* This is the first record:
           Unnamed: 0 symbol  baseMean    HD.mean  Control.mean  \
0  ENSG00000069011.10  PITX1  5.645675  18.684286      0.323793   

   log2FoldChange     lfcSE      stat        pvalue          padj  
0        4.769658  0.366367  13.01879  9.567529e-39  2.687232e-34  

The raw data is saved at: /home/karolis/LUMC/HDSR/bioknowledge-reviewer/bioknowledge_reviewer/transcriptomics/HD/data/GSE64810_mlhd_DESeq2_diffexp_DESeq2_outlier_trimmed_adjust.csv


Finished read_data().


The function "clean_data()" is running. Keeping only data with FC > 1.5 and FDR < 5% ...

* This is the size of the clean expression data structure: (3209, 6)
* These are the clean expressi

AttributeError: 'list' object has no attribute 'iterrows'

- Transcriptomics network is returned as both digital object (`rna_edges`, `rna_nodes`) and CSV files at _**graph/**_ (`rna_edges_version.csv`, `rna_nodes_version.csv`)

In [4]:
# print type of objects
print('type edges:', type(rna_edges))
print('type nodes:', type(rna_nodes))
print()

# print objects sizes
print('len edges:', len(rna_edges))
print('len nodes:', len(rna_nodes))
print()

# print object attribute
print('attribute edges:', rna_edges[0].keys())
print('attribute nodes:', rna_nodes[0].keys())

type edges: <class 'dict'>
type nodes: <class 'list'>

len edges: 3210
len nodes: 3210



KeyError: 0

#### REGULATION NETWORK
#### import regulation

We retrieved human TF gene expression regulation edges from several sources using the `regulation` module.

In [6]:
%%time
# prepare msigdb data
gmt_path = './regulation/msigdb/data/c3.tft.v6.1.entrez.gmt'
regulation.prepare_msigdb_data(gmt_path)

# prepare individual networks
data = regulation.load_tf_gene_edges()
dicts = regulation.get_gene_id_normalization_dictionaries(data)
data_edges = regulation.prepare_data_edges(data, dicts)

# prepare regulation network
reg_network = regulation.prepare_regulation_edges(data_edges)

# build network with graph schema
reg_edges = regulation.build_edges(reg_network)
reg_nodes = regulation.build_nodes(reg_network)


The function "prepare_msigdb_data()" is running...

* Number of Transcription Factor Targets (TFT) gene sets: 615

The MSigDB raw network is saved at: /home/karolis/Structured review/HD-SR/bioknowledge-reviewer/bioknowledge_reviewer/regulation/msigdb/out/tf_genelist_entrez_msigdb.json. Other reporting files are also saved at the same directory.


Finished prepare_msigdb_data().


The function "load_tf_gene_edges()" is running...

Finished load_tf_gene_edges().


The function "get_gene_id_normalization_dictionaries()" is running...

* Querying BioThings to map gene symbols to HGNC and Entrez IDs...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-3071...done.
Finished.
48 input query terms found no hit:
	['TAL1BETAE47', 'COMP1', 'PTF1BETA', 'MYCMAX', 'NKX62', 'MYOGNF1', 'MEIS1AHOXA9', 'NFMUE1', 'NFKAPPA
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

Saving not found gene symbols at: /home/karolis/Structure

- Regulation network is returned as both digital object (`reg_edges`, `reg_nodes`) and CSV files at _**graph/**_ (`regulation_edges_version.csv`, `regulation_nodes_version.csv`)

In [7]:
# print type of objects
print('type edges:', type(reg_edges))
print('type nodes:', type(reg_nodes))
print()

# print objects sizes
print('len edges:', len(reg_edges))
print('len nodes:', len(reg_nodes))
print()

# print object attribute
print('attribute edges:', reg_edges[0].keys())
print('attribute nodes:', reg_nodes[0].keys())

type edges: <class 'list'>
type nodes: <class 'list'>

len edges: 197267
len nodes: 16968

attribute edges: dict_keys(['subject_id', 'object_id', 'property_id', 'property_label', 'property_description', 'property_uri', 'reference_uri', 'reference_supporting_text', 'reference_date'])
attribute nodes: dict_keys(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms', 'description'])


#### CURATED NETWORK
#### import curation

We retrieved and prepared curated edges using the `curation` module. 

curation_edges = pd.read_csv("curation/data/HD/Empty_edges.csv")
curation_nodes = pd.read_csv("curation/data/HD/Empty_nodes.csv")

%%time
# graph v3.2
# read network from drive and concat all curated statements
curation_edges, curation_nodes = curation.read_network(version='v20180118')

# prepare data edges and nodes
data_edges = curation.prepare_data_edges(curation_edges)
data_nodes = curation.prepare_data_nodes(curation_nodes)

# prepare curated edges and nodes
curated_network = curation.prepare_curated_edges(data_edges)
curated_concepts = curation.prepare_curated_nodes(data_nodes)


# build edges and nodes files
curation_edges = curation.build_edges(curated_network)
curation_nodes = curation.build_nodes(curated_concepts)

- Curated network is returned as both digital object (`curation_edges`, `curation_nodes`) and CSV files at _**graph/**_ (`curated_graph_edges_version.csv`, `curated_graph_nodes_version.csv`)
- The original curated network, i.e. without graph data model normalization, is saved as CSV files at _**curation/**_ (`curated_edges_version.csv`, `curated_nodes_version.csv`)

# print type of objects
print('type edges:', type(curation_edges))
print('type nodes:', type(curation_nodes))
print()

# print objects sizes
print('len edges:', len(curation_edges))
print('len nodes:', len(curation_nodes))
print()

# print object attribute
print('attribute edges:', curation_edges[0].keys())
print('attribute nodes:', curation_nodes[0].keys())

#### MONARCH NETWORK
#### import monarch
We retrieved edges from Monarch using the `monarch` module.

Tasks:

- From 8 seed nodes we retrieved 1st shell nodes
- From all seed and 1st shell nodes we retrieved ortho-phenotypes
- We retrieved extra edges among all of them, i.e. extra connectivity between: seed, 1st shell, ortholog-phenotype nodes

In [8]:
%%time
# prepare data to graph schema
# seed nodes
seedList = [ 
    'MONDO:0007739', # HD
    'HGNC:4851', # Htt
    'HGNC:182293', # Rhes
    'REACT:R-HSA-917937' #Iron uptake pathway
] 

# get first shell of neighbours
neighboursList = monarch.get_neighbours_list(seedList)
print(len(neighboursList))

# introduce animal model ortho-phenotypes for seed and 1st shell neighbors
## For seed nodes:
seed_orthophenoList = monarch.get_orthopheno_list(seedList)
print(len(seed_orthophenoList))
## For 1st shell nodes:
neighbours_orthophenoList = monarch.get_orthopheno_list(neighboursList)
print(len(neighbours_orthophenoList))

# network nodes: seed + 1shell + ortholog-phentoype
geneList = sum([seedList,
                neighboursList,
                seed_orthophenoList,
                neighbours_orthophenoList
               ], 
               [])
print('genelist: ',len(geneList))

# get Monarch network
monarch_network = monarch.extract_edges(geneList)
print('network: ',len(monarch_network))

# save edges
monarch.print_network(monarch_network, 'monarch_connections')

# build network with graph schema 
monarch_edges = monarch.build_edges(monarch_network)
monarch_nodes = monarch.build_nodes(monarch_network)


The function "get_neighbours_list()" is running. Its runtime may take some minutes. If you interrupt the process, you will lose all the nodes retrieved and you should start over the execution of this function.


100%|██████████| 4/4 [00:22<00:00,  4.91s/it]



Finished get_neighbours_list().

1024

The function "get_orthopheno_list()" is running. Its runtime may take some hours. If you interrupt the process, you will lose all the nodes retrieved and you should start over the execution of this function.


100%|██████████| 4/4 [00:22<00:00,  4.87s/it]
100%|██████████| 16/16 [00:50<00:00,  4.90s/it]



Finished get_orthopheno_list().

227

The function "get_orthopheno_list()" is running. Its runtime may take some hours. If you interrupt the process, you will lose all the nodes retrieved and you should start over the execution of this function.


100%|██████████| 1024/1024 [1:19:32<00:00,  3.38s/it]
100%|██████████| 7307/7307 [6:06:49<00:00,  2.21s/it]   



Finished get_orthopheno_list().

13015
genelist:  14270

The function "extract_edges()" is running. Its runtime may take some hours. If you interrupt the process, you will lose all the edges retrieved and you should start over the execution of this function.


100%|██████████| 14021/14021 [15:36:11<00:00,  3.07s/it]   



Finished extract_edges(). To save the retrieved Monarch edges use the function "print_network()".

network:  233182

Saving Monarch edges at: '/home/karolis/Structured review/HD-SR/bioknowledge-reviewer/bioknowledge_reviewer/monarch/monarch_connections_v2020-12-12.csv'...


The function "build_edges()" is running...
df (233182, 9)

* This is the size of the edges file data structure: (233182, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
    object_id property_description    property_id  \
0  MGI:892004                   NA  RO:HOM0000017   

                   property_label  \
0  in orthology relationship with   

                                   property_uri reference_date  \
0  http://purl.obolibrary.org/obo/RO_HOM0000017             NA   

               

In [9]:
monarch.print_network(monarch_network, 'monarch_connections')

# build network with graph schema 
monarch_edges = monarch.build_edges(monarch_network)
monarch_nodes = monarch.build_nodes(monarch_network)


Saving Monarch edges at: '/home/karolis/Structured review/HD-SR/bioknowledge-reviewer/bioknowledge_reviewer/monarch/monarch_connections_v2020-12-12.csv'...


The function "build_edges()" is running...
df (233182, 9)

* This is the size of the edges file data structure: (233182, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
    object_id property_description    property_id  \
0  MGI:892004                   NA  RO:HOM0000017   

                   property_label  \
0  in orthology relationship with   

                                   property_uri reference_date  \
0  http://purl.obolibrary.org/obo/RO_HOM0000017             NA   

                           reference_supporting_text reference_uri  \
0  This edge comes from the Monarch Knowledge Gra...          

In [10]:
# print type of objects
print('type edges:', type(monarch_edges))
print('type nodes:', type(monarch_nodes))
print()

# print objects sizes
print('len edges:', len(monarch_edges))
print('len nodes:', len(monarch_nodes))
print()

# print object attribute
print('attribute edges:', monarch_edges[0].keys())
print('attribute nodes:', monarch_nodes[0].keys())

type edges: <class 'list'>
type nodes: <class 'list'>

len edges: 233182
len nodes: 14018

attribute edges: dict_keys(['subject_id', 'object_id', 'property_id', 'property_label', 'property_description', 'property_uri', 'reference_uri', 'reference_supporting_text', 'reference_date'])
attribute nodes: dict_keys(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms', 'description'])


%%time
# prepare data to graph schema
# seed nodes
seedList = [ 
    'MONDO:0007739', # HD
    'HGNC:4851', # Htt
    'HGNC:182293', # Rhes
] 

# get first shell of neighbours
neighboursList = monarch.get_neighbours_list(seedList)
print(len(neighboursList))

# introduce animal model ortho-phenotypes for seed and 1st shell neighbors
## For seed nodes:
seed_orthophenoList = monarch.get_orthopheno_list(seedList)
print(len(seed_orthophenoList))
## For 1st shell nodes:
neighbours_orthophenoList = monarch.get_orthopheno_list(neighboursList)
print(len(neighbours_orthophenoList))

# network nodes: seed + 1shell + ortholog-phentoype
geneList = sum([seedList,
                neighboursList,
                #seed_orthophenoList,
                #neighbours_orthophenoList
               ], 
               [])
print('genelist: ',len(geneList))

# get Monarch network
monarch_network = monarch.extract_edges(geneList)
print('network: ',len(monarch_network))

# save edges
monarch.print_network(monarch_network, 'monarch_connections')

# build network with graph schema 
monarch_edges = monarch.build_edges(monarch_network)
monarch_nodes = monarch.build_nodes(monarch_network)

monarch_network = monarch.read_connections("monarch_connections_v2020-11-15.csv")
monarch_edges = monarch.build_edges(monarch_network)
monarch_nodes = monarch.build_nodes(monarch_network)

monarch_edges = monarch.build_edges(monarch_network)


monarch_nodes = monarch.build_nodes(monarch_network)

In [11]:
# print type of objects
print('type edges:', type(monarch_edges))
print('type nodes:', type(monarch_nodes))
print()

# print objects sizes
print('len edges:', len(monarch_edges))
print('len nodes:', len(monarch_nodes))
print()

# print object attribute
print('attribute edges:', monarch_edges[0].keys())
print('attribute nodes:', monarch_nodes[0].keys())

type edges: <class 'list'>
type nodes: <class 'list'>

len edges: 233182
len nodes: 14018

attribute edges: dict_keys(['subject_id', 'object_id', 'property_id', 'property_label', 'property_description', 'property_uri', 'reference_uri', 'reference_supporting_text', 'reference_date'])
attribute nodes: dict_keys(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms', 'description'])


- Monarch network is returned as both digital object (`monarch_edges`, `monarch_nodes`) and CSV files at _**monarch/**_ (`monarch_edges_version.csv`, `monarch_nodes_version.csv`)

## Graph library
### Create the review knowledge graph
#### import graph

Tasks:

* Load Networks and calculate graph nodes
* Retrieve extra connectivity for the graph from Monarch
* Build the review graph

In [2]:
curation_edges = pd.read_csv("curation/data/HD/Empty_edges.csv")
curation_nodes = pd.read_csv("curation/data/HD/Empty_nodes.csv")
curation_nodes = curation_nodes.astype('object')
curation_nodes['name'] = ["NaN", "NaN"]
curation_edges = curation_edges.to_dict("records")

In [16]:
# obtain variables:
#curation_edges = pd.read_csv("curation/data/HD/Empty_edges.csv")
#curation_nodes = pd.read_csv("curation/data/HD/Empty_nodes.csv")

monarch_network = monarch.read_connections("monarch_connections_v2020-12-12.csv")
monarch_edges = monarch.build_edges(monarch_network)
monarch_nodes = monarch.build_nodes(monarch_network)

#monarch_network_graph = monarch.read_connections("monarch_connections_graph_v2020-12-02.csv")
#monarch_graph_edges = monarch.build_edges(monarch_network_graph)
#monarch_graph_nodes = monarch.build_nodes(monarch_network_graph)

csv_path = './transcriptomics/GSE64810_mlhd_DESeq2_diffexp_DESeq2_outlier_trimmed_adjust.txt'
data = transcriptomics.read_data(csv_path)
clean_data = transcriptomics.clean_data(data)
data_edges = transcriptomics.prepare_data_edges(clean_data)
rna_network = transcriptomics.prepare_rna_edges(data_edges)

# build network with graph schema
rna_edges = transcriptomics.build_edges(rna_network)
rna_nodes, rna_edges = transcriptomics.build_nodes(rna_network)

gmt_path = './regulation/msigdb/data/c3.tft.v6.1.entrez.gmt'
regulation.prepare_msigdb_data(gmt_path)

# prepare individual networks
data = regulation.load_tf_gene_edges()
dicts = regulation.get_gene_id_normalization_dictionaries(data)
data_edges = regulation.prepare_data_edges(data, dicts)

# prepare regulation network
reg_network = regulation.prepare_regulation_edges(data_edges)

# build network with graph schema
reg_edges = regulation.build_edges(reg_network)
reg_nodes = regulation.build_nodes(reg_network)

#graph_nodes_list, reg_graph_edges = graph.graph_nodes(
#    curation=curation_edges,
#    monarch=monarch_edges,
#    transcriptomics=rna_edges,
#    regulation=reg_edges
#)



* This is the size of the data structure: (233182, 7)
* These are the attributes: Index(['object_id', 'object_label', 'reference_id_list', 'relation_id',
       'relation_label', 'subject_id', 'subject_label'],
      dtype='object')
* This is the first record:
    object_id object_label  reference_id_list    relation_id  \
0  MGI:892004        H2-Bl                NaN  RO:HOM0000017   

                   relation_label   subject_id subject_label  
0  in orthology relationship with  RGD:1595815      RT1-M6-2  

The function "build_edges()" is running...
df (233182, 9)

* This is the size of the edges file data structure: (233182, 9)
* These are the edges attributes: Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
* This is the first record:
    object_id property_description    property_id  \
0  MGI:892004                   NA 

In [19]:
rna_network.shape

(3209, 12)

In [21]:
pd.DataFrame(rna_edges).shape

(3209, 12)

In [17]:
%%time
# load networks and calculate graph nodes
graph_nodes_list, reg_graph_edges = graph.graph_nodes(
    curation=curation_edges,
    monarch=monarch_edges,
    transcriptomics=rna_edges,
    regulation=reg_edges
)

# Monarch graph connectivity
## get Monarch edges
monarch_network_graph = monarch.extract_edges(graph_nodes_list)
print('network: ',len(monarch_network_graph))

## save Monarch network
monarch.print_network(monarch_network_graph, 'monarch_connections_graph')

## build Monarch network with graph schema
monarch_graph_edges = monarch.build_edges(monarch_network_graph)
monarch_graph_nodes = monarch.build_nodes(monarch_network_graph)

# build review graph
edges = graph.build_edges(
    curation=curation_edges,
    monarch=monarch_graph_edges,
    transcriptomics=rna_edges,
    regulation=reg_graph_edges
)
nodes = graph.build_nodes(
    statements=edges,
    curation=curation_nodes,
    monarch=monarch_graph_nodes,
    transcriptomics=rna_nodes,
    regulation=reg_nodes
)


The function "graph_nodes()" is running...

Preparing networks...
Curated:
(2, 9)
Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
Monarch:
(233182, 9)
Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')
Transcriptomics:
(3209, 12)
Index(['fdr', 'log2FoldChange', 'object_id', 'object_label', 'property_id',
       'property_label', 'pvalue', 'reference_id', 'regulation', 'source',
       'subject_id', 'subject_label'],
      dtype='object')
Regulatory:
(197267, 9)
Index(['object_id', 'property_description', 'property_id', 'property_label',
       'property_uri', 'reference_date', 'reference_supporting_text',
       'reference_uri', 'subject_id'],
      dtype='object')

Concate

KeyError: "['reference_uri_tf' 'reference_supporting_text_tf' 'reference_date_tf'\n 'property_description_tf' 'property_uri_tf'] not in index"

In [4]:
curated_df = pd.DataFrame(curation_edges)
monarch_df = pd.DataFrame(monarch_edges)
rna = pd.DataFrame(rna_edges)
tf = pd.DataFrame(reg_edges)

In [11]:
curated_df

Unnamed: 0,object_id,property_description,property_id,property_label,property_uri,reference_date,reference_supporting_text,reference_uri,subject_id
0,,,,,,,,,
1,,,,,,,,,


In [12]:
monarch_df

Unnamed: 0,object_id,property_description,property_id,property_label,property_uri,reference_date,reference_supporting_text,reference_uri,subject_id
0,MGI:892004,,RO:HOM0000017,in orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000017,,This edge comes from the Monarch Knowledge Gra...,,RGD:1595815
1,MGI:3704134,,RO:HOM0000017,in orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000017,,This edge comes from the Monarch Knowledge Gra...,,ENSEMBL:ENSRNOG00000061639
2,MGI:97912,,RO:HOM0000020,in 1 to 1 orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000020,,This edge comes from the Monarch Knowledge Gra...,,FlyBase:FBgn0041191
3,ENSEMBL:ENSOANG00000006132,,RO:HOM0000020,in 1 to 1 orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000020,,This edge comes from the Monarch Knowledge Gra...,,MGI:1344381
4,ENSEMBL:ENSECAG00000019077,,RO:HOM0000020,in 1 to 1 orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000020,,This edge comes from the Monarch Knowledge Gra...,,ENSEMBL:ENSCAFG00000017647
5,MONARCH:APO_0000222APO_0000004,,RO:0002200,has phenotype,http://purl.obolibrary.org/obo/RO_0002200,,This edge comes from the Monarch Knowledge Gra...,,SGD:S000002780
6,FlyBase:FBgn0013275,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,,This edge comes from the Monarch Knowledge Gra...,,FlyBase:FBgn0036505
7,ENSEMBL:ENSGALG00000044425,,RO:HOM0000017,in orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000017,,This edge comes from the Monarch Knowledge Gra...,,ENSEMBL:ENSCAFG00000016509
8,RGD:1359648,,RO:HOM0000020,in 1 to 1 orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000020,,This edge comes from the Monarch Knowledge Gra...,,ENSEMBL:ENSSSCG00000038628
9,ENSEMBL:ENSGALG00000042895,,RO:HOM0000020,in 1 to 1 orthology relationship with,http://purl.obolibrary.org/obo/RO_HOM0000020,,This edge comes from the Monarch Knowledge Gra...,,ENSEMBL:ENSMMUG00000002630


In [13]:
rna

Unnamed: 0,fdr,log2FoldChange,object_id,object_label,property_id,property_label,pvalue,reference_id,regulation,source,subject_id,subject_label
0,2.687232e-34,4.769658,HGNC:9004,PITX1,RO:0002434,interacts with,9.567529e-39,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
1,2.287927e-21,4.760790,HGNC:5120,HOXB9,RO:0002434,interacts with,1.629171e-25,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
2,2.724842e-20,4.573976,HGNC:5122,HOXC10,RO:0002434,interacts with,2.910430e-24,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
3,2.752160e-20,4.704005,HGNC:5101,HOXA11,RO:0002434,interacts with,3.919478e-24,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
4,4.511536e-20,4.273311,HGNC:5100,HOXA10,RO:0002434,interacts with,8.031360e-24,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
5,6.329454e-20,4.602451,HGNC:5133,HOXD10,RO:0002434,interacts with,1.352110e-23,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
6,1.371856e-19,3.962235,HGNC:9219,POU4F2,RO:0002434,interacts with,3.419015e-23,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
7,2.176886e-19,4.165899,HGNC:5102,HOXA13,RO:0002434,interacts with,6.200410e-23,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
8,3.799669e-15,3.657288,HGNC:5140,HOXD9,RO:0002434,interacts with,1.217539e-18,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT
9,5.876207e-15,3.866840,HGNC:5139,HOXD8,RO:0002434,interacts with,2.092145e-18,PMID:26636579,Upregulated,Chow,HGNC:4851,HTT


In [10]:
tf

Unnamed: 0,object_id,property_description,property_id,property_label,property_uri,reference_date,reference_supporting_text,reference_uri,subject_id
0,HGNC:8803,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:8615
1,HGNC:12687,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:8615
2,HGNC:10970,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:8615
3,HGNC:10972,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:8615
4,HGNC:25567,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:8615
5,HGNC:14357,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:8615
6,HGNC:1694,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:3240
7,HGNC:11246,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:3240
8,HGNC:12003,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:3240
9,HGNC:11920,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2007-01-01,"This edge comes from the TRED dataset in ""tfta...",https://www.ncbi.nlm.nih.gov/pubmed/17202159,HGNC:3239


In [5]:
statements = pd.concat([curated_df, monarch_df, rna], ignore_index=True, join="inner")

In [6]:
statements

Unnamed: 0,object_id,property_id,property_label,subject_id
0,,,,
1,,,,
2,MGI:892004,RO:HOM0000017,in orthology relationship with,RGD:1595815
3,MGI:3704134,RO:HOM0000017,in orthology relationship with,ENSEMBL:ENSRNOG00000061639
4,MGI:97912,RO:HOM0000020,in 1 to 1 orthology relationship with,FlyBase:FBgn0041191
5,ENSEMBL:ENSOANG00000006132,RO:HOM0000020,in 1 to 1 orthology relationship with,MGI:1344381
6,ENSEMBL:ENSECAG00000019077,RO:HOM0000020,in 1 to 1 orthology relationship with,ENSEMBL:ENSCAFG00000017647
7,MONARCH:APO_0000222APO_0000004,RO:0002200,has phenotype,SGD:S000002780
8,FlyBase:FBgn0013275,RO:0002434,interacts with,FlyBase:FBgn0036505
9,ENSEMBL:ENSGALG00000044425,RO:HOM0000017,in orthology relationship with,ENSEMBL:ENSCAFG00000016509


In [7]:
merge1 = pd.merge(statements, tf, how='inner', left_on='subject_id', right_on='subject_id',
                      suffixes=('_graph', '_tf'))

In [8]:
merge1

Unnamed: 0,object_id_graph,property_id_graph,property_label_graph,subject_id,object_id_tf,property_description,property_id_tf,property_label_tf,property_uri,reference_date,reference_supporting_text,reference_uri
0,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:9701,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
1,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:7879,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
2,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:8618,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
3,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:11290,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
4,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:11743,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
5,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:11289,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
6,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:3238,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
7,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:6206,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
8,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:3241,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076
9,HP:0000343,RO:0002200,has phenotype,HGNC:11551,HGNC:3488,,RO:0002434,interacts with,http://purl.obolibrary.org/obo/RO_0002434,2012-09-14,"This edge comes from the NEPH2012 dataset in ""...",https://www.ncbi.nlm.nih.gov/pubmed/22959076


In [9]:
merge1_clean = (merge1
    [['subject_id', 'property_id_tf', 'object_id_tf', 'reference_uri_tf',
      'reference_supporting_text_tf', 'reference_date_tf', 'property_label_tf',
      'property_description_tf', 'property_uri_tf']]
        .rename(columns={
        'property_id_tf': 'property_id',
        'object_id_tf': 'object_id',
        'reference_uri_tf': 'reference_uri',
        'reference_supporting_text_tf': 'reference_supporting_text',
        'reference_date_tf': 'reference_date',
        'property_label_tf': 'property_label',
        'property_description_tf': 'property_description',
        'property_uri_tf': 'property_uri'
    })
    )

KeyError: "['reference_uri_tf' 'reference_supporting_text_tf' 'reference_date_tf'\n 'property_description_tf' 'property_uri_tf'] not in index"

# fix curation data:
curation_nodes = curation_nodes.astype('object')
#curation_nodes.loc[0]

curation_nodes = curation_nodes.astype('object')
curation_nodes['name'] = ["NaN", "NaN"]

curation_nodes.loc[0]

# construct df's
edges = graph.build_edges(
    curation=curation_edges,
    monarch=monarch_edges,
    transcriptomics=rna_edges,
    regulation=reg_edges
)
nodes = graph.build_nodes(
    statements=edges,
    curation=curation_nodes,
    monarch=monarch_nodes,
    transcriptomics=rna_nodes,
    regulation=reg_nodes
)

nodes = pd.read_csv("/home/karolis/Structured review/bioknowledge-reviewer/bioknowledge_reviewer/graph/graph_nodes_v2020-11-20.csv")
edges = pd.read_csv("/home/karolis/Structured review/bioknowledge-reviewer/bioknowledge_reviewer/graph/graph_edges_v2020-11-20.csv")

- Regulation edges _merged_ with the graph is returned as both digital object (`reg_graph_edges`) and CSV file at _**graph/**_ (`regulation_graph_edges_version.csv`)
- Monarch network is returned as both digital object (`monarch_graph_edges`, `monarch_graph_nodes`) and CSV files at _**monarch/**_ (`monarch_edges_version.csv`, `monarch_nodes_version.csv`) overwritten the previous one.
- Review knowledge graph is returned as both digital object (`edges`, `nodes`) and CSV files at _**graph/**_ (`graph_edges_version.csv`, `graph_nodes_version.csv`)

## Neo4jlib library
### Import the graph into Neo4j graph database
#### import neo4jlib

Tasks:

- Create Neo4j server instance
- Import review graph into the Neo4j graph database

In [15]:
edges_df = pd.read_csv("~/Structured review/HD-SR/bioknowledge-reviewer/bioknowledge_reviewer/graph/graph_edges_v2020-12-02.csv")
nodes_df = pd.read_csv("~/Structured review/HD-SR/bioknowledge-reviewer/bioknowledge_reviewer/graph/graph_nodes_v2020-12-02.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [16]:
%%time
# create a Neo4j server instance
neo4j_dir = neo4jlib.create_neo4j_instance('3.5.6')
print('The name of the neo4j directory is {}'.format(neo4j_dir))

# import to graph database
## prepare the graph to neo4j format
#edges_df = utils.get_dataframe(edges)
#nodes_df = utils.get_dataframe(nodes)
statements = neo4jlib.get_statements(edges_df)
concepts = neo4jlib.get_concepts(nodes_df)
print('statements: ', len(statements))
print('concepts: ',len(concepts))

## save files into neo4j import dir
neo4j_path = './{}'.format(neo4j_dir)
neo4jlib.save_neo4j_files(statements, neo4j_path, file_type = 'statements')
neo4jlib.save_neo4j_files(concepts, neo4j_path, file_type = 'concepts')

## import graph into neo4j database
neo4jlib.do_import(neo4j_path)

Creating a Neo4j community v3.5.6 server instance...
Configuration adjusted!
Neo4j v3.5.6 is running.
The name of the neo4j directory is neo4j-community-3.5.6
statements:  433601
concepts:  33817

File './neo4j-community-3.5.6/import/ngly1/ngly1_statements.csv' saved.

File './neo4j-community-3.5.6/import/ngly1/ngly1_concepts.csv' saved.

The function "do_import()" is running...

The graph is imported into the server. Neo4j is running.You can start exploring and querying for hypothesis. If you change ports or authentication in the Neo4j configuration file, the hypothesis-generation modules performance, hypothesis.py and summary.py, will be affected.

CPU times: user 6.09 s, sys: 399 ms, total: 6.49 s
Wall time: 32.8 s


In [6]:
del edges_df
del nodes_df

In [15]:
# print type of objects
print('type edges:', type(statements))
print('type nodes:', type(concepts))
print()

# print objects sizes
print('len edges:', len(statements))
print('len nodes:', len(concepts))
print()

# print object attribute
print('attribute edges:', statements.columns)
print('attribute nodes:', concepts.columns)

NameError: name 'statements' is not defined

## hypothesis-generation library
### Query the graph for mechanistic explanation, then summarize the extracted paths
#### import hypothesis, summary


Tasks:

* Retrieve orthopheno paths with the `query` method.
* Retrieve orthopheno paths using relaxing node degree parameters with the `query` method.
* Retrieve orthopheno paths from a more open query topology with the `open_query` method.
* Get hypothesis summaries

### Ortopheno query with general nodes/relations removed

In [None]:
from neo4j import GraphDatabase
import sys,os
import json
import yaml
import datetime
import neo4j.exceptions

In [None]:
def parse_path( path ):
    """
    This function parses neo4j results.
    :param path: neo4j path object
    :return: parsed path dictionary
    """

    out = {}
    out['Nodes'] = []
    for node in path['path'].nodes:
        n = {}
        n['idx'] = node.id
        n['label'] = list(node.labels)[0]
        n['id'] = node.get('id')
        n['preflabel'] = node.get('preflabel')
        n['name'] = node.get('name')
        n['description'] = node.get('description')
        out['Nodes'].append(n)
    out['Edges'] = []
    for edge in path['path'].relationships:
        e = {}
        e['idx'] = edge.id
        e['start_node'] = edge.start_node.id
        e['end_node'] = edge.end_node.id
        e['type'] = edge.type
        e['preflabel'] = edge.get('property_label')
        e['references'] = edge.get('reference_uri')
        out['Edges'].append(e)
    return out

In [None]:
def get_node(source, target, port='7687'):
    """
    This function checks if a node exists within the graph
    """
    try:
        driver = GraphDatabase.driver("bolt://localhost:{}".format(port), auth=("neo4j", "ngly1"))
    except neo4j.exceptions.ServiceUnavailable:
        raise
    outputAll = list()
    with driver.session() as session:
        query = """MATCH (source { id: '""" + source + """' }), (target { id: '""" + target +  """' }), p = allShortestPaths((source)-[*..15]-(target)) RETURN p"""
        result = session.run(query)
        print(result)
        x = []
        for record in result:
            #path_dct = parse_path(record)
            x.append(record)
        print(x)
        return x, result

In [None]:
x, result = get_node('HGNC:4851', 'HGNC:18229')

In [None]:
x[1]

In [None]:
result.consume()

In [None]:
def query_shortest_paths(source, target, max_length=4, port='7687'):
    """
    This function gets the shortest paths between source and
    target. max_length determines the maximum path size allowed.
    """
    try:
        driver = GraphDatabase.driver("bolt://localhost:{}".format(port), auth=("neo4j", "ngly1"))
    except neo4j.exceptions.ServiceUnavailable:
        raise
    outputAll = list()
    with driver.session() as session:
        print("kaas")
        query = """MATCH (source { id: '""" + source + """' }), (target { id: '""" + target +  """' }), p = shortestPath((source)-[*..15]-(target)) RETURN p"""
        result = session.run(query)
        print(result)
        return result

In [None]:
seed = list([
        'HGNC:4851',  # Htt human gene
        'HGNC:18229'  # Rhes human gene
])
result = query_shortest_paths(seed[0], seed[1], max_length=999)

In [None]:
summary = result.consume()

In [None]:
summary.query

In [None]:
result.data()

In [None]:
%%time
# get orthopheno paths
seed = list([
        'HGNC:4851',  # Htt human gene
        'HGNC:18229'  # Rhes human gene
])
hypothesis.query(seed,queryname='Htt_Rhes',port='7687') 

In [None]:
%%time
# get orthopheno paths relaxing pathway and phenotype node degrees
seed = list([
        'HGNC:4851',  # NGLY1 human gene
        'HGNC:18229'  # AQP1 human gene
])
hypothesis.query(seed, queryname='Htt_Rhes', pwdegree='1000', phdegree='1000', port='7687')

In [None]:
%%time
# get orthopheno paths from a more open query topogology
seed = list([
        'HGNC:4851',  # NGLY1 human gene
        'HGNC:18229'  # AQP1 human gene
])
hypothesis.open_query(seed,queryname='Htt_Rhes',port='7687')

In [None]:
import hypothesis

In [None]:
%%time
seed = list([
        'HGNC:4851',  # NGLY1 human gene
        'HGNC:18229'  # AQP1 human gene
])
r = hypothesis.shortest_paths(seed[0], seed[1], max_length=50, port='7687')

In [None]:
print(r)

In [None]:
%%time
# get summary
data = summary.path_load('./hypothesis/query_ngly1_aqp1_pwdl1000_phdl1000_paths_v2020-09-14.json')

# parse data for summarization
data_parsed = list()
for query in data:
    query_parsed = summary.query_parser(query)
    data_parsed.append(query_parsed)
summary.metapaths(data_parsed)
summary.nodes(data_parsed)
summary.node_types(data_parsed)
summary.edges(data_parsed)
summary.edge_types(data_parsed)