# Description

This is the notebook for the creation of the first review network and derived hypotheses. 

* Using intermediary variables from workflow objects. In this workflow variables are directly used for the next step. 


* Review network: From Monarch knowledge graph, we built a network seeded by 8 nodes, retrieving their explicit relationships and all the relationships among all these nodes. Seed nodes:

    - 'MONDO:0007739' HD
    - 'HGNC:4851' Htt
    - 'CHEBI:18248' Iron (not working)
    - 'HGNC:18229' Rhes (RASD2)
    
Possible seed nodes:
https://monarchinitiative.org/search/Iron
* Connecting paths: query templates.

In [1]:
import transcriptomics, regulation, curation, monarch, graph, neo4jlib, hypothesis, summary, utils
import pandas as pd

In [2]:
%%time
# initiate empty curation network:
curation_edges = pd.read_csv("curation/data/HD/HD_curated_edges.csv")
curation_nodes = pd.read_csv("curation/data/HD/HD_curated_nodes.csv")


CPU times: user 3.29 ms, sys: 0 ns, total: 3.29 ms
Wall time: 2.9 ms


In [3]:
%%time
# transcriptomics:
csv_path = './transcriptomics/GSE64810_mlhd_DESeq2_diffexp_DESeq2_outlier_trimmed_adjust.txt'
data = transcriptomics.read_data(csv_path, sep='\t')
clean_data = transcriptomics.clean_data(data)
data_edges = transcriptomics.prepare_data_edges(clean_data)
rna_network = transcriptomics.prepare_rna_edges(data_edges)

# build network with graph schema
rna_edges = transcriptomics.build_edges(rna_network)
rna_nodes, node_dict = transcriptomics.build_nodes(rna_network)
rna_edges = transcriptomics.rework_edges(pd.DataFrame(rna_edges), node_dict)


The function "read_data()" is running...

* This is the size of the raw expression data structure: (28087, 10)
* These are the expression attributes: Index(['Unnamed: 0', 'symbol', 'baseMean', 'HD.mean', 'Control.mean',
       'log2FoldChange', 'lfcSE', 'stat', 'pvalue', 'padj'],
      dtype='object')
* This is the first record:
           Unnamed: 0 symbol  baseMean    HD.mean  Control.mean  \
0  ENSG00000069011.10  PITX1  5.645675  18.684286      0.323793   

   log2FoldChange     lfcSE      stat        pvalue          padj  
0        4.769658  0.366367  13.01879  9.567529e-39  2.687232e-34  

The raw data is saved at: /home/karolis/LUMC/HDSR/bioknowledge-reviewer/bioknowledge_reviewer/transcriptomics/HD/data/GSE64810_mlhd_DESeq2_diffexp_DESeq2_outlier_trimmed_adjust.csv


Finished read_data().


The function "clean_data()" is running. Keeping only data with FC > 1.5 and FDR < 5% ...

* This is the size of the clean expression data structure: (3209, 6)
* These are the clean expressi

In [4]:
%%time
# regulation:
gmt_path = './regulation/msigdb/data/c3.tft.v6.1.entrez.gmt'
regulation.prepare_msigdb_data(gmt_path)

# prepare individual networks
data = regulation.load_tf_gene_edges()
dicts = regulation.get_gene_id_normalization_dictionaries(data)
data_edges = regulation.prepare_data_edges(data, dicts)

# prepare regulation network
reg_network = regulation.prepare_regulation_edges(data_edges)

# build network with graph schema
reg_edges = regulation.build_edges(reg_network)
reg_nodes = regulation.build_nodes(reg_network)


The function "prepare_msigdb_data()" is running...

* Number of Transcription Factor Targets (TFT) gene sets: 615

The MSigDB raw network is saved at: /home/karolis/LUMC/HDSR/bioknowledge-reviewer/bioknowledge_reviewer/regulation/msigdb/out/tf_genelist_entrez_msigdb.json. Other reporting files are also saved at the same directory.


Finished prepare_msigdb_data().


The function "load_tf_gene_edges()" is running...

Finished load_tf_gene_edges().


The function "get_gene_id_normalization_dictionaries()" is running...

* Querying BioThings to map gene symbols to HGNC and Entrez IDs...
querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-3071...done.
Finished.
49 input query terms found no hit:
	['NKX25', 'CDPCR3HD', 'TCF1P', 'MEIS1AHOXA9', 'MMEF2', 'GNCF', 'TCF11MAFG', 'E2F1DP2', 'TAL1BETAITF2
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

Saving not found gene symbols at: /home/karolis/LUMC/HDSR/bioknowledge-

In [None]:
%%time
# Monarch extraction:
# seed nodes
seedList = [ 
    'MONDO:0007739', # HD
    'HGNC:4851', # Htt
    'HGNC:182293', # Rhes
    'REACT:R-HSA-917937' #Iron uptake pathway
] 

# get first degree neighbours
neighboursList = monarch.get_neighbours_list(seedList)
print("Nodes 1st degree neighbours:")
print(len(neighboursList))

seed_orthophenoList = monarch.get_orthopheno_list(seedList)
print(len(seed_orthophenoList))
## For 1st shell nodes:
neighbours_orthophenoList = monarch.get_orthopheno_list(neighboursList)
print(len(neighbours_orthophenoList))

node_list = sum([seedList, neighboursList, seed_orthophenoList, neighbours_orthophenoList], 
               [])

monarch_network = monarch.extract_edges(node_list)
print("connections after 2nd degree:")
print(len(monarch_network))
monarch.print_network(monarch_network, 'monarch_connections')


# build network with graph schema 
monarch_edges = monarch.build_edges(monarch_network)
monarch_nodes = monarch.build_nodes(monarch_network)

# print type of objects
print('type edges:', type(monarch_edges))
print('type nodes:', type(monarch_nodes))
print()

# print objects sizes
print('len edges:', len(monarch_edges))
print('len nodes:', len(monarch_nodes))
print()

# print object attribute
print('attribute edges:', monarch_edges[0].keys())
print('attribute nodes:', monarch_nodes[0].keys())

  0%|          | 0/4 [00:00<?, ?it/s]


The function "get_neighbours_list()" is running. Its runtime may take some minutes. If you interrupt the process, you will lose all the nodes retrieved and you should start over the execution of this function.


 25%|██▌       | 1/4 [00:00<00:00,  7.58it/s]

error: <class 'KeyboardInterrupt'>
MONDO:0007739


 75%|███████▌  | 3/4 [00:01<00:00,  2.66it/s]

error: <class 'KeyboardInterrupt'>
REACT:R-HSA-917937
error: <class 'KeyboardInterrupt'>
HGNC:4851


100%|██████████| 4/4 [00:01<00:00,  3.04it/s]
  0%|          | 0/4 [00:00<?, ?it/s]

error: <class 'KeyboardInterrupt'>
HGNC:182293

Finished get_neighbours_list().

Nodes 1st degree neighbours:
0

The function "get_orthopheno_list()" is running. Its runtime may take some hours. If you interrupt the process, you will lose all the nodes retrieved and you should start over the execution of this function.


 50%|█████     | 2/4 [00:00<00:00,  4.87it/s]

error: <class 'KeyboardInterrupt'>
MONDO:0007739
error: <class 'KeyboardInterrupt'>
REACT:R-HSA-917937


100%|██████████| 4/4 [00:00<00:00,  5.61it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
0it [00:00, ?it/s]
  0%|          | 0/4 [00:00<?, ?it/s]

error: <class 'KeyboardInterrupt'>
HGNC:4851
error: <class 'KeyboardInterrupt'>
HGNC:182293

Finished get_orthopheno_list().

0

The function "get_orthopheno_list()" is running. Its runtime may take some hours. If you interrupt the process, you will lose all the nodes retrieved and you should start over the execution of this function.

Finished get_orthopheno_list().

0

The function "extract_edges()" is running. Its runtime may take some hours. If you interrupt the process, you will lose all the edges retrieved and you should start over the execution of this function.


In [5]:
# this cell is a standin for edge debugging, should be removed in the final version:
# both tables should be conferted into list objects
monarch_edges = pd.read_csv("/home/karolis/LUMC/HDSR/bioknowledge-reviewer/bioknowledge_reviewer/monarch/monarch_edges_v2022-04-12.csv")
monarch_nodes = pd.read_csv("/home/karolis/LUMC/HDSR/bioknowledge-reviewer/bioknowledge_reviewer/monarch/monarch_nodes_v2022-04-12.csv")
monarch_edges = monarch_edges.to_dict('records')
monarch_nodes = monarch_nodes.to_dict('records')

In [6]:
# print type of objects
print('type edges:', type(monarch_edges))
print('type nodes:', type(monarch_nodes))
print()

# print objects sizes
print('len edges:', len(monarch_edges))
print('len nodes:', len(monarch_nodes))
print()

# print object attribute
print('attribute edges:', monarch_edges[0].keys())
print('attribute nodes:', monarch_nodes[0].keys())

type edges: <class 'list'>
type nodes: <class 'list'>

len edges: 200617
len nodes: 13082

attribute edges: dict_keys(['subject_id', 'property_id', 'object_id', 'reference_uri', 'reference_supporting_text', 'reference_date', 'property_label', 'property_description', 'property_uri'])
attribute nodes: dict_keys(['id', 'semantic_groups', 'preflabel', 'synonyms', 'description', 'name'])


In [7]:
# construct df's
edges = graph.build_edges(
    curation=curation_edges,
    monarch=monarch_edges,
    transcriptomics=rna_edges,
    regulation=reg_edges
)



The function "build_edges()" is running...

Preparing networks...
Curated:
(1, 9)
Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')
Monarch:
(200617, 9)
Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')
Transcriptomics:
(3209, 9)
Index(['subject_id', 'object_id', 'property_id', 'property_label',
       'property_description', 'property_uri', 'reference_uri',
       'reference_supporting_text', 'reference_date'],
      dtype='object')
Regulatory:
(197267, 9)
Index(['subject_id', 'object_id', 'property_id', 'property_label',
       'property_description', 'property_uri', 'reference_uri',
       'reference_supporting_text', 'reference_date'],
      dtype='object')

Concatenatin

In [8]:
nodes = graph.build_nodes(
    statements=edges,
    curation=curation_nodes,
    monarch=monarch_nodes,
    transcriptomics=rna_nodes,
    regulation=reg_nodes
)


The function "build_nodes()" is running...

Preparing networks...
Curated:
(2, 6)
Index(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms',
       'description'],
      dtype='object')
Monarch:
(13082, 6)
Index(['id', 'semantic_groups', 'preflabel', 'synonyms', 'description',
       'name'],
      dtype='object')
Transcriptomics:
(3210, 6)
Index(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms',
       'description'],
      dtype='object')
Regulatory:
(16989, 6)
Index(['id', 'semantic_groups', 'preflabel', 'name', 'synonyms',
       'description'],
      dtype='object')

Annotating nodes in the graph...
graph from e (31266, 1)
annotation check
curated (2, 6)
monarch (13082, 6)
rna (31266, 6)
regulation (16989, 6)

Concatenating all nodes...
graph ann (61339, 6)
diff set()

Drop duplicated rows...
(59439, 6)

Drop duplicated nodes...
(31266, 6)

All graph nodes are annotated.
Regulation nodes not in the graph: 0

Saving final graph...
(31266, 6)
Index(['id', 'semantic_gr

edges = pd.read_csv('/home/karolis/LUMC/HDSR/bioknowledge-reviewer/bioknowledge_reviewer/graph/graph_edges_v2022-05-04.csv')
nodes = pd.read_csv('/home/karolis/LUMC/HDSR/bioknowledge-reviewer/bioknowledge_reviewer/graph/graph_nodes_v2022-05-04.csv')

In [9]:
# print type of objects
print('type edges:', type(edges))
print('type nodes:', type(nodes))
print()

# print objects sizes
print('len edges:', len(edges))
print('len nodes:', len(nodes))
print()

# print object attribute
print('attribute edges:', edges.columns)
print('attribute nodes:', nodes.columns)

type edges: <class 'pandas.core.frame.DataFrame'>
type nodes: <class 'pandas.core.frame.DataFrame'>

len edges: 401070
len nodes: 31266

attribute edges: Index(['subject_id', 'property_id', 'object_id', 'reference_uri',
       'reference_supporting_text', 'reference_date', 'property_label',
       'property_description', 'property_uri'],
      dtype='object')
attribute nodes: Index(['id', 'semantic_groups', 'preflabel', 'synonyms', 'name',
       'description'],
      dtype='object')


In [10]:
%%time
# create a Neo4j server instance
neo4j_dir = neo4jlib.create_neo4j_instance('4.2.1')
print('The name of the neo4j directory is {}'.format(neo4j_dir))

# import to graph database
## prepare the graph to neo4j format
edges_df = utils.get_dataframe(edges)
nodes_df = utils.get_dataframe(nodes)
statements = neo4jlib.get_statements(edges_df)
concepts = neo4jlib.get_concepts(nodes_df)
print('statements: ', len(statements))
print('concepts: ',len(concepts))

## save files into neo4j import dir
neo4j_path = './{}'.format(neo4j_dir)
neo4jlib.save_neo4j_files(statements, neo4j_path, file_type = 'statements')
neo4jlib.save_neo4j_files(concepts, neo4j_path, file_type = 'concepts')

## import graph into neo4j database
neo4jlib.do_import(neo4j_path)

Creating a Neo4j community v4.2.1 server instance...
Configuration adjusted!
Starting the server...
Neo4j v4.2.1 is running.
The name of the neo4j directory is neo4j-community-4.2.1
statements:  401070
concepts:  31266

File './neo4j-community-4.2.1/import/HD/HD_statements.csv' saved.

File './neo4j-community-4.2.1/import/HD/HD_concepts.csv' saved.

The function "do_import()" is running...

The graph is imported into the server. Neo4j is running.You can start exploring and querying for hypothesis. If you change ports or authentication in the Neo4j configuration file, the hypothesis-generation modules performance, hypothesis.py and summary.py, will be affected.

CPU times: user 3.32 s, sys: 160 ms, total: 3.48 s
Wall time: 18.3 s
