# KG-Hub: Advanced Knowledge Graph Assembly

This notebook serves as a practical guide to advanced KG-Hub features and resources. A brief review of the Getting Started tutorial notebook is not a strict prerequisite but may be helpful.  

This notebook also assumes you are in a Linux environment, but Google Colab is an option as well.

Here's an example question for our use case: which foods may impact DNA repair pathways? It's a broad question with many possible answers, or no answers at all. A KG may hold some clues.

## Setup

First, we'll install the requirements.


In [4]:
!pip install kgx
!pip install kghub-downloader

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [7]:
import os
import yaml

Now we need to set up two things for KGX to work properly:
* A download config file
* A merge config file

In practice, we may need to write a new transform for each new source, but all of the sources we'll use here are conveniently already available as KGX node and edge files on KG-Hub.

We'll download five sources. Two are ontologies available through the KG-OBO project on KG-Hub: FOODON, a food ontology, and CHEBI, a chemical ontology. The other sources are sets of preprocessed [Reactome](https://reactome.org) pathways, connections between those pathways, and mappings between those pathways and chemicals. They're all defined in a dictionary below, with the name of each source as its key and a list of one or more source URLs as its value. We've also defined a set of local filenames, as we know what the compressed ontology files should contain.

In [7]:
data_dir = "./" # Just the current directory, though in practice it would be something like data/raw/
sources = {"foodon":["https://kg-hub.berkeleybop.io/kg-obo/foodon/2022-02-01/foodon_kgx_tsv.tar.gz"],
           "chebi":["https://kg-hub.berkeleybop.io/kg-obo/chebi/210/chebi_kgx_tsv.tar.gz"],
           "chebi2reactome":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/chebi2reactome_edges.tsv",
                             "https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/chebi2reactome_nodes.tsv"],
           "reactome_pathways":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/reactomepathways_nodes.tsv"],
           "reactome_relations":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/reactomepathwaysrelation_edges.tsv"]}
local_filepaths = {"foodon":["foodon_kgx_tsv_edges.tsv",
                            "foodon_kgx_tsv_nodes.tsv"],
           "chebi":["chebi_kgx_tsv_edges.tsv",
                    "chebi_kgx_tsv_nodes.tsv"],
           "chebi2reactome":["chebi2reactome_edges.tsv",
                             "chebi2reactome_nodes.tsv"],
           "reactome_pathways":["reactomepathways_nodes.tsv"],
           "reactome_relations":["reactomepathwaysrelation_edges.tsv"]}

There is an example of a KGX download config file [here](https://github.com/Knowledge-Graph-Hub/kg-dtm-template/blob/master/download.yaml), but it's easy to assemble from scratch with something like the following:

In [19]:
source_data = []
for source in sources:
  for url in sources[source]:
    local_name = url.rpartition('/')[-1]
    source_data.append({"url":url,
                        "local_name":local_name})

with open("download.yaml", "w") as dl_config:
  yaml.dump(source_data, dl_config, default_flow_style=False)

Now we may use the config file with the `kghub-downloader` to download all sources.

In [4]:
!downloader download.yaml

/bin/bash: downloader: command not found


Decompress the compressed sources.

In [5]:
!cat *.tar.gz | tar zxvf - -i

chebi_kgx_tsv_nodes.tsv
chebi_kgx_tsv_edges.tsv
foodon_kgx_tsv_nodes.tsv
foodon_kgx_tsv_edges.tsv


Next step: set up a merge config file. Our sources are already in the expected KGX graph format, so no transformation is necessary.

See the [example merge config](https://github.com/Knowledge-Graph-Hub/kg-dtm-template/blob/master/merge.yaml) in this repository for further inspiration.

In [10]:
merge_data = {"configuration":{"output_directory":data_dir,
                              "checkpoint":"false"
                              },
              "merged_graph":{"name":"tutorial_graph",
                              "source":{},
                              "operations":[{"name": "kgx.graph_operations.summarize_graph.generate_graph_stats",
                                        "args":{"graph_name":"tutorial_graph",
                                        "filename":"merged_graph_stats.yaml"
                                                }
                                                }
                                                ],
                                "destination":{"merged-kg-tsv":{"format":"tsv",
                                              "filename": "merged-kg"}
                                                },            
                                }
                }

for source in local_filepaths:
  merge_data["merged_graph"]["source"][source] = {"name":source,
                                                  "input":{"format":"tsv",
                                                          "filename":local_filepaths[source]}
                                                  }

with open("merge.yaml", "w") as merge_config:
  yaml.dump(merge_data, merge_config, default_flow_style=False)

## KG Assembly

The data files are all here and the configuration files are ready. We may now use `kgx` to assemble a single set of nodes and edges from them all.

In [2]:
from kgx.cli.cli_utils import merge



In [3]:
merged_graph = merge("merge.yaml")

[KGX][cli_utils.py][               merge] INFO: Spawning process for 'chebi'
[KGX][cli_utils.py][               merge] INFO: Spawning process for 'chebi2reactome'
[KGX][cli_utils.py][               merge] INFO: Spawning process for 'foodon'
[KGX][cli_utils.py][               merge] INFO: Spawning process for 'reactome_pathways'
[KGX][cli_utils.py][               merge] INFO: Spawning process for 'reactome_relations'
[KGX][cli_utils.py][        parse_source] INFO: Processing source 'chebi'
[KGX][cli_utils.py][        parse_source] INFO: Writing checkpoint for source 'chebi'
[KGX][cli_utils.py][        parse_source] INFO: Processing source 'chebi2reactome'
[KGX][cli_utils.py][        parse_source] INFO: Writing checkpoint for source 'chebi2reactome'
[KGX][cli_utils.py][        parse_source] INFO: Processing source 'foodon'
[KGX][cli_utils.py][        parse_source] INFO: Writing checkpoint for source 'foodon'
[KGX][cli_utils.py][        parse_source] INFO: Processing source 'reactome_path

If everything went as expected, the merged KG will be in `merged-kg_edges.tsv` and `merged-kg_nodes.tsv`. There will also be a `merged_graph_stats.yaml` detailing the new graph contents. Let's take a quick look at the stats file first.

In [8]:
with open("merged_graph_stats.yaml") as yaml_file:
    config = yaml.load(yaml_file, Loader=yaml.FullLoader)

In [16]:
# Count of all edges in the graph
print(config["edge_stats"]["total_edges"])

# Count of all nodes in the graph
print(config["node_stats"]["total_nodes"])

706192
232980


In [18]:
# What kind of nodes are in the graph?
for category in config["node_stats"]["node_categories"]:
    print(category)

biolink:BiologicalProcess
biolink:BiologicalProcessOrActivity
biolink:ChemicalEntity
biolink:NamedThing
biolink:Pathway


Nodes in ontologies and data sources are assigned appropriate Biolink Model categories whenever possible. Those assigned `NamedThing` may still belong to a more detailed category, but assigning such a category may be challenging.

Now let's take a look at the graph contents to begin examining how they may answer our questions.

Let's get a set of all relations between food entries in FOODON and chemical entries in CHEBI.

In [1]:
!grep FOODON merged-kg_edges.tsv | grep CHEBI

CHEBI:35149-biolink:subclass_of-FOODON:03412972	CHEBI:35149	biolink:subclass_of	FOODON:03412972		rdfs:subClassOf		Graph		
CHEBI:16199-biolink:subclass_of-FOODON:03412972	CHEBI:16199	biolink:subclass_of	FOODON:03412972		rdfs:subClassOf		Graph		
CHEBI:35366-biolink:subclass_of-FOODON:03412972	CHEBI:35366	biolink:subclass_of	FOODON:03412972		rdfs:subClassOf		Graph		
CHEBI:83501-biolink:subclass_of-FOODON:03412972	CHEBI:83501	biolink:subclass_of	FOODON:03412972		rdfs:subClassOf		Graph		
CHEBI:18185-biolink:subclass_of-FOODON:03412972	CHEBI:18185	biolink:subclass_of	FOODON:03412972		rdfs:subClassOf		Graph		
CHEBI:15428-biolink:subclass_of-FOODON:03412972	CHEBI:15428	biolink:subclass_of	FOODON:03412972		rdfs:subClassOf		Graph		
CHEBI:30746-biolink:subclass_of-FOODON:03412972	CHEBI:30746	biolink:subclass_of	FOODON:03412972		rdfs:subClassOf		Graph		
CHEBI:15366-biolink:subclass_of-FOODON:03412972	CHEBI:15366	biolink:subclass_of	FOODON:03412972		rdfs:subClassOf		Graph		
CHEBI:17561-biolink:subc

The "subject, predicate, and object" of each relation are found in the second, third, and fourth columns, respectively.
I'll save you some trouble: every relation like `CHEBI:XXXXX    biolink:subclass_of FOODON:03412972` is just saying "this chemical is a [food additive](https://www.ebi.ac.uk/ols/ontologies/foodon/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FFOODON_03412972)". There are several different *types* of relations in this set, however. Continue to the next section for some examples of how to learn more about the new graph with the `grape` tools.

## Loading Graphs with `grape`

In [2]:
!pip install grape

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes


In [4]:
from grape import Graph

ImportError: cannot import name 'Graph' from 'grape' (/home/harry/.local/lib/python3.8/site-packages/grape/__init__.py)

## Graph Embeddings

## Link Prediction