# KG-Hub: Advanced Knowledge Graph Assembly

This notebook serves as a practical guide to advanced KG-Hub features and resources. A brief review of the Getting Started tutorial notebook is not a strict prerequisite but may be helpful.  

This notebook also assumes you are in a Linux environment, but Google Colab is an option as well.

Here's an example question for our use case: which foods may impact DNA repair pathways? It's a broad question with many possible answers, or no answers at all. A KG may hold some clues. We don't want to be entirely reliant on existing data, however: starting with sets of chemicals, foods, and biological pathways, we can perform link prediction to predict additional connections.

## Setup

First, we'll install the requirements.


In [None]:
!pip install kgx
!pip install kghub-downloader

In [None]:
import os
import yaml

Now we need to set up two things for KGX to work properly:
* A download config file
* A merge config file

In practice, we may need to write a new transform for each new source, but all of the sources we'll use here are conveniently already available as KGX node and edge files on KG-Hub.

We'll download five sources. Two are ontologies available through the KG-OBO project on KG-Hub: FOODON, a food ontology, and CHEBI, a chemical ontology. The other sources are sets of preprocessed [Reactome](https://reactome.org) pathways, connections between those pathways, and mappings between those pathways and chemicals. They're all defined in a dictionary below, with the name of each source as its key and a list of one or more source URLs as its value. We've also defined a set of local filenames, as we know what the compressed ontology files should contain.

In [None]:
data_dir = "./" # Just the current directory, though in practice it would be something like data/raw/
sources = {"foodon":["https://kg-hub.berkeleybop.io/kg-obo/foodon/2022-02-01/foodon_kgx_tsv.tar.gz"],
           "chebi":["https://kg-hub.berkeleybop.io/kg-obo/chebi/210/chebi_kgx_tsv.tar.gz"],
           "chebi2reactome":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/chebi2reactome_edges.tsv",
                             "https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/chebi2reactome_nodes.tsv"],
           "reactome_pathways":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/reactomepathways_nodes.tsv"],
           "reactome_relations":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/reactomepathwaysrelation_edges.tsv"]}
local_filepaths = {"foodon":["foodon_kgx_tsv_edges.tsv",
                            "foodon_kgx_tsv_nodes.tsv"],
           "chebi":["chebi_kgx_tsv_edges.tsv",
                    "chebi_kgx_tsv_nodes.tsv"],
           "chebi2reactome":["chebi2reactome_edges.tsv",
                             "chebi2reactome_nodes.tsv"],
           "reactome_pathways":["reactomepathways_nodes.tsv"],
           "reactome_relations":["reactomepathwaysrelation_edges.tsv"]}

There is an example of a KGX download config file [here](https://github.com/Knowledge-Graph-Hub/kg-dtm-template/blob/master/download.yaml), but it's easy to assemble from scratch with something like the following:

In [None]:
source_data = []
for source in sources:
  for url in sources[source]:
    local_name = url.rpartition('/')[-1]
    source_data.append({"url":url,
                        "local_name":local_name})

with open("download.yaml", "w") as dl_config:
  yaml.dump(source_data, dl_config, default_flow_style=False)

Now we may use the config file with the `kghub-downloader` to download all sources.

In [None]:
!downloader download.yaml

Decompress the compressed sources.

In [None]:
!cat *.tar.gz | tar zxvf - -i

Next step: set up a merge config file. Our sources are already in the expected KGX graph format, so no transformation is necessary.

See the [example merge config](https://github.com/Knowledge-Graph-Hub/kg-dtm-template/blob/master/merge.yaml) in this repository for further inspiration.

In [None]:
merge_data = {"configuration":{"output_directory":data_dir,
                              "checkpoint":"false"
                              },
              "merged_graph":{"name":"tutorial_graph",
                              "source":{},
                              "operations":[{"name": "kgx.graph_operations.summarize_graph.generate_graph_stats",
                                        "args":{"graph_name":"tutorial_graph",
                                        "filename":"merged_graph_stats.yaml"
                                                }
                                                }
                                                ],
                                "destination":{"merged-kg-tsv":{"format":"tsv",
                                              "filename": "merged-kg"}
                                                },            
                                }
                }

for source in local_filepaths:
  merge_data["merged_graph"]["source"][source] = {"name":source,
                                                  "input":{"format":"tsv",
                                                          "filename":local_filepaths[source]}
                                                  }

with open("merge.yaml", "w") as merge_config:
  yaml.dump(merge_data, merge_config, default_flow_style=False)

## KG Assembly

The data files are all here and the configuration files are ready. We may now use `kgx` to assemble a single set of nodes and edges from them all.

In [None]:
from kgx.cli.cli_utils import merge

In [None]:
merged_graph = merge("merge.yaml")

If everything went as expected, the merged KG will be in `merged-kg_edges.tsv` and `merged-kg_nodes.tsv`. There will also be a `merged_graph_stats.yaml` detailing the new graph contents. Let's take a quick look at the stats file first.

In [None]:
with open("merged_graph_stats.yaml") as yaml_file:
    config = yaml.load(yaml_file, Loader=yaml.FullLoader)

In [None]:
# Count of all edges in the graph
print(config["edge_stats"]["total_edges"])

# Count of all nodes in the graph
print(config["node_stats"]["total_nodes"])

In [None]:
# What kind of nodes are in the graph?
for category in config["node_stats"]["node_categories"]:
    print(category)

Nodes in ontologies and data sources are assigned appropriate Biolink Model categories whenever possible. Those assigned `NamedThing` may still belong to a more detailed category, but assigning such a category may be challenging.

Now let's take a look at the graph contents to begin examining how they may answer our questions.

Let's get a set of all relations between food entries in FOODON and chemical entries in CHEBI.

In [None]:
!grep FOODON merged-kg_edges.tsv | grep CHEBI

The "subject, predicate, and object" of each relation are found in the second, third, and fourth columns, respectively.
I'll save you some trouble: every relation like `CHEBI:XXXXX    biolink:subclass_of FOODON:03412972` is just saying "this chemical is a [food additive](https://www.ebi.ac.uk/ols/ontologies/foodon/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FFOODON_03412972)". There are several different *types* of relations in this set, however. We can get a quick idea about those types by looking at the `merged_graph_stats.yaml` KGX has prepared for us.

In [3]:
!sed -n '/count_by_predicates/,/count_by_spo/p' merged_graph_stats.yaml

  count_by_predicates:
    biolink:close_match:
      count: 253
    biolink:coexists_with:
      count: 4
    biolink:contains_process:
      count: 21527
    biolink:derives_from:
      count: 2879
    biolink:develops_from:
      count: 36
    biolink:has_attribute:
      count: 42214
    biolink:has_input:
      count: 2
    biolink:has_output:
      count: 86
    biolink:has_part:
      count: 4318
    biolink:has_participant:
      count: 12
    biolink:in_taxon:
      count: 103
    biolink:located_in:
      count: 67
    biolink:model_of:
      count: 14
    biolink:part_of:
      count: 2896
    biolink:participates_in:
      count: 93025
    biolink:preceded_by:
      count: 1
    biolink:produced_by:
      count: 3
    biolink:related_to:
      count: 274196
    biolink:subclass_of:
      count: 264328
    unknown:
      count: 0
  count_by_spo:


We can see, for example, that there are >93 thousand "participates_in" relations. Let's see what the participants in these graph edges are:

In [5]:
!grep -A 1 "participates_in" merged_graph_stats.yaml

    biolink:participates_in:
      count: 93025
--
    biolink:ChemicalEntity-biolink:participates_in-biolink:BiologicalProcess:
      count: 93018
    biolink:ChemicalEntity-biolink:participates_in-biolink:BiologicalProcessOrActivity:
      count: 93018
    biolink:ChemicalEntity-biolink:participates_in-biolink:NamedThing:
      count: 93018
    biolink:ChemicalEntity-biolink:participates_in-biolink:Pathway:
      count: 93018
--
    biolink:NamedThing-biolink:participates_in-biolink:BiologicalProcess:
      count: 93018
    biolink:NamedThing-biolink:participates_in-biolink:BiologicalProcessOrActivity:
      count: 93018
    biolink:NamedThing-biolink:participates_in-biolink:NamedThing:
      count: 93025
    biolink:NamedThing-biolink:participates_in-biolink:Pathway:
      count: 93018
--
  - biolink:participates_in
  - biolink:preceded_by


So these are our connections between chemicals and pathways. The same counts appear multiple times because each node may have more than one category (e.g., a ChemicalEntity is also a NamedThing).

Continue to the next section for some examples of how to learn more about the new graph with the `grape` tools.

## Loading Graphs with `grape`

The `grape` library includes a substantial array of tools for working with graph data, generating reports and plots about graph contents, and preparing graph representations. We'll start by loading the graph from the previous section, then we'll get more details about its contents.

In [37]:
!pip install grape -U

Collecting grape
  Downloading grape-0.1.5.tar.gz (8.9 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting embiggen>=0.11.9
  Downloading embiggen-0.11.9.tar.gz (133 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m133.9/133.9 KB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone


Building wheels for collected packages: grape, embiggen
  Building wheel for grape (setup.py) ... [?25ldone
[?25h  Created wheel for grape: filename=grape-0.1.5-py3-none-any.whl size=6766 sha256=257aa919e6841de19ca57d2a7e1b2a178917ac472eb5734eeb595cdd94c04f90
  Stored in directory: /home/harry/.cache/pip/wheels/38/55/88/c0236384ede950c7b2d3c8f7310ab41cd9caa34ea92c1f264e
  Building wheel for embiggen (setup.py) ... [?25ldone
[?25h  Created wheel for embiggen: filename=embiggen-0.11.9-py3-none-any.whl size=267765 sha256=0ab9d6169fbc73c099a956c7ab578358365f4acd2dbd588f6d07842e8cd71159
  Stored in directory: /home/harry/.cache/pip/wheels/99/1f/55/8cd844c5a9bcc4fd27b43745ca19c85a6c699379625d270deb
Successfully built grape embiggen
Installing collected packages: embiggen, grape
  Attempting uninstall: embiggen
    Found existing installation: embiggen 0.11.7
    Uninstalling embiggen-0.11.7:
      Successfully uninstalled embiggen-0.11.7
  Attempting uninstall: grape
    Found existing i

In [44]:
from grape import Graph

Once the next block completes, it will output a long text report about the graph's properties and a variety of its "topological oddities". These don't mean anything is intrinsically *wrong* with the graph - rather, they are features of the data we have used to construct the graph. In some cases, for example, a CHEBI entry may be present within our imported data despite being deleted from the dataset and therefore obsolete. 

In [7]:
g = Graph.from_csv(
  directed=False, # This graph is, in fact, directed, but we'll treat it as undirected.
  node_path='merged-kg_nodes.tsv',
  edge_path='merged-kg_edges.tsv',
  verbose=True,
  nodes_column='id',
  node_list_node_types_column='category',
  default_node_type='biolink:NamedThing',
  sources_column='subject',
  destinations_column='object',
  edge_list_edge_types_column='predicate'
)
g

## Graph Embeddings

The `grape` library is particularly efficient at preparing graph embeddings. 
Let's see a list of its available node embedding methods:

In [10]:
from grape import get_available_models_for_node_embedding
grape.get_available_models_for_node_embedding()

Unnamed: 0,model_name,task_name,library_name,available,requires_node_types,can_use_node_types,requires_edge_types,requires_edge_weights,can_use_edge_weights,requires_positive_edge_weights
0,CBOW,Node Embedding,Ensmallen,True,False,True,False,False,True,True
1,CBOW,Node Embedding,TensorFlow,True,False,True,False,False,True,True
2,CBOW,Node Embedding,Karate Club,True,,,,,,
3,TransE,Node Embedding,Ensmallen,True,False,False,True,False,True,False
4,TransE,Node Embedding,TensorFlow,True,False,False,True,False,False,False
5,TransE,Node Embedding,PyKeen,True,,,,,,
6,SPINE,Node Embedding,Ensmallen,True,False,False,False,False,False,False
7,WeightedSPINE,Node Embedding,Ensmallen,True,False,False,False,True,True,True
8,SkipGram,Node Embedding,Ensmallen,True,False,True,False,False,True,True
9,SkipGram,Node Embedding,TensorFlow,True,False,True,False,False,True,True


Let's use the SkipGram method, and specifically the fast Ensmallen implementation.

In [65]:
from grape.embedders import SkipGramEnsmallen

Remove disconnected nodes first, as they won't contribute much to our embeddings and may cause errors.

In [60]:
g = g.remove_disconnected_nodes()

In [66]:
model = TransEEnsmallen()
embedding = model.fit_transform(g)

Now let's see what those embeddings look like. They won't be too informative just yet.

In [67]:
embedding.get_node_embedding_from_index(0)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
CHEBI:35149,-1.351520e-39,-0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,...,0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00
CHEBI:24995,-1.620334e-39,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,...,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
CHEBI:26666,5.726851e-32,-1.166641e-39,-3.194327e-36,1.368858e-37,2.750679e-37,-5.437841e-38,3.383582e-36,-4.595268e-38,5.908711e-36,3.888799e-40,...,1.573668e-37,-3.636030e-38,-1.468671e-37,2.637904e-38,1.313767e-37,-3.426683e-37,1.613853e-38,1.449227e-37,1.062291e-37,-3.499922e-38
CHEBI:58436,-1.037204e-37,0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,...,-0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00
CHEBI:83824,-1.620334e-39,0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,...,-0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,-0.000000e+00,0.000000e+00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
REACT:R-XTR-110313,5.726851e-32,-1.166641e-39,-3.194327e-36,1.368858e-37,2.750679e-37,-5.437841e-38,3.383582e-36,-4.595268e-38,5.908711e-36,3.888799e-40,...,1.573668e-37,-3.636030e-38,-1.468671e-37,2.637904e-38,1.313767e-37,-3.426683e-37,1.613853e-38,1.449227e-37,1.062291e-37,-3.499922e-38
REACT:R-XTR-948021,-7.148710e-32,1.456294e-39,3.987412e-36,-1.708718e-37,-3.433616e-37,6.787945e-38,-4.223655e-36,5.736178e-38,-7.375721e-36,-4.854308e-40,...,-1.964378e-37,4.538781e-38,1.833311e-37,-3.292841e-38,-1.639948e-37,4.277457e-37,-2.014539e-38,-1.809040e-37,-1.326036e-37,4.368880e-38
REACT:R-XTR-5653656,1.379774e-37,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,...,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00
REACT:R-XTR-2046104,2.690667e-39,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,...,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00,-0.000000e+00,0.000000e+00,0.000000e+00,0.000000e+00


In [68]:
from grape import GraphVisualizer
visualizer = GraphVisualizer(g)

In [70]:
visualizer.fit_and_plot_all(embedding)

ValueError: array must not contain infs or NaNs

## Link Prediction