# KG-Hub: Advanced Knowledge Graph Assembly

This notebook serves as a practical guide to advanced KG-Hub features and resources. A brief review of the Getting Started tutorial notebook is not a strict prerequisite but may be helpful.  

This notebook also assumes you are in a Linux environment, but Google Colab is an option as well.

## Setup

First, we'll install the requirements.


In [2]:
%pip install kgx
%pip install kghub-downloader

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Note: you may need to restart the kernel to use updated packages.
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Note: you may need to restart the kernel to use updated packages.


In [3]:
import os
import yaml

Now we need to set up two things for KGX to work properly:
* A download config file
* A merge config file

In practice, we may need to write a new transform for each new source, but all of the sources we'll use here are conveniently already available as KGX node and edge files on KG-Hub.

We'll download five sources. Two are ontologies available through the KG-OBO project on KG-Hub: FOODON, a food ontology, and CHEBI, a chemical ontology. The other sources are sets of preprocessed [Reactome](https://reactome.org) pathways, connections between those pathways, and mappings between those pathways and chemicals. They're all defined in a dictionary below, with the name of each source as its key and a list of one or more source URLs as its value. We've also defined a set of local filenames, as we know what the compressed ontology files should contain.

In [7]:
data_dir = "./" # Just the current directory, though in practice it would be something like data/raw/
sources = {"foodon":["https://kg-hub.berkeleybop.io/kg-obo/foodon/2022-02-01/foodon_kgx_tsv.tar.gz"],
           "chebi":["https://kg-hub.berkeleybop.io/kg-obo/chebi/210/chebi_kgx_tsv.tar.gz"],
           "chebi2reactome":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/chebi2reactome_edges.tsv",
                             "https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/chebi2reactome_nodes.tsv"],
           "reactome_pathways":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/reactomepathways_nodes.tsv"],
           "reactome_relations":["https://kg-hub.berkeleybop.io/kg-idg/20220601/transformed/reactome/reactomepathwaysrelation_edges.tsv"]}
local_filepaths = {"foodon":["foodon_kgx_tsv_edges.tsv",
                            "foodon_kgx_tsv_nodes.tsv"],
           "chebi":["chebi_kgx_tsv_edges.tsv",
                    "chebi_kgx_tsv_nodes.tsv"],
           "chebi2reactome":["chebi2reactome_edges.tsv",
                             "chebi2reactome_nodes.tsv"],
           "reactome_pathways":["reactomepathways_nodes.tsv"],
           "reactome_relations":["reactomepathwaysrelation_edges.tsv"]}

There is an example of a KGX download config file [here](https://github.com/Knowledge-Graph-Hub/kg-dtm-template/blob/master/download.yaml), but it's easy to assemble from scratch with something like the following:

In [19]:
source_data = []
for source in sources:
  for url in sources[source]:
    local_name = url.rpartition('/')[-1]
    source_data.append({"url":url,
                        "local_name":local_name})

with open("download.yaml", "w") as dl_config:
  yaml.dump(source_data, dl_config, default_flow_style=False)

Now we may use the config file with the `kghub-downloader` to download all sources.

In [4]:
!downloader download.yaml

/bin/bash: downloader: command not found


Decompress the compressed sources.

In [5]:
!cat *.tar.gz | tar zxvf - -i

chebi_kgx_tsv_nodes.tsv
chebi_kgx_tsv_edges.tsv
foodon_kgx_tsv_nodes.tsv
foodon_kgx_tsv_edges.tsv


Next step: set up a merge config file. Our sources are already in the expected KGX graph format, so no transformation is necessary.

See the [example merge config](https://github.com/Knowledge-Graph-Hub/kg-dtm-template/blob/master/merge.yaml) in this repository for further inspiration.

In [10]:
merge_data = {"configuration":{"output_directory":data_dir,
                              "checkpoint":"false"
                              },
              "merged_graph":{"name":"tutorial_graph",
                              "source":{},
                              "operations":[{"name": "kgx.graph_operations.summarize_graph.generate_graph_stats",
                                        "args":{"graph_name":"tutorial_graph",
                                        "filename":"merged_graph_stats.yaml"
                                                }
                                                }
                                                ],
                                "destination":{"merged-kg-tsv":{"format":"tsv",
                                              "filename": "merged-kg"}
                                                },            
                                }
                }

for source in local_filepaths:
  merge_data["merged_graph"]["source"][source] = {"name":source,
                                                  "input":{"format":"tsv",
                                                          "filename":local_filepaths[source]}
                                                  }

with open("merge.yaml", "w") as merge_config:
  yaml.dump(merge_data, merge_config, default_flow_style=False)

## KG Assembly

In [None]:
!kgx merge --merge-config merge.yaml 

## Graph Embeddings

## Link Prediction