# KG-Hub: Getting Started (from scratch)

This notebook serves as a walkthrough for creating a KG-Hub project, building a new graph, and using the graph for machine learning. Some familiarity with the command line, GitHub, and Python will be helpful. This notebook also assumes you're running in a Linux environment, but it should be informative even if you're on Windows or some other fancy operating system.

## Table of Contents
- [KG-Hub basics](#KG-Hub-basics)
- [Planning and setup](#Planning-and-setup)
- [Walking through the KG project](#Walking-through-the-KG-project)
- [Setting up your new KG project](#Setting-up-your-new-KG-project)
- [KGX format basics and the KGX config files](#KGX-format-basics-and-the-KGX-config-files)
  - [The KGX format](#The-KGX-format)
  - [KGX config files](#KGX-config-files)
  

## KG-Hub basics

The purpose of Knowledge Graph Hub (KG-Hub) is to provide a platform for building knowledge graphs (KGs) by adopting a set of guidelines and design principles. The goal of KG-Hub is to serve as a collective resource to simplify the process of generating biological and biomedical KGs and thus reducing the barrier for entry to new participants.

Each independent effort for building a KG is considered a KG-Hub project.

Projects include a code repository and a storage location for each set of graph products. Repositories are generally on GitHub and storage is available on https://kg-hub.berkeleybop.io/.

For example, the following are all KG-Hub project repositories and storage locations:



| **Name**    | **Repository**                                     | **Graphs**                                 |
|-------------|----------------------------------------------------|--------------------------------------------|
| KG-COVID-19 | https://github.com/Knowledge-Graph-Hub/kg-covid-19 | https://kg-hub.berkeleybop.io/kg-covid-19/ |
| KG-IDG      | https://github.com/Knowledge-Graph-Hub/kg-idg      | https://kg-hub.berkeleybop.io/kg-idg/      |
| KG-OBO      | https://github.com/Knowledge-Graph-Hub/kg-obo      | https://kg-hub.berkeleybop.io/kg-obo/      |


Each project *should*:
* live in its own GitHub repository within the [Knowledge-Graph-Hub](https://github.com/Knowledge-Graph-Hub/) organization.
* have enough code and/or configurations for Extract, Transform, and Load (ETL) to yield a reproducible product.
* model data using the [Biolink Model](https://biolink.github.io/biolink-model/), where possible.
* make use of ontologies from the [OBO Foundry](http://www.obofoundry.org/), where possible.
* be responsible for the veracity of the datasets that they ingest 
* be responsible for keeping track of evidence and provenance for assertions in their KG.
* provide their KG for download, following [semantic versioning guidelines](https://semver.org/).
* provide their KG in the [KGX interchange format](https://github.com/biolink/kgx/blob/master/specification/kgx-format.md) in addition to their format of choice (e.g., n-triples).

We also *highly recommend* including the following in the repository: 
* a README describing the intended purpose of the KG and its contributors
* a License (as its own LICENSE file) 
* Contributing guidelines
* a Code of Conduct
* statements emphasizing how the KG and KG-Hub are open to the community for contributions as well as consumption

## Planning and setup

You likely already found the KG-Hub project template repository - that's where this notebook is.
If you found this someplace else, the template repository is https://github.com/Knowledge-Graph-Hub/kg-dtm-template.

First, create a name for your project repository. In KG-Hub, most projects include **KG** somewhere, e.g., "KG-Squid" or "KG-Mimivirus-Proteomics".

Make a new repository for your project, based on the template, in one of the following two ways:
* [Follow the browser-based directions here.](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template)
* Use the GitHub command line interface. This may be a preferable option if you're using a Windows command line and don't want to use a browser interface. Follow the [install instructions here](https://cli.github.com/manual/installation) as needed, then run:


In [None]:
!cd ~/ 
!gh repo create kg-project-name --public --clone --template https://github.com/Knowledge-Graph-Hub/kg-dtm-template

Next, select a name for your project. It should resemble your repository name, though any dashes will need to be changed to underscores.

Change the values below to your repository name so they may be used later in the walkthrough.

In [6]:
kg_repo_name = "kg-placeholder-name"
kg_project_name = "kg_placeholder_name"

Define some additional details so they may be used later as well.

In [9]:
description = '' # A short description of the project
long_description = '' # A slightly less short description of the project
gh_name = '' # Your GitHub user name, assuming that you created the project repository in your own account.
author_name = '' # Your name
author_email = '' # Your email address

## Walking through the KG project

Each KG project in KG-Hub is generally structured like this (with some omissions for clarity):
```
📦kg-project-name
 ┣ 📂kg_project_name
 ┃ ┣ 📂merge_utils
 ┃ ┃ ┗ 📜merge_kg.py - this produces the final, merged KG
 ┃ ┣ 📂transform_utils - data source-specific transformation functions
 ┃ ┃ ┣ 📂transform_one
 ┃ ┃ ┃ ┗ 📜transform_one.py
 ┃ ┃ ┣ 📂transform_two
 ┃ ┃ ┃ ┗ 📜transform_two.py
 ┃ ┃ ┗ 📜transform.py - sets defaults for transform outputs
 ┃ ┣ 📂utils - utilities and helper functions
 ┃ ┃ ┣ 📜download_utils.py
 ┃ ┃ ┣ 📜robot_utils.py - utilities for working with the ROBOT tool
 ┃ ┃ ┗ 📜transform_utils.py
 ┃ ┣ 📜download.py 
 ┃ ┗ 📜transform.py - sets up the individual transformations
 ┣ 📂tests
 ┃ ┣ 📂resources
 ┃ ┃ ┣ files required to run the tests
 ┃ ┣ various tests
 ┣ 📜LICENSE.txt
 ┣ 📜README.md - modify as needed
 ┣ 📜download.yaml - the download configuration file
 ┣ 📜merge.yaml - the merge configuration file
 ┣ 📜requirements.txt - empty by default, but add any new requirements here
 ┣ 📜run.py - the main interface for downloading, transforming, and merging
 ┗ 📜setup.py
```

The general process of *defining* how to assemble a KG looks like this:
1. Add data sources to `download.yaml`.
2. Add a new transform to `transform.py` and in the `transform_utils` directory to handle the new data source. If the data source is already a set of KGX tsv node and edgelists, then it may only require a 'passthrough' tranform (i.e., files aren't modified but may be validated and moved). 
3. Modify `merge.yaml` to include the new sources.

The general process of *assembling* the KG looks like this. Even if you haven't changed much in the new project yet, these commands will still retrieve several files, transform them, and merge them into a graph.

In [2]:
!python run.py download

Downloading files: 100%|██████████████████████████| 6/6 [00:03<00:00,  1.58it/s]


In [None]:
!python run.py transform

In [None]:
!python run.py merge

## Setting up your new KG project

Before going any further, open the project in your favorite development environment. You will need to replace all instances of `project_name` with your project's name.

You can also run the following script, assuming you've specified `kg_project_name` above:

In [None]:
%%bash -s "$kg_project_name"
mv project_name/ $1
find . -name "*.py" | xargs -n 1 sed -i -e "s|project_name|$1|g"

Next, update the `setup.py` file with some details about your project. You should only need to modify the first several lines of the `setup` block, as these define project metadata. Don't modify the value for `version` as that is defined elsewhere (specifically, in `project_name/__version__.py`). You may also need to change the value for `license` if you're using a license other than BSD-3. 

If you've defined metadata values in the "Planning and setup" section above, you may run the following, then copy and paste the result into your `setup.py` immediately under `setup(`:

In [11]:
print( 
f"""
    name='{kg_project_name},
    version=__version__,
    description='{description}',
    long_description='{long_description}',
    url='https://github.com/Knowledge-Graph-Hub/{kg_repo_name}',
    author='{author_name}',
    author_email='{author_email}',
    python_requires='>=3.7',
"""
)


    name='kg_placeholder_name,
    version=__version__,
    description='',
    long_description='',
    url='https://github.com/Knowledge-Graph-Hub/kg-placeholder-name',
    author='',
    author_email='',
    python_requires='>=3.7',



Now it's all yours! Feel free to update the README, too.

## KGX format basics and the KGX config files

Two more steps remain before the KG is ready to be built: setting up the KGX configuration files (specifially, `download.yaml` and `merge.yaml`). You've likely noticed that the three primary stages in this pipeline are to download, transform, and merge. These two configuration files handle the first and last stages, respectively, but transforms are source-specific and each requires its own process. 

Our goal is to get all data in the same format - KGX tab-separated values - and adhering to the same data model. The next section will discuss the [Biolink Model](https://biolink.github.io/biolink-model/), but there isn't anything preventing you from reading about it now. In short, we need as much consistency as possible before combining sources into a single KG.

### The KGX format

[You can find an exhaustive specification for the KGX data format here.](https://github.com/biolink/kgx/blob/master/specification/kgx-format.md)

If you're already familiar with RDF, triples, and the idea of a [graph data model](https://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#section-data-model) then this will all appear quite simple. If not, just consider nodes to be things and edges to be the relationships between those things. For example, a node may be a farmer and an edge may be a specific connection to something else which may or may not be the same type of thing, e.g., "Farmer Alphonse *grows* lentils".

Here are the key points about KGX:
* Each graph consists of one node file and one edge file.
* Both files are tab-delimited and have a single header line each.
* Both files contain one record per line - one node or one edge.

A node file generally looks something like this:

```
id      category        name    description
ENSEMBL:ENSG00000143933 biolink:Gene|biolink:NamedThing CALM2   calmodulin 2
ENSEMBL:ENSG00000131089 biolink:Gene|biolink:NamedThing ARHGEF9 Cdc42 guanine nucleotide exchange factor 9
ENSEMBL:ENSG00000147889 biolink:Gene|biolink:NamedThing CDKN2A  cyclin dependent kinase inhibitor 2A
```

Note that each value in the `id` column is a [CURIE](https://en.wikipedia.org/wiki/CURIE). It contains a prefix denoting the data source (in this case, ENSEMBL) and, after the colon, an identifier. Each node shown here also has two categories, with the | character separating items in each list. 

An edge file generally looks something like this:

```
id      subject predicate       object
urn:uuid:e99e9dd6-0b4f-416e-8c81-061b4a61711c   ENSEMBL:ENSP00000000233 biolink:interacts_with  ENSEMBL:ENSP0000027
2298
urn:uuid:a819d828-3df7-4384-8612-38a17e521320   ENSEMBL:ENSP00000000233 biolink:interacts_with  ENSEMBL:ENSP0000041
8915
urn:uuid:b412961f-8939-488a-bd37-6ae3bc7237b9   ENSEMBL:ENSP00000000233 biolink:interacts_with  ENSEMBL:ENSP0000035
6737
```

Here, each `id` is actually a [Uniform Resource Name](https://en.wikipedia.org/wiki/Uniform_Resource_Name) - this is for consistency because the KG may contain a mix of relationships from other sources *and* newly-created connections. The crucial aspect is the set of *subject*, *predicate*, and *object*, or "*S* has relationship *P* with *O*".

### KGX config files

The KGX configuration files usually remain in the root of each project. Run this to see the download config, `download.yaml`:

In [6]:
!head download.yaml --lines=60

# This file is a list of things to be downloaded using the command:
#   run.py download

# To add a new item to be download, add a block like this - must have 'url',
# 'local_name' is optional, use to avoid name collisions

#  #
#  # Description of source
#  #
#  -
#    # brief comment about file, and optionally a local_name:
#    url: http://curefordisease.org/some_data.txt
#    local_name: some_data_more_chars_prevent_name_collision.pdf
#
#  For downloading from S3 buckets, see here for information about what URL to use:
#  https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html#access-bucket-intro
#  Amazon S3 virtual hosted style URLs follow the format shown below:
#  https://bucket-name.s3.Region.amazonaws.com/key_name
#
---

#
# **** ROBOT ****
#
-
  url: https://github.com/ontodev/robot/releases/download/v1.8.3/robot.jar
  local_name: robot.jar
-
  url: https://raw.githubusercontent.com/ontodev/robot/master/bin/robot 
  local_name: robot



By default, this file instructs KGX to download 5 separate files:
* Two files for [ROBOT](http://robot.obolibrary.org/), the Java tool used for ontology processing
* The [ENVO ontology](https://obofoundry.org/ontology/envo.html) in JSON format
* The [CHEBI ontology](https://obofoundry.org/ontology/chebi.html), in OWL format, and in a GZ compressed file
* A set of mappings between CHEBI and pathways in the [Reactome knowledge base](https://reactome.org/), as a tab-delimited txt file

The downloads are all stored in the `data/raw` directory.

The actual download process is handled by a separate package, [kghub-downloader](https://github.com/monarch-initiative/kghub-downloader), so consult the documentation for that package to see the full extent of options you can use with `download.yaml`. It isn't limited to downloading single files from HTTP URLs: there's functionality for FTP and for retrieving the results of Elasticsearch queries.

In [7]:
!head merge.yaml --lines=60

---
configuration:
  output_directory: data/merged
  checkpoint: false

merged_graph:
  name: project_name graph
  source:
    chebi:
      name: "CHEBI"
      input:
        format: tsv
        filename:
          - data/transformed/ontologies/chebi_nodes.tsv
          - data/transformed/ontologies/chebi_edges.tsv
    envo:
      name: "ENVO"
      input:
        format: tsv
        filename:
          - data/transformed/ontologies/envo_nodes.tsv
          - data/transformed/ontologies/envo_edges.tsv
    chebi_to_reactome:
      name: "CHEBI to Reactome Pathways"
      input:
        format: tsv
        filename:
          - data/transformed/reactome/chebi2reactome_nodes.tsv
          - data/transformed/reactome/chebi2reactome_edges.tsv
  operations:
    - name: kgx.graph_operations.summarize_graph.generate_graph_stats
      args:
        graph_name: project_name graph
        filename: merged_graph_stats.yaml
        node_facet_properties:
         

As the name implies, `merge.yaml` instructs KGX to merge specific transformed data into one or more merged output graphs. The transforms place their output in `data/merged`, so that directory is specified at the top. If `checkpoint` is set to True, then each input will be converted and saved to a TSV before merging, but this isn't necessary here. We have three `source`s: the ENVO ontology, the CHEBI ontology, and the CHEBI to Reactome pathway mappings, each in nice, convenient KGX TSV format. The `operations` block defines additional processes to perform in the course of the merge. Here, a statistics file describing the merged graph's projects is generated. Finally, the `destination` block allows us to define the format(s) of the merged graph. The default file tells KGX to produce a tar.gz compressed set of KGX TSVs *and* a gz-compressed n-triple format file. 

Give it a try, if you haven't done so already:

In [8]:
!python run.py download

Downloading files:   0%|                                  | 0/5 [00:00<?, ?it/s]Downloading files: 100%|███████████████████████| 5/5 [00:00<00:00, 23912.79it/s]


In [9]:
!python run.py transform # This may take a few minutes. Take a break - you deserve it.

Parsing data/raw/chebi.owl.gz
[KGX][cli_utils.py][    transform_source] INFO: Processing source 'chebi.json'
Parsing data/raw/envo.json
[KGX][cli_utils.py][    transform_source] INFO: Processing source 'envo.json'
Parsing data/raw/ChEBI2Reactome_PE_Pathway.txt
Transforming using source in project_name/transform_utils/reactome/chebi2reactome.yaml


In [10]:
!python run.py merge

[KGX][cli_utils.py][               merge] INFO: Spawning process for 'chebi'
[KGX][cli_utils.py][               merge] INFO: Spawning process for 'envo'
[KGX][cli_utils.py][               merge] INFO: Spawning process for 'chebi_to_reactome'
[KGX][cli_utils.py][        parse_source] INFO: Processing source 'chebi'
[KGX][cli_utils.py][        parse_source] INFO: Processing source 'envo'
[KGX][cli_utils.py][        parse_source] INFO: Processing source 'chebi_to_reactome'
[KGX][graph_merge.py][       add_all_nodes] INFO: Adding 6773 nodes from envo to chebi
[KGX][graph_merge.py][        merge_graphs] INFO: Number of nodes merged between chebi and envo: 924
[KGX][graph_merge.py][       add_all_edges] INFO: Adding 10370 edges from <kgx.graph.nx_graph.NxGraph object at 0x7fc5616d00d0> to <kgx.graph.nx_graph.NxGraph object at 0x7fc593c84040>
[KGX][graph_merge.py][        merge_graphs] INFO: Number of edges merged between chebi and envo: 1329
[KGX][graph_merge.py][       add_all_nodes] INFO: 

The merged graph will be in `data/merged`, as per the merge configuration.

Let's take a quick look at `merged_graph_stats.yaml` to get an idea of what the merged graph contains. The log output of the merge will tell us how many nodes and edges the graph contains, but so will the graph stats:

In [11]:
!grep total merged_graph_stats.yaml

  total_edges: 420049
  total_nodes: 196791


Take a look at the list of values under `predicates` in the stats file for the list of all predicates, or look under `node_stats` to find the count of all nodes by category.

The important point: **now you have a KG!** 

Presumably your interests extend beyond chemicals and pathways, though! In the following sections, you'll see how to customize your KG-Hub project for your own needs.

## Biolink basics

Are you working with biological or biomedical data?

No? OK, skip to the next section.

Otherwise, you'll need to know at least a bit about the Biolink Model.

## How to write transforms for new sources

### Writing transforms with Koza

### Retrieving source graphs from KG-Hub

## How and where to store results

### Jenkins builds

## Loading graphs with GraPE

## Embeddings and basic ML approaches w/ NEAT