# Running KG-COVID-19 pipeline

The KG-COVID-19 pipeline can be run on the command line or via this notebook. The goal here is to run the pipeline end-to-end.

We will also demonstrate some ways that you can use the KG downstream, and show some other features of the framework.

**Note:** This notebook assumes that you have already installed the required dependencies for KG-COVID-19. For more information refer to [Installation instructions](https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki#installation)

## Downloading all required datasets

First we start with downloading all required datasets as listed in [download.yaml](../download.yaml)

In [None]:
!python run.py download

## Transform all required datasets

We then transform all the datasets and generate the files `nodes.tsv` and `edges.tsv` for each dataset.

The files are located in `data/transformed/SOURCE_NAME` where `SOURCE_NAME` is the name of the data source.

In [None]:
!python run.py transform

## Merge all datasets into a single graph

Finally, we create a merged graph by reading in the individual nodes.tsv and edges.tsv and merging them.
The merge process is driven by the [merge.yaml](../merge.yaml).

In [3]:
!python run.py merge

^C


The merged graph should be available in `data/merged/` folder.

This pipeline generates a graph in KGX TSV format here:
`data/merged/merged-kg.tar.gz`

Prebuilt graphs are also available here:
https://kg-hub.berkeleybop.io/kg-covid-19/index.html

# Other tooling/functionality

## Make training data for machine learning use case

KG-COVID-19 contains tooling to produce training data for machine learning. Briefly, a training graph is produced with 80% (by default, override with `-t` parameter) of edges. 20% of edges are removed such that they do not create new components. These graphs are emitted as KGX TSV files in the `data/holdouts/` folder.

### Extract the generated graph

Extract the generated graph from `data/merged/merged-kg.tar.gz`.

> You can use the graph generated in the previous step OR download the latest graph from https://kg-hub.berkeleybop.io/kg-covid-19/current/kg-covid-19.tar.gz

In [14]:
!tar -xvzf data/merged/merged-kg.tar.gz

x merged-kg_nodes.tsv
x merged-kg_edges.tsv


### Create the training/holdout data

We then generate a training/holdout data which will be used in subsequent steps for training.

In [15]:
# this might take 10 minutes or so
!python run.py holdouts -e merged-kg_edges.tsv -n merged-kg_nodes.tsv

INFO:root:Loading graph from nodes merged-kg_nodes.tsv and edges merged-kg_edges.tsv files
Reading csv [32m⠒[0m [00:00:00] [[36m░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (3775/377577, ETA 1s)
[2K[1B[1AReading csv [32m⠤[0m [00:00:00] [[36m███░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (30200/377577, ETA 1s)
[2K[1B[1AReading csv [32m⠤[0m [00:00:00] [[36m█████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (56625/377577, ETA 1s)
[2K[1B[1AReading csv [32m⠦[0m [00:00:00] [[36m████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (79275/377577, ETA 1s)
[2K[1B[1AReading csv [32m⠤[0m [00:00:00] [[36m██████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (98150/377577, ETA 1s)
[2K[1B[1AReading csv [32m⠴[0m [00:00:00] [[36m████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (117025/377577, ETA 1s)
[2K[1B[1AReading csv [32m⠤[0m [00:00:00] [[36m██████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (139675/377577, ETA 1s)
[2K[1B[

[2K[1B[1AReading csv [32m⠦[0m [00:00:59] [[36m█████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (9216276/21433283, ETA 1m)
[2K[1B[1AReading csv [32m⠤[0m [00:01:00] [[36m█████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (9430608/21433283, ETA 1m)
[2K[1B[1AReading csv [32m⠚[0m [00:01:01] [[36m█████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (9644940/21433283, ETA 1m)
[2K[1B[1AReading csv [32m⠉[0m [00:01:03] [[36m██████████████████░[34m░░░░░░░░░░░░░░░░░░░░░[0m[0m] (9859272/21433283, ETA 1m)
[2K[1B[1AReading csv [32m⠖[0m [00:01:04] [[36m██████████████████░[34m░░░░░░░░░░░░░░░░░░░░░[0m[0m] (10073604/21433283, ETA 1m)
[2K[1B[1AReading csv [32m⠄[0m [00:01:05] [[36m███████████████████░[34m░░░░░░░░░░░░░░░░░░░░[0m[0m] (10287936/21433283, ETA 1m)
[2K[1B[1AReading csv [32m⠒[0m [00:01:07] [[36m███████████████████░[34m░░░░░░░░░░░░░░░░░░░░[0m[0m] (10502268/21433283, ETA 1m)
[2K[1B[1AReading csv [32m⠈[0m [00:01:08] [[36m███████

[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:00] [[36m█░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (928551/30951796, ETA 15s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:00] [[36m█░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1238068/30951796, ETA 13s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:00] [[36m█░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1547585/30951796, ETA 12s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:00] [[36m██░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1857102/30951796, ETA 12s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:00] [[36m██░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (2166619/30951796, ETA 11s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:00] [[36m███░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (2476136/30951796, ETA 11s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:01] [[36m███░[34m░░░░░░░░░░░░░░░░░░░░░░░░░

[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:05] [[36m███████████████████████████░[34m░░░░░░░░░░░░[0m[0m] (21356673/30951796, ETA 3s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:05] [[36m███████████████████████████░[34m░░░░░░░░░░░░[0m[0m] (21666190/30951796, ETA 3s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:06] [[36m████████████████████████████░[34m░░░░░░░░░░░[0m[0m] (21975707/30951796, ETA 3s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:06] [[36m████████████████████████████░[34m░░░░░░░░░░░[0m[0m] (22285224/30951796, ETA 3s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:06] [[36m█████████████████████████████░[34m░░░░░░░░░░[0m[0m] (22594741/30951796, ETA 3s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:06] [[36m█████████████████████████████░[34m░░░░░░░░░░[0m[0m] (22904258/30951796, ETA 2s)
[2K[1B[1ASorting and building graph [32m⠁[0m [00:00:06] [[36m█████████████████████████████░[3

[2K[1B[1APicking validation edges [32m⠤[0m [00:00:27] [[36m█████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1485562/6189874, ETA 1m)
[2K[1B[1APicking validation edges [32m⠠[0m [00:00:28] [[36m█████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1547461/6189874, ETA 1m)
[2K[1B[1APicking validation edges [32m⠐[0m [00:00:29] [[36m██████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1609359/6189874, ETA 1m)
[2K[1B[1APicking validation edges [32m⠋[0m [00:00:30] [[36m██████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1671257/6189874, ETA 1m)
[2K[1B[1APicking validation edges [32m⠉[0m [00:00:31] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1733156/6189874, ETA 1m)
[2K[1B[1APicking validation edges [32m⠙[0m [00:00:32] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1795054/6189874, ETA 1m)
[2K[1B[1APicking validation edges [32m⠂[0m [00:00:33] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (185

[2K[1B[1APicking validation edges [32m⠒[0m [00:01:26] [[36m███████████████████████████████░[34m░░░░░░░░[0m[0m] (4951875/6189874, ETA 22s)
[2K[1B[1APicking validation edges [32m⠈[0m [00:01:27] [[36m████████████████████████████████░[34m░░░░░░░[0m[0m] (5013773/6189874, ETA 21s)
[2K[1B[1APicking validation edges [32m⠈[0m [00:01:28] [[36m████████████████████████████████░[34m░░░░░░░[0m[0m] (5075671/6189874, ETA 20s)
[2K[1B[1APicking validation edges [32m⠈[0m [00:01:29] [[36m█████████████████████████████████░[34m░░░░░░[0m[0m] (5137570/6189874, ETA 19s)
[2K[1B[1APicking validation edges [32m⠈[0m [00:01:30] [[36m█████████████████████████████████░[34m░░░░░░[0m[0m] (5199468/6189874, ETA 18s)
[2K[1B[1APicking validation edges [32m⠁[0m [00:01:32] [[36m█████████████████████████████████░[34m░░░░░░[0m[0m] (5261367/6189874, ETA 16s)
[2K[1B[1APicking validation edges [32m⠂[0m [00:01:33] [[36m██████████████████████████████████░[34m░░░░░[0m[0m

[2K[1B[1ABuilding the train partition [32m⠄[0m [00:00:03] [[36m███████████████████████████░[34m░░░░░░░░░░░░[0m[0m] (17085711/24761921, ETA 2s)
[2K[1B[1ABuilding the train partition [32m⠦[0m [00:00:03] [[36m████████████████████████████░[34m░░░░░░░░░░░[0m[0m] (17580949/24761921, ETA 2s)
[2K[1B[1ABuilding the train partition [32m⠒[0m [00:00:03] [[36m█████████████████████████████░[34m░░░░░░░░░░[0m[0m] (18076187/24761921, ETA 2s)
[2K[1B[1ABuilding the train partition [32m⠈[0m [00:00:03] [[36m█████████████████████████████░[34m░░░░░░░░░░[0m[0m] (18571425/24761921, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠚[0m [00:00:03] [[36m██████████████████████████████░[34m░░░░░░░░░[0m[0m] (19066663/24761921, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠲[0m [00:00:03] [[36m███████████████████████████████░[34m░░░░░░░░[0m[0m] (19561901/24761921, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠤[0m [00:00:03] [[36m███████████████████

[2K[1B[1AComputing negative edges [32m⠦[0m [00:01:27] [[36m██████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (11139629/30951796, ETA 3m)
[2K[1B[1AComputing negative edges [32m⠖[0m [00:01:28] [[36m███████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (11758463/30951796, ETA 3m)
[2K[1B[1AComputing negative edges [32m⠒[0m [00:01:29] [[36m███████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (12377271/30951796, ETA 2m)
[2K[1B[1AComputing negative edges [32m⠐[0m [00:01:30] [[36m████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (12996073/30951796, ETA 2m)
[2K[1B[1AComputing negative edges [32m⠐[0m [00:01:31] [[36m█████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (13614930/30951796, ETA 2m)
[2K[1B[1AComputing negative edges [32m⠒[0m [00:01:31] [[36m██████████████████░[34m░░░░░░░░░░░░░░░░░░░░░[0m[0m] (14233773/30951796, ETA 2m)
[2K[1B[1AComputing negative edges [32m⠓[0m [00:01:32] [[36m███████████████████░[34m░░░░░░░░░░░░░░░░░░░░[

[2K[1B[1ABuilding negative graph [32m⠁[0m [00:01:51] [[36m█████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (13928265/30951796, ETA 2m)
[2K[1B[1ABuilding negative graph [32m⠁[0m [00:01:51] [[36m██████████████████░[34m░░░░░░░░░░░░░░░░░░░░░[0m[0m] (14547299/30951796, ETA 2m)
[2K[1B[1ABuilding negative graph [32m⠁[0m [00:01:51] [[36m███████████████████░[34m░░░░░░░░░░░░░░░░░░░░[0m[0m] (15166333/30951796, ETA 1m)
[2K[1B[1ABuilding negative graph [32m⠁[0m [00:01:51] [[36m████████████████████░[34m░░░░░░░░░░░░░░░░░░░[0m[0m] (15785367/30951796, ETA 1m)
[2K[1B[1ABuilding negative graph [32m⠁[0m [00:01:52] [[36m█████████████████████░[34m░░░░░░░░░░░░░░░░░░[0m[0m] (16404401/30951796, ETA 1m)
[2K[1B[1ABuilding negative graph [32m⠁[0m [00:01:52] [[36m█████████████████████░[34m░░░░░░░░░░░░░░░░░░[0m[0m] (17023435/30951796, ETA 1m)
[2K[1B[1ABuilding negative graph [32m⠁[0m [00:01:52] [[36m██████████████████████░[34m░░░░░░░░░░░░░░░░░[0m[0m]

[2K[1B[1APicking validation edges [32m⠐[0m [00:00:03] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1733312/6190360, ETA 10s)
[2K[1B[1APicking validation edges [32m⠁[0m [00:00:04] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1795216/6190360, ETA 10s)
[2K[1B[1APicking validation edges [32m⠴[0m [00:00:04] [[36m████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1857120/6190360, ETA 10s)
[2K[1B[1APicking validation edges [32m⠖[0m [00:00:04] [[36m████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1919024/6190360, ETA 10s)
[2K[1B[1APicking validation edges [32m⠈[0m [00:00:04] [[36m████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1980928/6190360, ETA 9s)
[2K[1B[1APicking validation edges [32m⠒[0m [00:00:04] [[36m█████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (2042831/6190360, ETA 9s)
[2K[1B[1APicking validation edges [32m⠤[0m [00:00:04] [[36m█████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] 

[2K[1B[1APicking validation edges [32m⠄[0m [00:00:09] [[36m██████████████████████████████████░[34m░░░░░[0m[0m] (5261837/6190360, ETA 2s)
[2K[1B[1APicking validation edges [32m⠐[0m [00:00:09] [[36m██████████████████████████████████░[34m░░░░░[0m[0m] (5323741/6190360, ETA 2s)
[2K[1B[1APicking validation edges [32m⠁[0m [00:00:09] [[36m██████████████████████████████████░[34m░░░░░[0m[0m] (5385645/6190360, ETA 2s)
[2K[1B[1APicking validation edges [32m⠴[0m [00:00:09] [[36m███████████████████████████████████░[34m░░░░[0m[0m] (5447549/6190360, ETA 2s)
[2K[1B[1APicking validation edges [32m⠖[0m [00:00:09] [[36m███████████████████████████████████░[34m░░░░[0m[0m] (5509453/6190360, ETA 1s)
[2K[1B[1APicking validation edges [32m⠈[0m [00:00:10] [[36m████████████████████████████████████░[34m░░░[0m[0m] (5571357/6190360, ETA 1s)
[2K[1B[1APicking validation edges [32m⠒[0m [00:00:10] [[36m████████████████████████████████████░[34m░░░[0m[0m] (563

[2K[1B[1ABuilding the train partition [32m⠠[0m [00:00:03] [[36m████████████████████████████████░[34m░░░░░░░[0m[0m] (20056734/24761437, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠴[0m [00:00:03] [[36m█████████████████████████████████░[34m░░░░░░[0m[0m] (20551962/24761437, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠒[0m [00:00:03] [[36m█████████████████████████████████░[34m░░░░░░[0m[0m] (21047190/24761437, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠁[0m [00:00:03] [[36m██████████████████████████████████░[34m░░░░░[0m[0m] (21542418/24761437, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠓[0m [00:00:03] [[36m███████████████████████████████████░[34m░░░░[0m[0m] (22037646/24761437, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠖[0m [00:00:03] [[36m████████████████████████████████████░[34m░░░[0m[0m] (22532874/24761437, ETA 1s)
[2K[1B[1ABuilding the train partition [32m⠤[0m [00:00:03] [[36m███████████████████

[2K[1B[1AWriting to file [32m⠄[0m [00:00:08] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (6933332/24761921, ETA 23s)
[2K[1B[1AWriting to file [32m⠁[0m [00:00:09] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (7180951/24761921, ETA 22s)
[2K[1B[1AWriting to file [32m⠤[0m [00:00:09] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (7428570/24761921, ETA 22s)
[2K[1B[1AWriting to file [32m⠒[0m [00:00:09] [[36m████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (7676189/24761921, ETA 22s)
[2K[1B[1AWriting to file [32m⠐[0m [00:00:10] [[36m████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (7923808/24761921, ETA 22s)
[2K[1B[1AWriting to file [32m⠴[0m [00:00:10] [[36m█████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (8171427/24761921, ETA 21s)
[2K[1B[1AWriting to file [32m⠈[0m [00:00:10] [[36m█████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (8419046/24761921, ETA 21s)
[2K[1B[1AWriting to file

[2K[1B[1AWriting to file [32m⠄[0m [00:00:02] [[36m███████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1856940/6189875, ETA 6s)
[2K[1B[1AWriting to file [32m⠓[0m [00:00:02] [[36m████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1918838/6189875, ETA 6s)
[2K[1B[1AWriting to file [32m⠂[0m [00:00:02] [[36m████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (1980736/6189875, ETA 6s)
[2K[1B[1AWriting to file [32m⠖[0m [00:00:02] [[36m█████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (2042634/6189875, ETA 6s)
[2K[1B[1AWriting to file [32m⠉[0m [00:00:02] [[36m█████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (2104532/6189875, ETA 6s)
[2K[1B[1AWriting to file [32m⠤[0m [00:00:03] [[36m█████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (2166430/6189875, ETA 6s)
[2K[1B[1AWriting to file [32m⠉[0m [00:00:03] [[36m██████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (2228328/6189875, ETA 6s)
[2K[1B[1AWriting to file [32m⠲[0m [0

[2K[1B[1AWriting to file [32m⠚[0m [00:00:11] [[36m███████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (9656946/24761437, ETA 18s)
[2K[1B[1AWriting to file [32m⠠[0m [00:00:11] [[36m███████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (9904560/24761437, ETA 18s)
[2K[1B[1AWriting to file [32m⠈[0m [00:00:11] [[36m████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (10152174/24761437, ETA 17s)
[2K[1B[1AWriting to file [32m⠤[0m [00:00:12] [[36m████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (10399788/24761437, ETA 17s)
[2K[1B[1AWriting to file [32m⠒[0m [00:00:12] [[36m█████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (10647402/24761437, ETA 17s)
[2K[1B[1AWriting to file [32m⠂[0m [00:00:12] [[36m█████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (10895016/24761437, ETA 17s)
[2K[1B[1AWriting to file [32m⠦[0m [00:00:13] [[36m█████████████████░[34m░░░░░░░░░░░░░░░░░░░░░░[0m[0m] (11142630/24761437, ETA 16s)
[2K[1B[1AWriting to

### Explore the training data

Let's get some stats on our training data. We're tightly integrated with ensmallen_graph, so we'll use that package to do this.

In [16]:
from ensmallen_graph import EnsmallenGraph

training = EnsmallenGraph.from_unsorted_csv(
    edge_path="data/holdouts/pos_train_edges.tsv",
    sources_column="subject",
    destinations_column="object",
    directed=False,
    edge_types_column='label',
    default_edge_type='biolink:Association',
    node_path="data/holdouts/pos_train_nodes.tsv",
    nodes_column='id',
    default_node_type='biolink:NamedThing',
    node_types_column='category',
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,    
)

training.report()

{'directed': 'false',
 'singletons': '23268',
 'edges_number': '24761921',
 'unique_edge_types_number': '33',
 'density': '0.02983031227755192',
 'self_loops_number': '405',
 'self_loops_rate': '0.000016355758505166057',
 'nodes_number': '377577',
 'degree_mean': '65.5811159048353',
 'unique_node_types_number': '37'}

Stats for the original graph, for comparison:

In [17]:
from ensmallen_graph import EnsmallenGraph

graph = EnsmallenGraph.from_unsorted_csv(
    edge_path="merged-kg_edges.tsv",
    sources_column="subject",
    destinations_column="object",
    directed=False,
    edge_types_column='edge_label',
    default_edge_type='biolink:Association',
    node_path="merged-kg_nodes.tsv",
    nodes_column='id',
    default_node_type='biolink:NamedThing',
    node_types_column='category',
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,    
)

graph.report()

{'self_loops_rate': '0.000016154151442455874',
 'nodes_number': '377577',
 'density': '0.0372871612114053',
 'degree_mean': '81.97479189675218',
 'directed': 'false',
 'unique_edge_types_number': '33',
 'edges_number': '30951796',
 'self_loops_number': '500',
 'singletons': '8314',
 'unique_node_types_number': '37'}

## Making embeddings for a KG

To generate embeddings from the KG you've created above, take a look at notebooks available at https://github.com/monarch-initiative/embiggen/blob/master/notebooks/

There are notebooks to make embeddings using:
- [Skipgram](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20SkipGram.ipynb)
- [CBOW](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20CBOW.ipynb)
- [GloVe](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20GloVe.ipynb)

These embeddings can then be used to train MLP, random forest, decision tree, and logistic regression classifiers using [this notebook](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Classical%20Link%20Prediction.ipynb).

**Note:** Consider running the code in above notebooks on a server with GPUs in order to complete in a reasonable amount of time. Currently on a server with 2 V100 GPUs, the creation of embeddings and the training of classifiers both take on the order of 1 day each to complete.

## Use SPARQL queries to query our Blazegraph endpoint

KG-COVID-19 has tooling to query our Blazegraph endpoint using templated SPARQL queries and emit the results as a TSV file. Different SPARQL queries on our endpoint or other endpoints can be used by creating a new YAML file and specifying this file with the `-y` flag.

The following is a simple query that retrieves a summary of the types of entities in the current KG-COVID-19 knowledge graph loaded on Blazegraph endpoint. These are counted as Biolink Model categories, which are high level entities such as genes, proteins, publications, etc. You can read more about the Biolink Model [here](https://biolink.github.io/biolink-model/).

In [18]:
!python run.py query -y queries/sparql/query-01-bl-cat-counts.yaml # or make a new YAML file and write your own query

  return yaml.load(open(yaml_file))


In [19]:
import csv

with open('data/queries/query-01-bl-cat-counts.tsv', newline='') as tsv:
    read_tsv = csv.reader(tsv, delimiter="\t")
    for row in read_tsv:
      print(row)

['v1', 'v0']
['199', 'organism taxon']
['19131', 'https://w3id.org/biolink/vocab/Gene']
['3908', 'https://w3id.org/biolink/vocab/NamedThing']
['20248', 'https://w3id.org/biolink/vocab/Protein']
['30534', 'https://w3id.org/biolink/vocab/BiologicalProcess']
['4468', 'https://w3id.org/biolink/vocab/CellularComponent']
['30018', 'https://w3id.org/biolink/vocab/ChemicalSubstance']
['32247', 'https://w3id.org/biolink/vocab/Drug']
['12241', 'https://w3id.org/biolink/vocab/MolecularActivity']
['62446', 'https://w3id.org/biolink/vocab/OntologyClass']
['6', 'https://w3id.org/biolink/vocab/OrganismalEntity']
['15530', 'https://w3id.org/biolink/vocab/PhenotypicFeature']
['129930', 'https://w3id.org/biolink/vocab/Publication']
['4687', 'https://w3id.org/biolink/vocab/AnatomicalEntity']
['48', 'https://w3id.org/biolink/vocab/Assay']
['703', 'https://w3id.org/biolink/vocab/Cell']
['3', 'https://w3id.org/biolink/vocab/MolecularEntity']
['21', 'https://w3id.org/biolink/vocab/RNA']
['47', 'https://w3id.