# Running KG-COVID-19 pipeline

The KG-COVID-19 pipeline can be run on the command line or via this notebook. The goal here is to run the pipeline end-to-end. 

We will also demonstrates some ways that you can use the KG downstream, and show some other features of the framework.

**Note:** This notebook assumes that you have already installed the required dependencies for KG-COVID-19. For more information refer to [Installation instructions](https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki#installation)

## Downloading all required datasets

First we start with downloading all required datasets as listed in [download.yaml](../download.yaml)

In [1]:
!python run.py download

Downloading files:   0%|                                 | 0/24 [00:00<?, ?it/s]Downloading files: 100%|█████████████████████| 24/24 [00:00<00:00, 19807.81it/s]


## Transform all required datasets

We then transform all the datasets and generate a nodes.tsv and edges.tsv for each dataset.

The files are located in `data/transformed/SOURCE_NAME` where `SOURCE_NAME` is the name of the data source.

In [None]:
!python run.py transform

Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: Your current java version is: 1.8.0_161
Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: To get higher rendering speed on old java 1.8 or 9 versions,
Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),
Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or
Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
Sep 28, 2020 4:1

## Merge all datasets into a single graph

Finally, we create a merged graph by reading in the individual nodes.tsv and edges.tsv and merging them. 
The merge process is driven by the [merge.yaml](../merge.yaml).

In [None]:
!python run.py merge

The merged graph should be available in `data/merged/` folder.

This pipeline generates a graph in KGX TSV format here:
`data/merged/merged-kg.tar.gz`
Prebuilt graphs are also available here:
https://kg-hub.berkeleybop.io/kg-covid-19/index.html

## Make training data for machine learning use case

KG-COVID-19 contains tooling to produce training data for machine learning. Briefly, a training graph is produced with 80% (by default, override with `-t` parameter) of edges. 20% of edges are removed such that they do not create new components. These graphs are emitted as KGX TSV files in `data/holdouts`.

#### untar and gunzip the graph

In [None]:
!tar -xvzf data/merged/merged-kg.tar.gz

#### create the training/holdout data

In [None]:
!python run.py holdouts -e merged-kg_edges.tsv -n merged-kg_nodes.tsv  # this might take 10 minutes or so

#### Let's get some stats on our training graph. We're tightly integrated with ensmallen_graph, so we'll use that package to do this.

In [None]:
from ensmallen_graph import EnsmallenGraph

training = EnsmallenGraph.from_csv(
    edge_path="data/holdouts/pos_train_edges.tsv",
    sources_column='subject',
    destinations_column='object',
    directed=False,
    edge_types_column='edge_label',
    default_edge_type='biolink:Association',
    node_path="data/holdouts/pos_train_nodes.tsv",
    nodes_column='id',
    default_node_type='biolink:NamedThing',
    node_types_column='category',
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,
);

training.report()

In [None]:
graph = EnsmallenGraph.from_csv(
    edge_path="merged-kg_edges.tsv",
    sources_column='subject',
    destinations_column='object',
    directed=False,
    edge_types_column='edge_label',
    default_edge_type='biolink:Association',
    node_path="merged-kg_nodes.tsv",
    nodes_column='id',
    default_node_type='biolink:NamedThing',
    node_types_column='category',
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,
    force_conversion_to_undirected=True # deprecated, removed in ensmallen_graph 0.4
);
graph.report()

#### See [these](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/) notebook to generate embeddings from the KG you've created above. There are notebooks to make embeddings using:
- [Skipgram](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20SkipGram.ipynb)
- [CBOW](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20CBOW.ipynb)
- [GloVe](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20GloVe.ipynb)

#### These embeddings can then be used to train MLP, random forest, decision tree, and logistic regression classifiers using [this notebook](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Classical%20Link%20Prediction.ipynb).

##### Note: consider running the code in these notebooks on a server with GPUs in order to complete in a reasonable amount of time

## Use prebuilt SPARQL queries to query our Blazegraph endpoint on the commandline

KG-COVID-19 has tooling to query our Blazegraph endpoint using predetermined SPARQL queries, and emit the results as a TSV file. Different SPARQL queries on our endpoint or other endpoints can be used by creating a new YAML file and specific this filewith the `-y` flag. 

In [None]:
!python run.py query -y queries/sparql/query-01-bl-cat-counts.yaml # or make a new YAML file and write your own query

In [None]:
# have a look at biolink category counts currently in KG-COVID-19 loaded on Blazegraph endpoint
import csv

with open('data/queries/query-01-bl-cat-counts.tsv', newline='') as tsv:
    read_tsv = csv.reader(tsv, delimiter="\t")
    for row in read_tsv:
      print(row)