# Running KG-COVID-19 pipeline

The KG-COVID-19 pipeline can be run on the command line or via this notebook. The goal here is to run the pipeline end-to-end.

We will also demonstrate some ways that you can use the KG downstream, and show some other features of the framework.

**Note:** This notebook assumes that you have already installed the required dependencies for KG-COVID-19. For more information refer to [Installation instructions](https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki#installation)

## Downloading all required datasets

First we start with downloading all required datasets as listed in [download.yaml](../download.yaml)

In [None]:
!python run.py download

## Transform all required datasets

We then transform all the datasets and generate the files `nodes.tsv` and `edges.tsv` for each dataset.

The files are located in `data/transformed/SOURCE_NAME` where `SOURCE_NAME` is the name of the data source.

In [None]:
!python run.py transform

## Merge all datasets into a single graph

Finally, we create a merged graph by reading in the individual nodes.tsv and edges.tsv and merging them.
The merge process is driven by the [merge.yaml](../merge.yaml).

In [None]:
!python run.py merge

The merged graph should be available in `data/merged/` folder.

This pipeline generates a graph in KGX TSV format here:
`data/merged/merged-kg.tar.gz`

Prebuilt graphs are also available here:
https://kg-hub.berkeleybop.io/kg-covid-19/index.html

# Other tooling/functionality

## Make training data for machine learning use case

KG-COVID-19 contains tooling to produce training data for machine learning. Briefly, a training graph is produced with 80% (by default, override with `-t` parameter) of edges. 20% of edges are removed such that they do not create new components. These graphs are emitted as KGX TSV files in the `data/holdouts/` folder.

### Extract the generated graph

Extract the generated graph from `data/merged/merged-kg.tar.gz`.

> You can use the graph generated in the previous step OR download the latest graph from https://kg-hub.berkeleybop.io/kg-covid-19/current/kg-covid-19.tar.gz

In [None]:
!tar -xvzf data/merged/merged-kg.tar.gz

### Create the training/holdout data

We then generate a training/holdout data which will be used in subsequent steps for training.

In [None]:
# this might take 10 minutes or so
!python run.py holdouts -e merged-kg_edges.tsv -n merged-kg_nodes.tsv

### Explore the training data

Let's get some stats on our training data. We're tightly integrated with ensmallen_graph, so we'll use that package to do this.

In [None]:
from ensmallen_graph import EnsmallenGraph

training = EnsmallenGraph.from_unsorted_csv(
    edge_path="data/holdouts/pos_train_edges.tsv",
    sources_column="subject",
    destinations_column="object",
    directed=False,
    edge_types_column='label',
    default_edge_type='biolink:Association',
    node_path="data/holdouts/pos_train_nodes.tsv",
    nodes_column='id',
    default_node_type='biolink:NamedThing',
    node_types_column='category',
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,    
)

training.report()

Stats for the original graph, for comparison:

In [None]:
from ensmallen_graph import EnsmallenGraph

graph = EnsmallenGraph.from_unsorted_csv(
    edge_path="merged-kg_edges.tsv",
    sources_column="subject",
    destinations_column="object",
    directed=False,
    edge_types_column='edge_label',
    default_edge_type='biolink:Association',
    node_path="merged-kg_nodes.tsv",
    nodes_column='id',
    default_node_type='biolink:NamedThing',
    node_types_column='category',
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,    
)

graph.report()

## Making embeddings for a KG

To generate embeddings from the KG you've created above, take a look at notebooks available at https://github.com/monarch-initiative/embiggen/blob/master/notebooks/

There are notebooks to make embeddings using:
- [Skipgram](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20SkipGram.ipynb)
- [CBOW](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20CBOW.ipynb)
- [GloVe](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20GloVe.ipynb)

These embeddings can then be used to train MLP, random forest, decision tree, and logistic regression classifiers using [this notebook](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Classical%20Link%20Prediction.ipynb).

**Note:** Consider running the code in above notebooks on a server with GPUs in order to complete in a reasonable amount of time. Currently on a server with 2 V100 GPUs, the creation of embeddings and the training of classifiers both take on the order of 1 day each to complete.

## Use SPARQL queries to query our Blazegraph endpoint

KG-COVID-19 has tooling to query our Blazegraph endpoint using templated SPARQL queries and emit the results as a TSV file. Different SPARQL queries on our endpoint or other endpoints can be used by creating a new YAML file and specifying this file with the `-y` flag.

The following is a simple query that retrieves a summary of the types of entities in the current KG-COVID-19 knowledge graph loaded on Blazegraph endpoint. These are counted as Biolink Model categories, which are high level entities such as genes, proteins, publications, etc. You can read more about the Biolink Model [here](https://biolink.github.io/biolink-model/).

In [None]:
!python run.py query -y queries/query-01-bl-cat-counts.yaml # or make a new YAML file and write your own query

In [None]:
import csv

with open('data/queries/query-01-bl-cat-counts.tsv', newline='') as tsv:
    read_tsv = csv.reader(tsv, delimiter="\t")
    for row in read_tsv:
      print(row)