# Running KG-COVID-19 pipeline

The KG-COVID-19 pipeline can be run on the command line or via this notebook. The goal here is to run the pipeline end-to-end. 

We will also demonstrates some ways that you can use the KG downstream, and show some other features of the framework.

**Note:** This notebook assumes that you have already installed the required dependencies for KG-COVID-19. For more information refer to [Installation instructions](https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki#installation)

## Downloading all required datasets

First we start with downloading all required datasets as listed in [download.yaml](../download.yaml)

In [1]:
!python run.py download

Downloading files:   0%|                                 | 0/24 [00:00<?, ?it/s]Downloading files: 100%|█████████████████████| 24/24 [00:00<00:00, 19599.55it/s]


## Transform all required datasets

We then transform all the datasets and generate a nodes.tsv and edges.tsv for each dataset.

The files are located in `data/transformed/SOURCE_NAME` where `SOURCE_NAME` is the name of the data source.

In [2]:
!python run.py transform

Sep 28, 2020 2:44:04 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Sep 28, 2020 2:44:04 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
Sep 28, 2020 2:44:04 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: Your current java version is: 1.8.0_161
Sep 28, 2020 2:44:04 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO: To get higher rendering speed on old java 1.8 or 9 versions,
Sep 28, 2020 2:44:04 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),
Sep 28, 2020 2:44:04 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or
Sep 28, 2020 2:44:04 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
Sep 28, 2020 2:44:04 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS
INFO:   or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")
Sep 28, 2020 2:4

## Merge all datasets into a single graph

Finally, we create a merged graph by reading in the individual nodes.tsv and edges.tsv and merging them. 
The merge process is driven by the [merge.yaml](../merge.yaml).

In [3]:
!python run.py merge

[KGX][cli_utils.py][        parse_target] INFO: Processing target 'drug-central'
[KGX][cli_utils.py][        parse_target] INFO: Processing target 'pharmgkb'
[KGX][cli_utils.py][        parse_target] INFO: Processing target 'STRING'
[KGX][cli_utils.py][       apply_filters] INFO: with node filters: {'category': ['biolink:Gene', 'biolink:Protein']}
[KGX][cli_utils.py][       apply_filters] INFO: with edge filters: {'subject_category': ['biolink:Gene', 'biolink:Protein'], 'object_category': ['biolink:Gene', 'biolink:Protein'], 'edge_label': ['biolink:interacts_with', 'biolink:has_gene_product']}
^C

Aborted!


The merged graph should be available in `data/merged/` folder.

This pipeline generates a graph in KGX TSV format here:
`data/merged/merged-kg.tar.gz`
Prebuilt graphs are also available here:
https://kg-hub.berkeleybop.io/kg-covid-19/index.html

## Make training data for machine learning use case

#### untar and gunzip the graph

In [None]:
!tar -xvzf data/merged/merged-kg.tar.gz

#### create the training/holdout data

In [None]:
!python run.py holdouts -e merged-kg_edges.tsv -n merged-kg_nodes.tsv  # this might take 10 minutes or so

#### Let's get some stats on our training graph. We're tightly integrated with ensmallen_graph, so we'll use that package to do this.

In [11]:
from ensmallen_graph import EnsmallenGraph

training = EnsmallenGraph.from_csv(
    edge_path="data/holdouts/pos_train_edges.tsv",
    sources_column='subject',
    destinations_column='object',
    directed=False,
    edge_types_column='edge_label',
    default_edge_type='biolink:Association',
    node_path="data/holdouts/pos_train_nodes.tsv",
    nodes_column='id',
    default_node_type='biolink:NamedThing',
    node_types_column='category',
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,
);

training.report()

{'bidirectional_rate': '1',
 'unique_edge_types_number': '32',
 'degrees_mode': '1',
 'is_directed': 'false',
 'degrees_max': '72280',
 'selfloops_rate': '0.000014902610832412994',
 'edges_number': '24760762',
 'traps_rate': '0.0612818094137847',
 'singleton_nodes': '23128',
 'connected_components_number': '24763',
 'nodes_number': '377404',
 'strongly_connected_components_number': '24763',
 'degrees_min': '0',
 'density': '0.0001738405182835909',
 'degrees_median': '5',
 'degrees_mean': '65.60810696230034',
 'unique_node_types_number': '37'}

In [21]:
graph = EnsmallenGraph.from_csv(
    edge_path="merged-kg_edges.tsv",
    sources_column='subject',
    destinations_column='object',
    directed=False,
    edge_types_column='edge_label',
    default_edge_type='biolink:Association',
    node_path="merged-kg_nodes.tsv",
    nodes_column='id',
    default_node_type='biolink:NamedThing',
    node_types_column='category',
    ignore_duplicated_edges=True,
    ignore_duplicated_nodes=True,
    force_conversion_to_undirected=True # deprecated, removed in ensmallen_graph 0.4
);
graph.report()

{'density': '0.00021729963334795238',
 'degrees_mean': '82.00975082405061',
 'degrees_max': '90378',
 'nodes_number': '377404',
 'selfloops_rate': '0.00001534693375371654',
 'connected_components_number': '8996',
 'unique_node_types_number': '37',
 'degrees_mode': '1',
 'edges_number': '30950808',
 'degrees_median': '6',
 'unique_edge_types_number': '32',
 'bidirectional_rate': '1',
 'degrees_min': '0',
 'singleton_nodes': '8243',
 'strongly_connected_components_number': '8996',
 'is_directed': 'false',
 'traps_rate': '0.02184131593729796'}

#### See [these](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/) notebook to generate embeddings from the KG you've created above. There are notebooks to make embeddings using:
- [Skipgram](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20SkipGram.ipynb)
- [CBOW](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20CBOW.ipynb)
- [GloVe](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20GloVe.ipynb)

#### These embeddings can then be used to train MLP, random forest, decision tree, and logistic regression classifiers using [this notebook](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Link%20Prediction.ipynb).

##### Note: 

## Use prebuilt SPARQL queries to query our Blazegraph endpoint on the commandline

In [None]:
!python run.py query -y queries/sparql/query-01-bl-cat-counts.yaml # or make a new YAML file and write your own query

In [20]:
# have a look at biolink category counts currently in KG-COVID-19 loaded on Blazegraph endpoint
import csv

with open('data/queries/query-01-bl-cat-counts.tsv', newline='') as tsv:
    read_tsv = csv.reader(tsv, delimiter="\t")
    for row in read_tsv:
      print(row)

['v1', 'v0']
['199', 'organism taxon']
['19131', 'https://w3id.org/biolink/vocab/Gene']
['3908', 'https://w3id.org/biolink/vocab/NamedThing']
['20167', 'https://w3id.org/biolink/vocab/Protein']
['30534', 'https://w3id.org/biolink/vocab/BiologicalProcess']
['4468', 'https://w3id.org/biolink/vocab/CellularComponent']
['30018', 'https://w3id.org/biolink/vocab/ChemicalSubstance']
['32228', 'https://w3id.org/biolink/vocab/Drug']
['12241', 'https://w3id.org/biolink/vocab/MolecularActivity']
['62446', 'https://w3id.org/biolink/vocab/OntologyClass']
['6', 'https://w3id.org/biolink/vocab/OrganismalEntity']
['15530', 'https://w3id.org/biolink/vocab/PhenotypicFeature']
['129930', 'https://w3id.org/biolink/vocab/Publication']
['4687', 'https://w3id.org/biolink/vocab/AnatomicalEntity']
['48', 'https://w3id.org/biolink/vocab/Assay']
['703', 'https://w3id.org/biolink/vocab/Cell']
['24229', 'https://w3id.org/biolink/vocab/Disease']
['1', 'https://w3id.org/biolink/vocab/MolecularEntity']
['17', 'https: