-
Notifications
You must be signed in to change notification settings - Fork 26
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
345 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,345 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Running KG-COVID-19 pipeline" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The KG-COVID-19 pipeline can be run on the command line or via this notebook. The goal here is to run the pipeline end-to-end. \n", | ||
"\n", | ||
"We will also demonstrates some ways that you can use the KG downstream, and show some other features of the framework.\n", | ||
"\n", | ||
"**Note:** This notebook assumes that you have already installed the required dependencies for KG-COVID-19. For more information refer to [Installation instructions](https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki#installation)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Downloading all required datasets" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"First we start with downloading all required datasets as listed in [download.yaml](../download.yaml)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"\r", | ||
"Downloading files: 0%| | 0/24 [00:00<?, ?it/s]\r", | ||
"Downloading files: 100%|█████████████████████| 24/24 [00:00<00:00, 19807.81it/s]\r\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!python run.py download" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Transform all required datasets" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We then transform all the datasets and generate a nodes.tsv and edges.tsv for each dataset.\n", | ||
"\n", | ||
"The files are located in `data/transformed/SOURCE_NAME` where `SOURCE_NAME` is the name of the data source." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"WARNING:tabula.io:Got stderr: Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n", | ||
"Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n", | ||
"Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n", | ||
"INFO: Your current java version is: 1.8.0_161\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n", | ||
"INFO: To get higher rendering speed on old java 1.8 or 9 versions,\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n", | ||
"INFO: update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n", | ||
"INFO: or\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n", | ||
"INFO: use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n", | ||
"INFO: or call System.setProperty(\"sun.java2d.cmm\", \"sun.java2d.cmm.kcms.KcmsServiceProvider\")\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n", | ||
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n", | ||
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n", | ||
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n", | ||
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n", | ||
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n", | ||
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n", | ||
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n", | ||
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n", | ||
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n", | ||
"\n", | ||
"5782864it [00:21, 270243.44it/s]\n", | ||
"5782864it [00:21, 271635.01it/s]\n", | ||
"Loading gene info: 28496648it [01:33, 304282.49it/s]\n", | ||
"Loading country codes: 264it [00:00, 238538.62it/s]\n", | ||
"Unzipping files: 100%|███████████████████████████| 2/2 [03:30<00:00, 105.07s/it]\n", | ||
"100%|█████████████████████████████████████| 54137/54137 [11:09<00:00, 80.84it/s]\n", | ||
"100%|█████████████████████████████████████| 75785/75785 [17:06<00:00, 73.86it/s]\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!python run.py transform" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Merge all datasets into a single graph" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, we create a merged graph by reading in the individual nodes.tsv and edges.tsv and merging them. \n", | ||
"The merge process is driven by the [merge.yaml](../merge.yaml)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!python run.py merge" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The merged graph should be available in `data/merged/` folder.\n", | ||
"\n", | ||
"This pipeline generates a graph in KGX TSV format here:\n", | ||
"`data/merged/merged-kg.tar.gz`\n", | ||
"Prebuilt graphs are also available here:\n", | ||
"https://kg-hub.berkeleybop.io/kg-covid-19/index.html" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Make training data for machine learning use case\n", | ||
"\n", | ||
"KG-COVID-19 contains tooling to produce training data for machine learning. Briefly, a training graph is produced with 80% (by default, override with `-t` parameter) of edges. 20% of edges are removed such that they do not create new components. These graphs are emitted as KGX TSV files in `data/holdouts`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### untar and gunzip the graph" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!tar -xvzf data/merged/merged-kg.tar.gz" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### create the training/holdout data" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"scrolled": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"!python run.py holdouts -e merged-kg_edges.tsv -n merged-kg_nodes.tsv # this might take 10 minutes or so" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### Let's get some stats on our training graph. We're tightly integrated with ensmallen_graph, so we'll use that package to do this." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from ensmallen_graph import EnsmallenGraph\n", | ||
"\n", | ||
"training = EnsmallenGraph.from_csv(\n", | ||
" edge_path=\"data/holdouts/pos_train_edges.tsv\",\n", | ||
" sources_column='subject',\n", | ||
" destinations_column='object',\n", | ||
" directed=False,\n", | ||
" edge_types_column='edge_label',\n", | ||
" default_edge_type='biolink:Association',\n", | ||
" node_path=\"data/holdouts/pos_train_nodes.tsv\",\n", | ||
" nodes_column='id',\n", | ||
" default_node_type='biolink:NamedThing',\n", | ||
" node_types_column='category',\n", | ||
" ignore_duplicated_edges=True,\n", | ||
" ignore_duplicated_nodes=True,\n", | ||
");\n", | ||
"\n", | ||
"training.report()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"graph = EnsmallenGraph.from_csv(\n", | ||
" edge_path=\"merged-kg_edges.tsv\",\n", | ||
" sources_column='subject',\n", | ||
" destinations_column='object',\n", | ||
" directed=False,\n", | ||
" edge_types_column='edge_label',\n", | ||
" default_edge_type='biolink:Association',\n", | ||
" node_path=\"merged-kg_nodes.tsv\",\n", | ||
" nodes_column='id',\n", | ||
" default_node_type='biolink:NamedThing',\n", | ||
" node_types_column='category',\n", | ||
" ignore_duplicated_edges=True,\n", | ||
" ignore_duplicated_nodes=True,\n", | ||
" force_conversion_to_undirected=True # deprecated, removed in ensmallen_graph 0.4\n", | ||
");\n", | ||
"graph.report()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"#### See [these](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/) notebook to generate embeddings from the KG you've created above. There are notebooks to make embeddings using:\n", | ||
"- [Skipgram](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20SkipGram.ipynb)\n", | ||
"- [CBOW](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20CBOW.ipynb)\n", | ||
"- [GloVe](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20GloVe.ipynb)\n", | ||
"\n", | ||
"#### These embeddings can then be used to train MLP, random forest, decision tree, and logistic regression classifiers using [this notebook](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Classical%20Link%20Prediction.ipynb).\n", | ||
"\n", | ||
"##### Note: consider running the code in these notebooks on a server with GPUs in order to complete in a reasonable amount of time" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Use prebuilt SPARQL queries to query our Blazegraph endpoint on the commandline\n", | ||
"\n", | ||
"KG-COVID-19 has tooling to query our Blazegraph endpoint using predetermined SPARQL queries, and emit the results as a TSV file. Different SPARQL queries on our endpoint or other endpoints can be used by creating a new YAML file and specific this filewith the `-y` flag. " | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!python run.py query -y queries/sparql/query-01-bl-cat-counts.yaml # or make a new YAML file and write your own query" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# have a look at biolink category counts currently in KG-COVID-19 loaded on Blazegraph endpoint\n", | ||
"import csv\n", | ||
"\n", | ||
"with open('data/queries/query-01-bl-cat-counts.tsv', newline='') as tsv:\n", | ||
" read_tsv = csv.reader(tsv, delimiter=\"\\t\")\n", | ||
" for row in read_tsv:\n", | ||
" print(row)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |