Skip to content

Commit

Permalink
Merge 5205d0e into 4a1fda7
Browse files Browse the repository at this point in the history
  • Loading branch information
deepakunni3 committed Sep 29, 2020
2 parents 4a1fda7 + 5205d0e commit 54dfef0
Showing 1 changed file with 345 additions and 0 deletions.
345 changes: 345 additions & 0 deletions Run-KG-COVID-19-pipeline.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,345 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Running KG-COVID-19 pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The KG-COVID-19 pipeline can be run on the command line or via this notebook. The goal here is to run the pipeline end-to-end. \n",
"\n",
"We will also demonstrates some ways that you can use the KG downstream, and show some other features of the framework.\n",
"\n",
"**Note:** This notebook assumes that you have already installed the required dependencies for KG-COVID-19. For more information refer to [Installation instructions](https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki#installation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Downloading all required datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we start with downloading all required datasets as listed in [download.yaml](../download.yaml)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\r",
"Downloading files: 0%| | 0/24 [00:00<?, ?it/s]\r",
"Downloading files: 100%|█████████████████████| 24/24 [00:00<00:00, 19807.81it/s]\r\n"
]
}
],
"source": [
"!python run.py download"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform all required datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We then transform all the datasets and generate a nodes.tsv and edges.tsv for each dataset.\n",
"\n",
"The files are located in `data/transformed/SOURCE_NAME` where `SOURCE_NAME` is the name of the data source."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WARNING:tabula.io:Got stderr: Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n",
"Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
"Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
"INFO: Your current java version is: 1.8.0_161\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
"INFO: To get higher rendering speed on old java 1.8 or 9 versions,\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
"INFO: update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
"INFO: or\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
"INFO: use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
"INFO: or call System.setProperty(\"sun.java2d.cmm\", \"sun.java2d.cmm.kcms.KcmsServiceProvider\")\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
"Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
"Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
"WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
"\n",
"5782864it [00:21, 270243.44it/s]\n",
"5782864it [00:21, 271635.01it/s]\n",
"Loading gene info: 28496648it [01:33, 304282.49it/s]\n",
"Loading country codes: 264it [00:00, 238538.62it/s]\n",
"Unzipping files: 100%|███████████████████████████| 2/2 [03:30<00:00, 105.07s/it]\n",
"100%|█████████████████████████████████████| 54137/54137 [11:09<00:00, 80.84it/s]\n",
"100%|█████████████████████████████████████| 75785/75785 [17:06<00:00, 73.86it/s]\n"
]
}
],
"source": [
"!python run.py transform"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Merge all datasets into a single graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we create a merged graph by reading in the individual nodes.tsv and edges.tsv and merging them. \n",
"The merge process is driven by the [merge.yaml](../merge.yaml)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python run.py merge"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The merged graph should be available in `data/merged/` folder.\n",
"\n",
"This pipeline generates a graph in KGX TSV format here:\n",
"`data/merged/merged-kg.tar.gz`\n",
"Prebuilt graphs are also available here:\n",
"https://kg-hub.berkeleybop.io/kg-covid-19/index.html"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Make training data for machine learning use case\n",
"\n",
"KG-COVID-19 contains tooling to produce training data for machine learning. Briefly, a training graph is produced with 80% (by default, override with `-t` parameter) of edges. 20% of edges are removed such that they do not create new components. These graphs are emitted as KGX TSV files in `data/holdouts`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### untar and gunzip the graph"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!tar -xvzf data/merged/merged-kg.tar.gz"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### create the training/holdout data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"!python run.py holdouts -e merged-kg_edges.tsv -n merged-kg_nodes.tsv # this might take 10 minutes or so"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Let's get some stats on our training graph. We're tightly integrated with ensmallen_graph, so we'll use that package to do this."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from ensmallen_graph import EnsmallenGraph\n",
"\n",
"training = EnsmallenGraph.from_csv(\n",
" edge_path=\"data/holdouts/pos_train_edges.tsv\",\n",
" sources_column='subject',\n",
" destinations_column='object',\n",
" directed=False,\n",
" edge_types_column='edge_label',\n",
" default_edge_type='biolink:Association',\n",
" node_path=\"data/holdouts/pos_train_nodes.tsv\",\n",
" nodes_column='id',\n",
" default_node_type='biolink:NamedThing',\n",
" node_types_column='category',\n",
" ignore_duplicated_edges=True,\n",
" ignore_duplicated_nodes=True,\n",
");\n",
"\n",
"training.report()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"graph = EnsmallenGraph.from_csv(\n",
" edge_path=\"merged-kg_edges.tsv\",\n",
" sources_column='subject',\n",
" destinations_column='object',\n",
" directed=False,\n",
" edge_types_column='edge_label',\n",
" default_edge_type='biolink:Association',\n",
" node_path=\"merged-kg_nodes.tsv\",\n",
" nodes_column='id',\n",
" default_node_type='biolink:NamedThing',\n",
" node_types_column='category',\n",
" ignore_duplicated_edges=True,\n",
" ignore_duplicated_nodes=True,\n",
" force_conversion_to_undirected=True # deprecated, removed in ensmallen_graph 0.4\n",
");\n",
"graph.report()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### See [these](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/) notebook to generate embeddings from the KG you've created above. There are notebooks to make embeddings using:\n",
"- [Skipgram](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20SkipGram.ipynb)\n",
"- [CBOW](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20CBOW.ipynb)\n",
"- [GloVe](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20GloVe.ipynb)\n",
"\n",
"#### These embeddings can then be used to train MLP, random forest, decision tree, and logistic regression classifiers using [this notebook](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Classical%20Link%20Prediction.ipynb).\n",
"\n",
"##### Note: consider running the code in these notebooks on a server with GPUs in order to complete in a reasonable amount of time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Use prebuilt SPARQL queries to query our Blazegraph endpoint on the commandline\n",
"\n",
"KG-COVID-19 has tooling to query our Blazegraph endpoint using predetermined SPARQL queries, and emit the results as a TSV file. Different SPARQL queries on our endpoint or other endpoints can be used by creating a new YAML file and specific this filewith the `-y` flag. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!python run.py query -y queries/sparql/query-01-bl-cat-counts.yaml # or make a new YAML file and write your own query"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# have a look at biolink category counts currently in KG-COVID-19 loaded on Blazegraph endpoint\n",
"import csv\n",
"\n",
"with open('data/queries/query-01-bl-cat-counts.tsv', newline='') as tsv:\n",
" read_tsv = csv.reader(tsv, delimiter=\"\\t\")\n",
" for row in read_tsv:\n",
" print(row)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

0 comments on commit 54dfef0

Please sign in to comment.