Merge 5205d0e into 4a1fda7

Knowledge-Graph-Hub · Sep 29, 2020 · 54dfef0 · 54dfef0
2 parents 4a1fda7 + 5205d0e
commit 54dfef0
Showing 1 changed file with 345 additions and 0 deletions.
diff --git a/Run-KG-COVID-19-pipeline.ipynb b/Run-KG-COVID-19-pipeline.ipynb
@@ -0,0 +1,345 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Running KG-COVID-19 pipeline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The KG-COVID-19 pipeline can be run on the command line or via this notebook. The goal here is to run the pipeline end-to-end. \n",
+    "\n",
+    "We will also demonstrates some ways that you can use the KG downstream, and show some other features of the framework.\n",
+    "\n",
+    "**Note:** This notebook assumes that you have already installed the required dependencies for KG-COVID-19. For more information refer to [Installation instructions](https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki#installation)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Downloading all required datasets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "First we start with downloading all required datasets as listed in [download.yaml](../download.yaml)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "\r",
+      "Downloading files:   0%|                                 | 0/24 [00:00<?, ?it/s]\r",
+      "Downloading files: 100%|█████████████████████| 24/24 [00:00<00:00, 19807.81it/s]\r\n"
+     ]
+    }
+   ],
+   "source": [
+    "!python run.py download"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Transform all required datasets"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We then transform all the datasets and generate a nodes.tsv and edges.tsv for each dataset.\n",
+    "\n",
+    "The files are located in `data/transformed/SOURCE_NAME` where `SOURCE_NAME` is the name of the data source."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "WARNING:tabula.io:Got stderr: Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n",
+      "Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
+      "Sep 28, 2020 4:12:20 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
+      "INFO: Your current java version is: 1.8.0_161\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
+      "INFO: To get higher rendering speed on old java 1.8 or 9 versions,\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
+      "INFO:   update to the latest 1.8 or 9 version (>= 1.8.0_191 or >= 9.0.4),\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
+      "INFO:   or\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
+      "INFO:   use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.rendering.PDFRenderer suggestKCMS\n",
+      "INFO:   or call System.setProperty(\"sun.java2d.cmm\", \"sun.java2d.cmm.kcms.KcmsServiceProvider\")\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial-BoldMT'\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
+      "Sep 28, 2020 4:12:21 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
+      "Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
+      "Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
+      "Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
+      "Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
+      "Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
+      "Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'ArialMT'\n",
+      "Sep 28, 2020 4:12:22 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>\n",
+      "WARNING: Using fallback font 'LiberationSans' for 'Arial'\n",
+      "\n",
+      "5782864it [00:21, 270243.44it/s]\n",
+      "5782864it [00:21, 271635.01it/s]\n",
+      "Loading gene info: 28496648it [01:33, 304282.49it/s]\n",
+      "Loading country codes: 264it [00:00, 238538.62it/s]\n",
+      "Unzipping files: 100%|███████████████████████████| 2/2 [03:30<00:00, 105.07s/it]\n",
+      "100%|█████████████████████████████████████| 54137/54137 [11:09<00:00, 80.84it/s]\n",
+      "100%|█████████████████████████████████████| 75785/75785 [17:06<00:00, 73.86it/s]\n"
+     ]
+    }
+   ],
+   "source": [
+    "!python run.py transform"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Merge all datasets into a single graph"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally, we create a merged graph by reading in the individual nodes.tsv and edges.tsv and merging them. \n",
+    "The merge process is driven by the [merge.yaml](../merge.yaml)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python run.py merge"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The merged graph should be available in `data/merged/` folder.\n",
+    "\n",
+    "This pipeline generates a graph in KGX TSV format here:\n",
+    "`data/merged/merged-kg.tar.gz`\n",
+    "Prebuilt graphs are also available here:\n",
+    "https://kg-hub.berkeleybop.io/kg-covid-19/index.html"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Make training data for machine learning use case\n",
+    "\n",
+    "KG-COVID-19 contains tooling to produce training data for machine learning. Briefly, a training graph is produced with 80% (by default, override with `-t` parameter) of edges. 20% of edges are removed such that they do not create new components. These graphs are emitted as KGX TSV files in `data/holdouts`."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### untar and gunzip the graph"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!tar -xvzf data/merged/merged-kg.tar.gz"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### create the training/holdout data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "!python run.py holdouts -e merged-kg_edges.tsv -n merged-kg_nodes.tsv  # this might take 10 minutes or so"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Let's get some stats on our training graph. We're tightly integrated with ensmallen_graph, so we'll use that package to do this."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ensmallen_graph import EnsmallenGraph\n",
+    "\n",
+    "training = EnsmallenGraph.from_csv(\n",
+    "    edge_path=\"data/holdouts/pos_train_edges.tsv\",\n",
+    "    sources_column='subject',\n",
+    "    destinations_column='object',\n",
+    "    directed=False,\n",
+    "    edge_types_column='edge_label',\n",
+    "    default_edge_type='biolink:Association',\n",
+    "    node_path=\"data/holdouts/pos_train_nodes.tsv\",\n",
+    "    nodes_column='id',\n",
+    "    default_node_type='biolink:NamedThing',\n",
+    "    node_types_column='category',\n",
+    "    ignore_duplicated_edges=True,\n",
+    "    ignore_duplicated_nodes=True,\n",
+    ");\n",
+    "\n",
+    "training.report()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "graph = EnsmallenGraph.from_csv(\n",
+    "    edge_path=\"merged-kg_edges.tsv\",\n",
+    "    sources_column='subject',\n",
+    "    destinations_column='object',\n",
+    "    directed=False,\n",
+    "    edge_types_column='edge_label',\n",
+    "    default_edge_type='biolink:Association',\n",
+    "    node_path=\"merged-kg_nodes.tsv\",\n",
+    "    nodes_column='id',\n",
+    "    default_node_type='biolink:NamedThing',\n",
+    "    node_types_column='category',\n",
+    "    ignore_duplicated_edges=True,\n",
+    "    ignore_duplicated_nodes=True,\n",
+    "    force_conversion_to_undirected=True # deprecated, removed in ensmallen_graph 0.4\n",
+    ");\n",
+    "graph.report()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### See [these](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/) notebook to generate embeddings from the KG you've created above. There are notebooks to make embeddings using:\n",
+    "- [Skipgram](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20SkipGram.ipynb)\n",
+    "- [CBOW](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20CBOW.ipynb)\n",
+    "- [GloVe](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Graph%20embedding%20using%20GloVe.ipynb)\n",
+    "\n",
+    "#### These embeddings can then be used to train MLP, random forest, decision tree, and logistic regression classifiers using [this notebook](https://github.com/monarch-initiative/embiggen/blob/master/notebooks/Classical%20Link%20Prediction.ipynb).\n",
+    "\n",
+    "##### Note: consider running the code in these notebooks on a server with GPUs in order to complete in a reasonable amount of time"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Use prebuilt SPARQL queries to query our Blazegraph endpoint on the commandline\n",
+    "\n",
+    "KG-COVID-19 has tooling to query our Blazegraph endpoint using predetermined SPARQL queries, and emit the results as a TSV file. Different SPARQL queries on our endpoint or other endpoints can be used by creating a new YAML file and specific this filewith the `-y` flag. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!python run.py query -y queries/sparql/query-01-bl-cat-counts.yaml # or make a new YAML file and write your own query"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# have a look at biolink category counts currently in KG-COVID-19 loaded on Blazegraph endpoint\n",
+    "import csv\n",
+    "\n",
+    "with open('data/queries/query-01-bl-cat-counts.tsv', newline='') as tsv:\n",
+    "    read_tsv = csv.reader(tsv, delimiter=\"\\t\")\n",
+    "    for row in read_tsv:\n",
+    "      print(row)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}