# Build Graph For The Tutorial

This notebook can work for any root node, the default is `Q2685` for Schwarzenegger

In [1]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
from IPython.display import display, HTML

import papermill as pm

sys.path.insert(0,'../..')
from configure_kgtk_notebooks import ConfigureKGTK

In [2]:
# Parameters
kgtk_path = "/Users/pedroszekely/Documents/GitHub/kgtk"

# Folder on local machine where to create the output and temporary folders
input_path = "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data/"
output_path = "/data1/rogers/kgtk/tutorial/"
project_name = "build-tutorial"
root = "Q2685"

Put the root q-node in the environment variable `ROOT`

In [3]:
os.environ['ROOT'] = root

In [4]:
files = [
    "claims",
    "item",
    "wikibase_property",
    "datatypes",
    "qualifiers",
    "p31",
    "p279",
    "p279star",
    "quantity",
    "time",
    "external_id",
    "globe_coordinate",
    "monolingualtext",
    "string",
    "label",
    "alias",
    "description"
]
ck = ConfigureKGTK(kgtk_path=kgtk_path)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name)

User home: /home/rogers
Current dir: /data1/rogers/kgtk/github/kgtk/tutorial
KGTK dir: /data1/rogers/kgtk/github/kgtk
Use-cases dir: /data1/rogers/kgtk/github/kgtk/use-cases


In [5]:
os.environ['KGTK_LABEL_FILE'] = "{}".format(os.environ['label']) 

In [6]:
ck.print_env_variables(files)

USE_CASES_DIR: /data1/rogers/kgtk/github/kgtk/use-cases
kypher: kgtk query --graph-cache /data1/rogers/kgtk/tutorial//build-tutorial/temp.build-tutorial/wikidata.sqlite3.db
OUT: /data1/rogers/kgtk/tutorial//build-tutorial
STORE: /data1/rogers/kgtk/tutorial//build-tutorial/temp.build-tutorial/wikidata.sqlite3.db
GRAPH: /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data/
kgtk: kgtk
EXAMPLES_DIR: /data1/rogers/kgtk/github/kgtk/examples
TEMP: /data1/rogers/kgtk/tutorial//build-tutorial/temp.build-tutorial
claims: /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//claims.tsv.gz
item: /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//claims.wikibase-item.tsv.gz
wikibase_property: /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//claims.wikibase-property.tsv.gz
datatypes: /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//metadata.property.datatypes.tsv.gz
qualifiers: /data3/rogers/kgtk/gd/kgt

## Define a custom location for the store when working with full Wikidata so that I can reuse it

In [7]:
os.environ['STORE'] = "/data1/rogers/kgtk/tutorial/wikidata.sqlite3.db"

Turn on debugging for kypher

In [8]:
os.environ['kypher'] = "kgtk --debug query --graph-cache " + os.environ['STORE']

In [9]:
!echo "$kypher"

kgtk --debug query --graph-cache /data1/rogers/kgtk/tutorial/wikidata.sqlite3.db


Load all my files into the kypher cache so that all graph aliases are defined

In [10]:
ck.load_files_into_cache(file_list=files)

kgtk --debug query --graph-cache /data1/rogers/kgtk/tutorial/wikidata.sqlite3.db -i "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//claims.tsv.gz" --as claims  -i "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//claims.wikibase-item.tsv.gz" --as item  -i "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//claims.wikibase-property.tsv.gz" --as wikibase_property  -i "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//metadata.property.datatypes.tsv.gz" --as datatypes  -i "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//qualifiers.tsv.gz" --as qualifiers  -i "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//derived.P31.tsv.gz" --as p31  -i "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//derived.P279.tsv.gz" --as p279  -i "/data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20210215/data//derived.P279star.tsv.gz" --as p279star  -i "

In [11]:
%cd {os.environ['OUT']}

/data1/rogers/kgtk/tutorial/build-tutorial


# Approach:
- Select a subgraph of full Wikidata that includes people (Q5), organizations (Q43229), geographic regions (Q82794), and awards (Q618779). This graph contains all edges that connect instances of the target classes listed above. Output the graph using a single relation we call `link`.
- Starting from Schwarzenegger Q2685, compute reachable nodes in the graph computed in the previous step. This step will produce the collection of nodes that will be part of the Schwarzenegger graph.
- Extract from Wikidata all the edges that connect nodes from the previous step.
- Extract from Wikidata the time, quantity, monolingual and string properties.
- Extract from Wikidata the qualifiers for the edges computed in the previous steps.
- Extract from Wikidata the labels, aliases and descriptions for the Schwarzenegger nodes.

## Extract a subset of Wikidata to use as the base for the Schewarzenegger graph

This query takes a really long time, so don't re-execute unless you have to.

In [12]:
%%time
!$kypher -i p31 -i item -i p279star \
--match ' \
    p31: (n1)-[]->(n1_class), \
    item: (n1)-[l]->(n2), \
    p31: (n2)-[]->(n2_class), \
    p279star: (n1_class)-[]->(n1_superclass), \
    p279star: (n2_class)-[]->(n2_superclass)' \
--where 'n1_superclass in ["Q11424", "Q5", "Q43229", "Q82794", "Q618779"] and n2_superclass in ["Q11424", "Q5", "Q43229", "Q82794", "Q618779"]' \
--return 'distinct n1 as node1, "link" as label, n2 as node2, l as id' \
-o "$TEMP"/item.per.org.cw.geo.award.link.tsv.gz 

[2021-10-04 16:37:54 sqlstore]: IMPORT graph via csv.reader into table graph_6 from /data1/rogers/kgtk/tutorial/build-tutorial/p31 ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 761, in

In the original graph there are qualifier values that we want to follow in the reachablity search. To do so, we will create `link` edges between the qualifier and the value of the statement on which the qualifier is defined.

In [None]:
%%time
!$kypher -i qualifiers -i datatypes -i "$TEMP"/item.per.org.cw.geo.award.link.tsv.gz --as links \
--match ' \
    links: ()-[l]->(n2), \
    qualifiers: (l)-[q {label: property}]->(qualifier), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["wikibase-item"]' \
--return 'n2 as node1, "link" as label, qualifier as node2' \
/ add-id --id-style wikidata \
/ cat -i - -i "$TEMP"/item.per.org.cw.geo.award.link.tsv.gz \
-o "$TEMP"/item.per.org.cw.geo.award.link.qualifier.tsv.gz

Starting from `ROOT` traverse links forward in breadfirst mode up to a fixed number of levels to build the graph

In [None]:
%%time
!$kgtk reachable-nodes \
    --root $ROOT \
    --prop link \
    --label "reachable" \
    --selflink \
    --breadth-first --depth-limit 3 \
    -i "$TEMP"/item.per.org.cw.geo.award.link.qualifier.tsv.gz  \
    -o "$TEMP"/root.reachable.per.org.cw.geo.award.tsv.gz

In [None]:
!$kgtk head -i "$TEMP"/root.reachable.per.org.cw.geo.award.tsv.gz

Index the resulting file in kypher

In [None]:
!$kypher -i $TEMP/root.reachable.per.org.cw.geo.award.tsv.gz --as root_nodes --limit 2

## Build initial graph containing the item edges

Figure out which properties are used so so that we can add them as node1s and get all the info about them.

In [None]:
%%time
!$kypher -i root_nodes -i datatypes -i claims \
--match ' \
    root_nodes: ()-[]->(n1), \
    claims: (n1)-[l {label: property}]->(), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["wikibase-item", "string", "quantity", "time", "monolingualtext"]' \
--return 'distinct "root" as node1, "link" as label, property as node2' \
-o "$TEMP"/root.nodes.property.tsv.gz

Concatenate the new nodes with the ones we found via reachability

In [None]:
!kgtk cat -i "$TEMP"/root.nodes.property.tsv.gz -i "$TEMP"/root.reachable.per.org.cw.geo.award.tsv.gz \
-o "$TEMP"/root.nodes.all.tsv.gz

Print number of nodes that we have so far for the new graph

In [None]:
!zcat < "$TEMP"/root.nodes.all.tsv.gz | wc -l

Update the Kypher database

In [None]:
!$kypher -i "$TEMP"/root.nodes.all.tsv.gz --as root_nodes --limit 2

Extract the item to item edges connecting the nodes in the new graph

In [None]:
%%time
!$kypher -i root_nodes -i item \
--match ' \
    root_nodes: ()-[]->(n1), \
    root_nodes: ()-[]->(n2), \
    item: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.item.tsv.gz

Add to the kypher database

In [None]:
!$kypher -i $OUT/root.graph.item.tsv.gz --as rootitems --limit 2

## Extract the other types of edges

Extract the quantities

In [None]:
%%time
!$kypher -i quantity -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    quantity: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.quantity.tsv.gz

Extract the time edges

In [None]:
%%time
!$kypher -i time -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    time: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.time.tsv.gz

Extract the monolingual text edges

In [None]:
%%time
!$kypher -i monolingualtext -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    monolingualtext: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.monolingual.tsv.gz

Extract the string edges

In [None]:
%%time
!$kypher -i string -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    string: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.string.tsv.gz

Extract external identifiers NEW

In [None]:
%%time
!$kypher -i external_id -i root_nodes \
--match ' \
    root_nodes: ()-[]->(n1), \
    external_id: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.external_ids.tsv.gz

## Complete the graph

Add external_ids

In [None]:
%%time
!kgtk cat \
-i $OUT/root.graph.item.tsv.gz \
-i $OUT/root.graph.quantity.tsv.gz \
-i $OUT/root.graph.time.tsv.gz \
-i $OUT/root.graph.monolingual.tsv.gz \
-i $OUT/root.graph.string.tsv.gz \
-i $OUT/root.graph.external_ids.tsv.gz \
-o $OUT/root.graph.item.quantity.time.monolingual.string.tsv.gz 

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.tsv.gz --as rootbase --limit 2

### Collect all the properties

Get edges for the properties

In [None]:
%%time
!$kypher -i rootbase -i wikibase_property \
--match ' \
    rootbase: ()-[l {label: property}]->(), \
    wikibase_property: (property)-[lp]->(n) \
    ' \
--return 'distinct property as node1, lp.label as label, n as node2, lp as id' \
/ sort \
-o $OUT/root.graph.property.tsv.gz

Update the base

In [None]:
%%time
!kgtk cat \
-i $OUT/root.graph.item.quantity.time.monolingual.string.tsv.gz \
-i $OUT/root.graph.property.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.tsv.gz 

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.tsv.gz --as rootbase --limit 2

### Compute qualifiers

In [None]:
%%time
!$kypher -i qualifiers -i rootbase \
--match ' \
    rootbase: ()-[l]->(), \
    qualifiers: (l)-[lq {label: property}]->(n) \
    ' \
--return 'distinct l as node1, property as label, n as node2, lq as id' \
/ sort \
-o $OUT/root.graph.qualifiers.tsv.gz

Update the base again

In [None]:
%%time
!kgtk cat \
-i $OUT/root.graph.item.quantity.time.monolingual.string.property.tsv.gz \
-i $OUT/root.graph.qualifiers.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.tsv.gz 

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.tsv.gz --as rootbase --limit 2

### Add the units

Find all values of quantity properties, and get the units defined for them.

> `kgtk_quantity_wd_units` throws an exception when it gets a quantity without units, so we have to hack around that using grep.

In [None]:
%%time
!$kypher -i datatypes -i rootbase \
--match ' \
    rootbase: ()-[l {label: property}]->(n2), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["quantity"]' \
--return 'distinct n2' \
| grep Q > "$TEMP"/units.noheader.tsv

!echo -e "node1" | cat - "$TEMP"/units.noheader.tsv > "$TEMP"/quantities.units.tsv

In [None]:
%%time
!$kypher -i "$TEMP"/quantities.units.tsv \
--match '(quantity)' \
--return 'distinct kgtk_quantity_wd_units(quantity) as node1' \
-o "$TEMP"/units.tsv

Now that we have the units in a file, we can get all the properties we want about them

In [None]:
%%time
!$kypher -i "$TEMP"/units.tsv -i item -i datatypes \
--match ' \
    units: (unit), \
    datatypes: (property)-[:datatype]->(datatype), \
    item: (unit)-[l {label: property}]->(n2) \
    ' \
--where 'datatype in ["wikibase-item", "string", "quantity", "time", "monolingualtext"]' \
--return 'distinct unit as node1, property as label, n2 as node2, l as id' \
/ cat -i - -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.tsv.gz\
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.units.tsv.gz \

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.units.tsv.gz --as rootbase --limit 2

### Make sure that every q-node has at least P31 and P279
need to do it twice, once for node1 and once for node2

In [36]:
%%time
!$kypher -i rootbase -i claims \
--match 'rootbase: (n)-[]->(), claims: (n)-[l {label: property}]->(n2)' \
--where 'property in ["P31", "P279"]' \
--return 'distinct n as node1, property as label, n2 as node2, l as id' \
-o "$TEMP"/root.node1.P31.P279.tsv.gz

[2021-10-04 16:38:11 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootbase ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 76

In [37]:
%%time
!$kypher -i rootbase -i claims \
--match 'rootbase: ()-[]->(n), claims: (n)-[l {label: property}]->(n2)' \
--where 'property in ["P31", "P279"]' \
--return 'distinct n as node1, property as label, n2 as node2, l as id' \
-o "$TEMP"/root.node2.P31.P279.tsv.gz

[2021-10-04 16:38:12 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootbase ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 76

Recreate the base file NEW
> the output file should have the `.units` segment in the name, but I didnt' add it so that I don't have to modify all the other ocmmands
> a better design for the file names would not have this problem

In [38]:
%%time
!kgtk cat \
-i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.units.tsv.gz \
-i "$TEMP"/root.node2.P31.P279.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.tsv.gz 

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.tsv.gz --as rootbase --limit 2

[Errno 2] No such file or directory: '/data1/rogers/kgtk/tutorial/build-tutorial/root.graph.item.quantity.time.monolingual.string.property.qualifiers.units.tsv.gz'
No header line in file
[2021-10-04 16:38:13 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.tsv.gz ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, i

### Incorporate all nodes up to the top of the class hierarchy
When we do a breath first traversal, we may not follow enough links on the P279 hierarchy to reach the top. We need to do a full traversal on the P279 hierarchy to incorporate all the relevant classes.

Approach:
- Create a graph including P31 and P279 to do the traversal
- Create a file of all the nodes in the Schwarzenneger file to use as roots

In [39]:
%%time
!$kypher -i claims \
--match '(n1)-[l {label:property}]->(n2)' \
--where 'property in ["P31", "P279"]' \
--return 'distinct n1 as node1, "link" as label, n2 as node2' \
-o "$TEMP"/P31.P279.subgraph.tsv.gz

[2021-10-04 16:38:13 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_1_c1."node1" "_aLias.node1", ? "_aLias.label", graph_1_c1."node2" "_aLias.node2"
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label" = graph_1_c1."label"
        AND (graph_1_c1."label" IN (?, ?))
  PARAS: ['link', 'P31', 'P279']
---------------------------------------------
[2021-10-04 16:38:13 sqlstore]: CREATE INDEX on table graph_1 column label ...
[2021-10-04 17:04:56 sqlstore]: ANALYZE INDEX on table graph_1 column label ...
1402.219u 120.318s 51:52.18 48.9%	0+0k 458695856+86016408io 23pf+0w
CPU times: user 26.1 s, sys: 3.95 s, total: 30.1 s
Wall time: 51min 52s


#### Create the roots

Find roots in node1

> This step is including qualifier ids in node1, which makes reachable nodes have more roots than necessary. Would be nice to eliminate qualifiers here.

In [40]:
%%time
!$kypher -i rootbase \
--match '(n)-[]->()' \
--return 'distinct n as node1' \
-o "$TEMP"/root.node1.tsv.gz

[2021-10-04 17:30:08 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootbase ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 76

Find roots in node2

In [41]:
%%time
!$kypher -i rootbase -i datatypes \
--match ' \
    rootbase: ()-[l {label: property}]->(n), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["wikibase-item"]' \
--return 'distinct n as node1' \
-o "$TEMP"/root.node2.tsv.gz

[2021-10-04 17:30:09 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootbase ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 76

Combine the two files to create all the roots

In [42]:
%%time
!$kgtk cat --mode=NONE -i "$TEMP"/root.node1.tsv.gz -i "$TEMP"/root.node2.tsv.gz \
/ compact --mode=NONE --columns node1 \
-o "$TEMP"/root.nodes.tsv.gz

!$kypher -i "$TEMP"/root.nodes.tsv.gz --as rootnode1 --limit 2

No header line in file
No header line in file
[2021-10-04 17:30:10 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/temp.build-tutorial/root.nodes.tsv.gz ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=ali

Circumvent a problem in `reachable-nodes` where it does not accept a root file with column header `node1`

In [43]:
%%time
!$kgtk rename-columns -i "$TEMP"/root.nodes.tsv.gz --output-columns id --mode=NONE \
/ compact -o "$TEMP"/root.roots.tsv.gz

[Errno 2] No such file or directory: '/data1/rogers/kgtk/tutorial/build-tutorial/temp.build-tutorial/root.nodes.tsv.gz'
No header line in file
CPU times: user 7.77 ms, sys: 3.55 ms, total: 11.3 ms
Wall time: 731 ms


Do a depth-first traversal of the P31/P279 graph using as roots all items in the Schewarzenegger graph

In [44]:
%%time
!$kgtk reachable-nodes \
    --rootfile "$TEMP"/root.roots.tsv.gz \
    --rootfilecolumn id \
    --prop link \
    --label "reachable" \
    --selflink \
    -i "$TEMP"/P31.P279.subgraph.tsv.gz \
    -o "$TEMP"/P31.P279.reachable.tsv.gz

KGTKException found

CPU times: user 7.67 ms, sys: 4.33 ms, total: 12 ms
Wall time: 1.02 s


Deduplicate the reachable nodes file

In [45]:
%%time
!$kgtk remove-columns -i "$TEMP"/P31.P279.reachable.tsv.gz --columns node1 label \
/ rename-columns --mode=NONE --output-columns node1 \
/ compact --mode=NONE --columns node1 \
-o "$TEMP"/P31.P279.reachable.dedup.tsv.gz

KGTKException found
[Errno 2] No such file or directory: '/data1/rogers/kgtk/tutorial/build-tutorial/temp.build-tutorial/P31.P279.reachable.tsv.gz'
No header line in file
No header line in file
CPU times: user 8.77 ms, sys: 908 µs, total: 9.68 ms
Wall time: 734 ms


Put all the reachable nodes in `rootnode1`

In [46]:
%%time
!$kgtk cat --mode=NONE \
-i "$TEMP"/root.nodes.tsv.gz \
-i "$TEMP"/P31.P279.reachable.dedup.tsv.gz \
/ compact --deduplicate --mode=NONE --columns node1 \
-o "$TEMP"/root.nodes.ontology.tsv.gz

!$kypher -i "$TEMP"/root.nodes.ontology.tsv.gz --as rootnode1 --limit 2

[Errno 2] No such file or directory: '/data1/rogers/kgtk/tutorial/build-tutorial/temp.build-tutorial/root.nodes.tsv.gz'
No header line in file
[2021-10-04 17:30:14 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/temp.build-tutorial/root.nodes.ontology.tsv.gz ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/hom

Extract all P31/P279 edges from Wikidata for all the nodes in the new graph and consolidate.

In [47]:
%%time
!$kypher -i claims -i rootnode1 \
--match ' \
    rootnode1: (n1), \
    claims: (n1)-[l {label:property}]->(n2) \
    ' \
--where 'property in ["P31", "P279"]' \
--return 'distinct n1 as node1, property as label, n2 as node2, l as id' \
/ cat -i - -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.tsv.gz \

!$kypher -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.tsv.gz --as rootbase --limit 2

[2021-10-04 17:30:15 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootnode1 ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 7

I am not certain about the need for this cell, whether new nodes can appear after adding P31 and P279.

In [48]:
%%time
!$kypher -i rootbase -i datatypes \
--match ' \
    rootbase: ()-[l {label: property}]->(n), \
    datatypes: (property)-[:datatype]->(datatype) \
    ' \
--where 'datatype in ["wikibase-item"]' \
--return 'distinct n as node1' \
/ cat -i - -i "$TEMP"/root.nodes.ontology.tsv.gz --mode=NONE \
/ compact --mode=NONE --columns node1 \
-o "$TEMP"/root.nodes.ontology.star.tsv.gz \

!$kypher -i "$TEMP"/root.nodes.ontology.star.tsv.gz --as rootnode1 --limit 2

[2021-10-04 17:30:16 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootbase ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 76

Include in node1 all the properties in the graph

In [49]:
%%time
!$kypher -i rootbase \
--match ' \
    rootbase: ()-[l {label: property}]->(n)' \
--return 'distinct property as node1' \
/ cat -i - -i "$TEMP"/root.nodes.ontology.star.tsv.gz --mode=NONE \
/ compact --mode=NONE --columns node1 \
-o "$TEMP"/root.nodes.ontology.star.property.tsv.gz \

!$kypher -i "$TEMP"/root.nodes.ontology.star.property.tsv.gz --as rootnode1 --limit 2

[2021-10-04 17:30:17 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootbase ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 76

## Add property datatypes

In [50]:
%%time
!$kypher -i datatypes -i rootbase \
--match ' \
    rootbase: ()-[r {label: property}]->(), \
    datatypes: (property)-[l:datatype]->(datatype) \
    ' \
--return 'distinct property as node1, l.label as label, datatype as node2, l as id' \
/ cat -i - -i $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.tsv.gz \
/ compact \
-o $OUT/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.datatype.tsv.gz \

!$kypher -i root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.datatype.tsv.gz --as rootbase --limit 2

[2021-10-04 17:30:18 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootbase ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 76

## Build labels, aliases and descriptions

Extract the label edges

In [51]:
%%time
!$kypher -i label -i rootnode1 \
--match ' \
    rootnode1: (n1)-[]->(), \
    label: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.label.tsv.gz

[2021-10-04 17:30:20 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/label ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 761, 

Extract the alias edges

In [52]:
%%time
!$kypher -i alias -i rootnode1 \
--match ' \
    rootnode1: (n1)-[]->(), \
    alias: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.alias.tsv.gz

[2021-10-04 17:30:20 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/alias ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 761, 

Extract the description edges

In [53]:
%%time
!$kypher -i description -i rootnode1 \
--match ' \
    rootnode1: (n1)-[]->(), \
    description: (n1)-[l]->(n2) \
    ' \
--return 'distinct n1 as node1, l.label as label, n2 as node2, l as id' \
/ sort \
-o $OUT/root.graph.description.tsv.gz

[2021-10-04 17:30:21 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/description ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line

## Compute useful derived files

### Inverses of `P279`

> To do: need to define t`P279_` property, it's datatype, label, etc.

In [54]:
!$kypher -i rootbase \
--match '(n1)-[:P279]->(class)' \
--return 'distinct class as node1, "P279_" as label, n1 as node2' \
/ add-id --id-style wikidata \
/ sort \
-o "$OUT"/root.derived.P279inv.tsv.gz

[2021-10-04 17:30:22 sqlstore]: IMPORT graph via csv.reader into table graph_8 from /data1/rogers/kgtk/tutorial/build-tutorial/rootbase ...
Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 758, in add_graph
    self.import_graph_data_via_import(table, file)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 816, in import_graph_data_via_import
    raise KGTKException('only implemented for existing, named files')
kgtk.exceptions.KGTKException: only implemented for existing, named files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/rogers/kgtk/github/kgtk/kgtk/cli/query.py", line 219, in run
    query = kyquery.KgtkQuery(inputs, store, loglevel=loglevel,
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/query.py", line 216, in __init__
    store.add_graph(file, alias=alias)
  File "/home/rogers/kgtk/github/kgtk/kgtk/kypher/sqlstore.py", line 76

## Final files
- base, includes all edges except labeles, aliases and descriptions
- labels
- aliases
- descriptions

In [55]:
%%time
!$kgtk cat \
-i "$OUT"/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.datatype.tsv.gz \
-i "$OUT"/root.graph.alias.tsv.gz \
-i "$OUT"/root.graph.label.tsv.gz \
-i "$OUT"/root.graph.description.tsv.gz \
-o "$OUT"/all.tsv.gz

[Errno 2] No such file or directory: '/data1/rogers/kgtk/tutorial/build-tutorial/root.graph.item.quantity.time.monolingual.string.property.qualifiers.P31.P279.ontology.datatype.tsv.gz'
CPU times: user 4.83 ms, sys: 3.54 ms, total: 8.38 ms
Wall time: 428 ms
