# Generating Subsets of Wikidata

### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

UPDATE EXAMPLE INVOCATION


```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p wiki_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz \
-p property_item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz \
-p qual_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no 
```

In [3]:
# Parameters

# Folder on local machine where to create the output and temporary folders
output_path = "/Users/pedroszekely/Downloads/kypher"

# The names of the output and temporary folders
output_folder = "wikidata_os_v1"
temp_folder = "temp.wikidata_os_v1"

# Classes to remove
remove_classes = "Q13442814, Q523, Q16521, Q318, Q7318358, Q7187, Q11173, Q8054"

# The location of input files
wiki_root_folder = "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/"
claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"
qual_file = "qualifiers.tsv.gz"
property_datatypes_file = "metadata.property.datatypes.tsv.gz"
isa_file = "derived.isa.tsv.gz"
p279star_file = "derived.P279star.tsv.gz"

# Location of the cache database for kypher
cache_path = "/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4"

# Whether to delete the cache database
delete_database = False

# Useful files Jupyter notebook
useful_files_notebook = "Wikidata Useful Files.ipynb"
notebooks_folder = "/Users/pedroszekely/Documents/GitHub/kgtk/examples/"

In [4]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt

import papermill as pm

## Set up environment and folders to store the files

- `OUT` folder where the output files go
- `TEMP` folder to keep temporary files , including the database
- `kgtk` shortcut to invoke the kgtk software
- `kypher` shortcut to invoke `kgtk query with the cache database
- `EDGES` the `all.tsv` file of wikidata that contains all edges except label/alias/description
- `LABELS` the file with the English labels
- `ITEMS` the wikibase-item file (currently does not include node1 that are properties so for now we need the net file
- `PROPERTY_ITEMS` the items that are properties
- `STORE` location of the cache file

In [5]:
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)
os.environ['OUT'] = "{}/{}".format(output_path, output_folder)
os.environ['TEMP'] = "{}/{}".format(output_path, temp_folder)
os.environ['kgtk'] = "kgtk"
# os.environ['kgtk'] = "time kgtk --debug"
os.environ['kypher'] = "time kgtk --debug query --graph-cache " + os.environ['STORE']
os.environ['column'] = "column -t -s $'\t'" 
os.environ['CLAIMS'] = wiki_root_folder + claims_file
os.environ['LABELS'] = wiki_root_folder + label_file
os.environ['ALIASES'] = wiki_root_folder + alias_file
os.environ['DESCRIPTIONS'] = wiki_root_folder + description_file
os.environ['ITEMS'] = wiki_root_folder + item_file
os.environ['QUALS'] = wiki_root_folder + qual_file
os.environ['DATATYPES'] = wiki_root_folder + property_datatypes_file
os.environ['ISA'] = wiki_root_folder + isa_file
os.environ['P279star'] = wiki_root_folder + p279star_file

Echo the variables to see if they are all set correctly

In [6]:
!echo $OUT
!echo $TEMP
!echo $kgtk
!echo $kypher
!echo $CLAIMS
!echo $LABELS
!echo $ALIASES
!echo $DESCRIPTIONS
!echo $ITEMS
!echo $QUALS
!echo $DATATYPES
!echo $ISA
!echo $P279star
!echo $STORE
!alias col="column -t -s $'\t' "

/Users/pedroszekely/Downloads/kypher/wikidata_os_v1
/Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v1
kgtk
time kgtk --debug query --graph-cache /Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4/wikidata.sqlite3.db
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/claims.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/labels.en.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/aliases.en.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/descriptions.en.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/claims.wikibase-item.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/qualifiers.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/metadata.property.datatypes.tsv.gz
/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/derived.isa.tsv.gz
/Volumes/GoogleDrive/Shared d

Go to the output directory and create the subfolders for the output files and the temporary files

In [7]:
cd $output_path

/Users/pedroszekely/Downloads/kypher


In [6]:
!mkdir $OUT
!mkdir $TEMP

mkdir: /Users/pedroszekely/Downloads/kypher/wikidata_os_v1: File exists
mkdir: /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v1: File exists


Clean up the output and temp folders before we start

In [7]:
# !rm $OUT/*.tsv $OUT/*.tsv.gz
# !rm $TEMP/*.tsv $TEMP/*.tsv.gz

Uncomment the line below to remove the sqllite2 database. It takes a long time to load all the data and create indices, so don't remove the database unless you change files that have already been loaded and you need to force a reload.

In [8]:
if delete_database and delete_database != "no":
    print("Deleted database")
    !rm $STORE

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [9]:
!$kypher -i "$CLAIMS" --limit 10 | col

[2020-11-15 17:40:13 sqlstore]: IMPORT graph directly into table graph_1 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/claims.tsv.gz ...
[2020-11-15 19:44:15 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
     7442.72 real     12090.35 user       326.27 sys
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q15075950-7eff6d65-0	P1

## Creating a list of all the items we want to remove

### Compute the items to be removed

First look at the classes we will remove

In [10]:
cmd = "wd u {}".format(" ".join(remove_classes.split(",")))
!{cmd}

[90mid[39m Q13442814
[42mLabel[49m scholarly article
[44mDescription[49m article in an academic publication, usually peer reviewed
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mscholarly publication [90m(Q591041)[39m | article [90m(Q191067)[39m | scholarly work [90m(Q55915575)[39m

[90mid[39m Q523
[42mLabel[49m star
[44mDescription[49m astronomical object consisting of a luminous spheroid of plasma held together by its own gravity
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m astronomical object type [90m(Q17444909)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39mastronomical object [90m(Q6999)[39m | fusor [90m(Q1027098)[39m

[90mid[39m Q16521
[42mLabel[49m taxon
[44mDescription[49m group of one or more organism(s), which a taxonomist adjudges to be a unit
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m first-order metaclass [90m(Q24017414)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: 

Compose the kypher command to remove the classes

In [11]:
!zcat < "$ISA" | head | col

zcat: node1	label	node2
error writing to outputP10	isa	Q18610173
: Broken pipe
P1000	isa	Q18608871
P1001	isa	Q15720608
P1001	isa	Q22984026
P1001	isa	Q22997934
P1001	isa	Q61719275
P1002	isa	Q22963600
P1003	isa	Q19595382
P1003	isa	Q19833377


In [12]:
classes = map(lambda x: '"{}"'.format(x), remove_classes.replace(" ", "").split(","))
remove_command = "$kypher -i \"$ISA\" -i \"$P279star\" -o $TEMP/items.remove.tsv.gz \
--match 'isa: (n1)-[:isa]->(c), P279star: (c)-[]->(class)' \
--where 'class in [CLASSES]' \
--return 'distinct n1, \"p31_p279star\" as label, class as node2' ".replace("CLASSES", ", ".join(list(classes)))
remove_command

'$kypher -i "$ISA" -i "$P279star" -o $TEMP/items.remove.tsv.gz --match \'isa: (n1)-[:isa]->(c), P279star: (c)-[]->(class)\' --where \'class in ["Q13442814", "Q523", "Q16521", "Q318", "Q7318358", "Q7187", "Q11173", "Q8054"]\' --return \'distinct n1, "p31_p279star" as label, class as node2\' '

Run the command, the items to remove will be in file `$TEMP/items.remove.tsv.gz`

In [13]:
!{remove_command}

[2020-11-15 19:44:23 sqlstore]: IMPORT graph directly into table graph_2 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/derived.isa.tsv.gz ...
[2020-11-15 19:46:56 sqlstore]: IMPORT graph directly into table graph_3 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/derived.P279star.tsv.gz ...
[2020-11-15 19:53:40 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_2_c1."node1", ? "label", graph_3_c2."node2" "node2"
     FROM graph_2 AS graph_2_c1, graph_3 AS graph_3_c2
     WHERE graph_2_c1."label"=?
     AND graph_2_c1."node2"=graph_3_c2."node1"
     AND (graph_3_c2."node2" IN (?, ?, ?, ?, ?, ?, ?, ?))
  PARAS: ['p31_p279star', 'isa', 'Q13442814', 'Q523', 'Q16521', 'Q318', 'Q7318358', 'Q7187', 'Q11173', 'Q8054']
---------------------------------------------
[2020-11-15 19:53:40 sqlstore]: CREATE INDEX on table graph_3 column node1 ...
[2020-11-15 19:54:53 sqlstore]: ANALYZE INDEX on table g

Preview the file

In [14]:
!zcat < $TEMP/items.remove.tsv.gz | head | col

zcat: node1	label	node2
Q1000017	p31_p279star	Q16521
Q1000126	p31_p279star	Q16521
Q1000261	p31_p279star	Q16521
Q1000262	p31_p279star	Q16521
Q1000266	p31_p279star	Q16521
error writing to outputQ1000270	p31_p279star	Q16521
: Q1000274	p31_p279star	Q16521
Broken pipe
Q1000278	p31_p279star	Q16521
Q1000280	p31_p279star	Q16521


Collect all the classes of items we will remove, just as a sanity check

In [15]:
!$kypher -i $TEMP/items.remove.tsv.gz \
--match '()-[]->(n2)' \
--return 'distinct n2' \
--limit 10

[2020-11-15 20:15:19 sqlstore]: IMPORT graph directly into table graph_4 from /Users/pedroszekely/Downloads/kypher/temp.wikidata_os_v1/items.remove.tsv.gz ...
[2020-11-15 20:17:30 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_4_c1."node2"
     FROM graph_4 AS graph_4_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
node2
Q16521
Q11173
Q523
Q13442814
Q7318358
Q318
Q7187
Q8054
      138.86 real       227.93 user         6.12 sys


## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [16]:
!$kgtk sort2 -i $TEMP/items.remove.tsv.gz -o $TEMP/items.remove.sorted.tsv.gz

In [17]:
!zcat < $TEMP/items.remove.sorted.tsv.gz | head | col

node1	label	node2
zcat: Q1000017	p31_p279star	Q16521
Q1000126	p31_p279star	Q16521
Q1000261	p31_p279star	Q16521
Q1000262	p31_p279star	Q16521
Q1000266	p31_p279star	Q16521
error writing to outputQ1000270	p31_p279star	Q16521
: Broken pipe
Q1000274	p31_p279star	Q16521
Q1000278	p31_p279star	Q16521
Q1000280	p31_p279star	Q16521


In [18]:
!zcat < "$CLAIMS" | head -5 | col

id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
zcat: P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
error writing to output: Broken pipe
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property


Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [19]:
!$kgtk ifnotexists -i "$CLAIMS" -o $TEMP/item.edges.reduced.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 

From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [20]:
!$kgtk sort2 -i $TEMP/item.edges.reduced.tsv.gz -o $TEMP/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id

In [21]:
!$kgtk ifnotexists -i $TEMP/item.edges.reduced.sorted.tsv.gz -o $TEMP/item.edges.reduced.2.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 

Create a file with the labels

In [22]:
!$kgtk ifnotexists -i "$LABELS" -o $TEMP/label.edges.reduced.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

Create a file with the aliases

In [23]:
!$kgtk ifnotexists -i "$ALIASES" -o $TEMP/alias.edges.reduced.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

Create a file with the descriptions

In [24]:
!$kgtk ifnotexists -i "$DESCRIPTIONS" -o $TEMP/description.edges.reduced.tsv.gz \
--filter-on $TEMP/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

### Produce the output files for claims, labels, aliases and descriptions

In [25]:
!$kgtk sort2 -i $TEMP/item.edges.reduced.2.tsv.gz -o $OUT/claims.tsv.gz 

In [26]:
!$kgtk sort2 -i $TEMP/label.edges.reduced.tsv.gz -o $OUT/labels.en.tsv.gz 

In [27]:
!$kgtk sort2 -i $TEMP/alias.edges.reduced.tsv.gz -o $OUT/aliases.en.tsv.gz 

In [28]:
!$kgtk sort2 -i $TEMP/description.edges.reduced.tsv.gz -o $OUT/descriptions.en.tsv.gz 

Sanity checks to see if it looks reasonable

## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `$QUALS` 
- `$OUT/claims.tsv.gz` 

In [29]:
!zcat < "$QUALS" | head | column -t -s $'\t' 

zcat: error writing to output: Broken pipe
id                                                node1                           label  node2                                                                    node2;wikidatatype
P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0       P10-P1855-Q15075950-7eff6d65-0  P10    "Smoorverliefd 12 september.webm"                                        commonsMedia
P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0    P10-P1855-Q15075950-7eff6d65-0  P3831  Q622550                                                                  wikibase-item
P10-P1855-Q4504-a69d2c73-0-P10-bef003-0           P10-P1855-Q4504-a69d2c73-0      P10    "Komodo dragons video.ogv"                                               commonsMedia
P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0       P10-P1855-Q69063653-c8cdb04c-0  P10    "Couch Commander.webm"                                                   commonsMedia
P10-P1855-Q7378-555592a4-0-P10-8a982d-0           P10-P1855-Q7378-555592a4-

Run `ifexists` to select out the quals for the edges in `$OUT/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [30]:
!$kgtk ifexists -i "$QUALS" -o $OUT/qualifiers.tsv.gz \
--filter-on $OUT/claims.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted

Look at the final output for qualifiers

In [31]:
!zcat < $OUT/qualifiers.tsv.gz | head | col

zcat: error writing to output: Broken pipe
id	node1	label	node2	node2;wikidatatype
P10-P1855-Q15075950-7eff6d65-0-P10-54b214-0	P10-P1855-Q15075950-7eff6d65-0	P10	"Smoorverliefd 12 september.webm"	commonsMedia
P10-P1855-Q15075950-7eff6d65-0-P3831-Q622550-0	P10-P1855-Q15075950-7eff6d65-0	P3831	Q622550 wikibase-item
P10-P1855-Q69063653-c8cdb04c-0-P10-6fb08f-0	P10-P1855-Q69063653-c8cdb04c-0	P10	"Couch Commander.webm"	commonsMedia
P10-P1855-Q7378-555592a4-0-P10-8a982d-0 P10-P1855-Q7378-555592a4-0	P10	"Elephants Dream (2006).webm"	commonsMedia
P10-P2302-Q21502404-d012aef4-0-P1793-f4c2ed-0	P10-P2302-Q21502404-d012aef4-0	P1793	"(?i).+\\.(webm\|ogv\|ogg\|gif)"	string
P10-P2302-Q21502404-d012aef4-0-P2316-Q21502408-0	P10-P2302-Q21502404-d012aef4-0	P2316	Q21502408	wikibase-item
P10-P2302-Q21502404-d012aef4-0-P2916-cb0917-0	P10-P2302-Q21502404-d012aef4-0	P2916	'filename with extension: webm, ogg, ogv, or gif (case insensitive)'@en monolingualtext
P10-P2302-Q21510851-5224fe0b-0-P2306-P175-0	P10-P230

## Sanity checks

In [32]:
!$kypher -i $OUT/claims.tsv.gz \
--match '(n1:Q368441)-[l]->(n2)' \
--limit 10 \
| col

[2020-11-16 08:43:11 sqlstore]: IMPORT graph directly into table graph_5 from /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/claims.tsv.gz ...
[2020-11-16 09:11:20 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."node1"=?
     LIMIT ?
  PARAS: ['Q368441', 10]
---------------------------------------------
[2020-11-16 09:11:20 sqlstore]: CREATE INDEX on table graph_5 column node1 ...
[2020-11-16 09:15:50 sqlstore]: ANALYZE INDEX on table graph_5 column node1 ...
     1983.59 real      2792.43 user       148.56 sys
id	node1	label	node2	rank	node2;wikidatatype
Q368441-P106-Q937857-ba9afa6b-0 Q368441 P106	Q937857 normal	wikibase-item
Q368441-P109-358e4e-63970f77-0	Q368441 P109	"James Rodriguez Signature.svg" normal	commonsMedia
Q368441-P118-Q82595-62cd72d9-0	Q368441 P118	Q82595	normal	wikibase-item
Q368441-P1344-Q170645-3f2d9c6a-0	Q368441 P1344	Q170645 normal	wikibase-item
Q368441-P1344-Q4630358-8e2

In [33]:
!$kypher -i $OUT/claims.tsv.gz \
--match '(n1:P131)-[l]->(n2)' \
--limit 10 \
| col

[2020-11-16 09:16:15 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_5 AS graph_5_c1
     WHERE graph_5_c1."node1"=?
     LIMIT ?
  PARAS: ['P131', 10]
---------------------------------------------
        1.08 real         0.61 user         0.19 sys
id	node1	label	node2	rank	node2;wikidatatype
P131-P1628-951146-4681d72b-0	P131	P1628	"http://dati.beniculturali.it/cis/GovernamentalAdministrativeArea"	normal	url
P131-P1629-Q56061-0d5b0586-0	P131	P1629	Q56061	normal	wikibase-item
P131-P1647-P276-5cc63556-0	P131	P1647	P276	normal	wikibase-property
P131-P1647-P361-257a2660-0	P131	P1647	P361	normal	wikibase-property
P131-P1659-P1001-f0f7e26a-0	P131	P1659	P1001	normal	wikibase-property
P131-P1659-P1383-3ebd92d5-0	P131	P1659	P1383	normal	wikibase-property
P131-P1659-P150-d414f410-0	P131	P1659	P150	normal	wikibase-property
P131-P1659-P159-e71dc93e-0	P131	P1659	P159	normal	wikibase-property
P131-P1659-P206-7eb31568-0	P131	P1659	P206	normal	wikiba

## Compute the derived files using the `Wikidata Useful Files` Jupyter notebook

Compute `claims.wikibase-item.tsv.gz` which would be computed by the Wikidata partitioner, but we are not using it here yet

In [34]:
!zcat < "$DATATYPES" | head | col

id	node1	label	node2
P10-datatype	P10	datatype	commonsMedia
P1000-datatype	P1000	datatype	wikibase-item
P1001-datatype	P1001	datatype	wikibase-item
P1002-datatype	P1002	datatype	wikibase-item
P1003-datatype	P1003	datatype	external-id
P1004-datatype	P1004	datatype	external-id
zcat: P1005-datatype	P1005	datatype	external-id
P1006-datatype	P1006	datatype	external-id
P1007-datatype	P1007	datatype	external-id
error writing to output: Broken pipe


In [8]:
!$kypher -i $OUT/claims.tsv.gz -i "$DATATYPES" -o $OUT/claims.wikibase-item.tsv.gz \
--match 'claims: (n1)-[l {label: p}]->(n2), datatypes: (p)-[:datatype]->(:`wikibase-item`)' \
--return 'l as id, n1 as node1, p as label, n2 as node2' \
--order-by 'l' 

[2020-11-16 09:38:24 query]: SQL Translation:
---------------------------------------------
  SELECT graph_5_c1."id" "id", graph_5_c1."node1" "node1", graph_5_c1."label" "label", graph_5_c1."node2" "node2"
     FROM graph_5 AS graph_5_c1, graph_6 AS graph_6_c2
     WHERE graph_5_c1."label"=graph_5_c1."label"
     AND graph_6_c2."label"=?
     AND graph_6_c2."node2"=?
     AND graph_5_c1."label"=graph_6_c2."node1"
     ORDER BY graph_5_c1."id" ASC
  PARAS: ['datatype', 'wikibase-item']
---------------------------------------------
     4427.33 real      1670.34 user       763.58 sys


To compute the derived files we use papermill to run the `Wikidata Useful Files` notebook.

In [9]:
pm.execute_notebook(
    notebooks_folder + useful_files_notebook,
    os.environ["TEMP"] + "/useful_files_notebook_output.ipynb",
    parameters=dict(
        output_path="/Users/pedroszekely/Downloads/kypher",
        output_folder="wikidata_os_v1",
        temp_folder="temp.wikidata_os_v1",
        wiki_root_folder="/Users/pedroszekely/Downloads/kypher/wikidata_os_v1/",
        claims_file="claims.tsv.gz",
        label_file="labels.en.tsv.gz",
        alias_file="aliases.en.tsv.gz",
        description_file="descriptions.en.tsv.gz",
        item_file="claims.wikibase-item.tsv.gz",
        cache_path="/Users/pedroszekely/Downloads/kypher/temp.useful_wikidata_files_v4",
        delete_database=False,
        compute_pagerank=False
    )
)

HBox(children=(FloatProgress(value=0.0, description='Executing', max=90.0, style=ProgressStyle(description_wid…




{'cells': [{'cell_type': 'markdown',
   'metadata': {'tags': [],
    'papermill': {'exception': False,
     'start_time': '2020-11-16T18:52:13.399368',
     'end_time': '2020-11-16T18:52:13.567905',
     'duration': 0.168537,
     'status': 'completed'}},
   'source': '# Generating Useful Wikidata Files\n\nThis notebook generates files that contain derived data that is useful in many applications. The input to the notebook is the full Wikidata or a subset of Wikidata. It also works for arbutrary KGs as long as they follow the representation requirements of Wikidata:\n\n- the *instance of* relation is represented using the `P31` property\n- the *subclass of* relation is represented using the `P279` property\n- all properties declare a datatype, and the data types must be one of the datatypes in Wikidata.\n\nInputs:\n\n- `claims_file`: contains all statements, which consist of edges `node1/label/node2` where `label` is a property in Wikidata (e.g., sitelinks, labels, aliases and descript

Look at the columns so we know how to construct the kypher query

## Summary of results

In [10]:
!ls -lh $OUT/*wikidataos.*

ls: /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/*wikidataos.*: No such file or directory


In [11]:
!zcat < $OUT/wikidataos.all.tsv.gz | wc

/bin/bash: /Users/pedroszekely/Downloads/kypher/wikidata_os_v1/wikidataos.all.tsv.gz: No such file or directory
       0       0       0


## Verification

The edges file must contain edges for properties, this is not the case on 2020-11-10`


In [13]:
!$kypher -i "$CLAIMS" \
--match '(:P10)-[l]->(n2)' \
--limit 10

[2020-11-16 13:26:55 sqlstore]: DROP graph data table graph_1 from /Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidata-20200803-v4/claims.tsv.gz
Traceback (most recent call last):
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/cli/query.py", line 148, in run
    index=options.get('index'))
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/query.py", line 180, in __init__
    store.add_graph(file)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 560, in add_graph
    self.drop_graph(file_info.graph)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.4.0-py3.7.egg/kgtk/kypher/sqlstore.py", line 586, in drop_graph
    self.execute('DROP TABLE %s' % table_name)
  File "/Users/pedroszekely/opt/anaconda3/envs/kgtk/lib/python3.7/site-packages/kgtk-0.