## Step 6: partition the files to follow the conventions KGTK uses for Wikidata

In [1]:
import sys  
sys.path.insert(0, 'tutorial')
from tutorial_setup import *
from generate_report import run

ALIAS: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/aliases.en.tsv.gz"
ALL: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/all.tsv.gz"
CLAIMS: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/claims.tsv.gz"
DESCRIPTION: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/descriptions.en.tsv.gz"
EXAMPLES_DIR: "/Users/amandeep/Github/kgtk/examples"
GE: "/Users/amandeep/Documents/kypher_2/temp.wikidata_os_v5/graph-embedding"
ISA: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.isa.tsv.gz"
ITEM: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/claims.wikibase-item.tsv.gz"
LABEL: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/labels.en.tsv.gz"
OUT: "/Users/amandeep/Documents/kypher_2/wikidata_os_v5"
P279: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.P279.tsv.gz"
P279STAR: "/Volumes/GoogleDrive/Shared drives/KGTK/datasets/wikidataos-v4/derived.P279

In [2]:
%cd {output_path}

/Users/amandeep/Documents/kypher_2


We'll use the partition-wikidata notebook to complete this step. This notebook expects an input file that includes all edges and qualifiers together. We also need to specify a directory where partitioned files should be created, and a directory where temporary files can be sent (this should be different from our temp directory as the partition notebook will clear any existing files in this folder).

In [3]:
!mkdir -p $OUT/parts

In [5]:
!$kgtk cat -i $OUT/all.tsv.gz -i $OUT/Q154.qualifiers.tsv.gz -o $TEMP/all_and_qualifiers.tsv.gz

        7.68 real         7.41 user         0.19 sys


In [6]:
!zcat < $TEMP/all_and_qualifiers.tsv.gz | head

id	node1	label	node2
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508
P10-P1659-P1651-c4068028-0	P10	P1659	P1651
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238
P10-P1659-P51-86aca4c5-0	P10	P1659	P51
P10-P1855-Q15075950-7eff6d65-0	P10	P1855	Q15075950
P10-P1855-Q69063653-c8cdb04c-0	P10	P1855	Q69063653
zcat: error writing to output: Broken pipe


In [7]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["TEMP"] + "/all_and_qualifiers.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)
;

Executing:   0%|          | 0/49 [00:00<?, ?cell/s]

''

The partition-wikidata notebook created the following partitioned kgtk-files:

In [8]:
!ls $OUT/parts

aliases.en.tsv.gz                   metadata.property.datatypes.tsv.gz
aliases.tsv.gz                      metadata.types.tsv.gz
all.tsv.gz                          qualifiers.commonsMedia.tsv.gz
claims.commonsMedia.tsv.gz          qualifiers.external-id.tsv.gz
claims.external-id.tsv.gz           qualifiers.geo-shape.tsv.gz
claims.geo-shape.tsv.gz             qualifiers.globe-coordinate.tsv.gz
claims.globe-coordinate.tsv.gz      qualifiers.math.tsv.gz
claims.math.tsv.gz                  qualifiers.monolingualtext.tsv.gz
claims.monolingualtext.tsv.gz       qualifiers.musical-notation.tsv.gz
claims.musical-notation.tsv.gz      qualifiers.quantity.tsv.gz
claims.other.tsv.gz                 qualifiers.string.tsv.gz
claims.quantity.tsv.gz              qualifiers.tabular-data.tsv.gz
claims.string.tsv.gz                qualifiers.time.tsv.gz
claims.tabular-data.tsv.gz          qualifiers.tsv.gz
claims.time.tsv.gz                  qualifiers.url.tsv.gz
claims.tsv.gz                       quali

In [9]:
!$kypher -i $OUT/parts/claims.tsv.gz \
--match '(n1)-[]->()' \
--return 'count(distinct n1)'

count(DISTINCT graph_33_c1."node1")
16698


## Step 7 Run Useful files Notebook

In [10]:
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Wikidata Useful Files.ipynb",
    os.environ["TEMP"] + "/Wikidata Useful Files Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        output_folder = "useful_files",
        temp_folder = "temp.useful_files",
        wiki_root_folder = os.environ["OUT"] + "/parts/",
        cache_path = os.environ["OUT"] + "/temp.useful_files",
        languages = 'en',
        compute_pagerank = True,
        delete_database = True
    )
)
;

Executing:   0%|          | 0/96 [00:00<?, ?cell/s]

''

The useful files notebook created the following files

In [11]:
!ls -lh $OUT/useful_files

total 19464
-rw-r--r--  1 amandeep  staff   860K Jan 22 17:10 aliases.en.tsv.gz
-rw-r--r--  1 amandeep  staff   157K Jan 22 17:11 derived.P279.tsv.gz
-rw-r--r--  1 amandeep  staff   1.7M Jan 22 17:11 derived.P279star.tsv.gz
-rw-r--r--  1 amandeep  staff   191K Jan 22 17:11 derived.P31.tsv.gz
-rw-r--r--  1 amandeep  staff   106K Jan 22 17:11 derived.isa.tsv.gz
-rw-r--r--  1 amandeep  staff   878K Jan 22 17:10 descriptions.en.tsv.gz
-rw-r--r--  1 amandeep  staff   787K Jan 22 17:10 labels.en.tsv.gz
-rw-r--r--  1 amandeep  staff   278K Jan 22 17:12 metadata.in_degree.tsv.gz
-rw-r--r--  1 amandeep  staff   123K Jan 22 17:12 metadata.out_degree.tsv.gz
-rw-r--r--  1 amandeep  staff   1.6M Jan 22 17:12 metadata.pagerank.directed.tsv.gz
-rw-r--r--  1 amandeep  staff   1.6M Jan 22 17:12 metadata.pagerank.undirected.tsv.gz
-rw-r--r--  1 amandeep  staff   1.6K Jan 22 17:12 statistics.in_degree.distribution.tsv
-rw-r--r--  1 amandeep  staff   3.5K Jan 22 17:12 statistics.out_degree.distribution.ts

## Step 8 Run the Knowledge Graph Profiler

In [12]:
# the ; at the end suppresses the output of this cell which is a very large json object output of executing the profiler notebook
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Knowledge-Graph-Profiler.ipynb",
    "Knowledge-Graph-Profiler.out.ipynb",
    parameters=dict(
        wikidata_parts_folder = os.environ["OUT"] + "/parts",
        cache_folder = os.environ['TEMP'] + "/profiler_temp",
        output_folder = os.environ["OUT"] + "/profiler",
        compute_graph_statistics = "true"
    )
)
;

Executing:   0%|          | 0/76 [00:00<?, ?cell/s]

''

[Knowledge Graph Profiler output](Knowledge-Graph-Profiler.out.ipynb)

### Generate a report on Profiler output

In [13]:
run(f'{os.environ["OUT"]}/profiler')

First_Growth example div creation failed
First_Growth incoming properties div creation failed
Alsace_wine example div creation failed
Alsace_wine incoming properties div creation failed


Click [here](report.html) to see the report

In [14]:
from IPython.core.display import display, HTML
display(HTML(open('report.html').read()))

Class_Label,Number of Instances
,
wine,2276.0
First Growth,1078.0
white wine,734.0
Alsace wine,722.0
beer brand,683.0
Other Classes,8305.0

Property_Label,Number_of_Statements
,
Freebase ID,1677.0
VIAF ID,750.0
Quora topic ID,710.0
TasteAtlas ID,670.0
GND ID,657.0
Other Properties,26237.0

Property_Label,Number_of_Statements
,
inception,1386.0
"dissolved, abolished or demolished date",107.0
date of birth,70.0
date of death,54.0
publication date,54.0
Other Properties,124.0

Property_Label,Number_of_Statements
,
official website,1177.0
described at URL,94.0
exact match,49.0
equivalent class,29.0
external data available at,18.0
Other Properties,102.0

Property_Label,Number_of_Statements
,
geoshape,244.0
Other Properties,0.0

Property_Label,Number_of_Statements
,
demonym,4114.0
official name,862.0
short name,432.0
native label,412.0
Wikidata usage instructions,361.0
Other Properties,774.0

Property_Label,Number_of_Statements
,
population,7797.0
nominal GDP,4627.0
nominal GDP per capita,4583.0
Human Development Index,3546.0
PPP GDP per capita,2702.0
Other Properties,11183.0

Property_Label,Number_of_Statements
,
image,2240.0
locator map image,774.0
pronunciation audio,747.0
flag image,555.0
coat of arms image,446.0
Other Properties,889.0

Property_Label,Number_of_Statements
,
Other Properties,0.0

Property_Label,Number_of_Statements
,
Commons category,2042.0
postal code,502.0
Commons gallery,472.0
licence plate code,281.0
IPA transcription,221.0
Other Properties,1279.0

Property_Label,Number_of_Statements
,
Other Properties,0.0

Property_Label,Number_of_Statements
,
coordinate location,855.0
coordinates of northernmost point,210.0
coordinates of westernmost point,201.0
coordinates of southernmost point,198.0
coordinates of easternmost point,197.0
Other Properties,18.0

Property_Label,Number_of_Statements
,
Other Properties,0.0

Property_Label,Number_of_Statements
,
contains administrative territorial entity,14689.0
instance of,13824.0
subclass of,9940.0
language used,5567.0
diplomatic relation,4889.0
Other Properties,47738.0

Label_,Pagerank
,
Honoro Vera Monastrell,1.079e-05
Honoro Vera Garnacha,1.063e-05
Atteca,1.063e-05
Atteca Armas,1.063e-05
Etna bianco,1.062e-05
Etna bianco superiore,1.062e-05

Property Name,Instances,% Instances
,,
instance of,2276.0,100.0
subclass of,2144.0,94.2
country,184.0,8.08
product certification,159.0,6.99
inception,74.0,3.25
TasteAtlas ID,60.0,2.64

Property Name,Instances
,
product or material produced,16.0
has part,8.0
category\\\\'s main topic,3.0
typically sells,2.0
replaced by,1.0
subclass of,1.0

Property Name,Instances,% Instances
,,
instance of,1078.0,100.0
subclass of,27.0,2.5

Property Name,Instances
,
product or material produced,16.0
has part,8.0
category\\\\'s main topic,3.0
typically sells,2.0
replaced by,1.0
subclass of,1.0

Label_,Pagerank
,
Champagne,0.0003705
Prosecco,5.737e-05
Clairette du Languedoc,2.527e-05
Pouilly-Fumé,1.075e-05
Beyaz,9.359e-06
Cortese di Gavi,9.359e-06

Property Name,Instances,% Instances
,,
instance of,734.0,100.0
subclass of,42.0,5.72
country,17.0,2.32
country of origin,9.0,1.23
product certification,9.0,1.23
native label,8.0,1.09

Property Name,Instances
,
subclass of,5.0
has part,3.0
product or material produced,2.0
material used,2.0
category\\\\'s main topic,2.0
different from,1.0

Property Name,Instances,% Instances
,,
instance of,722.0,100.0
subclass of,28.0,3.88

Property Name,Instances
,
subclass of,5.0
has part,3.0
product or material produced,2.0
material used,2.0
category\\\\'s main topic,2.0
different from,1.0

Label_,Pagerank
,
Vergina Lager,1.056e-05
Vergina Red,1.056e-05
Affligem beer,1.053e-05
Postel,1.053e-05
Coral beer,1.04e-05
Cruzcampo,1.023e-05

Property Name,Instances,% Instances
,,
instance of,683.0,100.0
manufacturer,129.0,18.89
image,128.0,18.74
country of origin,101.0,14.79
Commons category,90.0,13.18
official website,77.0,11.27

Property Name,Instances
,
owner of,16.0
product or material produced,15.0
brand,8.0
manufacturer,6.0
sponsor,5.0
different from,5.0

Property_Label,Number_of_Statements,Property Type
,,
contains administrative territorial entity,14689.0,wikibase_item
instance of,13824.0,wikibase_item
subclass of,9940.0,wikibase_item
population,7797.0,quantity
language used,5567.0,wikibase_item
diplomatic relation,4889.0,wikibase_item

Unit,Number_of_Statements
,
Number,171.0

Unit,Number_of_Statements
,
United states dollar,4619.0
Euro,7.0
Russian ruble,1.0

Unit,Number_of_Statements
,
United states dollar,4576.0
Euro,6.0
Russian ruble,1.0

Unit,Number_of_Statements
,
Number,1.0

Unit,Number_of_Statements
,
Q550207,2669.0
United states dollar,33.0
