## Constraint validation example

In this notebook we will run a simple constraint validation query against the Arnold KG.

We will run on the property P276 (location). This notebooks does the following tasks:

1- Simple query for records of P276, with examples.

2- Type/Value type constraint validation with kgtk queries.

3- Explorarion of an example

### Disclaimer
The data for the constraint violation works with the data that we have in the graph, which is not the complete taxonomy of Wikidata. Some constraint violations may be due to an incomplete taxonomy.

### Setup
This notebook was tested using the Docker installation of KGTK. It was run from the [KGTK Notebooks cloned repository](https://github.com/usc-isi-i2/kgtk-notebooks/) as PWD. With that configuration, no changes should be needed to run the notebook.

```
docker run -it --rm -v $PWD:/out -p 8888:8888 uscisii2/kgtk:latest-dev /bin/bash -c "jupyter notebook --ip='*' --port=8888 --no-browser"
```

In [1]:
# Import all the libraries needed to run KGTK.
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd
from IPython.display import display, HTML

from kgtk.configure_kgtk_notebooks import ConfigureKGTK
from kgtk.functions import kgtk, kypher

In [2]:
# Parameters

#kgtk_path = "/kgtk"

# Folder on local machine where to create the output and temporary folders
input_path = None
output_path = "/tmp/projects"
project_name = "tutorial-constraints"

In [3]:
files = [
    "all",
    "label",
    "alias",
    "description",
    "external_id",
    "monolingualtext",
    "quantity",
    "string",
    "time",
    "item",
    "wikibase_property",
    "qualifiers",
    "datatypes",
    "p279",
    "p279star",
    "p31",
    "in_degree",
    "out_degree",
    "pagerank_directed",
    "pagerank_undirected"
]
ck = ConfigureKGTK(files)
ck.configure_kgtk(input_graph_path=input_path,
                  output_path=output_path,
                  project_name=project_name)

User home: /Users/grantxie
Current dir: /Users/grantxie/Downloads/kgtk-notebooks/tutorial
KGTK dir: /Users/grantxie/Downloads/kgtk-notebooks
Use-cases dir: /Users/grantxie/Downloads/kgtk-notebooks/use-cases
--2022-02-01 10:19:13--  https://github.com/usc-isi-i2/kgtk-tutorial-files/raw/main/datasets/arnold/all.tsv.gz
Resolving github.com... 140.82.112.3
Connecting to github.com|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/arnold/all.tsv.gz [following]
--2022-02-01 10:19:14--  https://github.com/usc-isi-i2/kgtk-notebooks/raw/main/datasets/arnold/all.tsv.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/datasets/arnold/all.tsv.gz [following]
--2022-02-01 10:19:14--  https://raw.githubusercontent.com/usc-isi-i2/kgtk-notebooks/main/datasets/arnold/

In [4]:
ck.print_env_variables()

TEMP: /tmp/projects/tutorial-constraints/temp.tutorial-constraints
KGTK_GRAPH_CACHE: /tmp/projects/tutorial-constraints/temp.tutorial-constraints/wikidata.sqlite3.db
OUT: /tmp/projects/tutorial-constraints
kgtk: kgtk
GRAPH: /Users/grantxie/isi-kgtk-tutorial/input
STORE: /tmp/projects/tutorial-constraints/temp.tutorial-constraints/wikidata.sqlite3.db
USE_CASES_DIR: /Users/grantxie/Downloads/kgtk-notebooks/use-cases
EXAMPLES_DIR: /Users/grantxie/Downloads/kgtk-notebooks/examples
KGTK_OPTION_DEBUG: false
kypher: kgtk query --graph-cache /tmp/projects/tutorial-constraints/temp.tutorial-constraints/wikidata.sqlite3.db
KGTK_LABEL_FILE: /Users/grantxie/isi-kgtk-tutorial/input/labels.en.tsv.gz
all: /Users/grantxie/isi-kgtk-tutorial/input/all.tsv.gz
label: /Users/grantxie/isi-kgtk-tutorial/input/labels.en.tsv.gz
alias: /Users/grantxie/isi-kgtk-tutorial/input/aliases.en.tsv.gz
description: /Users/grantxie/isi-kgtk-tutorial/input/descriptions.en.tsv.gz
external_id: /Users/grantxie/isi-kgtk-tutori

In [5]:
os.environ['KGTK_GRAPH_CACHE'] = os.environ['STORE']

In [6]:
%%time
ck.load_files_into_cache()

kgtk query --graph-cache /tmp/projects/tutorial-constraints/temp.tutorial-constraints/wikidata.sqlite3.db -i "/Users/grantxie/isi-kgtk-tutorial/input/all.tsv.gz" --as all  -i "/Users/grantxie/isi-kgtk-tutorial/input/labels.en.tsv.gz" --as label  -i "/Users/grantxie/isi-kgtk-tutorial/input/aliases.en.tsv.gz" --as alias  -i "/Users/grantxie/isi-kgtk-tutorial/input/descriptions.en.tsv.gz" --as description  -i "/Users/grantxie/isi-kgtk-tutorial/input/claims.external-id.tsv.gz" --as external_id  -i "/Users/grantxie/isi-kgtk-tutorial/input/claims.monolingualtext.tsv.gz" --as monolingualtext  -i "/Users/grantxie/isi-kgtk-tutorial/input/claims.quantity.tsv.gz" --as quantity  -i "/Users/grantxie/isi-kgtk-tutorial/input/claims.string.tsv.gz" --as string  -i "/Users/grantxie/isi-kgtk-tutorial/input/claims.time.tsv.gz" --as time  -i "/Users/grantxie/isi-kgtk-tutorial/input/claims.wikibase-item.tsv.gz" --as item  -i "/Users/grantxie/isi-kgtk-tutorial/input/claims.wikibase-property.tsv.gz" --as wiki

Printing locations and their frequency in the dataset

In [7]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P276]->(location)'
        --return 'location as location, count(distinct instance) as count'
        --order-by 'cast(count, int) desc'
        --limit 20 
    / add-labels
""")

CPU times: user 10.5 ms, sys: 10.7 ms, total: 21.2 ms
Wall time: 3.93 s


Unnamed: 0,location,count,location;label
0,Q84,23,'London'@en
1,Q65,21,'Los Angeles'@en
2,Q30,20,'United States of America'@en
3,Q383689,18,'Northwest'@en
4,Q98,10,'Pacific Ocean'@en
5,Q90,10,'Paris'@en
6,Q64,9,'Berlin'@en
7,Q656,8,'Saint Petersburg'@en
8,Q1085,8,'Prague'@en
9,Q61,7,"'Washington, D.C.'@en"


Printing additional information about locations in the dataset

In [8]:
%%time
kgtk("""
    query -i all
        --match '(instance)-[:P276]->(location), (location)-[r {label: property}]->(value)'
        --return 'location as location, property as p, value as v'
        --limit 20 
    / add-labels
""")

CPU times: user 7.62 ms, sys: 9.07 ms, total: 16.7 ms
Wall time: 5.19 s


Unnamed: 0,location,p,v,location;label,p;label,v;label
0,Q105397154,alias,'Public Policy Building'@en,'UCLA Public Affairs Building'@en,,
1,Q105397154,alias,'UCLA Public Policy Building'@en,'UCLA Public Affairs Building'@en,,
2,Q105397154,alias,'Public Affairs Building'@en,'UCLA Public Affairs Building'@en,,
3,Q105397154,alias,"'University of California, Los Angeles Public ...",'UCLA Public Affairs Building'@en,,
4,Q105397154,alias,"'University of California, Los Angeles Public ...",'UCLA Public Affairs Building'@en,,
5,Q105397154,P455,230730,'UCLA Public Affairs Building'@en,'Emporis building ID'@en,
6,Q105397154,P1448,'Public Affairs Building'@en,'UCLA Public Affairs Building'@en,'official name'@en,
7,Q105397154,P6375,"'337 Charles E. Young Drive East, Los Angeles,...",'UCLA Public Affairs Building'@en,'street address'@en,
8,Q105397154,P1101,+7,'UCLA Public Affairs Building'@en,'floors above ground'@en,
9,Q105397154,P2046,+242912Q857027,'UCLA Public Affairs Building'@en,'area'@en,


## An example with value type constraint (similar to rdfs:range)

In [9]:
#filter those statements with P276 and save them in a file claims.P276.tsv
kgtk("""
    filter -i all -p ";P276;" / add-labels 
    -o $OUT/claims.P276.tsv
""")

Let's see some of the values from location

In [10]:
kgtk("""
    cat -i $OUT/claims.P276.tsv
""")

Unnamed: 0,node1,label,node2,id,node2;wikidatatype,node1;label,label;label,node2;label
0,Q101024448,P276,Q105397154,Q101024448-P276-Q105397154-0b107969-0,wikibase-item,'University of California Los Angeles Departme...,'location'@en,'UCLA Public Affairs Building'@en
1,Q1011509,P276,Q30,Q1011509-P276-Q30-2b888575-0,wikibase-item,'Golden Globe Award for Best Motion Picture – ...,'location'@en,'United States of America'@en
2,Q1011547,P276,Q30,Q1011547-P276-Q30-1d4b97b1-0,wikibase-item,'Golden Globe Award'@en,'location'@en,'United States of America'@en
3,Q102083688,P276,Q174710,Q102083688-P276-Q174710-9fc202b3-0,wikibase-item,'Dodd Hall'@en,'location'@en,"'University of California, Los Angeles'@en"
4,Q102253933,P276,Q102254486,Q102253933-P276-Q102254486-59d6b447-0,wikibase-item,'UCLA Department of Economics'@en,'location'@en,'Bunche Hall'@en
...,...,...,...,...,...,...,...,...
986,Q9696-P69-Q5103452-2205a8e8-0,P276,Q755745,Q9696-P69-Q5103452-2205a8e8-0-P276-Q755745-0,,,'location'@en,'Wallingford'@en
987,Q9696-P69-Q7338137-1aa3cffa-0,P276,Q60,Q9696-P69-Q7338137-1aa3cffa-0-P276-Q60-0,,,'location'@en,'New York City'@en
988,Q99-P6591-2f7689-dca2a805-0,P276,Q967966,Q99-P6591-2f7689-dca2a805-0-P276-Q967966-0,,,'location'@en,'Furnace Creek'@en
989,Q99-P7422-3499f0-c105a329-0,P276,Q2908225,Q99-P7422-3499f0-c105a329-0-P276-Q2908225-0,,,'location'@en,'Boca'@en


First, let us retrieve the entities that are correct (subclasses).
By looking at https://www.wikidata.org/wiki/Property:P276 we can retrieve the list of classes that define the range of the property. E.g., [location](https://www.wikidata.org/wiki/Q17334923), [geographical feature](https://www.wikidata.org/wiki/Q618123), [geographic region](https://www.wikidata.org/wiki/Q82794), etc.

In [11]:
%%time
kgtk("""
    query 
    -i $OUT/claims.P276.tsv -i  p31 -i p279star
    --match 'P276: (node1)-[nodeProp]->(node2), p31: (node2)-[]->(nodex), p279star: (nodex)-[]->(par)' 
    --where 'par in ["Q1299240", "Q1656682", "Q17334923", "Q17350442", "Q188193", "Q190463", "Q20203388", "Q2133296", "Q22698", "Q24334893", "Q3895768", "Q40397", "Q4130", "Q4503801", "Q47495022", "Q4936952", "Q618123", "Q634", "Q712378", "Q82794", "Q89464513", "Q988108"] 
        or node2 in ["Q2", "Q663611"]'   
    --return 'distinct nodeProp.id, node1 as `node1`, nodeProp.label as `label`, node2 as `node2`' 
    -o $OUT/P276.correct_temp.tsv
""")

CPU times: user 4.24 ms, sys: 7.39 ms, total: 11.6 ms
Wall time: 1.22 s


In [12]:
#now lets filter out those claims that are not correct
kgtk("""
    ifnotexists -i $OUT/claims.P276.tsv --filter-on $OUT/P276.correct_temp.tsv  -o $OUT/P276.incorrect_temp.tsv
""")

In [13]:
# let's browse the final results
kgtk("""
    cat -i $OUT/P276.incorrect_temp.tsv
""")

Unnamed: 0,node1,label,node2,id,node2;wikidatatype,node1;label,label;label,node2;label
0,Q38695,P276,Q628858,Q38695-P276-Q628858-def911c8-0,wikibase-item,'cooking'@en,'location'@en,'workplace'@en
1,Q42177,P276,Q193837,Q42177-P276-Q193837-6a360c34-0,wikibase-item,'bed'@en,'location'@en,'bedroom'@en
2,Q42177,P276,Q4260475,Q42177-P276-Q4260475-f6af1a06-0,wikibase-item,'bed'@en,'location'@en,'medical facility'@en
3,Q64809639,P276,Q27686,Q64809639-P276-Q27686-9cacc491-0,wikibase-item,'hotel management'@en,'location'@en,'hotel'@en
4,Q7809,P276,Q43895552,Q7809-P276-Q43895552-9187c9e7-0,wikibase-item,'UNESCO'@en,'location'@en,'UNESCO office United States'@en
5,Q7809,P276,Q50356724,Q7809-P276-Q50356724-26a033ef-0,wikibase-item,'UNESCO'@en,'location'@en,'UNESCO office Netherlands'@en
6,Q7809,P276,Q50356730,Q7809-P276-Q50356730-e22d3164-0,wikibase-item,'UNESCO'@en,'location'@en,'UNESCO office Qatar'@en
7,Q7809,P276,Q50356741,Q7809-P276-Q50356741-3581054d-0,wikibase-item,'UNESCO'@en,'location'@en,'UNESCO office Egypt'@en
8,Q7809,P276,Q50356747,Q7809-P276-Q50356747-36fd09f9-0,wikibase-item,'UNESCO'@en,'location'@en,'UNESCO office Afghanistan'@en
9,Q7931198,P276,Q188,Q7931198-P276-Q188-7a514a9a-0,wikibase-item,'Vilnius University Institute of International...,'location'@en,'German'@en


The first result is cooking, that has the property "location" linking to kitchen.
Since kitchen is not from any of the classes belonging to the range specified in the constraint, the property fails.

Another example is Q7931198, which is the Vilnius University Institute of International Relations and Political Science, but its location is **German** instead of **Germany**

**Note that the graph we are handling is an incomplete subset of Wikidata, and part of the taxonomy has not been imported. Thus some of the elements of the list may be correct**