# SHACL validation with `pySHACL`

Let's explore use of the [W3 Shapes Constraint Language](https://www.w3.org/TR/shacl/) (SHACL) based on the [`pySHACL`](https://github.com/RDFLib/pySHACL) library.

When we build KGs, it can be helpful to think of the different semantic technologies in terms of layers:

  * [SKOS](https://www.w3.org/2001/sw/wiki/SKOS) - *thesauri* and *classification*
  * [SHACL](https://www.w3.org/TR/shacl/) - *requirements*
  * [OWL](https://www.w3.org/OWL/) - *concepts*
  * [RDF](https://www.w3.org/TR/rdf11-primer/) - *represent nodes, predictates, literals*

For an excellent overview + demos of SHACL, see [`shacl-masterclass`](https://github.com/veleda/shacl-masterclass) by Veronika Heimsbakk.
Another great online resource for working with SHACL is the [SHACL Playground](https://shacl.org/playground/).

With SHACL we can *validate* as well as run some forms of inference to complement what's provided by RDF, OWL, and so on. For a good overview, see the discussion about SHACL and other rule-base approaches in general in ["Rules for Knowledge Graphs Rules"](https://dmccreary.medium.com/rules-for-knowledge-graphs-rules-f22587307a8f) by Dan McCreary.

First, we'll show one of the examples from `pySHACL`, starting with its SHACL *shapes graph* in Turtle format:

In [1]:
shapes_graph = """
@prefix sh:     <http://www.w3.org/ns/shacl#> .
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .
@prefix schema: <http://schema.org/> .

schema:PersonShape
    a sh:NodeShape ;
    sh:targetClass schema:Person ;
    sh:property [
        sh:path schema:givenName ;
        sh:datatype xsd:string ;
        sh:name "given name" ;
    ] ;
    sh:property [
        sh:path schema:birthDate ;
        sh:lessThan schema:deathDate ;
        sh:maxCount 1 ;
    ] ;
    sh:property [
        sh:path schema:gender ;
        sh:in ( "female" "male" ) ;
    ] ;
    sh:property [
        sh:path schema:address ;
        sh:node schema:AddressShape ;
    ] .

schema:AddressShape
    a sh:NodeShape ;
    sh:closed true ;
    sh:property [
        sh:path schema:streetAddress ;
        sh:datatype xsd:string ;
    ] ;
    sh:property [
        sh:path schema:postalCode ;
        sh:datatype xsd:integer ;
        sh:minInclusive 10000 ;
        sh:maxInclusive 99999 ;
    ] .
"""

Then define a simple *data graph* to test against, given in JSON-LD format:

In [2]:
data_graph = """
{
    "@context": { "@vocab": "http://schema.org/" },
    "@id": "http://example.org/ns#Bob",
    "@type": "Person",
    "givenName": "Robert",
    "familyName": "Junior",

    "birthDate": "1971-07-07",
    "deathDate": "1968-09-10",
    "address": {
        "@id": "http://example.org/ns#BobsAddress",
        "streetAddress": "1600 Amphitheatre Pkway",
        "postalCode": 9404
    }
}
"""

Now let's run `pySHACL` directly, to test whether this data graph conforms to the shapes graph, then print out the results:

In [3]:
import pyshacl

results = pyshacl.validate(
    data_graph,
    shacl_graph=shapes_graph,
    data_graph_format="json-ld",
    shacl_graph_format="turtle",
    inference="rdfs",
    debug=True,
    serialize_report_graph=False,
    )

conforms, v_graph, v_text = results

print("conforms", conforms)
print("graph", v_graph)
print("text", v_text)

Constraint Violation in LessThanConstraintComponent (http://www.w3.org/ns/shacl#LessThanConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:lessThan schema:deathDate ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:path schema:birthDate ]
	Focus Node: <http://example.org/ns#Bob>
	Value Node: Literal("1971-07-07")
	Result Path: schema:birthDate
	Message: Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal("1971-07-07")

Constraint Violation in MinInclusiveConstraintComponent (http://www.w3.org/ns/shacl#MinInclusiveConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:datatype xsd:integer ; sh:maxInclusive Literal("99999", datatype=xsd:integer) ; sh:minInclusive Literal("10000", datatype=xsd:integer) ; sh:path schema:postalCode ]
	Focus Node: <http://example.org/ns#BobsAddress>
	Value Node: Literal("9404", datatype=xsd:integer)
	Result Path: schema:postalCode
	Message: Value is not >= Literal("10000", datatype=xsd:integer)

Constraint Violatio

conforms False
graph [a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory2']].
text Validation Report
Conforms: False
Results (2):
Constraint Violation in LessThanConstraintComponent (http://www.w3.org/ns/shacl#LessThanConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:lessThan schema:deathDate ; sh:maxCount Literal("1", datatype=xsd:integer) ; sh:path schema:birthDate ]
	Focus Node: <http://example.org/ns#Bob>
	Value Node: Literal("1971-07-07")
	Result Path: schema:birthDate
	Message: Value of <http://example.org/ns#Bob>->schema:deathDate <= Literal("1971-07-07")
Constraint Violation in NodeConstraintComponent (http://www.w3.org/ns/shacl#NodeConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:node schema:AddressShape ; sh:path schema:address ]
	Focus Node: <http://example.org/ns#Bob>
	Value Node: <http://example.org/ns#BobsAddress>
	Result Path: schema:address
	Message: Value does not conform to Shape schema:AddressShape



The birthday value should cause a `LessThanConstraintComponent` violation, and the postal code value should cause a `NodeConstraintComponent` violation.

---
## Validating RDF graphs with `kglab`

*The following lines are only needed for verifying this example code with the library code as it's in development. You do not need to include this in production code:*

In [4]:
import sys
sys.path.insert(0, "../")

Now let's try this again, using the `kglab` abstraction layer.
First we'll load our recipe graph:

In [5]:
import kglab

namespaces = {
    "nom":  "http://example.org/#",
    "wtm":  "http://purl.org/heals/food/",
    "ind":  "http://purl.org/heals/ingredient/",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    }

kg = kglab.KnowledgeGraph(
    name = "A recipe KG example based on Food.com",
    base_uri = "https://www.food.com/recipe/",
    language = "en",
    namespaces = namespaces,
    )

kg.load_ttl("tmp.ttl")

Next we define a SHACL shape graph to provide *requirements* for our recipes KG:

In [6]:
shacl_graph = """
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix nom:  <http://example.org/#> .
@prefix wtm:  <http://purl.org/heals/food/> .
@prefix ind:  <http://purl.org/heals/ingredient/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .

nom:RecipeShape
    a sh:NodeShape ;
    sh:targetClass wtm:Recipe ;
    sh:property [
        sh:path wtm:hasIngredient ;
        sh:node wtm:Ingredient ;
        sh:minCount 3 ;
    ] ;
    sh:property [
        sh:path skos:definition ;
        sh:datatype xsd:string ;
        sh:maxLength 50 ;
    ] .
"""

Now run the SHACL validation through `kglab`.
Note that the `shacl_graph` parameter is optional; alternatively the SHACL shape graph could have been included as part of our `dat/nom.ttl` file.

In [7]:
conforms, v_graph, v_text = kg.validate(shacl_graph=shacl_graph)

Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
	Focus Node: <https://www.food.com/recipe/137158>
	Value Node: Literal("pikkuleipienperustaikina  finnish butter cookie dough")
	Result Path: skos:definition
	Message: String length not <= Literal("50", datatype=xsd:integer)

Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
	Focus Node: <https://www.food.com/recipe/279314>
	Value Node: Literal("choux pastry  for profiteroles  cream puffs or eclairs")
	Result Path: skos:definition
	Message: String length not <= Literal("50", datatype=xsd:integer)

Constraint Violation in MaxLengthCons

Then print the results, which should be approximately what was just printed as violations:

In [8]:
print("conforms:", conforms)
print("\ngraph:", v_graph)
print("\ntext:", v_text)

conforms: False

graph: [a rdfg:Graph;rdflib:storage [a rdflib:Store;rdfs:label 'Memory2']].

text: Validation Report
Conforms: False
Results (4):
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
	Focus Node: <https://www.food.com/recipe/137158>
	Value Node: Literal("pikkuleipienperustaikina  finnish butter cookie dough")
	Result Path: skos:definition
	Message: String length not <= Literal("50", datatype=xsd:integer)
Constraint Violation in MaxLengthConstraintComponent (http://www.w3.org/ns/shacl#MaxLengthConstraintComponent):
	Severity: sh:Violation
	Source Shape: [ sh:datatype xsd:string ; sh:maxLength Literal("50", datatype=xsd:integer) ; sh:path skos:definition ]
	Focus Node: <https://www.food.com/recipe/279314>
	Value Node: Literal("choux pastry  for profiteroles  cream puffs or

SHACL provides excellent resources for ensuring data quality when working with KGs.
In addition to validation, SHACL can also be applied for auditing and inference.

---

## Exercises

**Exercise 1:**

Fix the errors in the first example by modifying its *data graph*, i.e., its ABox.
Can you get it to a state were the returned flag `conforms` is true?

**Exercise 2:**

Extend the SHACL *shape graph* for our recipe KG to validate that each recipe has a non-zero cooking time?
How large must the maximum cooking time be set to avoid violations?