Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] PySHACL Alternate Modes #60

Open
ashleysommer opened this issue Sep 7, 2020 · 24 comments
Open

[Discussion] PySHACL Alternate Modes #60

ashleysommer opened this issue Sep 7, 2020 · 24 comments
Labels
discussion help wanted Extra attention is needed

Comments

@ashleysommer
Copy link
Collaborator

ashleysommer commented Sep 7, 2020

PySHACL was originally built to be a basic (but fully standards compliant) SHACL validator. That is, it uses SHACL shapes to check conformance of a data graph, and gives you the result (True/False, plus a ValidationReport).
PySHACL does that job quite well. It can be called from python or from the command line, and it delivers the results users expect.

Over the last 12 months, I've been slowly implementing more of the SHACL Advanced Features spec, and pySHACL is now almost AF-complete.

The Advanced features add capability to SHACL which extends beyond that of just validating. Eg, the SHACL Rules allow you to run SHACL-based entailment on your data graph. SHACL Functions allow you to execute parameterised custom SPARQL Functions over the data graph. Custom Targets allow you to bypass the standard SHACL node-targeting mechanism and use SPARQL to select targets.

These features can use useful to execute validation in a more customisable way, but their major benefit is in the general use outside of just validating a data graph against constraints.

With these new features I see the possibility of PySHACL operating in additional alternative modes, besides that of just validating. Eg, expansion mode could run SHACL-AF Functions and Rules on the data graph, then return the expanded data graph (without validating).

Related to #20

@ashleysommer ashleysommer added discussion help wanted Extra attention is needed labels Sep 7, 2020
@ashleysommer
Copy link
Collaborator Author

Another alternate operating mode could be a highly targeted version.
See #53

You would be able to specify a single focus-node in the data graph to target, or use one of the SHACL-AF sh:target methods to select a target or set of targets.
You would also pass in the ID of one or more Shapes from the SHACL graph. PySHACL will then validate only those target nodes against only the given shapes.

This would allow for the ability to programmatically guide the PySHACL operation using your own external logic.

@ashleysommer
Copy link
Collaborator Author

ashleysommer commented Sep 7, 2020

I have another idea for a different tool entirely, called shacl-quickcheck. That will be a heavily cut-down version of pyshacl, it will have zero dependencies.
It will have:

  • Removed SHACL-SPARQL features including SPARQLConstraint and SPARQLConstraintComponent
  • Removed SHACL-AF Custom Targets including SPARQL-Targets and SPARQL-TargetTypes
  • Removed SHACL-AF SPARQLFunction (and other SHACL Functions)
  • Potentially remove other features (like owl:imports?)
  • Bundled stripped down and customized version of RDFLib 5.1.0 with sparql module and sparqlwrapper removed
  • New efficient in-memory store (Memory2) available as the only Graph store backend
  • Bundled rdflib-jsonld, so it doesn't need to be imported separately.
  • Bundled OWL-RL (or removed OWL/RDFS inferencing).
  • If possible, all smashed into a single python file.
  • Able to be executed directly from the commandline as a python3 script

This will result in:

  • Much faster validator start up
  • Faster validating - Quicker turnaround time from execution to result.
  • No need to install the module before using it
  • No need to mess with virtualenvs, dependency trees, version conflicts etc
  • ???

@ashleysommer ashleysommer pinned this issue Nov 30, 2020
@JaimieMurdock
Copy link

First: thank you for all your work on PySHACL. This project and its compliance to the W3 specs has made development of SHACL-based tooling incredibly easy.

This issue is tagged as "help wanted". I've been diving into the code base for the validate function and particularly exception handling throughout the module and would like to offer some development cycles for:

  1. Improved documentation.
  2. Testing of SHACL Advanced Features, particularly the Node Expressions.

There's a lot of exciting things about the advanced features and your exploration of alternate modes. One thing that intersects with my interests in the project is how many matches a particular rule was evaluated against. There's a whole class of false positives caused by ill-formed shapes that never match anything in the ontology being validated - building an engine that can recognize those cases could be very interesting and possibly spur development in an expansion of the SHACL validation report spec.

Two other notes on the shacl-quickcheck project:

  1. Based on past experience with bundling stripped down versions of the usual libraries, I would strongly recommend against it - as a maintainer, it significantly increases your workload. There's very little shame in developing a module that can be run with python -m pyshacl.validate myfile. There's always defining new entry_points as well for getting it as a cleaner script.
  2. Is installing-an-environment a barrier for the target audience? (i.e., being so deep into semantic web/RDF world that you're doing SHACL validation) If so, maybe a docker file or an environment.yml for conda-based env management would be a better fit?

@rob-metalinkage
Copy link

I was about to add a new issue - but I guess this fits under here... when validating it is tricky to get the graph closure right - too little and you get validation errors for object types where the object is not in the graph closure - too much and you get validation errors for all the stuff in the graph closure and not the graph you are validating.

So, rather than running a specific rule (which is probably a good idea...) I want to be able to target a specific graph in validation. (i.e. ignore the ont_graph contents as focus nodes for validation and entailment.)

I also want (currently do this long-hand in scripting) to be able to do entailment+validation step wise - and be able to generate validation checks as a set of entailments and/or transformations are performed on a series of interoperability goals and identify which rules are failing to entail or validate properly.

@rob-metalinkage
Copy link

Another idea is to validate in a Linked Data context - where object references are resolved (via URI or a local catalog of some form) - could output the local catalog if you wish to persist between executions.

@ashleysommer
Copy link
Collaborator Author

Hi @rob-metalinkage
Thanks for the feature requests.

I want to be able to target a specific graph in validation.

PySHACL already has built-in support for validating on multi-graph datasets (using rdflib.ConjunctiveGraph and rdflib.Dataset abstractions), but the only option is to do all of the graphs. So it follows that it should be possible to add a feature to generate focus and value nodes from only one target named graph in that Dataset.

To make it a bit easier for implementation and testing, would you be able to generate small demonstration set of files (shape_graph, ont_graph, and data_graph) which demonstrate both cases:

  1. too little and you get validation errors for object types where the object is not in the graph closure
  2. too much and you get validation errors for all the stuff in the graph closure and not the graph you are validating.

I also want to be able to do entailment+validation step wise
be able to generate validation checks as a set of entailments and/or transformations are performed on a series of interoperability goals and identify which rules are failing to entail or validate properly.

I don't think I'm quite following what you're asking there. Can you simplify it, or provide an example?

Another idea is to validate in a Linked Data context - where object references are resolved (via URI or a local catalog of some form) - could output the local catalog if you wish to persist between executions.

I think another user asked about a similar feature. Could be worth visiting in the future.

@rob-metalinkage
Copy link

ok here are three tiny graphs for the source, validator and ont_graph and the test report.

the extra ont has two things - one a skos:Concept to make the test.ttl valid and an extra invalid reference to show it gets validated as well.

test_validator.zip

@rob-metalinkage
Copy link

And its also important that when running advanced features the output graph excludes the content of the ont_graph... I don't think it can be treating as working unless thats the default behaviour. There may be a need to include it bu I can't imagine it - and it would be trivial for users to add after the fact if required.

@ashleysommer
Copy link
Collaborator Author

important that when running advanced features the output graph excludes the content of the ont_graph

Its unclear, are you taking about using SHACL-AF features as they stand now, or talking about potential future alternate modes (as per the discussion in this thread)?

In its current form, PySHACL's output graph is a graph containing a validation report and nothing else. I think you're referring to the input data graph. PySHACL creates a working copy of the input datagraph in memory, which it then uses to mix in the ont_graph, apply rdfs/owl inferencing, and apply SHACL-AF Rules. This is done in a working copy to avoid polluting the datagraph (its also a requirement that SHACL validation engines should not modify the datagraph).

The inplace option was added for users who need skip the working-copy step, and forces PySHACL to operate directly on the input datagraph. This is useful for users to inspect the contents of the datagraph after validation was complete, which is not normally possible, (it is also useful in the case that the datagraph is very large, and its not feasible to create a working copy in memory). The inplace option is undocumented and is an unsupported mode of operation at this time.

When using the inplace option, then it is expected and desired that the contents of the ont_graph is in the datagraph after the validation is run, that is its purpose.

If however you're taking about a future alternate mode of operation, where PySHACL is used as a kind of entailment engine, then yes in that case the output graph would be an inflated closure of the input graph, and I agree, it should probably not include the ont_graph contents by default.

@rob-metalinkage
Copy link

Am talking about the entailment engine idea I guess,.

when I'm dealing with a source artefact my in-memory graph is always a working copy, so agnostic about changing it.

@ashleysommer
Copy link
Collaborator Author

A related topic has come up on the SHACL Mailing list today, in relation to the TopBraid validation engine. It was discussed there that TopBraid has a tool called shaclinfer that runs the engine in an alternate mode as discussed at the top of this thread.

I think that is a logical starting point to look at how such a mode would work for PySHACL.

@rob-metalinkage
Copy link

TQ have a few SHACL rules execution components implemented in different places - the programmatically accessible ones work exactly as I have asked for.

@huanyu-li
Copy link

huanyu-li commented Nov 17, 2022

Hi,

May I ask if it is possible to get inferred triples using PySHACL? If so, how should I do it?
@ashleysommer Thanks!

Best,
HL

@ashleysommer
Copy link
Collaborator Author

Hi @huanyu-li
Please see the recent comments in this issue thread: #20
There has not been any further development on this feature since that discussion.

@majidaldo
Copy link

Hi @huanyu-li Please see the recent comments in this issue thread: #20 There has not been any further development on this feature since that discussion.

wouldn't be too hard eh? put g as part of the return.

return (not non_conformant), v_report, v_text

@ashleysommer
Copy link
Collaborator Author

ashleysommer commented Jan 9, 2023

@majidaldo Correct, the implementation of returning the working graph to the user is not difficult. The difficult part is the thought process behind it. The validator's internal working datagraph (g in this context) is for the purposes of the validation engine to determine datagraph compliance, there is no mention in the SHACL w3c spec about returning this graph back to the user, and as far as I know, there are no other SHACL validation engines that do this.

However, I do believe this to be a valuable feature to have, and I am in the planning phase of a major update for PySHACL, that will expand its capabilities beyond just validation.

@majidaldo
Copy link

majidaldo commented Jan 9, 2023

However, I do believe this to be a valuable feature to have, and I am in the planning phase of a major update for PySHACL, that will expand its capabilities beyond just validation.

do you mind letting the community know of these plans somewhere? a discussion?

i say all generated data, owl/rdfs inferenced and shacl rules, should be a separate step that is written out (for skipping expensive calc).

@majidaldo
Copy link

majidaldo commented Jan 9, 2023

engine to determine datagraph compliance, there is no mention in the SHACL w3c spec about returning this graph back to the

can shacl rules subsume n3 rules?

@ashleysommer
Copy link
Collaborator Author

do you mind letting the community know of these plans somewhere? a discussion?

This thread is the discussion. You are already participating in it.

And remember there is always the official SHACL community Discord server used for discussion and help topics too: https://discord.gg/RTbGfJqdKB

@KonradHoeffner
Copy link
Contributor

I could really use an alternate "ontology mode" that allows SHACL validation of classes instead of only instances.
For better explanation, see this example from https://stackoverflow.com/questions/70756167/how-to-apply-shacl-to-subclasses-instead-of-instances:

If I have a class, for example "Animal", then I can use SHACL to validate its instances:

:Elefant a :Animal;
 :family  	:Elephantidae;
 :order     :Proboscidea.
:AnimalShape a sh:NodeShape;
 sh:targetClass :Animal;
 sh:property [sh:Path :family], [sh:Path :order].

This works on DBpedia, where animals are modeled as instances, for example https://dbpedia.org/page/Elephant has rdf:type dbo:Mammal, which is an rdfs:subClass of dbo:Animal.

However assume I want to model animals as classes, because an elefant is just a set of actual elefant individuals:

:Elefant rdf:type owl:Class;
 rdfs:subClassOf :Animal;
 :family  	:Elephantidae;
 :order     :Proboscidea.

This will not be validated using the before mentioned SHACL shape. However I would like to have an alternate pySHACL mode that does just that.

@ajnelson-nist
Copy link
Contributor

(This and Konrad's comment might be worth breaking out into a separate Issue. If GitHub's make-an-Issue button doesn't grab both at once, I'll happily migrate this comment.)

(Opinions in this post are my own and nobody's elses. I'm also not a biologist and some of my domain knowledge is likely outdated.)

@KonradHoeffner - I think your example of animal taxonomies is an OWL design question more than a SHACL issue. I do have a suggestion that follows on where I think SHACL could be used on OWL-specific tasks. I did try to start a SHACL-based discussion on using OWL to model animal taxonomies, but I ended up getting sidetracked by the specific properties in the example (family and order) being, in my own opinion, better to model as classes because of some specific benefits from OWL entailment (/inference/knowledge expansion). OWL and SHACL might be writable to ensure that the family and order values are related to one another, but I think this example more motivates a different OWL subclass design without putting family and order into object properties. I've left that whole discussion under a block to be expanded by those interested.

Your first snippet states :Elefant a :Animal, so :Animal is a metaclass, a class where instances of the class are classes. Here's an example of a consequence of that design: Say you encounter an elephant on foot, and record it in some graph:

kb:thatElephantISaw a :Elefant .

If :Animal is a metaclass, then this is NOT true, or entailed:

kb:thatElephantISaw a :Animal .

Is that how you want your model to work?

Take instead a knowledge-refining example. I'll use birds and an OWL design with a different application of metaclasses, that focuses the metaclasses on describing taxonomic levels.

:Animal a owl:Class .
:Bird a owl:Class ; rdfs:subClassOf :Animal .
:Eagle a owl:Class ; rdfs:subClassOf :Bird .
:Seagull a owl:Class ; rdfs:subClassOf :Bird .

One day, you take a picture of a bird flying high, and can only see a silhouette. You can record this in your journal-graph:

:thatFlyingBirdISaw a :Bird .

Later, you check silhouette references and conclude that, among your taxonomy, eagle's the most likely answer, and note so in the same journal-graph:

:thatFlyingBirdISaw a :Eagle .

You've made your graph more precise, and your prior triple is now redundant from entailment. OWL entailment of the latter triple would expand your graph to include:

:thatFlyingBirdISaw a :Bird .
:thatFlyingBirdISaw a :Animal .

Going back to the family and order classifications, one taxonomy design could do the tree-of-life division (kingdom, phylum, class, order, family, genus, specie - apologies in advance if it's outdated, I think the last time I thought of that whole ordering was 20 years ago). Somewhere in there, (Bald) Eagle's class hierarchy would show up as:

:Animalia a owl:Class .
:Accipitriformes a owl:Class, rdfs:subClassOf :Animalia .  # (This skipped a few steps.)
:Accipitridae a owl:Class, rdfs:subClassOf :Accipitriformes .
:BaldEagle rdfs:subClassOf :Accipitridae .  # (This skipped a few steps.)

A metaclass based design could note the family and order:

:Animalia a owl:Class , :TaxonomicKingdom .
:Accipitriformes a owl:Class , :TaxonomicFamily .
:Accipitridae a owl:Class , :TaxonomicOrder ; rdfs:subClassOf :Accipitriformes .

Then, these are entailed, and look (to me) correct - the eagle-individual is an instance of :Accipitridae and :Accipitriformes and :Animalia, but not an instance of a :TaxonomicOrder (or :TaxonomicKingdom, etc.).

If you're curious what the family of :thatFlyingBirdISaw is, here is the SPARQL query (and /rdfs:subClassOf* is not necessary if OWL inference has been done):

SELECT ?nFamily
WHERE {
  :thatFlyingBirdISaw a/rdfs:subClassOf* ?nFamily .
  ?nFamily a :TaxonomicFamily .
}

From the above, I meant to supplement what I saw in the discussion thread on StackOverflow. In summary, I don't think your example demonstrates a SHACL problem with reviewing OWL - it looks particular to a question on when to use metaclasses. You could write SHACL to require all instances of :Animal in your graph be an :Animal subclass that has, somewhere along the class hierarchy, membership in each of your requested taxonomic levels. I think that SHACL-SPARQL has to be used to deal with the metaclass, though I'd be very happy if someone could suggest otherwise. E.g. here's family again, as a SPARQL constraint:

:MustHaveFamily-shape
  a sh:NodeShape ;
  sh:targetClass :Animal ;
  sh:sparql [
    a sh:SPARQLConstraint ;
    sh:description "Find all animal individuals that do not have a taxonomic family specified."@en ;
    sh:message "Focus node is not a subclass of a class that is a taxonomic family instance."@en .
    sh:select """
SELECT $this
WHERE {
  $this a ?nClass .
  FILTER NOT EXISTS {
    ?nClass rdfs:subClassOf* ?nFamilyClass .
    ?nFamilyClass a :TaxonomicFamily .
  }
}
""" ;
  ] ;
  .

Warning - While I think the constraint will give correct answers, I do not think it would be fast to execute, because it's necessarily searching for absent information.

Back to SHACL and specific review of OWL class and property design (i.e. the TBox), rather than OWL individuals-data (i.e. the ABox): SHACL can be used to review OWL syntax and constructs, such as for conformance versus the OWL to RDF mapping, or for consistency checking between OWL definitions and SHACL shapes like done in this shape 1 that checks that owl:DatatypePropertys are only used with SHACL constraints on literal values.

Footnotes

  1. Participation by NIST in the creation of the documentation of mentioned software is not intended to imply a recommendation or endorsement by the National Institute of Standards and Technology, nor is it intended to imply that any specific software is necessarily the best available for the purpose.

@KonradHoeffner
Copy link
Contributor

KonradHoeffner commented Sep 8, 2023

@ajnelson-nist: Wow, thank you for the extremely detailed answer! While I totally agree with your points in theory, I want to explain my experiences and motivation for such a mode:

In Semantic Web ontology / knowledge base research projects there are often three groups: the domain experts (A), the ontologists (B) and the Semantic Web / Linked Open Data people (C).

A: Know everything about the domain but are not experts in ontologies or Semantic Web technologies. Can model their domain into an ontology / knowledge base with good tooling provided by C but have difficulties understanding some theoretical differences like subclass vs part of. The domain they are modelling contains extensive hierarchies and the concepts they are describing are abstract (like "Elephant", "Hospital" and so on), so while validation would be easy with SHACL if they would model a knowledge base (i.e. individuals), the particularities of the domain are better expressed with an ontology of classes, even though it is more of an "ontology light" or a "knowledge base with classes instead of individuals". Because in the end, the data is entered with a table-like tool so there are only database-like relations + hierarchies at this stage.
B: Are experts in ontologies, reasoning and so on, but their ontologies are often not automatically validated, not published in as many ways or even not published at all with a recent version. They provide the ontology for the domain but because the domain experts also provide an ontology, B provide a "meta / core ontology" and A provide the domain ontology that is connected via subclass-of statements to the core ontology (+ top level ontology). If the core paper of the research project is targeted to an ontology conference / journal, there are also other considerations. For example, I'm not sure if modelling classes as individuals / metaclasses will create paper acceptance problems.
C: Focus more on practical aspects like resolvable URLs, deployments, validation and tooling, search, in general providing a SPARQL endpoint and services using this SPARQL endpoint as API. Because of the many difficulties with OWL constructs such as axioms mapped into RDF, relations are often modelled as object properties so one fact is represented by a single triple (e.g. :Elefant :numberOfLegs "4"^^xsd:positiveInteger) even though the theoretically correct way would be to say that each individual of the elephant class has 4 legs and so on.

At the end, what the research project needs is a simple method to validate basic errors such as missing values, wrong cardinalities, invalid references or so, complex reasoning is not needed. If it can be integrated into a continuous integration, like a GitHub action, all the better.
PySHACL is perfect for this, with the only problem being that it does not have a "targetSubClassOf" mode.
As I experienced similar situations twice, it could be a common problem that many others also face, however it could also be that this is not actually that common in which case I understand that it is not useful to add such a mode. I would be interested in feedback from others that are in a similar situation and would find this useful.

However after reading your take again and writing this, I guess I should just accept that metaclasses are the correct solution here and just add them.

@majidaldo
Copy link

majidaldo commented Sep 11, 2023

I've developed a Python-coordinated 'rules' engine around Oxigraph.
Since the rule 'signature' is just rule: db --> triples, you can practically put everything under one roof:
sparqlConstructRule and pyfuncRule is everything that you'd need.
I even used RDF-star to 'annotate' generated triples.

  • SHACL rules could just be translated to sparql (preferably)
    or incur a serialization cost by routing relevant data to an interface complying with mypyShaclrule-->pyfuncrule.
  • ...so really an arbitrary python function can be a rule
  • For validation, as a final one-time 'rule' application after closure,
    SHACL validation could, again, be a cast as a sparql ASK preferably or routed to a pyShaclFunc

This works for my use with practicality in mind but provides a pathway to elegance,
where elegance means offloading work to the db and using sparql as a native and compact language.

After reviewing the pySHACL and owlrl codebases, I felt that both could make use of a common rule system.
I went ahead and created my own.

Side: I'm not an ontologist, but why isn't sparql used for inferencing?

@cgrain
Copy link

cgrain commented Apr 18, 2024

A SHACL 'infer' capability would be a great idea! Do you have any additional thoughts on your planning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants