-
Notifications
You must be signed in to change notification settings - Fork 558
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graph_diff fails on presence of unrelated blank nodes #1294
Comments
How should rdflib know that the blank node in g1 is the same as the one in g2? I don't think anything in rdf semantics make them the same, and not even in rdfs or owl semantics AFAIK. So I don't think this behaviour is incorrect, but I may be wrong, if you can provide some reference to the standards that dictate they should be the same I will have a look. Regardless of standards/specs however, if there is some way in which you think equivalence can be established consistently I think it would be good to know, it would be useful to have. I think it may actually be a bug that it works if there is only one blank node however, I don't think blank nodes in independent graphs should ever be equivalent with pure rdf semantics. I can see cases for OWL where that would be different though. |
I guess this could be done: Blank Node Matching and RDF/S Comparison Functions - but it is not standard RDF semantics, so it should likely not go inside the normal graph diff function. |
I see you import |
Comparing skolemized canonical graphs seems to yield the result you expect: from pprint import pprint
from rdflib import Graph
from rdflib.compare import graph_diff, to_canonical_graph
g1t = """@prefix : <http://test.org#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
:C1 a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:minCardinality "1"^^xsd:int ;
owl:onProperty rdfs:label
] ;
rdfs:label "C1"@en ."""
g2t = (
g1t
+ """ :C2 a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:minCardinality "1"^^xsd:int ;
owl:onProperty rdfs:comment
] ;
rdfs:label "C2"@en ."""
)
g1 = Graph()
g2 = Graph()
g1.parse(data=g1t, format="turtle")
# g1.parse(source="profile.ttl", format='turtle')
# g2.parse(source="source.ttl", format='turtle')
g2.parse(data=g2t, format="turtle")
g1c = to_canonical_graph(g1).skolemize()
g2c = to_canonical_graph(g2).skolemize()
g1cs = g1c.skolemize()
g2cs = g2c.skolemize()
in_both, in_first, in_second = graph_diff(g1cs, g1cs)
print("should be empty as all in g1 is in g2")
print(in_first.serialize(format="turtle").decode())
in_both, in_first, in_second = graph_diff(g1cs, g1cs)
print("should be empty as all in g1 is in g1")
print(in_first.serialize(format="turtle").decode()) output:
|
Thanks for the feedback.. to_canonical_graph() is used within graph_diff the output of to_canonical graph illustrates the problem: (rdflib.term.URIRef('http://test.org#C1'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'), rdflib.term.BNode('cb0'))
I tested you code and it doesnt help - you have an error in you are comparing the first graph to itself.. graph_diff(g1cs, g1cs)
|
PS - thanks for helping - if you are the original author could you please create a PR to drop in a comment saying where the algorithm came from and add some tests ? It looks cool - and I'm guessing there is some very interesting information about its performance relative to other options it would also be good to have explained in the docs :-) |
Apologies. It seems like under some circumstances the hashing fails and results in a 0 value (i.e. If I add some dummy blank nodes from pprint import pprint
from rdflib import Graph
from rdflib.compare import graph_diff, to_canonical_graph
g1t = """@prefix : <http://test.org#> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
[] rdfs:seeAlso [].
:C1 a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:minCardinality "1"^^xsd:int ;
owl:onProperty rdfs:label
] ;
rdfs:label "C1"@en ."""
g2t = (
g1t
+ """ :C2 a owl:Class ;
rdfs:subClassOf [ a owl:Restriction ;
owl:minCardinality "1"^^xsd:int ;
owl:onProperty rdfs:comment
] ;
rdfs:label "C2"@en ."""
)
g1 = Graph()
g2 = Graph()
g1.parse(data=g1t, format="turtle")
# g1.parse(source="profile.ttl", format='turtle')
# g2.parse(source="source.ttl", format='turtle')
g2.parse(data=g2t, format="turtle")
#g1c = to_canonical_graph(g1).skolemize()
#g2c = to_canonical_graph(g2).skolemize()
#g1cs = g1c.skolemize()
#g2cs = g2c.skolemize()
#print(g1cs.serialize(format="turtle").decode())
#print(g2cs.serialize(format="turtle").decode())
in_both, in_first, in_second = graph_diff(g1, g2)
print("should be empty as all in g1 is in g2")
print(in_first.serialize(format="turtle").decode())
in_both, in_first, in_second = graph_diff(g1, g2)
print("should be empty as all in g1 is in g1")
print(in_first.serialize(format="turtle").decode()) (hopefully this time it is not just a bug again 😄) output:
So yes there is clearly a bug here.
Not the original author. I am not sure when I will get time to look at this, or if I will at all, but at least the problem is clearer now. |
So this is a workaround until a fix is made - this could be used to replace the current graph_diff() code... def graph_diff_fix( g1: Graph, g2:Graph ) :
|
@jimmccusker as far as I can tell you wrote the diff algorithm here, is it documented somewhere? I'm trying to make sense of it to see why its going wrong. |
Okay I did some digging, the algorithm is explained in chapter 06 of this document: WEBSIG: A DIGITAL SIGNATURE FRAMEWORK FOR THE WEB
This references: Still not sure what is actually wrong with it. Once #1330 is merged I will merge some type annotations that made it slightly easier for me to understand some of what is happening and I will also merge some tests marked with I think it may be worth considering other graph canonicalization algorithms, for example: Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and Labelling Blank Nodes which was implemented by RDF4J |
Yeah. I only wrote the test to see if the graphs are different, not the one
that checks to see what is different between the graphs. However, if two
bnodes are connected via another bnode, that will change their color. Do
you have an example graph pair that causes the problem?
Thanks,
Jamie
On Sun, May 30, 2021 at 6:33 PM Iwan Aucamp ***@***.***> wrote:
Okay I did some digging, the algorithm is explained in ch06 of this
document: http://tw.rpi.edu/media/2015/09/21/8275/websig_jpm_thesis.pdf
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1294 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETCEOHMM6BGLB3WMAVQHDTQK4JXANCNFSM43N7VMEQ>
.
--
Jamie McCusker (they<she)
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
***@***.*** ***@***.***>
http://tw.rpi.edu
|
see the in-line examples earlier in this issue... |
Also add type annotations to rdflib.compare to make it easier to understand the code.
Also add type annotations to rdflib.compare to make it easier to understand the code.
Also add type annotations to rdflib.compare to make it easier to understand the code.
Also add type annotations to rdflib.compare to make it easier to understand the code.
Add @expectedfailure unit tests for #1294 and type annotations for compare.py
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
If both graph has multiple blank nodes graph diff fails to detect matches. It succeeds for a single blank node case.
code:
The text was updated successfully, but these errors were encountered: