Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph_diff fails on presence of unrelated blank nodes #1294

Closed
rob-metalinkage opened this issue Apr 23, 2021 · 12 comments
Closed

Graph_diff fails on presence of unrelated blank nodes #1294

rob-metalinkage opened this issue Apr 23, 2021 · 12 comments

Comments

@rob-metalinkage
Copy link

rob-metalinkage commented Apr 23, 2021

If both graph has multiple blank nodes graph diff fails to detect matches. It succeeds for a single blank node case.

code:

from pprint import pprint

from rdflib import Graph
from rdflib.compare import graph_diff, to_canonical_graph

g1t = """@prefix : <http://test.org#> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

:C1  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:label
                         ] ;
        rdfs:label   "C1"@en ."""

g2t =  g1t + """ :C2  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:comment
                         ] ;
        rdfs:label   "C2"@en ."""
g1 = Graph()
g2 = Graph()

g1.parse(data=g1t, format='turtle')
# g1.parse(source="profile.ttl", format='turtle')
#g2.parse(source="source.ttl", format='turtle')
g2.parse(data=g2t, format='turtle')


in_both, in_first, in_second = graph_diff(g1,g2)
print("should be empty as all in g1 is in g2")
pprint(in_first.serialize(format='turtle'))

in_both, in_first, in_second = graph_diff(g1,g1)

print("should be empty as all in g1 is in g1")
pprint(in_first.serialize(format='turtle'))


output: (which should be things only in first (in g1) - g2 is g1 + stuff so this should always be empty..

should be empty as all in g1 is in g2

(b'@prefix ns1: <http://www.w3.org/2002/07/owl#> .\n@prefix rdfs: <http://ww'
 b'w.w3.org/2000/01/rdf-schema#> .\n@prefix xsd: <http://www.w3.org/2001/XML'
 b'Schema#> .\n\n<http://test.org#C1> rdfs:subClassOf [ a ns1:Restriction ;\n '
 b'           ns1:minCardinality "1"^^xsd:int ;\n            ns1:onProperty '
 b'rdfs:label ] .\n\n')

should be empty as all in g1 is in g1

b'\n'
@aucampia
Copy link
Member

aucampia commented May 12, 2021

How should rdflib know that the blank node in g1 is the same as the one in g2? I don't think anything in rdf semantics make them the same, and not even in rdfs or owl semantics AFAIK. So I don't think this behaviour is incorrect, but I may be wrong, if you can provide some reference to the standards that dictate they should be the same I will have a look.

Regardless of standards/specs however, if there is some way in which you think equivalence can be established consistently I think it would be good to know, it would be useful to have.

I think it may actually be a bug that it works if there is only one blank node however, I don't think blank nodes in independent graphs should ever be equivalent with pure rdf semantics. I can see cases for OWL where that would be different though.

@aucampia
Copy link
Member

I guess this could be done: Blank Node Matching and RDF/S Comparison Functions - but it is not standard RDF semantics, so it should likely not go inside the normal graph diff function.

@aucampia
Copy link
Member

I see you import to_canonical_graph but you don't use it. Can you give it a try?

@aucampia
Copy link
Member

Comparing skolemized canonical graphs seems to yield the result you expect:

from pprint import pprint

from rdflib import Graph
from rdflib.compare import graph_diff, to_canonical_graph

g1t = """@prefix : <http://test.org#> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

:C1  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:label
                         ] ;
        rdfs:label   "C1"@en ."""

g2t = (
    g1t
    + """ :C2  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:comment
                         ] ;
        rdfs:label   "C2"@en ."""
)
g1 = Graph()
g2 = Graph()

g1.parse(data=g1t, format="turtle")
# g1.parse(source="profile.ttl", format='turtle')
# g2.parse(source="source.ttl", format='turtle')
g2.parse(data=g2t, format="turtle")


g1c = to_canonical_graph(g1).skolemize()
g2c = to_canonical_graph(g2).skolemize()
g1cs = g1c.skolemize()
g2cs = g2c.skolemize()

in_both, in_first, in_second = graph_diff(g1cs, g1cs)
print("should be empty as all in g1 is in g2")
print(in_first.serialize(format="turtle").decode())

in_both, in_first, in_second = graph_diff(g1cs, g1cs)

print("should be empty as all in g1 is in g1")
print(in_first.serialize(format="turtle").decode())

output:

$ poetry run python src/aucampia/scratchpad/rdflib-000.py 
should be empty as all in g1 is in g2


should be empty as all in g1 is in g1



@rob-metalinkage
Copy link
Author

rob-metalinkage commented May 13, 2021

Thanks for the feedback..

to_canonical_graph() is used within graph_diff

the output of to_canonical graph illustrates the problem:

(rdflib.term.URIRef('http://test.org#C1'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'), rdflib.term.BNode('cb0'))

  • multiple graphs create 'cb0' node ids which get incorrectly diffed as a result.

I tested you code and it doesnt help - you have an error in you are comparing the first graph to itself..

graph_diff(g1cs, g1cs)

  • should be graph_diff(g1cs, g2cs)

@rob-metalinkage
Copy link
Author

PS - thanks for helping - if you are the original author could you please create a PR to drop in a comment saying where the algorithm came from and add some tests ? It looks cool - and I'm guessing there is some very interesting information about its performance relative to other options it would also be good to have explained in the docs :-)

@aucampia
Copy link
Member

aucampia commented May 13, 2021

@rob-metalinkage

I tested you code and it doesnt help - you have an error in you are comparing the first graph to itself..

Apologies. It seems like under some circumstances the hashing fails and results in a 0 value (i.e. cb0), and for some reason having one blank node seems to cause this. I guess this may be at the core of the problem.

If I add some dummy blank nodes [] rdfs:seeAlso []. to your graphs it works:

from pprint import pprint

from rdflib import Graph
from rdflib.compare import graph_diff, to_canonical_graph

g1t = """@prefix : <http://test.org#> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

[] rdfs:seeAlso [].

:C1  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:label
                         ] ;
        rdfs:label   "C1"@en ."""

g2t = (
    g1t
    + """ :C2  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:comment
                         ] ;
        rdfs:label   "C2"@en ."""
)
g1 = Graph()
g2 = Graph()

g1.parse(data=g1t, format="turtle")
# g1.parse(source="profile.ttl", format='turtle')
# g2.parse(source="source.ttl", format='turtle')
g2.parse(data=g2t, format="turtle")


#g1c = to_canonical_graph(g1).skolemize()
#g2c = to_canonical_graph(g2).skolemize()
#g1cs = g1c.skolemize()
#g2cs = g2c.skolemize()

#print(g1cs.serialize(format="turtle").decode())
#print(g2cs.serialize(format="turtle").decode())


in_both, in_first, in_second = graph_diff(g1, g2)
print("should be empty as all in g1 is in g2")
print(in_first.serialize(format="turtle").decode())

in_both, in_first, in_second = graph_diff(g1, g2)

print("should be empty as all in g1 is in g1")
print(in_first.serialize(format="turtle").decode())

(hopefully this time it is not just a bug again 😄)

output:

$ poetry run python src/aucampia/scratchpad/rdflib-000.py 
should be empty as all in g1 is in g2


should be empty as all in g1 is in g1


So yes there is clearly a bug here.

if you are the original author could you please create a PR to drop in a comment saying where the algorithm came from and add some tests ?

Not the original author.

I am not sure when I will get time to look at this, or if I will at all, but at least the problem is clearer now.

@rob-metalinkage
Copy link
Author

So this is a workaround until a fix is made - this could be used to replace the current graph_diff() code...

def graph_diff_fix( g1: Graph, g2:Graph ) :
""" workaround for blank node bug in rdflib.compare.graph_diff"""
dummyg = Graph()

dummy = URIRef("http://dummy.fix")

dummyg.parse(data=" [] <http://dummy.fix> [] . " , format="turtle")
b,f,s = graph_diff(g1+dummyg, g2+dummyg)
#  clean out dummy node
for bmn in b.subjects( predicate=dummy):
    b.remove ( ( bmn , None, None ))
return  b, f, s

@aucampia
Copy link
Member

@jimmccusker as far as I can tell you wrote the diff algorithm here, is it documented somewhere? I'm trying to make sense of it to see why its going wrong.

@aucampia
Copy link
Member

aucampia commented May 30, 2021

Okay I did some digging, the algorithm is explained in chapter 06 of this document: WEBSIG: A DIGITAL SIGNATURE FRAMEWORK FOR THE WEB

We present a new digest algorithm for Resource Description Framework (RDF) graphs called RGDA1 (RDF Graph Digest Algorithm 1) based on the digest method from Sayers and Karp and the graph canonicalization method traces from McKay and Piperno.

This references:

Still not sure what is actually wrong with it. Once #1330 is merged I will merge some type annotations that made it slightly easier for me to understand some of what is happening and I will also merge some tests marked with @unittest.expectedFailure that is representative of the problem being experienced here.

I think it may be worth considering other graph canonicalization algorithms, for example: Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and Labelling Blank Nodes which was implemented by RDF4J

@jpmccu
Copy link
Contributor

jpmccu commented May 31, 2021 via email

@rob-metalinkage
Copy link
Author

see the in-line examples earlier in this issue...

aucampia added a commit to aucampia/rdflib that referenced this issue Jun 27, 2021
aucampia added a commit to aucampia/rdflib that referenced this issue Jun 27, 2021
aucampia added a commit to aucampia/rdflib that referenced this issue Jul 12, 2021
Also add type annotations to rdflib.compare to make it easier
to understand the code.
aucampia added a commit to aucampia/rdflib that referenced this issue Jul 12, 2021
Also add type annotations to rdflib.compare to make it easier
to understand the code.
aucampia added a commit to aucampia/rdflib that referenced this issue Jul 12, 2021
Also add type annotations to rdflib.compare to make it easier
to understand the code.
aucampia added a commit to aucampia/rdflib that referenced this issue Jul 12, 2021
Also add type annotations to rdflib.compare to make it easier
to understand the code.
nicholascar added a commit that referenced this issue Jul 12, 2021
Add @expectedfailure unit tests for #1294 and type annotations for compare.py
@ghost ghost locked and limited conversation to collaborators Dec 27, 2021
@ghost ghost converted this issue into discussion #1610 Dec 27, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants