Graph_diff fails on presence of unrelated blank nodes #1294

rob-metalinkage · 2021-04-23T06:20:45Z

If both graph has multiple blank nodes graph diff fails to detect matches. It succeeds for a single blank node case.

code:

from pprint import pprint

from rdflib import Graph
from rdflib.compare import graph_diff, to_canonical_graph

g1t = """@prefix : <http://test.org#> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

:C1  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:label
                         ] ;
        rdfs:label   "C1"@en ."""

g2t =  g1t + """ :C2  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:comment
                         ] ;
        rdfs:label   "C2"@en ."""
g1 = Graph()
g2 = Graph()

g1.parse(data=g1t, format='turtle')
# g1.parse(source="profile.ttl", format='turtle')
#g2.parse(source="source.ttl", format='turtle')
g2.parse(data=g2t, format='turtle')


in_both, in_first, in_second = graph_diff(g1,g2)
print("should be empty as all in g1 is in g2")
pprint(in_first.serialize(format='turtle'))

in_both, in_first, in_second = graph_diff(g1,g1)

print("should be empty as all in g1 is in g1")
pprint(in_first.serialize(format='turtle'))


output: (which should be things only in first (in g1) - g2 is g1 + stuff so this should always be empty..

should be empty as all in g1 is in g2

(b'@prefix ns1: <http://www.w3.org/2002/07/owl#> .\n@prefix rdfs: <http://ww'
 b'w.w3.org/2000/01/rdf-schema#> .\n@prefix xsd: <http://www.w3.org/2001/XML'
 b'Schema#> .\n\n<http://test.org#C1> rdfs:subClassOf [ a ns1:Restriction ;\n '
 b'           ns1:minCardinality "1"^^xsd:int ;\n            ns1:onProperty '
 b'rdfs:label ] .\n\n')

should be empty as all in g1 is in g1

b'\n'

The text was updated successfully, but these errors were encountered:

aucampia · 2021-05-12T22:45:38Z

How should rdflib know that the blank node in g1 is the same as the one in g2? I don't think anything in rdf semantics make them the same, and not even in rdfs or owl semantics AFAIK. So I don't think this behaviour is incorrect, but I may be wrong, if you can provide some reference to the standards that dictate they should be the same I will have a look.

Regardless of standards/specs however, if there is some way in which you think equivalence can be established consistently I think it would be good to know, it would be useful to have.

I think it may actually be a bug that it works if there is only one blank node however, I don't think blank nodes in independent graphs should ever be equivalent with pure rdf semantics. I can see cases for OWL where that would be different though.

aucampia · 2021-05-12T22:55:46Z

I guess this could be done: Blank Node Matching and RDF/S Comparison Functions - but it is not standard RDF semantics, so it should likely not go inside the normal graph diff function.

aucampia · 2021-05-12T23:15:43Z

I see you import to_canonical_graph but you don't use it. Can you give it a try?

aucampia · 2021-05-12T23:27:57Z

Comparing skolemized canonical graphs seems to yield the result you expect:

from pprint import pprint

from rdflib import Graph
from rdflib.compare import graph_diff, to_canonical_graph

g1t = """@prefix : <http://test.org#> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

:C1  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:label
                         ] ;
        rdfs:label   "C1"@en ."""

g2t = (
    g1t
    + """ :C2  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:comment
                         ] ;
        rdfs:label   "C2"@en ."""
)
g1 = Graph()
g2 = Graph()

g1.parse(data=g1t, format="turtle")
# g1.parse(source="profile.ttl", format='turtle')
# g2.parse(source="source.ttl", format='turtle')
g2.parse(data=g2t, format="turtle")


g1c = to_canonical_graph(g1).skolemize()
g2c = to_canonical_graph(g2).skolemize()
g1cs = g1c.skolemize()
g2cs = g2c.skolemize()

in_both, in_first, in_second = graph_diff(g1cs, g1cs)
print("should be empty as all in g1 is in g2")
print(in_first.serialize(format="turtle").decode())

in_both, in_first, in_second = graph_diff(g1cs, g1cs)

print("should be empty as all in g1 is in g1")
print(in_first.serialize(format="turtle").decode())

output:

$ poetry run python src/aucampia/scratchpad/rdflib-000.py 
should be empty as all in g1 is in g2


should be empty as all in g1 is in g1

rob-metalinkage · 2021-05-13T00:01:33Z

Thanks for the feedback..

to_canonical_graph() is used within graph_diff

the output of to_canonical graph illustrates the problem:

(rdflib.term.URIRef('http://test.org#C1'), rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#subClassOf'), rdflib.term.BNode('cb0'))

multiple graphs create 'cb0' node ids which get incorrectly diffed as a result.

I tested you code and it doesnt help - you have an error in you are comparing the first graph to itself..

graph_diff(g1cs, g1cs)

should be graph_diff(g1cs, g2cs)

rob-metalinkage · 2021-05-13T00:13:33Z

PS - thanks for helping - if you are the original author could you please create a PR to drop in a comment saying where the algorithm came from and add some tests ? It looks cool - and I'm guessing there is some very interesting information about its performance relative to other options it would also be good to have explained in the docs :-)

aucampia · 2021-05-13T00:29:19Z

@rob-metalinkage

I tested you code and it doesnt help - you have an error in you are comparing the first graph to itself..

Apologies. It seems like under some circumstances the hashing fails and results in a 0 value (i.e. cb0), and for some reason having one blank node seems to cause this. I guess this may be at the core of the problem.

If I add some dummy blank nodes [] rdfs:seeAlso []. to your graphs it works:

from pprint import pprint

from rdflib import Graph
from rdflib.compare import graph_diff, to_canonical_graph

g1t = """@prefix : <http://test.org#> .
@prefix owl:   <http://www.w3.org/2002/07/owl#> .
@prefix xsd:   <http://www.w3.org/2001/XMLSchema#> .
@prefix rdfs:  <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:   <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

[] rdfs:seeAlso [].

:C1  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:label
                         ] ;
        rdfs:label   "C1"@en ."""

g2t = (
    g1t
    + """ :C2  a          owl:Class ;
        rdfs:subClassOf  [ a                   owl:Restriction ;
                           owl:minCardinality  "1"^^xsd:int ;
                           owl:onProperty      rdfs:comment
                         ] ;
        rdfs:label   "C2"@en ."""
)
g1 = Graph()
g2 = Graph()

g1.parse(data=g1t, format="turtle")
# g1.parse(source="profile.ttl", format='turtle')
# g2.parse(source="source.ttl", format='turtle')
g2.parse(data=g2t, format="turtle")


#g1c = to_canonical_graph(g1).skolemize()
#g2c = to_canonical_graph(g2).skolemize()
#g1cs = g1c.skolemize()
#g2cs = g2c.skolemize()

#print(g1cs.serialize(format="turtle").decode())
#print(g2cs.serialize(format="turtle").decode())


in_both, in_first, in_second = graph_diff(g1, g2)
print("should be empty as all in g1 is in g2")
print(in_first.serialize(format="turtle").decode())

in_both, in_first, in_second = graph_diff(g1, g2)

print("should be empty as all in g1 is in g1")
print(in_first.serialize(format="turtle").decode())

(hopefully this time it is not just a bug again 😄)

output:

$ poetry run python src/aucampia/scratchpad/rdflib-000.py 
should be empty as all in g1 is in g2


should be empty as all in g1 is in g1

So yes there is clearly a bug here.

if you are the original author could you please create a PR to drop in a comment saying where the algorithm came from and add some tests ?

Not the original author.

I am not sure when I will get time to look at this, or if I will at all, but at least the problem is clearer now.

rob-metalinkage · 2021-05-14T01:52:44Z

So this is a workaround until a fix is made - this could be used to replace the current graph_diff() code...

def graph_diff_fix( g1: Graph, g2:Graph ) :
""" workaround for blank node bug in rdflib.compare.graph_diff"""
dummyg = Graph()

dummy = URIRef("http://dummy.fix")

dummyg.parse(data=" [] <http://dummy.fix> [] . " , format="turtle")
b,f,s = graph_diff(g1+dummyg, g2+dummyg)
#  clean out dummy node
for bmn in b.subjects( predicate=dummy):
    b.remove ( ( bmn , None, None ))
return  b, f, s

aucampia · 2021-05-30T19:57:17Z

@jimmccusker as far as I can tell you wrote the diff algorithm here, is it documented somewhere? I'm trying to make sense of it to see why its going wrong.

aucampia · 2021-05-30T22:32:48Z

Okay I did some digging, the algorithm is explained in chapter 06 of this document: WEBSIG: A DIGITAL SIGNATURE FRAMEWORK FOR THE WEB

We present a new digest algorithm for Resource Description Framework (RDF) graphs called RGDA1 (RDF Graph Digest Algorithm 1) based on the digest method from Sayers and Karp and the graph canonicalization method traces from McKay and Piperno.

This references:

Still not sure what is actually wrong with it. Once #1330 is merged I will merge some type annotations that made it slightly easier for me to understand some of what is happening and I will also merge some tests marked with @unittest.expectedFailure that is representative of the problem being experienced here.

I think it may be worth considering other graph canonicalization algorithms, for example: Canonical Forms for Isomorphic and Equivalent RDF Graphs: Algorithms for Leaning and Labelling Blank Nodes which was implemented by RDF4J

jpmccu · 2021-05-31T01:16:29Z

Yeah. I only wrote the test to see if the graphs are different, not the one that checks to see what is different between the graphs. However, if two bnodes are connected via another bnode, that will change their color. Do you have an example graph pair that causes the problem? Thanks, Jamie

On Sun, May 30, 2021 at 6:33 PM Iwan Aucamp ***@***.***> wrote: Okay I did some digging, the algorithm is explained in ch06 of this document: http://tw.rpi.edu/media/2015/09/21/8275/websig_jpm_thesis.pdf — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1294 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETCEOHMM6BGLB3WMAVQHDTQK4JXANCNFSM43N7VMEQ> .

-- Jamie McCusker (they<she) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

rob-metalinkage · 2021-05-31T02:22:46Z

see the in-line examples earlier in this issue...

Also add type annotations to rdflib.compare to make it easier to understand the code.

@expectedfailure

Add @expectedfailure unit tests for #1294 and type annotations for compare.py

aucampia added a commit to aucampia/rdflib that referenced this issue Jun 27, 2021

Add unit tests for issue RDFLib#1294 with @expectedfailure

07ef33c

aucampia added a commit to aucampia/rdflib that referenced this issue Jun 27, 2021

Add unit tests for issue RDFLib#1294 with @expectedfailure

2d85679

aucampia mentioned this issue Jun 27, 2021

Add @expectedFailure unit tests for #1294 and type annotations for compare.py #1346

Merged

white-gecko mentioned this issue Jun 30, 2021

WIP: Add Atomic Graph AKSW/QuitStore#262

Draft

aucampia added a commit to aucampia/rdflib that referenced this issue Jul 12, 2021

Add unit tests for issue RDFLib#1294 with @expectedfailure

60594b0

Also add type annotations to rdflib.compare to make it easier to understand the code.

aucampia added a commit to aucampia/rdflib that referenced this issue Jul 12, 2021

Add unit tests for issue RDFLib#1294 with @expectedfailure

4664767

Also add type annotations to rdflib.compare to make it easier to understand the code.

aucampia added a commit to aucampia/rdflib that referenced this issue Jul 12, 2021

Add unit tests for issue RDFLib#1294 with @expectedfailure

400052a

Also add type annotations to rdflib.compare to make it easier to understand the code.

aucampia added a commit to aucampia/rdflib that referenced this issue Jul 12, 2021

Add unit tests for issue RDFLib#1294 with @expectedfailure

bf9a616

Also add type annotations to rdflib.compare to make it easier to understand the code.

nicholascar added a commit that referenced this issue Jul 12, 2021

Merge pull request #1346 from iafork/iwana-issue1294-000

b81eaa4

Add @expectedfailure unit tests for #1294 and type annotations for compare.py

ghost locked and limited conversation to collaborators Dec 27, 2021

ghost converted this issue into discussion #1610 Dec 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Graph_diff fails on presence of unrelated blank nodes #1294

Graph_diff fails on presence of unrelated blank nodes #1294

rob-metalinkage commented Apr 23, 2021 •

edited

Loading

aucampia commented May 12, 2021 •

edited

Loading

aucampia commented May 12, 2021

aucampia commented May 12, 2021

aucampia commented May 12, 2021

rob-metalinkage commented May 13, 2021 •

edited

Loading

rob-metalinkage commented May 13, 2021

aucampia commented May 13, 2021 •

edited

Loading

rob-metalinkage commented May 14, 2021

aucampia commented May 30, 2021

aucampia commented May 30, 2021 •

edited

Loading

jpmccu commented May 31, 2021 via email

rob-metalinkage commented May 31, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Graph_diff fails on presence of unrelated blank nodes #1294

Graph_diff fails on presence of unrelated blank nodes #1294

Comments

rob-metalinkage commented Apr 23, 2021 • edited Loading

aucampia commented May 12, 2021 • edited Loading

aucampia commented May 12, 2021

aucampia commented May 12, 2021

aucampia commented May 12, 2021

rob-metalinkage commented May 13, 2021 • edited Loading

rob-metalinkage commented May 13, 2021

aucampia commented May 13, 2021 • edited Loading

rob-metalinkage commented May 14, 2021

aucampia commented May 30, 2021

aucampia commented May 30, 2021 • edited Loading

jpmccu commented May 31, 2021 via email

rob-metalinkage commented May 31, 2021

This issue was moved to a discussion.

rob-metalinkage commented Apr 23, 2021 •

edited

Loading

aucampia commented May 12, 2021 •

edited

Loading

rob-metalinkage commented May 13, 2021 •

edited

Loading

aucampia commented May 13, 2021 •

edited

Loading

aucampia commented May 30, 2021 •

edited

Loading