-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turtle serializing creates invalid PNames #1661
Comments
would be interesting to know if this works with RDF4J and Jena, will take a look when I can, possibly only next week somewhere. |
AFAICS this is not valid Turtle. The parens need to be escaped, like The nt cannot be the same as NTriples cannot contain PNames. (Also, the |
Ok, so it is not a valid turtle, but is it a valid nt?: Yes, the ns2 was a typo. |
Yes, that line is valid NT and valid Turtle. It's the PName (which is only in Turtle) which is invalid, as the parens must be escaped in a PName. I think RDFLib outputs invalid PNames here though, so it would fail on round-tripping. |
I created the triple by parsing the nt triple and serializing it as ttl. So the problem is actually with the serializer. |
@GreenfishK can you please create a new issue to indicate the serialization error and close this one? Just so we can keep on top of all the issues. Thanks. |
Is it OK to just rephrase the title? The issue is with serializing (and thus round-tripping) the given data. I took a stab at it, and a fix in import re
...
# From: https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC
# Note that '_.-' are OK within, but need escaping at start/end.
PNAME_LOCAL_ESCAPES = re.compile(r"([~!$&'()*+,;=/?#@%])")
...
def getQName(self, uri, gen_prefix=True): # should be renamed to get_pname ...
....
local = PNAME_LOCAL_ESCAPES.sub(r"\\\1", local)
.... (Sorry for just suggesting the fix in a comment. If I find more time I can make a PR, albeit I suspect I'd like to overhaul the Turtle handling thoroughly (using PNames consistently), and then I'd end up rewriting this and the parser.) |
I've just updated the title.
The Turtle parser is ancient and a bit of a mess. The N-Triples parser/serializer also needs attention. It would be great to see an overhaul although I fear this would be a large task. I suspect that modern Python domain-specific language handling could greatly improve things here but I don't know the area much... ...and the current parsers/serializers to mostly work of course! Recently I added the @niklasl if you want to do a major rewrite of any of these things, let us know! It could be a great consider that a great rdflib 7.0.0 candidate update. However, it would also be great just to see a small PR for this PName fix by itself! |
I will see about integrating @niklasl 's suggestion tonight. I think we should have a discussion about the future of the parsers though, or at least discuss some options, but like @nicholascar said for now we will do our best to get the ones we have in the best shape and hopefully that leaves us with a good test suite to ensure future parsers behave well. |
Thank you all! |
I have been looking at this a bit, the only cases in which invalid PNames are returned are:
In all these cases there is no problem:
And the reason it fails is actually because of is_ncname in rdflib.namespace: rdflib/rdflib/namespace/__init__.py Lines 643 to 669 in c0bd5ea
Basically, if "%", "(" and ")" is removed from there all turtle test pass, of course that is not a fix for this, but rather this code is XML specific and should probably not run at all for Turtle based syntax. @niklasl 's suggestion will fix it though, but will have to dig a bit into this at some point to try and rationalize what is going on here. |
Draft fix in #1678 - however this causes one regression, in that it breaks percent encoded stuff:
Possibly we should just go very conservative and only deal with So something like what @niklasl suggested but with: PNAME_LOCAL_ESCAPES = re.compile(r"([()])") Potentially also with a different name. I think there may be better places to fix this though, but will look at it more tomorrow. I think a targetted fix for only the brackets may make sense though, and then we can unravel the rest separately. |
Sure, right down to solving the particular motivating Issue here and no more, unless there are some obvious other cases of things to handle. But don't put in much more effort than that! |
@GreenfishK #1678 should address your issue, but there may still be some further issues with turtle pname serialization that need fixing. |
Consider following triple from DBPedia serialized as .ttl:
test.ttl
It is not possible to parse this triple due to the brackets in ns2:spouses(s)_ in a .ttl file. However, there is no such problem for an .nt file.
Code to parse the test.ttl file with the triple inside.
Exception:
The text was updated successfully, but these errors were encountered: