Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turtle serializing creates invalid PNames #1661

Closed
GreenfishK opened this issue Jan 11, 2022 · 14 comments · Fixed by #1678
Closed

Turtle serializing creates invalid PNames #1661

GreenfishK opened this issue Jan 11, 2022 · 14 comments · Fixed by #1678
Labels
bug Something isn't working in-resolution

Comments

@GreenfishK
Copy link

GreenfishK commented Jan 11, 2022

Consider following triple from DBPedia serialized as .ttl:

test.ttl

@prefix ns1: <http://dbpedia.org/property/> .
<http://dbpedia.org/resource/Maine_Mendoza> ns2:spouse(s)_ "Nate Winchester 2013-Present"@en .

It is not possible to parse this triple due to the brackets in ns2:spouses(s)_ in a .ttl file. However, there is no such problem for an .nt file.

Code to parse the test.ttl file with the triple inside.

from rdflib import Graph
g = Graph()
g.parse(test.ttl)

Exception:

Traceback (most recent call last):
  File "/home/fkovacev/PhD/RDFStarVersioningEval/OSTRICH/BEAR/scripts/build_tb_rdf_star_datasets.py", line 204, in <module>
    construct_tb_star_ds(source_ic0=data_dir + "/alldata.IC.nt/000001.nt",
  File "/home/fkovacev/PhD/RDFStarVersioningEval/OSTRICH/BEAR/scripts/build_tb_rdf_star_datasets.py", line 140, in construct_tb_star_ds
    cs_add.parse(source_cs + "/" + t[1])
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/graph.py", line 1261, in parse
    raise se
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/graph.py", line 1252, in parse
    parser.parse(source, self, **args)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1921, in parse
    p.loadStream(stream)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 438, in loadStream
    return self.loadBuf(stream.read())  # Not ideal
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 444, in loadBuf
    self.feed(buf)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 470, in feed
    i = self.directiveOrStatement(s, j)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 490, in directiveOrStatement
    j = self.statement(argstr, i)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 740, in statement
    j = self.property_list(argstr, i, r[0])
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1088, in property_list
    i = self.objectList(argstr, j, objs)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1132, in objectList
    i = self.object(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1418, in object
    j = self.subject(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 747, in subject
    return self.item(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 839, in item
    return self.path(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 846, in path
    j = self.nodeOrLiteral(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1446, in nodeOrLiteral
    j = self.node(argstr, i, res)
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1025, in node
    self.BadSyntax(argstr, i, "expected item in list or ')'")
  File "/home/fkovacev/anaconda3/envs/baseDSEnv/lib/python3.8/site-packages/rdflib/plugins/parsers/notation3.py", line 1653, in BadSyntax
    raise BadSyntax(self._thisDoc, self.lines, argstr, i, msg)
rdflib.plugins.parsers.notation3.BadSyntax: at line 3 of <>:
Bad syntax (expected item in list or ')') at ^ in:
"...b'> .\n\n<http://dbpedia.org/resource/Maine_Mendoza> ns1:spouse('^b's)_ "Nate Winchester 2013-Present"@en .\n\n\n'"

@aucampia
Copy link
Member

would be interesting to know if this works with RDF4J and Jena, will take a look when I can, possibly only next week somewhere.

@niklasl
Copy link
Member

niklasl commented Jan 13, 2022

AFAICS this is not valid Turtle. The parens need to be escaped, like ns2:spouse\(s\)_, otherwise this cannot be disambiguated from a list, like: <> ns2:spouse(ns2:s). (which is valid Turtle).

The nt cannot be the same as NTriples cannot contain PNames.

(Also, the ns2 prefix appears undeclared in the example; typo of ns1 I presume?)

@GreenfishK
Copy link
Author

GreenfishK commented Jan 13, 2022

AFAICS this is not valid Turtle. The parens need to be escaped, like ns2:spouse\(s\)_, otherwise this cannot be disambiguated from a list, like: <> ns2:spouse(ns2:s). (which is valid Turtle).

The nt cannot be the same as NTriples cannot contain PNames.

(Also, the ns2 prefix appears undeclared in the example; typo of ns1 I presume?)

Ok, so it is not a valid turtle, but is it a valid nt?:
<http://dbpedia.org/resource/Maine_Mendoza> <http://dbpedia.org/property/spouse(s)_> "Nate Winchester 2013-Present"@en .
The nt parser has no problem with it.

Yes, the ns2 was a typo.

@niklasl
Copy link
Member

niklasl commented Jan 13, 2022

Yes, that line is valid NT and valid Turtle. It's the PName (which is only in Turtle) which is invalid, as the parens must be escaped in a PName.

I think RDFLib outputs invalid PNames here though, so it would fail on round-tripping.

@GreenfishK
Copy link
Author

Yes, that line is valid NT and valid Turtle. It's the PName (which is only in Turtle) which is invalid, as the parens must be escaped in a PName.

I think RDFLib outputs invalid PNames here though, so it would fail on round-tripping.

I created the triple by parsing the nt triple and serializing it as ttl. So the problem is actually with the serializer.

@GreenfishK GreenfishK changed the title Issue with parsing resources with round bracktes from a .ttl file Issue with parsing resources with round brackets from a .ttl file Jan 14, 2022
@nicholascar
Copy link
Member

@GreenfishK can you please create a new issue to indicate the serialization error and close this one? Just so we can keep on top of all the issues. Thanks.

@niklasl
Copy link
Member

niklasl commented Jan 15, 2022

Is it OK to just rephrase the title? The issue is with serializing (and thus round-tripping) the given data. I took a stab at it, and a fix in rdflib/plugins/serializers/turtle.py should be along the lines of:

import re
...
# From: https://www.w3.org/TR/turtle/#grammar-production-PN_LOCAL_ESC
# Note that '_.-' are OK within, but need escaping at start/end.
PNAME_LOCAL_ESCAPES = re.compile(r"([~!$&'()*+,;=/?#@%])")
...

    def getQName(self, uri, gen_prefix=True): # should be renamed to get_pname ...
        ....
        local = PNAME_LOCAL_ESCAPES.sub(r"\\\1", local)
        ....

(Sorry for just suggesting the fix in a comment. If I find more time I can make a PR, albeit I suspect I'd like to overhaul the Turtle handling thoroughly (using PNames consistently), and then I'd end up rewriting this and the parser.)

@nicholascar nicholascar changed the title Issue with parsing resources with round brackets from a .ttl file Turtle serializing creates invlid PNames Jan 16, 2022
@nicholascar nicholascar changed the title Turtle serializing creates invlid PNames Turtle serializing creates invalid PNames Jan 16, 2022
@nicholascar
Copy link
Member

I've just updated the title.

I'd end up rewriting this and the parser

The Turtle parser is ancient and a bit of a mess. The N-Triples parser/serializer also needs attention. It would be great to see an overhaul although I fear this would be a large task. I suspect that modern Python domain-specific language handling could greatly improve things here but I don't know the area much...

...and the current parsers/serializers to mostly work of course!

Recently I added the longturtle turtle variant serializer to allow for Git-friendly Turtle files, so I got into the serializer a bit there but just wrapped the underlying N3 serializer with some slightly different, still valid Turtle, conventions, rather than implementing an actual, new, serializer. I'm a coward!

@niklasl if you want to do a major rewrite of any of these things, let us know! It could be a great consider that a great rdflib 7.0.0 candidate update. However, it would also be great just to see a small PR for this PName fix by itself!

@aucampia
Copy link
Member

I will see about integrating @niklasl 's suggestion tonight. I think we should have a discussion about the future of the parsers though, or at least discuss some options, but like @nicholascar said for now we will do our best to get the ones we have in the best shape and hopefully that leaves us with a good test suite to ensure future parsers behave well.

@GreenfishK
Copy link
Author

Thank you all!

@aucampia
Copy link
Member

aucampia commented Jan 17, 2022

I have been looking at this a bit, the only cases in which invalid PNames are returned are:

FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-(] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>
FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-)] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>
FAILED test/test_turtle_quoting.py::test_serialize_correctness[turtle-%] - rdflib.plugins.parsers.notation3.BadSyntax: <no detail available>

In all these cases there is no problem:

PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-A]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-_]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-~]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-.]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle--]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-!]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-$]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-&]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-']
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-*]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-+]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-,]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-;]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-=]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-/]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-?]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-#]
PASSED test/test_turtle_quoting.py::test_serialize_correctness[turtle-@]

Test code: https://github.com/aucampia/rdflib/blob/092f228752a89e2c1eecac376a7c148e64a83241/test/test_turtle_quoting.py#L155-L176

And the reason it fails is actually because of is_ncname in rdflib.namespace:

ALLOWED_NAME_CHARS = ["\u00B7", "\u0387", "-", ".", "_", "%", "(", ")"]
# http://www.w3.org/TR/REC-xml-names/#NT-NCName
# [4] NCName ::= (Letter | '_') (NCNameChar)* /* An XML Name, minus
# the ":" */
# [5] NCNameChar ::= Letter | Digit | '.' | '-' | '_' | CombiningChar
# | Extender
def is_ncname(name):
if name:
first = name[0]
if first == "_" or category(first) in NAME_START_CATEGORIES:
for i in range(1, len(name)):
c = name[i]
if not category(c) in NAME_CATEGORIES:
if c in ALLOWED_NAME_CHARS:
continue
return 0
# if in compatibility area
# if decomposition(c)!='':
# return 0
return 1
return 0

Basically, if "%", "(" and ")" is removed from there all turtle test pass, of course that is not a fix for this, but rather this code is XML specific and should probably not run at all for Turtle based syntax.

@niklasl 's suggestion will fix it though, but will have to dig a bit into this at some point to try and rationalize what is going on here.

@aucampia
Copy link
Member

Draft fix in #1678 - however this causes one regression, in that it breaks percent encoded stuff:

$ .venv/bin/pytest  'test/test_issue801.py::TestIssue801::test_issue_801'
============================================================================ test session starts ============================================================================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/iwana/sw/d/github.com/iafork/rdflib, configfile: tox.ini
plugins: subtests-0.5.0, cov-3.0.0
collected 1 item                                                                                                                                                            

test/test_issue801.py F                                                                                                                                               [100%]

================================================================================= FAILURES ==================================================================================
________________________________________________________________________ TestIssue801.test_issue_801 ________________________________________________________________________

self = <test.test_issue801.TestIssue801 testMethod=test_issue_801>

    def test_issue_801(self):
        g = Graph()
        example = Namespace("http://example.org/")
        g.bind("", example)
        node = BNode()
        g.add((node, example["first%20name"], Literal("John")))
>       self.assertEqual(
            g.serialize(format="turtle").split("\n")[-3], '[] :first%20name "John" .'
        )
E       AssertionError: '[] :first\\%20name "John" .' != '[] :first%20name "John" .'
E       - [] :first\%20name "John" .
E       ?          -
E       + [] :first%20name "John" .

test/test_issue801.py:15: AssertionError
========================================================================== short test summary info ==========================================================================
FAILED test/test_issue801.py::TestIssue801::test_issue_801 - AssertionError: '[] :first\\%20name "John" .' != '[] :first%20name "John" .'
============================================================================= 1 failed in 0.18s =============================================================================

Possibly we should just go very conservative and only deal with ( and ) as these are actually the main problems.

So something like what @niklasl suggested but with:

PNAME_LOCAL_ESCAPES = re.compile(r"([()])")

Potentially also with a different name.

I think there may be better places to fix this though, but will look at it more tomorrow. I think a targetted fix for only the brackets may make sense though, and then we can unravel the rest separately.

@nicholascar
Copy link
Member

Possibly we should just go very conservative...

Sure, right down to solving the particular motivating Issue here and no more, unless there are some obvious other cases of things to handle. But don't put in much more effort than that!

@aucampia
Copy link
Member

@GreenfishK #1678 should address your issue, but there may still be some further issues with turtle pname serialization that need fixing.

@aucampia aucampia added bug Something isn't working in-resolution labels Jan 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working in-resolution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants