Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trailing backslash in literal causes from_n3 to throw exception #546

Closed
scossu opened this issue Nov 20, 2015 · 2 comments · Fixed by #548
Closed

Trailing backslash in literal causes from_n3 to throw exception #546

scossu opened this issue Nov 20, 2015 · 2 comments · Fixed by #548
Assignees
Labels
bug Something isn't working fix-in-progress parsing Related to a parsing.
Milestone

Comments

@scossu
Copy link

scossu commented Nov 20, 2015

Steps to reproduce (Python 3.4):

>>> from rdflib.util import from_n3
>>> sample = "\"Sample string with trailing backslash\\\"^^xsd:string"
>>> from_n3(sample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/rdflib/util.py", line 183, in from_n3
    value = value.encode("raw-unicode-escape").decode("unicode-escape")
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 37: \ at end of string

I am aware of the trailing backslash problem with raw strings: https://docs.python.org/2/faq/design.html#why-can-t-raw-strings-r-strings-end-with-a-backslash

I would like to know whether this is a rdflib bug and, if not, how I should handle my string before passing it to from_n3.

@joernhees joernhees added bug Something isn't working fix-in-progress parsing Related to a parsing. labels Nov 22, 2015
@joernhees joernhees added this to the rdflib 4.2.2 milestone Nov 22, 2015
@joernhees joernhees self-assigned this Nov 22, 2015
@joernhees
Copy link
Member

this is a tricky one as it keeps confusing my brain with several encoding/escaping layers from python and n3 itself... py2.7 vs. 3.4 doesn't seem to be part of the problem here, which is why i'll just use py2.7.

To start off let me just put what i think you want in a foo.n3 file:

<foo:s> <foo:p> "Sample string with trailing backslash\\"^^<http://www.w3.org/2001/XMLSchema#string> .

As you can see the n3 representation already needs to escape the \ as otherwise it would itself escape the following ".

With that file we can do this:

In [1]: import rdflib
INFO:rdflib:RDFLib Version: 4.2.1

In [2]: g = rdflib.Graph()

In [3]: g.parse('foo.n3', format='n3')
Out[3]: <Graph identifier=Nb7a7399152c14612a6443bdb3c96453d (<class 'rdflib.graph.Graph'>)>

In [4]: list(g)
Out[4]:
[(rdflib.term.URIRef(u'foo:s'),
  rdflib.term.URIRef(u'foo:p'),
  rdflib.term.Literal(u'Sample string with trailing backslash\\', datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#string')))]

In [5]: lit = list(g)[0][2]

In [6]: lit
Out[6]: rdflib.term.Literal(u'Sample string with trailing backslash\\', datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#string'))

In [7]: lit.n3()
Out[7]: u'"Sample string with trailing backslash\\\\"^^<http://www.w3.org/2001/XMLSchema#string>'

In [8]: print lit
Sample string with trailing backslash\

In [9]: print lit.n3()
"Sample string with trailing backslash\\"^^<http://www.w3.org/2001/XMLSchema#string>

Actually line 7 is the most interesting one, as it shows that to represent the n3 string in python we need to double escape it. I kind of abuse print here to get rid of the python quoting & escaping layer and show what arrives at n3 / the end-user.

Now your version used ", let's explore:

In [1]: sample = "\"Sample string with trailing backslash\\\"^^xsd:string"

In [2]: sample
Out[2]: '"Sample string with trailing backslash\\"^^xsd:string'

In [3]: print sample
"Sample string with trailing backslash\"^^xsd:string

The last one actually shows one of the problems: your version lacks another \ which needs to be doubled to account for the python layer, so you need 5 \ if you want to use " quotes:

In [4]: sample = "\"Sample string with trailing backslash\\\\\"^^xsd:string"

In [5]: sample
Out[5]: '"Sample string with trailing backslash\\\\"^^xsd:string'

Why "one of the problems"? Well, cause it still doesn't work 💃 👊 :

In [6]: from rdflib.util import from_n3
INFO:rdflib:RDFLib Version: 4.2.1

In [7]: from_n3(sample)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-2ac3b89358bc> in <module>()
----> 1 from_n3(sample)

/usr/local/lib/python2.7/site-packages/rdflib/util.pyc in from_n3(s, default, backend, nsm)
    181         # Hack: this should correctly handle strings with either native unicode
    182         # characters, or \u1234 unicode escapes.
--> 183         value = value.encode("raw-unicode-escape").decode("unicode-escape")
    184         return Literal(value, language, datatype)
    185     elif s == 'true' or s == 'false':

UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 37: \ at end of string```

Great, and that's where it now becomes fishy... rdflib.util.from_n3 is a utility method trying to give us an rdflib.term representation of a single n3 term. It uses a completely different code-base (< 100 lines) than the full n3 parser (> 1900 lines). My current guess is that one of 3a62086 and 8ea987e accounting for (3b10849 in Literal.n3()) introduced a bug by replacing \\\\ with \\ before .decode('unicode-escape') does the same. So it was IMO erroneously double-reduced:

In [1]: 'foo\\\\bar'.encode('raw-unicode-escape')
Out[1]: 'foo\\\\bar'

In [2]: 'foo\\\\bar'.encode('raw-unicode-escape').decode('unicode-escape')
Out[2]: u'foo\\bar'

In [3]: 'foo\\\\bar'.decode('unicode-escape')
Out[3]: u'foo\\bar'

In [4]: 'foo\\bar'.decode('unicode-escape')
Out[4]: u'foo\x08ar'

(The last one obviously being wrong.)

I'll make a pull request (up for discussion) to fix that, see #548.

@joernhees
Copy link
Member

for more inconsistencies of from_n3 also see #549

joernhees added a commit to joernhees/rdflib that referenced this issue Nov 22, 2015
joernhees added a commit that referenced this issue Nov 28, 2015
fix double reduction of \ escapes in from_n3, fixes #546
joernhees added a commit that referenced this issue Jan 28, 2016
* master: (49 commits)
  Update reference to "Emulating container types"
  Avoid class reference to imported function
  Prevent RDFa parser from failing on time elements with child nodes
  Second proposed fix for the broken top_level.txt
  make Prologue and Query new style classes
  DOC: minor typo in paramater
  DOC: unamed -> unnamed
  AuditableStore.commit does not call self.store.commit anymore
  ignore operations with no effect
  fixed trivial copy-paste bug
  added test cases for AuditableStore
  expanded path comparison ops in order to keep py2.6 support and not use total_ordering
  let paths be comparable against all nodes. Fixes #545
  re-introduces special handling for DCTERMS.title and test for it
  Fix initBindings handling. Fixes #294
  added .n3 methods for path objects
  Made ClosedNamespace (and _RDFNamespace) inherit from Namespace
  cleaned up trailing whitespace
  Small but nice SPARQL Optimisation fix
  test for #546 from_n3 trailing backslash
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fix-in-progress parsing Related to a parsing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants