Trailing backslash in literal causes from_n3 to throw exception #546

scossu · 2015-11-20T16:09:17Z

Steps to reproduce (Python 3.4):

>>> from rdflib.util import from_n3
>>> sample = "\"Sample string with trailing backslash\\\"^^xsd:string"
>>> from_n3(sample)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/rdflib/util.py", line 183, in from_n3
    value = value.encode("raw-unicode-escape").decode("unicode-escape")
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 37: \ at end of string

I am aware of the trailing backslash problem with raw strings: https://docs.python.org/2/faq/design.html#why-can-t-raw-strings-r-strings-end-with-a-backslash

I would like to know whether this is a rdflib bug and, if not, how I should handle my string before passing it to from_n3.

The text was updated successfully, but these errors were encountered:

joernhees · 2015-11-22T00:47:25Z

this is a tricky one as it keeps confusing my brain with several encoding/escaping layers from python and n3 itself... py2.7 vs. 3.4 doesn't seem to be part of the problem here, which is why i'll just use py2.7.

To start off let me just put what i think you want in a foo.n3 file:

<foo:s> <foo:p> "Sample string with trailing backslash\\"^^<http://www.w3.org/2001/XMLSchema#string> .

As you can see the n3 representation already needs to escape the \ as otherwise it would itself escape the following ".

With that file we can do this:

In [1]: import rdflib
INFO:rdflib:RDFLib Version: 4.2.1

In [2]: g = rdflib.Graph()

In [3]: g.parse('foo.n3', format='n3')
Out[3]: <Graph identifier=Nb7a7399152c14612a6443bdb3c96453d (<class 'rdflib.graph.Graph'>)>

In [4]: list(g)
Out[4]:
[(rdflib.term.URIRef(u'foo:s'),
  rdflib.term.URIRef(u'foo:p'),
  rdflib.term.Literal(u'Sample string with trailing backslash\\', datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#string')))]

In [5]: lit = list(g)[0][2]

In [6]: lit
Out[6]: rdflib.term.Literal(u'Sample string with trailing backslash\\', datatype=rdflib.term.URIRef(u'http://www.w3.org/2001/XMLSchema#string'))

In [7]: lit.n3()
Out[7]: u'"Sample string with trailing backslash\\\\"^^<http://www.w3.org/2001/XMLSchema#string>'

In [8]: print lit
Sample string with trailing backslash\

In [9]: print lit.n3()
"Sample string with trailing backslash\\"^^<http://www.w3.org/2001/XMLSchema#string>

Actually line 7 is the most interesting one, as it shows that to represent the n3 string in python we need to double escape it. I kind of abuse print here to get rid of the python quoting & escaping layer and show what arrives at n3 / the end-user.

Now your version used ", let's explore:

In [1]: sample = "\"Sample string with trailing backslash\\\"^^xsd:string"

In [2]: sample
Out[2]: '"Sample string with trailing backslash\\"^^xsd:string'

In [3]: print sample
"Sample string with trailing backslash\"^^xsd:string

The last one actually shows one of the problems: your version lacks another \ which needs to be doubled to account for the python layer, so you need 5 \ if you want to use " quotes:

In [4]: sample = "\"Sample string with trailing backslash\\\\\"^^xsd:string"

In [5]: sample
Out[5]: '"Sample string with trailing backslash\\\\"^^xsd:string'

Why "one of the problems"? Well, cause it still doesn't work 💃 👊 :

In [6]: from rdflib.util import from_n3
INFO:rdflib:RDFLib Version: 4.2.1

In [7]: from_n3(sample)
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-2ac3b89358bc> in <module>()
----> 1 from_n3(sample)

/usr/local/lib/python2.7/site-packages/rdflib/util.pyc in from_n3(s, default, backend, nsm)
    181         # Hack: this should correctly handle strings with either native unicode
    182         # characters, or \u1234 unicode escapes.
--> 183         value = value.encode("raw-unicode-escape").decode("unicode-escape")
    184         return Literal(value, language, datatype)
    185     elif s == 'true' or s == 'false':

UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 37: \ at end of string```

Great, and that's where it now becomes fishy... rdflib.util.from_n3 is a utility method trying to give us an rdflib.term representation of a single n3 term. It uses a completely different code-base (< 100 lines) than the full n3 parser (> 1900 lines). My current guess is that one of 3a62086 and 8ea987e accounting for (3b10849 in Literal.n3()) introduced a bug by replacing \\\\ with \\ before .decode('unicode-escape') does the same. So it was IMO erroneously double-reduced:

In [1]: 'foo\\\\bar'.encode('raw-unicode-escape')
Out[1]: 'foo\\\\bar'

In [2]: 'foo\\\\bar'.encode('raw-unicode-escape').decode('unicode-escape')
Out[2]: u'foo\\bar'

In [3]: 'foo\\\\bar'.decode('unicode-escape')
Out[3]: u'foo\\bar'

In [4]: 'foo\\bar'.decode('unicode-escape')
Out[4]: u'foo\x08ar'

(The last one obviously being wrong.)

I'll make a pull request (up for discussion) to fix that, see #548.

joernhees · 2015-11-22T00:52:00Z

for more inconsistencies of from_n3 also see #549

fix double reduction of \ escapes in from_n3, fixes #546

* master: (49 commits) Update reference to "Emulating container types" Avoid class reference to imported function Prevent RDFa parser from failing on time elements with child nodes Second proposed fix for the broken top_level.txt make Prologue and Query new style classes DOC: minor typo in paramater DOC: unamed -> unnamed AuditableStore.commit does not call self.store.commit anymore ignore operations with no effect fixed trivial copy-paste bug added test cases for AuditableStore expanded path comparison ops in order to keep py2.6 support and not use total_ordering let paths be comparable against all nodes. Fixes #545 re-introduces special handling for DCTERMS.title and test for it Fix initBindings handling. Fixes #294 added .n3 methods for path objects Made ClosedNamespace (and _RDFNamespace) inherit from Namespace cleaned up trailing whitespace Small but nice SPARQL Optimisation fix test for #546 from_n3 trailing backslash ...

joernhees mentioned this issue Nov 22, 2015

fix double reduction of \ escapes in from_n3 #548

Merged

joernhees added bug Something isn't working fix-in-progress parsing Related to a parsing. labels Nov 22, 2015

joernhees added this to the rdflib 4.2.2 milestone Nov 22, 2015

joernhees self-assigned this Nov 22, 2015

joernhees mentioned this issue Nov 22, 2015

from_n3 erroneously unescapes \xhh #549

Closed

joernhees added a commit to joernhees/rdflib that referenced this issue Nov 22, 2015

test for RDFLib#546 from_n3 trailing backslash

3670738

joernhees closed this as completed in #548 Nov 28, 2015

joernhees added a commit that referenced this issue Nov 28, 2015

Merge pull request #548 from joernhees/fix_from_n3_backslashreduction

e261af3

fix double reduction of \ escapes in from_n3, fixes #546

pyup-bot mentioned this issue Jan 29, 2017

Update rdflib to 4.2.2 mytardis/mytardis#815

Merged

This was referenced Mar 16, 2017

Initial Update mozilla/amo-validator#510

Closed

Update rdflib to 4.2.2 mozilla/amo-validator#515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trailing backslash in literal causes from_n3 to throw exception #546

Trailing backslash in literal causes from_n3 to throw exception #546

scossu commented Nov 20, 2015

joernhees commented Nov 22, 2015

joernhees commented Nov 22, 2015

Trailing backslash in literal causes from_n3 to throw exception #546

Trailing backslash in literal causes from_n3 to throw exception #546

Comments

scossu commented Nov 20, 2015

joernhees commented Nov 22, 2015

joernhees commented Nov 22, 2015