NQuads: unicode escape issue

Hi

I have an issue parse nquads with rdflib.

I have this string:

'Production date :: 2532\u20132503BC :: circa'
- which is in the Object of a triple I have.  You can see that it has a unicode escape for the '-': '\u2013'.

The triple is in a file (see attached) - when it is parsed by rdflib I get an error: 
ValueError: chr() arg not in range(0x110000).  This seems to be generated in py3compat.py.

Digging into it is seems that the regex is correctly identifying the unicode escape, but when it wants to pull it out, it grabs all of the intergers after the \u2013 which includes the '2503' characters, i.e: '\u20132503'  The line that is failing is:

``` python
r_unicodeEscape = re.compile(r'(\\[uU][0-9A-Fa-f]{4}(?:[0-9A-Fa-f]{4})?)')
def _unicodeExpand(s):
    return r_unicodeEscape.sub(lambda m: chr(int(m.group(0)[2:], 16)), s)  #this line I get the error
```

What does seem to work is either:
changing the m.group(0) to m.group(1) OR
using the regex for unicode escape characters as defined in the notation3.py: (https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/parsers/notation3.py):

``` python
unicodeEscape4 = re.compile(r'\\u([0-9a-f]{4})', flags=re.I)
unicodeEscape8 = re.compile(r'\\U([0-9a-f]{8})', flags=re.I)
def _unicodeExpand(s):
    a = unicodeEscape4.sub(lambda m: chr(int(m.group(0)[2:], 16)), s) 
    return unicodeEscape8.sub(lambda m: chr(int(m.group(0)[2:], 16)), a)
```

Would this be an adequate solution?

Thanks
Josh


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NQuads: unicode escape issue #352

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NQuads: unicode escape issue #352

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions