Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NQuads: unicode escape issue #352

Closed
jmahmud opened this issue Jan 9, 2014 · 2 comments
Closed

NQuads: unicode escape issue #352

jmahmud opened this issue Jan 9, 2014 · 2 comments

Comments

@jmahmud
Copy link

jmahmud commented Jan 9, 2014

Hi

I have an issue parse nquads with rdflib.

I have this string:

'Production date :: 2532\u20132503BC :: circa'

  • which is in the Object of a triple I have. You can see that it has a unicode escape for the '-': '\u2013'.

The triple is in a file (see attached) - when it is parsed by rdflib I get an error:
ValueError: chr() arg not in range(0x110000). This seems to be generated in py3compat.py.

Digging into it is seems that the regex is correctly identifying the unicode escape, but when it wants to pull it out, it grabs all of the intergers after the \u2013 which includes the '2503' characters, i.e: '\u20132503' The line that is failing is:

r_unicodeEscape = re.compile(r'(\\[uU][0-9A-Fa-f]{4}(?:[0-9A-Fa-f]{4})?)')
def _unicodeExpand(s):
    return r_unicodeEscape.sub(lambda m: chr(int(m.group(0)[2:], 16)), s)  #this line I get the error

What does seem to work is either:
changing the m.group(0) to m.group(1) OR
using the regex for unicode escape characters as defined in the notation3.py: (https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/parsers/notation3.py):

unicodeEscape4 = re.compile(r'\\u([0-9a-f]{4})', flags=re.I)
unicodeEscape8 = re.compile(r'\\U([0-9a-f]{8})', flags=re.I)
def _unicodeExpand(s):
    a = unicodeEscape4.sub(lambda m: chr(int(m.group(0)[2:], 16)), s) 
    return unicodeEscape8.sub(lambda m: chr(int(m.group(0)[2:], 16)), a)

Would this be an adequate solution?

Thanks
Josh

@joernhees
Copy link
Member

group(1) is not an option, as it's wrong for \UXXXXXXXX escapes.

@joernhees
Copy link
Member

i think you can just use this regexp:

r_unicodeEscape = re.compile(r'(\\u[0-9A-Fa-f]{4}|\\U[0-9A-Fa-f]{8})?)')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants