-
Notifications
You must be signed in to change notification settings - Fork 547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some special characters might be parsed wrongly (?) #1655
Comments
If I take this nt file: $ cat test/variants/special_chars.nt
<example:special> <example:newline> "\n" .
<example:special> <example:form_feed> "\f" .
<example:special> <example:backspace> "\b" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:backslash> "\\" .
<example:special> <example:string-000> "\\r" .
<example:special> <example:string-001> "\\\r" . And round trip it: $ .venv/bin/python3 -m rdflib.tools.rdfpipe -i nt -o nt test/variants/special_chars.nt
/home/iwana/sw/d/github.com/iafork/rdflib/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
warnings.warn(
<example:special> <example:string-000> "\\\r" .
<example:special> <example:backspace> " .
<example:special> <example:backslash> "\\" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:form_feed> "
" .
<example:special> <example:string-001> "\\\r" .
<example:special> <example:newline> "\n" . It looks right, even though it maybe misses some escaping: $ .venv/bin/python3 -m rdflib.tools.rdfpipe -i nt -o nt test/variants/special_chars.nt | xxd
/home/iwana/sw/d/github.com/iafork/rdflib/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
warnings.warn(
00000000: 3c65 7861 6d70 6c65 3a73 7065 6369 616c <example:special
00000010: 3e20 3c65 7861 6d70 6c65 3a66 6f72 6d5f > <example:form_
00000020: 6665 6564 3e20 220c 2220 2e0a 3c65 7861 feed> "." ..<exa
00000030: 6d70 6c65 3a73 7065 6369 616c 3e20 3c65 mple:special> <e
00000040: 7861 6d70 6c65 3a63 6172 7269 6167 655f xample:carriage_
00000050: 7265 7475 726e 3e20 225c 7222 202e 0a3c return> "\r" ..<
00000060: 6578 616d 706c 653a 7370 6563 6961 6c3e example:special>
00000070: 203c 6578 616d 706c 653a 7374 7269 6e67 <example:string
00000080: 2d30 3030 3e20 225c 5c5c 7222 202e 0a3c -000> "\\\r" ..<
00000090: 6578 616d 706c 653a 7370 6563 6961 6c3e example:special>
000000a0: 203c 6578 616d 706c 653a 7374 7269 6e67 <example:string
000000b0: 2d30 3031 3e20 225c 5c5c 7222 202e 0a3c -001> "\\\r" ..<
000000c0: 6578 616d 706c 653a 7370 6563 6961 6c3e example:special>
000000d0: 203c 6578 616d 706c 653a 6e65 776c 696e <example:newlin
000000e0: 653e 2022 5c6e 2220 2e0a 3c65 7861 6d70 e> "\n" ..<examp
000000f0: 6c65 3a73 7065 6369 616c 3e20 3c65 7861 le:special> <exa
00000100: 6d70 6c65 3a62 6163 6b73 6c61 7368 3e20 mple:backslash>
00000110: 225c 5c22 202e 0a3c 6578 616d 706c 653a "\\" ..<example:
00000120: 7370 6563 6961 6c3e 203c 6578 616d 706c special> <exampl
00000130: 653a 6261 636b 7370 6163 653e 2022 0822 e:backspace> "."
00000140: 202e 0a0a |
Actually my apologies, it seems there is something wrong with |
And yes, when parsing that as ttl it works fine:
|
I would expect to receive the same canonical string when I roundtrip it. However, these special escapes get either not recognized or interpreted for nt and ttl. E.g. /n is an actual new line in the output, even for ttl. Is this desired? |
It may not be desired, but it is correct I think, will have to check the spec to be sure though. |
So for this triple I get this when i round trip the nt file: |
In python, """
""" That is just a multiline string that is a single newline, so what you showed earlier seems okay, odd that n3 renders as repr() - and maybe that should be fixed, but n3 should be used with care anyway, if you look at the actual value of the literal object in |
Have you tried to parse the triples above from .nt and then serialize as .nt (i guess this is what you mean by roundtrip). For me, all the objects above had a different representation in the destination file. Especially Regarding the newline. I still don’t understand why it is correct to be interpreted as actual new line when serialized. E.g if I have an ontolog about python where I have a triple like |
This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants.
This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.
This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.
This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.
This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.
$ .venv/bin/pytest 'test/test_roundtrip.py::test_extra[roundtrip_special_chars.nt_ntriples_ntriples]' --log-level DEBUG -rA
============================================================================ test session starts ============================================================================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/iwana/sw/d/github.com/iafork/rdflib, configfile: tox.ini
plugins: subtests-0.5.0, cov-3.0.0
collected 1 item
test/test_roundtrip.py . [100%]
================================================================================== PASSES ===================================================================================
_________________________________________________________ test_extra[roundtrip_special_chars.nt_ntriples_ntriples] __________________________________________________________
----------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------
DEBUG test.test_roundtrip:test_roundtrip.py:221 serailized =
<example:special> <example:backspace> " .
<example:special> <example:backslash> "\\" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:string-000> "\\\r" .
<example:special> <example:form_feed> "
" .
<example:special> <example:newline> "\n" .
<example:special> <example:string-001> "\\\r" .
DEBUG test.test_roundtrip:test_roundtrip.py:227 Items in both:
(rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:backspace'), rdflib.term.Literal('\x08'))
(rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:string-001'), rdflib.term.Literal('\\\r'))
(rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:form_feed'), rdflib.term.Literal('\x0c'))
(rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:backslash'), rdflib.term.Literal('\\'))
(rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:carriage_return'), rdflib.term.Literal('\r'))
(rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:newline'), rdflib.term.Literal('\n'))
(rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:string-000'), rdflib.term.Literal('\\\r'))
DEBUG test.test_roundtrip:test_roundtrip.py:228 Items in G1 Only:
DEBUG test.test_roundtrip:test_roundtrip.py:229 Items in G2 Only:
DEBUG test.test_roundtrip:test_roundtrip.py:234 OK
========================================================================== short test summary info ==========================================================================
PASSED test/test_roundtrip.py::test_extra[roundtrip_special_chars.nt_ntriples_ntriples]
============================================================================= 1 passed in 0.18s ============================================================================= Test added in #1658 This is wrong (i.e. parsed wrong):
And this is an issue I also added xfail tests for, but the rest all seem right to me. I'm looking at fixing the |
Thank you!! |
@aucampia I made a few more checkes for the .ttl dataset with the special characters/escapes and I think it looks good. Especially I tried to import following two triples to GraphDB and then to query it and the representation for both was exactly the same.
and SPARQL Query result when querying from GraphDB
However, for .nt I did not manage to get the correct output after calling parse(...) on an utf-8 encoded .nt dataset and then serialize(...). Especially for \\b and \\f I see errors. See the example from the original DBPedia triple below. Example triple from DBPedia: Output after function parse(...) and serialize(...) is called: Should I make a python notebook for this issue with the code I am using so it would be easier verifiable and better visible? |
This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.
Most reasonable solution I can find for this is to replace: r_unicodeEscape = re.compile(r"(\\u[0-9A-Fa-f]{4}|\\U[0-9A-Fa-f]{8})")
def _unicodeExpand(s: str) -> str:
return r_unicodeEscape.sub(lambda m: chr(int(m.group(0)[2:], 16)), s)
def decodeUnicodeEscape(s: str) -> str:
"""
s is a unicode string
replace ``\\n`` and ``\\u00AC`` unicode escapes
"""
# if "\\" not in s:
# # Most of times, there are no backslashes in strings.
# # In the general case, it could use maketrans and translate.
# return s
s = s.replace("\\t", "\t")
s = s.replace("\\n", "\n")
s = s.replace("\\r", "\r")
s = s.replace("\\b", "\b")
s = s.replace("\\f", "\f")
s = s.replace('\\"', '"')
s = s.replace("\\'", "'")
s = s.replace("\\\\", "\\")
s = _unicodeExpand(s) # hmm - string escape doesn't do unicode escaping
return s with string_escape_map = {
"t": "\t",
"b": "\b",
"n": "\n",
"r": "\r",
"f": "\f",
'"': '"',
"'": "'",
"\\": "\\",
}
string_escape_trans = str.maketrans(string_escape_map)
def _group_handler(match: Match[str]) -> str:
rmatch, smatch, umatch = match.groups()
if rmatch is not None:
return rmatch
elif smatch is not None:
return smatch.translate(string_escape_trans)
else:
return chr(int(umatch[1:], 16))
turtle_escape_pattern = re.compile(
r"""\\(?:([~.\-!$&'()*+,;=\/?#@%_])|([tbnrf"'\\])|(u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8}))""",
# re.ASCII,
)
def turtle_unescape(escaped: str) -> str:
return turtle_escape_pattern.sub(_group_handler, escaped) The replacement is slower in simple string replacement cases, but faster or equally fast more or less in other cases. Will try make a patch tomorrow. I think in the long run we should look at the parsers themselves, I think potentially using https://github.com/lark-parser/lark to make a parser for N3 may make more sense, and that can compile to python. |
This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.
This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.
I made a draft fix for this here #1663 You can have a look at the tests being done in this file. This quite extensively tests the handling of various escape sequences through turtple related parsers, and it seems all is okay with that fix. There are many things fixed, I placed a list of all tests that were added that fail with the old code in the PR for you to have a look, but all other cases seem to work correctly in master, and continue to work correctly. If you think there is something I missed in the tests let me know please. |
It seems that some special characters in RDF literals are not preserved after parsing them but rather translated into something faulty. So far, I found following ones:
n3_test.nt:
Sorted Output:
We see e.g. that "\\r" and "\\\r" result into the same literal and I am not sure if this is the expected behavior. There are some DBPedia logs unfortunately which have such characters and currently I just cannot parse them correctly.
Is there a trick or a proper way how to get around this?
The text was updated successfully, but these errors were encountered: