Some special characters might be parsed wrongly (?) #1655

GreenfishK · 2022-01-08T19:04:27Z

It seems that some special characters in RDF literals are not preserved after parsing them but rather translated into something faulty. So far, I found following ones:

n3_test.nt:

<http:s> <http:o1> "\n" .
<http:s> <http:o2> "\f" .
<http:s> <http:o3> "\b" .
<http:s> <http:o4> "\\r" .
<http:s> <http:o5> "\\\r" .

from rdflib import Graph
from rdflib import term

g = Graph()
g.parse("n3_test.nt")

for s, p, o in g:
    assert (type(o) == term.Literal)
    print("{s} {p} {o}".format(s=s.n3(), p=p.n3(), o=o.n3()))

Sorted Output:

<http:s> <http:o1> """
"""
<http:s> <http:o2> ""
<http:s> <http:o3> "
<http:s> <http:o4> "\\\r"
<http:s> <http:o5> "\\\r"

We see e.g. that "\\r" and "\\\r" result into the same literal and I am not sure if this is the expected behavior. There are some DBPedia logs unfortunately which have such characters and currently I just cannot parse them correctly.

Is there a trick or a proper way how to get around this?

The text was updated successfully, but these errors were encountered:

GreenfishK · 2022-01-08T19:17:29Z

I did the same check, just with a .ttl file instead of an .nt file and the result seems to be as expected for case 4 (=http:s http:o4 "\r" .
) but in all other cases it is the same faulty output as for .nt.

aucampia · 2022-01-09T11:47:40Z

If I take this nt file:

$ cat test/variants/special_chars.nt 
<example:special> <example:newline> "\n" .
<example:special> <example:form_feed> "\f" .
<example:special> <example:backspace> "\b" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:backslash> "\\" .
<example:special> <example:string-000> "\\r" .
<example:special> <example:string-001> "\\\r" .

And round trip it:

$ .venv/bin/python3 -m rdflib.tools.rdfpipe -i nt -o nt test/variants/special_chars.nt
/home/iwana/sw/d/github.com/iafork/rdflib/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
<example:special> <example:string-000> "\\\r" .
<example:special> <example:backspace> " .
<example:special> <example:backslash> "\\" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:form_feed> "
                                       " .
<example:special> <example:string-001> "\\\r" .
<example:special> <example:newline> "\n" .

It looks right, even though it maybe misses some escaping:

$ .venv/bin/python3 -m rdflib.tools.rdfpipe -i nt -o nt test/variants/special_chars.nt | xxd
/home/iwana/sw/d/github.com/iafork/rdflib/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
00000000: 3c65 7861 6d70 6c65 3a73 7065 6369 616c  <example:special
00000010: 3e20 3c65 7861 6d70 6c65 3a66 6f72 6d5f  > <example:form_
00000020: 6665 6564 3e20 220c 2220 2e0a 3c65 7861  feed> "." ..<exa
00000030: 6d70 6c65 3a73 7065 6369 616c 3e20 3c65  mple:special> <e
00000040: 7861 6d70 6c65 3a63 6172 7269 6167 655f  xample:carriage_
00000050: 7265 7475 726e 3e20 225c 7222 202e 0a3c  return> "\r" ..<
00000060: 6578 616d 706c 653a 7370 6563 6961 6c3e  example:special>
00000070: 203c 6578 616d 706c 653a 7374 7269 6e67   <example:string
00000080: 2d30 3030 3e20 225c 5c5c 7222 202e 0a3c  -000> "\\\r" ..<
00000090: 6578 616d 706c 653a 7370 6563 6961 6c3e  example:special>
000000a0: 203c 6578 616d 706c 653a 7374 7269 6e67   <example:string
000000b0: 2d30 3031 3e20 225c 5c5c 7222 202e 0a3c  -001> "\\\r" ..<
000000c0: 6578 616d 706c 653a 7370 6563 6961 6c3e  example:special>
000000d0: 203c 6578 616d 706c 653a 6e65 776c 696e   <example:newlin
000000e0: 653e 2022 5c6e 2220 2e0a 3c65 7861 6d70  e> "\n" ..<examp
000000f0: 6c65 3a73 7065 6369 616c 3e20 3c65 7861  le:special> <exa
00000100: 6d70 6c65 3a62 6163 6b73 6c61 7368 3e20  mple:backslash> 
00000110: 225c 5c22 202e 0a3c 6578 616d 706c 653a  "\\" ..<example:
00000120: 7370 6563 6961 6c3e 203c 6578 616d 706c  special> <exampl
00000130: 653a 6261 636b 7370 6163 653e 2022 0822  e:backspace> "."
00000140: 202e 0a0a

aucampia · 2022-01-09T11:51:00Z

Actually my apologies, it seems there is something wrong with <example:special> <example:string-000> "\\\r" . in output.

aucampia · 2022-01-09T11:56:26Z

And yes, when parsing that as ttl it works fine:

$ .venv/bin/python3 -m rdflib.tools.rdfpipe -i ttl -o nt test/variants/special_chars.nt 
/home/iwana/sw/d/github.com/iafork/rdflib/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
<example:special> <example:backspace> " .
<example:special> <example:newline> "\n" .
<example:special> <example:string-001> "\\\r" .
<example:special> <example:string-000> "\\r" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:form_feed> "
                                       " .
<example:special> <example:backslash> "\\" .

GreenfishK · 2022-01-09T12:10:11Z

I would expect to receive the same canonical string when I roundtrip it. However, these special escapes get either not recognized or interpreted for nt and ttl. E.g. /n is an actual new line in the output, even for ttl. Is this desired?

aucampia · 2022-01-09T12:11:30Z

/n is an actual new line in the output, even for ttl. Is this desired?

It may not be desired, but it is correct I think, will have to check the spec to be sure though.

GreenfishK · 2022-01-09T14:50:57Z

So for this triple
<example:special> <example:newline> "\n" .

I get this when i round trip the nt file:
<example:special> <example:newline> """ """.
Is this really correct? If yes, then my applogies.

aucampia · 2022-01-09T15:26:17Z

In python,

"""
"""

That is just a multiline string that is a single newline, so what you showed earlier seems okay, odd that n3 renders as repr() - and maybe that should be fixed, but n3 should be used with care anyway, if you look at the actual value of the literal object in <http:s> <http:o1> "\n" . then I think it is right, will maybe add some more checks though just to be sure, but I have done a couple of different checks for this already now and all seem fine. The only problem I can see for now is that <http:s> <http:o4> "\\r" . does not parse correctly when parsed as ntriples.

GreenfishK · 2022-01-09T16:30:36Z

Have you tried to parse the triples above from .nt and then serialize as .nt (i guess this is what you mean by roundtrip). For me, all the objects above had a different representation in the destination file. Especially
<http:s> <http:o3> "\b" . was serialized as unrecognizable character.

Regarding the newline. I still don’t understand why it is correct to be interpreted as actual new line when serialized. E.g if I have an ontolog about python where I have a triple like <newline> <label> "\n" and if I parse and serialize this then the object literal would be lost. Or would I in this case add an additional backslash as escape character, like <newline> <label> "\\n" ?

This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants.

This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.

aucampia · 2022-01-09T18:14:41Z

@GreenfishK

$ .venv/bin/pytest 'test/test_roundtrip.py::test_extra[roundtrip_special_chars.nt_ntriples_ntriples]' --log-level DEBUG -rA
============================================================================ test session starts ============================================================================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/iwana/sw/d/github.com/iafork/rdflib, configfile: tox.ini
plugins: subtests-0.5.0, cov-3.0.0
collected 1 item                                                                                                                                                            

test/test_roundtrip.py .                                                                                                                                              [100%]

================================================================================== PASSES ===================================================================================
_________________________________________________________ test_extra[roundtrip_special_chars.nt_ntriples_ntriples] __________________________________________________________
----------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------
DEBUG    test.test_roundtrip:test_roundtrip.py:221 serailized = 
<example:special> <example:backspace> " .
<example:special> <example:backslash> "\\" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:string-000> "\\\r" .
<example:special> <example:form_feed> "
                                       " .
<example:special> <example:newline> "\n" .
<example:special> <example:string-001> "\\\r" .


DEBUG    test.test_roundtrip:test_roundtrip.py:227 Items in both:
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:backspace'), rdflib.term.Literal('\x08'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:string-001'), rdflib.term.Literal('\\\r'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:form_feed'), rdflib.term.Literal('\x0c'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:backslash'), rdflib.term.Literal('\\'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:carriage_return'), rdflib.term.Literal('\r'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:newline'), rdflib.term.Literal('\n'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:string-000'), rdflib.term.Literal('\\\r'))
DEBUG    test.test_roundtrip:test_roundtrip.py:228 Items in G1 Only:

DEBUG    test.test_roundtrip:test_roundtrip.py:229 Items in G2 Only:

DEBUG    test.test_roundtrip:test_roundtrip.py:234 OK
========================================================================== short test summary info ==========================================================================
PASSED test/test_roundtrip.py::test_extra[roundtrip_special_chars.nt_ntriples_ntriples]
============================================================================= 1 passed in 0.18s =============================================================================

Test added in #1658

This is wrong (i.e. parsed wrong):

  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:string-001'), rdflib.term.Literal('\\\r'))

And this is an issue I also added xfail tests for, but the rest all seem right to me. I'm looking at fixing the \\r issue now but will have to completely rewrite the unescaping for ntriples to fix this.

GreenfishK · 2022-01-09T18:46:26Z

Thank you!!

GreenfishK · 2022-01-10T11:09:28Z

@aucampia I made a few more checkes for the .ttl dataset with the special characters/escapes and I think it looks good. Especially I tried to import following two triples to GraphDB and then to query it and the representation for both was exactly the same.

<http:s> <http:p> """
"""

and
<http:s> <http:p> "\n"

SPARQL Query result when querying from GraphDB

<http:s> <http:p> "
"

However, for .nt I did not manage to get the correct output after calling parse(...) on an utf-8 encoded .nt dataset and then serialize(...). Especially for \\b and \\f I see errors. See the example from the original DBPedia triple below.

Example triple from DBPedia:
<http://dbpedia.org/resource/Rodeo_(Travis_Scott_album)> <http://dbpedia.org/property/cover> "{\\rtf1\\ansi\\ansicpg1252{\\fonttbl}\n{\\colortbl;\\red255\\green255\\blue255;"@en .

Output after function parse(...) and serialize(...) is called:
<http://dbpedia.org/resource/Rodeo_(Travis_Scott_album)> <http://dbpedia.org/property/cover> "{\\\rtf1\\ansi\\ansicpg1252{\\onttbl}\n{\\colortbl;\\\red255\\green255\\�lue255;"@en .

Should I make a python notebook for this issue with the code I am using so it would be easier verifiable and better visible?

This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.

aucampia · 2022-01-12T00:16:02Z

Most reasonable solution I can find for this is to replace:

r_unicodeEscape = re.compile(r"(\\u[0-9A-Fa-f]{4}|\\U[0-9A-Fa-f]{8})")


def _unicodeExpand(s: str) -> str:
    return r_unicodeEscape.sub(lambda m: chr(int(m.group(0)[2:], 16)), s)


def decodeUnicodeEscape(s: str) -> str:
    """
    s is a unicode string
    replace ``\\n`` and ``\\u00AC`` unicode escapes
    """
    # if "\\" not in s:
    #     # Most of times, there are no backslashes in strings.
    #     # In the general case, it could use maketrans and translate.
    #     return s

    s = s.replace("\\t", "\t")
    s = s.replace("\\n", "\n")
    s = s.replace("\\r", "\r")
    s = s.replace("\\b", "\b")
    s = s.replace("\\f", "\f")
    s = s.replace('\\"', '"')
    s = s.replace("\\'", "'")
    s = s.replace("\\\\", "\\")

    s = _unicodeExpand(s)  # hmm - string escape doesn't do unicode escaping

    return s

with

string_escape_map = {
    "t": "\t",
    "b": "\b",
    "n": "\n",
    "r": "\r",
    "f": "\f",
    '"': '"',
    "'": "'",
    "\\": "\\",
}
string_escape_trans = str.maketrans(string_escape_map)


def _group_handler(match: Match[str]) -> str:
    rmatch, smatch, umatch = match.groups()
    if rmatch is not None:
        return rmatch
    elif smatch is not None:
        return smatch.translate(string_escape_trans)
    else:
        return chr(int(umatch[1:], 16))


turtle_escape_pattern = re.compile(
    r"""\\(?:([~.\-!$&'()*+,;=\/?#@%_])|([tbnrf"'\\])|(u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8}))""",
    # re.ASCII,
)


def turtle_unescape(escaped: str) -> str:
    return turtle_escape_pattern.sub(_group_handler, escaped)

The replacement is slower in simple string replacement cases, but faster or equally fast more or less in other cases. Will try make a patch tomorrow.

I think in the long run we should look at the parsers themselves, I think potentially using https://github.com/lark-parser/lark to make a parser for N3 may make more sense, and that can compile to python.

This includes xfails for the following issues: - RDFLib#1216 - RDFLib#1655 - RDFLib#1649 Also: - Add graph variant test scaffolding. Multiple files representing the same graph can now easily be tested to be isomorphic by just adding them in `test/variants`. - Add more things to `testutils.GraphHelper`, including some methods that does asserts with better messages. Also include some tests for GraphHelper. - Add some extra files to test_roundtrip, set the default identifier when parsing, and change verbose flag to rather be based on debug logging. - move one test from `test/test_issue247.py` to variants. - Fix problems with `.editorconfig` which prevents it from working properly.

aucampia · 2022-01-12T22:50:57Z

I made a draft fix for this here #1663

You can have a look at the tests being done in this file.

This quite extensively tests the handling of various escape sequences through turtple related parsers, and it seems all is okay with that fix. There are many things fixed, I placed a list of all tests that were added that fail with the old code in the PR for you to have a look, but all other cases seem to work correctly in master, and continue to work correctly.

If you think there is something I missed in the tests let me know please.

aucampia added the bug Something isn't working label Jan 9, 2022

aucampia mentioned this issue Jan 9, 2022

Add xfail tests for a couple of issues #1658

Closed

aucampia self-assigned this Jan 9, 2022

aucampia mentioned this issue Jan 10, 2022

Remove narrow build detection #1660

Merged

aucampia mentioned this issue Jan 12, 2022

Add xfail tests for a couple of issues #1662

Closed

aucampia mentioned this issue Jan 12, 2022

Fixed the handling of escape sequences in the ntriples and nquads parsers #1663

Merged

aucampia closed this as completed in #1663 Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some special characters might be parsed wrongly (?) #1655

Some special characters might be parsed wrongly (?) #1655

GreenfishK commented Jan 8, 2022 •

edited

Loading

GreenfishK commented Jan 8, 2022

aucampia commented Jan 9, 2022

aucampia commented Jan 9, 2022

aucampia commented Jan 9, 2022

GreenfishK commented Jan 9, 2022

aucampia commented Jan 9, 2022

GreenfishK commented Jan 9, 2022

aucampia commented Jan 9, 2022

GreenfishK commented Jan 9, 2022 •

edited

Loading

aucampia commented Jan 9, 2022

GreenfishK commented Jan 9, 2022

GreenfishK commented Jan 10, 2022

aucampia commented Jan 12, 2022

aucampia commented Jan 12, 2022

Some special characters might be parsed wrongly (?) #1655

Some special characters might be parsed wrongly (?) #1655

Comments

GreenfishK commented Jan 8, 2022 • edited Loading

GreenfishK commented Jan 8, 2022

aucampia commented Jan 9, 2022

aucampia commented Jan 9, 2022

aucampia commented Jan 9, 2022

GreenfishK commented Jan 9, 2022

aucampia commented Jan 9, 2022

GreenfishK commented Jan 9, 2022

aucampia commented Jan 9, 2022

GreenfishK commented Jan 9, 2022 • edited Loading

aucampia commented Jan 9, 2022

GreenfishK commented Jan 9, 2022

GreenfishK commented Jan 10, 2022

aucampia commented Jan 12, 2022

aucampia commented Jan 12, 2022

GreenfishK commented Jan 8, 2022 •

edited

Loading

GreenfishK commented Jan 9, 2022 •

edited

Loading