Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some special characters might be parsed wrongly (?) #1655

Closed
GreenfishK opened this issue Jan 8, 2022 · 14 comments · Fixed by #1663
Closed

Some special characters might be parsed wrongly (?) #1655

GreenfishK opened this issue Jan 8, 2022 · 14 comments · Fixed by #1663
Assignees
Labels
bug Something isn't working

Comments

@GreenfishK
Copy link

GreenfishK commented Jan 8, 2022

It seems that some special characters in RDF literals are not preserved after parsing them but rather translated into something faulty. So far, I found following ones:

n3_test.nt:

<http:s> <http:o1> "\n" .
<http:s> <http:o2> "\f" .
<http:s> <http:o3> "\b" .
<http:s> <http:o4> "\\r" .
<http:s> <http:o5> "\\\r" .
from rdflib import Graph
from rdflib import term

g = Graph()
g.parse("n3_test.nt")

for s, p, o in g:
    assert (type(o) == term.Literal)
    print("{s} {p} {o}".format(s=s.n3(), p=p.n3(), o=o.n3()))

Sorted Output:

<http:s> <http:o1> """
"""
<http:s> <http:o2> ""
<http:s> <http:o3> "
<http:s> <http:o4> "\\\r"
<http:s> <http:o5> "\\\r"

We see e.g. that "\\r" and "\\\r" result into the same literal and I am not sure if this is the expected behavior. There are some DBPedia logs unfortunately which have such characters and currently I just cannot parse them correctly.

Is there a trick or a proper way how to get around this?

@GreenfishK
Copy link
Author

I did the same check, just with a .ttl file instead of an .nt file and the result seems to be as expected for case 4 (=http:s http:o4 "\r" .
) but in all other cases it is the same faulty output as for .nt.

@aucampia
Copy link
Member

aucampia commented Jan 9, 2022

If I take this nt file:

$ cat test/variants/special_chars.nt 
<example:special> <example:newline> "\n" .
<example:special> <example:form_feed> "\f" .
<example:special> <example:backspace> "\b" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:backslash> "\\" .
<example:special> <example:string-000> "\\r" .
<example:special> <example:string-001> "\\\r" .

And round trip it:

$ .venv/bin/python3 -m rdflib.tools.rdfpipe -i nt -o nt test/variants/special_chars.nt
/home/iwana/sw/d/github.com/iafork/rdflib/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
<example:special> <example:string-000> "\\\r" .
<example:special> <example:backspace> " .
<example:special> <example:backslash> "\\" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:form_feed> "
                                       " .
<example:special> <example:string-001> "\\\r" .
<example:special> <example:newline> "\n" .

It looks right, even though it maybe misses some escaping:

$ .venv/bin/python3 -m rdflib.tools.rdfpipe -i nt -o nt test/variants/special_chars.nt | xxd
/home/iwana/sw/d/github.com/iafork/rdflib/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
00000000: 3c65 7861 6d70 6c65 3a73 7065 6369 616c  <example:special
00000010: 3e20 3c65 7861 6d70 6c65 3a66 6f72 6d5f  > <example:form_
00000020: 6665 6564 3e20 220c 2220 2e0a 3c65 7861  feed> "." ..<exa
00000030: 6d70 6c65 3a73 7065 6369 616c 3e20 3c65  mple:special> <e
00000040: 7861 6d70 6c65 3a63 6172 7269 6167 655f  xample:carriage_
00000050: 7265 7475 726e 3e20 225c 7222 202e 0a3c  return> "\r" ..<
00000060: 6578 616d 706c 653a 7370 6563 6961 6c3e  example:special>
00000070: 203c 6578 616d 706c 653a 7374 7269 6e67   <example:string
00000080: 2d30 3030 3e20 225c 5c5c 7222 202e 0a3c  -000> "\\\r" ..<
00000090: 6578 616d 706c 653a 7370 6563 6961 6c3e  example:special>
000000a0: 203c 6578 616d 706c 653a 7374 7269 6e67   <example:string
000000b0: 2d30 3031 3e20 225c 5c5c 7222 202e 0a3c  -001> "\\\r" ..<
000000c0: 6578 616d 706c 653a 7370 6563 6961 6c3e  example:special>
000000d0: 203c 6578 616d 706c 653a 6e65 776c 696e   <example:newlin
000000e0: 653e 2022 5c6e 2220 2e0a 3c65 7861 6d70  e> "\n" ..<examp
000000f0: 6c65 3a73 7065 6369 616c 3e20 3c65 7861  le:special> <exa
00000100: 6d70 6c65 3a62 6163 6b73 6c61 7368 3e20  mple:backslash> 
00000110: 225c 5c22 202e 0a3c 6578 616d 706c 653a  "\\" ..<example:
00000120: 7370 6563 6961 6c3e 203c 6578 616d 706c  special> <exampl
00000130: 653a 6261 636b 7370 6163 653e 2022 0822  e:backspace> "."
00000140: 202e 0a0a          

@aucampia
Copy link
Member

aucampia commented Jan 9, 2022

Actually my apologies, it seems there is something wrong with <example:special> <example:string-000> "\\\r" . in output.

@aucampia
Copy link
Member

aucampia commented Jan 9, 2022

And yes, when parsing that as ttl it works fine:

$ .venv/bin/python3 -m rdflib.tools.rdfpipe -i ttl -o nt test/variants/special_chars.nt 
/home/iwana/sw/d/github.com/iafork/rdflib/rdflib/plugins/serializers/nt.py:36: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
<example:special> <example:backspace> " .
<example:special> <example:newline> "\n" .
<example:special> <example:string-001> "\\\r" .
<example:special> <example:string-000> "\\r" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:form_feed> "
                                       " .
<example:special> <example:backslash> "\\" .

@GreenfishK
Copy link
Author

I would expect to receive the same canonical string when I roundtrip it. However, these special escapes get either not recognized or interpreted for nt and ttl. E.g. /n is an actual new line in the output, even for ttl. Is this desired?

@aucampia
Copy link
Member

aucampia commented Jan 9, 2022

/n is an actual new line in the output, even for ttl. Is this desired?

It may not be desired, but it is correct I think, will have to check the spec to be sure though.

@aucampia aucampia added the bug Something isn't working label Jan 9, 2022
@GreenfishK
Copy link
Author

So for this triple
<example:special> <example:newline> "\n" .

I get this when i round trip the nt file:
<example:special> <example:newline> """ """.
Is this really correct? If yes, then my applogies.

@aucampia
Copy link
Member

aucampia commented Jan 9, 2022

In python,

"""
"""

That is just a multiline string that is a single newline, so what you showed earlier seems okay, odd that n3 renders as repr() - and maybe that should be fixed, but n3 should be used with care anyway, if you look at the actual value of the literal object in <http:s> <http:o1> "\n" . then I think it is right, will maybe add some more checks though just to be sure, but I have done a couple of different checks for this already now and all seem fine. The only problem I can see for now is that <http:s> <http:o4> "\\r" . does not parse correctly when parsed as ntriples.

@GreenfishK
Copy link
Author

GreenfishK commented Jan 9, 2022

Have you tried to parse the triples above from .nt and then serialize as .nt (i guess this is what you mean by roundtrip). For me, all the objects above had a different representation in the destination file. Especially
<http:s> <http:o3> "\b" . was serialized as unrecognizable character.

Regarding the newline. I still don’t understand why it is correct to be interpreted as actual new line when serialized. E.g if I have an ontolog about python where I have a triple like <newline> <label> "\n" and if I parse and serialize this then the object literal would be lost. Or would I in this case add an additional backslash as escape character, like <newline> <label> "\\n" ?

aucampia added a commit to aucampia/rdflib that referenced this issue Jan 9, 2022
This includes xfails for the following issues:

- RDFLib#1216
- RDFLib#1655
- RDFLib#1649

Also:

- Add graph variant test scaffolding. Multiple files representing the
  same graph can now easily be tested to be isomorphic by just adding
  them in `test/variants`.
- Add more things to `testutils.GraphHelper`, including some methods that does
  asserts with better messages. Also include some tests for GraphHelper.
- Add some extra files to test_roundtrip, set the default identifier
  when parsing, and change verbose flag to rather be based on debug
  logging.
- move one test from `test/test_issue247.py` to variants.
aucampia added a commit to aucampia/rdflib that referenced this issue Jan 9, 2022
This includes xfails for the following issues:

- RDFLib#1216
- RDFLib#1655
- RDFLib#1649

Also:

- Add graph variant test scaffolding. Multiple files representing the
  same graph can now easily be tested to be isomorphic by just adding
  them in `test/variants`.
- Add more things to `testutils.GraphHelper`, including some methods that does
  asserts with better messages. Also include some tests for GraphHelper.
- Add some extra files to test_roundtrip, set the default identifier
  when parsing, and change verbose flag to rather be based on debug
  logging.
- move one test from `test/test_issue247.py` to variants.
- Fix problems with `.editorconfig` which prevents it from working
  properly.
aucampia added a commit to aucampia/rdflib that referenced this issue Jan 9, 2022
This includes xfails for the following issues:

- RDFLib#1216
- RDFLib#1655
- RDFLib#1649

Also:

- Add graph variant test scaffolding. Multiple files representing the
  same graph can now easily be tested to be isomorphic by just adding
  them in `test/variants`.
- Add more things to `testutils.GraphHelper`, including some methods that does
  asserts with better messages. Also include some tests for GraphHelper.
- Add some extra files to test_roundtrip, set the default identifier
  when parsing, and change verbose flag to rather be based on debug
  logging.
- move one test from `test/test_issue247.py` to variants.
- Fix problems with `.editorconfig` which prevents it from working
  properly.
aucampia added a commit to aucampia/rdflib that referenced this issue Jan 9, 2022
This includes xfails for the following issues:

- RDFLib#1216
- RDFLib#1655
- RDFLib#1649

Also:

- Add graph variant test scaffolding. Multiple files representing the
  same graph can now easily be tested to be isomorphic by just adding
  them in `test/variants`.
- Add more things to `testutils.GraphHelper`, including some methods that does
  asserts with better messages. Also include some tests for GraphHelper.
- Add some extra files to test_roundtrip, set the default identifier
  when parsing, and change verbose flag to rather be based on debug
  logging.
- move one test from `test/test_issue247.py` to variants.
- Fix problems with `.editorconfig` which prevents it from working
  properly.
aucampia added a commit to aucampia/rdflib that referenced this issue Jan 9, 2022
This includes xfails for the following issues:

- RDFLib#1216
- RDFLib#1655
- RDFLib#1649

Also:

- Add graph variant test scaffolding. Multiple files representing the
  same graph can now easily be tested to be isomorphic by just adding
  them in `test/variants`.
- Add more things to `testutils.GraphHelper`, including some methods that does
  asserts with better messages. Also include some tests for GraphHelper.
- Add some extra files to test_roundtrip, set the default identifier
  when parsing, and change verbose flag to rather be based on debug
  logging.
- move one test from `test/test_issue247.py` to variants.
- Fix problems with `.editorconfig` which prevents it from working
  properly.
@aucampia
Copy link
Member

aucampia commented Jan 9, 2022

@GreenfishK

$ .venv/bin/pytest 'test/test_roundtrip.py::test_extra[roundtrip_special_chars.nt_ntriples_ntriples]' --log-level DEBUG -rA
============================================================================ test session starts ============================================================================
platform linux -- Python 3.8.12, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/iwana/sw/d/github.com/iafork/rdflib, configfile: tox.ini
plugins: subtests-0.5.0, cov-3.0.0
collected 1 item                                                                                                                                                            

test/test_roundtrip.py .                                                                                                                                              [100%]

================================================================================== PASSES ===================================================================================
_________________________________________________________ test_extra[roundtrip_special_chars.nt_ntriples_ntriples] __________________________________________________________
----------------------------------------------------------------------------- Captured log call -----------------------------------------------------------------------------
DEBUG    test.test_roundtrip:test_roundtrip.py:221 serailized = 
<example:special> <example:backspace> " .
<example:special> <example:backslash> "\\" .
<example:special> <example:carriage_return> "\r" .
<example:special> <example:string-000> "\\\r" .
<example:special> <example:form_feed> "
                                       " .
<example:special> <example:newline> "\n" .
<example:special> <example:string-001> "\\\r" .


DEBUG    test.test_roundtrip:test_roundtrip.py:227 Items in both:
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:backspace'), rdflib.term.Literal('\x08'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:string-001'), rdflib.term.Literal('\\\r'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:form_feed'), rdflib.term.Literal('\x0c'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:backslash'), rdflib.term.Literal('\\'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:carriage_return'), rdflib.term.Literal('\r'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:newline'), rdflib.term.Literal('\n'))
  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:string-000'), rdflib.term.Literal('\\\r'))
DEBUG    test.test_roundtrip:test_roundtrip.py:228 Items in G1 Only:

DEBUG    test.test_roundtrip:test_roundtrip.py:229 Items in G2 Only:

DEBUG    test.test_roundtrip:test_roundtrip.py:234 OK
========================================================================== short test summary info ==========================================================================
PASSED test/test_roundtrip.py::test_extra[roundtrip_special_chars.nt_ntriples_ntriples]
============================================================================= 1 passed in 0.18s =============================================================================

Test added in #1658

This is wrong (i.e. parsed wrong):

  (rdflib.term.URIRef('example:special'), rdflib.term.URIRef('example:string-001'), rdflib.term.Literal('\\\r'))

And this is an issue I also added xfail tests for, but the rest all seem right to me. I'm looking at fixing the \\r issue now but will have to completely rewrite the unescaping for ntriples to fix this.

@aucampia aucampia self-assigned this Jan 9, 2022
@GreenfishK
Copy link
Author

Thank you!!

@GreenfishK
Copy link
Author

@aucampia I made a few more checkes for the .ttl dataset with the special characters/escapes and I think it looks good. Especially I tried to import following two triples to GraphDB and then to query it and the representation for both was exactly the same.

<http:s> <http:p> """
"""

and
<http:s> <http:p> "\n"

SPARQL Query result when querying from GraphDB

<http:s> <http:p> "
"

However, for .nt I did not manage to get the correct output after calling parse(...) on an utf-8 encoded .nt dataset and then serialize(...). Especially for \\b and \\f I see errors. See the example from the original DBPedia triple below.

Example triple from DBPedia:
<http://dbpedia.org/resource/Rodeo_(Travis_Scott_album)> <http://dbpedia.org/property/cover> "{\\rtf1\\ansi\\ansicpg1252{\\fonttbl}\n{\\colortbl;\\red255\\green255\\blue255;"@en .

Output after function parse(...) and serialize(...) is called:
<http://dbpedia.org/resource/Rodeo_(Travis_Scott_album)> <http://dbpedia.org/property/cover> "{\\\rtf1\\ansi\\ansicpg1252{\\ onttbl}\n{\\colortbl;\\\red255\\green255\\�lue255;"@en .

Should I make a python notebook for this issue with the code I am using so it would be easier verifiable and better visible?

aucampia added a commit to aucampia/rdflib that referenced this issue Jan 10, 2022
This includes xfails for the following issues:

- RDFLib#1216
- RDFLib#1655
- RDFLib#1649

Also:

- Add graph variant test scaffolding. Multiple files representing the
  same graph can now easily be tested to be isomorphic by just adding
  them in `test/variants`.
- Add more things to `testutils.GraphHelper`, including some methods that does
  asserts with better messages. Also include some tests for GraphHelper.
- Add some extra files to test_roundtrip, set the default identifier
  when parsing, and change verbose flag to rather be based on debug
  logging.
- move one test from `test/test_issue247.py` to variants.
- Fix problems with `.editorconfig` which prevents it from working
  properly.
@aucampia
Copy link
Member

Most reasonable solution I can find for this is to replace:

r_unicodeEscape = re.compile(r"(\\u[0-9A-Fa-f]{4}|\\U[0-9A-Fa-f]{8})")


def _unicodeExpand(s: str) -> str:
    return r_unicodeEscape.sub(lambda m: chr(int(m.group(0)[2:], 16)), s)


def decodeUnicodeEscape(s: str) -> str:
    """
    s is a unicode string
    replace ``\\n`` and ``\\u00AC`` unicode escapes
    """
    # if "\\" not in s:
    #     # Most of times, there are no backslashes in strings.
    #     # In the general case, it could use maketrans and translate.
    #     return s

    s = s.replace("\\t", "\t")
    s = s.replace("\\n", "\n")
    s = s.replace("\\r", "\r")
    s = s.replace("\\b", "\b")
    s = s.replace("\\f", "\f")
    s = s.replace('\\"', '"')
    s = s.replace("\\'", "'")
    s = s.replace("\\\\", "\\")

    s = _unicodeExpand(s)  # hmm - string escape doesn't do unicode escaping

    return s

with

string_escape_map = {
    "t": "\t",
    "b": "\b",
    "n": "\n",
    "r": "\r",
    "f": "\f",
    '"': '"',
    "'": "'",
    "\\": "\\",
}
string_escape_trans = str.maketrans(string_escape_map)


def _group_handler(match: Match[str]) -> str:
    rmatch, smatch, umatch = match.groups()
    if rmatch is not None:
        return rmatch
    elif smatch is not None:
        return smatch.translate(string_escape_trans)
    else:
        return chr(int(umatch[1:], 16))


turtle_escape_pattern = re.compile(
    r"""\\(?:([~.\-!$&'()*+,;=\/?#@%_])|([tbnrf"'\\])|(u[0-9A-Fa-f]{4}|U[0-9A-Fa-f]{8}))""",
    # re.ASCII,
)


def turtle_unescape(escaped: str) -> str:
    return turtle_escape_pattern.sub(_group_handler, escaped)

The replacement is slower in simple string replacement cases, but faster or equally fast more or less in other cases. Will try make a patch tomorrow.

I think in the long run we should look at the parsers themselves, I think potentially using https://github.com/lark-parser/lark to make a parser for N3 may make more sense, and that can compile to python.

aucampia added a commit to aucampia/rdflib that referenced this issue Jan 12, 2022
This includes xfails for the following issues:

- RDFLib#1216
- RDFLib#1655
- RDFLib#1649

Also:

- Add graph variant test scaffolding. Multiple files representing the
  same graph can now easily be tested to be isomorphic by just adding
  them in `test/variants`.
- Add more things to `testutils.GraphHelper`, including some methods that does
  asserts with better messages. Also include some tests for GraphHelper.
- Add some extra files to test_roundtrip, set the default identifier
  when parsing, and change verbose flag to rather be based on debug
  logging.
- move one test from `test/test_issue247.py` to variants.
- Fix problems with `.editorconfig` which prevents it from working
  properly.
aucampia added a commit to aucampia/rdflib that referenced this issue Jan 12, 2022
This includes xfails for the following issues:

- RDFLib#1216
- RDFLib#1655
- RDFLib#1649

Also:

- Add graph variant test scaffolding. Multiple files representing the
  same graph can now easily be tested to be isomorphic by just adding
  them in `test/variants`.
- Add more things to `testutils.GraphHelper`, including some methods that does
  asserts with better messages. Also include some tests for GraphHelper.
- Add some extra files to test_roundtrip, set the default identifier
  when parsing, and change verbose flag to rather be based on debug
  logging.
- move one test from `test/test_issue247.py` to variants.
- Fix problems with `.editorconfig` which prevents it from working
  properly.
@aucampia
Copy link
Member

I made a draft fix for this here #1663

You can have a look at the tests being done in this file.

This quite extensively tests the handling of various escape sequences through turtple related parsers, and it seems all is okay with that fix. There are many things fixed, I placed a list of all tests that were added that fail with the old code in the PR for you to have a look, but all other cases seem to work correctly in master, and continue to work correctly.

If you think there is something I missed in the tests let me know please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants