Skip to content

Surrogateescape errors#713

Merged
slozier merged 3 commits intoIronLanguages:masterfrom
BCSharp:surrogateescape_errors
Jan 2, 2020
Merged

Surrogateescape errors#713
slozier merged 3 commits intoIronLanguages:masterfrom
BCSharp:surrogateescape_errors

Conversation

@BCSharp
Copy link
Copy Markdown
Member

@BCSharp BCSharp commented Dec 31, 2019

This PR implements the codecs.surrogateescape_errors builtin function with the accompanying tests. The same caveat applies as for the previous error handlers: in case of misconstructed UnicodeError the way how the range gets coerced may sometimes differ from CPython, but in no instance will it throw IndexError.

During testing I discovered that my original understanding of PEP 383 was wrong. I thought that PEP 383 forbids smuggling of ASCII characters, but it actually forbids smuggling of ASCII bytes. Consequently, surrogateescape cannot be meaningfully used with wide-char encodings, like UTF-16 or UTF-32. The Python documentation doesn't mention it, but the PEP does say it is intended for ASCII-compatible encodings only. On the bright side, the implementation becomes simpler.

For instance, in case of UTF-16, if a lone surrogate (ordinarily an error) has the LSB 0x80 or above, it will be escaped; if that byte is less than 0x80, it will not be escaped but treated as belonging to the next character, resulting in all subsequent bytes misaligned:

>>> b"\xdd\x80".decode('utf-16-be', 'surrogateescape')
'\udcdd\udc80'
>>> b"\xdd\x20".decode('utf-16-be', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Pawel\.conda\envs\py37\lib\encodings\utf_16_be.py", line 16, in decode
    return codecs.utf_16_be_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x20 in position 1: truncated data
>>> b"\xdd\x20\xac".decode('utf-16-be', 'surrogateescape')
'\udcdd€'

In case of UTF-32, no surrogates can be escaped as they invariably contain two zero bytes.

Since it is a security issue, in the second commit of this PR I correct PythonSurrogateEscapeEncoding and modify the relevant tests. The change prevents smuggling of ASCII bytes, although the behaviour is still not exactly like CPython: the third statement from the example above will raise an exception rather than misalign the subsequent bytes. I leave it like this for now because I am considering getting rid of PythonSurrogateEscapeEncoding altogether at some point in the future.

@slozier slozier merged commit 83a7cdf into IronLanguages:master Jan 2, 2020
@slozier
Copy link
Copy Markdown
Contributor

slozier commented Jan 2, 2020

Thanks for the PR!

@BCSharp BCSharp deleted the surrogateescape_errors branch February 8, 2020 00:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants