Surrogateescape errors by BCSharp · Pull Request #713 · IronLanguages/ironpython3

BCSharp · 2019-12-31T03:18:48Z

This PR implements the codecs.surrogateescape_errors builtin function with the accompanying tests. The same caveat applies as for the previous error handlers: in case of misconstructed UnicodeError the way how the range gets coerced may sometimes differ from CPython, but in no instance will it throw IndexError.

During testing I discovered that my original understanding of PEP 383 was wrong. I thought that PEP 383 forbids smuggling of ASCII characters, but it actually forbids smuggling of ASCII bytes. Consequently, surrogateescape cannot be meaningfully used with wide-char encodings, like UTF-16 or UTF-32. The Python documentation doesn't mention it, but the PEP does say it is intended for ASCII-compatible encodings only. On the bright side, the implementation becomes simpler.

For instance, in case of UTF-16, if a lone surrogate (ordinarily an error) has the LSB 0x80 or above, it will be escaped; if that byte is less than 0x80, it will not be escaped but treated as belonging to the next character, resulting in all subsequent bytes misaligned:

>>> b"\xdd\x80".decode('utf-16-be', 'surrogateescape')
'\udcdd\udc80'
>>> b"\xdd\x20".decode('utf-16-be', 'surrogateescape')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Pawel\.conda\envs\py37\lib\encodings\utf_16_be.py", line 16, in decode
    return codecs.utf_16_be_decode(input, errors, True)
UnicodeDecodeError: 'utf-16-be' codec can't decode byte 0x20 in position 1: truncated data
>>> b"\xdd\x20\xac".decode('utf-16-be', 'surrogateescape')
'\udcdd€'

In case of UTF-32, no surrogates can be escaped as they invariably contain two zero bytes.

Since it is a security issue, in the second commit of this PR I correct PythonSurrogateEscapeEncoding and modify the relevant tests. The change prevents smuggling of ASCII bytes, although the behaviour is still not exactly like CPython: the third statement from the example above will raise an exception rather than misalign the subsequent bytes. I leave it like this for now because I am considering getting rid of PythonSurrogateEscapeEncoding altogether at some point in the future.

slozier · 2020-01-02T15:05:09Z

Thanks for the PR!

BCSharp added 3 commits December 30, 2019 11:37

Implement codecs.surrogateescape_errors

6ecf5aa

Fix PythonSurrogateEscapeEncoding

2897e75

Fix PythonSurrogateEscapeEncoding, encoding side

1803da8

slozier merged commit 83a7cdf into IronLanguages:master Jan 2, 2020

BCSharp deleted the surrogateescape_errors branch February 8, 2020 00:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surrogateescape errors#713

Surrogateescape errors#713
slozier merged 3 commits intoIronLanguages:masterfrom
BCSharp:surrogateescape_errors

BCSharp commented Dec 31, 2019

Uh oh!

slozier commented Jan 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BCSharp commented Dec 31, 2019

Uh oh!

slozier commented Jan 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants