Implements 'surrogateescape' encoding error handler #545

BCSharp · 2019-02-18T05:37:38Z

Implements PEP 383 and Issue #2.
Works with all .NET encodings: ASCII, Latin1, UTF-8, UTF-7, UTF-16, UTF-32, etc.

The challenge is that .NET encoder and decoder fallback interfaces are only character oriented. This was also a problem in CPython, but they could extend the interface to allow fallbacks to produce output on byte level (see discussion PEP 383).

Src/IronPython/Runtime/Operations/StringOps.cs

slozier · 2019-02-18T15:33:07Z

I like the two-pass encoding idea. Do you have test cases for these? Might be useful to include in the testing.

There are a few failures related to my comment about CodePage on the CI. For the Linux tests you'll have to revert the workarounds I made in Src/StdLib/Lib/test/support/__init__.py of #539 to see how it's working.

This reverts commit 864c44c.

BCSharp · 2019-02-18T19:33:36Z

Frankly, I do not know how to do it in a single pass (short from rewriting all .NET encoding classes from scratch). The good thing is that, if there are no decoding/encoding errors (which will be the vast majority of practical cases), the decoding/encoding will still be done in one pass.

I do have C# unit tests for PythonSurrogateEscapeEncoding. They test whether the encoding is round-trip safe, whether it produces the same results as CPython, whether decoding/encoding in in blocks works OK (this will never be used for OS paths, but should work nevertheless), and throw some curve balls for specific encodings. I will include them once I figure out where they fit.

I also have a few tests in Python, which are less extensive but focus on (potential) differences with CPython.

Funny but locally, all tests were passing. Does CI do more tests?

slozier · 2019-02-18T20:14:40Z

The CI runs tests on .NET Framework + .NET Core on Windows and Mono + .NET Core on Linux and macOS. Not sure why it would pass for you but fail on the CI (I managed to reproduce the failure over here).

You can include the C# unit tests in ÌronPythonTest, it's set up to have access to the IronPython internals. It should pick up and run NUnit tests. You can probably set it up like:

namespace IronPythonTest {
    [TestFixture(Category="IronPython")]
    public class SurrogateEscape {
        [Test]
        public void Utf8() {
            // ...
        }
        // etc...
    }
}

BCSharp · 2019-02-18T20:44:08Z

I have reverted the changes to PythonAsciiEncoding and EncodingWrapper plus the Linux workaround. All test pass (still), but I am going to push the changes to see the CI tests. If they pass, I will include the unit tests in a separate commit. Thanks for the info.

slozier · 2019-02-18T21:49:04Z

Sorry, I should have been clear that it should have been a partial revert of #539 since the changes to Src/StdLib/Lib/os.py are still required for things to work properly right now.

BCSharp · 2019-02-19T05:57:24Z

The Windows tests pass but the Linux/macOS tests fail. The main cause is os.environ, which tries to decode with surrogateescape the dictionary it got from the posix module. This fails (cannot decode a str), because the posix module is not really Posix, it is nt renamed. One of the differences between CPython's posix and nt is that the environ dictionary on Posix contains bytestrings, but unicode strings on NT.

The nt module in turn gets its environ dictionary from EnvironmentDictionaryStorage in PythonDicttionary.cs, which obtains it through .NET's Environment.GetEnvironmentVariables(). The last one always returns a dictionary of strings, even on Posix systems. So the decoding of the environment strings already happened inside .NET.

I have done some tests to see how .NET handles decoding errors, and sure enough, they do not get escaped in lone surrogates, but simply replaced by U+FFFD. So information is lost and we will never have full CPython behavior here. What we still can do is to make sure that we've got the types right, that would be good enough for practical purposes. This means that posix.environ should be a dictionary of bytestrings.

The question is where to do the conversion of strings to bytes:

Since module nt dubs as posix, perhaps that is the best place to handle all platform specific differences. There may be more than just the environment variables, there are so many tests failing that I couldn't tell.

slozier · 2019-02-19T15:22:07Z

@BCSharp I would probably fork EnvironmentDictionaryStorage into BytesEnvironmentDictionaryStorage (or something of the sort) and then in the nt module call the appropriate one. Either that or a bunch of platform checks in EnvironmentDictionaryStorage. I don't think it can be any higher up since we'd end up having to implement yet another dict wrapper so that Add/Remove updates the underlying EnvironmentDictionaryStorage...

I'm also completely fine with just having the os patch (although it's a pain whenever the StdLib needs to be updated).

BCSharp · 2019-02-19T20:19:27Z

I'm also completely fine with just having the os patch

Great. I leave updating ``EnvironmentDictionaryStorage` for another PR.

slozier · 2019-02-23T14:30:41Z

Thanks for adding the tests.

Looks like Mono is having issues with GetChars. In particular (SurrogateEscapeDecoderFallbackBuffer)_pass1decoder.FallbackBuffer is failing. It is using a DefaultDecoder which simply does a pass through to the Encoding.GetChars so Fallback and FallbackBuffer are never set. Is it documented that the Decoder must inherit the Fallback from the Encoding? You might be relying on an implementation detail here...

BCSharp · 2019-02-23T18:44:01Z

I have noticed that one test on Mono was failing but I haven’t got time to look into it yet. Probably I will have some time tonight (Pacific time). Thank you for providing some insight into the problem.

What is interesting is that the test that is failing was meant to be run only on .NET Framework on Windows, as it uses an Encoding that is not guaranteed to be available anywhere else. I have excluded it from .NET Core by wrapping it in #if !NETCOREAPP2_1 and #endif but forgotten to disable on Mono. So the fact that is failing is not surprising but it is failing in a surprising way.

By the way, is there a better way to mark NUnit tests as applicable only to a specific framework or operating system? I was looking fir some attributes to apply to the test method, but haven’t found any.

slozier · 2019-02-23T22:15:23Z

Hmm, while cp1252 is primarily a Windows encoding I think it should be available on all platforms. CPython supports it (via a charmap encoding?) and .NET also supports it. In the case of .NET Core, it's not available natively, but it is still available via the System.Text.Encoding.CodePages package which we use. You could the following to your SetUp if you want to make it available:

#if NETCOREAPP2_1
    Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
#endif

We have an explicit hook up for cp1252 encoding on .NET Core (see CodecsInfo.MakeCodecsDict) to make it available in IronPython.

Maybe if we didn't hook up encodings (other than the Python builtin ones) then we wouldn't have to deal with these compatibility issues when using normal Python. However we still should still support the interop scenario: "aaa".encode(SomeDotNetEncoding) (which I seem to have broken recently #551).

Looks like the standard method for skipping the test would be something like:

if (!RuntimeInformation.IsOSPlatform(OSPlatform.Windows)) {
    Assert.Ignore("some message");
}

BCSharp · 2019-02-24T05:39:31Z

Is it documented that the Decoder must inherit the Fallback from the Encoding?

No, it is not. What is documented is that if a Decoder is stateless, it may forward to Encoding.GetChars. For stateless decoders, the FallbackBuffer is irrelevant as it is never used, so I assume that Mono developers don't set it as an optimization. I have corrected my code to handle such case appropriately and I keep the test enabled on Mono.

We have an explicit hook up for cp1252 encoding on .NET Core (see CodecsInfo.MakeCodecsDict) to make it available in IronPython.

Strange, but when I do Encoding.RegisterProvider(CodePagesEncodingProvider.Instance) in the test fixture, Encoding 1252 is still not available.

slozier · 2019-02-24T06:05:57Z

Odd about Encoding 1252, the test works fine for me... If you're happy with your PR as it is I'll take another look tomorrow and merge it in if everything looks good.

BCSharp · 2019-02-24T06:34:58Z

I have just noticed that System.Text.Encoding.CodePages is not referenced from IronPythonTest. I will add it tomorrow and test again.

slozier · 2019-02-24T14:08:48Z

The reference should have been inherited from IronPython. I'm not sure that the CodePagesEncodingProvider.Instance API is even available if you don't reference System.Text.Encoding.CodePages.

BCSharp · 2019-02-25T05:43:01Z

Adding the reference to System.Text.Encoding.CodePages did not help. The API is available, with or without this reference, but code page 1252 is not. If you have ideas to make it work, feel free to do it, otherwise, I am OK with a merge as it is.

slozier · 2019-02-25T21:16:06Z

Thanks again for the PR!

BCSharp added 3 commits February 17, 2019 21:26

New class PythonSurrogateEscapeEncoding

2c32543

Enable surrogateescape encoding error handler

130ba22

Merge remote-tracking branch 'upstream/master' into surrogateencode

b7fba29

slozier reviewed Feb 18, 2019

View reviewed changes

Src/IronPython/Runtime/Operations/StringOps.cs Outdated Show resolved Hide resolved

Revert "Bypass surrogateescape in test.support (IronLanguages#539)"

9558b31

This reverts commit 864c44c.

Revert changes to PythonAsciiEncoding and EncodingWrapper

53060ce

Work around PythonAsciiEncoding in handling surrogateescape

9e95e85

Bypass (again) surrogateescape in os.environ (IronLanguages#539)

42c1199

BCSharp added 4 commits February 21, 2019 19:11

Unit test for PythonSurrogateEscapeEncoding

1d75534

Support running tests by a query from make.ps1

52268ad

Bugfix in PythonSurrogateEscapeEncoding

d3b553d

Add IronPython.test_surrogateescape

8ab5507

BCSharp added 3 commits February 23, 2019 20:17

Bugfix: handle stateless Encoders and Decoders

485e345

Improve tests for surrogateescape

3f733ad

Optimizations in encoder and decoder for PythonSurrogateEscapeEncoding

92a4cb0

slozier merged commit 744c9d7 into IronLanguages:master Feb 25, 2019

BCSharp deleted the surrogateencode branch April 1, 2019 18:06

slozier changed the title ~~Implements 'surrogateencode' encoding error handler~~ Implements 'surrogateescape' encoding error handler Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements 'surrogateescape' encoding error handler #545

Implements 'surrogateescape' encoding error handler #545

BCSharp commented Feb 18, 2019 •

edited

slozier commented Feb 18, 2019

BCSharp commented Feb 18, 2019

slozier commented Feb 18, 2019

BCSharp commented Feb 18, 2019

slozier commented Feb 18, 2019

BCSharp commented Feb 19, 2019

slozier commented Feb 19, 2019

BCSharp commented Feb 19, 2019

slozier commented Feb 23, 2019

BCSharp commented Feb 23, 2019

slozier commented Feb 23, 2019 •

edited

BCSharp commented Feb 24, 2019

slozier commented Feb 24, 2019

BCSharp commented Feb 24, 2019

slozier commented Feb 24, 2019

BCSharp commented Feb 25, 2019

slozier commented Feb 25, 2019

Implements 'surrogateescape' encoding error handler #545

Implements 'surrogateescape' encoding error handler #545

Conversation

BCSharp commented Feb 18, 2019 • edited

slozier commented Feb 18, 2019

BCSharp commented Feb 18, 2019

slozier commented Feb 18, 2019

BCSharp commented Feb 18, 2019

slozier commented Feb 18, 2019

BCSharp commented Feb 19, 2019

slozier commented Feb 19, 2019

BCSharp commented Feb 19, 2019

slozier commented Feb 23, 2019

BCSharp commented Feb 23, 2019

slozier commented Feb 23, 2019 • edited

BCSharp commented Feb 24, 2019

slozier commented Feb 24, 2019

BCSharp commented Feb 24, 2019

slozier commented Feb 24, 2019

BCSharp commented Feb 25, 2019

slozier commented Feb 25, 2019

BCSharp commented Feb 18, 2019 •

edited

slozier commented Feb 23, 2019 •

edited