Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implements 'surrogateescape' encoding error handler #545

Merged
merged 14 commits into from Feb 25, 2019

Conversation

BCSharp
Copy link
Member

@BCSharp BCSharp commented Feb 18, 2019

Implements PEP 383 and Issue #2.
Works with all .NET encodings: ASCII, Latin1, UTF-8, UTF-7, UTF-16, UTF-32, etc.

The challenge is that .NET encoder and decoder fallback interfaces are only character oriented. This was also a problem in CPython, but they could extend the interface to allow fallbacks to produce output on byte level (see discussion PEP 383).

@slozier
Copy link
Contributor

slozier commented Feb 18, 2019

I like the two-pass encoding idea. Do you have test cases for these? Might be useful to include in the testing.

There are a few failures related to my comment about CodePage on the CI. For the Linux tests you'll have to revert the workarounds I made in Src/StdLib/Lib/test/support/__init__.py of #539 to see how it's working.

@BCSharp
Copy link
Member Author

BCSharp commented Feb 18, 2019

Frankly, I do not know how to do it in a single pass (short from rewriting all .NET encoding classes from scratch). The good thing is that, if there are no decoding/encoding errors (which will be the vast majority of practical cases), the decoding/encoding will still be done in one pass.

I do have C# unit tests for PythonSurrogateEscapeEncoding. They test whether the encoding is round-trip safe, whether it produces the same results as CPython, whether decoding/encoding in in blocks works OK (this will never be used for OS paths, but should work nevertheless), and throw some curve balls for specific encodings. I will include them once I figure out where they fit.

I also have a few tests in Python, which are less extensive but focus on (potential) differences with CPython.

Funny but locally, all tests were passing. Does CI do more tests?

@slozier
Copy link
Contributor

slozier commented Feb 18, 2019

The CI runs tests on .NET Framework + .NET Core on Windows and Mono + .NET Core on Linux and macOS. Not sure why it would pass for you but fail on the CI (I managed to reproduce the failure over here).

You can include the C# unit tests in ÌronPythonTest, it's set up to have access to the IronPython internals. It should pick up and run NUnit tests. You can probably set it up like:

namespace IronPythonTest {
    [TestFixture(Category="IronPython")]
    public class SurrogateEscape {
        [Test]
        public void Utf8() {
            // ...
        }
        // etc...
    }
}

@BCSharp
Copy link
Member Author

BCSharp commented Feb 18, 2019

I have reverted the changes to PythonAsciiEncoding and EncodingWrapper plus the Linux workaround. All test pass (still), but I am going to push the changes to see the CI tests. If they pass, I will include the unit tests in a separate commit. Thanks for the info.

@slozier
Copy link
Contributor

slozier commented Feb 18, 2019

Sorry, I should have been clear that it should have been a partial revert of #539 since the changes to Src/StdLib/Lib/os.py are still required for things to work properly right now.

@BCSharp
Copy link
Member Author

BCSharp commented Feb 19, 2019

The Windows tests pass but the Linux/macOS tests fail. The main cause is os.environ, which tries to decode with surrogateescape the dictionary it got from the posix module. This fails (cannot decode a str), because the posix module is not really Posix, it is nt renamed. One of the differences between CPython's posix and nt is that the environ dictionary on Posix contains bytestrings, but unicode strings on NT.

The nt module in turn gets its environ dictionary from EnvironmentDictionaryStorage in PythonDicttionary.cs, which obtains it through .NET's Environment.GetEnvironmentVariables(). The last one always returns a dictionary of strings, even on Posix systems. So the decoding of the environment strings already happened inside .NET.

I have done some tests to see how .NET handles decoding errors, and sure enough, they do not get escaped in lone surrogates, but simply replaced by U+FFFD. So information is lost and we will never have full CPython behavior here. What we still can do is to make sure that we've got the types right, that would be good enough for practical purposes. This means that posix.environ should be a dictionary of bytestrings.

The question is where to do the conversion of strings to bytes:

Since module nt dubs as posix, perhaps that is the best place to handle all platform specific differences. There may be more than just the environment variables, there are so many tests failing that I couldn't tell.

@slozier
Copy link
Contributor

slozier commented Feb 19, 2019

@BCSharp I would probably fork EnvironmentDictionaryStorage into BytesEnvironmentDictionaryStorage (or something of the sort) and then in the nt module call the appropriate one. Either that or a bunch of platform checks in EnvironmentDictionaryStorage. I don't think it can be any higher up since we'd end up having to implement yet another dict wrapper so that Add/Remove updates the underlying EnvironmentDictionaryStorage...

I'm also completely fine with just having the os patch (although it's a pain whenever the StdLib needs to be updated).

@BCSharp
Copy link
Member Author

BCSharp commented Feb 19, 2019

I'm also completely fine with just having the os patch

Great. I leave updating ``EnvironmentDictionaryStorage` for another PR.

@slozier
Copy link
Contributor

slozier commented Feb 23, 2019

Thanks for adding the tests.

Looks like Mono is having issues with GetChars. In particular (SurrogateEscapeDecoderFallbackBuffer)_pass1decoder.FallbackBuffer is failing. It is using a DefaultDecoder which simply does a pass through to the Encoding.GetChars so Fallback and FallbackBuffer are never set. Is it documented that the Decoder must inherit the Fallback from the Encoding? You might be relying on an implementation detail here...

@BCSharp
Copy link
Member Author

BCSharp commented Feb 23, 2019

I have noticed that one test on Mono was failing but I haven’t got time to look into it yet. Probably I will have some time tonight (Pacific time). Thank you for providing some insight into the problem.

What is interesting is that the test that is failing was meant to be run only on .NET Framework on Windows, as it uses an Encoding that is not guaranteed to be available anywhere else. I have excluded it from .NET Core by wrapping it in #if !NETCOREAPP2_1 and #endif but forgotten to disable on Mono. So the fact that is failing is not surprising but it is failing in a surprising way.

By the way, is there a better way to mark NUnit tests as applicable only to a specific framework or operating system? I was looking fir some attributes to apply to the test method, but haven’t found any.

@slozier
Copy link
Contributor

slozier commented Feb 23, 2019

Hmm, while cp1252 is primarily a Windows encoding I think it should be available on all platforms. CPython supports it (via a charmap encoding?) and .NET also supports it. In the case of .NET Core, it's not available natively, but it is still available via the System.Text.Encoding.CodePages package which we use. You could the following to your SetUp if you want to make it available:

#if NETCOREAPP2_1
    Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
#endif

We have an explicit hook up for cp1252 encoding on .NET Core (see CodecsInfo.MakeCodecsDict) to make it available in IronPython.

Maybe if we didn't hook up encodings (other than the Python builtin ones) then we wouldn't have to deal with these compatibility issues when using normal Python. However we still should still support the interop scenario: "aaa".encode(SomeDotNetEncoding) (which I seem to have broken recently #551).

Looks like the standard method for skipping the test would be something like:

if (!RuntimeInformation.IsOSPlatform(OSPlatform.Windows)) {
    Assert.Ignore("some message");
}

@BCSharp
Copy link
Member Author

BCSharp commented Feb 24, 2019

Is it documented that the Decoder must inherit the Fallback from the Encoding?

No, it is not. What is documented is that if a Decoder is stateless, it may forward to Encoding.GetChars. For stateless decoders, the FallbackBuffer is irrelevant as it is never used, so I assume that Mono developers don't set it as an optimization. I have corrected my code to handle such case appropriately and I keep the test enabled on Mono.

We have an explicit hook up for cp1252 encoding on .NET Core (see CodecsInfo.MakeCodecsDict) to make it available in IronPython.

Strange, but when I do Encoding.RegisterProvider(CodePagesEncodingProvider.Instance) in the test fixture, Encoding 1252 is still not available.

@slozier
Copy link
Contributor

slozier commented Feb 24, 2019

Odd about Encoding 1252, the test works fine for me... If you're happy with your PR as it is I'll take another look tomorrow and merge it in if everything looks good.

@BCSharp
Copy link
Member Author

BCSharp commented Feb 24, 2019

I have just noticed that System.Text.Encoding.CodePages is not referenced from IronPythonTest. I will add it tomorrow and test again.

@slozier
Copy link
Contributor

slozier commented Feb 24, 2019

The reference should have been inherited from IronPython. I'm not sure that the CodePagesEncodingProvider.Instance API is even available if you don't reference System.Text.Encoding.CodePages.

@BCSharp
Copy link
Member Author

BCSharp commented Feb 25, 2019

Adding the reference to System.Text.Encoding.CodePages did not help. The API is available, with or without this reference, but code page 1252 is not. If you have ideas to make it work, feel free to do it, otherwise, I am OK with a merge as it is.

@slozier slozier merged commit 744c9d7 into IronLanguages:master Feb 25, 2019
@slozier
Copy link
Contributor

slozier commented Feb 25, 2019

Thanks again for the PR!

@BCSharp BCSharp deleted the surrogateencode branch April 1, 2019 18:06
@slozier slozier changed the title Implements 'surrogateencode' encoding error handler Implements 'surrogateescape' encoding error handler Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants