Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement conversion from IBufferProtocol to IList<byte> in Binder #762

Closed

Conversation

BCSharp
Copy link
Member

@BCSharp BCSharp commented Mar 5, 2020

This PR originated from a failing codecs test. It turns out that all codec functions accept a 'b' type array just as well as Bytes. At first I thought of adding another overload accepting IBufferProtocol objects, but it turns out that [BytestConversion]IList<byte> will catch array as well. However, it does not go through IBufferProtocol, instead what is happening is that the binder sees that array implements IList<object> and wraps it in ListGenericWrapper<byte>, hoping for the best.

But it does not work. ListGenericWrapper<byte>, on access or on copy (which is triggered by ToArray() in the first line of StringOps.DoDecode) accesses each element as object and unboxes to byte. But the array, even for type 'B', when accessed thought IList<object>, boxes its elements to int for some reason. Even if I changed that and made array box a byte, this still would not be good enough: the conversion will fail for other array types, like sbyte or float, and in CPython all objects work as long as they support the buffer protocol:

>>> codecs.latin_1_decode(array.array('f', [12345.67]))
('®æ@F', 4)

So it is clear that wrapping in ListGenericWrapper<byte> is not a way to do it. In the end I went aheand modyfing the Binder logic to do the conversion on the fly in the expression tree. This seems to do the right thing, does it?

Questions/issues:

  1. Is BytesConversionAttribute really needed? I do not see it being checked/used anywhere by the Binder.

  2. Wouldn't it be better to wrap IBufferProtocol objects on binding and serialize to a memory stream only on access? This would delay the serialization to when the object is actually accessed, rather than having it done already in the binding expression tree.

  3. Use a dedicated attribute to enable this functionality, e.g. BufferProtocolAttribute? In this way, BytesConversion will be used only for Bytes and ByteArray, and MemoryView. On the other hand, I haven't found a case when BytesConversion would apply but BufferProtocol conversion not. In all cases I tried, where IronPython uses BytesConversion on parameters, CPython does accept an arbitrary array as well:

 >>> b"spam".startswith(array.array('b', [ord('s'), ord('p')]))
True
  1. Is the binding constraint right? It is supposed to constraint the binging to the argument that match the type exactly (I assume that LimitType means specialized type for generics). Ideally, the constraint would be more relaxed, matching any type that implements interface IBufferProtocol but I coudn't figure out how to define such constraint. Maybe constraining on the exact type match is more efficient in the average case.

  2. Even with this change, things are not quite right. CPython will accept memoryview in all those instances, but IronPython not. Shouldn't MemoryView implement IBufferProtocol? (Or at least IList<byte>).

  3. A side issue but caught my eye: ConversionBinder.cs, line 173, "blindly" converts a string to Bytes. This is error prone (only Latin 1 characters succeed) and it seems to me it is not a way to do it in Python 3, where all str to bytes conversions are explicit. See NewTypeMaker.cs, line 413: "[BytesConversionAttribute] to make enable automatic cast between .net IList and string." Really?

@slozier
Copy link
Contributor

slozier commented Mar 5, 2020

1. Is `BytesConversionAttribute` really needed? I do not see it being checked/used anywhere by the Binder.

The only thing the BytesConversionAttribute currently does is prevent PythonList from binding to IList<byte>. In 2.7 it used to do automatic string to bytes conversions.

2. Wouldn't it be better to wrap `IBufferProtocol` objects on binding and serialize to a memory stream only on access? This would delay the serialization to when the object is actually accessed, rather than having it done already in the binding expression tree.

The whole buffer protocol thing has been bugging me for a while. They should really be a zero-copy way of exposing the underlying memory instead of serializing the object. It seems to me like we could easily do this using Span/Memory. For example, if IBufferProtocol required an AsSpan<T>() where T struct or something of the sort. The whole binding story then becomes somewhat simpler.

3. Use a dedicated attribute to enable this functionality, e.g. `BufferProtocolAttribute`? In this way, `BytesConversion` will be used only for `Bytes` and `ByteArray`, and `MemoryView`. On the other hand, I haven't found a case when `BytesConversion` would apply but `BufferProtocol` conversion not. In all cases I tried, where IronPython uses `BytesConversion` on parameters, CPython does accept an arbitrary `array` as well:
 >>> b"spam".startswith(array.array('b', [ord('s'), ord('p')]))
True

If we're going with a new attribute to enable conversion, maybe we could do something like [BytesLike]IList<byte>. If we went with Span then it might just be Span<byte> and have IBufferProtocol implement an explicit cast? Just some ideas.

4. Is the binding constraint right? It is supposed to constraint the binging to the argument that match the type exactly (I assume that `LimitType` means specialized type for generics). Ideally, the constraint would be more relaxed, matching any type that implements interface `IBufferProtocol` but I coudn't figure out how to define such constraint. Maybe constraining on the exact type match is more efficient in the average case.

Did you already resolve this? It seems like you're binding to any IBufferProtocol. Or maybe I'm misunderstanding.

5. Even with this change, things are not quite right. CPython will accept `memoryview` in all those instances, but IronPython not. Shouldn't `MemoryView` implement `IBufferProtocol`? (Or at least `IList<byte>`).

I've thought about having MemoryView implement IBufferProtocol. I'm not sure it's technically correct. However, as you mentioned, you can always pass a memoryview object where it expects a Py_buffer. I'm not exactly sure how CPython deals with this...

6. A side issue but caught my eye: `ConversionBinder.cs`, line 173, "blindly" converts a string to `Bytes`. This is error prone (only Latin 1 characters succeed) and it seems to me it is not a way to do it in Python 3, where all `str` to `bytes` conversions are explicit. See NewTypeMaker.cs, line 413: "`[BytesConversionAttribute]` to make enable automatic cast between .net IList and string." Really?

These are probably left over from 2.7 and could be removed.

@BCSharp
Copy link
Member Author

BCSharp commented Mar 6, 2020

1. BytesConversionAttribute

The only thing the BytesConversionAttribute currently does is prevent PythonList from binding to IList<byte>.

OK, I understand now, but then maybe it is an opportunity to rethink/reorganize the mechanism. When I remove BytesConversionAttribute from a parameter, the binding to a Python list indeed happens, but soon fails with the same TypeError, though with a less informative error message:

With [BytesConversion]:

TypeError: expected IList[Byte], got list

Without [BytesConversion]:

TypeError: Specified cast is not valid.

In CPython (example):

TypeError: 'list' does not support the buffer interface

The type error still occurs because the Binder wraps PythonList, which implements IList<object>, in ListGenericWrapper<byte>, which was also the cause of problems for array bindings. So I think the better semantics of BytesConversionAttribute is probably not just prevent binding with PythonList but with any type if the type implements IList<object> and ListGenericWrapper would be needed.

But maybe the whole problem is stated backwards. I have scanned the whole codebase; there are 154 public interface functions that use an IList<byte> parameter. Of those, only 9 do not use [BytesConversion]:

IronPython.Runtime.ByteArray.__init__([NotNull]IList<byte> input)

IronPython.Runtime.Operations.DoubleOps.__new__(CodeContext context, PythonType cls, IList<byte> s)
IronPython.Runtime.Operations.SingleOps.__new__(CodeContext context, PythonType cls, IList<byte> s)
IronPython.Runtime.Operations.Int32Ops.__new__(CodeContext context, PythonType cls, IList<byte> s, int @base = 10)
IronPython.Runtime.Operations.BigIntegerOps.__new__(CodeContext context, PythonType cls, IList<byte> s, int @base = 10)

IronPython.Modules.PythonRegex.findall(CodeContext context, object pattern, IList<byte> @string, int flags = 0)
IronPython.Modules.PythonIOModule.IncrementalNewlineDecoder.decode(CodeContext context, [NotNull]IList<byte> input, bool final=false)
IronPython.Modules.PythonWindoundModule.PlaySound(CodeContext context, [NotNull] IList<byte> sound, int flags)
IronPython.Modules.PythonSocket.socket.recvfrom_into(IList<byte> buffer, int nbytes=0, int flags=0)

A cursory check shows that most of them should have been tagged with [BytesConversion] anyway. I believe that only the initializer of ByteArray legitimately should accept PythonList. If so, what about making a rule of never making PythonList (or better, IList<object>) match IList<byte> coded directly into the Binder, and for those very few cases (perhaps just one case...) when it is desired anyway, simply provide an overload? Then we can get away with BytesConversionAttribute altogether. Or better yet, rename and repurpose it, see point 3 below.

2. IBufferProtocol

The whole buffer protocol thing has been bugging me for a while. They should really be a zero-copy way of exposing the underlying memory instead of serializing the object.

I absolutely agree. I was having exactly the same thoughts. But I thought that we cannot use Span/Memory because it is not available in .NET 4.5/Mono? Anyway, this seems a more localized challenge. If we change the IBufferProtocol, then we could adapt the Binder accordingly, but at this stage I would like to figure out the principles of bytes conversion for the whole project.

3. BytesLikeAttribute

I like the name. CPython's documentation often talks about a "bytes-like" object as an acceptable input type. But on some occasion it talks about objects implementing a buffer protocol. It is being reflected in error messages as well:

>>> int([0x31])  
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
>>> codecs.latin_1_decode([0x31])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'list' does not support the buffer interface

I don't know if there is any difference between those two concepts, will have to dig deeper. But it appears that whenever Python API accepts a "bytes-like object" or an object with a "buffer interface", in IronPython, the parameter is tagged with BytesConversionAttibute. So if we agree to abolish this attribute, I would not just remove it but rename to BytesLikeAttribute and perform a specialized conversion in the expression tree generated by the Binder. Ideally without any copying. Maybe through Span<byte>/ReadOnlySpan<byte> or Memory<byte>/ReadOnlyMemory<byte>. I also like IReadOnlyList<byte>, which I think is under-utilized by the current code. However, I would not go with Span in function signatures. Span requires the underlying memory contiguous. IList does not, it simply offers indexed access. If a type provides Span/ReadOnlySpan but the interface requires IList/IReadOnlyList, maybe we can provide a ref struct wrapper type like ListGenericWrapper<byte> except type-safe. ListSpanWrapper<byte> perhaps?

So here are the options for the interface:

[BytesLike]IList<bytes>                           // if object must be modifiable
[BytesLike]IReadOnlyList<bytes>                   // my current favorite
[BytesLike,BufferProtocol]IList<bytes>            // in case buffer interface is something different than bytes-like
[BytesLike,BufferProtocol]IReadOnlyList<bytes>    // in case buffer interface is something different than bytes-like

Also, if a parameter is tagged with BytesLikeAttribute this would preclude wrapping in ListGenericWrapper<byte>. The binding has to succeed in a type safe way or there is no match. This will avoid several of those strange error messages while casting from object to byte.

And finally, maybe the overload resolver could produce a better error message in such case, something more in line with CPython, e.g. "expected bytes-like object, got xxx" rather "expected IList[byte], got xxx"?

4. Binding Constraint

Did you already resolve this? It seems like you're binding to any IBufferProtocol.

No I haven't solved it yet. I think I wasn't clear in my description of the issue. It is true that the proposed code will perform binding for all instances implementing IBufferProtocol. However, when a Call Site encounters the first case (say, an array) and invokes the Binder, the Binder will construct a binding expression tree (which will perform the conversion) and constrain ("restrict") it to only apply for other objects of the exact same type (i.e. another array). This binding is then stored in the Site cache. When another object later arrives at the same Call Site, and it is also array, the cache is hit, the binding will be reused, and the Binder is not employed. But if another type shows up that also implements IBufferProtocol, the cache will be missed and the Binder asked to create another expression tree, which will be exactly like the first one. This seemed to me like inefficiency: the first expression tree would have done just fine, and it was already in the cache.

But after thinking about it for a day I am now not so sure what is best; after all, there is a tradeof. If the first expression tree produced were restricted to apply to every object implementing IBufferProtocol, the cache would be smaller, the Binder would not be invoked several times, but the cost of checking the restriction would be higher. After all, it is simpler to check whether object's type is exactly a given type, than to check whether its type, or any of the base types implement a given interface. With this in mind, I put the question to rest now, it is not as important. And without performance measurements in various scenarios it is impossible to say which approach is better.

5. Memoryview

I've thought about having MemoryView implement IBufferProtocol. I'm not sure it's technically correct.

If implementing IBufferProtocol is not right, what about implementing IList<byte> and IReadOnlyList<byte>? Then binding would be trivial and automatic.

6. str to bytes conversion

These are probably left over from 2.7 and could be removed.

I will gladly remove this code at the first opportunity.

@slozier
Copy link
Contributor

slozier commented Mar 6, 2020

1. BytesConversionAttribute

what about making a rule of never making PythonList (or better, IList<object>) match IList<byte>

Preventing IList<object> from binding to IList<byte> is probably fine. I think the whole point of the wrappers is to make interop with .NET easier (e.g. passing [1,2,3] to a .NET method). However, since you would have to go out of your way to build a Python list with actual byte objects it seems unlikely to be beneficial. The only potential use case I can think of would be with array, but as you mention that doesn't even work (because values are converted back to Python types when extracted from the array).

2. IBufferProtocol

But I thought that we cannot use Span/Memory because it is not available in .NET 4.5/Mono?

We could include the System.Memory package. We wouldn't get all the overloads that .NET Core provides but it would enable us to use the types.

3. BytesLikeAttribute

But on some occasion it talks about objects implementing a buffer protocol.

CPython is somewhat inconsistent with their error messages. I think in most cases where they say "buffer protocol" they mean "bytes-like". I think it's somewhat cleaner in newer releases.

So if we agree to abolish this attribute, I would not just remove it but rename to BytesLikeAttribute

I am fine with renaming it.

Span requires the underlying memory contiguous. IList does not, it simply offers indexed access.

While this is true, if you look at Python's definition of bytes-like object they specifically mention contiguous memory.

I also like IReadOnlyList<byte>, which I think is under-utilized by the current code.

Because it didn't exist when the bulk of the codebase was written! 😄 I would be fine with making use of this since 4.5 is now the minimum target. Presumably Bytes and ByteArray would need to implement it.

[BytesLike]IList<bytes>                           // if object must be modifiable
[BytesLike]IReadOnlyList<bytes>                   // my current favorite

These seem good.

5. Memoryview

If implementing IBufferProtocol is not right, what about implementing IList and IReadOnlyList? Then binding would be trivial and automatic.

That would be an option. We could go further and implement it on all IBufferProtocol types. Then the whole binding issue becomes somewhat trivial and we no longer need special binding?


Anyway, it seems to me like IBufferProtocol should reflect the Py_buffer structure and so using Span/Memory appears to be reasonable.

@BCSharp
Copy link
Member Author

BCSharp commented Mar 7, 2020

if you look at Python's definition of bytes-like object they specifically mention contiguous memory.

So a bytes-like object has two constraints: implement buffer protocol and be a contiguous buffer. Great. This indeed makes Span/Memory qualify.

Anyway, it seems to me like IBufferProtocol should reflect the Py_buffer structure and so using Span/Memory appears to be reasonable.

I agree. Span/Memory allow the underlying data to be pinned thus allowing optimized acces in unsafe blocks. This lives up to the spirit of low-level, optimized memory access Py_Buffer provides. Although Py_Buffer need not be continuous and allows for fairly complex array layouts, these scenarios only become relevant when providing support to 3rd party libraries like NumPy and PIL. I am looking forward to implementing that, but we are not there yet, and for the builtin types supporting buffer protocol (bytes, bytearray, memoryview, array.array), a contiguous buffer will be enough.

I have realized that a ref struct cannot implement interfaces, or be members of non-ref structs, so the idea of providing a lightweight wrapper for a Span by the Binder to match an IList<byte> parameter is impossible. Conversely, it is impossible to convert an arbitrary IList<byte> object to a Span<byte> without copying data. So when making a choice for the interface type for IronPython's "bytes-like" parameter, it is either-or: one choice eliminates the other. It is a strategic choice with far-reaching consequences.

Below there are some thoughts I have on the choices I currently see. I evaluate them from the implementation point of view, and from the client point of view (e.g. C# host code calling into IronPython code).

A. IList<byte>/IReadOnlyList<byte>

  • Implementation: Most functions just loop over data and don't care whether the data is contiguous or not, so they are quite happy with IList. In fact, in most (all?) cases IReadOnlyList would do just fine. A notable exception are all decoding functions in _codecs: the actual decoding work is delegated to System.Text.Encoding.GetChars, and that interface not only requires a contiguous block but an actual byte[] array. The current code has an optimization to downcast the argument to byte[] or Bytes and access the underlying byte[], and only if that fails, a data copy is performed to a fresh array. This is good enough for most cases but probably should be changed/extended to go through IBufferProtocol to support more cases. This PR partially tried to address it.

  • Clients: Since IList is ubiquitous (e.g. every array implements it), it is easy for a client to provide data in this form. The exception is of course Span: it will require data copy to provide IList. If the data is available as Memory, things are a little easier: it is possible to provide a lightweight wrapper, automatically supplied by the Binder, to make a match.

    Another thing is that IList does not derive from IReadOnlyList, as the latter came later on the scene. This means that there can be objects that implement IList but not IReadOnlyList. This does not apply to any standard .NET types: all of them that implement IList do also implement IReadOnlyList. Also we can make sure that this is so for all types in IronPython. For those unfortunate user types, an automatic lightweight wrapper provided by the Binder would do the trick.

B. Span<byte>/ReadOnlySpan<byte>

  • Implementation: Those simple functions that just iterate over data will be just happy with this efficient choice. Even more so: we have now a possibility to drop to unsafe code if performance optimization is required. However, the _codecs functions will have a big problem: I am convinced that it is impossible to obtain a reference to the original System.Array from Span, even if that span was created from a whole bytes[] array. This means that every call to those functions will result in data copy, although the input data is never modified during decoding. I don't think it is acceptable.

  • Clients: It is easy and cheap to get Span from System.Array, and the client can also span unmanned memory or a stackalloc. However, if the client has only a reference to IList and downcasting to bytes[] does not work, the only remedy is data copy. I find it a fair trade-off.

C. Memory<byte>/ReadOnlyMemory<byte>

  • Implementation: It is somewhat less performing than Span and the management can be a little more complicated, but not so in the simple cases, which are the most common. And the option of dropping to unsafe code if needed remains. However, although Memory can be boxed, it still does not allow to obtain a reference to the original array, even when created over the whole such array. At least I fail to figure out how. So the big memory/performance hit to codecs remains the same as for Span.

  • Clients: Again, byte[] will be easily accepted, but not IList (<= no guarantee of data contiguity) nor Span (<= cannot be boxed). So the clients will have to resort to data copy in those cases. From the client's perspective, Span would be better as it is trivial to convert Memory to Span.

D. IBufferProtocol

  • Implementation: We could make IBufferProtocol derive from both IList and IReadOnlyList and the simple functions can keep their IList parameter types. Or if the simple functions require Span, a method in IBufferProtocol will provide that, and the Binder will invoke it on demand, similarly to what this PR does. The codecs functions would use a method from IBufferProtocol to obtain bytes[]. A copy would only be made when absolutely unavoidable.

  • Clients: Obviously we will make all relevant IronPython types implement IBufferProtocol, but interoperability with native .NET types suffers. To support that, a sort of automatic Binder wrapper will be needed. Possibly several wrapper types, one for each .NET type we want to support: bytes[], Span<byte>, IList<byte>, IReadOnlyList<byte>. Or we disqualify the IList family on principle as the data is not guaranteed contiguous. Whether such wrapper is automatically created or not would be controlled by the presence of BytesLikeAttribute.


I am currently leaning towards Choice D with a mixin from B. The codec functions would have a parameter declared as [BytesLike]IBufferProtocol, with the semantics from Choice D. Simple functions like Bytes.startswith would declare their parameter as [BytesLike]Span<byte> or [BytesLike]ReadOnlySpan<byte>. The binder would create a relevant IBufferProtocol wrapper on the fly in the same way as for codecs but then immediately convert it to Span. If we disqualify IList<byte> as an acceptable bytes-like type, then this case becomes even simpler: bytes[] can be easily represented as Span<byte>. The only reason I am still floating IList around is because this is what IronPython 2 accepts, so accepting IList will make porting to IronPython3 easier. However I would prefer to drop support for IList as "bytes-like" altogether. Developers can easily convert their calls drom data to data.ToArray(), and although it copies data, it at least makes it visible to the developer. And maybe it will be a motivation to rethink their data types and replace IList with Span or byte[] or a downcast to byte[]. The latter two would be most beneficial for codecs.

@slozier, I am curious of your thoughts on this.

@slozier
Copy link
Contributor

slozier commented Mar 8, 2020

@BCSharp Thanks for the analysis!

However, the _codecs functions will have a big problem: I am convinced that it is impossible to obtain a reference to the original System.Array from Span, even if that span was created from a whole bytes[] array.

Why do we need a reference to the original array? If the concern is that methods like Encoding.GetString require arrays, then we could potentially resolve it by bumping the .NET Framework version to 4.6.1 (which is .NET Standard 2.0 "compatible") and using unsafe code.

unsafe {
    fixed (byte* bp = span.Slice(start)) {
        var z = e.GetString(bp, length);
    }
}

The binder would create a relevant IBufferProtocol wrapper on the fly in the same way as for codecs but then immediately convert it to Span.

This sounds like a good approach.

I would prefer to drop support for IList as "bytes-like" altogether

I agree. If you think about it, a Python list of bytes does not qualify as bytes-like so would would a .NET list of bytes work (although one could argue about the IList being strongly typed). There are enough differences between IronPython 2 and 3 that I think this one will be a minor pain point.


Before jumping all in we should do some small scale experiments. I think there are issues with using ref struct with the interpreter so using Span directly in public method signatures might be a non-starter.

@BCSharp
Copy link
Member Author

BCSharp commented Mar 9, 2020

we could potentially resolve it by bumping the .NET Framework version to 4.6.1 (which is .NET Standard 2.0 "compatible") and using unsafe code.

I like the idea of going though unsafe code in this case. Why bump up to .NET 4.6.1?

Default Encoding.GetString(byte*, ...) calls Encoding.GetChars(byte*, ...), which by default will copy the unsafe bytes into a managed array and then call Encoding.GetChars(byte[], ...). So to avoid copying, all of our Encoding subclasses will have to implement unsafe GetChars(byte*, ...) override as the workhorse. We can mention this in the IronPython documentation to warn/guide host developers who want to implement their own Encoding. The standard .NET encodings are fine as they already use GetChars(byte*, ...) as the workhorse.

Before jumping all in we should do some small scale experiments.

Agreed. I will start with the experiments around encodings and make a draft pull requests if there is anything interesting to share/review.

I think there are issues with using ref struct with the interpreter so using Span directly in public method signatures might be a non-starter.

I cannot think of any issues. Do you have any (pointers to) information or examples of those issues?
I know that Span cannot be created on the stack of async methods, but async methods can still create Memory and call other methods with Span parameters by accessing Memory.Span. It seems to me that Span and ReadOnlySpan are perfect types for public interfaces.

@slozier
Copy link
Contributor

slozier commented Mar 9, 2020

I like the idea of going though unsafe code in this case. Why bump up to .NET 4.6.1?

Pointer overloads were only added to Encoding in 4.6. I suggested 4.6.1 since I was thinking that we might as well target the .NET Standard 2.0 "compliant" version of the Framework. Thinking about it now, might be better to stick with 4.6 to avoid any potential "DLL hell" (if you build for 4.6.1 and reference System.Memory it'll bring in ~100 .NET Standard assemblies).

Do you have any (pointers to) information or examples of those issues?

IronLanguages/ironpython2#347

@BCSharp
Copy link
Member Author

BCSharp commented Apr 2, 2020

Superseded by #765.

@BCSharp BCSharp closed this Apr 2, 2020
@BCSharp BCSharp deleted the buffer_protocol_conversion branch April 2, 2020 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants