Factor out IPythonBuffer from IBufferProtocol #808

BCSharp · 2020-04-27T21:57:07Z

Here is my exploration around the re-implementation of the buffer protocol (#21). I will try to make incremental changes that will keep IronPython working (so all tests should be passing).

Each step is a proposal of an idea that can be discussed and adjusted, so smaller steps are preferable, rather a large dive that goes nowhere.

The PR is writable to the maintainers, so I welcome commits on top in a collaborative effort.

BCSharp · 2020-04-27T21:58:08Z

In the first commit I separated the buffer API from the buffer exporter API. This is how PEP 3118 handles it too. The difference is how the buffer is being released. PEP 3118 argues against an idea of a separate releaser object (that is, different than the exported object), and that argument makes sense for C, but with OO C#, the Dispose pattern makes more sense. We have discussed it before. For the record, here are some arguments for the IDisposable patterns that come to my mind.

It doesn't require for the consumer to hold on to the reference of the exporter. The buffer keeps that reference and GC will not collect it until the buffer is collected.
It fits the prevailing .NET usage pattern, which has some dedicated support from the C# compiler (using).
It is fairly simple to go from one form to the other in the interface adaptation layer, if a 3-rd party library expect a separate PyBuffer_Release().
It is similar to .NET IMemoryOwner<T> approach.

The changes here deliberately stop short from changing the existing behaviour in any big way. In particular, I think that several methods in IPythonBuffer should be removed. Also, generally speaking, exporters should not implement IPythonBuffer themselves, but spawn a separate object representing the buffer (with the exceptions of MemoryView, and maybe CData). This seems fairly straightforward to implement, but there is one pint that I do not yet know how to handle: the automatic conversion of bytes-like objects to IList or ReadOnlyMemory in the Binder.

In CPython, a bytes-like handling function accepts the exporter object as argument (i.e. an object that has getbufferproc and releasebufferproc), and then requests and releases the buffer as needed. This corresponds to having bytes-like parameters typed IBufferProtocol (Option D). However, we opted for more convenient types IList and ReadOnlyMemory, with the conversion implemented in PythonConversionBinder. The issue is, that although the converter can easily call GetBuffer on IBufferProtocol to extract a compatible data access type, there no obvious way to call Dispose. It should be called after the method call returns, but the conversion expression tree does not know when it happens. Maybe there is a sort of de-convert mechanism available in the DLR that I am not aware of. Or maybe there is an option to provide conversion on the whole method level.

A workaround could be to release the buffer in the conversion ET right after the desired data object has been obtained, but this is not as the buffer protocol should work. The data object will be used after the buffer is released. It will work sort-of OK with IronPython built-in types that implement IBufferProtocol, but could be disastrous for unmanaged/unsafe 3-rd party libraries like NumPy.

slozier · 2020-04-28T00:18:19Z

This is a good start. Hopefully I can find some time to experiment with this. Here are some of my initial thoughts:

In particular, I think that several methods in IPythonBuffer should be removed.

Agreed, most of them can probably go.

However, we opted for more convenient types IList and ReadOnlyMemory, with the conversion implemented in PythonConversionBinder.

Well, maybe we should revisit this. While these types are more from a .NET point of view it seems like they're not a great fit for the Python world. In particular IList is terrible because it would force us to make a copy the buffer? Another point which makes using the convenient types moot is that a lot of these methods are already a annoying to invoke from .NET due to the CodeContext argument. Perhaps we should just stick to IBufferProtocol overloads. While we'd need to call GetBuffer every time it doesn't seem like such a big deal and we'd have more controls over the flags (otherwise we'd need some sort of attribute to specify them?).

We could have IBufferProtocol wrappers (for example static IBufferProtocol ToBufferProtocol(this Array[byte] bytes)) for the convenient .NET types. When invoking from the Python side of things the binder could easily use those wrappers to automatically convert the object to an IBufferProtocol.

BCSharp · 2020-04-28T17:54:51Z

Well, maybe we should revisit this. While these types are more from a .NET point of view it seems like they're not a great fit for the Python world. [...] Perhaps we should just stick to IBufferProtocol overloads.

We can revisit it, but I wouldn't dismiss the existing mechanism as yet. First of all, nothing needs to change to support IBufferProtocol as bytes-like input; it can already be used and there is no need for any special parameter attributes. So having support for [BytesLike] in the converter is a bonus, that may be useful or convenient in some cases (assuming that it works correctly). On the other hand, if the "case for those cases" is weak, then it does not warrant the extra complexity. I will try to sum up some pros and cons of both approaches, as I see it.

Using IBufferProtocol in the method signature:

It matches how CPython and offers the most flexibility on when, how (i.e. flags), and for how long a buffer is exported, unlike with the Binder mechanism. With the Binder mechanism, the buffer will be always requested at the entry to the method and released at the exit. This may not always match CPython. For instance, it is conceivable than for some corner cases created by values of other arguments, CPython never does request the buffer but fast-tracks to a default answer. If so, buffer error handling behaviour would be different. (Though I think this is an unlikely scenario).
It does not require any special handling code in the Binder, or any parameter attribute.
It makes it more difficult for interop, but in many cases methods that accept a bytes-like object are there just to provide an existing .NET functionality to Python. So there is little benefit for C# to call them. And indeed, we could provide a wrapper around byte[] or Memory<byte> for convenience.

Using [BytesLike] support in the Binder.

The change has less impact on the codebase, so the migration can be gradual. The common GetBuffer functionality is factored out to the Binder-generated code. The actual method code is cleaner.
If we settle on Memory<byte>/ReadOnlyMemory<byte>, interop is easier.
For advanced scenarios, a method can still use IBufferProtocol in its signature.

In particular IList is terrible because it would force us to make a copy the buffer?

I agree that IList is not the best match for the buffer semantics. I would like to phase it out and replace with either [BytesLike]ReadOnlyMemory or IBufferProtocol, depending on the choice. Keeping support for IList as a possibility for buffer access will either require some exporters to make a copy of the data or we will need a wrapper around Memory that provides IList. The latter is functionally possible but will not offer efficient element access. One way or another, IList always underperforms, or is at best equal to other alternatives. It does offer an ability to pass discontinuous data, but as you have indicated before, this is not a requirement of bytes-like objects, which are expected to have contiguous byte data.

So now how to phase it out. Since the full progression of types is IBubberProtocol -> IPythonBuffer -> ReadOnlyMemory -> ReadOnlySpan -> individual byte, I think the best first step is to replace [BytesLike]IList<byte> with [BytesLike]ReadOnlyMemory<byte>. These changes will be useful regardless the final mechanism of handling of bytes-like objects.

slozier · 2020-04-28T19:27:55Z

I think the best first step is to replace [BytesLike]IList with [BytesLike]ReadOnlyMemory

This seems like a good first step, however, before jumping into it, I'd like to make sure this actually works. In particular, I'm trying to see how we would get from, for example, an array('i') (which is backed by an Array<int>) to a Memory<byte>. Right now we can write:

array a = new array("i", new byte[] {1,2,3,4,5});
using IPythonBuffer buffer = a.GetBuffer();
ReadOnlyMemory<byte> mem = buffer.ToMemory();

The issues which we would need to resolve are:

buffer.ToMemory() copies the data, it should be able to provide direct access to the memory.
a is backed by an ArrayData<int> which in turn is backed by an int[]. I know we can go from the array to a Memory<int>, but how do we go from there to a Memory<byte>? There's MemoryMarshal.Cast for Span but I don't see something like that for Memory.
ToMemory returns a ReadOnlyMemory, what do we do when we want write access?

I guess in all the [BytesLike] cases we only need read-only access, however, we need to support the array scenario where the data is not backed by a byte array.

bytes(array.array('i', [1, 2, 3, 4, 5]))

Any ideas?

BCSharp · 2020-04-29T03:39:02Z

Here are my ideas (none of them tested out, though).

First of all, although buffer.ToMemory() copies data if buffer comes from an array so does converting buffer to IList<byte>. Eliminating IList<byte> use would allow to remove IPythonBuffer.ToBytes(). It is not a small task because IList<byte> is used in many methods of Bytes and ByteArray, but it will be needed regardless which final solution we settle on.

As for the IPythonBuffer interface, I wanted to remove GetItem, SetItem, SetSlice, ToBytes, and ToList. If we need ToBytes and ToList they can come as extension methods.

Yes, it would be easier with spans, but we can't use spans in public methods.

So I was thinking about replacing ToMemory(), which returns ReadOnlyMemory<byte> and possibly copying data, with two methods: Memory<T>? AsMemory<T>() and ReadOnlyMemory<T>? AsReadOnlyMemory<T>(). Neither of them would copy data. The former would only work for writable buffers (that is, ! IPythonBuffer.IsReadOnly), throwing InvalidOperationException otherwise. The latter would work all the time. But: if T does not "match" IPythonBuffer.Format, null is returned.

An alternative: forget the generics and always return Memory<byte>/ReadOnlyMemory<byte>. The client then has to do proper data marshalling by dropping to Span and using MemoryMarshal.Cast if needed. This option is currently my favourite.

This does not solve the ArrayData<int> problem, which should be able to provide both Memory<int>, when BufferFlags.Format is used, and Memory<byte>, for BufferFlags.Simple requests. Maybe a solution is in changing the backing object from int[] to some unmanaged blob?

Yet another idea: use IBufferProtocol as the designated bytes-like type, and then each method does GetBuffer(BufferFlags.Format) -> AsMemory<T> -> AsSpan<T> -> AsBytes<T>. This could be an utility method in PythonOps, doing the job based on IPythonBuffer.Format Not ideal though, because CPython's array can satisfy simple (i.e. byte-typed) buffer requests too, so ideally the problem should be addressed in array.

BCSharp · 2020-04-29T18:16:48Z

Here is another idea that I have tested out a little bit. It addresses the array problem locally, in ArrayData<T>. Instead of using T[] as the backing object, it can use byte[]. Providing Memory<byte> for IPythonBuffer would become trivial. Indexed access, on the other hand, would go something like this:

public T this[int index] {
    get => MemoryMarshal.Cast<byte, T>(_items.Span)[index];
    set => MemoryMarshal.Cast<byte, T>(_items.Span)[index] = value;
}

I would like to see it tied out. @slozier, you have been working on ArrayData<T> on the past, do you prefer to take it on?

slozier · 2020-04-29T18:47:17Z

As for the IPythonBuffer interface, I wanted to remove GetItem, SetItem, SetSlice, ToBytes, and ToList. If we need ToBytes and ToList they can come as extension methods.

Agreed, these are the methods I would also remove. I'm not sure that they are even needed as extension methods as their (only?) consumer is memoryview.

Maybe a solution is in changing the backing object from int[] to some unmanaged blob?

I've thought about that as well, but while it solves the issue with array, it does not really solve the larger issue. For instance, I would really like to be able to use .NET arrays anywhere Python arrays are accepted (so with any bytes-like call site). We'd basically be running into the same issue.

Yet another idea: use IBufferProtocol as the designated bytes-like type, and then each method does GetBuffer(BufferFlags.Format) -> AsMemory -> AsSpan -> AsBytes

This is the idea I'm currently leaning towards. In fact, if we do this then I don't see why we couldn't just get rid of Memory and just have an AsSpan<byte> on IPythonBuffer?

BCSharp · 2020-04-29T20:18:56Z

I don't see why we couldn't just get rid of Memory and just have an AsSpan<byte> on IPythonBuffer?

I remember that having span as a method parameter of a public method was not working well. I never fully understood the root cause of this bug so I was avoiding spans anywhere public. But depending on the nature of the bug, maybe it is OK to have Span<byte> as return type of a method of a public interface? It will not be involved in the overload resolution process.

I am currently testing it.

slozier · 2020-04-29T20:33:51Z

I remember that having span as a method parameter of a public method was not working well.

Part of the issue with Span is that it can't be boxed or used as a generic argument so it causes a bunch of issues with call sites and the interpreter (which uses a stack of objects). If we're just using it "behind the scenes" via C# then there should be no issue with it.

BCSharp · 2020-05-06T19:31:47Z

This was originally "Remove IPythonBuffer.SetItem(...)" but things turned out to be quite interdependent. I tried to keep changes to minimum, just to the level of all tests passing, and commit them to get early feedback on the direction it is going in. The code is not proofread or cleaned up, and is sprinkled with TODO comments.

There are some additional tests that are passing now (not included) but to my knowledge there are no regressions.

It is strange that the build fails; does CI do something that a local ./make clean; ./make doesn't? I've successfully tested the commit locally on Windows and macOS.

BCSharp · 2020-05-06T19:35:51Z

I will try to merge in master and see if it helps.

slozier · 2020-05-07T17:17:46Z

I like it. I added an ArrayDataView to test out the approach. I put the logic there instead of in ByteArray so that it could be reused with array but I'm not sure if it's worth it? I don't think my ArrayDataView is quite correct since it should probably throw whenever Count changes instead of when the array is replaced. Also there might be threading issues with just using a counter as I do?

BCSharp · 2020-05-08T05:01:27Z

@slozier, thanks for the commit. I started working on ByteArrayView, leaving factoring out commonalities between it and ArrayView for later, but looking at your solution I see it can be done right away. Nevertheless, it was a worthwhile exercise because it got me thinking about multithreading early on. After some deliberation I came to a conclusion that it will be best not to make the Buffer Protocol views thread-safe, leaving any synchronization to the client (CPython also does not give any guarantees on thread-safety of buffer data). In the last commit, I wrote some rules of engagement as I see them currently.

However, obtaining and releasing (and in the case of ByteArray, checking for exports) has to be thread-safe, as the rest of API of the exporting object. I added the modification to the code to ensure that. Perhaps they all can be moved to ArrayData for symmetry. It also means that all CheckBuffer calls have to happen in a locked critical section. This can be better done at ByteArray level. And yes, every method of ByteArray that may modify the length of the data needs to to the check.

Further, I am not sure about calling Dispose() from the finalizer. I vaguely recall a possibility that a referenced object from a field may already been garbage-collected, in this case, _arrayData. That would lead to an exception, with GC would probably ignore...?

Handling unformatted buffers is not quite clear to me. I find Python documentation on this point incomplete and ambiguous. Maybe I will have to look into CPython's code in the end to get the full picture. But one thing that is clear is that exporters, when given requests without Formatted flag, MUST provide buffers that report Format as null. See BytesView.

slozier · 2020-05-11T18:21:23Z

I wonder if there's any value in making the ByteArray._bytes field readonly. We could do this be adding an ArrayData<T>.Capacity property (similar to List<T>.Capacity) to allow shrinking the underlying array.

Personally I don't see the issue with calling Dispose from the finalizer. It's essentially what the dispose pattern says to do? Although I'm not an expert in this domain.

BCSharp · 2020-05-17T16:42:07Z

Src/IronPython/Runtime/MemoryView.cs

                return Bytes.Empty;
            }

-            var buf = _buffer.AsReadOnlySpan();
+            var buf = _buffer!.AsReadOnlySpan();


Is there a smart way to indicate _buffer being not null after the CheckBuffer() call, withouth changing the signature of CheckBuffer()?

I believe MemberNotNullAttribute would allow that, unfortunately it's still considered a preview feature.

BCSharp · 2020-05-17T16:46:26Z

Src/IronPython/Runtime/MemoryView.cs

@@ -396,7 +396,7 @@ public sealed class MemoryView : ICodeFormattable, IWeakReferenceable, IBufferPr
            return cast(format, null);
        }

-        public MemoryView cast([NotNull]string format, [NotNull]object shape) {
+        public MemoryView cast([NotNull]string format, [NotNull, AllowNull]object shape) {


[NotNull, AllowNull] — I find it not very readable. Not to say the confusion with NotNull from System.Diagnostics.CodeAnalysis. How about adopting a convention to always using-renaming Microsoft.Scripting.Runtime.NotNullAttribute to NotNoneAttribute?

I would be fine with NotNone if there is a conflict.

using NotNoneAttribute = Microsoft.Scripting.Runtime.NotNullAttribute;

slozier

This is great! A few things I noted (for other PRs if any):

BytesIO.getbuffer should be reworked to not return a copy
Does CData work? (very low priority)
Should MemoryView be an IPythonBuffer?

Tests/test_memoryview.py

Src/IronPython/Runtime/ArrayData.cs

Src/IronPython/Runtime/MemoryView.cs

slozier · 2020-05-21T17:06:30Z

Src/IronPython/Runtime/MemoryView.cs

                }

-                int newStart = _start + start * firstIndexWidth;
-                int newEnd = _start + stop * firstIndexWidth;
-                int newStep = ((long)_step * step).ClampToInt32();


I guess we can get rid of PythonOps.ClampToInt32 now? 😏

I almost used it in 6a6841d...
It can go now, but it may come back depending on the approach to #52.

Src/IronPython/Runtime/MemoryView.cs

BCSharp · 2020-05-21T20:14:40Z

BytesIO.getbuffer should be reworked to not return a copy

Agreed. I think there are more parts in BytesIO that could be reworked now that we have System.Buffers.

Does `CData` work? (very low priority)

To my knowledge, CData works as good (or as bad) as before. That is, it does work except for indirect multidimensional arrays (i.e. arrays of pointers to arrays). That last one will require support for suboffsets in MemoryView — very low priority for me at the moment. I am planning simple general cleanup in CData in the coming weeks. I will also see if there are more loose ends there.

Should `MemoryView` be an `IPythonBuffer`?

Good point, but debatable. The way I see it memoryview is Python Buffer, but elevated to be a first class Python type. In the early CPython 3 versions even more so, for instance it would return None for shape if ndim == 0, just as it is mandated for Python Buffers. IPythonBuffer is not exactly CPython Buffer, but close enough, so conceptually I think that MemoryView is an IPythonBuffer.
But I don't feel strongly about it. Introducing a MemoryViewView of sorts would prevent code like

IPythonBuffer buf = memoryView.GetBuffer();

// somewhere else, perhaps in a sub-method:
if (buf is MemoryView mv) {
   // have a feast...
}

but I also don't feel strongly of adding extra complexity just to prevent it.

On my side, besides the remaining TODOs, I see the following:

Constructors/initializers of bytes/bytearray should be able to accept any object implementing Buffer protocol, not just bytes-like objects.
Slice deletion in bytearray/array.array can be simplified and made more efficient.
Perhaps some of the switch on typecode statements in array.array can be simplified or eliminated. On the other hand, what is there now works just fine, so it is more like a mind teaser...

slozier

I'm happy with this, feel free to merge when ready.

Factor out IPythonBuffer from IBufferProtocol

9b28c32

BCSharp added 3 commits April 29, 2020 14:21

Remove IPythonBuffer.SetSlice(...)

c7a3d66

Introduce IPythonBuffer.AsSpan()/AsReadOnlySpan()

c16769a

Redesign MemoryView

5965bd2

Merge remote-tracking branch 'ironpython3/master' into buffer_protocol

fd07f00

BCSharp and others added 4 commits May 6, 2020 13:19

Fix compile errors in re

4df9e7f

Move IPythonBuffer from Bytes to BytesView

f9bd595

Properly dispose IPythonBuffer created during conversion to BytesLike

613735f

Add ArrayDataView

7282c9c

Synchronise disposal of ArrayDataView

87e55b6

BCSharp added 6 commits May 9, 2020 12:57

Support optional byteorder mark in memoryview casts

ebe9e6a

Thread-safe buffer export management of ArrayData

02c42c7

Fix bytearray extended slice assignment

28488f3

Add/enable some bytearray tests

8c1e57d

Workaround for HTTPResponse.client.readinto

a0da749

Merge remote-tracking branch 'ironpython3/master' into buffer_protocol

28fec90

slozier and others added 5 commits May 15, 2020 17:58

Merge branch 'master' into buffer_protocol

bbb2457

Fix compile errors

d1ab07c

Remove obsolete code

78093d6

Remove IPythonBuffer.ToBytes(int, int?)

533255b

Finish implementing support for unformatted MemoryView exports

de4b45c

BCSharp commented May 17, 2020

View reviewed changes

BCSharp added 7 commits May 17, 2020 14:58

Increase test coverage of codecs and fix bugs

5a6cc1a

Optimize _codecs.escape_encode

ee0cff3

Let array.array use ArrayDataView

61ffbdd

Eliminate some of switches on typecode in array.array

39a497a

Make ArrayData field in ByteArray readonly

a73fa3e

Prevent overflow bugs of ArrayData size

6a6841d

Implement a finalizer in MemoryView

3611b2e

BCSharp marked this pull request as ready for review May 21, 2020 04:29

slozier added 3 commits May 21, 2020 09:20

Merge branch 'master' into buffer_protocol

a66f5b9

Enable test_array buffer test

589cc93

Replace Marshal.SizeOf with Unsafe.SizeOf

7a29b5a

slozier reviewed May 21, 2020

View reviewed changes

slozier mentioned this pull request May 21, 2020

test_array failures #767

Open

8 tasks

slozier approved these changes May 21, 2020

View reviewed changes

Update after review

824fefe

BCSharp merged commit e8028c2 into IronLanguages:master May 22, 2020

BCSharp deleted the buffer_protocol branch May 22, 2020 04:42

BCSharp linked an issue May 22, 2020 that may be closed by this pull request

PEP 3118 -- Revising the buffer protocol #21

Open

This was referenced Oct 15, 2020

__del__ method is not called #985

Open

Clean up globals defined in Python source, make sure their __del__ is called #988

Draft

BCSharp mentioned this pull request Mar 19, 2021

Add namereplace error handler #1140

Merged

BCSharp mentioned this pull request Apr 20, 2022

Confusing use of NotNullAttribute #1412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Factor out IPythonBuffer from IBufferProtocol #808

Factor out IPythonBuffer from IBufferProtocol #808

BCSharp commented Apr 27, 2020

BCSharp commented Apr 27, 2020 •

edited

Loading

slozier commented Apr 28, 2020

BCSharp commented Apr 28, 2020 •

edited

Loading

slozier commented Apr 28, 2020

BCSharp commented Apr 29, 2020

BCSharp commented Apr 29, 2020

slozier commented Apr 29, 2020

BCSharp commented Apr 29, 2020

slozier commented Apr 29, 2020 •

edited

Loading

BCSharp commented May 6, 2020

BCSharp commented May 6, 2020

slozier commented May 7, 2020

BCSharp commented May 8, 2020

slozier commented May 11, 2020

BCSharp May 17, 2020

slozier May 17, 2020

BCSharp May 17, 2020

slozier May 17, 2020

slozier left a comment

slozier May 21, 2020

BCSharp May 21, 2020

BCSharp commented May 21, 2020

slozier left a comment

Factor out IPythonBuffer from IBufferProtocol #808

Factor out IPythonBuffer from IBufferProtocol #808

Conversation

BCSharp commented Apr 27, 2020

BCSharp commented Apr 27, 2020 • edited Loading

slozier commented Apr 28, 2020

BCSharp commented Apr 28, 2020 • edited Loading

slozier commented Apr 28, 2020

BCSharp commented Apr 29, 2020

BCSharp commented Apr 29, 2020

slozier commented Apr 29, 2020

BCSharp commented Apr 29, 2020

slozier commented Apr 29, 2020 • edited Loading

BCSharp commented May 6, 2020

BCSharp commented May 6, 2020

slozier commented May 7, 2020

BCSharp commented May 8, 2020

slozier commented May 11, 2020

BCSharp May 17, 2020

Choose a reason for hiding this comment

slozier May 17, 2020

Choose a reason for hiding this comment

BCSharp May 17, 2020

Choose a reason for hiding this comment

slozier May 17, 2020

Choose a reason for hiding this comment

slozier left a comment

Choose a reason for hiding this comment

slozier May 21, 2020

Choose a reason for hiding this comment

BCSharp May 21, 2020

Choose a reason for hiding this comment

BCSharp commented May 21, 2020

slozier left a comment

Choose a reason for hiding this comment

BCSharp commented Apr 27, 2020 •

edited

Loading

BCSharp commented Apr 28, 2020 •

edited

Loading

slozier commented Apr 29, 2020 •

edited

Loading