Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factor out IPythonBuffer from IBufferProtocol #808

Merged
merged 38 commits into from
May 22, 2020

Conversation

BCSharp
Copy link
Member

@BCSharp BCSharp commented Apr 27, 2020

Here is my exploration around the re-implementation of the buffer protocol (#21). I will try to make incremental changes that will keep IronPython working (so all tests should be passing).

Each step is a proposal of an idea that can be discussed and adjusted, so smaller steps are preferable, rather a large dive that goes nowhere.

The PR is writable to the maintainers, so I welcome commits on top in a collaborative effort.

@BCSharp
Copy link
Member Author

BCSharp commented Apr 27, 2020

In the first commit I separated the buffer API from the buffer exporter API. This is how PEP 3118 handles it too. The difference is how the buffer is being released. PEP 3118 argues against an idea of a separate releaser object (that is, different than the exported object), and that argument makes sense for C, but with OO C#, the Dispose pattern makes more sense. We have discussed it before. For the record, here are some arguments for the IDisposable patterns that come to my mind.

  1. It doesn't require for the consumer to hold on to the reference of the exporter. The buffer keeps that reference and GC will not collect it until the buffer is collected.
  2. It fits the prevailing .NET usage pattern, which has some dedicated support from the C# compiler (using).
  3. It is fairly simple to go from one form to the other in the interface adaptation layer, if a 3-rd party library expect a separate PyBuffer_Release().
  4. It is similar to .NET IMemoryOwner<T> approach.

The changes here deliberately stop short from changing the existing behaviour in any big way. In particular, I think that several methods in IPythonBuffer should be removed. Also, generally speaking, exporters should not implement IPythonBuffer themselves, but spawn a separate object representing the buffer (with the exceptions of MemoryView, and maybe CData). This seems fairly straightforward to implement, but there is one pint that I do not yet know how to handle: the automatic conversion of bytes-like objects to IList or ReadOnlyMemory in the Binder.

In CPython, a bytes-like handling function accepts the exporter object as argument (i.e. an object that has getbufferproc and releasebufferproc), and then requests and releases the buffer as needed. This corresponds to having bytes-like parameters typed IBufferProtocol (Option D). However, we opted for more convenient types IList and ReadOnlyMemory, with the conversion implemented in PythonConversionBinder. The issue is, that although the converter can easily call GetBuffer on IBufferProtocol to extract a compatible data access type, there no obvious way to call Dispose. It should be called after the method call returns, but the conversion expression tree does not know when it happens. Maybe there is a sort of de-convert mechanism available in the DLR that I am not aware of. Or maybe there is an option to provide conversion on the whole method level.

A workaround could be to release the buffer in the conversion ET right after the desired data object has been obtained, but this is not as the buffer protocol should work. The data object will be used after the buffer is released. It will work sort-of OK with IronPython built-in types that implement IBufferProtocol, but could be disastrous for unmanaged/unsafe 3-rd party libraries like NumPy.

@slozier
Copy link
Contributor

slozier commented Apr 28, 2020

This is a good start. Hopefully I can find some time to experiment with this. Here are some of my initial thoughts:

In particular, I think that several methods in IPythonBuffer should be removed.

Agreed, most of them can probably go.

However, we opted for more convenient types IList and ReadOnlyMemory, with the conversion implemented in PythonConversionBinder.

Well, maybe we should revisit this. While these types are more from a .NET point of view it seems like they're not a great fit for the Python world. In particular IList is terrible because it would force us to make a copy the buffer? Another point which makes using the convenient types moot is that a lot of these methods are already a annoying to invoke from .NET due to the CodeContext argument. Perhaps we should just stick to IBufferProtocol overloads. While we'd need to call GetBuffer every time it doesn't seem like such a big deal and we'd have more controls over the flags (otherwise we'd need some sort of attribute to specify them?).

We could have IBufferProtocol wrappers (for example static IBufferProtocol ToBufferProtocol(this Array[byte] bytes)) for the convenient .NET types. When invoking from the Python side of things the binder could easily use those wrappers to automatically convert the object to an IBufferProtocol.

@BCSharp
Copy link
Member Author

BCSharp commented Apr 28, 2020

Well, maybe we should revisit this. While these types are more from a .NET point of view it seems like they're not a great fit for the Python world. [...] Perhaps we should just stick to IBufferProtocol overloads.

We can revisit it, but I wouldn't dismiss the existing mechanism as yet. First of all, nothing needs to change to support IBufferProtocol as bytes-like input; it can already be used and there is no need for any special parameter attributes. So having support for [BytesLike] in the converter is a bonus, that may be useful or convenient in some cases (assuming that it works correctly). On the other hand, if the "case for those cases" is weak, then it does not warrant the extra complexity. I will try to sum up some pros and cons of both approaches, as I see it.

Using IBufferProtocol in the method signature:

  1. It matches how CPython and offers the most flexibility on when, how (i.e. flags), and for how long a buffer is exported, unlike with the Binder mechanism. With the Binder mechanism, the buffer will be always requested at the entry to the method and released at the exit. This may not always match CPython. For instance, it is conceivable than for some corner cases created by values of other arguments, CPython never does request the buffer but fast-tracks to a default answer. If so, buffer error handling behaviour would be different. (Though I think this is an unlikely scenario).
  2. It does not require any special handling code in the Binder, or any parameter attribute.
  3. It makes it more difficult for interop, but in many cases methods that accept a bytes-like object are there just to provide an existing .NET functionality to Python. So there is little benefit for C# to call them. And indeed, we could provide a wrapper around byte[] or Memory<byte> for convenience.

Using [BytesLike] support in the Binder.

  1. The change has less impact on the codebase, so the migration can be gradual. The common GetBuffer functionality is factored out to the Binder-generated code. The actual method code is cleaner.
  2. If we settle on Memory<byte>/ReadOnlyMemory<byte>, interop is easier.
  3. For advanced scenarios, a method can still use IBufferProtocol in its signature.

In particular IList is terrible because it would force us to make a copy the buffer?

I agree that IList is not the best match for the buffer semantics. I would like to phase it out and replace with either [BytesLike]ReadOnlyMemory or IBufferProtocol, depending on the choice. Keeping support for IList as a possibility for buffer access will either require some exporters to make a copy of the data or we will need a wrapper around Memory that provides IList. The latter is functionally possible but will not offer efficient element access. One way or another, IList always underperforms, or is at best equal to other alternatives. It does offer an ability to pass discontinuous data, but as you have indicated before, this is not a requirement of bytes-like objects, which are expected to have contiguous byte data.

So now how to phase it out. Since the full progression of types is IBubberProtocol -> IPythonBuffer -> ReadOnlyMemory -> ReadOnlySpan -> individual byte, I think the best first step is to replace [BytesLike]IList<byte> with [BytesLike]ReadOnlyMemory<byte>. These changes will be useful regardless the final mechanism of handling of bytes-like objects.

@slozier
Copy link
Contributor

slozier commented Apr 28, 2020

I think the best first step is to replace [BytesLike]IList with [BytesLike]ReadOnlyMemory

This seems like a good first step, however, before jumping into it, I'd like to make sure this actually works. In particular, I'm trying to see how we would get from, for example, an array('i') (which is backed by an Array<int>) to a Memory<byte>. Right now we can write:

array a = new array("i", new byte[] {1,2,3,4,5});
using IPythonBuffer buffer = a.GetBuffer();
ReadOnlyMemory<byte> mem = buffer.ToMemory();

The issues which we would need to resolve are:

  1. buffer.ToMemory() copies the data, it should be able to provide direct access to the memory.
  2. a is backed by an ArrayData<int> which in turn is backed by an int[]. I know we can go from the array to a Memory<int>, but how do we go from there to a Memory<byte>? There's MemoryMarshal.Cast for Span but I don't see something like that for Memory.
  3. ToMemory returns a ReadOnlyMemory, what do we do when we want write access?

I guess in all the [BytesLike] cases we only need read-only access, however, we need to support the array scenario where the data is not backed by a byte array.

bytes(array.array('i', [1, 2, 3, 4, 5]))

Any ideas?

@BCSharp
Copy link
Member Author

BCSharp commented Apr 29, 2020

Here are my ideas (none of them tested out, though).

First of all, although buffer.ToMemory() copies data if buffer comes from an array so does converting buffer to IList<byte>. Eliminating IList<byte> use would allow to remove IPythonBuffer.ToBytes(). It is not a small task because IList<byte> is used in many methods of Bytes and ByteArray, but it will be needed regardless which final solution we settle on.

As for the IPythonBuffer interface, I wanted to remove GetItem, SetItem, SetSlice, ToBytes, and ToList. If we need ToBytes and ToList they can come as extension methods.

Yes, it would be easier with spans, but we can't use spans in public methods.

So I was thinking about replacing ToMemory(), which returns ReadOnlyMemory<byte> and possibly copying data, with two methods: Memory<T>? AsMemory<T>() and ReadOnlyMemory<T>? AsReadOnlyMemory<T>(). Neither of them would copy data. The former would only work for writable buffers (that is, ! IPythonBuffer.IsReadOnly), throwing InvalidOperationException otherwise. The latter would work all the time. But: if T does not "match" IPythonBuffer.Format, null is returned.

An alternative: forget the generics and always return Memory<byte>/ReadOnlyMemory<byte>. The client then has to do proper data marshalling by dropping to Span and using MemoryMarshal.Cast if needed. This option is currently my favourite.

This does not solve the ArrayData<int> problem, which should be able to provide both Memory<int>, when BufferFlags.Format is used, and Memory<byte>, for BufferFlags.Simple requests. Maybe a solution is in changing the backing object from int[] to some unmanaged blob?

Yet another idea: use IBufferProtocol as the designated bytes-like type, and then each method does GetBuffer(BufferFlags.Format) -> AsMemory<T> -> AsSpan<T> -> AsBytes<T>. This could be an utility method in PythonOps, doing the job based on IPythonBuffer.Format Not ideal though, because CPython's array can satisfy simple (i.e. byte-typed) buffer requests too, so ideally the problem should be addressed in array.

@BCSharp
Copy link
Member Author

BCSharp commented Apr 29, 2020

Here is another idea that I have tested out a little bit. It addresses the array problem locally, in ArrayData<T>. Instead of using T[] as the backing object, it can use byte[]. Providing Memory<byte> for IPythonBuffer would become trivial. Indexed access, on the other hand, would go something like this:

public T this[int index] {
    get => MemoryMarshal.Cast<byte, T>(_items.Span)[index];
    set => MemoryMarshal.Cast<byte, T>(_items.Span)[index] = value;
}

I would like to see it tied out. @slozier, you have been working on ArrayData<T> on the past, do you prefer to take it on?

@slozier
Copy link
Contributor

slozier commented Apr 29, 2020

As for the IPythonBuffer interface, I wanted to remove GetItem, SetItem, SetSlice, ToBytes, and ToList. If we need ToBytes and ToList they can come as extension methods.

Agreed, these are the methods I would also remove. I'm not sure that they are even needed as extension methods as their (only?) consumer is memoryview.

Maybe a solution is in changing the backing object from int[] to some unmanaged blob?

I've thought about that as well, but while it solves the issue with array, it does not really solve the larger issue. For instance, I would really like to be able to use .NET arrays anywhere Python arrays are accepted (so with any bytes-like call site). We'd basically be running into the same issue.

Yet another idea: use IBufferProtocol as the designated bytes-like type, and then each method does GetBuffer(BufferFlags.Format) -> AsMemory -> AsSpan -> AsBytes

This is the idea I'm currently leaning towards. In fact, if we do this then I don't see why we couldn't just get rid of Memory and just have an AsSpan<byte> on IPythonBuffer?

@BCSharp
Copy link
Member Author

BCSharp commented Apr 29, 2020

I don't see why we couldn't just get rid of Memory and just have an AsSpan<byte> on IPythonBuffer?

I remember that having span as a method parameter of a public method was not working well. I never fully understood the root cause of this bug so I was avoiding spans anywhere public. But depending on the nature of the bug, maybe it is OK to have Span<byte> as return type of a method of a public interface? It will not be involved in the overload resolution process.

I am currently testing it.

@slozier
Copy link
Contributor

slozier commented Apr 29, 2020

I remember that having span as a method parameter of a public method was not working well.

Part of the issue with Span is that it can't be boxed or used as a generic argument so it causes a bunch of issues with call sites and the interpreter (which uses a stack of objects). If we're just using it "behind the scenes" via C# then there should be no issue with it.

@BCSharp
Copy link
Member Author

BCSharp commented May 6, 2020

This was originally "Remove IPythonBuffer.SetItem(...)" but things turned out to be quite interdependent. I tried to keep changes to minimum, just to the level of all tests passing, and commit them to get early feedback on the direction it is going in. The code is not proofread or cleaned up, and is sprinkled with TODO comments.

There are some additional tests that are passing now (not included) but to my knowledge there are no regressions.

It is strange that the build fails; does CI do something that a local ./make clean; ./make doesn't? I've successfully tested the commit locally on Windows and macOS.

@BCSharp
Copy link
Member Author

BCSharp commented May 6, 2020

I will try to merge in master and see if it helps.

@slozier
Copy link
Contributor

slozier commented May 7, 2020

I like it. I added an ArrayDataView to test out the approach. I put the logic there instead of in ByteArray so that it could be reused with array but I'm not sure if it's worth it? I don't think my ArrayDataView is quite correct since it should probably throw whenever Count changes instead of when the array is replaced. Also there might be threading issues with just using a counter as I do?

@BCSharp
Copy link
Member Author

BCSharp commented May 8, 2020

@slozier, thanks for the commit. I started working on ByteArrayView, leaving factoring out commonalities between it and ArrayView for later, but looking at your solution I see it can be done right away. Nevertheless, it was a worthwhile exercise because it got me thinking about multithreading early on. After some deliberation I came to a conclusion that it will be best not to make the Buffer Protocol views thread-safe, leaving any synchronization to the client (CPython also does not give any guarantees on thread-safety of buffer data). In the last commit, I wrote some rules of engagement as I see them currently.

However, obtaining and releasing (and in the case of ByteArray, checking for exports) has to be thread-safe, as the rest of API of the exporting object. I added the modification to the code to ensure that. Perhaps they all can be moved to ArrayData for symmetry. It also means that all CheckBuffer calls have to happen in a locked critical section. This can be better done at ByteArray level. And yes, every method of ByteArray that may modify the length of the data needs to to the check.

Further, I am not sure about calling Dispose() from the finalizer. I vaguely recall a possibility that a referenced object from a field may already been garbage-collected, in this case, _arrayData. That would lead to an exception, with GC would probably ignore...?

Handling unformatted buffers is not quite clear to me. I find Python documentation on this point incomplete and ambiguous. Maybe I will have to look into CPython's code in the end to get the full picture. But one thing that is clear is that exporters, when given requests without Formatted flag, MUST provide buffers that report Format as null. See BytesView.

@slozier
Copy link
Contributor

slozier commented May 11, 2020

I wonder if there's any value in making the ByteArray._bytes field readonly. We could do this be adding an ArrayData<T>.Capacity property (similar to List<T>.Capacity) to allow shrinking the underlying array.

Personally I don't see the issue with calling Dispose from the finalizer. It's essentially what the dispose pattern says to do? Although I'm not an expert in this domain.

return Bytes.Empty;
}

var buf = _buffer.AsReadOnlySpan();
var buf = _buffer!.AsReadOnlySpan();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a smart way to indicate _buffer being not null after the CheckBuffer() call, withouth changing the signature of CheckBuffer()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe MemberNotNullAttribute would allow that, unfortunately it's still considered a preview feature.

@@ -396,7 +396,7 @@ public sealed class MemoryView : ICodeFormattable, IWeakReferenceable, IBufferPr
return cast(format, null);
}

public MemoryView cast([NotNull]string format, [NotNull]object shape) {
public MemoryView cast([NotNull]string format, [NotNull, AllowNull]object shape) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NotNull, AllowNull] — I find it not very readable. Not to say the confusion with NotNull from System.Diagnostics.CodeAnalysis. How about adopting a convention to always using-renaming Microsoft.Scripting.Runtime.NotNullAttribute to NotNoneAttribute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be fine with NotNone if there is a conflict.

using NotNoneAttribute = Microsoft.Scripting.Runtime.NotNullAttribute;

@BCSharp BCSharp marked this pull request as ready for review May 21, 2020 04:29
Copy link
Contributor

@slozier slozier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! A few things I noted (for other PRs if any):

  • BytesIO.getbuffer should be reworked to not return a copy
  • Does CData work? (very low priority)
  • Should MemoryView be an IPythonBuffer?

Tests/test_memoryview.py Outdated Show resolved Hide resolved
Src/IronPython/Runtime/ArrayData.cs Outdated Show resolved Hide resolved
Src/IronPython/Runtime/MemoryView.cs Outdated Show resolved Hide resolved
Src/IronPython/Runtime/MemoryView.cs Show resolved Hide resolved
Src/IronPython/Runtime/MemoryView.cs Show resolved Hide resolved
}

int newStart = _start + start * firstIndexWidth;
int newEnd = _start + stop * firstIndexWidth;
int newStep = ((long)_step * step).ClampToInt32();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can get rid of PythonOps.ClampToInt32 now? 😏

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost used it in 6a6841d...
It can go now, but it may come back depending on the approach to #52.

Src/IronPython/Runtime/MemoryView.cs Show resolved Hide resolved
@slozier slozier mentioned this pull request May 21, 2020
8 tasks
@BCSharp
Copy link
Member Author

BCSharp commented May 21, 2020

BytesIO.getbuffer should be reworked to not return a copy

Agreed. I think there are more parts in BytesIO that could be reworked now that we have System.Buffers.

Does `CData` work? (very low priority)

To my knowledge, CData works as good (or as bad) as before. That is, it does work except for indirect multidimensional arrays (i.e. arrays of pointers to arrays). That last one will require support for suboffsets in MemoryView — very low priority for me at the moment. I am planning simple general cleanup in CData in the coming weeks. I will also see if there are more loose ends there.

Should `MemoryView` be an `IPythonBuffer`?

Good point, but debatable. The way I see it memoryview is Python Buffer, but elevated to be a first class Python type. In the early CPython 3 versions even more so, for instance it would return None for shape if ndim == 0, just as it is mandated for Python Buffers. IPythonBuffer is not exactly CPython Buffer, but close enough, so conceptually I think that MemoryView is an IPythonBuffer.
But I don't feel strongly about it. Introducing a MemoryViewView of sorts would prevent code like

IPythonBuffer buf = memoryView.GetBuffer();

// somewhere else, perhaps in a sub-method:
if (buf is MemoryView mv) {
   // have a feast...
}

but I also don't feel strongly of adding extra complexity just to prevent it.

On my side, besides the remaining TODOs, I see the following:

  • Constructors/initializers of bytes/bytearray should be able to accept any object implementing Buffer protocol, not just bytes-like objects.
  • Slice deletion in bytearray/array.array can be simplified and made more efficient.
  • Perhaps some of the switch on typecode statements in array.array can be simplified or eliminated. On the other hand, what is there now works just fine, so it is more like a mind teaser...

Copy link
Contributor

@slozier slozier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this, feel free to merge when ready.

@BCSharp BCSharp merged commit e8028c2 into IronLanguages:master May 22, 2020
@BCSharp BCSharp deleted the buffer_protocol branch May 22, 2020 04:42
@BCSharp BCSharp linked an issue May 22, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PEP 3118 -- Revising the buffer protocol
2 participants