Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix signature of PyArray::new or better yet replace it by PyArray::uninit #217

Closed
adamreichold opened this issue Nov 6, 2021 · 14 comments · Fixed by #220
Closed

Fix signature of PyArray::new or better yet replace it by PyArray::uninit #217

adamreichold opened this issue Nov 6, 2021 · 14 comments · Fixed by #220

Comments

@adamreichold
Copy link
Member

As discussed in #216 (comment), PyArray::new creates an uninitialized array and should therefore be marked unsafe as it is the callers responsibility to not access (and of non-copy element types not even drop) the array until all elements are initialized.

However, we also do not seem to offer a good API for doing that element-wise initialization post facto. I think instead of just marking the method unsafe, we should rename it PyArray::uninit and keep it safe but change its return type to PyArray<MaybeUninit<A>, D>> and add an additional assume_init method for turning this into PyArray<A, D>. This would be similar to the approach taken by ndarray.

@adamreichold
Copy link
Member Author

adamreichold commented Nov 6, 2021

we should rename it PyArray::uninit and keep it safe but change its return type to PyArray<MaybeUninit, D>> and add an additional assume_init method for turning this into PyArray<A, D>

Ah, it probably is not that simple. In contrast to ndarray, we do not drop the elements ourselves but rather numpy's C code does that. Hence, &PyArray<MaybeUninit<PyObject>, D> would still lead numpy to decrement reference counts via those garbage pointers and I fear we actually have to yield MaybeUninit<&PyArray<T, D>> instead which will make that API significantly less ergonomic...

@adamreichold
Copy link
Member Author

adamreichold commented Nov 6, 2021

@davidhewitt I am having some trouble trying to figure out the correct API here. The above does not work as just returning MaybeUninit<&PyArray<T,D>> does not really change anything (the shared reference is Copy after all).

I think the main issue is that as soon as FromPyPointer::from_owned_ptr is called to register the newly created PyArray instance with the release pool, its reference count will be decremented which will eventually call NumPy's array_dealloc on the uninitialized data.

So it seems I need some type MaybeOwned<'py, T> where T: FromPyPointer that

  • holds onto the Python<'py> token and a raw pointer *mut T (in this case the return value of PyArray_New)
  • provides access to the potentially uninitialized T, most likely via *mut T as MaybeUninit::as_mut_ptr does
  • is consumed via fn assume_init(self) -> &'py T which will register the pointer with the release pool only after initialization.

Does something like this already exist in PyO3? Is there a different idiom I should use to delay registration with the release pool?

(Of course this all means the underlying NumPy array is leaked if the initialization is never completed, but at least this would be safe.)

@adamreichold
Copy link
Member Author

Or maybe I want to return

MaybeUninit<Py<PyArray>>

to avoid registration?

@adamreichold
Copy link
Member Author

adamreichold commented Nov 7, 2021

Does filling the array with null pointers help?

Yes, NumPy seems to use XDECREF internally, so null pointers would be safe. On the other hand, filling everything with zeros is exactly the overhead using an uninitialized array is supposed to avoid.

@kngwyu Since we do not have proper API to do mostly safe element-wise initialized either, I would propose to remove the creation of uninitialized PyArray from the public API. We can still use it internally as an unsafe method and external users can call PyArray::zeros or use ndarray's API and then turn the result into a PyArray. Do think this would be a way going forward? Do you know of users of this crate which depend on create uninitialized PyArray directly?

@davidhewitt
Copy link
Member

Hmm not sure why the github timestamp for my previous comment is in the future? Never seen that before!

Sounds like something like PyArray<MaybeUninit<PyObject>, D> might still be a viable way to go, with the method still being unsafe with clear documentation that the user is responsible for initializing the full array else crashes happen?

@adamreichold
Copy link
Member Author

might still be a viable way to go

The problem is that we do not really have API to enable the user to properly initialize that array, i.e. get access to raw data pointers and such. Additionally, that PyArray instance is a bomb that we do not offer reasonable help to defuse.

@davidhewitt
Copy link
Member

The problem is that we do not really have API to enable the user to properly initialize that array, i.e. get access to raw data pointers and such. Additionally, that PyArray instance is a bomb that we do not offer reasonable help to defuse.

Agreed we would need to add .set_item() and/or ways to fill in rows from a slice.

Regarding it being a "bomb" - true, but only in the PyObject case I guess. For inteteger / float types there's no such danger on drop, just danger of uninitialised reads.

Either way, if the method to create it is unsafe then it still offers users the flexibility to achieve what they want, with explicit acknowledgement they are opting into dangerous territory.

Removing PyArray::new and making unsafe PyArray::uninit seems like a decent compromise?

@davidhewitt
Copy link
Member

davidhewitt commented Nov 7, 2021

EDIT: This message has an incorrect timestamp (maybe 1 hour delayed?) not sure what GitHub did here.

This seems like a tricky one... because presumably we do want to free the underlying allocation on drop, so we do need numpy's destructor to run? To clarify, we're in a thorny situation where numpy will free the underlying array elements on drop, so we need to be sure they're valid to drop before calling the numpy destructor. I guess this is only a problem with Drop structs and also PyObject arrays. Does filling the array with null pointers help?

I'm not aware of prior art, so we might need to get creative.

@adamreichold
Copy link
Member Author

Regarding it being a "bomb" - true, but only in the PyObject case I guess. For inteteger / float types there's no such danger on drop, just danger of uninitialised reads.

I think I found a useful idiom in #216 for keeping the uninitialized array alive, I just do

let ref _ = mem::ManuallyDrop(array.to_object(py));

before initialization to elevate its reference count and do

let _ = mem::ManuallyDrop::into_inner(ref_);

afterwards to release the spurious reference. (I added a bit of convoluted test which does seem to verify that this works as intended.)

Removing PyArray::new and making unsafe PyArray::uninit seems like a decent compromise?

I wonder whether keeping the signature as-is (i.e. leaving out the MaybeUninit but marking it unsafe) would be simpler as we might end up recreating a lot of the access API for the MaybeUninit<A> where A: Element case?

In any case, I think we should first make sure #216 is in good shape before starting on the public API as it forces us to figure this internally already.

@adamreichold
Copy link
Member Author

adamreichold commented Nov 7, 2021

Looking at my test case again, I wonder whether we should just make "Clone must not panic" part of Element's contract as in the case of Py<T>, cloning is just bumping the reference count which should never panic, shouldn't it? This would leave only the iterator-based methods requiring the elevated reference counts as the iterator itself could panic.

@adamreichold
Copy link
Member Author

(Assuming Copy for all data types except DataType::Object should probably also made explicit in the trait documentation.)

@davidhewitt
Copy link
Member

(Assuming Copy for all data types except DataType::Object should probably also made explicit in the trait documentation.)

Could also add a const IS_COPYABLE member to the trait

@adamreichold
Copy link
Member Author

adamreichold commented Nov 7, 2021

Could also add a const IS_COPYABLE member to the trait

I don't think DATA_TYPE != DataType::Object but not trivially copyable makes sense as NumPy itself will not "drop" the elements or "clone" the elements in that case, i.e. it is not us imposing that constraint but already NumPy.

@davidhewitt
Copy link
Member

Ah right, I see 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants