Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected memory management behavior #1056

Closed
etianen opened this issue Jul 20, 2020 · 21 comments
Closed

Unexpected memory management behavior #1056

etianen opened this issue Jul 20, 2020 · 21 comments
Assignees
Labels
Milestone

Comments

@etianen
Copy link

etianen commented Jul 20, 2020

I'm very excited about this library. The effort going into documentation and ergonomics is fantastic. I've been waiting to use it on stable Rust for months, and have finally started trying to integrate it into my CPU-bound Python codebase to write extension modules.

Right now, pyo3 seems to have a serious issue to me. The use of the object storage (even the lighter #887 implementation) means that this library is not zero-cost, and worse, can easily leak memory. The additional management layer that's built on top of Python's own reference counting, while ergonomic, is a big problem for my use-case.

The current situation, as I understand it:

  • GilGuard acquire overhead (minimal, and rare):

    • Acquire a Mutex on the PyObject alloc/dealloc array.
    • Allocate a Vec to copy the PyObject alloc/dealloc array out of TLS.
    • Write to 2x thread-local storage.
    • Allocate two Vec for storing pointers to owned references.
  • &PyAny owned reference creation (mostly method returns, cheap, but very frequent):

    • Write to thread-local storage.
    • Occasionally re-alloate owned reference Vecs as they grow.
    • Memory is only freed at GilGuard drop
  • PyObject drop (cheap, but frequent):

    • Read from thread-local storage.
    • Possibly acquire a Mutex on the dealloc array.
    • Possibly only free memory at next GitGuard acquire

The regular access to TLS and global mutexes are regrettable, but the memory-freeing behavior is really worrying to me. I'm not writing Python extensions because I want pretty-damn fast and unbounded memory growth. I'm going to the effort of writing a Python extension because I need as fast as possible and predictable memory usage.

The solution outlined in #885 seems perfect. A PyAny<'a> has no need for object storage, and Clone and Drop work as one would expect. I'd even go one stage further and have PyObject and Py<T> always acquire the GIL for Clone and Drop. Sure, it's slower than using the current TLS and deferred drop, but it's simpler, more predictable, and people can always use ManuallyDrop turn them back into PyAny<'a> to batch clone/drop if performance is suffering.

You could even remove the mirrored object API on PyObject and require that they're converted into PyAny<'a> to call or getattr, making PyObject's only purpose to be storing a long-term reference to a Python object between GIL acquisitions.

At that point, there's no additional global state beyond what's already provided by the Python interpreter. Things are deallocated predictably. There's a smaller API surface. The library is truly zero-cost and adds no additional complexity.

I would urge you to be bold with making breaking changes while the library is young and the userbase relatively small. It would be a trememdous shame if the most popular Rust/Python bindings were significantly slower than the equivalent C, and introduced an unexpected additional memory model.

In any case, thanks for all the work you've put in so far. :)

@kngwyu kngwyu self-assigned this Jul 20, 2020
@kngwyu
Copy link
Member

kngwyu commented Jul 20, 2020

Thanks. I think it's good timing to revisit our object model now, since recently some safety problems are pointed out aroung GILGuard.
Let me create the 2nd draft.

@etianen
Copy link
Author

etianen commented Jul 20, 2020

I'd love to read the 2nd draft. I'll shortly be using this project cautiously in production, so am excited to see where it's heading.

@davidhewitt
Copy link
Member

Just some notes I learned from my first try at PyAny<'a>:

  • The biggest issue is about API of Py<T> - e.g. in Py<PyAny<'static>> - the 'static lifetime is both required and the only one that makes sense, which is a serious papercut.
  • I had a go at solving the above using an "object type tag" e.g. Any, List which was just a type master that could go in Py<T>. This caused lots of traits to end up picking up the 'a lifetime, which is kinda annoying. Generic Associated Types, when finished, would really help here.
  • It is definitely faster than the current PyO3 implementation, but not by as much as you might think (maybe 10% last time we tried). Reliability of memory release is a nice win though.
  • Because ownership is in the struct rather than the GILPool, lifetimes work very slightly differently compared to current PyO3 (they're usually a bit shorter with PyAny<'a>).

@davidhewitt
Copy link
Member

davidhewitt commented Jul 20, 2020

I also have been thinking about our new API .into_ref() that we're planning to add to Py<T> and PyObject.

Because this gives an easy way to go from the owned reference into the gil-scoped reference, I'm also considering whether we should add new ways to create Python types with Py<T>. This would allow users to opt-out (or potentially opt-in) to GIL-scoped references depending on their preference for speed or ergonomics.

Focussing on PyList, I see two options:

  1. Add PyList::new_py() -> Py<PyList>, leave PyList::new() unchanged. This breaks less code.
  2. Change PyList::new() -> Py<PyList>, add PyList::new_ref() -> &PyList. The nice thing about this option is that it names really well alongside Py::into_ref and Py::as_ref.

@davidhewitt
Copy link
Member

Finally, I wanted to talk about PyObject and Py<T> drop behaviour. I think these have a nice sweet spot at the moment, though I'm open to changing them. There are a number of places in PyO3 where automatically acquiring the GIL could improve ergonomics. It's hard to know where to draw the line on hidden performance costs vs. ease of use.

I think we can always make these tradeoffs better learning from community members who have feedback!

@etianen
Copy link
Author

etianen commented Jul 21, 2020

To me, the imporant things to get right for this library are, in order:

  1. Sound.
  2. Predictable memory usage, with no unbounded growth.
  3. Zero-cost (fast as C)
  4. Ergonomic, well documented

Currently, 1 and 4 are pretty well nailed, but 2 and 3 are both being held back by object storage. Here's my attempt at a solution that does away with all thread locals and object storage.

The first part of the solution is already solved by the current PyAny (zero-cost wrapper around a pointer) and PyDict, PyList (zero-cost wrapper around a pointer that asserts a native type). The change I'd make is that a user never interacts with &PyAny or &PyDict directly.

Instead, these are always wrapped by either:

  • PyRef<'py, T> (e.g. PyRef<'py, PyAny> or PyRef<'py, PyDict>

    • The 'py lifetime is tied to Python<'py>, so guarantees that the GIL is held.
    • Clone and Drop perform immediate refcount incr/decr.
    • Deref impl to T for easy .call(...) and getattr(...) etc.
  • PyOwned<T> (e.g. PyOwned<PyAny> or PyOwned<PyDict>)

    • These assume that the GIL is not held.
    • Clone and Drop acquire the GIL and perform immediate refcount incr/decr, then drop the GIL again. This is not efficient, but it keeps the memory model predictable and correct. People can use .into_ref(...) (below) for efficient batch clone and drop.
    • An .into_ref(self, py: Python<'py>) method consumes the PyOwned<T> and returns a PyRef<'py, T>.
    • An .as_ref(&self, py: Python<'py>) method does not consume the PyOwned<T>, and returns a &'py T (the only time a user ever deals with a &'py T directly. I'm not 100% sure about the lifetimes here, or whether it would actually work. Maybe .clone().into_ref(py) is more correct, at the expese of a refcount.

I believe this does away with all thread locals and all object storage, and brings the performance and memory characteristics in line with the equivalent C.

I'd be happy to have a go at this myself, but I don't want to step on other people's toes, or spend time on something that isn't wanted, or has already been tried and found problematic.

@etianen
Copy link
Author

etianen commented Jul 21, 2020

Finally, I wanted to talk about PyObject and Py<T> drop behaviour. I think these have a nice sweet spot at the moment, though I'm open to changing them. There are a number of places in PyO3 where automatically acquiring the GIL could improve ergonomics. It's hard to know where to draw the line on hidden performance costs vs. ease of use.

I think we can always make these tradeoffs better learning from community members who have feedback!

My specific issue with the drop behaviour is that checking a thread-local is not zero cost, and if the GIL isn't held it leads to unbounded memory growth, and potentially a permanent memory leak if the GIL is never re-acquired.

@etianen

This comment has been minimized.

@davidhewitt
Copy link
Member

davidhewitt commented Jul 21, 2020

@etianen thanks for all the ideas. I agree that the direction I want for PyO3 is to remove as much object storage as possible.

TLDR; Yes, I like this design and have considered it myself, but I'm not sure it's the right time to do it.

I've thought many times about ideas similar to your PyOwned<T> / PyRef<'a, T> combination. Note that your proposal for PyOwned<T> is more or less exactly what the current Py<T> is, except for the drop semantics. (PyObject is very similar to Py<PyAny>, and I'd like to remove it, but haven't quite found the appropriate way to do it given how PyO3's conversion traits work currently.)

Regarding PyRef<'a, T> - again, this is an API that I've considered moving towards. It fits nicely as PyRef is already a name we use for #[pyclass], and the meaning would be essentially the same for native types (except "refcell" semantics wouldn't be needed for native types).

Why I'm not sure now is the moment.

The biggest downside of PyRef<'py, T> with current Rust, is that the GIL lifetime 'py isn't perfectly preserved over multiple calls. Consider, for example, this method definition which we would adopt if we switched to PyRef<'py, T>:

impl PyAny {
    fn call0(&'a self) -> PyRef<'a, PyAny> { ... }
}

If PyRef<'py, PyAny> derefs to &'a PyAny for the method call, then this lifetime 'a is not the original GIL lifetime 'py. Instead, 'a is the lifetime of the derefencing borrow. So as far as Rust is concerned the result of the method call is borrowed from the original receiver. This is obviously wrong!

The result of this is that a series of chained PyO3 API calls gradually see the GIL lifetime which the compiler guarantees get narrower and narrower. It's not the end of the world, but it's quite an annoying papercut.

The future solution I see

There is, fortunately, a way that I can see this being resolved in the future: arbitrary self types. If we were able to use &PyRef<'py, T> as a self type, then this problem goes away:

impl PyAny {
    fn call0<'py>(self: &PyRef<'py, Self>) -> PyRef<'py, PyAny> { ... }
}

With arbitrary self types, we really can have our cake and eat it! So this is a direction which I would like to see for PyO3, but I'm not actively pursuing it now because we need someone to champion getting the feature added to the Rust language first!

@davidhewitt
Copy link
Member

davidhewitt commented Jul 21, 2020

Note that a "hack" to fix GIL lifetimes if we move to PyRef<'py, T> without arbitrary self types is to expose an API to manually rebind the reference using an API that consumes the PyRef:

impl<T> PyRef<'_, T> {
    fn with_lifetime<'py>(self, py: Python<'py>) -> PyRef<'py, T> {
        // build new pyref using new Python token
    }
}

In the case where the reference is owned, we know that the lifetime 'a can be extended safely to any lifetime the GIL is held for, which is why the above API could work as an interim "hack".

TBH I'd be very happy to listen to the PyO3 community to see whether they would prefer migrate to this API now, at the cost of some lifetime frustration, or to wait for the more ergonomic future.

@davidhewitt
Copy link
Member

My specific issue with the drop behaviour is that checking a thread-local is not zero cost, and if the GIL isn't held it leads to unbounded memory growth, and potentially a permanent memory leak if the GIL is never re-acquired.

The alternative (acquiring the GIL) is significantly slower than checking the "not zero cost" thread-local. (You can easily verify this by modifying the Drop impl for PyObject and running the bench_pyobject benchmark.)

If we were to remove the PyObject storage in favour of decreasing the reference immediately, I'd still encourage us to check the thread-local - and then only acquire the GIL if needed.

To migrate to the case where PyObject does not have object storage on Drop, I would like to see discussion on these points:

  • Is there any risk of deadlock by doing this? (My suspicion is no; any reassurance is welcome.)

  • A quick test on my laptop suggests that the "worst case" where the GIL is not held and needs to be acquired is about 40x slower than the implementation we currently have (place in object storage). Is this overhead in a Drop impl acceptable?

    To test this, I modified bench_pyobject to:

    #[bench]
    fn drop_many_objects(b: &mut Bencher) {
        let gil = Python::acquire_gil();
        let py = gil.python();
        let objects = vec![py.None(); 1000];
        b.iter(|| {
            let objects = objects.clone();
            py.allow_threads(move || drop(objects));
        });
    }
    

    And changed the PyObject drop to always acquire the GIL then decrease the reference.

When I've looked at this in the past, I've come to the conclusion that the PyObject object storage is actually not so bad, when faced with the alternative of a fairly nasty performance drawback in the Drop implementation. To hit the case of "unbounded memory growth" with this bit of the object storage, you've need to be repeatedly cloning and dropping these PyObject instances without ever acquiring the GIL. Perhaps somebody will one day write a valid program which does this, but that seems rare!

(OTOH, I 100% agree with you that the unbounded memory growth that can occur from &PyAny with the GIL is a serious issue.)

@etianen
Copy link
Author

etianen commented Jul 22, 2020

It's clear that you've considered a lot of these issues already, so thanks for explaining! On reflection, I can see that there's still a cause for a naked &'py PyAny. The holy trinity of object references is:

  • &'py T - Borrowed pointer with GIL.
  • PyRef<'py, T> - Owned pointer with GIL.
  • PyOwned<T> - Owned pointer with no GIL (yes, effectively a Py<T>).

I think your with_lifetime() solution is workable. The alternative is to duplicate the object API (getattr, call etc) on PyRef<'py, T>. That starts getting extremely clunky though, particularly with other primitive types like PyDict. I think your with_lifetime is a good, general solution.

Why I think now is the time

It's currently impossible / very difficult to write a Python extension that deals with siginificant numbers of objects, and that is a big chunk of valuable use-cases. Using &'py T means unbounded memory growth, or requires unsafe GILPool. Using Py<T> or PyObject means thread-local overhead, which can be mitigated using Python::release(...) absolutely everywhere.

Arbitrary self-types might never land, or might land in a long time. Switching to PyRef<'py, T> is a big breaking API change, but it enables a large swath of use-cases. I vote for doing it soon! If we get arbitrary self-types in the future, the with_lifetime hack can be deprecated, but it would be a very easy migration path.

It's a small hit to ergonomics vs a 10% performance boost and no unbounded memory growth. That seems a clear win, IMO.

The drop behavior of Py<T> and PyObject

With a decent PyRef<'py, T> implementation, and if pyclass constructors could return PyRef<'py, T> instead of Py<T>, the use of Py<T> and PyObject would only be used for persisting owned pointers without the GIL. In most cases, this would mean very few of them actually exist. So their Clone and Drop behavior isn't so important. I can think of a few strategies, all with different tradeoffs:

Clone/Drop strategy Pro Con
Always acquire GIL, immediately update refcount Simple, no additional state Slow
Check for GIL using `PyGILState_Check` (or a TLS if faster), acquire GIL if needed, immediately update refcount Still simple, no additional state, hopefully fast is the GIL is held Slow if GIL not held
Check for GIL using `PyGILState_Check` (or a TLS if faster), immediately update refcount if possible, else defer until GIL next held Fast regardless of GIL state Unbounded memory growth

Whichever strategy is chosen, a programmer can use ManuallyDrop and into_ref() to perform a zero-cost drop of one or more Py<T>, so this really comes down to which strategy is the default for correctness and good-enough lazy programming. I personally find the potential for unbounded memory growth a little worrying, but I appreciate the different tradeoffs here. I'd vote for the second strategy (assuming reasonably performant), but it's not a hill I'd die on. 😛

@etianen
Copy link
Author

etianen commented Jul 22, 2020

I've largely said my piece here, but just to add: I'm very reassured that you see the &PyAny unbounded memory growth as a serious issue, and it's clear that you've considered all sorts of approaches. Thanks for your time in explaining everything.

I can't use this library in earnest until that issue is solved, but it remains an excellent project. I'm happy to help out if it makes this happen earlier, but otherwise I'll be watching the changelog eagerly!

@kngwyu
Copy link
Member

kngwyu commented Jul 26, 2020

Thank you but I'm thinking about a completely different one: PyAny<'py> and PyAnyRaw. No borrowed type.

@davidhewitt
Copy link
Member

@kngwyu would love to see some detailed notes on what you're thinking!

@davidhewitt
Copy link
Member

Just FYI; while I think we're getting close to considering 0.12 release, and shouldn't wait for this (unless someone's already close to an implementation), I'd like to see us review how we can improve this for 0.13.

@davidhewitt
Copy link
Member

Note to self - pantsbuild/pants#13526 (comment) is interesting. (Potentially Arc<NonNull<ffi::PyObject>> may be a good future representation.)

@davidhewitt davidhewitt modified the milestones: 0.17, 0.18 Jul 3, 2022
@davidhewitt davidhewitt modified the milestones: 0.18, 0.19 Dec 27, 2022
haixuanTao added a commit to haixuanTao/pyo3 that referenced this issue Jan 7, 2023
Adding a special case of memory management when writing an extension.

This is a documentation of: PyO3#1056 and PyO3#2853
davidhewitt pushed a commit to haixuanTao/pyo3 that referenced this issue Jan 10, 2023
Adding a special case of memory management when writing an extension.

This is a documentation of: PyO3#1056 and PyO3#2853
bors bot added a commit that referenced this issue Jan 10, 2023
2864: Add a section on memory management for `extension` r=davidhewitt a=haixuanTao

Adding a special case of memory management when writing an extension.

This is a documentation of: #1056 and #2853


Co-authored-by: Haixuan Xavier Tao <tao.xavier@outlook.com>
Co-authored-by: David Hewitt <1939362+davidhewitt@users.noreply.github.com>
@davidhewitt davidhewitt mentioned this issue May 18, 2023
7 tasks
@davidhewitt davidhewitt modified the milestones: 0.19, 0.20, 0.21 Jun 16, 2023
@davidhewitt
Copy link
Member

With the release of the Bound API in 0.21 (beta), there is finally a resolution to this problem. 🎉

Migrating to the Bound API will solve this problem immediately for code which needs this fixed, and the follow-up to this work is #3960 to remove the existing GIL Refs API.

@etianen
Copy link
Author

etianen commented Mar 15, 2024

Incredible, I've been watching and hoping for a long while for this to be sorted! I'm really glad it was included for 1.0. ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants