Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program aborts when Python's garbage collector gets called from another thread and attempts to traverse an unsendable pyclass instance. #3688

Closed
JRRudy1 opened this issue Dec 21, 2023 · 8 comments · Fixed by #3689

Comments

@JRRudy1
Copy link

JRRudy1 commented Dec 21, 2023

I have created a repository providing a full breakdown and minimal reproducible example of the error
at https://github.com/JRRudy1/pyo3_gc_error. I will provide a summary below, but please check out
the repository instead as I put a lot of effort into clearly presenting and investigating the issue.

In summary, I have discovered an error, or perhaps an undocumented limitation, in the way
PyO3 handles thread-checking for "unsendable" pyclass instances as they are being traversed
by Python's garbage collector (GC). In particular, this occurs when garbage collection is triggered
from a separate thread, and the pyclasses integrate with the GC by implementing the __traverse__
magic method. The error (or limitation) results in a hard abort, and is particularly problematic
since it cannot be caught from Python using a try/except block.

The conditions and sequence of events leading to the error can be summarized as:

  1. Two (or more) instances of an "unsendable" pyclass are created from Python
  2. The objects are in a reference cycle, and the pyclass defines __traverse__/__clear__ to break it
  3. All references to them outside the cycle are dropped, so the next GC cycle should clean them up
  4. Before the GC runs automatically, it gets explicitly called from another thread (gc.collect from
    Python or GcCollect from C)
  5. When the GC calls back into Rust to traverse the objects, PyO3 detects that the calling thread is
    not the original thread and incorrectly deduces that the object was sent between threads
  6. PyO3 triggers a panic and the program aborts with a misleading error message

I have gotten reasonably familiar with PyO3's internals and may be interested in working on this,
but I would need some guidance from an "expert" with a more nuanced understanding of the
possible implications. It is possible that the limitation cannot be safely fixed, and the only solution
is to improve the error message and add a warning to the documentation.

As mentioned above, please visit https://github.com/JRRudy1/pyo3_gc_error for more information.

@davidhewitt
Copy link
Member

Hmm. This is unfortunate, but not entirely a surprise. At least we crash safely.

When the GC calls back into Rust to traverse the objects, PyO3 detects that the calling thread is
not the original thread and incorrectly deduces that the object was sent between threads

I disagree that this deduction is incorrect. From PyO3's perspective this is true; the data is being read on another thread, which violates the !Send nature of that type.

One option could be to make unsendable pyclass not support gc integration, forcing the user to choose what functionality they desire. I think this seems like a reasonable restriction, because the Python GC is multithreaded.

@JRRudy1
Copy link
Author

JRRudy1 commented Dec 21, 2023

From PyO3's perspective this is true; the data is being read on another thread, which violates the !Send nature of that type.

I was afraid you'd say that! Unfortunate indeed.

One option could be to make unsendable pyclass not support gc integration

I suppose that's fair. However the issue only arises in the fairly niche case where a GC call from another thread happens to occur while there are unsendable GC-integrated objects in a reference cycle waiting to be collected, so I'm not sure whether it would be a worthy motivation for removing functionality that works fine in most cases. But maybe it is?

I did just think of another possible solution; see this github.dev link. Apparently the ThreadCheckerImpl struct (and PyClassThreadChecker trait) already has a can_drop method that seems to check for and react to this exact problem, but it only gets called in the context of the struct being dropped. By updating the _call_traverse function to call can_drop (or a similar new method) before attempting to borrow from the cell, the error could be handled more gracefully with a more informative error message. Of course we'd need to add a higher-level method to PyCell or something that would call can_drop when appropriate, instead of calling it directly in the _call_traverse function like I did in the dev link.

@adamreichold
Copy link
Member

One other option I see is, instead of an error, to make unsendable pyclasses "invisible" to the GC when it is running on a different thread, i.e. turn __traverse__ into a no-op and only actually traverse anything if on the original thread. This would imply leaks for such objects as e.g. the home thread could already have exited and the GC would never be able to run again there. But this might be a smaller caveat than raising an error (which I think is not supported by the contract of __traverse__, i.e. we would always abort).

@davidhewitt
Copy link
Member

I making them opaque to other threads is quite a reasonable option, we can also document this caveat as part of the offering of unsendable. That's a softer form of "do not support GC integration", I guess, which is practical for truly single threaded programs 👍

That said, I think it's possible that these things might still get collected by another thread running a GC collection? E.g. if the unsendable class itself does not directly contain the cycle but is referenced from an object that does participate in a cycle. Then when the cycle gets collected, the unsendable class gets dropped by the wrong thread. IIRC we leak and warn in this situation already, as per #3176, so I think this edge case is ok but unfortunate.

(The only solution I can see to mitigate that would be to have a per-thread queue so that unsendable classes could post themselves to their owning thread instead of leaking, but I'm not sure that it's worth the complexity.)

@adamreichold
Copy link
Member

I making them opaque to other threads is quite a reasonable option, we can also document this caveat as part of the offering of unsendable. That's a softer form of "do not support GC integration", I guess, which is practical for truly single threaded programs 👍

Will prepare a PR to turn __traverse__ into a no-op on other than the original threads then.

(The only solution I can see to mitigate that would be to have a per-thread queue so that unsendable classes could post themselves to their owning thread instead of leaking, but I'm not sure that it's worth the complexity.)

I think we should definitely try to reduce global state in PyO3, we already have quite to much and I would like to avoid adding more. If something like this is desired, I would prefer to have that in downstream code which actually how threading is used.

@davidhewitt
Copy link
Member

Agreed very much so on that point 👍

@JRRudy1
Copy link
Author

JRRudy1 commented Dec 22, 2023

Wow that was fast, thank you for your effort! I like that solution and implementation, great work guys

@adamreichold
Copy link
Member

So this means you tested your PoC using the proposed change and it worked as expected?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants