Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for nogil Python #2885

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add support for nogil Python #2885

wants to merge 4 commits into from

Conversation

colesbury
Copy link

Per the discussion in Discourse, these are the changes to PyO3 to work with the nogil proof-of-concept implementation. I've rebased these changes from 0.15.2 and done some light testing locally.

  • The primary purpose of this PR is to discuss the changes PyO3 would need to support PEP 703. Those changes would be slightly different because the PEP proposes an ABI flag like "cp312n" whereas the "nogil" fork currently looks something like "nogil39". Additionally, the reference count fields are slightly different sizes in the PEP vs. in the "nogil" fork (basically Py_ssize_t in the PEP and u32 in the fork). Otherwise, I think the changes would be similar.

  • Additionally, if you're interested, it would be great for PyO3 to support the "nogil" fork (i.e., to get this PR in a state where it can be merged). That would make it easier to support projects that depend on PyO3 in "nogil" Python (and may help the PEP too). We've done something similar in [0.29] Add configuration for nogil Python (gh-4912) cython/cython#4914. (Of course, if it ever feels like too much of a maintenance burden in the future, you should feel free to drop it.)

@adamreichold
Copy link
Member

To be honest, I am highly sceptical as if I understand this correctly and PyGILState_Ensure essentially becomes a no-op, a PyO3 built against the nogil variant of CPython would be unsound due to (from the PEP):

  • C-API extensions that rely on the GIL to protect global state or object state in C code will need additional explicit locking to remain thread-safe when run without the GIL.
  • C-API extensions that use borrowed references in ways that are not safe without the GIL will need to use the equivalent new APIs that return non-borrowed references. Note that only some uses of borrowed references are a concern; only references to objects that might be freed by other threads pose an issue.

We currently rely on the GIL protecting our data structures and global state in various places and protecting them separately would add overhead to the common usage with CPython or a much higher maintenance burden to abstract away the GIL-versus-separate-locks problem. (It would also very much complicate reasoning as we would need to assume separate locks when evaluating safety.)

I think following our usual standards regarding soundness we could not release a version of the PyO3 crate based on the changes presented here.

@davidhewitt
Copy link
Member

Agree with both of you. I'm definitely enthusiastic to support PEP 703, as I think that it would be a massive long-term win for Rust/Python interop if the GIL were removed.

That said, @adamreichold is absolutely right that this is a long way from mergeable / releasable as-is. At a minimum, this is going to need additional CI jobs for the nogil fork on Windows, Mac and Linux. The soundness issues are also a major concern. One which springs to mind now is GILOnceCell. I think it's good enough if we keep the PyO3 APIs externally equivalent for now and just have internal adjustments to ensure sound behaviour (e.g. maybe we can use once_cell as an internal substitute).

With passing CI jobs and a reasonable effort to identify what would be unsound and how it's mitigated, I can see us merging this. Certainly in the long term this work would be necessary to support PEP 703, and if supporting nogil helps push the PEP, as long as the maintenance burden here isn't enormous I'd be ok with merging nogil support too.

(It would also very much complicate reasoning as we would need to assume separate locks when evaluating safety.)

As a non-optimal short-term fallback we could always have a static global mutex which all PyO3 APIs could use internally 😋 (only half serious, if there are better options we should take them).

Comment on lines +70 to +80
#[inline]
#[cfg(not(Py_NOGIL))]
pub unsafe fn PyList_FetchItem(list: *mut PyObject, index: Py_ssize_t) -> *mut PyObject {
_Py_XNewRef(PyList_GetItem(list, index))
}

#[inline]
#[cfg(Py_NOGIL)]
pub unsafe fn PyList_FetchItem(list: *mut PyObject, index: Py_ssize_t) -> *mut PyObject {
_PyList_FetchItem(list, index)
}
Copy link
Member

@davidhewitt davidhewitt Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, adding these provisional APIs and changing PyO3 to use them is a 👎 from me, the callsites which you modified should already be converting borrowed references to owned, and if they didn't, that's a separate bug.

EDIT: I reread the PEP and see now that it's necessary for thread-safety that the borrowed-to-owned conversion is done within the interpreter rather than PyO3. So I guess this would indeed be necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this proposed change is that, when running without the GIL, the retrieval from list and the acquisition of an owned reference needs to be performed atomically in relation to any concurrent modifications. The PEP 703 proposes PyList_FetchItem and PyDict_GetItem which have this behavior -- they return owned references and are safe in the face of concurrent modification.

@adamreichold
Copy link
Member

I think it's good enough if we keep the PyO3 APIs externally equivalent for now and just have internal adjustments to ensure sound behaviour

I am not sure about the Python type. The token is used in downstream projects to assert that data structures are accessed only when the GIL is held. (I know that it is used in that way by rust-numpy, but I suspect that it is not the only one. "We already have to live with the GIL, so why add overhead for another synchronization scheme." is likely a thought that multiple developers had.)

So for downstream projects to remain sound, we might end up needing to do our own locking to give the Python token GIL-like semantics, i.e.

As a non-optimal short-term fallback we could always have a static global mutex which all PyO3 APIs could use internally yum (only half serious, if there are better options we should take them).

which might end up defeating the purpose supporting nogil in the first place.

(My scepticism is not so much about supporting nogil, but rather about doing it with a simple compile time switch to change things behind the scenes while keeping an API which is tailored towards the GIL-based CPython world.)

@colesbury
Copy link
Author

Can you point me to a few instances of using the GIL to protect other data structures, such as in rust-numpy? @davidhewitt mentioned GILOnceCell. The conversion of borrowed references in dict.rs and list.rs is another example, but that seems easy to address.

@davidhewitt
Copy link
Member

davidhewitt commented Jan 18, 2023

(I know that it is used in that way by rust-numpy, but I suspect that it is not the only one. "We already have to live with the GIL, so why add overhead for another synchronization scheme." is likely a thought that multiple developers had.)

This is a very good point, and one that I hadn't considered the finer details of. I think in an ideal world this means we would want some PyO3 APIs to have external differences when running on nogil so that unsound uses would not compile. More consideration needed.

At its most crude, a nogil feature gate which e.g. removed the Python type and replaced it with a slightly different equivalent would force all pyo3 crates to consider their nogil compatibility. However, this would be horrendous maintenance issue, I think removing Python type would be a non-starter.

@davidhewitt
Copy link
Member

Thought - Python::from_borrowed_ptr and Py::from_borrowed_ptr are likely to be footguns if borrowed references are typically dangerous.

@adamreichold
Copy link
Member

Can you point me to a few instances of using the GIL to protect other data structures, such as in rust-numpy? @davidhewitt mentioned GILOnceCell. The conversion of borrowed references in dict.rs and list.rs is another example, but that seems easy to address.

Examples are protecting cached capsule pointers in https://github.com/PyO3/rust-numpy/blob/main/src/npyffi/array.rs#L60 and https://github.com/PyO3/rust-numpy/blob/main/src/npyffi/ufunc.rs#L38 as well as access to the global state backing the dynamic borrow checking in https://github.com/PyO3/rust-numpy/blob/main/src/borrow/shared.rs#L107

But as @davidhewitt discussed above, this is not so much about changing the various usage sites to use an independent synchronization scheme, but rather about PyO3 providing a sound API contract for the Python type as all potential/upcoming/unknown users of PyO3 could be relying on the current contract.

So either have to remove the Python type when the nogil feature is enabled so that downstream projects cannot rely on its properties, or we have to provide a contract that works with the GIL and without GIL. Currently, I think that would mean to back the Python type with a process-wide lock which would be a GIL by any other name.

As for a contract that would work in both cases, I can only imagine that Python looses any assurances it had w.r.t. synchronization but we provide a separate API to acquire a (maybe process-wide, maybe more limited) lock if required which could be backed by the GIL if it is in used and by a PyO3-provided lock if not. This way, downstreams can use it for synchronization without additional overhead as it is used now but still work soundly when no GIL is backing it. Whether some downstreams should use a separate synchronization mechanism to achieve more parallelism at cost of overhead if the GIL is in use could then be discussed on a case by case basis.

@davidhewitt
Copy link
Member

davidhewitt commented Jan 19, 2023

Thinking on this more overnight, I think the assessment above is correct - we'd need the Python type to be backed by a reentrant locking mechanism to emulate the GIL behaviour for users who depend on it. Probably it would need to be per-process to avoid subtle issues with deadlocks if there were finer-grained locks, but I'm not sure.

A nogil feature could then make some breaking adjustments to PyO3 APIs which would be sufficient to guarantee soundness. (Maybe remove all ways to obtain the Python type except for a fn nogil() -> Python<'static> as a start, likely would need to go even further.) This would enable extension authors to attest support for nogil and opt-out of the GIL emulation for their extension.

I think PyO3's "borrowed references" &'py PyAny could be problematic for a future nogil feature because the 'py lifetime there is directly treated as the GIL lifetime. We may need to rework that API before proper nogil support. (That's been on the cards for a long time anyway.)

For the sake of supporting experimentation, how about the following:

  • We add a section to https://pyo3.rs/latest/parallelism to document the existence of this branch and how to use it. For now @colesbury keeps ownership of this branch and can set a policy of when to run rebases etc.
  • We document clearly the soundness risks which we are aware of, and make it clear to users that by choosing to compile against this branch they take responsibility for attesting they and their dependencies are safe in a nogil context.
  • We could add detection for nogil to pyo3-build-config and raise helpful compile errors pointing at those instructions.

I'd hope that by providing a way for experimentation in nogil that it encourages downstream packages to at least consider what they would look like in a nogil context, even if they have to declare that they don't support it for now.

@shivaylamba
Copy link

@sansyrox

@davidhewitt
Copy link
Member

Another obvious soundness hole is our PyCell type - this is essentially a copy of Rust's RefCell where we've allowed multithreaded access mediated by the GIL. In a nogil world we'd need to consider replacing these internally by mutexes or some other protection (maybe RwLock is most equivalent). I fear that having lots of tiny locks would have a high risk of deadlocking.

This also raises an argument for making progress on #1979 and move #[pyclass(frozen=True)] to be the default, as frozen (i.e. immutable) pyclasses wouldn't need any locking. Eventually perhaps for a nogil feature we wouldn't offer mutable pyclasses at all. This would be one example of a breaking API on the nogil feature which could help prevent unsound users from compiling in that mode.

@adamreichold
Copy link
Member

I fear that having lots of tiny locks would have a high risk of deadlocking.

I think this case could be handled using a single static lock. Still better than a GIL proper but also local enough to reason about safety.

On the other hand, we do not need full locks here as we do not need to provide blocking behaviour at all. Just as we fail now if the cell is already borrowed, we could just check an atomic usage counter and fail if it has the wrong value without risk of deadlocks. (The ECS hecs for example uses atomic borrow counts to allow multi-threaded usage without imposting locking costs onto all users. One just has to structure the systems accessing the data to not overlap by design but there is safety hazard besides a Result::Err/panic if they do.)

@mejrs
Copy link
Member

mejrs commented Jan 20, 2023

The soundness issues are also a major concern. One which springs to mind now is GILOnceCell. I think it's good enough if we keep the PyO3 APIs externally equivalent for now and just have internal adjustments to ensure sound behaviour (e.g. maybe we can use once_cell as an internal substitute).

I'm generally not a fan of adding more conditional compilation, but one way to fix that is to just hide it behind #[cfg(not(NoGil))]

Eventually perhaps for a nogil feature we wouldn't offer mutable pyclasses at all.

That's the direction I'd take - rather than implement something ourselves that might be rather implicit and error prone, users should implement their own interior mutability if they need it.

@adamreichold
Copy link
Member

rather than implement something ourselves that might be rather implicit and error prone, users should implement their own interior mutability if they need it.

As interacting with Python mandates shared ownership, I don't think it is reasonable to push this out to our users completely. We should at least provide a tool box of well tested generally used solutions.

If some code needs a special mechanism, they can use a frozen pyclass and add their own layer of interior mutability as you suggest. But I still think we need to provide something that works out of the box.

@davidhewitt
Copy link
Member

On the other hand, we do not need full locks here as we do not need to provide blocking behaviour at all.

I considered this and am split. It would avoid deadlocks but I think with high concurrency it would be easy for users to hit racy errors and end up wanting to have some level of blocking to wait until the object can be written to.

If some code needs a special mechanism, they can use a frozen pyclass and add their own layer of interior mutability as you suggest. But I still think we need to provide something that works out of the box.

I agree that &mut self for pymethods is extremely convenient. Despite this my personal preference is to have frozen = true as the default. We could dedicate a section of the guide to opting in to mutability at the object level and the trade-offs. This would also have the effect of encouraging users to consider simpler synchronization options such as atomic datatypes or fine grained locks where the potential for deadlocking is more obvious to them and debuggable in their own code rather than PyO3 internals.

@adamreichold
Copy link
Member

I agree that &mut self for pymethods is extremely convenient. Despite this my personal preference is to have frozen = true as the default.

I am not trying to argue for not changing the default. I am arguing for shipping and documenting something like PyCell so that users do not need to invent their own all the time.

It would avoid deadlocks but I think with high concurrency it would be easy for users to hit racy errors and end up wanting to have some level of blocking to wait until the object can be written to.

If frozen = true is the only approach we support without explicit additional layers like PyCell, then this could be just one more option for such a type, e.g. PyGILCell (uses GIL is present or GIL-emulation otherwise), PyAtomicCell (uses atomic borrow flags), PyMutexCell (uses a single mutex), etc.

#[inline]
#[cfg(any(py_sys_config = "Py_REF_DEBUG", Py_NOGIL))]
pub unsafe fn Py_INCREF(op: *mut PyObject) {
Py_IncRef(op)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@colesbury - I understood that for the CPython stable API there was a discussion about making refcounting details internal to CPython but it was deemed infeasible due to performance regression.

Here this patch seems to do exactly that. Are you able to comment on the estimated performance impact for extensions by changing refcounting to be through an FFI call?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have heard something similar, but I can't find the original source and don't know for which extensions it was deemed too great a cost.

There didn't seem to be an noticeable performance impact for the cryptography extension, but I don't know enough to make any general estimates.

@colesbury
Copy link
Author

Can someone (like @adamreichold or @davidhewitt) explain how PyCell works compared to say RefCell? I've read through the code a few times, but don't have enough Rust knowledge to fully understand it. Some things I'm wondering about:

  • Which parts are statically checked?
  • Which parts are dynamically checked (where are those implemented)? (RefCell says it has "dynamically checked borrow rules")

@adamreichold
Copy link
Member

adamreichold commented Jan 27, 2023

explain how PyCell works compared to say RefCell?

I'd say a good summary would be that PyCell is exactly like RefCell, but to get at a &'py PyCell<T> shared reference into the Python heap, you need to prove that hold the GIL via a Python<'py> token. (If you just have PyCell<T> lying around e.g. on the stack this difference does not matter either. Being called from Python, we start with a reference-counted Py<PyCell<T>> in PyO3 parlance instead.)

This is complicated somewhat by the borrow checking being pluggable to avoid its overhead for immutable types shared the Python heap. You can find the various implementations including code that looks exactly like RefCell without the value field here: https://github.com/PyO3/pyo3/blob/main/src/pycell/impl_.rs

@adamreichold
Copy link
Member

but to get at a &'py PyCell shared reference into the Python heap, you need to prove that hold the GIL via a Python<'py> token.

Maybe this saying it better: Since in Python, basically all objects are basically always shared, we can only produce shared references and if your Rust type is fine with that (because it is immutable or already uses interior mutability), then you do not need the borrow checking.

(PyCell<T> does double duty though, it plugs in the borrow checking if required and it provides the Python object header. We could split those concerns into e.g. PyObjectHeader<RefCell<T>> and have user code call .borrow() etc. explicitly or handle going from &RefCell to RefMut via FromPyObject as discussed upthread.)

@davidhewitt
Copy link
Member

The basic principle is that Rust's references, by definition, for a given piece of data allow for at most one of the following to exist at a time:

  • Any number of shared references &
  • A single unique reference &mut

(To violate this rule is UB.)

The purpose of Rust's RefCell is to allow this to be satisfied at runtime rather than by the Rust compiler's "borrow checker". To do this it has internal state which either denotes it's uniquely borrowed or otherwise records the number of shared borrows currently accessing it. This internal state is not thread-safe, so RefCell can only be used by a single thread.

PyCell is very much the same design, with the difference being that we modify the internal state under the protection of the GIL, which makes it thread-safe. Removing the GIL would break this model, hence why we're discussing how we get some kind of locking semantics back which affect users least.

The PyCell source code is somewhat convoluted because it is designed to support Python objects including inheritance. At some point I'm likely to propose a rework which separates the object hierarchy into a PyClassObject, leaving the PyCell bit to be just about the synchronisation.

@adamreichold
Copy link
Member

PyCell is very much the same design, with the difference being that we modify the internal state under the protection of the GIL, which makes it thread-safe. Removing the GIL would break this model, hence why we're discussing how we get some kind of locking semantics back which affect users least.

I would like to amend this: How this internal state is modified is exactly the same as for RefCell, i.e. PyCell: !Sync just like RefCell: !Sync. It is getting access to a shared reference &'py PyCell in the first place that is protected by the GIL, i.e. only one thread at a time is allowed to produce references (in contrast to raw pointers) into the Python heap. So if you have ownership or otherwise soundly got hold of a &'a PyCell, then the GIL does not enter the picture at all but everything is still safe due to PyCell: !Sync and hence &'a PyCell: !Send.

@colesbury
Copy link
Author

Thank you both for the explanations. If I understand correctly:

  1. Every [#pyclass] annotated struct support PyCell. The [#pyclass] adds the borrow checking bits, which is essentially a Rust Cell<BorrowFlag> (except for immutable classes, which do no runtime borrow checking)
  2. The borrow flags can be modified by multiple threads. That's safe with the GIL, but would not be safe without it.

@davidhewitt mentioned the possibility of using RWLock. It seems like you maintain the PyCell borrow checking invariants with something like an AtomicUSize. (Which I think is equivalent to only using the try_read/try_write functions on RWLock). That would avoid the possibility of deadlocks, but increase the potential for runtime panics (in cases where the GIL ensured only one borrowed reference was attempted at a time)

Is my understanding correct?

@colesbury
Copy link
Author

I'm re-reading the thread now that I better understand PyCell and see that the above is among the things you are considering.

@adamreichold
Copy link
Member

Is my understanding correct?

I think there is one issue here, mainly

The borrow flags can be modified by multiple threads. That's safe with the GIL, but would not be safe without it.

PyCell: !Sync, so it cannot be shared between threads and hence its implementation does not need to care about internal synchronization. I would really like to stress that

It is getting access to a shared reference &'py PyCell in the first place that is protected by the GIL, i.e. only one thread at a time is allowed to produce references (in contrast to raw pointers) into the Python heap.

because just producing a &'a T means that the referenced value must stay valid for as long as that references exists. And generally &'a T: Copy meaning that shared references can be copied freely, so we really need to protect a scope of code using these references which is why it is so convenient to bind the lifetimes to holding the GIL.

The above is also independent of PyCell. We need the same guarantees to access &'py PyDict, i.e. the referenced dictionary must stay alive for at least as long the reference exists. And this why our borrowed references which ATM do not bump the reference count for every access by relying on the GIL to prevent concurrent modifications of even the reference count.

So I think that before discussing how to enable interior mutability in a nogil world - "the PyCell problem" - we need to discuss how to enable shared references to objects stored in the Python heap in the first place. If we have that, interior mutability could be solved by the usual techniques like mutexes, atomics, etc. and is mainly a question of ergonomics.

To enable us to produce a shared reference into the Python heap, I think we need to ensure that the current thread owns at least one globally visible reference to the object in question. (I suspect this usually fulfilled automatically as it is currently hard to pass a Python object to Rust code without elevating its reference count, but I guess the nogil fork might change the globally visible part of the above requirement.)

Furthermore, we probably need to enforce a Sync bound, i.e. thread safety/internal synchronization, for all objects stored in the Python heap and accessed via Rust references. Either by just assuming/mandating this (which I think is unrealistic due to the large existing ecosystem assuming presence of the GIL), or by having a bit in the object header indicating it and refusing to produce a shared reference otherwise. (Objects implemented in Rust would then have to use suitable interior mutability techniques so that the compiler will certify them as Sync whereas native Python types will have to be implemented with internal synchronization which I think is already part of the nogil fork.)

Alternatively, if we do not want enforce Sync in a nogil world, I think we would need a facility to lock/pin an object stored in the Python heap to the current thread, e.g. using hazard pointers or a parking lot for Python objects, so that we can continue to be sure to be the only thread which produces even a shared reference to a !Sync type.

These options are also not really exclusive: We could have a bit in the object header indicating thread safety and enforce the locking/pinning procedure only for types which do not have the sync bit set. In this approach, one might even get away with a global lock under the assumption that it is only a fallback and all performance-critical Python objects will enable the sync bit eventually.

@adamreichold
Copy link
Member

It should be true for all the CPython types for which PyO3 provides safe API wrappers.

I think the problem here is PyAny which covers pure Python code using e.g. xml.etree.ElementTree internally. But if internal synchronization everywhere is the goal, then pure Python code that does not using anything not provided by CPython should be safe eventually.

@colesbury
Copy link
Author

I finally understand one of the soundness issues with the atomic borrowing approach. The PyCounter example @mejrs linked to helped illustrate the issue. Thanks everyone on this PR for your patience in explaining things to me.

#[pyclass(name = "Counter")]
pub struct PyCounter {
    // Keeps track of how many calls have gone through.
    //
    // See the discussion at the end for why `Cell` is used.
    count: Cell<u64>,
...
}

You can have multiple shared &PyCounter references (in multiple threads). Absent the mutual exclusion provided by the GIL, you'd have racy updates to PyCounter.cell. Atomic borrow flags aren't sufficient here because that only ensures unique mutable references, but here we have immutable &PyCounter references that still allow updating internals.

I don't think the frozen attribute addresses this, since the above PyCounter could be marked as frozen but still have the same hazards.

@adamreichold wrote:

I could not find API documentation on "Python critical section", could you post a link? Is this term specific to the PEP?

Yes, it's from PEP: Python critical sections. (I linked to it in the previous comment, but not in the most obvious place.)

... wouldn't it still add the possibility of deadlock if e.g. thread A borrows first x then y whereas thread B borrows first y then x?

No deadlock, but I'm not sure if/how you'd model the behavior in Rust to ensure soundness. In the x/y + y/x case, one thread would temporarily release the held locks when it blocks on the the lock acquisition allowing the other thread to proceed.

The limitation is that you can only really operate on a one (or in some special cases two) Python object at a time. If you borrow x then y, then acquiring the lock for y may temporarily release the lock for x until the y reference is dropped. Although this works well for CPython internals where the GIL assumptions are only held up to module/object boundaries, it's not clear to me if this would work well in PyO3 where they may be more variability in code structure.

I've come around to the idea that the default behavior for "nogil" PyO3 should be a global static lock to emulate the GIL. I think the lock should be implemented by the above mentioned Python critical sections API.

Some things I'm still wondering about:

  • Practically, how many #[pyclass] structs are Sync or could reasonably implement Sync?
  • If all #[pyclass] structs are Sync in an extension, is lighter synchronization possible?
  • If only some #[pyclass] structs are Sync in an extension, is lighter synchronization possible?

Thanks again everyone for your patience in explaining Rust and PyO3 concepts to me.

@mejrs
Copy link
Member

mejrs commented Jan 30, 2023

Practically, how many #[pyclass] structs are Sync or could reasonably implement Sync?

Most of mine are essentially "immutable data classes". These could be used as-is without any issues. Most Rust types are Send + Sync on their own. These traits are only really relevant when interior mutability is involved.

I don't think the frozen attribute addresses this, since the above PyCounter could be marked as frozen but still have the same hazards.

It doesn't on its own, but it does if we enforce that pyclasses are Send + Sync + frozen. For the implementation of PyCounter that would mean switching to an AtomicU64.

This is my preferred implementation, for a couple of reasons:

  • We don't actually need to implement synchronization ourselves, so this is the most "obviously correct" and easiest implementation
  • violations of these constraints are (or can be) relatively straightforward compilation errors, rather than runtime bugs/deadlocks.
  • it encourages a programming style focused on immutability, which is a imo a good pattern in general
  • because we won't be needlessly borrow checking (like in the case for my immutable data classes), i think this will result in faster user libraries (this is also why I'd like to eventually make pyclasses frozen by default, and let users opt into mutating with a mutable attribute)
  • it leaves any synchronization up to the user, if they do need interior mutability. Because we don't need to hold the lock over the function/method for them, this results in smaller critical sections resulting in faster and less error prone programs.
  • any rust dependencies the user might be using are already threadsafe, so we don't need to worry about that

@mejrs
Copy link
Member

mejrs commented Jan 30, 2023

If you want to see the fallout of this, you could try making some more changes:

You need change this PyClassPyO3Options's frozen field to always be Some

class_kind: kind,
options: PyClassPyO3Options::parse(input)?,
deprecations: Deprecations::new(),

and put a Sync bound on

pub struct ThreadCheckerStub<T: Send>(PhantomData<T>);

Note that this isn't enough to ensure soundness, but you should be able to see how "reasonable" this would be to users.

@adamreichold
Copy link
Member

No deadlock, but I'm not sure if/how you'd model the behavior in Rust to ensure soundness. In the x/y + y/x case, one thread would temporarily release the held locks when it blocks on the the lock acquisition allowing the other thread to proceed.

I suspect that would involve reference wrappers like RefMut<'a, T> or our own PyRefMut<'a, T> which morally are a "&'a mut", but produce only short-lived &'b mut T via the DerefMut trait, so that we have full control over when/how the reference is used and what happens when it isn't.

Practically, how many #[pyclass] structs are Sync or could reasonably implement Sync?

It doesn't on its own, but it does if we enforce that pyclasses are Send + Sync + frozen.

I think there are two directions in which a Sync bound can be applied.

One is for implementing #[pyclass]es and I agree that just having their implementations fulfil the above trait is the most reasonable approach. The Rust compiler will statically enforce this and while it may not be trivial to make implementations thread-safe, adding this trait bound will either make "blind" recompilation against a nogil PyO3 fail or verifiably correct due to static checking.

The other is for controlling what objects on the Python heap are accessed by shared references, i.e. basically whether Py<T> gets a T: Sync bound as well. If I understand things correctly, nogil is basically about adding that bound so that any PyAny can be accessed from multiple threads, e.g. from thread pool of an Rust extension using Rayon. IMHO, the main question is whether this is a reasonable assumption to make for the whole ecosystem (of things built against the nogil fork).

As written above, personally I do not think so without reifying that property in the Python object header, i.e. having objects opt into being sync. Otherwise I fear for --without-gil becoming the next -ffast-math, i.e. a flag to speed up scientific code with little to no investigation into the consequences for the reliability of the results.1

I also think this ties in directly with questions like

If all #[pyclass] structs are Sync in an extension, is lighter synchronization possible?
If only some #[pyclass] structs are Sync in an extension, is lighter synchronization possible?

because I don't think this can answered by just considering the #[pyclass] of a single extension. Mainly, because all shared references into the Python heap participate in that decision. I think we could elide the global (or per-object) lock as long as all references produced refer to objects which are thread-safe, i.e. Sync if they are #[pyclass]es but fulfilling a similar notion if they are implemented differently. (So pure Python code would always have this property when all native types it uses have it. But native extensions which are blindly rebuilt do not have until their code is audited for thread-safety and sets the sync bit.)

Of course, for me the converse position also holds, i.e. without something like a sync bit in the object header but using a global lock for PyO3, I would not produce shared references to anything stored in the Python heap but the #[pyclass]es I know are thread-safe. Because another native extension not participating in that locking could still concurrently operate on non-thread-safe objects in another thread. Most notably, due to the lack of stable ABI and how static globals work, this would include other extensions written using PyO3 as we do not "know" its #[pyclass]es and it would have its own instance of the global lock.

Footnotes

  1. I know of a German research institution doing a lot of policy support that ships all code for its HPC cluster using -O3 -ffast-math. No subnormal floating point number was ever seen on the premises...

@adamreichold
Copy link
Member

adamreichold commented Jan 30, 2023

Maybe just for the record: I would support adding a Send + Sync bound for #[pyclass]es and making PyCell an optional layer (using e.g. atomics to fulfil the aforementioned requirement) on PyO3's main branch, gil or nogil, for all the reasons @mejrs mentions and if just to start moving the ecosystem into that direction. (Hoping that the performance improvement from not using PyCell outweighs the performance loss of using atomics instead of plain integers.)

@colesbury
Copy link
Author

@adamreichold wrote:

IMHO, the main question is whether this is a reasonable assumption to make for the whole ecosystem (of things built against the nogil fork). As written above, personally I do not think so without reifying that property in the Python object header, i.e. having objects opt into being sync.

I'm not sure I understand your position. PyO3 already makes thread safety assumptions about PyAny that are not strictly true. If I understand correctly, PyAny objects can be accessed from multiple threads as long as they are within Python::with_gil (or otherwise have a Python type token.) This mostly works, but is not strictly safe. For many C extensions, there are possibilities for crashes and other undefined behavior if you access the same object from multiple threads because there are plenty of places where the GIL can be temporarily released even if the extension itself never releases the GIL.

For example:

  • Even with the GIL, in PyTorch it's unsafe to resize a torch.Tensor from multiple threads.
  • The same is true for NumPy's ndarray although it's a bit harder to trigger because of the default reference count checking. However, the reference count checking is not robust and it's still possible even with the GIL and with reference count checking.
  • The CPython classes tend to be a bit more robust due to widespread use, but similar issues still exist even in "core" types. For example, del list[:] called concurrently from multiple threads (on the same list) can crash (with the GIL) if you are particularly unlucky.

@birkenfeld
Copy link
Member

Even with the GIL, in PyTorch it's unsafe to resize a torch.Tensor from multiple threads.
The same is true for NumPy's ndarray although it's a bit harder to trigger because of the default reference count checking. However, the reference count checking is not robust and it's still possible even with the GIL and with reference count checking.

From a Rust PoV, these would be considered memory safety bugs, not something that we should design our API around.

The CPython classes tend to be a bit more robust due to widespread use, but similar issues still exist even in "core" types. For example, del list[:] called concurrently from multiple threads (on the same list) can crash (with the GIL) if you are particularly unlucky.

Really? What's the mechanism for crashing here, since DELETE_SUBSCR is only a single bytecode? Insufficient memory barriers?

@mejrs
Copy link
Member

mejrs commented Jan 31, 2023

PyO3 already makes thread safety assumptions about PyAny that are not strictly true. If I understand correctly, PyAny objects can be accessed from multiple threads as long as they are within Python::with_gil (or otherwise have a Python type token.)

Can you clarify? Pyo3 does not let users do this.

@adamreichold
Copy link
Member

adamreichold commented Jan 31, 2023

I'm not sure I understand your position. PyO3 already makes thread safety assumptions about PyAny that are not strictly true. If I understand correctly, PyAny objects can be accessed from multiple threads as long as they are within Python::with_gil (or otherwise have a Python type token.) This mostly works, but is not strictly safe. For many C extensions, there are possibilities for crashes and other undefined behavior if you access the same object from multiple threads because there are plenty of places where the GIL can be temporarily released even if the extension itself never releases the GIL.

Indeed, not everything fully safe yet but that does not appear a good argument for making things worse to me8 and I agree with @birkenfeld's

From a Rust PoV, these would be considered memory safety bugs, not something that we should design our API around.

Or if Python had something like unsafe then these PyTorch and NumPy operations would be tagged that way to indicate their additional preconditions. And ideally, we as parts of the wider Python ecosystem would find ways to make these functions safe.

For example, CPython has made significant progress in that direction in the past, e.g. when GIL is released, then

But indeed especially NumPy is still a big problem as it releases the GIL when operating on primitives without any other synchronization so our own dynamic borrow checking currently considers even pure Python code operating on NumPy arrays unsafe/trusted as it is able to produce data races without using any native code (besides NumPy itself). But as @birkenfeld said this is not something we should aim for when designing future contracts and interfaces. Ideally, we will get a cross-language protocol for borrowing the interior of large buffers like NumPy arrays and things like narray.resize will become safe (rust-numpy already exclusively borrows an array before allowing a safe call to resize) without adding undue overhead.

I also do understand that the nogil fork is mainly about being able to use all that existing Python software in a more scalable manner and I am not opposed to that. But I would prefer if we do not do it blindly and try to include more safety features into the approach so that e.g. the scientists using our code do not need to become experts on numerical stability or memory orderings to determine if their usage is correct. Someone said that Rust feels like doing parkour while hanging on strings and wearing protective gear and I think that this is indeed the goal we should strive for when designing our programming interfaces.

@colesbury
Copy link
Author

The point of these examples is in response to:

The other is for controlling what objects on the Python heap are accessed by shared references, i.e. basically whether Py gets a T: Sync bound as well. If I understand things correctly, nogil is basically about adding that bound so that any PyAny can be accessed from multiple threads, e.g. from thread pool of an Rust extension using Rayon. IMHO, the main question is whether this is a reasonable assumption to make for the whole ecosystem (of things built against the nogil fork).

Basically, I think Py<T> should get a T: Sync bound. I think if a project is built for nogil, it's reasonable to assume it's thread safe so far as it's safe to assume these things are thread-safe today.

I think it's good and practical to put extra constraints on projects built with PyO3 (like the Sync/Send/frozen constraints discussed previously) because these can actually be enforced by the Rust type checker, but I don't think it's reasonable to try to put extra constraints on projects not using PyO3.

@birkenfeld wrote:

From a Rust PoV, these would be considered memory safety bugs, not something that we should design our API around.

I agree that PyO3 should not design its API around this. You can consider these "bugs" in the projects, but that's mostly a label of convenience. (OTOH, the del list[:] example is truly a bug upstream.)

Really? What's the mechanism for crashing here, since DELETE_SUBSCR is only a single bytecode? Insufficient memory barriers?

Nearly all the GIL thread-safety issues have the same form: you assume so invariant which is broken because some Python API call releases the GIL. The list issues are often of the form:

  1. Get the size of the list (or compute the size of the slice you are reading/deleting)
  2. Allocate a new list or Py_DECREF some obj (may release the GIL)
  3. Use the size of the list from step 1 (!!)

The general problem is that a huge number of Python API calls potentially releases the GIL. The most obvious causes are Py_DECREF, because destructors can call arbitrary code, and allocations because they may trigger a GC, which can call arbitrary code. These issues are generally both re-entrancy and thread-safety hazards. This is essentially the same issue that can lead to GILOnceCell initialization being called multiple times, although in that case its documented behavior, while in the list implementation is clearly a bug.

The use of a single bytecode is a red herring. The implementations for most bytecodes are complex enough that they may end up releasing the GIL at some intermediate point.

@mejrs wrote:

Can you clarify? Pyo3 does not let users do this.

Here is the example I linked to in a previous comment. The Python environment is inherently shared so you can get the same &PyAny references in multiple threads. The lifetime of these references can overlap because the GIL is not a simple mutex; many Python API calls may implicitly temporarily release the GIL allowing threads to run in an interleaved manner.

https://gist.github.com/colesbury/b01e645546114977bff8d7babbb05f29

@adamreichold wrote:

Or if Python had something like unsafe then these PyTorch and NumPy operations would be tagged that way to indicate their additional preconditions...

I know rust-numpy provides safe wrappers, but you can also call all these operations via PyAny and I don't think you'd want to label the basic methods on PyAny as unsafe.

I also do understand that the nogil fork is mainly about being able to use all that existing Python software in a more scalable manner and I am not opposed to that. But I would prefer if we do not do it blindly and try to include more safety features into the approach...

I generally agree, but I don't think this is a blind approach any more than todays approach is blind.

@birkenfeld
Copy link
Member

Allocate a new list or Py_DECREF some obj (may release the GIL)

Of course, you're right. It's a can of worms (or maybe, little snakelets).

@adamreichold
Copy link
Member

adamreichold commented Jan 31, 2023

but I don't think it's reasonable to try to put extra constraints on projects not using PyO3.

Maybe to clarify my intent: I am not arguing for reifying thread safety and the whole ecosystem to bend over backwards to protect the purity of our precious bodily fluids trait bounds. I think falling back to process-wide lock when objects are in use that have not opted into thread safety would be a good general implementation of PyGILState_Ensure in a nogil world, not just for PyO3.

I am just not sure typical usage of the CPython API from C has enough control over references into the Python heap to enforce this, or rather to exploit it if all involved objects are indeed thread safe. I am also not sure if a flag in the object header is a sensible mechanism to achieve this. I am convinced that native code should explicitly opt into the increase in parallelism though.

I know rust-numpy provides safe wrappers, but you can also call all these operations via PyAny and I don't think you'd want to label the basic methods on PyAny as unsafe.

As written above this is really a sore spot and the reason even pure Python code using NumPy is considered unsafe/trusted. It is also a reason why we put our borrow checking into a C API compatible capsule in the hope that it might see more widespread use outside of rust-numpy.

But again, I don't think the main problem is rust-numpy's safe API being unsound strictly speaking but rather that plain Python code can produce data races without any safeguards in the first place and do so as easily as performing numerical operations like += on NumPy arrays.

...

Switching gears, I suspect we should try to refocus the discussion away from the sync bit thing. We explained our positions and I think it is alright if we do not agree for now. Especially since we have already identified multiple things in PyO3 that we do need to work on:

  • At least for nogil, PyCell<T> should enforce T: Send + Sync and use thread safe borrow flags (or even locks) or we need to emulate the GIL using a global lock. Ideally, we also decouple PyClass and PyCell and start moving towards frozen-by-default.
  • We should prototype of how usage of the PEP's critical sections would look like and whether reference wrappers would be sufficient to model temporarily releasing borrows. (I suspect we might actually need something like GhostToken from the ghost-cell crate.)
  • For the Python token, we need to separate its two concerns of binding lifetimes of borrows into the Python heap and being a type-level proof that a process-wide lock is held (which is currently relied on by downstream unsafe code). Probably by adding API that will fail to compile for nogil or maybe even fallback to make the necessary locking to provide the same guarantees.

@mejrs
Copy link
Member

mejrs commented Jan 31, 2023

Here is the example I linked to in a previous comment. The Python environment is inherently shared so you can get the same &PyAny references in multiple threads. The lifetime of these references can overlap because the GIL is not a simple mutex; many Python API calls may implicitly temporarily release the GIL allowing threads to run in an interleaved manner.

https://gist.github.com/colesbury/b01e645546114977bff8d7babbb05f29

This is totally fine though, this &PyAny cannot be used while the gil is not held.

@colesbury
Copy link
Author

@adamreichold wrote:

Especially since we have already identified multiple things in PyO3 that we do need to work on...

Great -- I'll try to prototype these things, but probably won't get to it immediately. (Of course, if anyone else is interested in working on those things and has the time, that would be great too.)

@adamreichold
Copy link
Member

adamreichold commented Jan 31, 2023

Great -- I'll try to prototype these things, but probably won't get to it immediately.

I would advise waiting for a commitment from @davidhewitt before investing significant amounts of your time into these things though for he is the final authority on the direction of PyO3's development and I am not sure if he has had time to read through and comment on this thread yet.

@mejrs
Copy link
Member

mejrs commented Jan 31, 2023

I would also recommend you wait for davidhewitts thoughts before putting in a lot of effort, but I'd like to think we all share authority regarding direction here.

@davidhewitt
Copy link
Member

Sorry for some delay from me - I've been reading the discussion and thinking a bit about this but have been AWOL for a couple days with a sick family. Better again for now.

The discussion above sounds good to me. To re-spell a couple of the above in my own thoughts:

  • Breaking API changes for nogil. Let's try to have a few of these but keep them not too horrendous, so that users are forced to consider thread-safety to some degree without going over the top.
    • I think for now it may be good to enforce PyClass: Send + Sync only when building for nogil?
    • I think we won't be able to remove the Python token with nogil as making Python C-API calls still requires attachment to a Python thread state. We could play with nogil-specific renames like Python -> PyThread or Python::with_gil -> Python::attach?
  • Separating Python's two concerns - yes agreed, I think experimental: owned objects API #1308 is the long-awaited first step on the road to get there. I have an idea how that PR could be finished off, I'll do my best to resurrect it once 0.18.1 is out the door.
  • Frozen-by-default - yes, I think we should seriously consider this for 0.19.

@adamreichold
Copy link
Member

I think for now it may be good to enforce PyClass: Send + Sync only when building for nogil?

It would be nice to have something like crater for this as I would expect the breakage to be manageable if we would couple this with making PyCell use atomic borrow flags as I expect that most classes would be Send + Sync automatically then.

Put differently, if we do go for a split of PyClass and PyCell with frozen-by-default classes, then this breakage when everyone who needs interior mutability has to suddenly opt into it, is probably much more problematic than the additional trait bound of Send + Sync when they have to touch that code anyway.

Of course, there is the argument (which is a big part of the PEP as well), that forcing everything to be thread-safe is imposing that cost on applications which are not interested in multi-threading. Personally, I have to say though that if that level of efficiency is relevant, I would probably recommend moving away from an interpreted language in any case.

@davidhewitt
Copy link
Member

It would be nice to have something like crater for this as I would expect the breakage to be manageable if we would couple this with making PyCell use atomic borrow flags as I expect that most classes would be Send + Sync automatically then.

A crater-like thing would be cool though I fear it'd be a lot of work to set up? I wonder if there's a way we can set up a deprecation warning to nudge users a release or two prior. (Maybe using macro hackery deep in the guts of #[pyclass].)

@DataTriny
Copy link
Contributor

Hello @colesbury

Thank you for contributing to PyO3!

The project is undergoing changes on its licensing model to better align with the rest of the Rust crates ecosystem.

Before your work gets merged, please express your consent by leaving a comment on this pull request.

Thank you!

@stuhood
Copy link
Contributor

stuhood commented May 5, 2023

Very excited to see discussion of adding nogil support! https://github.com/pantsbuild/pants would be very eager users.

I'm not sure I understand your position. PyO3 already makes thread safety assumptions about PyAny that are not strictly true. If I understand correctly, PyAny objects can be accessed from multiple threads as long as they are within Python::with_gil (or otherwise have a Python type token.) This mostly works, but is not strictly safe. For many C extensions, there are possibilities for crashes and other undefined behavior if you access the same object from multiple threads because there are plenty of places where the GIL can be temporarily released even if the extension itself never releases the GIL.

It's slightly off topic for this thread, but we have previously observed a case like this when using the rust-cpython crate (which has a GILProtected type which seems very similar to PyCell). The discussion there might be interesting: dgrunwald/rust-cpython#218 ... can also open an issue with pyo3 if that would help.

@thejcannon
Copy link

@colesbury I just got this going in https://github.com/pantsbuild/pants which is a heavy user of Python+PyO3 in a multi-threaded application. We use Rust's tokio driving both the Rust and Python event loop, and also use "frozen" types at the boundaries, so we think this would be a big win 😄

I have a branch able to run it with an updated version of your fork of PyO3 (for some recent 0.18 fixes here) and the nogil-3.9.10 branch of your CPython fork.

So far I've seen the occasional seg fault (first one I snipped below) and at least one "Already mutably borrowed". Oh and an ocean of gc_get_refs(gc): -N outputs to stdout.

Let me know if you want the cores or anything else that be be of assistance.

@colesbury
Copy link
Author

Hi @thejcannon and @stuhood - thanks for taking a look at this. I've paused work on this PR until a decision is made on the PEP 703. I plan to submit it to the steering council soon, and will come back to it once there's a decision.

@thejcannon
Copy link

Oh exciting! Absolute best of luck.

Hopefully they say yes and we can reap the rewards 😈

@davidhewitt
Copy link
Member

https://discuss.python.org/t/a-steering-council-notice-about-pep-703-making-the-global-interpreter-lock-optional-in-cpython/30474

Sounds positive for the future of nogil, how exciting! We'd better start figuring out the details here ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants