Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: calling overhead #3827

Open
samuelcolvin opened this issue Feb 11, 2024 · 7 comments
Open

Performance: calling overhead #3827

samuelcolvin opened this issue Feb 11, 2024 · 7 comments

Comments

@samuelcolvin
Copy link
Contributor

samuelcolvin commented Feb 11, 2024

Continued from conversation on long-dead #1607.

I've still seeing a significant overhead when calling

#[pyfunction]
pub fn slow_len(obj: &Bound<'_, PyAny>) -> PyResult<usize> {
    obj.len()
}

vs. a more "baremetal" implementation.

Timings:

In [2]: %timeit slow_len((1, 2, 3, 4))
33.5 ns ± 0.147 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: %timeit fast_len((1, 2, 3, 4))
12.6 ns ± 0.0159 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

I see this 20-40ns overhead in calling PyO3 functions in many scenarios.

Also my "baremetal" code is still using _PyCFunctionFastWithKeywords, not _PyCFunctionFast as there don't seem to be methods available to call that. I assume there might be further performance improvements available over my fast_len method if _PyCFunctionFast could be used?

Code
use pyo3::ffi;
use pyo3::ffi::{PyLong_FromSize_t, PyObject_Size};
use pyo3::prelude::*;
use pyo3::types::PyCFunction;

unsafe extern "C" fn fast_len(
    _self: *mut ffi::PyObject,
    args: *const *mut ffi::PyObject,
    nargs: ffi::Py_ssize_t,
    _kwnames: *mut ffi::PyObject,
) -> *mut ffi::PyObject {
    if nargs != 1 {
        ffi::PyErr_SetString(ffi::PyExc_TypeError, "expected one positional argument".as_ptr().cast());
        return std::ptr::null_mut();
    }
    let len = PyObject_Size(*args);
    if len == -1 {
        ffi::PyErr_SetString(ffi::PyExc_TypeError, "expected a sequence".as_ptr().cast());
        std::ptr::null_mut()
    } else {
        PyLong_FromSize_t(len as usize)
    }
}

static FAST_LEN: pyo3::PyMethodDef = pyo3::PyMethodDef::fastcall_cfunction_with_keywords(
    "fast_len",
    pyo3::methods::PyCFunctionFastWithKeywords(fast_len),
    "",
);

#[pyfunction]
pub fn slow_len(obj: &Bound<'_, PyAny>) -> PyResult<usize> {
    obj.len()
}

#[pymodule]
fn py_module(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(PyCFunction::internal_new(&FAST_LEN, m.into())?)?;
    m.add_function(wrap_pyfunction!(slow_len, m)?)?;
    Ok(())
}

I'm installing pyo3 with

pyo3 = {git = "https://github.com/davidhewitt/pyo3", branch="complete-py2"}

So I can use the new Bound API.

@davidhewitt
Copy link
Member

Thanks! I'll see to break this down further in the morning 👍

@samuelcolvin
Copy link
Contributor Author

samuelcolvin commented Feb 11, 2024

Thanks!

My actual use case was struct/class methods where I think there's the same overhead, but I don't know how to implement t them without #[pyclass].

@alex
Copy link
Contributor

alex commented Feb 12, 2024

We still construct a GIL Pool and poke and prod at it, right? It's just that it's always empty, in principle?

@adamreichold
Copy link
Member

Note that we should be able to get there in 0.22, i.e. the gil-refs feature will not just silence deprecations of the old API but the old API and the pool backing it will be fully removed when that feature is disabled as we already experimented on in #3685.

(You should be able to see the effects already by rebasing that patch and adjust the feature name. It just is nothing we can ship because it is still possible to reach the unimplemented!() via the old API.

Also note we do still have the global reference count pool, i.e. handling calls to Py::clone without the GIL being held. This will not go away just via the new API and might still add overhead. I think it can go away in a nogil Python build since the more expensive cross-thread reference counting will then just be the general case for the Python reference counting itself.)

@davidhewitt
Copy link
Member

I just took a quick look at this. The timings on my machine come out very similarly to @samuelcolvin with the original code, so I'll assume my new measurement is comparable.

I applied the same patch as in the bottom section of #3787 (comment) and reran with this analysis. I also made sure LTO was enabled to get inlining to be aggressive as possible to make the generated code more similar to the "baremetal" snippet.

With that done, I measure ~17.5ns for the "slow" code above, with no changes on the user's end. So we can make a significant dent of the difference with a GIL Pool removed and also with the global Py reference counting replaced by nogil.

Regardless, ~17.5ns does imply a little bit of extra framework overhead over the "baremetal" still, but that's still cut the overall function execution by 50% from 33.5ns, so we've already made a huge impact. It's worth remembering that there's a couple of extra pieces of work which PyO3 does which are somewhat fundamental:

  • We wrap code in panic::catch_unwind to translate panics at the function boundary into Python exceptions.
  • We allow keyword arguments (you can add signature = (obj, /) above to disable them, but the underlying framework will still check there are no keywords).

Once the main bulk of the overheads are gone and we're into this 17.5ns regime it would be interesting to see if we can optimize further, but I'd be surprised if there was much more to be won.


I also wonder whether we could already move the global Py reference counting to a dedicated thread which attempts to wake and update counts at intervals, rather than doing this on every function call? In a Python with the GIL it will affect singlethreaded throughput a little, but once nogil is present that thread shouldn't affect throughput (and as @adamreichold says it might not be necessary at all).

@adamreichold
Copy link
Member

I also wonder whether we could already move the global Py reference counting to a dedicated thread which attempts to wake and update counts at intervals, rather than doing this on every function call?

IIRC, we had soundness issues when those updates were missed before resuming GIL-dependent Rust code, e.g. 83f5fa2

@davidhewitt
Copy link
Member

Right, yes. That would make it very difficult to do anything other than what we already do, at least until nogil comes along 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants