Performance: calling overhead #3827

samuelcolvin · 2024-02-11T23:42:04Z

Continued from conversation on long-dead #1607.

I've still seeing a significant overhead when calling

#[pyfunction]
pub fn slow_len(obj: &Bound<'_, PyAny>) -> PyResult<usize> {
    obj.len()
}

vs. a more "baremetal" implementation.

Timings:

In [2]: %timeit slow_len((1, 2, 3, 4))
33.5 ns ± 0.147 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [3]: %timeit fast_len((1, 2, 3, 4))
12.6 ns ± 0.0159 ns per loop (mean ± std. dev. of 7 runs, 100,000,000 loops each)

I see this 20-40ns overhead in calling PyO3 functions in many scenarios.

Also my "baremetal" code is still using _PyCFunctionFastWithKeywords, not _PyCFunctionFast as there don't seem to be methods available to call that. I assume there might be further performance improvements available over my fast_len method if _PyCFunctionFast could be used?

Code

use pyo3::ffi;
use pyo3::ffi::{PyLong_FromSize_t, PyObject_Size};
use pyo3::prelude::*;
use pyo3::types::PyCFunction;

unsafe extern "C" fn fast_len(
    _self: *mut ffi::PyObject,
    args: *const *mut ffi::PyObject,
    nargs: ffi::Py_ssize_t,
    _kwnames: *mut ffi::PyObject,
) -> *mut ffi::PyObject {
    if nargs != 1 {
        ffi::PyErr_SetString(ffi::PyExc_TypeError, "expected one positional argument".as_ptr().cast());
        return std::ptr::null_mut();
    }
    let len = PyObject_Size(*args);
    if len == -1 {
        ffi::PyErr_SetString(ffi::PyExc_TypeError, "expected a sequence".as_ptr().cast());
        std::ptr::null_mut()
    } else {
        PyLong_FromSize_t(len as usize)
    }
}

static FAST_LEN: pyo3::PyMethodDef = pyo3::PyMethodDef::fastcall_cfunction_with_keywords(
    "fast_len",
    pyo3::methods::PyCFunctionFastWithKeywords(fast_len),
    "",
);

#[pyfunction]
pub fn slow_len(obj: &Bound<'_, PyAny>) -> PyResult<usize> {
    obj.len()
}

#[pymodule]
fn py_module(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(PyCFunction::internal_new(&FAST_LEN, m.into())?)?;
    m.add_function(wrap_pyfunction!(slow_len, m)?)?;
    Ok(())
}

I'm installing pyo3 with

pyo3 = {git = "https://github.com/davidhewitt/pyo3", branch="complete-py2"}

So I can use the new Bound API.

The text was updated successfully, but these errors were encountered:

davidhewitt · 2024-02-11T23:47:25Z

Thanks! I'll see to break this down further in the morning 👍

samuelcolvin · 2024-02-11T23:50:36Z

Thanks!

My actual use case was struct/class methods where I think there's the same overhead, but I don't know how to implement t them without #[pyclass].

alex · 2024-02-12T00:14:42Z

We still construct a GIL Pool and poke and prod at it, right? It's just that it's always empty, in principle?

adamreichold · 2024-02-12T06:57:18Z

Note that we should be able to get there in 0.22, i.e. the gil-refs feature will not just silence deprecations of the old API but the old API and the pool backing it will be fully removed when that feature is disabled as we already experimented on in #3685.

(You should be able to see the effects already by rebasing that patch and adjust the feature name. It just is nothing we can ship because it is still possible to reach the unimplemented!() via the old API.

Also note we do still have the global reference count pool, i.e. handling calls to Py::clone without the GIL being held. This will not go away just via the new API and might still add overhead. I think it can go away in a nogil Python build since the more expensive cross-thread reference counting will then just be the general case for the Python reference counting itself.)

davidhewitt · 2024-02-13T20:25:42Z

I just took a quick look at this. The timings on my machine come out very similarly to @samuelcolvin with the original code, so I'll assume my new measurement is comparable.

I applied the same patch as in the bottom section of #3787 (comment) and reran with this analysis. I also made sure LTO was enabled to get inlining to be aggressive as possible to make the generated code more similar to the "baremetal" snippet.

With that done, I measure ~17.5ns for the "slow" code above, with no changes on the user's end. So we can make a significant dent of the difference with a GIL Pool removed and also with the global Py reference counting replaced by nogil.

Regardless, ~17.5ns does imply a little bit of extra framework overhead over the "baremetal" still, but that's still cut the overall function execution by 50% from 33.5ns, so we've already made a huge impact. It's worth remembering that there's a couple of extra pieces of work which PyO3 does which are somewhat fundamental:

We wrap code in panic::catch_unwind to translate panics at the function boundary into Python exceptions.
We allow keyword arguments (you can add signature = (obj, /) above to disable them, but the underlying framework will still check there are no keywords).

Once the main bulk of the overheads are gone and we're into this 17.5ns regime it would be interesting to see if we can optimize further, but I'd be surprised if there was much more to be won.

I also wonder whether we could already move the global Py reference counting to a dedicated thread which attempts to wake and update counts at intervals, rather than doing this on every function call? In a Python with the GIL it will affect singlethreaded throughput a little, but once nogil is present that thread shouldn't affect throughput (and as @adamreichold says it might not be necessary at all).

adamreichold · 2024-02-13T20:37:06Z

I also wonder whether we could already move the global Py reference counting to a dedicated thread which attempts to wake and update counts at intervals, rather than doing this on every function call?

IIRC, we had soundness issues when those updates were missed before resuming GIL-dependent Rust code, e.g. 83f5fa2

davidhewitt · 2024-02-13T20:59:13Z

Right, yes. That would make it very difficult to do anything other than what we already do, at least until nogil comes along 👍

samuelcolvin mentioned this issue Feb 11, 2024

PyO3 performance analysis: function overheads #1607

Closed

davidhewitt mentioned this issue Feb 16, 2024

Avoiding struct method call overhead of extract_pyclass_ref_mut #3843

Open

gi0baro mentioned this issue Feb 27, 2024

Upgrade PyO3 to 0.21 emmett-framework/granian#217

Closed

ariebovenberg mentioned this issue Mar 10, 2024

Ability to include bare ffi methods in a #[pyclass] #3949

Open

davidhewitt mentioned this issue Apr 21, 2024

Remove deferred reference count increments and make the global reference pool optional #4095

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: calling overhead #3827

Performance: calling overhead #3827

samuelcolvin commented Feb 11, 2024 •

edited

Loading

davidhewitt commented Feb 11, 2024

samuelcolvin commented Feb 11, 2024 •

edited

Loading

alex commented Feb 12, 2024

adamreichold commented Feb 12, 2024

davidhewitt commented Feb 13, 2024

adamreichold commented Feb 13, 2024

davidhewitt commented Feb 13, 2024

Performance: calling overhead #3827

Performance: calling overhead #3827

Comments

samuelcolvin commented Feb 11, 2024 • edited Loading

davidhewitt commented Feb 11, 2024

samuelcolvin commented Feb 11, 2024 • edited Loading

alex commented Feb 12, 2024

adamreichold commented Feb 12, 2024

davidhewitt commented Feb 13, 2024

adamreichold commented Feb 13, 2024

davidhewitt commented Feb 13, 2024

samuelcolvin commented Feb 11, 2024 •

edited

Loading

samuelcolvin commented Feb 11, 2024 •

edited

Loading