PyO3 performance analysis: function overheads #1607

davidhewitt · 2021-05-14T19:02:51Z

This is a follow-up from #1470 focussing specifically on the #[pyfunction] call overheads and how we can reduce the performance gap with CPython.

To keep it simple, I'm starting with the no-arguments, no-return-value case. Optimizations we make for this case can benefit more complex cases.

First I'll demonstrate the sample code I'm using to measure. Second, I'll talk about the underlying code. Finally, I'll suggest improvements.

Sample code

First up, the sample code is trivial: I'm using a no-args pyfunction like this one:

pyo3/examples/pyo3-benchmarks/src/lib.rs

Lines 24 to 25 in 1ae3d87

    
           #[pyfunction] 
        
           fn no_args() {}

I install the extension module from the pyo3 root dir:

cd pyo3
pip install --use-feature=in-tree-build examples/pyo3-benchmarks

I can then import and run this in ipython:

In [1]: from pyo3_benchmarks import no_args

In [2]: %timeit no_args()
67.8 ns ± 1.19 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Comparing this to a pure-python implementation:

In [3]: def no_args_py():
   ...:     pass
   ...:

In [4]: %timeit no_args_py()
43 ns ± 0.414 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

67.8ns vs 43ns ... suggests there's some work we can do even in this base case.

Underlying implementation

So the code generated in question is this macro output:

pyo3/pyo3-macros-backend/src/pyfunction.rs

Lines 448 to 460 in 1ae3d87

    
           unsafe extern "C" fn #wrapper_ident( 
        
               _slf: *mut pyo3::ffi::PyObject, 
        
               _args: *mut pyo3::ffi::PyObject, 
        
               _kwargs: *mut pyo3::ffi::PyObject) -> *mut pyo3::ffi::PyObject 
        
           { 
        
               pyo3::callback::handle_panic(|#py| { 
        
                   #slf_module 
        
                   let _args = #py.from_borrowed_ptr::<pyo3::types::PyTuple>(_args); 
        
                   let _kwargs: Option<&pyo3::types::PyDict> = #py.from_borrowed_ptr_or_opt(_kwargs); 
        
                   #body 
        
               }) 
        
           }

I'm going to break this code into three parts:

The function signature (PyObject*, PyObject, PyObject*) -> PyObject* shows we are using the METH_VARARGS | METH_KEYWORDS calling convention. There's several more efficient calling conventions we should consider using:
- When functions take no arguments we should use the METH_NOARGS.
- When not using the limited API (abi3 feature) we should use METH_FASTCALL | METH_KEYWORDS.
The helper pyo3::callback::handle_panic sets up a GILPool for pyo3 owned references and catched panics at the ffi boundary. The biggest change to this function will come from experimental: owned objects API #1308, which will pave the way to no longer need to create a GILPool. I don't think it's worth trying to optimize that helper before that PR makes progress.
The rest of the function body. Although it's obfuscated a lot by the macro code, ultimately this is parsing the input args tuple and kwargs dict, executing the (empty) Rust function, generating a new reference to None and returning that. This is all Rust code that we can benchmark and optimize. We can accelerate the argument parsing code a lot if we use the METH_FASTCALL | METH_KEYWORDS calling convention, because we'll get a raw array of argument and keyword argument names to work with instead of a Python tuple and dict.

We can get an idea of the rough performance impact of each piece of code by making changes to this macro output.

First, let's make the contents of the body just return a reference to None, without handle_panic or the argument parsing code:

unsafe extern "C" fn #wrapper_ident(
    _slf: *mut pyo3::ffi::PyObject,
    _args: *mut pyo3::ffi::PyObject,
    _kwargs: *mut pyo3::ffi::PyObject) -> *mut pyo3::ffi::PyObject
{
    use crate::IntoPyPointer;
    unsafe { pyo3::Python::assume_gil_acquired().None().into_ptr() }
}

If I re-install the module and measure again, I get this timing:

In [2]: %timeit no_args()
24.8 ns ± 0.246 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

24.8ns - that's even faster than the plain Python code! This is exciting; it shows us that we can get a lot faster. It also shows that we're spending about 44,ns of time inside the Rust function body.

Ok, so let's make one more change, which is to put handle_panic back. This'll help tease apart how much is handle_panic PyO3 framework overhead, and how much is PyO3's argument parsing.

unsafe extern "C" fn #wrapper_ident(
    _slf: *mut pyo3::ffi::PyObject,
    _args: *mut pyo3::ffi::PyObject,
    _kwargs: *mut pyo3::ffi::PyObject) -> *mut pyo3::ffi::PyObject
{
    pyo3::callback::handle_panic(|#py| {
        use crate::IntoPyPointer;
        Ok(#py.None().into_ptr())
    })
}

This results in:

In [2]: %timeit no_args()
60.7 ns ± 0.857 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

We're back up to 60ns. So the argument parsing code is about 7ns, and the remaining 37ns is down to handle_panic.

If I tweak handle_panic to use Python::assume_gil_acquired() instead of GILPool::new().python(), I get back to 28ns. So unfortunately a lot of the slowdown is caused by creation of a GILPool, meaning we need #1308 to resolve it...

Run out of time now, will post suggested actions for interested contributors in a second message later or tomorrow...

The text was updated successfully, but these errors were encountered:

davidhewitt · 2021-05-15T08:09:07Z

Suggested Improvements

Tabulating the above, I get on my machine for the no-args case:

Base Case	Timing
Python Reference	~ 43ns
PyO3 Pyfunction	~ 68ns

Pyfunction Part	Timing
... Python call overhead	~ 25ns
... Panic handling	~ 3ns
... `GILPool::new()`	~ 33ns
... Argument parsing	~ 7ns
Total	~ 68ns

It looks to me like the best optimization we can make is to push forward with the changes in experimental: owned objects API #1308, which will set up a new API which doesn't require us to call GILPool::new(). In the meanwhile we could try to optimize GILPool::new().
Another thing we can do for this no-arguments case is to write an alternative code path in the macro which generates a function using the METH_NOARGS calling convention. We already do a similar thing for #[pymethods]:

pyo3/pyo3-macros-backend/src/pymethod.rs

Line 580 in 1ae3d87

if spec.args.is_empty() {

It'd be worth trying to make this change and measuring if it helps.
A more general improvement would be to also change the macro to conditionally emit METH_FASTCALL native functions if we're on Python 3.7 and up and not using the abi3 feature. This would be a more complex change compared to METH_NOARGS support as it'll also need some adjustments to the argument-parsing code, so I'd recommend attempting METH_NOARGS first.

cc @birkenfeld - sorry this took me so long to write up! Feel free to ask for any more information if you're interested in implementing any of the above.

alex · 2021-05-15T14:28:30Z

Note that in addition to the METH_NOARGS special case, there's also METH_O for a single argument.

birkenfeld · 2021-05-15T14:31:05Z

That doesn't support calling by kwarg though, right?

alex · 2021-05-15T14:35:30Z

No, I suppose not, so it'd need to be opt-in (or an intentional backwards incompatible changes).

alex · 2021-05-15T14:36:35Z

I've submitted #1608 which reduces the overhead of GILPool::new()

davidhewitt · 2021-05-15T16:17:22Z

That doesn't support calling by kwarg though, right?

Mmmm indeed we also don't support METH_VARARGS (which allows any number of positional arguments but no keyword arguments).

No, I suppose not, so it'd need to be opt-in (or an intentional backwards incompatible changes).

The user could opt in with the postional-only-arguments separator / at the end of an #[args(...)] annotation. We could METH_O and METH_VARARGS as optimizations in those cases.

birkenfeld · 2021-05-15T16:20:16Z

The positional-only-arguments separator is / and very recent in Python syntax.

davidhewitt · 2021-05-15T16:30:38Z

Thanks, fixed my comment.

We could support the separator independent of Python version it was added to, because we control the argument-parsing code. Issue is tracked at #1439

alex · 2021-05-16T00:01:26Z

@davidhewitt how much work would it be to reproduce the table from #1607 (comment) of which components take up the time?

davidhewitt · 2021-05-16T07:19:01Z

Not too much at all, though I didn't script it 😄. Note that I'm now using python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()" as per your usage in #1608 which seems to give marginally lower timings than ipython.

Pyfunction Part	Timing
... Python call overhead	~ 19ns
... Panic handling	~ 0ns
... `GILPool::new()`	~ 21ns
... Argument parsing	~ 3ns
Total	~ 43ns

Bit suprised to see the argument parsing and panic handling overheads reduced. It's possible that the optimizer is doing a better job at removing side-effect-free code after #1604, so I'm now attributing some of that cost (incorrectly) to GILPool::new().

Indeed I wrote a quick benchmark in #1609, which suggested that we've achieved ~ 17ns of reduced overhead in GILPool::new(), but I'm only measuring ~ 11ns here.

Either way, we're looking better! I think that experimenting with the various METH_X calling conventions may yield further benefits, and I still think #1308 can reduce GILPool overhead further.

As suggested in PyO3#1607. If I ran the benchmarks correctly, this shaves only about 5ns from the noargs case. But since it's relatively easy to do...

birkenfeld · 2021-05-17T13:01:43Z

So I've hit a little bump while starting to implement fastcall support...

The macros for pyfunction etc. have to generate different code depending on if were are on all(Py_3_7, not(abi3)) or not. But the proc-macro crate doesn't have the cfg defines available.

So I thought, let's just generate both versions, with the proper #[cfg] attributes to select the right one. But again, that doesn't work since now the #[cfg] is in the crate using PyO3, which again doesn't have the cfg defines available...

How would you solve that?

mejrs · 2021-05-17T14:19:42Z

Fwiw, I think that is why pyclass and pymethods have a cfg'd import and rename. It's why I had to duplicate the documentation for them.

Those cfg's work in src/lib.rs, right?

davidhewitt · 2021-05-17T22:16:00Z

Ugh, @mejrs is right that I used the cfg'd import in #[pyclass] and #[pymethods] to avoid making a pyo3-macros/multiple-pymethods feature (and potentially even pyo3-macros-backend/multiple-pymethods).

But that approach doesn't really scale and created documentation pain, so I'm thinking about biting the bullet and reversing that decision.

For this #[cfg(Py_3_7)] problem, I can imagine all the various combinations using that import approach will get horrible.

I think the right solution is probably to introduce a pyo3-build-config crate as a root dependency, like I was considering in #1548. Then pyo3-macros-backend and pyo3 can both access the cfg vars (as well as downstream crates).

I'd been hesitating on pushing that branch I'd been playing with because I didn't really want to have a 4th crate for something that seemed like a small user convenience, but if it also makes it possible for us to make better optimizations in the macro codegen then I think it's well worth it.

I'll try to finish up and push that branch on Wednesday. Perhaps to unblock you @birkenfeld, you could just copy pyo3's build.rs into pyo3-macros-backend for now? It'll create small about of duplicate work in your local compilation but at least you'll be able to use all the #[cfg] attributes to implement the conditional optimizations...

davidhewitt · 2021-09-18T09:04:49Z

Given that we landed a lot of improvement in 0.14 (in particular #1619, thanks @birkenfeld), I'm going to close this particular analysis as it's now quite stale.

samuelcolvin · 2024-02-11T22:58:38Z

I know this is closed, but I still think there's a lot more to do here.

I'm using @davidhewitt's "bound" branch and seeing a very difference between barematal code and the elegant API.

davidhewitt · 2024-02-11T23:24:03Z

@samuelcolvin are you observing that the "bound" branch still has some overheads? That's not a surprise as we won't actually be able to remove the steps to create and remove the GILPool until 0.22. But the datastructure itself should remain empty for completely migrated code and be much closer to a noop.

It is definitely possible there are some extra optimisations to be found.

If you can share some concrete examples of what you're measuring and what the target would be, it helps us establish what might be remaining.

samuelcolvin · 2024-02-11T23:32:28Z

Thanks, I'll create a new issue.

samuelcolvin · 2024-02-11T23:42:20Z

See #3827.

1tgr mentioned this issue May 15, 2021

Improve compilation times for projects using PyO3 #1604

Merged

davidhewitt mentioned this issue May 16, 2021

benches: add bench_gil #1609

Merged

birkenfeld mentioned this issue May 17, 2021

pyfunction: use METH_NOARGS for no-argument functions #1612

Merged

davidhewitt added the performance label May 24, 2021

davidhewitt closed this as completed Sep 18, 2021

samuelcolvin mentioned this issue Feb 11, 2024

Performance: calling overhead #3827

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyO3 performance analysis: function overheads #1607

PyO3 performance analysis: function overheads #1607

davidhewitt commented May 14, 2021 •

edited

Loading

davidhewitt commented May 15, 2021 •

edited

Loading

alex commented May 15, 2021

birkenfeld commented May 15, 2021

alex commented May 15, 2021

alex commented May 15, 2021

davidhewitt commented May 15, 2021 •

edited

Loading

birkenfeld commented May 15, 2021

davidhewitt commented May 15, 2021

alex commented May 16, 2021

davidhewitt commented May 16, 2021

birkenfeld commented May 17, 2021

mejrs commented May 17, 2021 •

edited

Loading

davidhewitt commented May 17, 2021

davidhewitt commented Sep 18, 2021

samuelcolvin commented Feb 11, 2024

davidhewitt commented Feb 11, 2024

samuelcolvin commented Feb 11, 2024

samuelcolvin commented Feb 11, 2024

PyO3 performance analysis: function overheads #1607

PyO3 performance analysis: function overheads #1607

Comments

davidhewitt commented May 14, 2021 • edited Loading

Sample code

Underlying implementation

davidhewitt commented May 15, 2021 • edited Loading

Suggested Improvements

alex commented May 15, 2021

birkenfeld commented May 15, 2021

alex commented May 15, 2021

alex commented May 15, 2021

davidhewitt commented May 15, 2021 • edited Loading

birkenfeld commented May 15, 2021

davidhewitt commented May 15, 2021

alex commented May 16, 2021

davidhewitt commented May 16, 2021

birkenfeld commented May 17, 2021

mejrs commented May 17, 2021 • edited Loading

davidhewitt commented May 17, 2021

davidhewitt commented Sep 18, 2021

samuelcolvin commented Feb 11, 2024

davidhewitt commented Feb 11, 2024

samuelcolvin commented Feb 11, 2024

samuelcolvin commented Feb 11, 2024

davidhewitt commented May 14, 2021 •

edited

Loading

davidhewitt commented May 15, 2021 •

edited

Loading

davidhewitt commented May 15, 2021 •

edited

Loading

mejrs commented May 17, 2021 •

edited

Loading