Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyO3 performance analysis: function overheads #1607

Closed
davidhewitt opened this issue May 14, 2021 · 18 comments
Closed

PyO3 performance analysis: function overheads #1607

davidhewitt opened this issue May 14, 2021 · 18 comments

Comments

@davidhewitt
Copy link
Member

davidhewitt commented May 14, 2021

This is a follow-up from #1470 focussing specifically on the #[pyfunction] call overheads and how we can reduce the performance gap with CPython.

To keep it simple, I'm starting with the no-arguments, no-return-value case. Optimizations we make for this case can benefit more complex cases.

First I'll demonstrate the sample code I'm using to measure. Second, I'll talk about the underlying code. Finally, I'll suggest improvements.


Sample code

First up, the sample code is trivial: I'm using a no-args pyfunction like this one:

#[pyfunction]
fn no_args() {}

I install the extension module from the pyo3 root dir:

cd pyo3
pip install --use-feature=in-tree-build examples/pyo3-benchmarks

I can then import and run this in ipython:

In [1]: from pyo3_benchmarks import no_args

In [2]: %timeit no_args()
67.8 ns ± 1.19 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Comparing this to a pure-python implementation:

In [3]: def no_args_py():
   ...:     pass
   ...:

In [4]: %timeit no_args_py()
43 ns ± 0.414 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

67.8ns vs 43ns ... suggests there's some work we can do even in this base case.


Underlying implementation

So the code generated in question is this macro output:

unsafe extern "C" fn #wrapper_ident(
_slf: *mut pyo3::ffi::PyObject,
_args: *mut pyo3::ffi::PyObject,
_kwargs: *mut pyo3::ffi::PyObject) -> *mut pyo3::ffi::PyObject
{
pyo3::callback::handle_panic(|#py| {
#slf_module
let _args = #py.from_borrowed_ptr::<pyo3::types::PyTuple>(_args);
let _kwargs: Option<&pyo3::types::PyDict> = #py.from_borrowed_ptr_or_opt(_kwargs);
#body
})
}

I'm going to break this code into three parts:

  1. The function signature (PyObject*, PyObject, PyObject*) -> PyObject* shows we are using the METH_VARARGS | METH_KEYWORDS calling convention. There's several more efficient calling conventions we should consider using:
    • When functions take no arguments we should use the METH_NOARGS.
    • When not using the limited API (abi3 feature) we should use METH_FASTCALL | METH_KEYWORDS.
  2. The helper pyo3::callback::handle_panic sets up a GILPool for pyo3 owned references and catched panics at the ffi boundary. The biggest change to this function will come from experimental: owned objects API #1308, which will pave the way to no longer need to create a GILPool. I don't think it's worth trying to optimize that helper before that PR makes progress.
  3. The rest of the function body. Although it's obfuscated a lot by the macro code, ultimately this is parsing the input args tuple and kwargs dict, executing the (empty) Rust function, generating a new reference to None and returning that. This is all Rust code that we can benchmark and optimize. We can accelerate the argument parsing code a lot if we use the METH_FASTCALL | METH_KEYWORDS calling convention, because we'll get a raw array of argument and keyword argument names to work with instead of a Python tuple and dict.

We can get an idea of the rough performance impact of each piece of code by making changes to this macro output.

First, let's make the contents of the body just return a reference to None, without handle_panic or the argument parsing code:

unsafe extern "C" fn #wrapper_ident(
    _slf: *mut pyo3::ffi::PyObject,
    _args: *mut pyo3::ffi::PyObject,
    _kwargs: *mut pyo3::ffi::PyObject) -> *mut pyo3::ffi::PyObject
{
    use crate::IntoPyPointer;
    unsafe { pyo3::Python::assume_gil_acquired().None().into_ptr() }
}

If I re-install the module and measure again, I get this timing:

In [2]: %timeit no_args()
24.8 ns ± 0.246 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

24.8ns - that's even faster than the plain Python code! This is exciting; it shows us that we can get a lot faster. It also shows that we're spending about 44,ns of time inside the Rust function body.

Ok, so let's make one more change, which is to put handle_panic back. This'll help tease apart how much is handle_panic PyO3 framework overhead, and how much is PyO3's argument parsing.

unsafe extern "C" fn #wrapper_ident(
    _slf: *mut pyo3::ffi::PyObject,
    _args: *mut pyo3::ffi::PyObject,
    _kwargs: *mut pyo3::ffi::PyObject) -> *mut pyo3::ffi::PyObject
{
    pyo3::callback::handle_panic(|#py| {
        use crate::IntoPyPointer;
        Ok(#py.None().into_ptr())
    })
}

This results in:

In [2]: %timeit no_args()
60.7 ns ± 0.857 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

We're back up to 60ns. So the argument parsing code is about 7ns, and the remaining 37ns is down to handle_panic.

If I tweak handle_panic to use Python::assume_gil_acquired() instead of GILPool::new().python(), I get back to 28ns. So unfortunately a lot of the slowdown is caused by creation of a GILPool, meaning we need #1308 to resolve it...

Run out of time now, will post suggested actions for interested contributors in a second message later or tomorrow...

@davidhewitt
Copy link
Member Author

davidhewitt commented May 15, 2021

Suggested Improvements

Tabulating the above, I get on my machine for the no-args case:

Base Case Timing
Python Reference ~ 43ns
PyO3 Pyfunction ~ 68ns
Pyfunction Part Timing
... Python call overhead ~ 25ns
... Panic handling ~ 3ns
... GILPool::new() ~ 33ns
... Argument parsing ~ 7ns
Total ~ 68ns
  • It looks to me like the best optimization we can make is to push forward with the changes in experimental: owned objects API #1308, which will set up a new API which doesn't require us to call GILPool::new(). In the meanwhile we could try to optimize GILPool::new().

  • Another thing we can do for this no-arguments case is to write an alternative code path in the macro which generates a function using the METH_NOARGS calling convention. We already do a similar thing for #[pymethods]:

    if spec.args.is_empty() {

    It'd be worth trying to make this change and measuring if it helps.

  • A more general improvement would be to also change the macro to conditionally emit METH_FASTCALL native functions if we're on Python 3.7 and up and not using the abi3 feature. This would be a more complex change compared to METH_NOARGS support as it'll also need some adjustments to the argument-parsing code, so I'd recommend attempting METH_NOARGS first.

cc @birkenfeld - sorry this took me so long to write up! Feel free to ask for any more information if you're interested in implementing any of the above.

@alex
Copy link
Contributor

alex commented May 15, 2021

Note that in addition to the METH_NOARGS special case, there's also METH_O for a single argument.

@birkenfeld
Copy link
Member

That doesn't support calling by kwarg though, right?

@alex
Copy link
Contributor

alex commented May 15, 2021

No, I suppose not, so it'd need to be opt-in (or an intentional backwards incompatible changes).

@alex
Copy link
Contributor

alex commented May 15, 2021

I've submitted #1608 which reduces the overhead of GILPool::new()

@davidhewitt
Copy link
Member Author

davidhewitt commented May 15, 2021

That doesn't support calling by kwarg though, right?

Mmmm indeed we also don't support METH_VARARGS (which allows any number of positional arguments but no keyword arguments).

No, I suppose not, so it'd need to be opt-in (or an intentional backwards incompatible changes).

The user could opt in with the postional-only-arguments separator / at the end of an #[args(...)] annotation. We could METH_O and METH_VARARGS as optimizations in those cases.

@birkenfeld
Copy link
Member

The positional-only-arguments separator is / and very recent in Python syntax.

@davidhewitt
Copy link
Member Author

Thanks, fixed my comment.

We could support the separator independent of Python version it was added to, because we control the argument-parsing code. Issue is tracked at #1439

@alex
Copy link
Contributor

alex commented May 16, 2021

@davidhewitt how much work would it be to reproduce the table from #1607 (comment) of which components take up the time?

@davidhewitt
Copy link
Member Author

Not too much at all, though I didn't script it 😄. Note that I'm now using python3 -mtimeit -s "from pyo3_benchmarks import no_args" "no_args()" as per your usage in #1608 which seems to give marginally lower timings than ipython.

Pyfunction Part Timing
... Python call overhead ~ 19ns
... Panic handling ~ 0ns
... GILPool::new() ~ 21ns
... Argument parsing ~ 3ns
Total ~ 43ns

Bit suprised to see the argument parsing and panic handling overheads reduced. It's possible that the optimizer is doing a better job at removing side-effect-free code after #1604, so I'm now attributing some of that cost (incorrectly) to GILPool::new().

Indeed I wrote a quick benchmark in #1609, which suggested that we've achieved ~ 17ns of reduced overhead in GILPool::new(), but I'm only measuring ~ 11ns here.

Either way, we're looking better! I think that experimenting with the various METH_X calling conventions may yield further benefits, and I still think #1308 can reduce GILPool overhead further.

birkenfeld added a commit to birkenfeld/pyo3 that referenced this issue May 17, 2021
As suggested in PyO3#1607.

If I ran the benchmarks correctly, this shaves only about 5ns from
the noargs case.  But since it's relatively easy to do...
@birkenfeld
Copy link
Member

So I've hit a little bump while starting to implement fastcall support...

The macros for pyfunction etc. have to generate different code depending on if were are on all(Py_3_7, not(abi3)) or not. But the proc-macro crate doesn't have the cfg defines available.

So I thought, let's just generate both versions, with the proper #[cfg] attributes to select the right one. But again, that doesn't work since now the #[cfg] is in the crate using PyO3, which again doesn't have the cfg defines available...

How would you solve that?

@mejrs
Copy link
Member

mejrs commented May 17, 2021

Fwiw, I think that is why pyclass and pymethods have a cfg'd import and rename. It's why I had to duplicate the documentation for them.

Those cfg's work in src/lib.rs, right?

@davidhewitt
Copy link
Member Author

Ugh, @mejrs is right that I used the cfg'd import in #[pyclass] and #[pymethods] to avoid making a pyo3-macros/multiple-pymethods feature (and potentially even pyo3-macros-backend/multiple-pymethods).

But that approach doesn't really scale and created documentation pain, so I'm thinking about biting the bullet and reversing that decision.

For this #[cfg(Py_3_7)] problem, I can imagine all the various combinations using that import approach will get horrible.

I think the right solution is probably to introduce a pyo3-build-config crate as a root dependency, like I was considering in #1548. Then pyo3-macros-backend and pyo3 can both access the cfg vars (as well as downstream crates).

I'd been hesitating on pushing that branch I'd been playing with because I didn't really want to have a 4th crate for something that seemed like a small user convenience, but if it also makes it possible for us to make better optimizations in the macro codegen then I think it's well worth it.

I'll try to finish up and push that branch on Wednesday. Perhaps to unblock you @birkenfeld, you could just copy pyo3's build.rs into pyo3-macros-backend for now? It'll create small about of duplicate work in your local compilation but at least you'll be able to use all the #[cfg] attributes to implement the conditional optimizations...

@davidhewitt
Copy link
Member Author

Given that we landed a lot of improvement in 0.14 (in particular #1619, thanks @birkenfeld), I'm going to close this particular analysis as it's now quite stale.

@samuelcolvin
Copy link
Contributor

I know this is closed, but I still think there's a lot more to do here.

I'm using @davidhewitt's "bound" branch and seeing a very difference between barematal code and the elegant API.

@davidhewitt
Copy link
Member Author

@samuelcolvin are you observing that the "bound" branch still has some overheads? That's not a surprise as we won't actually be able to remove the steps to create and remove the GILPool until 0.22. But the datastructure itself should remain empty for completely migrated code and be much closer to a noop.

It is definitely possible there are some extra optimisations to be found.

If you can share some concrete examples of what you're measuring and what the target would be, it helps us establish what might be remaining.

@samuelcolvin
Copy link
Contributor

Thanks, I'll create a new issue.

@samuelcolvin
Copy link
Contributor

See #3827.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants