Performance issue when calling rust function in python #3787

richecr · 2024-01-31T02:31:37Z

I have created this simple function:

use pyo3::prelude::*;
use std::time::{Instant};

#[pyfunction]
fn rust_sleep() -> i32 {
    let start = Instant::now();
    let num = 1 + 1;
    let duration = start.elapsed();
    println!("{:?}", duration);
    num
}

#[pymodule]
fn pythonicsqlrust(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}

import time
import pythonicsqlrust

def main():
    s = time.time_ns()
    print(pythonicsqlrust.rust_sleep())
    e = time.time_ns()
    print(e - s)

main()

In pure rust it takes ~60ns and when called in python ~22350ns. Can you help understand why this happens? (is slower than running in pure python)

davidhewitt · 2024-01-31T09:32:15Z

Thanks @richecr for the report.

TLDR:

Yes, at the moment PyO3 does introduce some framework-level overheads which make a PyO3 function call just slightly slower than a Python function call. We intend to remove this. Nevertheless, this overhead is tiny, on the nanosecond scale.
Despite this, you should not be surprised if for more complex functions the cost of converting Python types into Rust types continues to show a penalty.
For real-world workloads, you should find that the execution speed of the Rust code outstrips the Python equivalent. There are many blog posts showcasing examples of this, some which can be found on the PyO3 README.

First, I'll assume that when you say "is slower than running in pure Python" that you're testing with against this Python implementation:

def py_sleep():
    start = time.time_ns()
    num = 1 + 1
    duration = time.time_ns() - start
    print(duration)
    return num

There's several different factors coming into play. Let's try to break these down:

In both cases you are making a Python function call. In one case, that runs native Rust code, in the other case, it runs interpreted Python. There is a fundamental baseline cost to running a Python function call.
Measuring these kind of microbenchmarks is really hard. There will be a lot of volatility in these measurements from lots of hardware-level effects, such as CPU branch prediction, frequency boosting, cache locality etc. Let's take a large sample and average that to get a truer measurement.
As you're printing to stdout and also reading system time, it's quite possible there will be some one-time initialization, so let's run a warmup to eliminate those.
Let's check your Rust compile settings. You should be using --release build at a minimum.
The Rust is doing a tiny bit of extra formatting for the output to compute the scale of the duration and add ns. Let's make it a bit fairer by taking that out for now.

In the end, let's end up with this code:

use pyo3::prelude::*;
use std::time::Instant;

#[pyfunction]
fn rust_sleep() -> i32 {
    let start = Instant::now();
    let num = 1 + 1;
    let duration = start.elapsed();
    println!("{}", duration.as_nanos());
    num
}

#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}

import time
from timeit import timeit
import pyo3_scratch


def py_sleep():
    s = time.time_ns()
    x = 1 + 1
    e = time.time_ns()
    print(e - s)
    return x


# run some warmups
pyo3_scratch.rust_sleep()
py_sleep()

# measure average duration of 1 million calls
N = 1_000_000
py = timeit("py_sleep()", setup="from __main__ import py_sleep", number=N) / N
rust = timeit("rust_sleep()", setup="from pyo3_scratch import rust_sleep", number=N) / N

# report final timings
print("py", py)
print("rust", rust)

Now, running this, I get the following output:

py 4.560636853999767e-06
rust 4.483858657999917e-06

There is still volatility in these numbers; sometimes Rust is a little slower than Python, sometimes Rust is a little faster. Overall, both are reporting around 4.5us on my machine, which to me makes me assume the work on both languages is dominated here by the system-level operations: timing measurements and writing to stdout.

Let's try to measure the call overhead more precisely by making both of these functions into noops:

use pyo3::prelude::*;

#[pyfunction]
fn rust_sleep() {}

#[pymodule]
fn pyo3_scratch(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(rust_sleep, m)?)?;
    Ok(())
}

from timeit import timeit
import pyo3_scratch


def py_sleep():
    return


# run some warmups
pyo3_scratch.rust_sleep()
py_sleep()

# measure average duration of 1 million calls
N = 1_000_000
py = timeit("py_sleep()", setup="from __main__ import py_sleep", number=N) / N
rust = timeit("rust_sleep()", setup="from pyo3_scratch import rust_sleep", number=N) / N

print("py", py)
print("rust", rust)

Now on my machine I get less volatility, and pure-Python shows an edge:

py 1.899612299985165e-08
rust 2.5328653000087797e-08

What we see is that calling the noop Python function is measuring as taking 18ns, and 25ns to call into Rust. This 7ns slowdown is a fairer estimate of the slowdown which PyO3 currently exhibits over pure-Python function calls.

Finally, let's go a step further and estimate what PyO3 could look like with the overheads which we're working to remove in PyO3 0.21. I'll apply the following diff to current PyO3 main, which is a crude way to disable framework-level overheads we're working to remove in #3382:

diff --git a/src/impl_/trampoline.rs b/src/impl_/trampoline.rs
index 4b4eac17a..2664d7598 100644
--- a/src/impl_/trampoline.rs
+++ b/src/impl_/trampoline.rs
@@ -174,8 +174,9 @@ where
     R: PyCallbackOutput,
 {
     let trap = PanicTrap::new("uncaught panic at ffi boundary");
-    let pool = unsafe { GILPool::new() };
-    let py = pool.python();
+    // let pool = unsafe { GILPool::new() };
+    // let py = pool.python();
+    let py = unsafe { Python::assume_gil_acquired() };
     let out = panic_result_into_callback_output(
         py,
         panic::catch_unwind(move || -> PyResult<_> { body(py) }),

(NB do not attempt to apply the above diff and run this for real world code in production. Until PyO3 is fully transitioned to the new API, the GILPool is a fundamental part of correct operation of PyO3.)

This reverses the situation. For the noop function calls, once we've sorted out this framework-level overhead, calling a noop PyO3 function will be faster than calling a noop pure-Python one, by about ~4.5ns on my machine:

py 1.5801341000042156e-08
rust 1.1012634000508115e-08

Paulo-21 · 2024-03-05T12:10:00Z

Hello,
I have the same performance issue with a lot of gap.

tic = time.perf_counter()
raw_features = af_reader_py.compute_features(af_path, page_rank, degree_centrality, in_degrees, out_degrees, 10000, 0.00001 )
print("Python wrappe the function : ", time.perf_counter()-tic, " sec")

we can see that the rust code take 6.8 sec and it take nearly 10 sec to return the output to python.

fn compute_features(file_path : &str, page_rank : Vec<f64>, degree_centrality: Vec<f64>,  in_degree: Vec<f64>, out_degree: Vec<f64>, iter:usize, tol : f64)-> PyResult<Vec<[f64;11]>> {
    let start = Instant::now();
    let edge = reading_cnf_for_rustworkx(file_path);
    let (hcat, card, noselfatt, maxbased, gr, ) = reading_cnf_with_semantics(file_path);
    let g = petgraph::graph::DiGraph::<u32, ()>::from_edges(&edge);
    let eig = eigenvector_centrality(&g,  |_| {Ok::<f64,f64>(1.)}, Some(iter), Some(tol)).unwrap().unwrap();
    let coloring = greedy_node_color(&g);
    let mut raw_features = Vec::with_capacity(g.node_count());
    for node in 0..page_rank.len() {
        raw_features.push([
            coloring[node] as f64,page_rank[node],
            degree_centrality[node], eig[node],
            in_degree[node],out_degree[node],
            hcat[node],card[node],noselfatt[node],
            maxbased[node],gr[node]
        ]);
    }
    println!("Inside Rust {} ms", start.elapsed().as_millis());
    //.into()
    Ok(raw_features)
}

Did i do somethings wrong or it due to the pyo3 implementation ?
ty

birkenfeld · 2024-03-05T12:28:50Z

Did i do somethings wrong or it due to the pyo3 implementation ? ty

Keep in mind that the Vecs you pass in and out of the function have to be converted between Python and Rust representation. How many elements are in them, roughly?

davidhewitt · 2024-03-06T00:21:48Z

@Paulo-21 further to the above, you may want to consider using rust-numpy to convert Rust matrix types directly to numpy arrays and avoid the list-of-list-of-floats. Should be significantly faster.

Paulo-21 · 2024-03-06T18:53:37Z

Did i do somethings wrong or it due to the pyo3 implementation ? ty

Keep in mind that the Vecs you pass in and out of the function have to be converted between Python and Rust representation. How many elements are in them, roughly?

Yes i understand, it's roughly 2.5 Million elements.

@Paulo-21 further to the above, you may want to consider using rust-numpy to convert Rust matrix types directly to numpy arrays and avoid the list-of-list-of-floats. Should be significantly faster.

Thank you, i will give a try !

davidhewitt · 2024-10-11T20:41:16Z

With 0.22 (without the GIL Refs feature) and also on the upcoming 0.23, we now have the changes I mentioned above completed, and I consistently measure calling a noop PyO3 function as faster than calling a noop Python one.

I will close this issue, I'm sure we will yet find more cases to optimise in future, they can be new issues.

davidhewitt mentioned this issue Feb 13, 2024

Performance: calling overhead #3827

Closed

gi0baro mentioned this issue Feb 27, 2024

Upgrade PyO3 to 0.21 emmett-framework/granian#217

Closed

davidhewitt mentioned this issue May 11, 2024

add flag to skip reference pool mutex if the program doesn't use the pool #4174

Closed

davidhewitt closed this as completed Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue when calling rust function in python #3787

Performance issue when calling rust function in python #3787

richecr commented Jan 31, 2024 •

edited

Loading

davidhewitt commented Jan 31, 2024 •

edited

Loading

Paulo-21 commented Mar 5, 2024

birkenfeld commented Mar 5, 2024

davidhewitt commented Mar 6, 2024

Paulo-21 commented Mar 6, 2024

davidhewitt commented Oct 11, 2024

Performance issue when calling rust function in python #3787

Performance issue when calling rust function in python #3787

Comments

richecr commented Jan 31, 2024 • edited Loading

davidhewitt commented Jan 31, 2024 • edited Loading

Paulo-21 commented Mar 5, 2024

birkenfeld commented Mar 5, 2024

davidhewitt commented Mar 6, 2024

Paulo-21 commented Mar 6, 2024

davidhewitt commented Oct 11, 2024

richecr commented Jan 31, 2024 •

edited

Loading

davidhewitt commented Jan 31, 2024 •

edited

Loading