flaky tests when importing module #3262

shaoner · 2023-06-20T16:37:45Z

Bug Description

In some scenario, when running cargo test, objects from another thread seem to get popped from the release pool.

Here is an example:

use pyo3::{
    prelude::{pyclass, pymethods, *},
    types::{PyDict, PyModule},
};

#[cfg(test)]
mod tests {
    use super::*;

    #[pyclass]
    struct MyClass {
        pub num: u8,
    }

    #[pymethods]
    impl MyClass {
        fn get_num(&self) -> u8 {
            self.num
        }
    }

    fn run(num: u8, timeout: u8) -> PyResult<()> {
        pyo3::Python::with_gil(|py| {
            let api = PyModule::new(py, "api").unwrap();
            let obj = PyCell::new(py, MyClass { num }).unwrap();
            api.add("obj", obj).unwrap();

            let top_level = PyDict::new(py);
            top_level.set_item("api", api).unwrap();
            py.run(
                "import sys; sys.modules['api'] = api",
                Some(top_level),
                None,
            )
            .unwrap();
            let code = format!(
                r#"
import api
import time

time.sleep({timeout})
assert api.obj.get_num() == {num}
"#
            );
            py.run(&code, None, None)
        })
    }

    #[test]
    fn test_race1() -> PyResult<()> {
        run(1, 0)
    }

    #[test]
    fn test_race2() -> PyResult<()> {
        run(2, 1)
    }
}

in test_race1, we add a new module api in scope with an object and a method that returns 1
in test_race2, we add a new module api in scope with an object and a method that returns 2

Now when we run the following python code we don't always get the expected object and the assertion sometimes breaks.
In test_race2, the object api.obj will refer to the object in test_race1 and return 1 instead.

You'll notice a timeout, this is because it makes it easier to trigger this bug, basically having one test finish before the other.

I tried to find some similar issues and it seems like we previously had some mutex in some tests: dfbe22b and that got fixed

Finally, playing with the locals and globals in py.run seems to solve this but trigger other issues (like missing classes) that I wanted to keep this example simple.

Steps to Reproduce

If you run the example in the description a few times (in my case it takes 2 runs max):

while true; do cargo test || break; done

Backtrace

No response

Your operating system and version

linux

Your Python version (`python --version`)

python 3.11.3

Your Rust version (`rustc --version`)

rustc 1.68.2

Your PyO3 version

0.19

How did you install python? Did you use a virtualenv?

pyenv

Additional Info

No response

The text was updated successfully, but these errors were encountered:

adamreichold · 2023-06-20T17:50:36Z

Could you expand how you came to the conclusion that

objects from another thread seem to get popped from the release pool.

The pool is used to track deferred reference count increments/decrements, but I do not understand yet how it would affect module look-up?

shaoner · 2023-06-20T18:30:21Z

So I'm not sure this is really what happens and it seems more to be about object lookup (!= module lookup)
I found some previous issues similar to this one, for ex here + if I log thread id + self in the get_num(&self) method I see a swap with the pointer address from the other thread so i figured it could be related to that release pool

If I understand correctly, here is what could happen:

test_race1 adds api.obj (1) into the release pool
test_race2 adds api.obj (2) into the release pool
test_race1 finishes and with dropping of it's GILGuard, api.obj (2) is popped from the release pool instead of api.obj (1)
test_race2 finishes later and retrieves api.obj (1)

adamreichold · 2023-06-20T18:54:29Z

If I understand correctly, here is what could happen:

"popping form the release pool" could only result in using an object that was already freed due to its reference count reaching zero. But this does not seem to happen in your case where all objects are valid, you just do not end up with the object you expect.

I think the most likely culprit here is that both import and time.sleep release the GIL and there is just insufficient synchronization enabling some interleaving where test_race1 sees the api module installed by test_race2.

shaoner · 2023-06-20T19:09:56Z

"popping form the release pool" could only result in using an object that was already freed

isn't freeing / using the wrong object possible?

I think the most likely culprit here is that both import and time.sleep releases the GIL

Just to clarify, I also have the issue without importing time (and by replacing it with a loop so one test ends before the other)

davidhewitt · 2023-06-20T19:30:16Z

isn't freeing / using the wrong object possible?

Since #887 was released in v0.10 the release pools are now unique to each thread, so I'm highly skeptical that there's a bug here caused by a pool interaction.

Both tests are wrapped in Python::with_gil, so for there to be any issue there has to be operations within the test which are releasing the GIL to allow the tests to race. Also, you're mutating a global sys.modules state, which looks like a solid way to introduce race conditions. I wouldn't be surprised if the import sys statement released the GIL and the tests are racing at that point. Are you able to reproduce if you avoid touching sys.modules?

shaoner · 2023-06-20T20:10:44Z

release pools are now unique to each thread
got it, sorry for the confusion then

yes I'm not able to reproduce the issue without the sys.modules part, I hadn't realized sys.module it could introduce race conditions.

In my specific case, I still need to make the module available to imports so I guess I could just add some static mutex as a work around before running the python code

adamreichold · 2023-06-20T20:32:42Z

In my specific case, I still need to make the module available to imports so I guess I could just add some static mutex as a work around before running the python code

If you want to tie in lazy initialization with GIL-based synchronization, the pyo3::sync::GILOnceCell type might be helpful.

shaoner · 2023-06-20T21:30:11Z

Thanks, this was helpful. I'm just surprised that the GIL can be released while running the code.

davidhewitt · 2023-06-20T21:49:02Z

Indeed, as PyO3 developers we all had this experience. e.g. https://pyo3.rs/v0.19.0/faq#im-experiencing-deadlocks-using-pyo3-with-lazy_static-or-once_cell

shaoner added the bug label Jun 20, 2023

shaoner changed the title ~~flaky import tests~~ flaky tests when importing module Jun 20, 2023

davidhewitt closed this as completed Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky tests when importing module #3262

flaky tests when importing module #3262

shaoner commented Jun 20, 2023

adamreichold commented Jun 20, 2023 •

edited

Loading

shaoner commented Jun 20, 2023

adamreichold commented Jun 20, 2023 •

edited

Loading

shaoner commented Jun 20, 2023 •

edited

Loading

davidhewitt commented Jun 20, 2023

shaoner commented Jun 20, 2023

adamreichold commented Jun 20, 2023 •

edited

Loading

shaoner commented Jun 20, 2023

davidhewitt commented Jun 20, 2023

flaky tests when importing module #3262

flaky tests when importing module #3262

Comments

shaoner commented Jun 20, 2023

Bug Description

Steps to Reproduce

Backtrace

Your operating system and version

Your Python version (python --version)

Your Rust version (rustc --version)

Your PyO3 version

How did you install python? Did you use a virtualenv?

Additional Info

adamreichold commented Jun 20, 2023 • edited Loading

shaoner commented Jun 20, 2023

adamreichold commented Jun 20, 2023 • edited Loading

shaoner commented Jun 20, 2023 • edited Loading

davidhewitt commented Jun 20, 2023

shaoner commented Jun 20, 2023

adamreichold commented Jun 20, 2023 • edited Loading

shaoner commented Jun 20, 2023

davidhewitt commented Jun 20, 2023

Your Python version (`python --version`)

Your Rust version (`rustc --version`)

adamreichold commented Jun 20, 2023 •

edited

Loading

adamreichold commented Jun 20, 2023 •

edited

Loading

shaoner commented Jun 20, 2023 •

edited

Loading

adamreichold commented Jun 20, 2023 •

edited

Loading