Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky tests when importing module #3262

Closed
shaoner opened this issue Jun 20, 2023 · 9 comments
Closed

flaky tests when importing module #3262

shaoner opened this issue Jun 20, 2023 · 9 comments
Labels

Comments

@shaoner
Copy link

shaoner commented Jun 20, 2023

Bug Description

In some scenario, when running cargo test, objects from another thread seem to get popped from the release pool.

Here is an example:

use pyo3::{
    prelude::{pyclass, pymethods, *},
    types::{PyDict, PyModule},
};

#[cfg(test)]
mod tests {
    use super::*;

    #[pyclass]
    struct MyClass {
        pub num: u8,
    }

    #[pymethods]
    impl MyClass {
        fn get_num(&self) -> u8 {
            self.num
        }
    }

    fn run(num: u8, timeout: u8) -> PyResult<()> {
        pyo3::Python::with_gil(|py| {
            let api = PyModule::new(py, "api").unwrap();
            let obj = PyCell::new(py, MyClass { num }).unwrap();
            api.add("obj", obj).unwrap();

            let top_level = PyDict::new(py);
            top_level.set_item("api", api).unwrap();
            py.run(
                "import sys; sys.modules['api'] = api",
                Some(top_level),
                None,
            )
            .unwrap();
            let code = format!(
                r#"
import api
import time

time.sleep({timeout})
assert api.obj.get_num() == {num}
"#
            );
            py.run(&code, None, None)
        })
    }

    #[test]
    fn test_race1() -> PyResult<()> {
        run(1, 0)
    }

    #[test]
    fn test_race2() -> PyResult<()> {
        run(2, 1)
    }
}
  • in test_race1, we add a new module api in scope with an object and a method that returns 1
  • in test_race2, we add a new module api in scope with an object and a method that returns 2

Now when we run the following python code we don't always get the expected object and the assertion sometimes breaks.
In test_race2, the object api.obj will refer to the object in test_race1 and return 1 instead.

You'll notice a timeout, this is because it makes it easier to trigger this bug, basically having one test finish before the other.

I tried to find some similar issues and it seems like we previously had some mutex in some tests: dfbe22b and that got fixed

Finally, playing with the locals and globals in py.run seems to solve this but trigger other issues (like missing classes) that I wanted to keep this example simple.

Steps to Reproduce

If you run the example in the description a few times (in my case it takes 2 runs max):

while true; do cargo test || break; done

Backtrace

No response

Your operating system and version

linux

Your Python version (python --version)

python 3.11.3

Your Rust version (rustc --version)

rustc 1.68.2

Your PyO3 version

0.19

How did you install python? Did you use a virtualenv?

pyenv

Additional Info

No response

@shaoner shaoner added the bug label Jun 20, 2023
@shaoner shaoner changed the title flaky import tests flaky tests when importing module Jun 20, 2023
@adamreichold
Copy link
Member

adamreichold commented Jun 20, 2023

Could you expand how you came to the conclusion that

objects from another thread seem to get popped from the release pool.

The pool is used to track deferred reference count increments/decrements, but I do not understand yet how it would affect module look-up?

@shaoner
Copy link
Author

shaoner commented Jun 20, 2023

So I'm not sure this is really what happens and it seems more to be about object lookup (!= module lookup)
I found some previous issues similar to this one, for ex here + if I log thread id + self in the get_num(&self) method I see a swap with the pointer address from the other thread so i figured it could be related to that release pool

If I understand correctly, here is what could happen:

  • test_race1 adds api.obj (1) into the release pool
  • test_race2 adds api.obj (2) into the release pool
  • test_race1 finishes and with dropping of it's GILGuard, api.obj (2) is popped from the release pool instead of api.obj (1)
  • test_race2 finishes later and retrieves api.obj (1)

@adamreichold
Copy link
Member

adamreichold commented Jun 20, 2023

If I understand correctly, here is what could happen:

"popping form the release pool" could only result in using an object that was already freed due to its reference count reaching zero. But this does not seem to happen in your case where all objects are valid, you just do not end up with the object you expect.

I think the most likely culprit here is that both import and time.sleep release the GIL and there is just insufficient synchronization enabling some interleaving where test_race1 sees the api module installed by test_race2.

@shaoner
Copy link
Author

shaoner commented Jun 20, 2023

"popping form the release pool" could only result in using an object that was already freed

isn't freeing / using the wrong object possible?

I think the most likely culprit here is that both import and time.sleep releases the GIL

Just to clarify, I also have the issue without importing time (and by replacing it with a loop so one test ends before the other)

@davidhewitt
Copy link
Member

isn't freeing / using the wrong object possible?

Since #887 was released in v0.10 the release pools are now unique to each thread, so I'm highly skeptical that there's a bug here caused by a pool interaction.

Both tests are wrapped in Python::with_gil, so for there to be any issue there has to be operations within the test which are releasing the GIL to allow the tests to race. Also, you're mutating a global sys.modules state, which looks like a solid way to introduce race conditions. I wouldn't be surprised if the import sys statement released the GIL and the tests are racing at that point. Are you able to reproduce if you avoid touching sys.modules?

@shaoner
Copy link
Author

shaoner commented Jun 20, 2023

release pools are now unique to each thread
got it, sorry for the confusion then

yes I'm not able to reproduce the issue without the sys.modules part, I hadn't realized sys.module it could introduce race conditions.

In my specific case, I still need to make the module available to imports so I guess I could just add some static mutex as a work around before running the python code

@adamreichold
Copy link
Member

adamreichold commented Jun 20, 2023

In my specific case, I still need to make the module available to imports so I guess I could just add some static mutex as a work around before running the python code

If you want to tie in lazy initialization with GIL-based synchronization, the pyo3::sync::GILOnceCell type might be helpful.

@shaoner
Copy link
Author

shaoner commented Jun 20, 2023

Thanks, this was helpful. I'm just surprised that the GIL can be released while running the code.

@davidhewitt
Copy link
Member

Indeed, as PyO3 developers we all had this experience. e.g. https://pyo3.rs/v0.19.0/faq#im-experiencing-deadlocks-using-pyo3-with-lazy_static-or-once_cell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants