Skip to content

Conversation

@brownbaerchen
Copy link
Contributor

After some changes to Gusto, some strange deadlocks have started to appear in the pySDC tests.

In the Gusto changes, a debug output showing the max/min values of some fields was added, which I presume requires communication across space. Removing this debug information is sufficient to get these tests to pass again.

In order to make sure the space-parallelism is set up correctly, I changed the MPI tests to always use four tasks and distribute them to do space-only, time-only, or space-time parallelism. All of these tests pass without the debug output.

This is definitely some fishy business. Do you know if the debug output works in Gusto without pySDC, @jshipton @tommbendall? If I add return None after this line, the pySDC tests pass with debug level output in Gusto, so I am pretty confident it is related to the remainder of this function.

Btw, I had some issues in freeing communicators at the end of the script on my laptop, so I added the Free function to the Firedrake ensemble communicator wrapper. This should not make a large difference in practice.

@tommbendall
Copy link

I think the tests are hanging because the field_data.min() and field_data.max() calls that have been added involve parallel communication. Maybe these don't work in the presence of an Ensemble communicator?

Our test suite does include some parallel tests and there hasn't been an issue here.

One potential quick fix might be to avoid having any logging at the DEBUG level. However pytest runs tests with the DEBUG level by default -- with a quick internet search I couldn't see an easy way to force an individual test to be run at the INFO logging level, but that could be a potential avenue to pursue.

@tommbendall
Copy link

Oh scratch what I said about pytest and logging -- I see that's exactly what you've done to get your CI passing.

@brownbaerchen
Copy link
Contributor Author

I think the tests are hanging because the field_data.min() and field_data.max() calls that have been added involve parallel communication. Maybe these don't work in the presence of an Ensemble communicator?

I also suspect that this is causing the issue. However, the ensemble communicators just generate a set of space communicators, which should be useable within Gusto and which "don't know" that they are generated from the ensemble, afaik. I am not saying that this is a bug in Gusto, could be the way I use it. But it is definitely a bug somewhere.

Our test suite does include some parallel tests and there hasn't been an issue here.

Confusingly, the tests with debug level pass with one or two tasks, just not with four, making it even more fishy..

I guess for now we can just ignore this random bug where we don't even know if it's in pySDC, Gusto or Firedrake.. Just thought you should be aware of more weirdness :D

@pancetta pancetta merged commit 10388b6 into Parallel-in-Time:master Feb 13, 2025
87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants