New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect Number of Workers for SimpleLauncher+ClusterProvider #1647
Comments
I opened a PR, #1521, on this topic that we later closed due to my inappropriate use of This issue documents the issue outside of a PR so that we can better iterate on potential solutions. I am considering altering the provider class to compute the number of available slots (looking at the FIXME in strategy.py, and the launcher class to determine the total tasks. That would require touching some core classes from Parsl, so I figure to touch in with the team before making changes. Any objections to that approach? |
what do you want to do? give the launchers understanding of how many nodes they are launching? |
Sort of. I plan to add a function to the launcher class to to compute the number of slots given a task configuration. It would be a pair to the |
It occurred to me recently that we could always set the parallelism to >1 to cause Parsl to submit more blocks than it thinks are necessary. |
* Use parallelism to get correct number of blocks See Parsl/parsl#1647 * Compute the solvation energy too Also save to gzipped files. The size is starting to be notable * Implement a multi-fidelity MPNN Uses delta learning to predict the properties at intermediate levels * No longer test for errors being thrown * Only output the highest level of fidelity Also some flake8 fixes * Document how multi-fidelity trainign works * Minor changes to the documentation * Use a more robust relaxation technique (#108) * Use MDMin to reduce to 0.1 eV/Ang, then BFGS Still want to test this before we merge to main, but fixes #106 * Use FIRE and a higher threshold for switching * Use molecule which takes longer to optimize in test * Use isnan and not isinf for detecting placeholders * Switch to one scale layer per network * Compute diff between adjance level, not from first * Also fix how we compute inference Delta between adjacent levels, not the beginning Changed our test routine to ensure we do this right * Initial training runs for multi-fidelity learning * Update data loader test to handle new fixtures * Train using subset of available data, test on all fidelities * Minor bug: start decaying LR immediately
Describe the bug
Using SimpleLauncher with a ClusterProvider can cause the incorrect number of workers to be calculated.
I run into this problem when using Parsl with Apps that launch tasks onto compute nodes themselves. This causes problem in determining the number of blocks needed to satisfy a list of outstanding tasks.
To Reproduce
Use a ClusterProvider (e.g.,
CobaltProvider
) and override the default launcher to beSimpleLauncher
Expected behavior
The logic in strategy.py determines the correct number of works and makes allocation requests appropriately.
Environment
Distributed Environment
The text was updated successfully, but these errors were encountered: