Incorrect Number of Workers for SimpleLauncher+ClusterProvider #1647

WardLT · 2020-04-20T20:46:32Z

Describe the bug
Using SimpleLauncher with a ClusterProvider can cause the incorrect number of workers to be calculated.

I run into this problem when using Parsl with Apps that launch tasks onto compute nodes themselves. This causes problem in determining the number of blocks needed to satisfy a list of outstanding tasks.

To Reproduce
Use a ClusterProvider (e.g., CobaltProvider) and override the default launcher to be SimpleLauncher

Expected behavior
The logic in strategy.py determines the correct number of works and makes allocation requests appropriately.

Environment

OS: Linux of some sort
Python version: 3.7
0.9.0

Distributed Environment

Where are you running the Parsl script from ? Theta
Where do you need the workers to run ? Compute Nodes

The text was updated successfully, but these errors were encountered:

WardLT · 2020-04-20T20:53:37Z

I opened a PR, #1521, on this topic that we later closed due to my inappropriate use of isinstance.

This issue documents the issue outside of a PR so that we can better iterate on potential solutions.

I am considering altering the provider class to compute the number of available slots (looking at the FIXME in strategy.py, and the launcher class to determine the total tasks.

That would require touching some core classes from Parsl, so I figure to touch in with the team before making changes. Any objections to that approach?

benclifford · 2020-04-21T10:35:19Z

what do you want to do? give the launchers understanding of how many nodes they are launching?

WardLT · 2020-04-21T18:53:46Z

Sort of. I plan to add a function to the launcher class to to compute the number of slots given a task configuration. It would be a pair to the __call__ function of the launcher and not require storing more state (e.g., how many slots it is supposed to create) in the launcher itself.

WardLT · 2023-09-19T18:23:14Z

It occurred to me recently that we could always set the parallelism to >1 to cause Parsl to submit more blocks than it thinks are necessary.

See Parsl/parsl#1647

* Use parallelism to get correct number of blocks See Parsl/parsl#1647 * Compute the solvation energy too Also save to gzipped files. The size is starting to be notable * Implement a multi-fidelity MPNN Uses delta learning to predict the properties at intermediate levels * No longer test for errors being thrown * Only output the highest level of fidelity Also some flake8 fixes * Document how multi-fidelity trainign works * Minor changes to the documentation * Use a more robust relaxation technique (#108) * Use MDMin to reduce to 0.1 eV/Ang, then BFGS Still want to test this before we merge to main, but fixes #106 * Use FIRE and a higher threshold for switching * Use molecule which takes longer to optimize in test * Use isnan and not isinf for detecting placeholders * Switch to one scale layer per network * Compute diff between adjance level, not from first * Also fix how we compute inference Delta between adjacent levels, not the beginning Changed our test routine to ensure we do this right * Initial training runs for multi-fidelity learning * Update data loader test to handle new fixtures * Train using subset of available data, test on all fidelities * Minor bug: start decaying LR immediately

WardLT added the bug label Apr 20, 2020

WardLT mentioned this issue Apr 22, 2020

WIP: Proper Scaling for SimpleLauncher #1648

Closed

2 tasks

WardLT mentioned this issue Jun 2, 2023

Understanding blocks with respect to MPI parallel sub tasks #2723

Closed

WardLT added a commit to exalearn/ExaMol that referenced this issue Sep 19, 2023

Use parallelism to get correct number of blocks

8ab1a88

See Parsl/parsl#1647

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect Number of Workers for SimpleLauncher+ClusterProvider #1647

Incorrect Number of Workers for SimpleLauncher+ClusterProvider #1647

WardLT commented Apr 20, 2020

WardLT commented Apr 20, 2020

benclifford commented Apr 21, 2020

WardLT commented Apr 21, 2020

WardLT commented Sep 19, 2023

Incorrect Number of Workers for SimpleLauncher+ClusterProvider #1647

Incorrect Number of Workers for SimpleLauncher+ClusterProvider #1647

Comments

WardLT commented Apr 20, 2020

WardLT commented Apr 20, 2020

benclifford commented Apr 21, 2020

WardLT commented Apr 21, 2020

WardLT commented Sep 19, 2023