Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Number of Workers for SimpleLauncher+ClusterProvider #1647

Open
WardLT opened this issue Apr 20, 2020 · 4 comments
Open

Incorrect Number of Workers for SimpleLauncher+ClusterProvider #1647

WardLT opened this issue Apr 20, 2020 · 4 comments
Labels

Comments

@WardLT
Copy link
Contributor

WardLT commented Apr 20, 2020

Describe the bug
Using SimpleLauncher with a ClusterProvider can cause the incorrect number of workers to be calculated.

I run into this problem when using Parsl with Apps that launch tasks onto compute nodes themselves. This causes problem in determining the number of blocks needed to satisfy a list of outstanding tasks.

To Reproduce
Use a ClusterProvider (e.g., CobaltProvider) and override the default launcher to be SimpleLauncher

Expected behavior
The logic in strategy.py determines the correct number of works and makes allocation requests appropriately.

Environment

  • OS: Linux of some sort
  • Python version: 3.7
  • 0.9.0

Distributed Environment

  • Where are you running the Parsl script from ? Theta
  • Where do you need the workers to run ? Compute Nodes
@WardLT WardLT added the bug label Apr 20, 2020
@WardLT
Copy link
Contributor Author

WardLT commented Apr 20, 2020

I opened a PR, #1521, on this topic that we later closed due to my inappropriate use of isinstance.

This issue documents the issue outside of a PR so that we can better iterate on potential solutions.

I am considering altering the provider class to compute the number of available slots (looking at the FIXME in strategy.py, and the launcher class to determine the total tasks.

That would require touching some core classes from Parsl, so I figure to touch in with the team before making changes. Any objections to that approach?

@benclifford
Copy link
Collaborator

what do you want to do? give the launchers understanding of how many nodes they are launching?

@WardLT
Copy link
Contributor Author

WardLT commented Apr 21, 2020

Sort of. I plan to add a function to the launcher class to to compute the number of slots given a task configuration. It would be a pair to the __call__ function of the launcher and not require storing more state (e.g., how many slots it is supposed to create) in the launcher itself.

@WardLT
Copy link
Contributor Author

WardLT commented Sep 19, 2023

It occurred to me recently that we could always set the parallelism to >1 to cause Parsl to submit more blocks than it thinks are necessary.

WardLT added a commit to exalearn/ExaMol that referenced this issue Sep 19, 2023
WardLT added a commit to exalearn/ExaMol that referenced this issue Sep 21, 2023
* Use parallelism to get correct number of blocks

See Parsl/parsl#1647

* Compute the solvation energy too

Also save to gzipped files. The size is starting to be notable

* Implement a multi-fidelity MPNN

Uses delta learning to predict the properties at intermediate levels

* No longer test for errors being thrown

* Only output the highest level of fidelity

Also some flake8 fixes

* Document how multi-fidelity trainign works

* Minor changes to the documentation

* Use a more robust relaxation technique (#108)

* Use MDMin to reduce to 0.1 eV/Ang, then BFGS

Still want to test this before we merge to main,
but fixes #106

* Use FIRE and a higher threshold for switching

* Use molecule which takes longer to optimize in test

* Use isnan and not isinf for detecting placeholders

* Switch to one scale layer per network

* Compute diff between adjance level, not from first

* Also fix how we compute inference

Delta between adjacent levels, not the beginning

Changed our test routine to ensure we do this right

* Initial training runs for multi-fidelity learning

* Update data loader test to handle new fixtures

* Train using subset of available data, test on all fidelities

* Minor bug: start decaying LR immediately
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants