Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make WQ scaling aware of core counts #3415

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

benclifford
Copy link
Collaborator

@svandenhaute was getting frustrated by having different sized tasks and the current parsl-vs-wq scaling not being able to deal with that - see issue #3414 - as of existing parsl master the best we could come up with was estimating an average task size; but that does not work well when tasks are far from average (eg a period with above size tasks, followed by a period with under-sized tasks).

this PR demonstrates changing scaling (WQ only) to pay attention to number of cores requested by a task in its parsl_resource_specification and use that in the scaling calculation

In @svandenhaute actual workload, number of cores is used as a proxy for "amount of node needed" - both CPU cores and number of GPUs, and this PR is probably better developed in that direction.

This probably has some relevance to an idea I talked with @yadudoc in his HTEX MPI prototype where we considered how to support multiple blocks, in #3323. That issue #3323 talks about how the interchange might allocate MPI tasks to different blocks; this current PR might lead in the direction of the scaling code being able to understand how many blocks are needed for the currently submitted workload - so, adding MPI

@svandenhaute
Copy link

svandenhaute commented May 5, 2024

Parsl/WQ seems to hang at the end of workflow execution with this particular version of the code. Any ideas on why this could be the case? Happens for multiple providers and multiple types of apps, and systematically disappears when I switch to 2024.04.29.

Example stack trace upon ctrl+C:

Traceback (most recent call last):
  File "/home/sandervandenhaute/micromamba/envs/psiflow_env/lib/python3.10/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/home/sandervandenhaute/micromamba/envs/psiflow_env/lib/python3.10/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/home/sandervandenhaute/micromamba/envs/psiflow_env/lib/python3.10/multiprocessing/popen_fork.py", line 43, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/sandervandenhaute/micromamba/envs/psiflow_env/lib/python3.10/multiprocessing/popen_fork.py", line 27, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt:

@benclifford
Copy link
Collaborator Author

since 2024.04.29, there was #3407 which was part of ongoing messing around in Parsl in general with how to do shutdowns - WQ hangs at shutdown if you didn't shut Parsl down.

Since #3165 parsl has not done this automatically at exit, because Python 3.12 doesn't allow that any more, and users (i.e. you) have more responsibility on shutting things down more explicitly - eg by putting your workflow into a with block (eg like the README change in #3404) or less preferable by calling parsl.dfk().cleanup() at the end of things.

I'd like Work Queue and Task Vine to not hang when the process falls off the end without a parsl shutdown, but that's not where those executors are right now.

@svandenhaute
Copy link

ah right, thanks!

@benclifford benclifford changed the title sketch for sander of core-aware scaling for wq Make scaling be task-size aware Jun 4, 2024
@benclifford benclifford changed the title Make scaling be task-size aware Make scaling aware of core counts Jun 4, 2024
@benclifford benclifford changed the title Make scaling aware of core counts Make WQ scaling aware of core counts Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants