Add an MPIExecutor #3423

WardLT · 2024-05-08T22:21:43Z

Description

Add MPIExecutor -- a wrapper class over HTEx which fixes or removes options irrelevant when enable_mpi_mode=True.

To do list:

Gather consensus on what to call "max_workers"? I've started with max_tasks, but might change that to max_tasks_per_block.
Write a test case ensuring the docstring and init function are not out of date (mechanism for ensuring changes to HTEx don't break MPIEx)
Write a test case showing MPIEx works as intended
Update Parsl documentation with new example

Also, whoops, I included a commit from #3418. LMK if I should bother rebasing

Changed Behaviour

No behavior was changed, only added

Fixes

N/A

Type of change

Choose which options apply, and delete the ones which do not apply.

New feature
Update to human readable text: Documentation/error messages/comments
Code maintenance/cleanup

SingleNodeLauncher creates too many workers when running in MPI mode

Andrew-S-Rosen · 2024-05-08T22:48:16Z

Gather consensus on what to call "max_workers"? I've started with max_tasks, but might change that to max_tasks_per_block.

I'm going to think out loud here. With the HighThroughputExecutor, I usually set max_workers_per_node=nodes_per_app * nodes_per_block [NB: this used to be max_workers before the rename to max_workers_per_node, which only complicated matters.].

So, let's consider the following scenario: 5 Slurm allocations (blocks), each reserving 4 nodes, and I want each task (App) to run via MPI over 2 nodes in the same block. This means I'd usually set max_workers_per_node to 2*4=8.

Now let's consider what happens with the same scenario but using the MPIExecutor where we call max_workers_per_node by the name max_tasks_per_block instead. The parameter max_tasks_per_block=8 doesn't seem to check out because I have 4 nodes on a block, but only 2 (not 8) tasks are running on that block.

Is there a mistake somewhere in my logic? My gut feeling is that the confusion is coming in from the max_worker_per_node rename. If I understand correctly, the change from max_workers to max_workers_per_node in HPCExecutor was just a name change and nothing more, even though in our use case it does not make sense to be thought of as "per node". Alternatively, I could just be wrong, although I thought I had tested this out.

For full transparency, here is my "old" way in this hypothetical example:

import parsl
from parsl.config import Config
from parsl.executors import HighThroughputExecutor
from parsl.launchers import SimpleLauncher
from parsl.providers import SlurmProvider

max_blocks = 5
nodes_per_block = 4
nodes_per_app = 2
cores_per_node = 128

config = Config(
    strategy="htex_auto_scale",
    executors=[
        HighThroughputExecutor(
            label="test",
            max_workers=nodes_per_app * nodes_per_block,  # or, equivalently, `max_workers_per_node`
            cores_per_worker=1e-6,
            provider=SlurmProvider(
                account="MyAccountName",
                qos="debug",
                constraint="cpu",
                worker_init=f"module load vasp/6.4.1-cpu && export ASE_VASP_COMMAND='srun -N {nodes_per_app} --ntasks-per-node={cores_per_node} --cpu_bind=cores'",
                walltime="00:10:00",
                nodes_per_block=nodes_per_block,
                init_blocks=0,
                max_blocks=max_blocks,
                launcher=SimpleLauncher(),
            ),
        )
    ],
)

parsl.load(config)

WardLT · 2024-05-09T01:22:49Z

My understanding is that the max_workers_per_node in your example should be nodes_per_block // nodes_per_app and not nodes_per_block * nodes_per_app.

The weird nomenclature is that, normally, each node of each block gets one worker pool and each pool gets up to max_workers_per_node.

Only one pool gets launched per block in MPI mode (regardless of node count), which means the total number of workers available to launch equals max_workers_per_node.

Does that clarify? If so, maybe I rename max_workers_per_node to max_workers_per_block to avoid introducing another term?

Andrew-S-Rosen · 2024-05-09T04:05:25Z

Thanks @WardLT! That makes perfect sense. I tested things out with HighThroughputExecutor, and it looks like I was misled by what was being printed in the VASP OUTCAR. This is even more justification for the need to have a dedicated MPIExecutor. 😉

As for the original question about the name, I think adding a _per qualifier is worthwhile for clarity's sake. Using max_tasks_per_block seems wise then.

Andrew-S-Rosen · 2024-05-09T04:14:47Z

parsl/executors/high_throughput/mpi_executor.py

+                 heartbeat_threshold: int = 120,
+                 heartbeat_period: int = 30,


Do we need to enumerate all of base class' (HighThroughputExecutor) keyword arguments if we aren't modifying the defaults? My concern is if we update the base class, then we might forget to update MPIExecutor. It seems like a safer approach might be to only explicitly have new and updated keyword arguments and then pass any remaining **kwargs to the base class. Thoughts?

Using **kwargs was my first thought too, but I wanted to try enumerating them to give better IDE support.

I made a unit test that will fail if the signature of the base class changes. Does that seem sufficient?

Sounds reasonable to me!

(I really wish IDEs supported better inspection of classes...)

benclifford · 2024-05-09T16:06:53Z

no brainpower for this for the next day or so but will give some more in-depth opinions after that

benclifford · 2024-05-10T17:40:49Z

parsl/executors/high_throughput/mpi_executor.py

+                 encrypted: bool = False):
+        super().__init__(
+            # Hard-coded settings
+            cores_per_worker=1e-9,  # Ensures there will always be enough workers


This is not always - there's a number of workers where this will no longer work - because this is not 1/∞, just an approximation of that.

If someone >1 billion MPI tasks in a single block, I'd be quite curious to know how they did it. Maybe I'll change to "there will be an absurd number of workers?"

In any case, isn't the edge case 1/max_tasks here? Since we know that there will never be more than max_tasks workers? It might be cleaner that way?

Right something like that, maybe will some other factors in there too.

The correct solution IMO is (separately) to make cores per worker be an optional limit, rather than always taking part in size computations. That way you can choose to not specify it at all, rather than "I won't change this piece of code, so I have to write the other piece of code to trick the first piece of code".

(the original assumption for this parameter is "of course a task needs cores on the manager node to execute" but that's not how htex-mpi does things)

parsl/executors/high_throughput/mpi_executor.py

benclifford · 2024-05-10T17:44:31Z

for the above max_workers_per_* discussion, more generally I'd like to push the internals of htex to talk about "per_pool" when it's relevant, although that doesn't necessarily need to be reflected in the user facing parameters that this PR is dealing with. but as the above discussion shows "per node" and "per block" are internally not going to be the right words to use.

WardLT · 2024-05-13T18:56:52Z

I added a first draft of the MPI Executor docs

@Andrew-S-Rosen and @tomdemeyere , you are a good target audience. How does it seem to you? https://parsl--3423.org.readthedocs.build/en/3423/userguide/mpi_apps.html

tomdemeyere · 2024-05-13T19:22:24Z

Nice! I am excited to try it with the modified MPI mode in #3418 but even without it, this seems much simpler and accessible.

I see everything that I would need to launch MPI apps, would this work with max_block > 1 ?

benclifford · 2024-05-13T19:27:36Z

@tomdemeyere there's not enough awareness of tasks having different resource requirements (number of ranks for each task) in the interchange, which dispatches tasks to (in the htex-mpi case) different blocks. So you might end up with some tasks queued in one block due to insufficient resources while another block has capacity.

That awareness was one of the targets of this project which was ultimately not funded: #3323

Andrew-S-Rosen

@WardLT: Awesome!! This is looking great. Some minor corrections below :)

Andrew-S-Rosen · 2024-05-13T19:57:11Z

docs/userguide/mpi_apps.rst

-HTEX and MPI Tasks
------------------
+1. Set ``max_workers_per_block`` to maximum number of tasks to run per block of compute nodes.
+   This value is typically the number of nodes per task divided by the number of nodes per task.


This value is typically the number of nodes per task divided by the number of nodes per task.

Wow, @WardLT, fanciest way to say "1" ever. 🤣 (Clearly a typo.)

Andrew-S-Rosen · 2024-05-13T19:57:27Z

docs/userguide/mpi_apps.rst

------------------
+1. Set ``max_workers_per_block`` to maximum number of tasks to run per block of compute nodes.
+   This value is typically the number of nodes per task divided by the number of nodes per task.
+2. Set ``mpi_launcher`` to launcher used for your application.


Small typo: "to [the] launcher"

Andrew-S-Rosen · 2024-05-13T19:58:14Z

docs/userguide/mpi_apps.rst


+Create the app by first defining a function which includes ``parsl_resource_specification`` keyword argument.


There should be a hyperlink here to what parsl_resource_specification is because the uninitiated reader likely has no clue.

Andrew-S-Rosen · 2024-05-13T19:58:39Z

docs/userguide/mpi_apps.rst

-
-
-Next we define the CosmicTagger MPI application. TODO: Ask Khalid for help.
+    @bash_app


This is not a @python_app but should be😅

Andrew-S-Rosen · 2024-05-13T20:00:14Z

docs/userguide/mpi_apps.rst

-                print(f"Stdout : {future.stdout}")
-                print(f"Stderr : {future.stderr}")
-                futures[(num_nodes, batchsize)] = future
+    # Resources in terms of nodes and how ranks are to be distributed are set on a per app


Do we always need a parsel_ressource_specification? I'm just thinking back to the approach used in your older videos where we never used such a thing with the HighThroughputExecutor. If it's omitted, I assume the implicit assumption is that all tasks have the same resource requirements but it will still function?

There has to be some default made in that case. In the workqueue case, for example, ommitting a core count (or memory specification) for a task will allocate one entire worker node (whatever the relevant worker node happens to have on it).

That is one obvious default. The other is that you get 1 rank.

Yes, you do always need one and - at the moment (see #3427) - Parsl will crash without it being provided as a keyword argument.

Makes sense. This is not your problem to deal with so to speak, but this just makes it even more awkward that parsl_resource_specification is not a keyword argument of the decorator itself. Having to redefine all of your functions to take this (hidden) keyword argument just to parallelize things is a bit hacky and counter-intuitive.

Andrew-S-Rosen · 2024-05-13T20:01:55Z

docs/userguide/mpi_apps.rst

+        "num_nodes" = 2,
+        "ranks_per_node" = 2,
+        "num_ranks" = 4,


Presumably these should be : instead of =.

Andrew-S-Rosen · 2024-05-13T20:06:35Z

docs/userguide/mpi_apps.rst


+Advanced: More Environment Variables


Very selfish question unrelated to this specific line:

I am usually running VASP (via ASE) on the GPU nodes of Perlmutter. I assume one is still using a SimpleLauncher() here since it's an "internal" call to srun?

For context, a Slurm script usually looks something like the following:

#!/bin/bash #SBATCH -N 1 #SBATCH -C gpu #SBATCH -G 4 #SBATCH -q debug #SBATCH -J vasp_job #SBATCH -t 00:30:00 #SBATCH -A matgen_g module load vasp/6.4.1-gpu export OMP_NUM_THREADS=8 export OMP_PLACES=threads export OMP_PROC_BIND=spread export ASE_VASP_COMMAND="srun -n 4 -c 32 --cpu_bind=cores -G 4 --gpu-bind=none vasp_std" python run_my_jobs.py

where run_my_jobs.py is some arbitrary program that in it launches a bunch of @python_apps representing one VASP task per App.

It's also a bit awkward here because the concept of ranks is mostly obscured...

I think this application is possible by replacing the parts of the VASP command definition which deal with the total ranks to \$PARSL_MPI_PREFIX. Escaping the $ will (hopefully!) result in the environment variable not being resolved with ASE_VASP_COMMAND is defined but instead when the contents ASE_VASP_COMMAND is used to start a subprocess somewhere deep in the bowels of ASE.

export ASE_VASP_COMMAND="\$PARSL_MPI_PREFIX -c 32 --cpu_bind=cores -G 4 --gpu-bind=none vasp_std"

That said, I'm not sure if -G is the "total" number of GPUs or the number per node. You might to switch to --gpus-per-node instead of -G.

Woof, that wasn't great.

WardLT · 2024-05-14T12:39:46Z

Alright. I've addressed Andrew's comments and fixed a bunch of other grammar issues.

I think this is ready for review.

benclifford · 2024-05-14T15:00:58Z

parsl/executors/high_throughput/mpi_resource_management.py

@@ -208,6 +208,8 @@ def get_result(self, block: bool, timeout: float):
        """Return result and relinquish provisioned nodes"""
        result_pkl = self.pending_result_q.get(block, timeout=timeout)
        result_dict = pickle.loads(result_pkl)
+        # TODO (wardlt): If the task did not request nodes, it won't be in `self._map_tasks_to_nodes`. Causes Parsl to hang
+        #  See Issue #3427


I feel like this message could be put as something more machine-readable, rather than a comment:

assert task_in in self._map_tasks_to_nodes, "You are about to experience issue #3427"

that might at least get some message somewhere at runtime (some stderr somewhere?), as well as making the message visible in source code like it is now.

Yea, I should have done that. Was leery to touch code, but it's already broken 🤷🏼

parsl/tests/test_mpi_apps/test_mpiex.py

benclifford · 2024-05-14T15:07:41Z

parsl/tests/test_mpi_apps/test_mpiex.py

+        "ranks_per_node": 2,
+    })
+    env_vars = future.result()
+    assert env_vars['PARSL_NUM_RANKS'] == '4'


the mathematician in me would use two different prime numbers here, not both 2, to get some difference in the values of the parameters (eg so that you're additionally testing that your aren't being allocated num_nodes * num_nodes ranks). then CS people can be forever confused why you're using not using a power of two...

Using primes is a nice trick

benclifford · 2024-05-14T15:16:04Z

docs/userguide/mpi_apps.rst

+        executors=[
+            MPIExecutor(
+                address=address_by_interface('bond0'),
+                max_workers_per_block=2,  # Assuming 2 nodes per task


here's that use of 2 twice for different meanings I mentioend elsewhere!

Also update the CONTRIBUTING.rst documentation

WardLT · 2024-05-15T18:29:03Z

Thanks, @benclifford ! I think I've addressed each of your points.

Andrew-S-Rosen · 2024-05-22T20:33:33Z

parsl/executors/high_throughput/mpi_executor.py

+    to spawn multi-node tasks.
+
+    Specify the maximum number of multi-node tasks to run at once using ``max_workers_per_block``.
+    The maximum number should be smaller than the ``nodes_per_block`` in the Provider.


@WardLT: Just looking at this again. Should it be "The value should be less than or equal to the nodes_per_block?

Yes, it should be less than or equal to

@WardLT

As noted in Parsl#3423 (comment), the docstring for `MPIExecutor` said `max_workers_per_block` should be less than `nodes_per_block` when in reality it should be less than or equal to. CC @WardLT.

As noted in #3423 (comment), the docstring for `MPIExecutor` said `max_workers_per_block` should be less than `nodes_per_block` when in reality it should be less than or equal to.

WardLT added 3 commits May 7, 2024 16:51

Replace SingleNodeLauncher with SimpleLauncher

f1874ea

SingleNodeLauncher creates too many workers when running in MPI mode

Add first draft of MPIExecutor

9de886c

Flake8 fix

6562474

Andrew-S-Rosen reviewed May 9, 2024

View reviewed changes

benclifford reviewed May 10, 2024

View reviewed changes

parsl/executors/high_throughput/mpi_executor.py Outdated Show resolved Hide resolved

benclifford closed this May 10, 2024

benclifford reopened this May 10, 2024

Rename max_tasks, improve docs

14945db

Andrew-S-Rosen mentioned this pull request May 11, 2024

Parsl: Switch to MPIExecutor in the tutorials for MPI tasks Quantum-Accelerators/quacc#2116

Open

WardLT added 2 commits May 13, 2024 12:26

Add test case showing MPIEx behaves correctly

c59c66e

Overhaul the MPI documentation

a8af379

WardLT added 2 commits May 13, 2024 14:57

Add API docs for MPI executor

37f6cb3

Fix some typos in docs

0ef578b

WardLT marked this pull request as ready for review May 13, 2024 18:59

Andrew-S-Rosen reviewed May 13, 2024

View reviewed changes

benclifford added the MPI label May 14, 2024

Fixing a bunch of grammar issues

ff51603

Woof, that wasn't great.

benclifford reviewed May 14, 2024

View reviewed changes

parsl/tests/test_mpi_apps/test_mpiex.py Show resolved Hide resolved

benclifford reviewed May 14, 2024

View reviewed changes

WardLT added 4 commits May 15, 2024 13:21

Get rid of the "edit docstring code"

fb769d7

Add warning about "message not in dict" issue

dccc648

Switch how config is created

a85cb1d

Also update the CONTRIBUTING.rst documentation

Use different primes in docs

81b47bb

WardLT requested a review from benclifford May 16, 2024 02:58

WardLT and others added 5 commits May 16, 2024 07:06

Merge branch 'master' into mpiex

c7e20e9

Merge branch 'master' into mpiex

753f790

Merge branch 'master' into mpiex

94f0b92

Merge remote-tracking branch 'origin/master' into HEAD

e53003f

Split CONTRIBUTING update into own PR

37f507f

benclifford approved these changes May 17, 2024

View reviewed changes

benclifford merged commit b214714 into Parsl:master May 17, 2024
6 checks passed

Andrew-S-Rosen reviewed May 22, 2024

View reviewed changes

Andrew-S-Rosen mentioned this pull request May 22, 2024

Clarify docstring for MPIExecutor #3455

Merged


		Create the app by first defining a function which includes ``parsl_resource_specification`` keyword argument.



		Next we define the CosmicTagger MPI application. TODO: Ask Khalid for help.
		@bash_app

Add an MPIExecutor #3423

Add an MPIExecutor #3423

Conversation

WardLT commented May 8, 2024 • edited

Description

Changed Behaviour

Fixes

Type of change

Andrew-S-Rosen commented May 8, 2024 • edited

WardLT commented May 9, 2024 • edited

Andrew-S-Rosen commented May 9, 2024 • edited

Choose a reason for hiding this comment

WardLT May 9, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benclifford commented May 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benclifford commented May 10, 2024

WardLT commented May 13, 2024 • edited

tomdemeyere commented May 13, 2024

benclifford commented May 13, 2024 • edited

Andrew-S-Rosen left a comment

Choose a reason for hiding this comment

Andrew-S-Rosen May 13, 2024 • edited

Choose a reason for hiding this comment

Andrew-S-Rosen May 13, 2024 • edited

Choose a reason for hiding this comment

Andrew-S-Rosen May 13, 2024 • edited

Choose a reason for hiding this comment

Andrew-S-Rosen May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Andrew-S-Rosen May 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WardLT commented May 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WardLT commented May 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WardLT commented May 8, 2024 •

edited

Andrew-S-Rosen commented May 8, 2024 •

edited

WardLT commented May 9, 2024 •

edited

Andrew-S-Rosen commented May 9, 2024 •

edited

WardLT May 9, 2024 •

edited

WardLT commented May 13, 2024 •

edited

benclifford commented May 13, 2024 •

edited

Andrew-S-Rosen May 13, 2024 •

edited

Andrew-S-Rosen May 13, 2024 •

edited

Andrew-S-Rosen May 13, 2024 •

edited

Andrew-S-Rosen May 13, 2024 •

edited

Andrew-S-Rosen May 13, 2024 •

edited