Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make HTEX reject empty resource specification in MPI mode #3442

Merged
merged 11 commits into from
May 27, 2024

Conversation

R1j1t
Copy link
Contributor

@R1j1t R1j1t commented May 20, 2024

Description

Added a check in HighThroughputExecutor.submit in cases where parsl_resource_specification is empty when enable_mpi_mode=True

Possible changes:

  • update user doc (user doc does mention the need to configure parsl_resource_specification
  • Update exception InvalidResourceSpecification as it currently expects a set
  • fix test case?
  • Refactor test_mpi_mode_enabled.py? Many function arguments are never used.

Changed Behaviour

Parsl fails with an error message, instead of hanging on the user

Fixes

Fixes #3427

Type of change

Choose which options apply, and delete the ones which do not apply.

  • Bug fix
  • Update to human readable text: Documentation/error messages/comments

@R1j1t
Copy link
Contributor Author

R1j1t commented May 20, 2024

Also, happy to hear feedback on the PR!

@R1j1t R1j1t changed the title added bugfix and corresponding tests [BUG] empty parsl_resource_specification in mpi mode causes parsl to hang May 20, 2024
Copy link
Collaborator

@benclifford benclifford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments on specific lines

@R1j1t R1j1t requested a review from benclifford May 23, 2024 21:07
@benclifford benclifford requested a review from WardLT May 23, 2024 21:25
@benclifford
Copy link
Collaborator

ok, reviewing this again, I feel like you should take a step back and review what validate_resource_spec is meant to be doing - rather than what it actually does.

For both MPI mode and not-MPI mode, there are some keys that are required, some that are optional and some that are forbidden - different for each mode.

The current validation code in master doesn't distinguish between those two modes - and in this PR you've added in that distinction, which is probably a good thing.

And the current validation code doesn't look for which keys/combinations are required at all. Your current PR doesn't look to see which keys are required either, it just requires that something is specified.

So maybe it would be good, perhaps with help from @WardLT and @yadudoc , to figure out and write down for both modes which keys are required, optional and forbidden - based on understanding what the code does with those keys. And then rewrite validate_resource_spec to validate those rules.

@R1j1t
Copy link
Contributor Author

R1j1t commented May 24, 2024

Thanks @benclifford for the comment and it makes perfect sense to check for required parameters be passed. Let me dig in the code for the required parameters as well. I will confirm my finding with @WardLT and @yadudoc!

@benclifford
Copy link
Collaborator

another question i have is in mpi mode, there are three interacting parameters and this validate function does more than validate - it computes new values. but what happens in a user specifies all three values? (ranks per node, num nodes and num ranks?) - for example if a user specifies 2 ranks per node, 500 ranks, 3 nodes? I feel like there is something bad here that should be validated. I've also raised that question with @rjmello offline who is doing some other validation work.

Copy link
Contributor

@WardLT WardLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @R1j1t ! I'm inclined to approve this PR as it takes a good step in the direction of making resource spec validation errors clearer for users.

Building on @benclifford , I think we can do better by deviating farther from the initial implementation. For example, we could use a data class to handle the validation logic rather than duplicate logic dataclasses already handles well. We should also move the checks from buried within HTEx to part of the app module to encourage use of the same spec across different Executors.

Those ideas and the ones Ben already has feel like things we should spread over multiple PRs. Do you agree?

PS. I'm about to leave on vacation. Reviews are going to be slower over the next few weeks.

Copy link
Collaborator

@benclifford benclifford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I made a couple more change requests- then as @WardLT is happy with the rest of this, we can merge this and other work can happen in other PRs (either from you or from whoever else works on this)

.gitignore Outdated

#deps
/.deps

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change .gitignore with changes that are not relevant to the topic of the PR: none of these changes seem relevant to the point of the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed any changes in this file. The reason I added them was because of the following files getting created on every test execution
image

Let me open another PR for this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok - you can also fix the MPI test perhaps. It looks like: parsl/tests/test_mpi_apps/test_mpi_mode_enabled.py uses local_setup and local_teardown when it could use local_config instead to supply the test configuration. If the test used local_config, then the test system should make those files appear under .pytest/parsltest-current like it does for many other tests.


def __init__(self, invalid_keys: Set[str]):
def __init__(self, invalid_keys: Union[Set[str], str]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is even more weird than the set-of-strings you changed before! invalid_keys isn't even a collection of objects any more, sometimes!

To get this PR merged without too much further work, make a new exception class MissingResourceSpecification which doesn't take an invalid_keys set, and use that to describe a missing resource specification; and leave this InvalidResourceSpecification for representing incorrectly used keys.

@R1j1t
Copy link
Contributor Author

R1j1t commented May 27, 2024

For both MPI mode and not-MPI mode, there are some keys that are required, some that are optional and some that are forbidden - different for each mode.

I am not sure if I am missing something, but based on the below code num_ranks, ranks_per_node and num_nodes should be present post validate_resource_spec.

"--ntasks",
resource_spec.get("num_ranks"),
"--ntasks-per-node",
resource_spec.get("ranks_per_node"),

nodes_needed = resource_spec.get("num_nodes")
if nodes_needed:
try:
allocated_nodes = self._get_nodes(nodes_needed)
except MPINodesUnavailable:
logger.warning("Not enough resources, placing task into backlog")
self._backlog_queue.put((nodes_needed, task_package))
return
else:
resource_spec["MPI_NODELIST"] = ",".join(allocated_nodes)
self._map_tasks_to_nodes[task_package["task_id"]] = allocated_nodes
buffer = pack_res_spec_apply_message(_f, _args, _kwargs, resource_spec)
task_package["buffer"] = buffer
self.pending_task_q.put(task_package)

cc: @yadudoc

@benclifford benclifford changed the title [BUG] empty parsl_resource_specification in mpi mode causes parsl to hang Make HTEX reject empty resource specification in MPI mode May 27, 2024
@benclifford benclifford merged commit 11664da into Parsl:master May 27, 2024
6 checks passed
@R1j1t R1j1t deleted the bug/3427 branch May 27, 2024 17:26
@R1j1t
Copy link
Contributor Author

R1j1t commented May 27, 2024

Thanks @benclifford, @WardLT for the help! Looking forward to contributing to parsl :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsl hangs when using MPI Mode HTEx if resource specs are not found
3 participants