Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix two bugs and add support for N masters, N workers with load balancing in RadicalPilotExecutor #3060

Open
wants to merge 131 commits into
base: master
Choose a base branch
from

Conversation

AymenFJA
Copy link
Contributor

@AymenFJA AymenFJA commented Feb 8, 2024

Description

This PR:

Fixes

Fixes #3029, #3070

Type of change

Choose which options apply, and delete the ones which do not apply.

@benclifford benclifford changed the title RPEX fix termination and test issues. Fix two bugs and add support for N masters, N workers with load balancing in RadicalPilotExecutor Feb 9, 2024
"""
for fut_task in self.future_tasks.values():
if not fut_task.done():
fut_task.set_exception(exception)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe there's a very slight race condition here - can fut_task get set by another thread (eg. task completion) in the space between line 211 and line 212? If so, set_exception would raise an exception here. I've seen a few situations like that elsewhere in Parsl.

Copy link
Contributor Author

@AymenFJA AymenFJA Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think so. State_Callbacks is the only one responsible for setting fut_tasks in the executor.

@benclifford
Copy link
Collaborator

ok thanks. there are a bunch of changes in this and I'm not super familiar with this bit of the code, so I'll try to use the review as a learning experience. If you're making several changes at once, it can be easier both for review and tracking down later bugs to make several smaller, single focused PRs (a rule of thumb for me is, if you start listing out multiple changes in the PR description then consider splitting it...)

@AymenFJA
Copy link
Contributor Author

ok thanks. there are a bunch of changes in this and I'm not super familiar with this bit of the code, so I'll try to use the review as a learning experience. If you're making several changes at once, it can be easier both for review and tracking down later bugs to make several smaller, single focused PRs (a rule of thumb for me is, if you start listing out multiple changes in the PR description then consider splitting it...)

@benclifford totally understandable and I agree. In the future, the PRs will be more focused. During this review please let me know if I can help you with anything vague or any questions. Thank you and sorry for the late response..

@@ -10,6 +10,7 @@ export PATH := $(CCTOOLS_INSTALL)/bin/:$(PATH)
export CCTOOLS_VERSION=7.7.2
export HYDRA_LAUNCHER=fork
export OMPI_MCA_rmaps_base_oversubscribe=yes
export RADICAL_TEMP_SANDBOX=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this variable is not interpreted as a Python bool so it's maybe misleading to use True here in case someone thinks False would do something opposite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. maybe RADICAL_TEMP_SANDBOX=yes or RADICAL_TEMP_SANDBOX=1 ?

# to include the agent sandbox with the ci artifacts.
if os.environ.get("LOCAL_SANDBOX"):
pd_init['sandbox'] = self.run_dir
# set the agent dir to /temp as a fix for #3029
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this changed sandbox behaviour the only part that fixes #3029?

If so:

it's moving things into a temporary directory only when the CI tests are run (or in other situations where the RADICAL_TEMP_SANDBOX environment variable is set).

in non-test situations, this behaviour is otherwise unchanged, right? If so, does this PR stop that situation occurring for normal users who are running real applications, not our CI?

@AymenFJA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is a very rare case that can happen when the filesystem is exhausted for some reason (maybe too many concurrent I/O or so in the CI test machine), this behavior should not happen on a regular basis.

I do understand your concerns here that this might also happen specifically on a local non-test setup (as the agent sandbox is located in $HOME), we never observed this behavior on an HPC machine as we use the shared file system to store our agent sandbox.

I think to address your concern we can make the default location for the sandbox on local setup in general inside the /tmp folder? I am open to suggestions as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated this a bunch more (and have commented on what I see in other issues) to try to understand what was actually happening in the CI environment, because I like to at least have some understanding of how that environment behaves.

I can see why this is happening in the CI (due to entering a strange zone of fd number-space which radical.saga cannot deal with) and therefore why this would happen in a user workflow (when a user's workflow submitting process has opened many - around 1000 - files).

This PR avoid the misbehaving radical.saga code path by having a special non-default option to turn on a local sandbox.

But in radical-cybertools/radical.saga#885 Andrew mentions, effectively, that radical.saga would be better replaced by psij-python - that would have the same effect of removing the misbehaving codepath, in a different way.

So I tried that out in this draft PR: #3079 which doesn't actually avoid the misbehaving codepath, as far as I can tell.

But maybe this is a better approach longer term, if radical.saga is effectively deprecated in favour of psij-python?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failures from Radical pilot executor
2 participants