New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[not for merge] Status poller rearrangement patch stack #3295
Draft
benclifford
wants to merge
16
commits into
master
Choose a base branch
from
benc-status-refactor
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
benclifford
force-pushed
the
benc-status-refactor
branch
2 times, most recently
from
March 26, 2024 11:16
6ecc4a8
to
3518372
Compare
benclifford
changed the title
[not for merge] status refactor
[not for merge] Status poller rearrangement patch stack
Mar 26, 2024
benclifford
force-pushed
the
benc-status-refactor
branch
2 times, most recently
from
March 28, 2024 16:48
15ad3f1
to
0460015
Compare
benclifford
force-pushed
the
benc-status-refactor
branch
3 times, most recently
from
April 9, 2024 17:24
b960e4a
to
da14f7a
Compare
It was used once, a couple of lines below. This PR should not change behaviour.
Prior to this PR, the monitoring router exited due to receiving a WORKFLOW_INFO message with an exit_now field set to True, but only if that message was received through a specific path. This PR removes that exit_now field, and makes the monitoring router exit on a multiprocessing event. This removes the need for the exit message to arrive through that specific path into the router, which makes message handling more consistent, and opens up opportunities to feed messages into monitoring through different paths. Slowly ongoing work has been trying to make all the different monitoring message paths behave the same with a goal of eliminating some of them, and this change also works towards that.
i think if i try using the dfk.monitoring.send channel, this is going to be very thread unsafe. but I think that's already the case, because there's nothing thread specific about who can call dfk.monitoring.send... ... just going to happen more often.
This makes this status available to the scale_in methods of BlockProviderExecutors, and so can be used as part of a scaling bugfix in a subsequent PR. This is part of ongoing work to minimise and possibly eliminate the PolledExecutorFacade class. This PR leaves accesses to those structures in the PolledExecutorFacade, with future PRs intended to evaporate away more of the PolledExecutorFacade. This should not change behaviour, as it only moves an attribute from one object to a 1:1 paired object.
…htex can now make use of that information source... so I can try to make a better implementation of PR #3232 i think this fixes 3232
this means this monitoring code does not need to know if a cache-refresh happened or not subtleties: if someone modifies the poller dict elsewhere, this won't catch that becuase its sharing that mutable old state and so that mutator needs to send a monitoring notification themselves. that's how things are already i think: pollitem.scale_out and pollitem.scale_in already do that monitoring change this is probably broken - get some better testing. or perhaps redo how monitoring deltas happen entirely?
this should not change behaviour, but is prep for next patch that does change behaviour it does not change behaviour because it is only moving attributes into different objects
this should not change behaviour, as it is a code move
…ller this is a behavioural change - but what's the behavioural change? if there are multiple executors, then cache refreshes might now happen on a slightly different cadence: the 2nd executor now time will be later (it will be the end time of the 1st executor polling process) which means it might now fire a bit earlier
…, not fresh provider+simulated data potentially-refresh cache on every call instead of driven by poller - this might change the refresh cadence. code that assumes that repeated calls to PollItem.status will be constant, except across the poll loop iteration, might break -- does anything make that assumption? for example: handle_errors vs strategize now happen with potentially different status info (that should be eventually convergent) the monitoring loop in poll now can't see the update happening/can't get a before/after state inside the loop, because its decoupled now. probably needs some more persistent (across loop iterations) storage of the previous state... things to test: * what test makes sure the provider isn't polled too often? if not here, write one. * what tests that we don't scale too much? (eg. if we ignore the PENDING status added by scale_out to poller_mutable_status, and let the strategy keep running, then we should see excessive blocks being launched) * test monitoring delta recording works this breaks: pytest parsl/tests/test_scaling/test_scale_down_htex_unregistered.py --config local I think it's because the status update hasn't happened yet at the point that its being asserted, because .status() is now cached...
…to the block provider executor, then perhaps its easier to do local reasoning here about which info the job status poller actually wants? specifically, what's different about FAILED vs PENDING artificial statuses? FAILED is persistent beyond the update from the underlying provider PENDING should be replaced by an update from the underlying provider - including, perhaps incorrectly, being entirely deleted and that's probably the fundamental problem i have so far with unifying these tables - the different update semantics above to bugfix this, in addition to seeing the poller deferred status, we also need to see the simulated status changes live (rather than cache-deferred) no one else is using simulated status and now job status poller is allowed to touch the internal structs, so perform the simulated status update elsewhere broken behaviour is that other callers (if they exist?) of executor.status no longer see any simulated status at all this behaviour changes when htex gets to apply the MISSING rewrite to only include jobs reported by the provider, not jobs added by simulated status. that's actually possibly desirable, because simulated status might contain more useful information that shouldn't be overwritten by MISSING?
this might change some behaviour: because it will now perform the mutation on any calls to block provider executor scale in/out, not only the ones driven by job status poller - however I think there are no such calls any more and that everything is driven by the job status poller?
benclifford
force-pushed
the
benc-status-refactor
branch
from
April 9, 2024 20:07
da14f7a
to
e38ea68
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a patch stack with a bunch of provider status rearrangement.
The primary user-related motivation is to fix issues #3235 and #2627
Another goal is to rationalise the handling of multiple sources of block/job status information.
As of the time of writing, this PR contains many small steps that culminate in fixing (I hope) both of those issues, but leaves the scaling code still needing cosmetic tidyup. It looks like PollItem is now a facade that mostly handles reporting things to the monitoring system, which could also be moved into the status handling executor code...
This patch stack is managed as an stgit patch stack on my (@benclifford ) laptop so don't go pushing things to this branch because that's a hassle for me to deal with.
Because the scaling code is quite hard to comprehensively understand, and other attempts to change this code have shown to be very hard for everyone to review, I would like to merge this code patch-by-patch, each one its own PR, with it being clearly defined for each one if I am expecting any behaviour change, and if so what that behaviour change is, ideally accompanied by tests; and I would like reviewers to pay attention to the behaviour of each PR, rather than assuming its probably right.
When reviewing each PR individually, this current PR should serve as end-goal context for why a change is being made. Each commit on this branch is a future-PR-in-preparation and should make sense on its own.