Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[not for merge] Status poller rearrangement patch stack #3295

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Commits on Apr 9, 2024

  1. Inline PolledExecutorFacade._should_poll_now

    It was used once, a couple of lines below.
    
    This PR should not change behaviour.
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    e3df4db View commit details
    Browse the repository at this point in the history
  2. Exit monitoring router exit on multiprocessing event, not exit message

    Prior to this PR, the monitoring router exited due to receiving a
    WORKFLOW_INFO message with an exit_now field set to True, but only if that
    message was received through a specific path.
    
    This PR removes that exit_now field, and makes the monitoring router exit
    on a multiprocessing event. This removes the need for the exit message to
    arrive through that specific path into the router, which makes message
    handling more consistent, and opens up opportunities to feed messages into
    monitoring through different paths.
    
    Slowly ongoing work has been trying to make all the different monitoring
    message paths behave the same with a goal of eliminating some of them, and
    this change also works towards that.
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    fb7a941 View commit details
    Browse the repository at this point in the history
  3. desc-sponsored monitoring untangling that fits in here

    i think if i try using the dfk.monitoring.send channel, this is going to be very thread unsafe.
    
    but I think that's already the case, because there's nothing thread specific about who can call dfk.monitoring.send...
    ... just going to happen more often.
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    e2b4176 View commit details
    Browse the repository at this point in the history
  4. Move PolledExecutorFacade status into BlockProviderExecutor

    This makes this status available to the scale_in methods of
    BlockProviderExecutors, and so can be used as part of a scaling bugfix
    in a subsequent PR.
    
    This is part of ongoing work to minimise and possibly eliminate the
    PolledExecutorFacade class.
    
    This PR leaves accesses to those structures in the PolledExecutorFacade,
    with future PRs intended to evaporate away more of the PolledExecutorFacade.
    
    This should not change behaviour, as it only moves an attribute from one
    object to a 1:1 paired object.
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    e4b1106 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    4d92648 View commit details
    Browse the repository at this point in the history
  6. at this point, scale in code can now observe poll item states and so …

    …htex can now make use of that information source...
    
    so I can try to make a better implementation of PR #3232
    
    i think this fixes 3232
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    e4802d6 View commit details
    Browse the repository at this point in the history
  7. compute monitoring deltas on each poll

    this means this monitoring code does not need to know if a cache-refresh happened or not
    
    subtleties: if someone modifies the poller dict elsewhere, this won't catch that becuase its sharing that mutable old state
    and so that mutator needs to send a monitoring notification themselves.
    
    that's how things are already i think: pollitem.scale_out and pollitem.scale_in already do that monitoring change
    
    this is probably broken - get some better testing. or perhaps redo how monitoring deltas happen entirely?
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    ad315a2 View commit details
    Browse the repository at this point in the history
  8. push cache refresh state into executor

    this should not change behaviour, but is prep for next patch that does change behaviour
    
    it does not change behaviour because it is only moving attributes into different objects
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    d5bf865 View commit details
    Browse the repository at this point in the history
  9. move cache update code into executor

    this should not change behaviour, as it is a code move
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    fc74a91 View commit details
    Browse the repository at this point in the history
  10. executors get current time directly from time, not from job status po…

    …ller
    
    this is a behavioural change - but what's the behavioural change?
    if there are multiple executors, then cache refreshes might now happen
    on a slightly different cadence: the 2nd executor now time will be
    later (it will be the end time of the 1st executor polling process)
    which means it might now fire a bit earlier
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    ff62019 View commit details
    Browse the repository at this point in the history
  11. test dev for #3235

    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    69ad2aa View commit details
    Browse the repository at this point in the history
  12. move executor.status() API boundary to return the cached/mutated data…

    …, not fresh provider+simulated data
    
    potentially-refresh cache on every call instead of driven by poller - this might change the refresh cadence.
    
    code that assumes that repeated calls to PollItem.status will be constant, except across the poll loop iteration, might break -- does anything make that assumption? for example: handle_errors vs strategize now happen with potentially different status info (that should be eventually convergent)
    
    the monitoring loop in poll now can't see the update happening/can't get a before/after state inside the loop, because its decoupled now.
    probably needs some more persistent (across loop iterations) storage of the previous state...
    
    things to test:
    
    * what test makes sure the provider isn't polled too often?
    if not here, write one.
    
    * what tests that we don't scale too much?
    (eg. if we ignore the PENDING status added by scale_out
    to poller_mutable_status, and let the strategy keep
    running, then we should see excessive blocks being
    launched)
    
    * test monitoring delta recording works
    
    
    this breaks:
    pytest parsl/tests/test_scaling/test_scale_down_htex_unregistered.py --config local
    
    I think it's because the status update hasn't happened yet at the point that its being asserted, because .status() is now cached...
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    d23ecc3 View commit details
    Browse the repository at this point in the history
  13. this is an attempt to bugfix 3235 - after moving more status stuff in…

    …to the block provider executor, then perhaps its easier to do local reasoning here about which info the job status poller actually wants?
    
    specifically, what's different about FAILED vs PENDING artificial statuses?
    
    FAILED is persistent beyond the update from the underlying provider
    PENDING should be replaced by an update from the underlying provider - including, perhaps incorrectly, being entirely deleted
    
    and that's probably the fundamental problem i have so far with unifying these tables - the different update semantics above
    
    to bugfix this, in addition to seeing the poller deferred status, we also need to see the simulated status changes live
    (rather than cache-deferred)
    
    no one else is using simulated status and now job status poller is allowed to touch the internal structs, so perform the simulated status update elsewhere
    
    broken behaviour is that other callers (if they exist?) of executor.status no longer see any simulated status at all
    
    this behaviour changes when htex gets to apply the MISSING rewrite to only include jobs reported by the provider, not jobs added by simulated status. that's actually possibly desirable, because simulated status might contain more useful information that shouldn't be overwritten by MISSING?
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    d404255 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    51ea950 View commit details
    Browse the repository at this point in the history
  15. dev

    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    671e282 View commit details
    Browse the repository at this point in the history
  16. move mutation of poll items status table into block provider executor

    this might change some behaviour: because it will now perform the mutation on any calls
    to block provider executor scale in/out, not only the ones driven by job status poller
    - however I think there are no such calls any more and that everything is driven by
    the job status poller?
    benclifford committed Apr 9, 2024
    Configuration menu
    Copy the full SHA
    e38ea68 View commit details
    Browse the repository at this point in the history