Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scale in unregistered htex blocks #3232

Closed
wants to merge 2 commits into from
Closed

Conversation

benclifford
Copy link
Collaborator

@benclifford benclifford commented Mar 11, 2024

This PR will make the htex scale-in code consider blocks which have been launched through the provider mechanism but have not registered to the interchange. This is especially relevant at shutdown time, when previous blocks still waiting in the queue (and so, not registered) would be abandoned by parsl shutdown.

[first commit on this PR is a test that should break; when that happens, I'll push the rest; this is to get a demonstrated failure of test in CI]

Changed Behaviour

More blocks scaling in. Tidier shutdown.

Fixes

Fixes #2627

Type of change

  • Bug fix

…caled in

do this by submitting an infinite sleep worker
…at least one manager registered

how to test this? make an htex with init_blocks=min_blocks=1 block and a worker command that does something like sleep forever (it needs to not fail, because failed blocks wont' get scaled in). and then shutdown without any tasks. in the old behaviour that unregistered block will not be scaled in, i think, but in the new behaviour it should be.
@benclifford
Copy link
Collaborator Author

Since I wrote this code a few weeks ago, there's been a change to the high throughput executor status method to introduce a "MISSING" state, and I've also gone over the general status code with @yadudoc a bit.

Most importantly, I'd like to check that this code isn't actually calling the provider status calls (eg. invoking sbatch) unnecessarily: the current code is quite unclear as to when asking for status returns some in-memory cache of blocks (and block status) vs calling out to the LRM.

There might also be opportunities for aligning this code better with the code that handles MISSING status.

@yadudoc
Copy link
Member

yadudoc commented Mar 12, 2024

This draft PR fixes the issue described here: #3235.

@benclifford
Copy link
Collaborator Author

@yadudoc - I don't think it does? this PR doesn't fix any cache-based status update delays? I thought we tested a different (no-PR-number) patch?

Comment on lines +751 to +752

managers = self.connected_managers()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes the earlier assignment redundant, yeah?

Comment on lines +743 to +750
for (b_id, job_status) in provider_blocks.items():

# skip blocks that have already finished...
if job_status.state in TERMINAL_STATES:
continue

if b_id not in block_info:
block_info[b_id] = BlockInfo(tasks=0, idle=float('inf'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provider_blocks is a dictionary, yeah? If so, then b_id is guaranteed to be unique through the loop. Coupled with the fact that block_info is empty when the loop starts, the if-conditional is a waste. Could merge it all into the instantiation:

block_info = {
  block_id: BlockInfo(tasks=0, idle=float('inf')
  for block_id, job_status in provider_blocks.items()
  if job_status.state not in TERMINAL_STATES
}

Later on, block_info keys are set to zero on first use as well, so could make that structure a defaultdict:

block_info: Dict[str, BlockInfo = defaultdict(lambda: BlockInfo(0, float('inf')))
for b_id, job_status in provider_blocks.items():
    # skip blocks that have already finished...
    if job_status.state in TERMINAL_STATES:
        continue

    # the defaultdict logic is now the authority on "empty BlockInfo" objects
    block_info[b_id]

And could remove the if-branch in the following for-loop as well.

benclifford added a commit that referenced this pull request Mar 13, 2024
parsl.provider.* status() calls can be expensive, in the sense that they
do things like execute batch system commands which system administrators
dislike being run automatically and periodically, and even in some
situations ban that behavior.

So, Parsl tries to not do those calls too often - providers can define a
status_polling_interval (which returns an interval in seconds) and the
strategy code will call (through a chain of calls) the provider status
method at most every status_polling_interval.

Prior to this PR, this was implemented in the job status poller code,
which makes the following chain of calls at most every status polling
interval

poller/strategy iterator =>
  poll_item.poll (runs "often" - eg every 5 seconds driven by job status poller?)
    => determine if the rest of this stack should run, or otherwise use data cached as poll item._status attribute
      => executor.status()
        => provider.status()
          => (in the case of cluster provider, provider._status and LRM command line callout)

There are a couple of issues related to this call stack:

* #3235 - slow update of jobs that failed to submit - when a job fails
to submit, at present it is recorded in a separate table, the
_simulated_status dictionary, in BlockProviderExecutor. Entries added to
that are not exposed to the scaling code until executor.status() is
called, which may be many strategy iterations later. In that time, the
scaling strategy make make the wrong decisions, based on incomplete
information. That is the core of issue #3235

* #2627 - htex scale in needs to be aware of job status. Scale in code
is not always driven in by the poller/strategy code; for example it
executes at shutdown driven by the DataFlowKernel. That scale in code
then is not in a position to make use of the cached data inside
poll_item and it is more natural to call executor.status() to determine
status - see PR #3232. That results in calls to provider.status()
without any status_polling_interval rate limiting/caching.

This current PR moves that rate limit/cache deeper into the stack to the
BlockProviderExecutor.

The job status poller code now polls the executor on every iteration,
relying on the executor to perform rate limiting and caching.

That has the effect of exposing simulated statuses to the scaling
strategy code on each iteration, so that the delays in issue #3235 will
not be experienced.

This change also means that PR #3232 can safely call executor.status()
without exceeding the rate limit.

This PR changes when provider status gets called, because it is now a
cache that refreshes whenever status is called past the expiry time,
rather than more explicitly driven in a loop. That might change scaling
behaviour, but I think not in any significant way.

This PR changes the decision of when an executor is to be polled for
status: previously non-positive values of status_polling_interval were
used as magic values to inhibit provider status polling. Now, any
executor with a provider will be polled.

This PR makes the call stack above provider.status a little more data
driven and less effectful (the effect being load on the LRM), which
might make refactoring this stack a bit easier. However, it will also
make that a bit more computation heavy - I think not much because
usually there are not many blocks in existence.

This PR adds a test that repeatedly calling BlockProviderExecutor.status
does not result in an excessive number of calls to the provider status
method but still updates eventually.
benclifford added a commit that referenced this pull request Mar 25, 2024
this is to accomodate future population of blockinfo from multiple sources

how is this change tested? it *shouldn't* change anything. but scale_in is not well tested...
benclifford added a commit that referenced this pull request Mar 25, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232
benclifford added a commit that referenced this pull request Mar 26, 2024
this is to accomodate future population of blockinfo from multiple sources

how is this change tested? it *shouldn't* change anything. but scale_in is not well tested...
benclifford added a commit that referenced this pull request Mar 26, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232
benclifford added a commit that referenced this pull request Mar 26, 2024
this is to accomodate future population of blockinfo from multiple sources

how is this change tested? it *shouldn't* change anything. but scale_in is not well tested...
benclifford added a commit that referenced this pull request Mar 26, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232
benclifford added a commit that referenced this pull request Mar 26, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232
benclifford added a commit that referenced this pull request Mar 27, 2024
this is to accomodate future population of blockinfo from multiple sources

how is this change tested? it *shouldn't* change anything. but scale_in is not well tested...
benclifford added a commit that referenced this pull request Mar 27, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232

i think this fixes 3232
benclifford added a commit that referenced this pull request Mar 27, 2024
This is to accomodate upcoming expansion of the block_info dictionary to
include data from an additional source (the list of blocks known to the
scaling code)

This change arises from a review of PR #3232, an earlier prototype of that
expansion work.

Behaviourally, this shouldn't change anything: it changes invalid
block accesses from key errors into blocks that are infinitely idle and
completely unloaded. However there are no indexes into block_info except
in code that is adding block information, so this situation should
not arise.
benclifford added a commit that referenced this pull request Mar 28, 2024
This is to accomodate upcoming expansion of the block_info dictionary to
include data from an additional source (the list of blocks known to the
scaling code)

This change arises from a review of PR #3232, an earlier prototype of that
expansion work.

Behaviourally, this shouldn't change anything: it changes invalid
block accesses from key errors into blocks that are infinitely idle and
completely unloaded. However there are no indexes into block_info except
in code that is adding block information, so this situation should
not arise.
benclifford added a commit that referenced this pull request Mar 28, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232

i think this fixes 3232
benclifford added a commit that referenced this pull request Apr 7, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232

i think this fixes 3232
benclifford added a commit that referenced this pull request Apr 8, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232

i think this fixes 3232
benclifford added a commit that referenced this pull request Apr 9, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232

i think this fixes 3232
benclifford added a commit that referenced this pull request Apr 9, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232

i think this fixes 3232
benclifford added a commit that referenced this pull request Apr 11, 2024
…htex can now make use of that information source...

so I can try to make a better implementation of PR #3232

i think this fixes 3232
benclifford added a commit that referenced this pull request Apr 12, 2024
Prior to this PR, the scale_in code for the HighThroughputExecutor will not scale in any block that has not had at least one manager register with the interchange, because it retrieves the list of blocks from the interchange. This is documented in issue #3232

This PR makes the htex scale in code also pay attention to blocks in the status_facade list - which includes blocks that have been submitted, and blocks which have been reported by the provider mechanism.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

htex termination (and other scale-in-by-quantity) does not scale in blocks that haven't registered yet
3 participants