Remove PreventRedundantTransitions
rule from core task orchestration policy
#8802
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some users choose to run Dask workers on ephemeral infrastructure like EC2 spot instances, as reported in #8602. Sometimes, this infrastructure disappears without giving Prefect the opportunity to gracefully handle the shutdown - instead, Dask itself will retry the task.
The abrupt shutdown, combined with Dask retrying the task without giving Prefect an opportunity to gracefully terminate the task, leave the task in a RUNNING state, causing an error when retrying:
This run cannot transition to the RUNNING state from the RUNNING state. Task run is in RUNNING state
.The issue also appears to happen when running Ray on spot instances: #13013
After discussing the issue, we decided Prefect should not try to orchestrate things happening outside its control, so this PR removes the
PreventRedundantTransitions
rule fromCoreTaskPolicy.
Closes #8602
A related issue also solved by this PR is #8597, where API timeouts occasionally result in task run trying to re-set the running state when it was already saved server-side.
Closes #8597
Example
Before applying the changes in this PR, run task in a Dask worker, and either kill the worker manually or call
os.abort()
inside the worker. Dask should restart the worker and begin re-running the task.Below is a simple test that will cause random crashes, but succeed eventually. We want the task to crash the Dask runner on the first run, but succeed eventually:
Before applying this PR, when the worker crashes and tries to re-run the task, Prefect aborts the state transition and fails the flow run:
After applying the PR's changes, Prefect allows the task to restart, so Dask restarts it and when it completes, the flow finishes successfully:
Checklist
<link to issue>
"fix
,feature
,enhancement