-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more state transition error handling #6315
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, but after thinking about the behavior from a user's perspective, I sense two different product principles that seem to be at tension here:
- A state should only ever mean one thing. In this case, the
CRASHED
state only ever means the user's code crashed (versus the user's code or our code crashed). - We never fail the user silently. In this case, with this PR as-is, the state of the task we crashed while trying to orchestrate remains
PENDING
, but if we never fail silently for the user, it should be eitherCRASHED
or a new state that means We crashed while trying to orchestrate this.
I don't know if these are real product principles for us, though, so I'm not sure which makes the most sense. I'm trying to be more careful to elevate anything even product-related to the product folks, so I'm going to invoke @billpalombi to get some backup.
assert cleanup_step.call_count == 0 | ||
|
||
# Check the task run state | ||
task_run_states = await models.task_run_states.read_task_run_states( | ||
session=session, task_run_id=task_run.id | ||
) | ||
|
||
assert len(task_run_states) == 1, "No transition was made" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these asserts stick around? They seem useful to verify the state didn't transition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting from out-of-PR conversation -- this check verified that a state was written and now we don't write a state.
I'm sorry, I'm having difficulty understanding both what the current behavior is and what the proposed behavior would be. It appears as though this change is relevant to all state transitions, not just transitions to/from the Regarding the principles that you proposed, @abrookins:
|
@abrookins @billpalombi I do believe that an entirely new state (or other persisted object) that encompasses orchestration failures is a potentially useful thing. Perhaps an annotation on the run itself is sufficient? I do want to point out that aborted transitions are not silent. An abort response is returned to the client which will be reflected to the user with an abort message, and a similar error is logged on the server. I do suspect this bit of exception handling to be used extremely rarely—possibly most commonly if a custom, untested rule has been injected into the orchestration policy. |
I just discussed this change with @anticorrelator. We agree that this exception does not fail silently because it is both raised in the server logs and included in the state transition response. We also agree that this exception does not motivate a new state name or type, because it describes a failure in the execution of our code, rather than the users code. However, when this type of error occurs, we should make it more prominent by including it in the flow/task run logs, in addition to the orion server logs. |
Co-authored-by: Michael Adkins <michael@prefect.io>
@abrookins I think it's worth noting that the CRASHED state does not indicate a failure of the user's code. In fact, it's explicitly a failure outside of their code. Crashes indicate a failure that we cannot recover from, generally an exception in Prefect's orchestration of your code or failure of the infrastructure running the flow. These are often not distinguishable to us since infrastructure errors appear as exceptions within our engine. |
So just to be clear, this looks like strictly an improvement, and my only immediate question is about the tests. On whether or not our behavior, in this case, fails silently: Thank you for mentioning the logged errors, @anticorrelator. If those appear in the logs that users can see in our UI, then I agree that this does not fail silently. If I imagine what my expectations would be as a user, I would want to see that this happened very clearly. Maybe I'm not studying my agent's logs. I'd want to see that the final state of the flow or task run was some kind of "error" in the UI: a visual indication that I can learn more about what went wrong by reading the flow run or task run logs to learn more about the error that occurred. I defer to Bill for decisions about how the product should work, ultimately. For my understanding though, I'd like to know more about Our docs on states say this about
But Michael describes it differently, like this:
@billpalombi We should get aligned on what |
To clarify the user's experience here:
In summary:
(Dustin will need to confirm this is correct) |
I'm not sure we should send an ABORT response here since it is basically a "silent" exit of the flow run. We'd leave it hanging in its old state (perhaps RUNNING). It seems like we should return a 500 from the server allowing the client to crash the run. |
Our discussion here about the user seeing the error message suggests that the user is running the flow, but what about flows run by deployments, etc., where the user is not actively doing anything, and instead an agent is running somewhere? Based on what Michael wrote, it sounds like the flow run would stay in RUNNING state and no error would be logged in the UI associated with the flow run -- if that's correct, that sounds "quiet" if not "silent" to me. 🤔 |
Well, we can continue this conversation -- perhaps in a GitHub issue? This PR helps establish better error handling, and I think we should accept it once my question about the test changes is resolved. Then as Bill said:
This seems like the content of the GitHub issue in which we take further steps to improve the visibility of this type of error. Whether more visibility includes only the error showing up in the logs associated with the flow/task run or also includes a new state or use of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than a little test cleanup we may need to do, this looks good. 👍
I think we should continue the conversation around observability, etc. in a GH issue.
…:PrefectHQ/prefect into add-more-state-transition-error-handling
I think the additional of server-side logging here is valuable, but I do not think we should send an ABORT. Previously, this error would be returned as a bad response and propagate to the client, resulting in a CRASHED flow run which is a better outcome for the user. Since the flow run is left in it's previous state, I think this makes the user experience for this error more confusing. |
I think more discussion on how to raise this issue clientside is warranted, potentially with a new response for server-side failures of this kind. I do not believe we should simply error out however, and we should ensure that all of the outer rules have an opportunity to clean up side effects. |
@madkinsz this is correct, as discussed above I think we should probably discuss how we want to more durably record this error on a run, but my hunch is not cleaning up potential side-effects can potentially be actively harmful (and have wider implications beyond this one run) and was the immediate behavior I intended to address. I didn't really want to spend more time on this, seeing as how we've ever encountered an exception in the before-transition hook before, but potentially we can still abort the transition (which beyond clientside considerations, has important server-side implications) AND return the response with a 500 error code. |
After a conversation with @madkinsz last night we've decided to reraise the orchestration exception AFTER the orchestration rule cleanup steps have been run. This will allow us to fall back into the clientside "crashing" logic if we get a 5xx response from Orion. For future improvements, we'd like to propose potentially rejecting the state and putting the run into a
|
…te-transition-error-handling
@abrookins Responding to your note above about silent failures, I do believe even if we fall into the crashing logic to return a |
…te-transition-error-handling
Summary
Handles exceptions raised during the pre-transition hook fired by an
OrchestrationRule
. These errors will immediately abort the transition without running the post-transition hook. Normally, rules that abort a transition will also fire the post-transition hook, as we assume the change in proposed state is intentional. Because the rule fails due to an exception, we will treat this as unintentional, so no more hooks will fire.co-authored with @madkinsz