Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more state transition error handling #6315

Merged
merged 11 commits into from
Aug 11, 2022

Conversation

anticorrelator
Copy link
Contributor

@anticorrelator anticorrelator commented Aug 8, 2022

Summary

Handles exceptions raised during the pre-transition hook fired by an OrchestrationRule. These errors will immediately abort the transition without running the post-transition hook. Normally, rules that abort a transition will also fire the post-transition hook, as we assume the change in proposed state is intentional. Because the rule fails due to an exception, we will treat this as unintentional, so no more hooks will fire.

co-authored with @madkinsz

Copy link
Collaborator

@abrookins abrookins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, but after thinking about the behavior from a user's perspective, I sense two different product principles that seem to be at tension here:

  1. A state should only ever mean one thing. In this case, the CRASHED state only ever means the user's code crashed (versus the user's code or our code crashed).
  2. We never fail the user silently. In this case, with this PR as-is, the state of the task we crashed while trying to orchestrate remains PENDING, but if we never fail silently for the user, it should be either CRASHED or a new state that means We crashed while trying to orchestrate this.

I don't know if these are real product principles for us, though, so I'm not sure which makes the most sense. I'm trying to be more careful to elevate anything even product-related to the product folks, so I'm going to invoke @billpalombi to get some backup.

assert cleanup_step.call_count == 0

# Check the task run state
task_run_states = await models.task_run_states.read_task_run_states(
session=session, task_run_id=task_run.id
)

assert len(task_run_states) == 1, "No transition was made"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these asserts stick around? They seem useful to verify the state didn't transition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting from out-of-PR conversation -- this check verified that a state was written and now we don't write a state.

@billpalombi
Copy link
Contributor

I'm sorry, I'm having difficulty understanding both what the current behavior is and what the proposed behavior would be. It appears as though this change is relevant to all state transitions, not just transitions to/from the CRASHED state, correct?

Regarding the principles that you proposed, @abrookins:

  1. Broadly speaking, I agree that a state name should mean one thing. However, a state type can mean many things. The primary question for whether a new state type is necessary is whether we want to orchestrate the run differently based on this state. For example, "running" and "retrying" are different names because the semantic difference is important, but there are both the same type - RUNNING because the orchestration rules to be applied are exactly the same.
  2. I agree very strongly with that we should not fail silently.

@anticorrelator
Copy link
Contributor Author

@abrookins @billpalombi I do believe that an entirely new state (or other persisted object) that encompasses orchestration failures is a potentially useful thing. Perhaps an annotation on the run itself is sufficient?

I do want to point out that aborted transitions are not silent. An abort response is returned to the client which will be reflected to the user with an abort message, and a similar error is logged on the server. I do suspect this bit of exception handling to be used extremely rarely—possibly most commonly if a custom, untested rule has been injected into the orchestration policy.

@billpalombi
Copy link
Contributor

I just discussed this change with @anticorrelator. We agree that this exception does not fail silently because it is both raised in the server logs and included in the state transition response. We also agree that this exception does not motivate a new state name or type, because it describes a failure in the execution of our code, rather than the users code. However, when this type of error occurs, we should make it more prominent by including it in the flow/task run logs, in addition to the orion server logs.

Co-authored-by: Michael Adkins <michael@prefect.io>
@zanieb
Copy link
Contributor

zanieb commented Aug 9, 2022

@abrookins I think it's worth noting that the CRASHED state does not indicate a failure of the user's code. In fact, it's explicitly a failure outside of their code. Crashes indicate a failure that we cannot recover from, generally an exception in Prefect's orchestration of your code or failure of the infrastructure running the flow. These are often not distinguishable to us since infrastructure errors appear as exceptions within our engine.

@abrookins
Copy link
Collaborator

So just to be clear, this looks like strictly an improvement, and my only immediate question is about the tests.

On whether or not our behavior, in this case, fails silently: Thank you for mentioning the logged errors, @anticorrelator. If those appear in the logs that users can see in our UI, then I agree that this does not fail silently.

If I imagine what my expectations would be as a user, I would want to see that this happened very clearly. Maybe I'm not studying my agent's logs. I'd want to see that the final state of the flow or task run was some kind of "error" in the UI: a visual indication that I can learn more about what went wrong by reading the flow run or task run logs to learn more about the error that occurred.

I defer to Bill for decisions about how the product should work, ultimately. For my understanding though, I'd like to know more about CRASHED and why it isn't applicable in this case. I may not understand completely!

Our docs on states say this about CRASHED:

The run did not complete because of an infrastructure issue.

But Michael describes it differently, like this:

Crashes indicate a failure that we cannot recover from, generally an exception in Prefect's orchestration of your code or failure of the infrastructure running the flow.

@billpalombi We should get aligned on what CRASHED means. "Generally an exception in Prefect's orchestration of your code" seems to match exactly what happens in this PR to my perhaps-naive eyes.

@zanieb
Copy link
Contributor

zanieb commented Aug 9, 2022

To clarify the user's experience here:

  • They are running a flow which proposes a state transition
  • The server encounters an unexpected exception
  • The exception is logged on the server
  • The server returns an ABORT response with a reason populated
  • The flow run immediately exits orchestration
  • The abort message is logged to the engine logger. This appears in stderr by default. This message is not sent to the server logger.

In summary:

  • The flow run exits without changing states
  • The abort is logged in the server and client, but not attached to the flow run

(Dustin will need to confirm this is correct)

@zanieb
Copy link
Contributor

zanieb commented Aug 9, 2022

I'm not sure we should send an ABORT response here since it is basically a "silent" exit of the flow run. We'd leave it hanging in its old state (perhaps RUNNING). It seems like we should return a 500 from the server allowing the client to crash the run.

@abrookins
Copy link
Collaborator

abrookins commented Aug 9, 2022

Our discussion here about the user seeing the error message suggests that the user is running the flow, but what about flows run by deployments, etc., where the user is not actively doing anything, and instead an agent is running somewhere?

Based on what Michael wrote, it sounds like the flow run would stay in RUNNING state and no error would be logged in the UI associated with the flow run -- if that's correct, that sounds "quiet" if not "silent" to me. 🤔

@abrookins
Copy link
Collaborator

abrookins commented Aug 9, 2022

Well, we can continue this conversation -- perhaps in a GitHub issue? This PR helps establish better error handling, and I think we should accept it once my question about the test changes is resolved. Then as Bill said:

However, when this type of error occurs, we should make it more prominent by including it in the flow/task run logs, in addition to the orion server logs.

This seems like the content of the GitHub issue in which we take further steps to improve the visibility of this type of error. Whether more visibility includes only the error showing up in the logs associated with the flow/task run or also includes a new state or use of CRASHED, we can decide in that issue.

Copy link
Collaborator

@abrookins abrookins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than a little test cleanup we may need to do, this looks good. 👍

I think we should continue the conversation around observability, etc. in a GH issue.

@zanieb
Copy link
Contributor

zanieb commented Aug 9, 2022

I think the additional of server-side logging here is valuable, but I do not think we should send an ABORT. Previously, this error would be returned as a bad response and propagate to the client, resulting in a CRASHED flow run which is a better outcome for the user. Since the flow run is left in it's previous state, I think this makes the user experience for this error more confusing.

@anticorrelator
Copy link
Contributor Author

I think more discussion on how to raise this issue clientside is warranted, potentially with a new response for server-side failures of this kind. I do not believe we should simply error out however, and we should ensure that all of the outer rules have an opportunity to clean up side effects.

@anticorrelator
Copy link
Contributor Author

anticorrelator commented Aug 10, 2022

To clarify the user's experience here:

* They are running a flow which proposes a state transition

* The server encounters an unexpected exception

* The exception is logged **on the server**

* The server returns an ABORT response with a _reason_ populated

* The flow run immediately exits orchestration

* The abort message is logged to the _engine_ logger. This appears in stderr by default. This message is not sent to the server logger.

In summary:

* The flow run exits without changing states

* The abort is logged in the server and client, but not attached to the flow run

(Dustin will need to confirm this is correct)

@madkinsz this is correct, as discussed above I think we should probably discuss how we want to more durably record this error on a run, but my hunch is not cleaning up potential side-effects can potentially be actively harmful (and have wider implications beyond this one run) and was the immediate behavior I intended to address. I didn't really want to spend more time on this, seeing as how we've ever encountered an exception in the before-transition hook before, but potentially we can still abort the transition (which beyond clientside considerations, has important server-side implications) AND return the response with a 500 error code.

@anticorrelator
Copy link
Contributor Author

After a conversation with @madkinsz last night we've decided to reraise the orchestration exception AFTER the orchestration rule cleanup steps have been run. This will allow us to fall back into the clientside "crashing" logic if we get a 5xx response from Orion.

For future improvements, we'd like to propose potentially rejecting the state and putting the run into a CRASHED (or different state, if we believe crashing is for infrastructure failures only) so that the run status is updated accordingly. This should satisfy all our requirements:

  • cleaning up all side effects from incomplete orchestration
  • noisily failing so the user is aware an error happened
  • producing a durable record that is intuitively visible via the UI

@anticorrelator
Copy link
Contributor Author

@abrookins Responding to your note above about silent failures, I do believe even if we fall into the crashing logic to return a CRASHED state as Michael noted above, if we don't repropose the state to the Orion the run will still look like it's stuck. While I don't think we should spend more time on it now I do believe by rejecting the proposed state in the event of an error and writing the appropriate state to the run will satisfy all of our requirements.

@zanieb zanieb merged commit bab867a into main Aug 11, 2022
@zanieb zanieb deleted the add-more-state-transition-error-handling branch August 11, 2022 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants