Add more state transition error handling #6315

anticorrelator · 2022-08-08T21:32:55Z

Summary

Handles exceptions raised during the pre-transition hook fired by an OrchestrationRule. These errors will immediately abort the transition without running the post-transition hook. Normally, rules that abort a transition will also fire the post-transition hook, as we assume the change in proposed state is intentional. Because the rule fails due to an exception, we will treat this as unintentional, so no more hooks will fire.

co-authored with @madkinsz

abrookins

This makes sense to me, but after thinking about the behavior from a user's perspective, I sense two different product principles that seem to be at tension here:

A state should only ever mean one thing. In this case, the CRASHED state only ever means the user's code crashed (versus the user's code or our code crashed).
We never fail the user silently. In this case, with this PR as-is, the state of the task we crashed while trying to orchestrate remains PENDING, but if we never fail silently for the user, it should be either CRASHED or a new state that means We crashed while trying to orchestrate this.

I don't know if these are real product principles for us, though, so I'm not sure which makes the most sense. I'm trying to be more careful to elevate anything even product-related to the product folks, so I'm going to invoke @billpalombi to get some backup.

abrookins · 2022-08-09T00:54:45Z

tests/orion/orchestration/test_rules.py

        assert cleanup_step.call_count == 0

        # Check the task run state
        task_run_states = await models.task_run_states.read_task_run_states(
            session=session, task_run_id=task_run.id
        )

-        assert len(task_run_states) == 1, "No transition was made"


Should these asserts stick around? They seem useful to verify the state didn't transition.

Noting from out-of-PR conversation -- this check verified that a state was written and now we don't write a state.

billpalombi · 2022-08-09T13:30:05Z

I'm sorry, I'm having difficulty understanding both what the current behavior is and what the proposed behavior would be. It appears as though this change is relevant to all state transitions, not just transitions to/from the CRASHED state, correct?

Regarding the principles that you proposed, @abrookins:

Broadly speaking, I agree that a state name should mean one thing. However, a state type can mean many things. The primary question for whether a new state type is necessary is whether we want to orchestrate the run differently based on this state. For example, "running" and "retrying" are different names because the semantic difference is important, but there are both the same type - RUNNING because the orchestration rules to be applied are exactly the same.
I agree very strongly with that we should not fail silently.

anticorrelator · 2022-08-09T14:07:26Z

@abrookins @billpalombi I do believe that an entirely new state (or other persisted object) that encompasses orchestration failures is a potentially useful thing. Perhaps an annotation on the run itself is sufficient?

I do want to point out that aborted transitions are not silent. An abort response is returned to the client which will be reflected to the user with an abort message, and a similar error is logged on the server. I do suspect this bit of exception handling to be used extremely rarely—possibly most commonly if a custom, untested rule has been injected into the orchestration policy.

billpalombi · 2022-08-09T15:13:22Z

I just discussed this change with @anticorrelator. We agree that this exception does not fail silently because it is both raised in the server logs and included in the state transition response. We also agree that this exception does not motivate a new state name or type, because it describes a failure in the execution of our code, rather than the users code. However, when this type of error occurs, we should make it more prominent by including it in the flow/task run logs, in addition to the orion server logs.

src/prefect/orion/orchestration/rules.py

Co-authored-by: Michael Adkins <michael@prefect.io>

zanieb · 2022-08-09T15:28:18Z

@abrookins I think it's worth noting that the CRASHED state does not indicate a failure of the user's code. In fact, it's explicitly a failure outside of their code. Crashes indicate a failure that we cannot recover from, generally an exception in Prefect's orchestration of your code or failure of the infrastructure running the flow. These are often not distinguishable to us since infrastructure errors appear as exceptions within our engine.

abrookins · 2022-08-09T16:10:21Z

So just to be clear, this looks like strictly an improvement, and my only immediate question is about the tests.

On whether or not our behavior, in this case, fails silently: Thank you for mentioning the logged errors, @anticorrelator. If those appear in the logs that users can see in our UI, then I agree that this does not fail silently.

If I imagine what my expectations would be as a user, I would want to see that this happened very clearly. Maybe I'm not studying my agent's logs. I'd want to see that the final state of the flow or task run was some kind of "error" in the UI: a visual indication that I can learn more about what went wrong by reading the flow run or task run logs to learn more about the error that occurred.

I defer to Bill for decisions about how the product should work, ultimately. For my understanding though, I'd like to know more about CRASHED and why it isn't applicable in this case. I may not understand completely!

Our docs on states say this about CRASHED:

The run did not complete because of an infrastructure issue.

But Michael describes it differently, like this:

Crashes indicate a failure that we cannot recover from, generally an exception in Prefect's orchestration of your code or failure of the infrastructure running the flow.

@billpalombi We should get aligned on what CRASHED means. "Generally an exception in Prefect's orchestration of your code" seems to match exactly what happens in this PR to my perhaps-naive eyes.

zanieb · 2022-08-09T16:23:42Z

To clarify the user's experience here:

They are running a flow which proposes a state transition
The server encounters an unexpected exception
The exception is logged on the server
The server returns an ABORT response with a reason populated
The flow run immediately exits orchestration
The abort message is logged to the engine logger. This appears in stderr by default. This message is not sent to the server logger.

In summary:

The flow run exits without changing states
The abort is logged in the server and client, but not attached to the flow run

(Dustin will need to confirm this is correct)

zanieb · 2022-08-09T16:27:39Z

I'm not sure we should send an ABORT response here since it is basically a "silent" exit of the flow run. We'd leave it hanging in its old state (perhaps RUNNING). It seems like we should return a 500 from the server allowing the client to crash the run.

abrookins · 2022-08-09T18:10:02Z

Our discussion here about the user seeing the error message suggests that the user is running the flow, but what about flows run by deployments, etc., where the user is not actively doing anything, and instead an agent is running somewhere?

Based on what Michael wrote, it sounds like the flow run would stay in RUNNING state and no error would be logged in the UI associated with the flow run -- if that's correct, that sounds "quiet" if not "silent" to me. 🤔

abrookins · 2022-08-09T18:14:05Z

Well, we can continue this conversation -- perhaps in a GitHub issue? This PR helps establish better error handling, and I think we should accept it once my question about the test changes is resolved. Then as Bill said:

However, when this type of error occurs, we should make it more prominent by including it in the flow/task run logs, in addition to the orion server logs.

This seems like the content of the GitHub issue in which we take further steps to improve the visibility of this type of error. Whether more visibility includes only the error showing up in the logs associated with the flow/task run or also includes a new state or use of CRASHED, we can decide in that issue.

abrookins

Other than a little test cleanup we may need to do, this looks good. 👍

I think we should continue the conversation around observability, etc. in a GH issue.

…:PrefectHQ/prefect into add-more-state-transition-error-handling

zanieb · 2022-08-09T20:53:06Z

I think the additional of server-side logging here is valuable, but I do not think we should send an ABORT. Previously, this error would be returned as a bad response and propagate to the client, resulting in a CRASHED flow run which is a better outcome for the user. Since the flow run is left in it's previous state, I think this makes the user experience for this error more confusing.

anticorrelator · 2022-08-10T14:07:18Z

I think more discussion on how to raise this issue clientside is warranted, potentially with a new response for server-side failures of this kind. I do not believe we should simply error out however, and we should ensure that all of the outer rules have an opportunity to clean up side effects.

anticorrelator · 2022-08-10T14:12:23Z

To clarify the user's experience here:

* They are running a flow which proposes a state transition

* The server encounters an unexpected exception

* The exception is logged **on the server**

* The server returns an ABORT response with a _reason_ populated

* The flow run immediately exits orchestration

* The abort message is logged to the _engine_ logger. This appears in stderr by default. This message is not sent to the server logger.

In summary:

* The flow run exits without changing states

* The abort is logged in the server and client, but not attached to the flow run

(Dustin will need to confirm this is correct)

@madkinsz this is correct, as discussed above I think we should probably discuss how we want to more durably record this error on a run, but my hunch is not cleaning up potential side-effects can potentially be actively harmful (and have wider implications beyond this one run) and was the immediate behavior I intended to address. I didn't really want to spend more time on this, seeing as how we've ever encountered an exception in the before-transition hook before, but potentially we can still abort the transition (which beyond clientside considerations, has important server-side implications) AND return the response with a 500 error code.

anticorrelator · 2022-08-11T15:08:19Z

After a conversation with @madkinsz last night we've decided to reraise the orchestration exception AFTER the orchestration rule cleanup steps have been run. This will allow us to fall back into the clientside "crashing" logic if we get a 5xx response from Orion.

For future improvements, we'd like to propose potentially rejecting the state and putting the run into a CRASHED (or different state, if we believe crashing is for infrastructure failures only) so that the run status is updated accordingly. This should satisfy all our requirements:

cleaning up all side effects from incomplete orchestration
noisily failing so the user is aware an error happened
producing a durable record that is intuitively visible via the UI

…te-transition-error-handling

anticorrelator · 2022-08-11T15:11:46Z

@abrookins Responding to your note above about silent failures, I do believe even if we fall into the crashing logic to return a CRASHED state as Michael noted above, if we don't repropose the state to the Orion the run will still look like it's stuck. While I don't think we should spend more time on it now I do believe by rejecting the proposed state in the event of an error and writing the appropriate state to the run will satisfy all of our requirements.

…te-transition-error-handling

anticorrelator added 3 commits August 8, 2022 11:34

Add context manager error handling to OrchestrationRules

5ffdd38

No post-transition hook should run if the rule raises an exception

3899060

Document new behavior in rules test

d44786c

anticorrelator requested a review from bunchesofdonald August 8, 2022 21:32

anticorrelator requested a review from zanieb as a code owner August 8, 2022 21:32

abrookins reviewed Aug 9, 2022

View reviewed changes

zanieb reviewed Aug 9, 2022

View reviewed changes

src/prefect/orion/orchestration/rules.py Outdated Show resolved Hide resolved

Update src/prefect/orion/orchestration/rules.py

1fa96e1

Co-authored-by: Michael Adkins <michael@prefect.io>

abrookins approved these changes Aug 9, 2022

View reviewed changes

anticorrelator added 3 commits August 9, 2022 15:11

Assert that proposed state is none in tests

4c25f4a

Merge branch 'add-more-state-transition-error-handling' of github.com…

6758be7

…:PrefectHQ/prefect into add-more-state-transition-error-handling

Remove needless import

7acea87

Reraise orchestration errors

325f64e

anticorrelator requested a review from zangell44 as a code owner August 11, 2022 15:02

Merge branch 'main' of github.com:PrefectHQ/prefect into add-more-sta…

fdb6208

…te-transition-error-handling

Update for latest pydantic

c520445

zanieb approved these changes Aug 11, 2022

View reviewed changes

Merge branch 'main' of github.com:PrefectHQ/prefect into add-more-sta…

76491db

…te-transition-error-handling

zanieb merged commit bab867a into main Aug 11, 2022

zanieb deleted the add-more-state-transition-error-handling branch August 11, 2022 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more state transition error handling #6315

Add more state transition error handling #6315

anticorrelator commented Aug 8, 2022 •

edited

Loading

abrookins left a comment •

edited

Loading

abrookins Aug 9, 2022

abrookins Aug 9, 2022

billpalombi commented Aug 9, 2022

anticorrelator commented Aug 9, 2022

billpalombi commented Aug 9, 2022

zanieb commented Aug 9, 2022 •

edited

Loading

abrookins commented Aug 9, 2022

zanieb commented Aug 9, 2022 •

edited

Loading

zanieb commented Aug 9, 2022

abrookins commented Aug 9, 2022 •

edited

Loading

abrookins commented Aug 9, 2022 •

edited

Loading

abrookins left a comment

zanieb commented Aug 9, 2022

anticorrelator commented Aug 10, 2022

anticorrelator commented Aug 10, 2022 •

edited

Loading

anticorrelator commented Aug 11, 2022

anticorrelator commented Aug 11, 2022

Add more state transition error handling #6315

Add more state transition error handling #6315

Conversation

anticorrelator commented Aug 8, 2022 • edited Loading

Summary

abrookins left a comment • edited Loading

Choose a reason for hiding this comment

abrookins Aug 9, 2022

Choose a reason for hiding this comment

abrookins Aug 9, 2022

Choose a reason for hiding this comment

billpalombi commented Aug 9, 2022

anticorrelator commented Aug 9, 2022

billpalombi commented Aug 9, 2022

zanieb commented Aug 9, 2022 • edited Loading

abrookins commented Aug 9, 2022

zanieb commented Aug 9, 2022 • edited Loading

zanieb commented Aug 9, 2022

abrookins commented Aug 9, 2022 • edited Loading

abrookins commented Aug 9, 2022 • edited Loading

abrookins left a comment

Choose a reason for hiding this comment

zanieb commented Aug 9, 2022

anticorrelator commented Aug 10, 2022

anticorrelator commented Aug 10, 2022 • edited Loading

anticorrelator commented Aug 11, 2022

anticorrelator commented Aug 11, 2022

anticorrelator commented Aug 8, 2022 •

edited

Loading

abrookins left a comment •

edited

Loading

zanieb commented Aug 9, 2022 •

edited

Loading

zanieb commented Aug 9, 2022 •

edited

Loading

abrookins commented Aug 9, 2022 •

edited

Loading

abrookins commented Aug 9, 2022 •

edited

Loading

anticorrelator commented Aug 10, 2022 •

edited

Loading