Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add handling for statement timeouts to state validation #8425

Closed
wants to merge 6 commits into from

Conversation

zanieb
Copy link
Contributor

@zanieb zanieb commented Feb 6, 2023

Closes #8435

Database query timeouts are resulting in an ABORT response from the API:

Task run '80dc471b-bd97-47d2-9f54-fb8707551b39' received abort during orchestration: Error validating state: DBAPIError("(sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.QueryCanceledError'>: canceling statement due to statement timeout") Task run is in RUNNING state.

Previously, these errors resulted in a 503 response allowing the client to retry. Since an aborted task run can crash the flow and these timeouts are happening frequently, customers are blocked by the lack of retry. It seems likely that this regression is due to a change from #8164 where error handling in state validation was fixed.

Here, we explicitly do not capture statement timeouts and instead reraise them. We should probably we doing cleanup instead of just raising an exception, but unfortunately this seems quite complicated and we should first attempt to achieve the previous behavior.

This pull request extends #8423 which adds baseline test coverage for this area.

Example

Discussion at https://prefecthq.slack.com/archives/C04F1PTMUCQ/p1675455197515479
User report at https://prefect-community.slack.com/archives/CL09KU1K7/p1675449149527989

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • This pull request includes tests or only affects documentation.
  • This pull request includes a label categorizing the change e.g. fix, feature, enhancement

@zanieb zanieb added the fix A fix for a bug in an existing feature label Feb 6, 2023
Base automatically changed from add-test-coverage to main February 6, 2023 21:21
@netlify
Copy link

netlify bot commented Feb 7, 2023

Deploy Preview for prefect-orion ready!

Name Link
🔨 Latest commit 906ca0e
🔍 Latest deploy log https://app.netlify.com/sites/prefect-orion/deploys/63e280906366550008443c1e
😎 Deploy Preview https://deploy-preview-8425--prefect-orion.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@zanieb zanieb force-pushed the raise-timeouts-in-state-validation branch from 89d6d34 to 906ca0e Compare February 7, 2023 16:47
Comment on lines +397 to +399
if self.run.state is None:
self.validated_state = None
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previous behavior but avoids calling self.session.add(None) which throws an exception

@zanieb
Copy link
Contributor Author

zanieb commented Feb 10, 2023

Superseded by https://github.com/PrefectHQ/nebula/pull/3482

@zanieb zanieb closed this Feb 10, 2023
@ccueto36
Copy link

Superseded by https://github.com/PrefectHQ/nebula/pull/3482

Getting a 404 upon clicking this link.

@zanieb
Copy link
Contributor Author

zanieb commented Feb 10, 2023

@ccueto36 that's our internal repository — we're fixing this in Cloud but it does not need a fix in the open source.

@ccueto36
Copy link

@ccueto36 that's our internal repository — we're fixing this in Cloud but it does not need a fix in the open source.

Got it. Thanks for the clarification. Looking forward to the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix A fix for a bug in an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flow Runs fail with DBAPIError (sqlalchemy) - QueryCanceledError
2 participants