Skip to content

fix(aci): Retry process_delayed_workflows timeouts #95379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 14, 2025
Merged

Conversation

kcons
Copy link
Member

@kcons kcons commented Jul 11, 2025

Timeouts are expected here and assumed transient, and the existing retry decorator and config don't support retrying ProcessingDeadlineExceeded.
So, we add a new task decorator to trigger retries on timeouts. This avoids opening the door to other BaseExceptions and keeps it opt-in as designed.

@kcons kcons requested a review from a team as a code owner July 11, 2025 21:19
@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jul 11, 2025
cursor[bot]

This comment was marked as outdated.

Copy link

codecov bot commented Jul 11, 2025

Codecov Report

Attention: Patch coverage is 73.91304% with 6 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/sentry/workflow_engine/tasks/utils.py 61.53% 5 Missing ⚠️
src/sentry/sentry_apps/tasks/sentry_apps.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #95379       +/-   ##
===========================================
+ Coverage   36.78%   87.77%   +50.98%     
===========================================
  Files        9915    10509      +594     
  Lines      557022   606328    +49306     
  Branches    23720    23720               
===========================================
+ Hits       204927   532193   +327266     
+ Misses     351732    73772   -277960     
  Partials      363      363               

@kcons kcons marked this pull request as draft July 11, 2025 21:48
@kcons kcons marked this pull request as ready for review July 14, 2025 16:38
Comment on lines +92 to +96
def retry_timeouts(func):
"""
Schedule a task retry if the function raises ProcessingDeadlineExceeded.
This exists because the standard retry decorator doesn't allow BaseExceptions.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you find yourself using this frequently, we could introduce deadline approaching exceptions that happen some time before the processing deadline is crossed.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Timeout Handling Fails in Retry Decorator

The retry_timeouts decorator has two issues. First, an exception raised by sentry_sdk.capture_exception() will mask the original ProcessingDeadlineExceeded and prevent retry_task() from being called. Second, if retry_task() unexpectedly doesn't raise an exception (despite its NoReturn annotation), the wrapper implicitly returns None, potentially masking the timeout.

src/sentry/workflow_engine/tasks/utils.py#L97-L105

@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except ProcessingDeadlineExceeded:
sentry_sdk.capture_exception(level="info")
retry_task()

Fix in CursorFix in Web


Was this report helpful? Give feedback by reacting with 👍 or 👎

@kcons kcons requested a review from iamrajjoshi July 14, 2025 19:27
@@ -330,7 +330,7 @@ def _process_resource_change(
except model.DoesNotExist as e:
# Explicitly requeue the task, so we don't report this to Sentry until
# we hit the max number of retries.
return retry_task(e)
retry_task(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: why did we have to remove return from here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I redefined retry_task as officially never returning as far as the type system is concerned.
So, this return was kinda incorrect, and due to my change mypy was able to flag it.

Copy link
Member

@iamrajjoshi iamrajjoshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 more question/thought

@@ -760,6 +761,7 @@ def repr_keys[T, V](d: dict[T, V]) -> dict[str, V]:
),
)
@retry
@retry_timeouts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any thoughts on some convenience decorator to group the 2 retries together. something like @retry_including_timeouts

probably will increase adoption. there are a lot of ecosystem tasks i would use this for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up coming soon.

@kcons kcons merged commit 9f14e5d into master Jul 14, 2025
64 checks passed
@kcons kcons deleted the kcons/retry_to branch July 14, 2025 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants