Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CON-487] Make state machine error recovery more reliable #4239

Merged
merged 3 commits into from
Oct 31, 2022

Conversation

theoilie
Copy link
Contributor

Description

  • Fixes onFailure not being recognized due to calling queue.on instead of worker.on (this is most likely what made jobs not re-enqueue properly)
  • Makes onError run the same recovery code as onFailure (onError catches "missing lock for job" errors, so we want to recover from those instead of only relying on onFailure)
  • Adds global onFailure and onError callbacks using QueueEvents. This should be more reliable than per-worker events because it uses Redis streams instead of pub-sub

Tests

Tested 2 failure scenarios with debugger:

  1. Job throws an error in its processor - it recovers and re-enqueues
  2. Job throws an error in its onComplete callback - it's not able to recover but logs Fatal: onComplete errored. Cron queues may be broken

Monitoring - How will this change be monitored? Are there sufficient logs / alerts?

  • Look out for the following error log: Fatal: onComplete errored. Cron queues may be broken
  • Check state machine queues on /health/bull to make sure they all have active or waiting jobs

@theoilie theoilie added content-node Content Node (previously known as Creator Node) and removed size/L labels Oct 31, 2022
@SidSethi
Copy link
Contributor

great call on all of these changes

@theoilie theoilie merged commit 590495f into main Oct 31, 2022
@theoilie theoilie deleted the theo-con-487-bull-issues branch October 31, 2022 23:49
@AudiusProject AudiusProject deleted a comment from linear bot Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-node Content Node (previously known as Creator Node) size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants