[CON-487] Make state machine error recovery more reliable #4239
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
onFailure
not being recognized due to callingqueue.on
instead ofworker.on
(this is most likely what made jobs not re-enqueue properly)onError
run the same recovery code asonFailure
(onError
catches "missing lock for job" errors, so we want to recover from those instead of only relying on onFailure)onFailure
andonError
callbacks using QueueEvents. This should be more reliable than per-worker events because it uses Redis streams instead of pub-subTests
Tested 2 failure scenarios with debugger:
Fatal: onComplete errored. Cron queues may be broken
Monitoring - How will this change be monitored? Are there sufficient logs / alerts?
Fatal: onComplete errored. Cron queues may be broken
/health/bull
to make sure they all have active or waiting jobs