[CON-487] Make state machine error recovery more reliable #4239

theoilie · 2022-10-31T22:35:14Z

Fixes onFailure not being recognized due to calling queue.on instead of worker.on (this is most likely what made jobs not re-enqueue properly)
Makes onError run the same recovery code as onFailure (onError catches "missing lock for job" errors, so we want to recover from those instead of only relying on onFailure)
Adds global onFailure and onError callbacks using QueueEvents. This should be more reliable than per-worker events because it uses Redis streams instead of pub-sub

Tested 2 failure scenarios with debugger:

Job throws an error in its processor - it recovers and re-enqueues
Job throws an error in its onComplete callback - it's not able to recover but logs Fatal: onComplete errored. Cron queues may be broken

Look out for the following error log: Fatal: onComplete errored. Cron queues may be broken
Check state machine queues on /health/bull to make sure they all have active or waiting jobs

creator-node/src/services/stateMachineManager/stateMonitoring/index.js

SidSethi · 2022-10-31T22:42:06Z

great call on all of these changes

Make state machine error recovery more reliable

d780d8e

theoilie requested a review from SidSethi October 31, 2022 22:35

pull-request-size bot added the size/L label Oct 31, 2022

theoilie added content-node Content Node (previously known as Creator Node) and removed size/L labels Oct 31, 2022

SidSethi reviewed Oct 31, 2022

View reviewed changes

creator-node/src/services/stateMachineManager/stateMonitoring/index.js Show resolved Hide resolved

SidSethi approved these changes Oct 31, 2022

View reviewed changes

A little cleanup

9303624

pull-request-size bot added the size/L label Oct 31, 2022

Update test

f0db3d0

theoilie merged commit 590495f into main Oct 31, 2022

theoilie deleted the theo-con-487-bull-issues branch October 31, 2022 23:49

AudiusProject deleted a comment from linear bot Sep 11, 2023

Provide feedback