Recover empty queues, fix 1 cause of missing Bull locks #4315

theoilie · 2022-11-08T06:00:42Z

Description

Increases stability of cluster+Bull when running into unexpected failures:

Adds fallback setInterval to detect and recover queues that are missing a job (I was able to reliably reproduce this issue – see testing section)
Simplifies isWorkerSpecial logic by only making the primary store the special worker's ID and passing it as an env var, similar to isWorkerInit
Removes active and "stuck" active jobs from more queues using a cleaner abstracted util functions
- I found that these jobs were reproducibly causing errors on restart. Even with obliterating the queue, I saw locally that after a restart there would be "missing lock for job <high ID>" when there was only 1 job (meaning less than ) in the queue because the previous run of the server had a job with ID <high ID> that ran into some issue

Tests

I added debug logs (now removed) to verify that the correct worker is marked as special when respawning, and I did the following:

A up; A seed clear; A seed create-user; A seed upload-track
Kill the worker with ID 1 (find its pid from the logs and kill it after docker exec'ing into the container)
Wait a couple minutes before verifying that the following queues have either an active or delayed job: monitoring-state, c-node-endpoint-to-sp-id, and recover-orphaned-data
Restart the container and verify that no "missing lock for job" errors were logged

Monitoring - How will this change be monitored? Are there sufficient logs / alerts?

We should see fewer "missing lock for job" errors
The following queues should always have 1 active or delayed job (verify at /health/bull): monitoring-state, c-node-endpoint-to-sp-id, and recover-orphaned-data
Monitor the following log to see when a special worker respawned and recovered from the situation where no jobs were running in a queue: was empty - restarting it

SidSethi

i did a first pass but i'm pretty confused about a lot of this logic tbh, i think i need to do another deeper pass

don't block on me for testing this on staging/prod (might be too much work to squash and cherry pick commit onto a prod node)

creator-node/src/monitors/MonitoringQueue.js

creator-node/src/utils/utils.ts

creator-node/src/serviceRegistry.js

creator-node/src/index.ts

creator-node/src/services/stateMachineManager/stateMonitoring/index.js

creator-node/src/services/stateMachineManager/stateReconciliation/index.js

creator-node/src/utils/clusterUtils.ts

SidSethi

nice!
still confusing, but not worth further effort rn
only thing i can think of that might immediately help things is to put a ./utils/cluster/README.md that just summarizes the different types. i know you documented across various files, but still hard to track down

creator-node/src/config.js

creator-node/src/services/stateMachineManager/index.js

creator-node/compose/env/base.env

Recover empty queues, fix 1 cause of missing Bull locks

67ad9ff

theoilie added the content-node Content Node (previously known as Creator Node) label Nov 8, 2022

theoilie requested a review from SidSethi November 8, 2022 06:00

theoilie assigned dmanjunath Nov 8, 2022

pull-request-size bot added the size/L label Nov 8, 2022

Slight cleanup

9e19610

SidSethi reviewed Nov 8, 2022

View reviewed changes

Make things a little less confusing

9e96791

pull-request-size bot added size/XL and removed size/L labels Nov 8, 2022

Merge remote-tracking branch 'origin' into theo-bull-missing-lock

afd0964

theoilie requested a review from SidSethi November 9, 2022 06:07

theoilie added 2 commits November 9, 2022 17:54

Disable cluster for tests

f2f3034

Resolve merge conflicts with main

48e28a8

SidSethi approved these changes Nov 29, 2022

View reviewed changes

creator-node/src/config.js Outdated Show resolved Hide resolved

creator-node/src/services/stateMachineManager/index.js Show resolved Hide resolved

theoilie added 2 commits November 30, 2022 02:25

Fix merge conflicts

afc7a90

Fix flake8

6e6638f

vicky-g reviewed Nov 30, 2022

View reviewed changes

creator-node/compose/env/base.env Outdated Show resolved Hide resolved

theoilie added 3 commits November 30, 2022 02:59

Address feedback & fix remaining merge conflicts

4a5b649

Re-disable cluster for tests only

c6e17a1

Group cluster utils and add README

483e9ab

theoilie merged commit 6e781f0 into main Nov 30, 2022

theoilie deleted the theo-bull-missing-lock branch November 30, 2022 07:00

theoilie added a commit that referenced this pull request Dec 2, 2022

Recover empty queues, fix 1 cause of missing Bull locks (#4315)

c8135d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover empty queues, fix 1 cause of missing Bull locks #4315

Recover empty queues, fix 1 cause of missing Bull locks #4315

theoilie commented Nov 8, 2022 •

edited

SidSethi left a comment

SidSethi left a comment

Recover empty queues, fix 1 cause of missing Bull locks #4315

Recover empty queues, fix 1 cause of missing Bull locks #4315

Conversation

theoilie commented Nov 8, 2022 • edited

Description

Tests

Monitoring - How will this change be monitored? Are there sufficient logs / alerts?

SidSethi left a comment

Choose a reason for hiding this comment

SidSethi left a comment

Choose a reason for hiding this comment

theoilie commented Nov 8, 2022 •

edited