Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover empty queues, fix 1 cause of missing Bull locks #4315

Merged
merged 11 commits into from
Nov 30, 2022

Conversation

theoilie
Copy link
Contributor

@theoilie theoilie commented Nov 8, 2022

Description

Increases stability of cluster+Bull when running into unexpected failures:

  • Adds fallback setInterval to detect and recover queues that are missing a job (I was able to reliably reproduce this issue – see testing section)
  • Simplifies isWorkerSpecial logic by only making the primary store the special worker's ID and passing it as an env var, similar to isWorkerInit
  • Removes active and "stuck" active jobs from more queues using a cleaner abstracted util functions
    • I found that these jobs were reproducibly causing errors on restart. Even with obliterating the queue, I saw locally that after a restart there would be "missing lock for job <high ID>" when there was only 1 job (meaning less than ) in the queue because the previous run of the server had a job with ID <high ID> that ran into some issue

Tests

I added debug logs (now removed) to verify that the correct worker is marked as special when respawning, and I did the following:

  1. A up; A seed clear; A seed create-user; A seed upload-track
  2. Kill the worker with ID 1 (find its pid from the logs and kill it after docker exec'ing into the container)
  3. Wait a couple minutes before verifying that the following queues have either an active or delayed job: monitoring-state, c-node-endpoint-to-sp-id, and recover-orphaned-data
  4. Restart the container and verify that no "missing lock for job" errors were logged

Monitoring - How will this change be monitored? Are there sufficient logs / alerts?

  • We should see fewer "missing lock for job" errors
  • The following queues should always have 1 active or delayed job (verify at /health/bull): monitoring-state, c-node-endpoint-to-sp-id, and recover-orphaned-data
  • Monitor the following log to see when a special worker respawned and recovered from the situation where no jobs were running in a queue: was empty - restarting it

@theoilie theoilie added the content-node Content Node (previously known as Creator Node) label Nov 8, 2022
Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did a first pass but i'm pretty confused about a lot of this logic tbh, i think i need to do another deeper pass

don't block on me for testing this on staging/prod (might be too much work to squash and cherry pick commit onto a prod node)

creator-node/src/monitors/MonitoringQueue.js Outdated Show resolved Hide resolved
creator-node/src/utils/utils.ts Outdated Show resolved Hide resolved
creator-node/src/serviceRegistry.js Outdated Show resolved Hide resolved
creator-node/src/index.ts Show resolved Hide resolved
creator-node/src/utils/clusterUtils.ts Outdated Show resolved Hide resolved
@pull-request-size pull-request-size bot added size/XL and removed size/L labels Nov 8, 2022
Copy link
Contributor

@SidSethi SidSethi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!
still confusing, but not worth further effort rn
only thing i can think of that might immediately help things is to put a ./utils/cluster/README.md that just summarizes the different types. i know you documented across various files, but still hard to track down

creator-node/src/config.js Outdated Show resolved Hide resolved
@theoilie theoilie merged commit 6e781f0 into main Nov 30, 2022
@theoilie theoilie deleted the theo-bull-missing-lock branch November 30, 2022 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-node Content Node (previously known as Creator Node) size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants