[ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check #2861

SidSethi · 2022-04-06T19:42:54Z

Note - this is a copy of previous branch #2774 , opened new branch since this was easier than resolving all the merge conflicts while trying to remove the commits I reverted from master previously. Please see previous PR for discussion

Note - I also removed the stateMachineQueue redis lock since it ended up creating more issues. Specifically when a node restarts mid job, the lock is not released. I added code to ensure it is released on each init but for some reason that didn't work. Basically, not worth the hassle for what is in the end redundant logic.

Description

remove timestamp suffix from bull queue generation so we don't end up with hundreds of diff queues in redis
remove done() callback consumption in queue processing. this is actually incorrect pattern and should not be combined with async-await processing. (note - don't think this was really causing any errors, but unclear lol)
track stateMachineQueueLatestJobSuccess and stateMachineQueueLatestJobStart + expose via monitors in health check, and configure health check to error if enforceStateMachineQueueHealth flag passed
increase prod snapbackModuloBase to 48 (from 24) and decrease snapbackJobInterval to 30min (from 60min)
add timeout to peerSetManager axios reqs
Increase stateMachineQueue lockDuration from default 30s to 2x snapbackJobInterval to ensure job does not get prematurely marked as stalled
set maxStalledCount from default 1 to 0 to ensure stalled jobs are never re-processed
add stalled job logging to all Snapback queues (note - even though jobs have been getting marked as stalled, that does not seem to have stopped them from processing...its weird. either way, with above config changes, we should not see any more stalling)
change stateMachineQueue from a setTimeout with manual re-add to a cron on snapbackJobInterval (note - have not confirmed whether this does anything, but lets see)
track duration for each processStateMachineOperationDecisionTree stage + improve the logging code
reduce snapback batch_clock_status request timeout from 20s to 10s

Tests

Automated CN tests should all still pass, but that doesn't help much with snapback testing.
For regression testing, confirmed via mad-dog with manualSyncsDisabled = true that content is still being synced correctly
The main validation here is that I've already released this to staging and have confirmed jobs are being correctly processed

How will this change be monitored? Are there sufficient logs?

Will wire up the new stateMachineQueueLatestJobSuccess health check field to a Pingdom alert to ensure jobs are being processed correctly

Instances of batch_clock_status req timeouts with log msg [retrieveClockStatusesForUsersAcrossReplicaSet] Could not fetch clock values for wallets=${walletsOnReplica} on replica=${replica} ${errorMsg.toString()}

… issues with unreleased locks

creator-node/src/components/healthCheck/healthCheckController.js

…eMachineQueue status via health check (#2861)

SidSethi added 27 commits March 29, 2022 00:40

v0 bull q fix + visibility

cb20f49

Merge branch 'master' into ss-reconfig-fixes

dc45788

comments

8e4d7e8

reduce interval

0b93af6

lol

021ce00

more queue logging

f2a074a

more logging

4c67c90

stall log

568c7c7

increase statemachinequeue concurrency from 1 to 3

f1f52ad

log jobId

02ace6b

logging

4f84f9a

bugfix

979d827

increase lockduration to hopefully fix stall + logcleanup

2091695

loging

d93e556

cleanup

a39ac36

nit

5e3efb0

re-print

a962dd0

add initial task

1f7881c

ensure decision tree end is always logged

f1375da

nit

b80af32

Final changes + test fixes

e4d94dd

nit change for tests

b553355

Merge branch 'master' into ss-reconfig-fixes

14e83db

Fix health check response

01db6ed

Use redis set NX for atomic locking

e0f73f4

lint

1925e2c

Merge branch 'master' into ss-reconfig-fixes-v2

0cc45bb

pull-request-size bot added the size/L label Apr 6, 2022

SidSethi changed the title ~~Ss reconfig fixes v2~~ [ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check Apr 6, 2022

SidSethi requested a review from dmanjunath April 6, 2022 19:45

SidSethi assigned vicky-g Apr 6, 2022

SidSethi marked this pull request as ready for review April 6, 2022 19:45

SidSethi mentioned this pull request Apr 6, 2022

[ASI-976] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check #2774

Closed

SidSethi added 2 commits April 6, 2022 20:43

Ensure stateMachineQueue job lock is always released on snapback start

c50df63

Remove stateMachineQueue redis lock as it is redundant + creates more…

fa6d718

… issues with unreleased locks

dmanjunath approved these changes Apr 6, 2022

View reviewed changes

creator-node/src/components/healthCheck/healthCheckController.js Show resolved Hide resolved

Merge branch 'master' into ss-reconfig-fixes-v2

f386ad2

SidSethi merged commit 0605556 into master Apr 6, 2022

SidSethi deleted the ss-reconfig-fixes-v2 branch April 6, 2022 21:46

SidSethi added a commit that referenced this pull request Apr 8, 2022

[ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stat…

264265b

…eMachineQueue status via health check (#2861)

AudiusProject deleted a comment from linear bot Sep 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check #2861

[ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check #2861

SidSethi commented Apr 6, 2022 •

edited

[ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check #2861

[ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check #2861

Conversation

SidSethi commented Apr 6, 2022 • edited

Description

Tests

How will this change be monitored? Are there sufficient logs?

SidSethi commented Apr 6, 2022 •

edited