Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check #2861

Merged
merged 30 commits into from
Apr 6, 2022

Conversation

SidSethi
Copy link
Contributor

@SidSethi SidSethi commented Apr 6, 2022

Note - this is a copy of previous branch #2774 , opened new branch since this was easier than resolving all the merge conflicts while trying to remove the commits I reverted from master previously. Please see previous PR for discussion

Note - I also removed the stateMachineQueue redis lock since it ended up creating more issues. Specifically when a node restarts mid job, the lock is not released. I added code to ensure it is released on each init but for some reason that didn't work. Basically, not worth the hassle for what is in the end redundant logic.

Description

  • remove timestamp suffix from bull queue generation so we don't end up with hundreds of diff queues in redis
  • remove done() callback consumption in queue processing. this is actually incorrect pattern and should not be combined with async-await processing. (note - don't think this was really causing any errors, but unclear lol)
  • track stateMachineQueueLatestJobSuccess and stateMachineQueueLatestJobStart + expose via monitors in health check, and configure health check to error if enforceStateMachineQueueHealth flag passed
  • increase prod snapbackModuloBase to 48 (from 24) and decrease snapbackJobInterval to 30min (from 60min)
  • add timeout to peerSetManager axios reqs
  • Increase stateMachineQueue lockDuration from default 30s to 2x snapbackJobInterval to ensure job does not get prematurely marked as stalled
  • set maxStalledCount from default 1 to 0 to ensure stalled jobs are never re-processed
  • add stalled job logging to all Snapback queues (note - even though jobs have been getting marked as stalled, that does not seem to have stopped them from processing...its weird. either way, with above config changes, we should not see any more stalling)
  • change stateMachineQueue from a setTimeout with manual re-add to a cron on snapbackJobInterval (note - have not confirmed whether this does anything, but lets see)
  • track duration for each processStateMachineOperationDecisionTree stage + improve the logging code
  • reduce snapback batch_clock_status request timeout from 20s to 10s

Tests

Automated CN tests should all still pass, but that doesn't help much with snapback testing.
For regression testing, confirmed via mad-dog with manualSyncsDisabled = true that content is still being synced correctly
The main validation here is that I've already released this to staging and have confirmed jobs are being correctly processed

How will this change be monitored? Are there sufficient logs?

Will wire up the new stateMachineQueueLatestJobSuccess health check field to a Pingdom alert to ensure jobs are being processed correctly

Instances of batch_clock_status req timeouts with log msg [retrieveClockStatusesForUsersAcrossReplicaSet] Could not fetch clock values for wallets=${walletsOnReplica} on replica=${replica} ${errorMsg.toString()}

@SidSethi SidSethi changed the title Ss reconfig fixes v2 [ASI-976] [REPLICA] Fix StateMachineQueue unreliability + expose stateMachineQueue status via health check Apr 6, 2022
@SidSethi SidSethi requested a review from dmanjunath April 6, 2022 19:45
@SidSethi SidSethi merged commit 0605556 into master Apr 6, 2022
@SidSethi SidSethi deleted the ss-reconfig-fixes-v2 branch April 6, 2022 21:46
SidSethi added a commit that referenced this pull request Apr 8, 2022
@AudiusProject AudiusProject deleted a comment from linear bot Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants