Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consume SecondarySyncRequestOutcomes in Snapback reconfig #1614

Merged
merged 11 commits into from
Jul 9, 2021

Conversation

SidSethi
Copy link
Contributor

@SidSethi SidSethi commented Jul 2, 2021

Description

What is the purpose of this PR? What is the current behavior? New behavior? Relevant links (e.g. Trello) and/or information pertaining to PR?

Previous PR built SecondarySyncRequestOutcomes tracking in redis.
This now consumes it in Snapback to more intelligently trigger reconfigs.

Tests

List the manual tests and repro instructions to verify that this PR works as anticipated. Include log analysis if possible.
❗ If this change impacts clients, make sure that you have tested the clients ❗

SecondarySyncRequestOutcome logging:

Executing (default): SELECT "cnodeUserUUID", "walletPublicKey", "lastLogin", "latestBlockNumber", "clock", "createdAt", "updatedAt" FROM "CNodeUsers" AS "CNodeUser" WHERE "CNodeUser"."walletPublicKey" = '0x67347bf69527853f1f1e349c0561ef3351770236';
[2021-07-09T17:19:54.377Z]  INFO: audius_creator_node/30 on 2617c5b0a12c: SnapbackSM: additionalSyncIsRequired (MANUAL): wallet 0xecd41512d8f558a76db3eabc69cab6e4399fa026 secondary http://cn4_creator-node_1:4003 primaryClock 14 secondaryClock 14
[2021-07-09T17:19:54.377Z]  INFO: audius_creator_node/30 on 2617c5b0a12c: SnapbackSM: additionalSyncIsRequired (MANUAL): wallet 0xecd41512d8f558a76db3eabc69cab6e4399fa026 secondary http://cn4_creator-node_1:4003 primaryClock 14 secondaryClock 14 || Sync completed in 9ms
[2021-07-09T17:19:54.378Z]  INFO: audius_creator_node/30 on 2617c5b0a12c: SecondarySyncHealthTracker:_recordSyncRequestOutcome || Recorded SecondarySyncRequestOutcomes-Daily:::http://cn4_creator-node_1:4003:::0xecd41512d8f558a76db3eabc69cab6e4399fa026:::MANUAL:::2021-07-09:::Success

SecondarySyncRequestOutcome success rate tracking:

[2021-07-09T17:20:00.407Z]  INFO: audius_creator_node/30 on 2617c5b0a12c: SnapbackSM: SIDTEST COMPUTEUSERSECONDARYSYNCSUCCESSRATES sec1UserSyncSuccessRate: 1 || sec2UserSyncSuccessRate: 1

❗ Reminder 💡❗:
If this PR touches a critical flow (such as Indexing, Uploads, Gateway or the Filesystem), make sure to add the requires-special-attention label. Add relevant labels as necessary.

How will this change be monitored?

For features that are critical or could fail silently please describe the monitoring/alerting being added.

creator-node/src/snapbackSM/snapbackSM.js Outdated Show resolved Hide resolved
creator-node/src/snapbackSM/snapbackSM.js Outdated Show resolved Hide resolved
creator-node/src/snapbackSM/secondarySyncHealthTracker.js Outdated Show resolved Hide resolved
creator-node/src/snapbackSM/snapbackSM.js Outdated Show resolved Hide resolved
@SidSethi SidSethi requested a review from dmanjunath July 3, 2021 04:06
@SidSethi SidSethi marked this pull request as ready for review July 8, 2021 23:47
@SidSethi SidSethi requested a review from dmanjunath July 8, 2021 23:47
@SidSethi
Copy link
Contributor Author

SidSethi commented Jul 8, 2021

@dmanjunath refactored per discussion, pls re-review

@@ -10,6 +10,7 @@
"logLevel": "debug",
"redisHost": "localhost",
"redisPort": 4379,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just re-ordering

# Sync / SnapbackSM configs
snapbackReconfigEnabled=true
snapbackJobInterval=10000 # ms
snapbackModuloBase=2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all re-ordering, except for the new minimumSecondaryUserSyncSuccessPercent on line 51

},

// Rate limit configs
/**
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all changes are just re-ordering + grouping, except for minimumSecondaryUserSyncSuccessPercent on line 559

}
},

const Utils = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not particularly familiar with this code structure convention. what was the benefit of grouping functions into consts, and then reassigning them to SecondarySyncHealthTracker when exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for logical separation, helps me understand the code separation a lot better. trying to avoid the situation in other classes where we have 20-30 functions all at the same level even though there is a clear hierarchical structure
makes no difference logically

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand how this format possibly helps with clearly defining the hierarchy. however, i have a few points against this:

  1. it's a little strange that in these conts, you have inner functions referencing the consts itself (e.x: in Utils const, the fn _getMetricsMatchingPattern() calls Utils._getAllKeysMatchingPattern(pattern)). looks like we could've just made this a static class and/or keep the prior commit version of this file
  2. it's redundant to have outer consts like Setters and Getters to group the actual fns. getters/setters should be very clearly obvious via the function name and grouped structurally in the same area of the file.
  3. this file introduces a new pattern that isn't around in the code base. I rather keep to current structures we have in the code base (no hierarchal structuring of fns, just arranging fns in the proper place and possibly introducing strong comments like // ========= GETTERS ========= to divide the file)

/**
* If either secondary has a Sync success rate for user below threshold, add it to `unhealthyReplicas` list
*/
const userSecondarySyncMetrics = await SecondarySyncHealthTracker.computeUserSecondarySyncSuccessRates(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this logic into the peerSetManager file? It would be cleaner to have the peer health determination logic in this file as I have moved any peer health calculation in my reconfig pr into this file too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is actually fine where it is. the distinction is peerSetManager is abstracted on a node level whereas this is per user per node. if i understand correctly, the unhealthyReplicas is on a per user basis below.

Copy link
Contributor Author

@SidSethi SidSethi Jul 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vicky i kinda see your point but this logic of secondarySyncHealth tracking only works from the primary's angle, and PeerSetMAnager doesn't really have a notion of that distinction right now

might make sense but given its like 8 lines, not sure if its worth doing right now

Copy link
Contributor

@dmanjunath dmanjunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more cleanup comments but structurally 👍

doc: 'Minimum percent of failed Syncs for a user on a secondary for the secondary to be considered healthy for that user',
format: 'nat',
env: 'minimumSecondaryUserSyncSuccessPercent',
default: 75
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely what it's defaulted to for third party nodes, do we want it it at 50 or 75

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will set to 50 now for minimal impact rollout but the end goal for this should def be higher as this is at a per-user level

creator-node/src/snapbackSM/secondarySyncHealthTracker.js Outdated Show resolved Hide resolved
const secondary = Utils._parseSecondaryFromRedisKey(key)

if (!(secondary in secondarySyncMetrics)) {
// This case can be hit for old secondaries that have been cycled out of user's replica set - these can be safely skipped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't the only case where secondary is not in sync metrics is if there's no data for that secondary? eg no syncs recorded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, secondarySyncMetrics would contain every current user secondary that was passed in as a param even if it had 0 data
the case where a secondary from redis is not in secondarySyncMetrics would be if it previously had data recorded for it but is no longer in current replica set

creator-node/src/snapbackSM/secondarySyncHealthTracker.js Outdated Show resolved Hide resolved
creator-node/src/snapbackSM/secondarySyncHealthTracker.js Outdated Show resolved Hide resolved
/**
* If either secondary has a Sync success rate for user below threshold, add it to `unhealthyReplicas` list
*/
const userSecondarySyncMetrics = await SecondarySyncHealthTracker.computeUserSecondarySyncSuccessRates(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is actually fine where it is. the distinction is peerSetManager is abstracted on a node level whereas this is per user per node. if i understand correctly, the unhealthyReplicas is on a per user basis below.

@SidSethi SidSethi requested a review from dmanjunath July 9, 2021 17:26
@SidSethi SidSethi added content-node Content Node (previously known as Creator Node) feature New features labels Jul 9, 2021
@SidSethi SidSethi merged commit d3e2a05 into master Jul 9, 2021
@SidSethi SidSethi deleted the ss-consume-secsynchealth-snapback branch July 9, 2021 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-node Content Node (previously known as Creator Node) feature New features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants