-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consume SecondarySyncRequestOutcomes in Snapback reconfig #1614
Conversation
@dmanjunath refactored per discussion, pls re-review |
@@ -10,6 +10,7 @@ | |||
"logLevel": "debug", | |||
"redisHost": "localhost", | |||
"redisPort": 4379, | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just re-ordering
# Sync / SnapbackSM configs | ||
snapbackReconfigEnabled=true | ||
snapbackJobInterval=10000 # ms | ||
snapbackModuloBase=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all re-ordering, except for the new minimumSecondaryUserSyncSuccessPercent
on line 51
}, | ||
|
||
// Rate limit configs | ||
/** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all changes are just re-ordering + grouping, except for minimumSecondaryUserSyncSuccessPercent
on line 559
} | ||
}, | ||
|
||
const Utils = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im not particularly familiar with this code structure convention. what was the benefit of grouping functions into consts, and then reassigning them to SecondarySyncHealthTracker
when exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for logical separation, helps me understand the code separation a lot better. trying to avoid the situation in other classes where we have 20-30 functions all at the same level even though there is a clear hierarchical structure
makes no difference logically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand how this format possibly helps with clearly defining the hierarchy. however, i have a few points against this:
- it's a little strange that in these conts, you have inner functions referencing the consts itself (e.x: in
Utils
const, the fn_getMetricsMatchingPattern()
callsUtils._getAllKeysMatchingPattern(pattern)
). looks like we could've just made this a static class and/or keep the prior commit version of this file - it's redundant to have outer consts like
Setters
andGetters
to group the actual fns. getters/setters should be very clearly obvious via the function name and grouped structurally in the same area of the file. - this file introduces a new pattern that isn't around in the code base. I rather keep to current structures we have in the code base (no hierarchal structuring of fns, just arranging fns in the proper place and possibly introducing strong comments like
// ========= GETTERS =========
to divide the file)
/** | ||
* If either secondary has a Sync success rate for user below threshold, add it to `unhealthyReplicas` list | ||
*/ | ||
const userSecondarySyncMetrics = await SecondarySyncHealthTracker.computeUserSecondarySyncSuccessRates( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this logic into the peerSetManager file? It would be cleaner to have the peer health determination logic in this file as I have moved any peer health calculation in my reconfig pr into this file too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is actually fine where it is. the distinction is peerSetManager is abstracted on a node level whereas this is per user per node. if i understand correctly, the unhealthyReplicas is on a per user basis below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vicky i kinda see your point but this logic of secondarySyncHealth tracking only works from the primary's angle, and PeerSetMAnager doesn't really have a notion of that distinction right now
might make sense but given its like 8 lines, not sure if its worth doing right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more cleanup comments but structurally 👍
creator-node/src/config.js
Outdated
doc: 'Minimum percent of failed Syncs for a user on a secondary for the secondary to be considered healthy for that user', | ||
format: 'nat', | ||
env: 'minimumSecondaryUserSyncSuccessPercent', | ||
default: 75 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely what it's defaulted to for third party nodes, do we want it it at 50 or 75
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will set to 50 now for minimal impact rollout but the end goal for this should def be higher as this is at a per-user level
const secondary = Utils._parseSecondaryFromRedisKey(key) | ||
|
||
if (!(secondary in secondarySyncMetrics)) { | ||
// This case can be hit for old secondaries that have been cycled out of user's replica set - these can be safely skipped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't the only case where secondary is not in sync metrics is if there's no data for that secondary? eg no syncs recorded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, secondarySyncMetrics
would contain every current user secondary that was passed in as a param even if it had 0 data
the case where a secondary from redis is not in secondarySyncMetrics
would be if it previously had data recorded for it but is no longer in current replica set
/** | ||
* If either secondary has a Sync success rate for user below threshold, add it to `unhealthyReplicas` list | ||
*/ | ||
const userSecondarySyncMetrics = await SecondarySyncHealthTracker.computeUserSecondarySyncSuccessRates( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is actually fine where it is. the distinction is peerSetManager is abstracted on a node level whereas this is per user per node. if i understand correctly, the unhealthyReplicas is on a per user basis below.
Description
What is the purpose of this PR? What is the current behavior? New behavior? Relevant links (e.g. Trello) and/or information pertaining to PR?
Previous PR built SecondarySyncRequestOutcomes tracking in redis.
This now consumes it in Snapback to more intelligently trigger reconfigs.
Tests
List the manual tests and repro instructions to verify that this PR works as anticipated. Include log analysis if possible.
❗ If this change impacts clients, make sure that you have tested the clients ❗
SecondarySyncRequestOutcome logging:
SecondarySyncRequestOutcome success rate tracking:
❗ Reminder 💡❗:
If this PR touches a critical flow (such as Indexing, Uploads, Gateway or the Filesystem), make sure to add the
requires-special-attention
label. Add relevant labels as necessary.How will this change be monitored?
For features that are critical or could fail silently please describe the monitoring/alerting being added.