Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SecondarySyncHealthTracker change recordfail #4353

Merged
merged 49 commits into from
Nov 30, 2022
Merged

Conversation

vicky-g
Copy link
Contributor

@vicky-g vicky-g commented Nov 15, 2022

Description

Summary: This PR is to relieve pressure on redis by only tracking the errors encountered on the primary->secondary syncs and adding fine tuned granularity on the error encountered to the number of times encountered.

This PR includes:

  • removing old env vars \ configs
  • adding a prometheus error
  • rewriting SecondarySyncHealthTracker to go from success rate \ error rate computation -> error count
  • changing how SecondarySyncHealthTracker is consumed around various places
  • updated relevant tests
  • wrote new tests (see Tests)

Tests

  • Unit tests via SecondarySyncHealthTracker.test.ts
  • Manual tests

Happy path:

  1. Created a user
  2. Uploaded a track
  3. Checked redis + logs to see flow

Wallet was able to sync successfully with no errors recorded in redis

Redis

127.0.0.1:6379> keys PRIMARY_TO_SECONDARY_SYNC_FAILURE*
(empty list or set)
127.0.0.1:6379> 

User state

User 2
wallet 0xa89416610a90a7655c432434f763f0bb7d9189ad
handle seed_221129mz8g
creator node endpoint http://cn1_creator-node_1:4000,http://cn3_creator-node_1:4002,http://cn2_creator-node_1:4001

The clocks:  {
  'http://cn1_creator-node_1:4000': 11,
  'http://cn2_creator-node_1:4001': 11,
  'http://cn3_creator-node_1:4002': 11,
  'http://cn4_creator-node_1:4003': -1
}

Sad path:

  1. Forced update replica sets to fail (part of sync flow)
  2. Created a user
  3. Uploaded a track
  4. Checked redis + logs to see flow

Uploaded track and user only had primary clock values increase:

User 1
wallet 0xbc13d58475359394a75795e292bbe4aeb254cf18
handle seed_22112958hj
creator node endpoint http://cn3_creator-node_1:4002,http://cn2_creator-node_1:4001,http://cn4_creator-node_1:4003

The clocks:  {
  'http://cn1_creator-node_1:4000': -1,
  'http://cn2_creator-node_1:4001': 20,
  'http://cn3_creator-node_1:4002': 29,
  'http://cn4_creator-node_1:4003': 20
}

SecondarySyncHealthTracker logs highlighting wallet encountered max errors allowed:

{"name":"audius_creator_node","clusterWorker":"Worker 1","hostname":"51f7210dfa94","pid":134,"level":40,"SecondarySyncHealthTracker":"computeWalletOnSecondaryExceedsMaxErrorsAllowed","msg":"Wallets on secondaries have exceeded the allowed error capacity for today: {\"0xbc13d58475359394a75795e292bbe4aeb254cf18\":{\"http://cn2_creator-node_1:4001\":\"failure_fetching_user_replica_set\",\"http://cn4_creator-node_1:4003\":\"failure_fetching_user_replica_set\"}}","time":"2022-11-29T17:26:56.452Z","v":0,"trace_id":"4ab33a571b7d03cb1f24fc9d503a80a8","span_id":"4e7bf002cb84f1ab","trace_flags":"01","resource.span.name":"issueSyncRequest.jobProcessor","resource.span.links":[{"context":{"traceId":"e8a4ceb28bfb672e51649a6406444db9","spanId":"395c9ea02aa69ce4","traceFlags":1},"attributes":{}}],"resource.span.attributed":{"code.filepath":"/usr/src/app/src/services/stateMachineManager/stateReconciliation/issueSyncRequest.jobProcessor.ts","code.function":"issueSyncRequest","spid":0,"endpoint":"http://cn3_creator-node_1:4002"},"resource.service.name":"content-node","resource.service.spid":0,"resource.service.endpoint":"http://cn3_creator-node_1:4002","logLevel":"warn"}

and

{"name":"audius_creator_node","clusterWorker":"master","hostname":"5b39db87d2af","pid":20708,"level":40,"SecondarySyncHealthTracker":"doesWalletOnSecondaryExceedMaxErrorsAllowed","wallet":"0xc3d1d7559356aac64ba2921ec266ce294dcdc981","secondary":"http://cn4_creator-node_1:4003","error":"failure_fetching_user_replica_set","msg":"Wallet encountered max errors allowed on secondary","time":"2022-11-30T01:52:17.790Z","v":0,"trace_id":"849f74275a833943fe862c41772b27a6","span_id":"5bbb5776175182fd","trace_flags":"01","resource.span.name":"_findReplicaSetUpdatesForUser","resource.span.links":[],"resource.span.attributed":{"code.filepath":"/usr/src/app/src/services/stateMachineManager/stateMonitoring/findReplicaSetUpdates.jobProcessor.ts","code.function":"_findReplicaSetUpdatesForUser","spid":0,"endpoint":"http://cn3_creator-node_1:4002"},"resource.service.name":"content-node","resource.service.spid":0,"resource.service.endpoint":"http://cn3_creator-node_1:4002","logLevel":"warn"}

Sync job failing for a wallet

{"name":"audius_creator_node","clusterWorker":"Worker 2","hostname":"51f7210dfa94","pid":231,"queue":"find-replica-set-updates-queue","jobId":"117","level":50,"msg":"_findReplicaSetUpdatesForUser(): Secondary http://cn4_creator-node_1:4003 for user 0xbc13d58475359394a75795e292bbe4aeb254cf18 encountered too many sync errors. Marking replica as unhealthy.","time":"2022-11-29T17:27:22.657Z","v":0,"trace_id":"d9bcac02dd1b8b5e197be43b74816f61","span_id":"1eb2bd430f18de60","trace_flags":"01","resource.span.name":"_findReplicaSetUpdatesForUser","resource.span.links":[],"resource.span.attributed":{"code.filepath":"/usr/src/app/src/services/stateMachineManager/stateMonitoring/findReplicaSetUpdates.jobProcessor.ts","code.function":"_findReplicaSetUpdatesForUser","spid":0,"endpoint":"http://cn3_creator-node_1:4002"},"resource.service.name":"content-node","resource.service.spid":0,"resource.service.endpoint":"http://cn3_creator-node_1:4002","logLevel":"error"}

Redis highlighting that a wallet-secondary-date recorded the max allowed fails:

127.0.0.1:6379> keys PRIMARY_TO_SECONDARY_SYNC_FAILURE*
1) "PRIMARY_TO_SECONDARY_SYNC_FAILURE::2022-11-29::http://cn2_creator-node_1:4001::0xbc13d58475359394a75795e292bbe4aeb254cf18"
2) "PRIMARY_TO_SECONDARY_SYNC_FAILURE::2022-11-29::http://cn4_creator-node_1:4003::0xbc13d58475359394a75795e292bbe4aeb254cf18"
127.0.0.1:6379> hgetall PRIMARY_TO_SECONDARY_SYNC_FAILURE::2022-11-29::http://cn2_creator-node_1:4001::0xbc13d58475359394a75795e292bbe4aeb254cf18
1) "failure_fetching_user_replica_set"
2) "10"
127.0.0.1:6379> PRIMARY_TO_SECONDARY_SYNC_FAILURE::2022-11-29::http://cn4_creator-node_1:4003::0xbc13d58475359394a75795e292bbe4aeb254cf18
127.0.0.1:6379> hgetall PRIMARY_TO_SECONDARY_SYNC_FAILURE::2022-11-29::http://cn4_creator-node_1:4003::0xbc13d58475359394a75795e292bbe4aeb254cf18
1) "failure_fetching_user_replica_set"
2) "10"

Monitoring - How will this change be monitored? Are there sufficient logs / alerts?

Check log surrounding "SecondarySyncHealthTracker":"computeWalletOnSecondaryExceedsMaxErrorsAllowed","msg":"Wallets on secondaries have exceeded the allowed error capacity for today for the wallets that exceed the max errors allowed
and also Wallet encountered max errors allowed on secondary for if max errors are encountered

@vicky-g
Copy link
Contributor Author

vicky-g commented Nov 15, 2022

ignore tests for now. will update

Copy link
Contributor

@theoilie theoilie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all looking good besides having the sync tracker as a class. that's my only big request to change here so most comments are about that or just nits

@vicky-g vicky-g marked this pull request as ready for review November 29, 2022 17:59
Copy link
Contributor

@theoilie theoilie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work - it was especially nice and reassuring to see the manual tests!

Comment on lines +3 to +15
failure_fetching_user_replica_set: 10,
failure_force_resync_check: 10,
failure_fetching_user_gateway: 10,
failure_delete_db_data: 3,
failure_delete_disk_data: 3,
failure_sync_secondary_from_primary: 10,
failure_db_transaction: 3,
failure_export_wallet: 10,
failure_import_not_consistent: 3,
failure_import_not_contiguous: 3,
failure_inconsistent_clock: 10,
failure_undefined_sync_status: 3,
default: 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flagging that we'll probably want to fine tune these and add other errors based on what we see in prod after release

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically these errors only get added from the recordFailure call. these fields are from the prometheus constants so if more errors come up, there needs to be a manual update here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree these all seem reasonable for now and we can tweak later

@vicky-g vicky-g added the content-node Content Node (previously known as Creator Node) label Nov 29, 2022
@@ -0,0 +1,19 @@
// See prometheus.constants.js for reference on potential errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe valuable to denote either in comment or in variable name that these are NUMBER_OF_DAILY_RETRIES

@vicky-g vicky-g merged commit b292d3c into main Nov 30, 2022
@vicky-g vicky-g deleted the vg-ssht-change-recordfail branch November 30, 2022 19:43
theoilie pushed a commit that referenced this pull request Dec 2, 2022
* add flag to disable sync tracking

* disable recording sync

* update env var

* update test

* remove extra })

* remove .only

* WIP remove recordSuccess

* update tests to remove recordSuccess

* fix test

* WIP

* also wip

* wip also redis

* wip change names + redis calls

* WIP

* remove unused file

* remove personal file

* remove unused redis command

* fix old comments + update retry + refactor

* pass in serialized data but let class instance handle processing

* add question mark + remove sync type

* fix configs

* remove additional properties

* add sync undefined status response

* remove comment

* update some tests

* fix some tests

* fix test again

* fix tests again

* fix test

* remove. only

* fix tests

* fix tests

* fix some tests

* fix tests again

* remove .only

* fix test?

* write tests

* fix test

* cleanup

* remove unused vars

* move fn

* short circuit

* remove t/f map

* fix tests

* fix tests

* remove forced failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-node Content Node (previously known as Creator Node) size/XXL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants