Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SnapbackSM - processStateMachineOperation() function rewrite + peerset monitoring v0 #1469

Merged
merged 35 commits into from May 12, 2021

Conversation

SidSethi
Copy link
Contributor

@SidSethi SidSethi commented May 6, 2021

Description

What is the purpose of this PR? What is the current behavior? New behavior? Relevant links (e.g. Trello) and/or information pertaining to PR?

Add PeerSet Monitoring (skeleton code) to Snapback

  • now that Snapback supports secondary node logic as well, can add rails to also submit updateReplicaSet ops for users with unhealthy replicas (primary or secondary)
  • adds step to processStateMachineOperation() to check if any replicas in peerSet are unhealthy
  • adds simple computePeerHealth() function
  • adds fork in logic - instead of just triggering syncRequest ops for all secondary replicas that need it, first adds updateReplicaSet ops for unhealthy replicas (primary or secondary), then adds syncRequest ops for all healthy secondary replicas
  • adds a no-op updateReplicaSet() function with a log to start gathering metrics

Rewrite core SnapbackSM logic (processStateMachineOperation()):

  • core function was very bloated + unreadable; rewrote this function to untangle discrete components, break each component out into an isolated function, and more clearly document the logic
  • replaces logging with decisionTree for clarity + conciseness

Tests

List the manual tests and repro instructions to verify that this PR works as anticipated. Include log analysis if possible.
❗ If this change impacts clients, make sure that you have tested the clients ❗

  • auto tests done
  • manual tests done
  • mad-dog + manual tests give us coverage of all required flows, as long as they are run in snapbackDevModeEnabled() (this forces all sync ops to go through processStateMachineOperation())
[
  {
    "stage": "1/ BEGIN processStateMachineOperation()",
    "vals": {
      "currentModuloSlice": 0
    },
    "time": 1620659629082
  },
  {
    "stage": "2.A/ getNodeUsers():getAllNodeUsers() Success",
    "vals": {
      "requestParams": {
        "method": "get",
        "baseURL": "http://audius-disc-prov_web-server_1:5000",
        "url": "v1/full/users/content_node/all",
        "params": {
          "creator_node_endpoint": "http://cn1_creator-node_1:4000"
        }
      },
      "numAllNodeUsers": 1
    },
    "time": 1620659629122
  },
  {
    "stage": "2/ getNodeUsers() Success",
    "vals": {
      "nodeUsersLength": 1
    },
    "time": 1620659629122
  },
  {
    "stage": "3/ select nodeUsers modulo slice",
    "vals": {
      "nodeUsersSubsetLength": 1,
      "moduloBase": 1,
      "currentModuloSlice": 0
    },
    "time": 1620659629122
  },
  {
    "stage": "4/ computeContentNodePeerSet() Success",
    "vals": {
      "peerSetLength": 2
    },
    "time": 1620659629122
  },
  {
    "stage": "5/ determinePeerHealth() Success",
    "vals": {
      "unhealthyPeerSetLength": 0,
      "healthyPeerSetLength": 2,
      "unhealthyPeers": []
    },
    "time": 1620659629293
  },
  {
    "stage": "6/ Build requiredUpdateReplicaSetOps and requiredSyncRequests arrays",
    "vals": {
      "requiredUpdateReplicaSetOpsLength": 0,
      "requiredSyncRequestsLength": 2
    },
    "time": 1620659629293
  },
  {
    "stage": "7/ buildSecondaryNodesToUserWalletsMap() Success",
    "vals": {
      "numSecondaryNodes": 2
    },
    "time": 1620659629293
  },
  {
    "stage": "8/ retrieveClockStatusesForSecondaryUsersFromNodes() Success",
    "vals": {},
    "time": 1620659629340
  },
  {
    "stage": "9/ issueSyncRequests() Success",
    "vals": {
      "numSyncRequestsIssued": 2,
      "numSyncRequestErrors": 0,
      "syncRequestErrors": []
    },
    "time": 1620659629350
  },
  {
    "stage": "10/ issueUpdateReplicaSetOp() Success",
    "vals": {
      "numUpdateReplicaOpsIssued": 0
    },
    "time": 1620659629350
  },
  {
    "stage": "11/ END processStateMachineOperation()",
    "vals": {
      "currentModuloSlice": 0,
      "nextModuloSlice": 0,
      "moduloBase": 1,
      "numSyncRequestsIssued": 2,
      "numUpdateReplicaOpsIssued": 0
    },
    "time": 1620659629350
  }
]

Testing unhealthy peer:

[2021-05-10T15:47:22.313Z]  INFO: audius_creator_node/1303 on 3de557a82be8: SnapbackSM: Updating Replica Set for userID 1 & wallet 0x8a6f7421b7bb0d3f308b092a0acca217404e6e33 from old replicaSet [http://cn1_creator-node_1:4000,http://cn3_creator-node_1:4002,http://cn2_creator-node_1:4001] due to unhealthy replica http://cn2_creator-node_1:4001
[2021-05-10T15:47:22.313Z]  INFO: audius_creator_node/1303 on 3de557a82be8: SnapbackSM: Updating Replica Set for userID 2 & wallet 0x1a73ca5c096e9352f3608d20b0b47da4fbd628e7 from old replicaSet [http://cn3_creator-node_1:4002,http://cn2_creator-node_1:4001,http://cn1_creator-node_1:4000] due to unhealthy replica http://cn2_creator-node_1:4001
[2021-05-10T15:47:22.313Z]  INFO: audius_creator_node/1303 on 3de557a82be8: SnapbackSM: Updating Replica Set for userID 3 & wallet 0xfceb74949602b9a468e4f3ef6dea6fe424d63e49 from old replicaSet [http://cn1_creator-node_1:4000,http://cn3_creator-node_1:4002,http://cn2_creator-node_1:4001] due to unhealthy replica http://cn2_creator-node_1:4001
[2021-05-10T15:47:22.313Z]  INFO: audius_creator_node/1303 on 3de557a82be8: SnapbackSM: Updating Replica Set for userID 4 & wallet 0x888dafcd47217bbffd9f890f23720b4930524666 from old replicaSet [http://cn1_creator-node_1:4000,http://cn3_creator-node_1:4002,http://cn2_creator-node_1:4001] due to unhealthy replica http://cn2_creator-node_1:4001
[2021-05-10T15:47:22.313Z]  INFO: audius_creator_node/1303 on 3de557a82be8: SnapbackSM: Updating Replica Set for userID 5 & wallet 0xec6a2feac4989b731380ec356369d92fd6b5c3c1 from old replicaSet [http://cn1_creator-node_1:4000,http://cn2_creator-node_1:4001,http://cn3_creator-node_1:4002] due to unhealthy replica http://cn2_creator-node_1:4001
[
  {
    "stage": "1/ BEGIN processStateMachineOperation()",
    "vals": {
      "currentModuloSlice": 0
    },
    "time": 1620661642166
  },
  {
    "stage": "2.A/ getNodeUsers():getAllNodeUsers() Success",
    "vals": {
      "requestParams": {
        "method": "get",
        "baseURL": "http://audius-disc-prov_web-server_1:5000",
        "url": "v1/full/users/content_node/all",
        "params": {
          "creator_node_endpoint": "http://cn1_creator-node_1:4000"
        }
      },
      "numAllNodeUsers": 5
    },
    "time": 1620661642181
  },
  {
    "stage": "2/ getNodeUsers() Success",
    "vals": {
      "nodeUsersLength": 5
    },
    "time": 1620661642181
  },
  {
    "stage": "3/ select nodeUsers modulo slice",
    "vals": {
      "nodeUsersSubsetLength": 5,
      "moduloBase": 1,
      "currentModuloSlice": 0
    },
    "time": 1620661642181
  },
  {
    "stage": "4/ computeContentNodePeerSet() Success",
    "vals": {
      "peerSetLength": 2
    },
    "time": 1620661642181
  },
  {
    "stage": "5/ determinePeerHealth() Success",
    "vals": {
      "unhealthyPeerSetLength": 1,
      "healthyPeerSetLength": 1,
      "unhealthyPeers": [
        "http://cn2_creator-node_1:4001"
      ]
    },
    "time": 1620661642280
  },
  {
    "stage": "6/ Build requiredUpdateReplicaSetOps and potentialSyncRequests arrays",
    "vals": {
      "requiredUpdateReplicaSetOpsLength": 5,
      "potentialSyncRequestsLength": 4
    },
    "time": 1620661642280
  },
  {
    "stage": "7/ buildSecondaryNodesToUserWalletsMap() Success",
    "vals": {
      "numSecondaryNodes": 1
    },
    "time": 1620661642281
  },
  {
    "stage": "8/ retrieveClockStatusesForSecondaryUsersFromNodes() Success",
    "vals": {},
    "time": 1620661642310
  },
  {
    "stage": "9/ issueSyncRequests() Success",
    "vals": {
      "numSyncRequestsRequired": 0,
      "numSyncRequestsIssued": 0,
      "numSyncRequestErrors": 0,
      "syncRequestErrors": []
    },
    "time": 1620661642313
  },
  {
    "stage": "10/ issueUpdateReplicaSetOp() Success",
    "vals": {
      "numUpdateReplicaOpsIssued": 5
    },
    "time": 1620661642313
  },
  {
    "stage": "11/ END processStateMachineOperation()",
    "vals": {
      "currentModuloSlice": 0,
      "nextModuloSlice": 0,
      "moduloBase": 1,
      "numSyncRequestsIssued": 0,
      "numUpdateReplicaOpsIssued": 5
    },
    "time": 1620661642314
  }
]

❗ Reminder 💡❗:
If this PR touches a critical flow (such as Indexing, Uploads, Gateway or the Filesystem), make sure to add the requires-special-attention label. Add relevant labels as necessary.

*/
async computeContentNodePeerSet (nodeUserInfoList) {
computeContentNodePeerSet (nodeUserInfoList) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of mapping three times plus a concat, you can just add directly to the set in one iteration

@SidSethi SidSethi added content-node Content Node (previously known as Creator Node) feature New features labels May 10, 2021
@SidSethi SidSethi marked this pull request as ready for review May 10, 2021 22:11
@SidSethi SidSethi requested a review from dmanjunath May 10, 2021 22:11
Copy link
Contributor

@dmanjunath dmanjunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great! Mostly minor cleanup stuff and renaming, nothing major. Reads so much better than the previous flow

creator-node/src/snapbackSM.js Show resolved Hide resolved
creator-node/src/snapbackSM.js Show resolved Hide resolved
creator-node/src/snapbackSM.js Outdated Show resolved Hide resolved
creator-node/src/snapbackSM.js Outdated Show resolved Hide resolved
creator-node/src/snapbackSM.js Show resolved Hide resolved
creator-node/src/snapbackSM.js Outdated Show resolved Hide resolved
creator-node/src/snapbackSM.js Show resolved Hide resolved
Base automatically changed from ss-aud-573-peerset-calculation-monitoring to master May 11, 2021 22:07
@SidSethi SidSethi requested a review from dmanjunath May 11, 2021 22:48
@SidSethi
Copy link
Contributor Author

@dmanjunath addressed all comments - easiest way to diff would be to look at the last 3 commits

Copy link
Contributor

@dmanjunath dmanjunath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@SidSethi SidSethi merged commit a1c37fc into master May 12, 2021
@SidSethi SidSethi deleted the ss-aud-573-peerset-monitoring branch May 12, 2021 17:03
@AudiusProject AudiusProject deleted a comment from linear bot Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
content-node Content Node (previously known as Creator Node) feature New features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants