feat(anomaly): sentinel/cluster failover event tracking with slowlog correlation#142
Conversation
|
All contributors have signed the CLA. ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
0d04ce9 to
895320f
Compare
|
recheck |
There was a problem hiding this comment.
dispatchClusterFailover is never called in the new cluster state detection path, even though it exists for exactly this case and all required data (slotsAssigned, slotsFailed, knownNodes) is in the getClusterInfo response already in scope.
The pre-existing dispatchClusterFailover exists specifically to fire a webhook when cluster state changes (ok→fail, fail→ok), carrying clusterState, previousState, slotsAssigned, slotsFailed, knownNodes. The new cluster state detection code (lines 390–426 of anomaly.service.ts) creates the anomaly event correctly but never calls dispatchClusterFailover. All the information needed for the payload is available right there -getClusterInfo is already called and its response contains the slot counts. So a fail transition fires a CRITICAL anomaly but no webhook, while the equivalent replication role demotion fires both.
KIvanow
left a comment
There was a problem hiding this comment.
First, I want to thank you for your contribution @SBALAVIGNESH123 !
There is only one issue that needs to be adressed before this can be merged - https://github.com/BetterDB-inc/monitor/pull/142/changes#r3172532439
And some conflicts with the main branch due to other PRs being merged
|
Thanks for the review @KIvanow! Great catch — I've addressed the feedback: Fix: Added dispatchClusterFailover() call in the cluster state detection path (lines 421–435 of �nomaly.service.ts). The webhook now fires on both ok→fail and ail→ok transitions, carrying clusterState, previousState, slotsAssigned, slotsFailed, and knownNodes from the getClusterInfo response that was already in scope. Test: Added a new test ( Merge conflicts: Resolved — kept both our FAILOVER_STARTED/FAILOVER_COMPLETED events and the upstream INFERENCE_SLA_BREACH event in packages/shared/src/webhooks/types.ts. |
ec621b6 to
e0757e4
Compare
|
recheck |
|
This looks great @SBALAVIGNESH123 Could you please sign your commits so I can merge them? |
…correlation Implements Issue BetterDB-inc#28 — Sentinel/Cluster failover event tracking with slowlog correlation. Changes: - Added FAILOVER_STARTED and FAILOVER_COMPLETED to WebhookEventType enum (PRO tier) - Added CLUSTER_STATE to MetricType enum - Replication role detection tracks both directions (demotion + promotion) - Cluster state tracking detects ok<->fail transitions via CLUSTER INFO - dispatchClusterFailover webhook fires on cluster state transitions - Enhanced NODE_FAILOVER correlation pattern with SLOWLOG_LAST_ID co-occurrence - Added enrichDiagnosis() for dynamic slowlog correlation context - Added dispatchFailoverStarted/Completed methods with license gating - Excluded CLUSTER_STATE from z-score buffer loop - Added lastClusterState cleanup in onConnectionRemoved() - 9 new tests covering promotion, webhook dispatch, cluster state correlation Closes BetterDB-inc#28
e0757e4 to
6a94c68
Compare
|
Hey @KIvanow, thanks for the heads up! I've signed the commit — it should now show as verified. Let me know if there's anything else needed for the merge! |
Summary
Implements Issue #28 — Sentinel/Cluster failover event tracking with slowlog correlation.
This PR adds end-to-end failover detection, persistence, and correlation to the anomaly detection pipeline, enabling post-incident analysis of failover events alongside slowlog spikes.
Changes
Shared Types (
packages/shared)FAILOVER_STARTEDandFAILOVER_COMPLETEDtoWebhookEventTypeenumWEBHOOK_EVENT_TIERSandPRO_EVENTSIWebhookEventsProServiceinterface withdispatchFailoverStarted()anddispatchFailoverCompleted()Anomaly Detection (
proprietary/anomaly-detection)CLUSTER_STATEtoMetricTypeenummaster -> replicademotion:CRITICAL/DROP+failover.startedwebhookreplica -> masterpromotion:WARNING/SPIKE+failover.completedwebhookok -> failandfail -> oktransitions viaCLUSTER INFOCLUSTER_STATEfrom z-score buffer loop (handled via state-change detection)lastClusterStatecleanup inonConnectionRemoved()WebhookEventsProService(optional, PRO-only)Correlation Engine
NODE_FAILOVERpattern rule withoptionalMetrics: [SLOWLOG_LAST_ID, CLUSTER_STATE, OPS_PER_SEC]CLUSTER_STATE->NODE_FAILOVERpattern ruleenrichDiagnosis()method that dynamically includes slowlog correlation context whenSLOWLOG_LAST_IDco-occurs with failover events in the same 5s windowWebhook PRO Service
dispatchFailoverStarted()anddispatchFailoverCompleted()methods with license gatingTesting
correlator.spec.ts,spike-detector.spec.ts,metric-buffer.spec.tspackages/sharedcompiles cleanly withtsc --noEmitCloses #28