Skip to content

feat(anomaly): sentinel/cluster failover event tracking with slowlog correlation#142

Merged
KIvanow merged 1 commit into
BetterDB-inc:masterfrom
SBALAVIGNESH123:feature/28-sentinel-cluster-failover-tracking
May 12, 2026
Merged

feat(anomaly): sentinel/cluster failover event tracking with slowlog correlation#142
KIvanow merged 1 commit into
BetterDB-inc:masterfrom
SBALAVIGNESH123:feature/28-sentinel-cluster-failover-tracking

Conversation

@SBALAVIGNESH123
Copy link
Copy Markdown
Contributor

Summary

Implements Issue #28 — Sentinel/Cluster failover event tracking with slowlog correlation.

This PR adds end-to-end failover detection, persistence, and correlation to the anomaly detection pipeline, enabling post-incident analysis of failover events alongside slowlog spikes.

Changes

Shared Types (packages/shared)

  • Added FAILOVER_STARTED and FAILOVER_COMPLETED to WebhookEventType enum
  • Registered both as PRO tier events in WEBHOOK_EVENT_TIERS and PRO_EVENTS
  • Extended IWebhookEventsProService interface with dispatchFailoverStarted() and dispatchFailoverCompleted()

Anomaly Detection (proprietary/anomaly-detection)

  • Added CLUSTER_STATE to MetricType enum
  • Replication role detection now tracks both directions:
    • master -> replica demotion: CRITICAL / DROP + failover.started webhook
    • replica -> master promotion: WARNING / SPIKE + failover.completed webhook
  • Cluster state tracking: detects ok -> fail and fail -> ok transitions via CLUSTER INFO
  • Excluded CLUSTER_STATE from z-score buffer loop (handled via state-change detection)
  • Added lastClusterState cleanup in onConnectionRemoved()
  • Injected WebhookEventsProService (optional, PRO-only)

Correlation Engine

  • Enhanced NODE_FAILOVER pattern rule with optionalMetrics: [SLOWLOG_LAST_ID, CLUSTER_STATE, OPS_PER_SEC]
  • Added dedicated CLUSTER_STATE -> NODE_FAILOVER pattern rule
  • Added enrichDiagnosis() method that dynamically includes slowlog correlation context when SLOWLOG_LAST_ID co-occurs with failover events in the same 5s window

Webhook PRO Service

  • Added dispatchFailoverStarted() and dispatchFailoverCompleted() methods with license gating

Testing

  • 50 unit tests pass across correlator.spec.ts, spike-detector.spec.ts, metric-buffer.spec.ts
  • 8 new tests covering promotion detection, webhook dispatch, CLUSTER_STATE correlation, slowlog-failover co-occurrence diagnosis enrichment, buffer exclusion, and state cleanup
  • packages/shared compiles cleanly with tsc --noEmit
  • Zero regression on all existing tests

Closes #28

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

All contributors have signed the CLA. ✅
Posted by the CLA Assistant Lite bot.

@SBALAVIGNESH123
Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@SBALAVIGNESH123 SBALAVIGNESH123 force-pushed the feature/28-sentinel-cluster-failover-tracking branch from 0d04ce9 to 895320f Compare April 30, 2026 15:44
@SBALAVIGNESH123
Copy link
Copy Markdown
Contributor Author

recheck

github-actions Bot added a commit that referenced this pull request Apr 30, 2026
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dispatchClusterFailover is never called in the new cluster state detection path, even though it exists for exactly this case and all required data (slotsAssigned, slotsFailed, knownNodes) is in the getClusterInfo response already in scope.

The pre-existing dispatchClusterFailover exists specifically to fire a webhook when cluster state changes (ok→fail, fail→ok), carrying clusterState, previousState, slotsAssigned, slotsFailed, knownNodes. The new cluster state detection code (lines 390–426 of anomaly.service.ts) creates the anomaly event correctly but never calls dispatchClusterFailover. All the information needed for the payload is available right there -getClusterInfo is already called and its response contains the slot counts. So a fail transition fires a CRITICAL anomaly but no webhook, while the equivalent replication role demotion fires both.

Copy link
Copy Markdown
Member

@KIvanow KIvanow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, I want to thank you for your contribution @SBALAVIGNESH123 !

There is only one issue that needs to be adressed before this can be merged - https://github.com/BetterDB-inc/monitor/pull/142/changes#r3172532439

And some conflicts with the main branch due to other PRs being merged

@SBALAVIGNESH123
Copy link
Copy Markdown
Contributor Author

Thanks for the review @KIvanow! Great catch — I've addressed the feedback:

Fix: Added dispatchClusterFailover() call in the cluster state detection path (lines 421–435 of �nomaly.service.ts). The webhook now fires on both ok→fail and ail→ok transitions, carrying clusterState, previousState, slotsAssigned, slotsFailed, and knownNodes from the getClusterInfo response that was already in scope.

Test: Added a new test (dispatches cluster.failover webhook on ok→fail transition) that verifies the webhook payload is correct.

Merge conflicts: Resolved — kept both our FAILOVER_STARTED/FAILOVER_COMPLETED events and the upstream INFERENCE_SLA_BREACH event in packages/shared/src/webhooks/types.ts.

@SBALAVIGNESH123 SBALAVIGNESH123 force-pushed the feature/28-sentinel-cluster-failover-tracking branch from ec621b6 to e0757e4 Compare May 1, 2026 08:53
@SBALAVIGNESH123
Copy link
Copy Markdown
Contributor Author

recheck

@KIvanow
Copy link
Copy Markdown
Member

KIvanow commented May 11, 2026

This looks great @SBALAVIGNESH123 Could you please sign your commits so I can merge them?

…correlation

Implements Issue BetterDB-inc#28 — Sentinel/Cluster failover event tracking with slowlog correlation.

Changes:
- Added FAILOVER_STARTED and FAILOVER_COMPLETED to WebhookEventType enum (PRO tier)
- Added CLUSTER_STATE to MetricType enum
- Replication role detection tracks both directions (demotion + promotion)
- Cluster state tracking detects ok<->fail transitions via CLUSTER INFO
- dispatchClusterFailover webhook fires on cluster state transitions
- Enhanced NODE_FAILOVER correlation pattern with SLOWLOG_LAST_ID co-occurrence
- Added enrichDiagnosis() for dynamic slowlog correlation context
- Added dispatchFailoverStarted/Completed methods with license gating
- Excluded CLUSTER_STATE from z-score buffer loop
- Added lastClusterState cleanup in onConnectionRemoved()
- 9 new tests covering promotion, webhook dispatch, cluster state correlation

Closes BetterDB-inc#28
@SBALAVIGNESH123 SBALAVIGNESH123 force-pushed the feature/28-sentinel-cluster-failover-tracking branch from e0757e4 to 6a94c68 Compare May 12, 2026 00:41
@SBALAVIGNESH123
Copy link
Copy Markdown
Contributor Author

Hey @KIvanow, thanks for the heads up! I've signed the commit — it should now show as verified. Let me know if there's anything else needed for the merge!

@KIvanow KIvanow merged commit 2fa1c46 into BetterDB-inc:master May 12, 2026
1 of 2 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sentinel/Cluster failover event tracking with slowlog correlation

2 participants