feat(nutanix): add support for alerts tracking#23538
Draft
NouemanKHAL wants to merge 3 commits intomasterfrom
Draft
feat(nutanix): add support for alerts tracking#23538NouemanKHAL wants to merge 3 commits intomasterfrom
NouemanKHAL wants to merge 3 commits intomasterfrom
Conversation
Contributor
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 9c54fed | Docs | Datadog PR Page | Give us feedback! |
Replace the cursor-based alert collection with a reconciliation loop
against the v4.0 unresolved-alerts API. Ship per-state lifecycle gauges
and a default monitor template:
- nutanix.alert.open — 1 while alert is unresolved + unacknowledged
- nutanix.alert.acknowledged — 1 while alert is unresolved + acknowledged
- nutanix.alert.resolved — 1 once when alert enters the resolved state
State transitions emit explicit zeros to the previous state's metric so
per-alert monitor cases recover cleanly when the alert leaves a state.
Each metric carries ext_id for monitor grouping; the metric name itself
encodes the state, so monitor queries don't need a tag filter.
Lifecycle events (in addition to the metrics):
- "Alert: <title>" — created (or re-opened from resolved)
- "Alert acknowledged: <title>" — open -> acknowledged transition
- "Alert reopened: <title>" — acknowledged -> open transition
- "Alert Resolved: <title>" — resolution with resolvedTime / by /
auto_resolved metadata
Reconciliation is the source of truth each cycle: alerts in the API but
not in the in-memory cache are new (emit open event); alerts in the cache
but absent from the API are resolved or deleted (emit resolution event +
.open or .acknowledged = 0, .resolved = 1); alerts in both have their
cached metadata refreshed and ack-state transitions emit dedicated
events. Stateless across check cycles in terms of persistence — agent
restarts re-derive state from the API; the aggregation_key collapses any
visible duplicate creation events on restart.
Hardening:
- on transient API failure, re-emit cached gauges before re-raising so
per-alert monitors don't auto-resolve while the alert is still open.
- pre-compute new/gone/still-tracked sets before mutating _open_alerts
so loop ordering is safe.
- v4.2 fallback removed; v4.0 endpoint with $filter=isResolved eq false
is the only path. The pre-existing client-side filter remains as a
safety net.
Tags added to alert events and metrics:
- ext_id, ntnx_alert_type, ntnx_alert_severity, ntnx_alert_status
(events only — redundant on metrics where the name encodes state)
- ntnx_originating_cluster_name, ntnx_alert_user_defined,
ntnx_alert_service (Tier 1 — distinguish federated cluster, custom
vs platform alerts, and Nutanix subsystem when present)
- ntnx_cluster_name, ntnx_alert_classification, ntnx_alert_impact,
ntnx_alert_auto_resolved (resolution events only), source-entity tags
Default monitor template at assets/monitors/alerts.json combines
nutanix.alert.open + nutanix.alert.acknowledged minus
nutanix.alert.resolved to alert on any unresolved alert (clamped to
non-negative). Auto-resolves on the resolved one-shot. Description
notes the agent-restart re-broadcast trade-off.
Test coverage: state transitions (open<->ack, ack->resolved from each
prior state), filter-add edge case (treated as spurious resolution),
deleted-alert (_get_alert returns None) graceful fallback, empty
unresolved list cold-start, and per-tag assertions for the new Tier 1
tags. The four "complete output" alertType tests are parametrized.
conftest mock has a _filter_after helper for the time-based fixture
branches.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5e6fb22 to
563c1f3
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files🚀 New features to boost your workflow:
|
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Validation ReportAll 20 validations passed. Show details
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Motivation
Review checklist (to be filled by reviewers)
qa/skip-qalabel if the PR doesn't need to be tested during QA.backport/<branch-name>label to the PR and it will automatically open a backport PR once this one is merged