Skip to content

fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828

Draft
kdhulipala-wq wants to merge 2 commits into
NVIDIA:mainfrom
kdhulipala-wq:kcd-suppress-maintenance-alerts
Draft

fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828
kdhulipala-wq wants to merge 2 commits into
NVIDIA:mainfrom
kdhulipala-wq:kcd-suppress-maintenance-alerts

Conversation

@kdhulipala-wq
Copy link
Copy Markdown
Contributor

@kdhulipala-wq kdhulipala-wq commented May 19, 2026

Description

Stuck Assigned-substate machines sometimes take days to resolve, and operators put them into maintenance mode to silence the on-call. The existing setup does not suppress alerts for these machines when they are put into maintenance mode, and the oncall personnel is still receiving the PD alerts.

This PR makes SetMaintenance::Enable also write ExcludeFromStateMachineSla — matching what the admin-cli InternalMaintenance template has been doing — so state_sla() short-circuits to no_sla() and a host in maintenance stops contributing to stuck-instance alerts regardless of which state or substate it's in.

This PR also adds a regression test (test_maintenance_suppresses_state_machine_sla_alert) that forces a host into the zero-SLA Failed state, confirms it reports above-SLA, enables maintenance, confirms the breach flips off, then disables maintenance and confirms it flips back on.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>
Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant