fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828
Draft
kdhulipala-wq wants to merge 2 commits into
Draft
fix(api): suppress stuck-instance alerts for hosts in maintenance mode#1828kdhulipala-wq wants to merge 2 commits into
kdhulipala-wq wants to merge 2 commits into
Conversation
Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>
Signed-off-by: Krishna Dhulipala <kdhulipala@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Stuck Assigned-substate machines sometimes take days to resolve, and operators put them into maintenance mode to silence the on-call. The existing setup does not suppress alerts for these machines when they are put into maintenance mode, and the oncall personnel is still receiving the PD alerts.
This PR makes
SetMaintenance::Enablealso writeExcludeFromStateMachineSla— matching what the admin-cliInternalMaintenancetemplate has been doing — sostate_sla()short-circuits tono_sla()and a host in maintenance stops contributing to stuck-instance alerts regardless of which state or substate it's in.This PR also adds a regression test (
test_maintenance_suppresses_state_machine_sla_alert) that forces a host into the zero-SLAFailedstate, confirms it reports above-SLA, enables maintenance, confirms the breach flips off, then disables maintenance and confirms it flips back on.Type of Change
Testing