Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove some *_over_time modifiers for alerts #8631

Merged
merged 1 commit into from Apr 15, 2021

Conversation

mrmr1993
Copy link
Member

@mrmr1993 mrmr1993 commented Apr 13, 2021

This PR removes the min_over_time/max_over_time aggregators for several alerts. The effect of these aggregators is to extend the time that an alert fires for by ${alert_timeframe} (currently 10m) after it otherwise should have stopped.

For all of the alerts changed in this PR, the issue is known to be resolved as soon as the alert stops firing.

Checklist:

  • Document code purpose, how to use it
    • Mention expected invariants, implicit constraints
  • Tests were added for the new behavior
    • Document test purpose, significance of failures
    • Test names should reflect their purpose
  • All tests pass (CI will check this if you didn't)
  • Serialized types are in stable-versioned modules
  • Does this close issues? List them:

@deepthiskumar
Copy link
Member

${alert_timeframe} is currently an hour (alert_evaluation_duration is 10m) and is used as a window for samples. The samples are then used by the aggregate functions such as max_over_time. From what I understand, the alert should be triggered once when the alert condition is met?
The alert_evaluation_duration which is 10min is the wait time when the alert actually fires after the condition is met. So if within that time frame the condition is unmet then the alert is not fired.

I'm not sure if i understand the impact. Could you give a concrete example of the issue you were seeing with one of these alerts?

@mrmr1993
Copy link
Member Author

${alert_timeframe} is currently an hour (alert_evaluation_duration is 10m) and is used as a window for samples. The samples are then used by the aggregate functions such as max_over_time. From what I understand, the alert should be triggered once when the alert condition is met?

@deepthiskumar This would be essentially saying 'if an alert fires, we want it to fire for an extra hour after it's really resolved'. This makes the alert useless for an hour, and massively inflates small blips in the metrics. This also makes several of the for: clauses redundant: if the alert fires, it will always fire for at least 1 hr > 10 mins.

I suspect these want the inverse behaviour in these alerts, using for: clauses, not *_over_time. Perhaps we also want to elongate the triggering period slightly, so that e.g. restart loops get combined into a single continuous alert, but that elongation should be way shorter, and strictly less than the for: timeout

I'm not sure if i understand the impact. Could you give a concrete example of the issue you were seeing with one of these alerts?

This triggered last night. The issue resolved before PagerDuty even fired an alert, but the min_over_time kept it triggering for 10 mins after.

@deepthiskumar
Copy link
Member

@deepthiskumar This would be essentially saying 'if an alert fires, we want it to fire for an extra hour after it's really resolved'. This makes the alert useless for an hour, and massively inflates small blips in the metrics. This also makes several of the for: clauses redundant: if the alert fires, it will always fire for at least 1 hr > 10 mins.

Hmm, yeah it'll keep firing (if you manually resolve) because sometime in the past within the alert timeframe the alert condition was met (even if it was just once). Also makes sense that the for clause will be ineffective because the alert condition stays true for the entire alert timeframe. I guess the question is if we want to be alerted for these blips because removing the *_over_time and adding for will hide them.

This triggered last night. The issue resolved before PagerDuty even fired an alert, but the min_over_time kept it triggering for 10 mins after.

This particular alert NoCoinbaseInBlocks does not have the for clause and hence triggered a pagerduty alert when the condition was met. I only see a single pagerduty entry though. Was it that it was continuously notifying the same alert?
Anyway, the plan is trigger alerts from nodes that are synced . Perhaps we could add a for clause to it if we think one block without a coinbase within 10mins is most likely not critical.

@mrmr1993 mrmr1993 merged commit 1fcbc1b into compatible Apr 15, 2021
@mrmr1993 mrmr1993 deleted the feature/alerts-no-over_time branch April 15, 2021 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants