Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Burnrate alerts aren't working correctly #47

Open
lswith opened this issue Feb 6, 2022 · 7 comments
Open

Burnrate alerts aren't working correctly #47

lswith opened this issue Feb 6, 2022 · 7 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@lswith
Copy link

lswith commented Feb 6, 2022

I have an SLO that is 30m (short window) and 6h (long window). I've put the threshold the same on both.

When the SLO was triggered, it was quite quick (within 5m) but the alert took 6 hours to resolve after it went back to normal.

I would have expected it to be resolved quickly according to https://sre.google/workbook/alerting-on-slos/

Looking into this a bit deeper, I think that the threshold values on the monitor take 6 hours to evaluate, and it might not be possible to do "Multiwindow, Multi-Burn-Rate Alerts" using sumologic's monitors.

@lswith lswith changed the title Burnrate alerts taking too long to recover Burnrate alerts aren't working correctly Feb 6, 2022
@lswith
Copy link
Author

lswith commented Feb 6, 2022

Also, I have 3 alerts associated with an SLO: 10m-1h, 30m-6h, and 6h-24h. In prometheus, the alerts aren't duplicated because they're grouped together (as you can see in the query in the article), but in Sumo I got 3 emails per SLO while the system was down.

@lswith
Copy link
Author

lswith commented Feb 7, 2022

Looking into this a bit more thoroughly, it looks like the monitor is being evaluated over the long period, and if the combined_burn exceeds the value of 1, anytime in that period it won't resolve. This would mean that it would have to be 1 or lower, for the long period of time.

I think we might have to change the monitor to be evaluated over the short period of time, but move the calculations for the combined_burn into a scheduled search so that it can be evaluated over a period of time.

@lswith
Copy link
Author

lswith commented Feb 7, 2022

It looks like a scheduled search wouldn't do it, but a scheduled view would. You can pre-populate the scheduled view with the current longBurnRate, and then calculate the latestBurnRate in the monitor.

Also, I've noticed that I am using the trigger for "Warning" and "ResolvedWarning" which is tripped when the combined_burn exceeds 1. The "Critical" and "ResolvedCritical" seem to trip when the combined_burn exceeds 2 but this will never happen, as it can only equal 2:

if (longBurnRate > 6 , 1,0) as long_burn_exceeded
| if ( latestBurnRate > 6, 1,0) as short_burn_exceeded
| long_burn_exceeded + short_burn_exceeded as combined_burn

@lswith
Copy link
Author

lswith commented Feb 7, 2022

Also, looking into the https://sre.google/workbook/alerting-on-slos/ more, it seems that they combine alerts based on the notification type.

For example:

expr: (
        job:slo_errors_per_request:ratio_rate1h{job="myjob"} > (14.4*0.001)
      and
        job:slo_errors_per_request:ratio_rate5m{job="myjob"} > (14.4*0.001)
      )
    or
      (
        job:slo_errors_per_request:ratio_rate6h{job="myjob"} > (6*0.001)
      and
        job:slo_errors_per_request:ratio_rate30m{job="myjob"} > (6*0.001)
      )
severity: page

This query means that both SLO alerts are combined. If either one is triggered, it will send the same email. This has the benefit that there won't be 2 notifications that the alert has been triggered, and there won't be a duplication of alerts.

I think it might be worthwhile updating the SLO configuration to the latest OpenSLO Spec. They have added a few objects such as "AlertPolicies" which have 1 or more "Alert Conditions". This would allow the configuration to group all of the "long/short burn rate" conditions into 1 alert.

@lswith
Copy link
Author

lswith commented Feb 7, 2022

Ah dam, it looks like OpenSLO oslo doesn't support the latest OpenSLO Spec.

OpenSLO/oslo#63

@agaurav
Copy link
Contributor

agaurav commented Feb 7, 2022

hey @lswith, i will discuss the monitor not resolving with monitors team and get back on it by tomorrow.
I recall it was to prevent frequent flapping b/w alert opening and closing but waiting for 6h defeats the purpose of a multi-window monitor.

the update to oslo is currently blocked for two reasons : 1) they haven't updated oslo and 2) it doesn't support multi burn rate monitors yet.
i will discuss this with openslo team and will try to expedite it with raising a pr for oslo.

@agaurav
Copy link
Contributor

agaurav commented Feb 8, 2022

the monitor team is working on adding configurable resolution window for monitors, after that setting the resolve window to the short-burn period will give us the correct behaviour required for these alerts.
The ETA for this feature is end of march.

cc: @tarunk2

@agaurav agaurav self-assigned this Feb 8, 2022
@agaurav agaurav added the enhancement New feature or request label Feb 8, 2022
@agaurav agaurav added this to the v0.8 milestone Feb 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants