-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert plugins can prevent alert conditions from firing #3663
Comments
Testing:
While it isn't Graylog's fault since this is a 3rd party plugin... the ability for a 3rd party plugin to potentially break all of the alerting is a serious issue as "Backup" notification methods will silently fail. |
@avongluck-r1soft Are there any alerting or OpsGenie related exceptions in the logs of your Graylog node(s)? |
So I tried to narrow the issue down.. but the behavior seems erratic. When does an alert "resolve"? I noticed that editing the conditions and unchecking the "Repeat notifications(optional)" box seems to "resolve" the alert. However the alert refuses to trigger any additional notifications after this point. |
(I can re-check repeat notifications and still can't trigger any notifications 30 minutes since the last successful one) |
If I restart graylog-server (leaving repeat notifications checked) I get a flood of them after startup.. then a alert is open and doesn't seem to ever resolve. Here is the alert log:
So 12 minutes later it is still considered open even though the threshold is 5 minutes. |
|
20 minutes later, still unresolved.
|
Maybe the OpsGenie notification is at fault somehow and "holds" the alerts open?
Attaching a server.log which doesn't tell us much. |
Ok. Here is some solid documented oddness:
Notice as of 18:12 "no notifications since 17:31".
So:
|
^^^ all of that testing was with the OpsGenie plugin installed. I uninstalled the OpsGenie plugin again and removed the notification, now events seem to always happen for open alerts as expected. No errors seen in server.log up through now. |
I painfully realized over the weekend at 3am that "repeat notifiations" means a notification every minute while an alert is open, even if only one condition was matched. With that understanding the comment above still proves to be out-of-line with intended behaviour. |
Sorry to read that 😞. We have an issue with grace period and repeat notifications (#3579), and it will be fixed in an upcoming patch release. |
The first screenshot shows: "Stream had 1 messages in the last 5 minutes" (which shouldn't be true any longer) Maybe "more than 0 messages" is incorrectly configured? Normally "more than 0" means 1 or more... maybe the logic is >= 0 vs > 0 ? |
Nah. https://github.com/Graylog2/graylog2-server/blob/master/graylog2-server/src/main/java/org/graylog2/alerts/types/MessageCountAlertCondition.java#L184 Logic seems fine there (MORE is >) |
@avongluck-r1soft: Is this still an issue with the most recent Graylog version ( |
Given the configuration, i'd expect the alert to close after 2 minutes. However its been open for 15 with no signs of closing. |
@avongluck-r1soft Anything interesting in the logs of your Graylog nodes at that time? |
Nothing at all interesting in the logs (quiet)
|
I've finally found an interesting correlation with a repeatable pass situation.
This really makes no sense to me unless binding graylog on 127.0.0.1:9000 was somehow messing up the outbound network within the java server (hanging up the notification?). We don't see any http calls failing, but there was a noticeable removal of erratic behavior once nginx was removed from infront of the web interface (aka. we bound web_listen_uri to the private interface and killed off nginx) I've attached the proxy + graylog config we were using. |
avongluck i have the same issue with opsgenie but i have no nginx in front of the graylog server. could you please paste the graylog server conf so i can compare it to mine. Thanks this issue has been driving us nuts. |
The old alerting system has been replaced with the alerts and events system in the meantime. This issue shouldn't happen anymore. Please open a new issue if this is still a problem. Thank you! |
We see an issue around the OpsGenie plugin for Graylog preventing all alert conditions from firing in 2.2.2 (with zero error messages).
After uninstalling the OpsGenie plugin and restarting Graylog the issue went away and alert conditions in streams started working. We've had an issue open to OpsGenie for a week now on it.
The text was updated successfully, but these errors were encountered: