New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert triggered but message not included #2382

Closed
123dev opened this Issue Jun 18, 2016 · 5 comments

Comments

Projects
None yet
4 participants
@123dev

123dev commented Jun 18, 2016

Problem description

We use graylog-jira-alarmcallback to log jira tickets when an alert on a stream is raised.

Every now and then we notice that an alert is raised without the message being passed to the plugin.
We have configured the alert condition as such.
More than 0 messages in the last 1 minute and wait at least 0 minutes until triggering a new alert.
When sending an alert, include the last 1 messages ...

2016-06-18 10_41_21-graylog web interface

Checking the graylog logs around the problem time we see the following message.

2016-06-17_16:52:04.15041 2016-06-17 16:52:04,149 WARN : com.bidorbuy.graylog.alarmcallbacks.jira.JiraAlarmCallback - Skipping JIRA-issue MD5 generation, alarmcallback did not provide a message

We initailly logged an issue with the plugin author who after investigating the issue believed this is possibly a graylog bug.
You can see the discussion here

It is worth to note that we are running more than one graylog servers in a clustered environment, in case that has any relevance.

Steps to reproduce the problem

Not easy, as it happens rarely

  • Setup and configure the graylog-jira-alarmcallback
  • Monitor the graylog logs for a message similar to this.
    WARN : com.bidorbuy.graylog.alarmcallbacks.jira.JiraAlarmCallback - Skipping JIRA-issue MD5 generation, alarmcallback did not provide a message

Environment

AWS Image

  • Graylog Version: 2.0.2 (but happens with 1.3.x as well)

Thanks

@kroepke

This comment has been minimized.

Member

kroepke commented Jun 20, 2016

This looks to be a problem with the message count alert condition code. It first retrieves the number of matching messages and then runs a second query to retrieve the actual messages.

For the second query it uses a RelativeRange time range object, which unfortunately reevaluates its boundaries again. The code should really be using an AbsoluteRange with the original boundaries or simply run the query without doing a count first.

All other alert conditions behave correctly.

@123dev

This comment has been minimized.

123dev commented Jun 20, 2016

Thanks for the update Kay.

@bernd bernd self-assigned this Jul 26, 2016

@bernd bernd modified the milestones: 2.1.0, 2.x Jul 26, 2016

bernd added a commit that referenced this issue Jul 26, 2016

Fix timing issue in MessageCountAlertCondition
This condition works by first running a count in Elasticsearch and then,
if the condition triggers, a search to fetch the messages that will be
included in the check result.

Both queries use a RelativeRange object which returns a new time for
every getFrom() and getTo() call that is made. This can result in
different messages being included in the check result or no messages at
all given the count query takes a while.

The RelativeRange is now converted to an AbsoluteRange object which is
then used to run the count and search query. This makes sure the exact
same time range is used no matter how much time is in between the calls.

Refs #1704
Fixes #2382
@bernd

This comment has been minimized.

Member

bernd commented Jul 26, 2016

A fix for this will be included in the upcoming Graylog 2.1. #2546

@bernd bernd added bug S3 P3 and removed to-verify labels Jul 26, 2016

edmundoa added a commit that referenced this issue Jul 27, 2016

Fix timing issue in MessageCountAlertCondition (#2546)
This condition works by first running a count in Elasticsearch and then,
if the condition triggers, a search to fetch the messages that will be
included in the check result.

Both queries use a RelativeRange object which returns a new time for
every getFrom() and getTo() call that is made. This can result in
different messages being included in the check result or no messages at
all given the count query takes a while.

The RelativeRange is now converted to an AbsoluteRange object which is
then used to run the count and search query. This makes sure the exact
same time range is used no matter how much time is in between the calls.

Refs #1704
Fixes #2382

@kroepke kroepke added the triaged label Sep 21, 2016

@123dev

This comment has been minimized.

123dev commented Oct 5, 2017

Can this be reopened, or should a new ticket be logged.
The problem is not resolved.

I setup multiple notification plugins (Jira, slack)
and when this problem arises (of course not all the time) then all plugins don't receive the last message.
That confirms that the issue is with Graylog and not the plugin.

Thanks

@bernd

This comment has been minimized.

Member

bernd commented Oct 5, 2017

@123dev Please open a new issue and add details on how to reproduce this. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment