Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alerting not working if cluster contains nodes with no active inputs #6415

Closed
voxmaster opened this issue Sep 8, 2019 · 23 comments · Fixed by #6428

Comments

@voxmaster
Copy link

commented Sep 8, 2019

Graylog-debug1.log

Expected Behavior

Alert or Event must be triggered if event is configured

Current Behavior

No Alerts has been triggered if there is no other messages in streams. But after other streams (like All messages) receives logs the Alert is triggered with some latency. Feels like alerts are stuck.

Possible Solution

Steps to Reproduce (for bugs)

  1. Run graylog 3.1.1 in docker-compose
  2. Configure SysLog UDP input and custom stream
  3. Configure event definition for custom stream
  4. Send some logs that must be matched by event definition, repeat within some time

Context

Get Alert to be triggered

Your Environment

  • Graylog Version: Graylog v3.1.1+b39ee32
  • Elasticsearch Version: 6.8.2
  • MongoDB Version: 3.6.14
  • Operating System: Ubuntu 18.04.3 LTS (GNU/Linux 4.15.0-60-generic x86_64)
  • Browser version: Firefox 69
  • Debug log in attachment
@mpfz0r mpfz0r self-assigned this Sep 9, 2019
@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

Thanks for you report.
Unfortunately the debug log does not include messages from the events subsystem.
Could you bump these loglevels with curl and reproduce the bug?

curl -I -X PUT http://<user>:<pw>@127.0.0.1:9000/api/system/loggers/org.graylog.events/level/DEBUG -H'X-Requested-By: curl'

curl -I -X PUT http://<user>:<pw>@127.0.0.1:9000/api/system/loggers/org.graylog.plugins.events/level/DEBUG -H'X-Requested-By: curl'

cf: https://docs.graylog.org/en/3.1/pages/faq.html#how-do-i-enable-debug-logging-for-a-specific-plugin-or-area-of-graylog

@mpfz0r mpfz0r added the needs-input label Sep 10, 2019
@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

I have encountered similar problem, only it doesn't matter if there are other messages in the stream or not - events are simply not triggered at all.

When defining an event, at the "Filter & Aggregation" step, search query produces results in the Filter Preview, but once the definition is complete, no event gets triggered. Some events get triggered late, when graylog-server is restarted, but after that, there is no futher triggering.

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

this is the debug for event called 'test':

2019-09-10T11:22:26.397+02:00 DEBUG [EventProcessorEngine] Executing event processor <test/5d776b52d399f8400c488af6/aggregation-v1>
2019-09-10T11:22:26.437+02:00 DEBUG [EventProcessorExecutionJob] Event processor <test/5d776b52d399f8400c488af6> couldn't be executed because of a failed precondition (retry in 5000 ms)
2019-09-10T11:22:31.480+02:00 DEBUG [EventProcessorEngine] Executing event processor <test/5d776b52d399f8400c488af6/aggregation-v1>
2019-09-10T11:22:31.504+02:00 DEBUG [EventProcessorExecutionJob] Event processor <test/5d776b52d399f8400c488af6> couldn't be executed because of a failed precondition (retry in 5000 ms)

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

the 'test' event is defined as:

query "auth_status:fail"
search at 1 min interval last 1 min
create event definition if filter has results

@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

The current process status implementation depends on a constant stream of messages.
Is your system mostly idle?
Could you try to start a Random HTTP message generator input and see if that helps
to push the event processing forward?

@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

refs #6252

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

we have 50 msg/s average, with 300 msg/s peaks , so even if it is not a heavy loaded system, it is not idle either.

@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

I see.
Could you in addition turn on debugging for org.graylog2.system.processing?
curl -I -X PUT http://<user>:<pw>@127.0.0.1:9000/api/system/loggers/org.graylog2.system.processing/level/DEBUG -H'X-Requested-By: curl'

Also logs for a longer period (5min) might be helpful.
Thank you

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

here's 10 mins of debug ....

debug.log

@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

what's the output of

echo 'db.processing_status.find()' | mongo graylog

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

{ "_id" : ObjectId("5d72201bb2d3b17a3a25fee1"), "node_id" : "499a1010-96e2-4e79-8007-149185d62032", "node_lifecycle_status" : "RUNNING", "updated_at" : ISODate("2019-09-10T12:40:47.276Z"), "receive_times" : { "ingest" : ISODate("1970-01-01T00:00:00Z"), "post_processing" : ISODate("1970-01-01T00:00:00Z"), "post_indexing" : ISODate("1970-01-01T00:00:00Z") }, "input_journal" : { "uncommitted_entries" : NumberLong("9223372036854775807"), "read_messages_1m_rate" : 0, "written_messages_1m_rate" : 0 } }
{ "_id" : ObjectId("5d72215eb2d3b17a3a2698c7"), "node_id" : "b8fca46f-c009-4c91-90aa-ec4af21cd22a", "node_lifecycle_status" : "RUNNING", "updated_at" : ISODate("2019-09-10T12:40:47.094Z"), "receive_times" : { "ingest" : ISODate("2019-09-10T12:40:47.066Z"), "post_processing" : ISODate("2019-09-10T12:40:47.066Z"), "post_indexing" : ISODate("2019-09-10T12:40:45.591Z") }, "input_journal" : { "uncommitted_entries" : NumberLong(32), "read_messages_1m_rate" : 43.4064764384834, "written_messages_1m_rate" : 43.406455191880404 } }

@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

The problem comes from the first entry:
"uncommitted_entries" : NumberLong("9223372036854775807") is Long.MAX_VALUE

Is 499a1010-96e2-4e79-8007-149185d62032 an idle cluster node?

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

We have 2 nodes: master does all the heavy lifting, while slave is just a webui/api server. We also have 5 'data' nodes with ES cluster and mongo replica set.

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

I can confirm that it works when we stop the process on the slave node.

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

Is there anything we can do, as a workaround, to restore events, without losing our processing/ui separation ?

@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

Here's a potential workaround: On nodes that don't do any input processing,
you could init the journal read offset file to 0 (but only if it's empty)

Like so:

  • stop server
  • $ offsetfile=data/journal/graylog2-committed-read-offset ; [ -s $offsetfile ] || echo -n 0 > $offsetfile
  • start server
@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

Or disable the disk journal in the configuration altogether:
message_journal_enabled = false

@markomdinic

This comment has been minimized.

Copy link

commented Sep 10, 2019

Works with journaling explicitly disabled on the idle node.

Many thanks!

mpfz0r added a commit that referenced this issue Sep 10, 2019
If nothing was ever commited to a journal,
the uncommitedMessages gauge always reported 9223372036854775807 (Long.MAX_VALUE)

This happens because committedOffset is initialized as DEFAULT_COMMITTED_OFFSET (Long.MIN_VALUE)
(substracted by 1 underflows the long to Long.MAX_VALUE)

This breaks the idle node detection feature for the processing status,
which would exclude entries that have no messages in the journal.

Steal the code from the ThrottleStateUpdaterThread, which already deals with this edge case.

Fixes #6415
@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 10, 2019

Thanks for your help @markomdinic 👍
This will be fixed in 3.1.2

@mpfz0r mpfz0r added this to the 3.1.2 milestone Sep 11, 2019
@mpfz0r mpfz0r changed the title Alerting Not Working 3.1 if no other message in streams Alerting not working if cluster contains nodes with no active inputs Sep 11, 2019
dennisoelkers added a commit that referenced this issue Sep 11, 2019
If nothing was ever commited to a journal,
the uncommitedMessages gauge always reported 9223372036854775807 (Long.MAX_VALUE)

This happens because committedOffset is initialized as DEFAULT_COMMITTED_OFFSET (Long.MIN_VALUE)
(substracted by 1 underflows the long to Long.MAX_VALUE)

This breaks the idle node detection feature for the processing status,
which would exclude entries that have no messages in the journal.

Steal the code from the ThrottleStateUpdaterThread, which already deals with this edge case.

Fixes #6415
dennisoelkers added a commit that referenced this issue Sep 12, 2019
If nothing was ever commited to a journal,
the uncommitedMessages gauge always reported 9223372036854775807 (Long.MAX_VALUE)

This happens because committedOffset is initialized as DEFAULT_COMMITTED_OFFSET (Long.MIN_VALUE)
(substracted by 1 underflows the long to Long.MAX_VALUE)

This breaks the idle node detection feature for the processing status,
which would exclude entries that have no messages in the journal.

Steal the code from the ThrottleStateUpdaterThread, which already deals with this edge case.

Fixes #6415

(cherry picked from commit 13814e6)
mpfz0r added a commit that referenced this issue Sep 12, 2019
…`) (#6435)

If nothing was ever commited to a journal,
the uncommitedMessages gauge always reported 9223372036854775807 (Long.MAX_VALUE)

This happens because committedOffset is initialized as DEFAULT_COMMITTED_OFFSET (Long.MIN_VALUE)
(substracted by 1 underflows the long to Long.MAX_VALUE)

This breaks the idle node detection feature for the processing status,
which would exclude entries that have no messages in the journal.

Steal the code from the ThrottleStateUpdaterThread, which already deals with this edge case.

Fixes #6415

(cherry picked from commit 13814e6)
@voxmaster

This comment has been minimized.

Copy link
Author

commented Sep 14, 2019

Looks like problem is still present with new Graylog 3.1.2+9e96b08
Still have "No Alerts & Events found for the current search criteria." and not notified

curl -I -X PUT http://temp:temptemp@127.0.0.1:9000/api/system/loggers/org.graylog.events/level/DEBUG -H'X-Requested-By: curl'
curl -I -X PUT http://temp:temptemp@127.0.0.1:9000/api/system/loggers/org.graylog.plugins.events/level/DEBUG -H'X-Requested-By: curl'
curl -I -X PUT http://temp:temptemp@127.0.0.1:9000/api/system/loggers/org.graylog2.system.processing/level/DEBUG -H'X-Requested-By: curl'

And attached log and trigger rule
searchmessage
graylog1.log

@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 16, 2019

@voxmaster maybe another issue then..

could you get me the output of
echo 'db.processing_status.find()' | mongo graylog

@voxmaster

This comment has been minimized.

Copy link
Author

commented Sep 16, 2019

@mpfz0r Attached output
Graylogmongo-debug1.log

Should I create another issue topic ?

@mpfz0r

This comment has been minimized.

Copy link
Member

commented Sep 17, 2019

@mpfz0r Attached output
Graylogmongo-debug1.log

Should I create another issue topic ?

@voxmaster thanks. I've created #6453

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.