Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GELF input UDP stop work #5701

Closed
ismaelpuerto opened this issue Feb 20, 2019 · 11 comments · Fixed by #6263

Comments

@ismaelpuerto
Copy link

commented Feb 20, 2019

Hello

I receive some packet gelf that causes graylog input down. The last message in graylog-server.log is:

2019-02-20T16:07:56.099+01:00 ERROR [NettyTransport] Error in Input [GELF UDP/58ebb5b9ba0e204a705a894c] (channel [id: 0xd79eba88, L:/0.0.0.0:12201]) (cause java.lang.IllegalStateException: GELF message is too short. Not even the type header would fit.)

The web interface report that input is OK, but if I see the port open in machine, I see down:

[root@machine1 ~]# ss -u -a |grep 12201
[root@machine1 ~]#

With graylog 2.5 version works fine.

My Environment

  • Graylog Version: 3.0
  • Elasticsearch Version: 6.5.4
  • MongoDB Version: 3.4.3
  • Operating System: RHEL7
  • Browser version: no apply

@dennisoelkers dennisoelkers self-assigned this Feb 20, 2019

@dennisoelkers

This comment has been minimized.

Copy link
Member

commented Feb 20, 2019

Can you reproduce this? Is this happening repeatedly? If yes, is there even a chance to capture a pcap during the time when it happens?

@ismaelpuerto

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

I found the "bug", I sent the log with netcat and this is cause that error. I switched to nxlog and now the input can't stop. Thanks

@bjoernhaeuser

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2019

This is happening for us aswell. We are getting this error message in the log:

ERROR: org.graylog2.plugin.inputs.transports.NettyTransport - Error in Input [GELF UDP/59a7d14166282b0001d46ea6] (channel [id: 0xa5cdc923, L:/0.0.0.0:22201]) (cause java.lang.IllegalStateException: GELF message is too short. Not even the type header would fit.)

Afterwards the input on the node reports as "running", but in fact accepts no new messages. After a while all Gelf UDP inputs on all nodes receive a bad package and stop; effectively not accepting any messages through UDP anymore.

We do not know which system is producing these messages and therefore it would be complicated to capture a pcap during the time.

Would be cool if the input continues to work, despite the bad package.

@svenwltr

This comment has been minimized.

Copy link

commented Jul 22, 2019

Hello,

we were able to reproduce this issue by actually sending a single byte to the UDP input. This directly causes the GELF message is too short. error message and the input is not reachable afterwards anymore.

  • Graylog Version: 3.0.1

Steps to reproduce

Verify that the UDP port is open:

$ nc -vzu 10.10.55.54 22201
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.10.55.54:22201.
Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 0 bytes received in 2.01 seconds.

Send a single byte over UDP:

$ echo -en '\x1e' | nc -vu 10.10.55.54 22201
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.10.55.54:22201.
Ncat: 1 bytes sent, 0 bytes received in 0.01 seconds.

Somehow the input does not stop immediately:

$ nc -vzu 10.10.55.54 22201
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.10.55.54:22201.
Ncat: UDP packet sent successfully
Ncat: 1 bytes sent, 0 bytes received in 2.01 seconds.

A couple of seconds later the port is closed:

$ nc -vzu 10.10.55.54 22201
Ncat: Version 7.50 ( https://nmap.org/ncat )
Ncat: Connected to 10.10.55.54:22201.
Ncat: Connection refused.
@bernd

This comment has been minimized.

Copy link
Member

commented Jul 22, 2019

@bjoernhaeuser @svenwltr Thank you for the updated information. We will reopen the ticket so we can try to reproduce and fix it.

@bernd bernd reopened this Jul 22, 2019

@bjoernhaeuser

This comment has been minimized.

Copy link
Contributor

commented Jul 22, 2019

@bernd thank you! If there is anything else we can do, let us know. I am not familiar enough with the codebase to open a PR by myself, but let me know if there is anything I can do.

@jalogisch

This comment has been minimized.

Copy link
Member

commented Jul 23, 2019

It is important to have in mind what we highlight in the docs: http://docs.graylog.org/en/3.0/pages/gelf.html#gelf-via-udp

Please note, that the UDP-Inputs of Graylog use the SO_REUSEPORT socket option, which was introduced in Linux kernel version 3.9. So be aware, that UDP inputs will NOT work on Linux kernel versions prior to 3.9.

@der-eismann

This comment has been minimized.

Copy link

commented Jul 23, 2019

@jalogisch Our cluster environment is based on CoreOS and therefore kernel 4.19.x, so this shouldn't be an issue.

@asaf400

This comment has been minimized.

Copy link

commented Jul 29, 2019

Happens to us as well, exact same symptoms,
however, the packet to crash the listen port is actually that netcat test port command:
nc -vzu graylog-node 22201

After that the listener (all threads of it) close, and never comeback, all while the master web gui, shows listener as active on that node (and every node which has crashed...)

And since UDP doesn't have a transmission mechanism, the data is lost (since load balancer on udp cant really health check the port, and if it does, with this bug, it might just kill the server with a simple healthcheck - port check)

System - Docker:
Graylog v3.0.2+1686930
Uname -a: Linux 4.14.128-112.105.amzn2.x86_64 #1 SMP x86_64 GNU/Linux
(Using Official Docker Image)

I am somewhat glad this has been picked up by more graylog users already, And I'm not the only one experiencing it..

@bernd bernd added this to the 3.1.0 milestone Jul 29, 2019

dennisoelkers added a commit that referenced this issue Aug 8, 2019
Prevent closing of UDP channel in input for messages with bogus GELF …
…header.

Before this change, whenever a UDP message came in with a 0 or 1 byte
header (smaller than the 2 byte GELF header), the thread handling this
message throws an exception in `GELFMessage#getGELFType` which is caught
in `EnvelopeMessageHandler#exceptionCaught`. In there, the channel is
closed. This is correct for a TCP-based input, as the corresponding
TCP-connection would be closed, as each channel corresponds to a TCP
connection. This is not the case for UDP pipelines. Closing a channel of
a UDP pipeline leads to the handler/thread losing the overall connection
to the socket multiplexing. If this happens for a number of times
equaling the number of working threads, the whole input stops processing
any UDP messages as there is no thread left which is able to receive
channel events.

This change adds `EnvelopeMessageAggregationHandler#exceptionCaught`
which is only logging the exception and increasing the counter for
invalid GELF chunks, without propagating the exception to other handlers
(preventing the channel closure).

A remaining issue is that for UDP messages with no/invalid GELF headers,
the `EnvelopeMessageAggregationHandler#channelRead0` is called four
times, for correct messages only once. The first three times it is
called with a single byte buffer, the last time with a buffer containing
the complete message. It is unclear to me why this happens and leads to
the exception being logged four times, but this seems to be still way
better than causing situations where the complete input ceases to work.
Any insight how to fix this is welcome.

Fixes #5701.
@dennisoelkers

This comment has been minimized.

Copy link
Member

commented Aug 8, 2019

Thanks @ismaelpuerto, @svenwltr, @bjoernhaeuser, @asaf400, @der-eismann for reporting this and helping us to reproduce it. I was able to reproduce it and identify the root cause. A fix is prepared and will hopefully become part of 3.1.0.

@bernd bernd closed this in #6263 Aug 9, 2019

bernd added a commit that referenced this issue Aug 9, 2019
Prevent closing of UDP channel in input for messages with bogus GELF …
…header. (#6263)

Before this change, whenever a UDP message came in with a 0 or 1 byte
header (smaller than the 2 byte GELF header), the thread handling this
message throws an exception in `GELFMessage#getGELFType` which is caught
in `EnvelopeMessageHandler#exceptionCaught`. In there, the channel is
closed. This is correct for a TCP-based input, as the corresponding
TCP-connection would be closed, as each channel corresponds to a TCP
connection. This is not the case for UDP pipelines. Closing a channel of
a UDP pipeline leads to the handler/thread losing the overall connection
to the socket multiplexing. If this happens for a number of times
equaling the number of working threads, the whole input stops processing
any UDP messages as there is no thread left which is able to receive
channel events.

This change adds `EnvelopeMessageAggregationHandler#exceptionCaught`
which is only logging the exception and increasing the counter for
invalid GELF chunks, without propagating the exception to other handlers
(preventing the channel closure).

A remaining issue is that for UDP messages with no/invalid GELF headers,
the `EnvelopeMessageAggregationHandler#channelRead0` is called four
times, for correct messages only once. The first three times it is
called with a single byte buffer, the last time with a buffer containing
the complete message. It is unclear to me why this happens and leads to
the exception being logged four times, but this seems to be still way
better than causing situations where the complete input ceases to work.
Any insight how to fix this is welcome.

Fixes #5701.
@bernd

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

@ismaelpuerto @bjoernhaeuser @svenwltr @der-eismann @asaf400 We just merged the PR that is fixing this issue. It will be part of the upcoming Graylog 3.1.0 and we will also consider a backport to 3.0.x.

Thanks for the input everyone! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants
You can’t perform that action at this time.