New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On Solaris 10 (Illumos), setting TCP_NODELAY on a closed socket causes elasticsearch to be unresponsive #7115
Comments
seems like on netty its not set by default only on Android, and still being set on Solaris. I would be more than happy to create a change to disable it on Solaris, others, thoughts? |
@kimchy Correct, though it is now wrapped in an exception block and is ignored if it throws an error. TCP_NODELAY should still be set on Solaris but the behavior on a closed socket throws an exception and should be ignored. |
Yes, the silent ignore in the exception... . I was just wondering why netty didn't disable it on Solaris by default as well. Based on your input, it seems like it should. I am reaching out to some solaris experts on our end to see what they think, just to be double sure we should make this change. Thanks for bringing it up! |
Okay, awesome! Thanks so much! |
@letuboy btw, which Java version are you running? |
and another question, if you set it to |
We're using OpenJDK 1.7.
It appears that the mere fact of calling |
@letuboy hard to tell exactly which Java version its actually is..., internal? |
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure relates to elastic#7115
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure relates to #7115
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure relates to #7115
I pushed #7136 to master and 1.x (upcoming 1.4) to allow to set |
Allow to set the value default to network.tcp.no_delay and network.tcp.keep_alive so they won't be set at all, since on solaris, setting tcpNoDelay can actually cause failure relates to #7115
We're on ElasticSearch 1.1.1 running on Illumos (Solaris 10 derivative on Joyent).
We ran into an issue today where elasticsearch became completely unresponsive after the following exception:
On solaris, setsocketopt has different behavior that on other platforms. It will return EINVAL causing java to raise an InvalidArgument exception when the socket has been closed. Apparently this happens when the client closes the connection before the server has finished it's accept. Elasticsearch appears to have been doing a garbage collection around that time.
Here's a couple references to this bug occurring in other projects:
http://bugs.java.com/view_bug.do?bug_id=6378870
https://java.net/jira/browse/GLASSFISH-5342
https://jira.atlassian.com/browse/STASH-3624
It also appears that in Netty 4.0+ this might have been fixed by: netty/netty@39357f3#diff-dbfa6a222217d4fc2c12d20ee3496eb3R50
Unfortunately, this is a bit difficult to reproduce and it only happens rarely. I'd imagine it can by reproduced by running elasticsearch on Solaris 10, finding a way to stall the server long enough for the client to close the connection before the server has set the socket options. Elasticsearch search should then stall and stop responding to any requests (as is the behavior that we saw).
Thanks,
Paul
The text was updated successfully, but these errors were encountered: