New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ES 1.4.2 random node disconnect #9212
Comments
This may be a more suitable question for the mailing list(?). How big is your cluster; node count, index count, shards and replicas, total data size, heap size, ES version? What is your config? |
Right now : Current yml config here : |
That's a lot of shards. You might be running into resource limits. Check your ulimit on filehandles. |
File limit in not the problem, it is set to : I have another cluster with older ES version, with over 9000 shards on 3 nodes and my nodes don't get random disconnects there . |
Can you try closing some old indices for a while and seeing if it helps? |
Ok, I will try that to 1/3 of my total shards and see what happens. I will do it Monday so have a nice weekend till than :) |
@Revan007 did you see any long GCs on the node? Can you set the |
Hi, I am also seeing the same issue on our 3 node cluster - I have posted a lot of details in the elasticsearch users group here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/elasticsearch/fQkhRJ6Md9c I've messed around with lowering tcp keep alive, setting the fd.ping_timeout etc. and the issue is still happening. The root cause seems to be a Netty level timeout - The last change I tried was to increase the fd.ping_timeout by setting the following last night - the issue is still happening frequently and causing failures on the client side in our application. |
How do I enable transport.netty debug logging ? I have also read this https://github.com/foundit/elasticsearch-transport-module - Recommended tweaks to existing settings: how can I modify this values in ES ? |
You can do something like the following: (The following works - I wasn't sure whether logger.transport or logger.org.elasticsearch.transport was the right one - I tried both and this works) curl -XPUT :9200/_cluster/settings -d ' |
Thank you @rangagopalan . I guess I will wait for another node failure to see the results . and here is the _nodes/stats info |
My node failed... this are the logs, in addition to the ones posted before : |
@Revan007 and @bleskes - This issue seemed to be faced by others as well - See https://groups.google.com/forum/#!searchin/elasticsearch/ReceiveTimeoutTransportException/elasticsearch/EvDNO_vALNg/mVzurNj1KRAJ The solution there was to try to reduce the number of shards/indices/replicas combination since that might help the cluster:/nodes/stats API return within the hard-coded timeout of 15 seconds (assuming that is the cause of the connection failures). I am trying it here on my 3-node cluster (we recently switched to time-based indexes and at least temporarily for testing I was able to close out a bunch of older time-based indices to reduce the total shards+replicas I have from 1200 to about 450). I will post again on whether this helped eliminate / reduce the timeouts or not in a few hours. If this works, perhaps we can also request an enhancement to make this 15 second timeout configurable for use in clusters where there are a larger number of shards / indices. (I believe the hard-coded limit is in the following code - but I could be mistaken - https://github.com/elasticsearch/elasticsearch/blob/3712d979519db5453bea49c34642a391c51d88b3/src/main/java/org/elasticsearch/cluster/InternalClusterInfoService.java) |
@Revan007 - it seems something is causing
@rangagopalan theses stats should be very fast (pending some old solved bugs - which eversion are you on?). Any reason why you point at the InternalClusterInfoService which is not used for client calls but rather for the disk threshold allocation decider. Do you have issues there? |
Am using 1.4.2 - more details of my env are in the post here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/elasticsearch/fQkhRJ6Md9c No I wasn't using the internal class in any way - I was just theorizing/postulating that since the timeout was seen in the nodes/stats call perhaps this 15 second timeout set in the internal cluster info service class was applicable - The test I am trying is to reduce the time for any cluster statistics calls by reducing the number of indices/replicas/shards combination (what I am doing may not be applicable if the stats apis return really fast always and aren't dependent on the number of indexes) - I am trying out what was posted in the other link that I referred to - to try and see if the timeouts / disconnects stop if I reduce the total number of indexes/shards - |
@bleskes . That was when I did a manual restart of that node. If I let it recover on its own it takes like 10 minutes. Here is the data node log when I let it recover on its own |
@Revan007 what does take 10 minutes exactly? I want to make sure I understand the sequence of events. For what it's worth it may these are genuine network issues. The last error in your logs show a socket read timeout:
|
Hmm... it has been almost 20 hours now and no node disconnect so far. I am not sure if my change fixed it. I will wait a little longer till I can say for sure . |
An update - it's been about 18 hours since I closed a bunch of older indexes to reduce the total number of shards to below 500 from about 1100. No timeouts since then. |
Ok, it did happen again, but after 22 hours now . It took ~17 minutes to d/c the node and go to yellow state and recover . That means for 17 minutes when I tried to query search the master it wouldn't give me any response . Here is the complete log from Master and Data Node : |
Damn , I still have received 2 more node timeouts so I guess I was just lucky before that it took more time . @rangagopalan still going ok for you after reducing the number of open shards ? I don't find this a solution because I am using 1 index per day with 9 shards and I have months of data, I will be over 3000 shards whatever I do ... |
I used to run into similar issues until I reduced the number of active Tin On Sun, Jan 11, 2015 at 11:51 AM, Revan007 notifications@github.com wrote:
|
@Revan007 - Yeah - things look fine here since the reduction of indexes/shards- no timeouts - (going for more than 26 hours now) -
|
@Revan007 @rangagopalan there two things at play here - timeouts on indices / node stats and network level disconnects. They are two different things. On slow file system - indices stats (and to a lesser degree node stats) can take longer to complete you have many shards on a node. In extreme cases this can cause a time out. This is how ever very different then network disconnects which should not be affected by the number of shards. That said it is possible that the shards put too much load on the node causing it not to respond on time to pings from the master which in turn causes the node to be thrown off the cluster and the network channels to be closed. This is how not the cases as you can see the network is closed because of a socket level time out (still possible under load but less likely):
@Revan007 : what do you mean exactly when you say "I tried to query search the master it wouldn't give me any response ." - did search requests to the master not return? A couple of side notes:
|
@bleskes I will reduce the number of shards to 5 soon per index but I cannot reduce the number of indexes, they have to be per day because of the amount of data. So in 1 year I will still have 365 x 3 = 1095 indices. ( 3 types of indices) The indices stats timeout of 15000 ms has to modified somehow in a future ES update/fix . |
@Revan007 I still think there is some confusion here as the the Indices stats API doesn't have a timeout. That 15s timeout reference (which I agree should be configurable) is to an internal disk free space monitoring calls issued from the master. I'm sorry - but it is not clear to me what you exactly you mean when you say " takes 17 minutes to d/c the failed node the query still works fine." - node stats / indices stats should never be in the way of Search calls - they use another thread pool. I think something else is going on and it seems to all point at networking issues or general node load - do you monitor these?
Thats OK - just make sure you need 5 shards - you might be surprised how much you can get out of a single one. |
Well, before, when I was querying the master node (doing a search , curl -XGET bla bla bla) when a node started to fail :
I would get no result , nothing would return. It takes like 17 minutes for the master to start spitting tons of errors and disconnect that node , than is starts to recover the shards when the nodes is reconnected and could query it and receive results . Now I am querying the balancer node and even if it the node starts to fail at least I am able to query during that time .
So because I have 5 nodes I should preferably have 5 shards, 1 per node. Should I understand that If I now start another 4 nodes and have 9 nodes, 1 shard per node I should not see those disconnects anymore ? |
It really sounds to me like there are some networking issues causing both symptoms. Can you try monitoring that out of ES? If you don't have any monitoring infra in place maybe start a ping process or something between the nodes
Remember you have replicas and also more indices (one per day) that should be spread around. I would try 1 shard and see if you have enough indexing / search capacity based on your data volumes. That will also give you an idea how much a shard can take and based on that decide how many shards you need.
I still don't see any correlation between the number of shards and disconnects. |
Just an update confirming that there is definitely some kind of problem related to the number of shards per node - We had a separate testing system that was working fine - a single node that had about 647 shards/replicas (set up with 1 replica - so there were 647 unassigned shards)
{code} I believe that (as suggested here too: https://groups.google.com/d/msg/elasticsearch/EvDNO_vALNg/BPK5yYSUFeQJ) that there is some kind of problem within Elasticsearch node-communications/monitoring when there is a multi-node cluster and the number of shards/replicas per cluster goes above a certain number. From our experience we can say that 647 shards/replicas per node surely causes it (on our testing server) and on the production cluster I believe we saw the issue at about 400 or 450 shards/replicas per node. I think at the very least the Elasticsearch documentation should be updated to provide recommendations/guidelines on cluster sizing to avoid this kind of issue. |
Ok, here it is the bug. If you cant upgrade your kernel right now you can do: sudo ethtool -K eth0 sg off Good luck ! |
Thanks a lot @Revan007 |
Agree. Given the vagueness of the issue I think there could be a few bugs at play and we may be experiencing different ones that manifest themselves in a similar way. Or I could just have a badly configured cluster... 😜 |
@wilb the issue's description might be vage and the various comments may have different causes, but I know for a fact that our issue is worked around by using a specific kernel |
@faxm0dem don't want you to think that I was having a dig there - just meant that this feels like it could be one of those issues that could have multiple root causes that all manifest themselves in similar ways. |
I am also having this issue and have submitted a support ticket, but have yet to resolve it. We're running ES 1.4.4 on Ubuntu 14.04 on GCE. We did not have the "rides the rocket" log lines in syslog and have tested on 3.16 and 3.19 kernels. No resolution yet. |
We upgraded to Ubuntu 14.04.02, and saw the same issue again after just about a week of runtime. :( |
I got to the bottom of what my particular issue was. Not sure it's likely to help others, but worth detailing... I use Chef to configure our cluster, one of the things that gets configured is the S3 Repo for backups. Unfortunately I wasn't doing this in an idempotent manner, which resulted in the call coming in every 30 minutes. This wasn't causing any problems initially, but the call will verify the repository by default (this can be disabled by a query string, but I wasn't doing that). Over time this appeared to become more of an expensive operation (I assume because it's an active backup repo full of data) - this looks to have been causing hosts to briefly lock and result in them temporarily dropping out of the cluster whilst the verify was taking place. Disabling the verify made things behave in a much more sane manner so I rewrote the logic to actually be idempotent. On top of this, there appears to be a point of degradation at which there is no return unless you restart your master. As I mentioned above, calls to the local node status endpoint of the master start to take 20-30 seconds to return and at this point the cluster generally struggles along with many more disconnects until you take action. |
At least for us upgrading to 3.18 kernel fixed the issue. Our cluster is running now for 3 weeks without any node disconnect whereas earlier we had disconnect at least once a week. It does seems that there are multiple issues which can cause node disconnect. May be in 1.6.0 the improvement they made to make cluster state changes async might help. |
Thankyou - that looks exactly like the issue. I'd held off from doing an upgrade to 1.5.x yet but looks like I should definitely do so. |
It sounds like we've arrived at the root cause of this problem, so I'm going to close the issue. If anybody disagrees, ping me to reopen. |
Not really: @wilb had an unrelated issue |
Ping @clintongormley |
Nope, no kernel messages pointing Into that Direction. It's definitely something that happened between the Linux Version I mentioned somewhere above |
OK, reopening |
Just FYI, we're seeing this pretty continuously with 1.6.0 on Ubuntu 14.04. Going to try some of the suggestions above. |
@bradmac downgrading kernel should work if |
What i'm seeing is that the server node is not stuck, most requests to it are successfully processed. Its just that apparently some of the threads in the client side thread pool are unable to connect to it. This is in lightly loaded conditions. |
@bradmac different issue I'd say |
Ugh we just had this issue happen twice for us in 3 hours. The first time it happened to one node in our cluster, the sceond time it happened to a different node. In both cases, 1 of the 4 nodes went into a garbage-collection loop. When this happened, the other 3 nodes were disconnected from the gc-looping-node and the cluster basically became unresponsive almost entirely. Node 75s GC loop -- this stopped on its own
Node 75s GC loop an hour later -- went on for an hour or more
Node 77s GC loop -- stopped when we restarted ES
|
@diranged that's not surprising. you have enormous heaps (90GB!) and they're full. You need some tuning advice. I suggest asking about that on the forum: http://discuss.elastic.co/ |
If anybody facing issue with following error: exception caught on transport layer [[id: 0x32cd6d09, /172.31.6.91:38524 => /172.31.18.78:9300]], closing connectionjava.io.StreamCorruptedException: invalid internal transport message format, got (50,4f,53,54) The 9300 port is used for inter-node communication and uses an internal binary protocol, thus you can't use it from the browser. In order to use the 9300 port you have to use the Java API, either Node client or Transport client, which both understand the internal binary protocol. From the browser you should only use the 9200 port, which is the one that exposes the REST API. If you are using Amazon LoadBalancer to access your cluster: then you should change your instance protocol in Listener settings to 9200. |
@pavanIntel and how is this relevant? |
Nothing further on this ticket. Closing |
I have the same problem, and I got out of ideas. The riding socket is out of question, since we use latest kernel (3.16.7-ckt20-1+deb8u1). ES version 1.7.4. Debian Jessie. Java build 1.8.0_66-internal-b17. Here is the debug log:
It is true that we have a rather big cluster, but only the master disconnects, not the nodes. They communicate through a VPN tunnel, maybe somebody has another idea how to improve this. Is it normal that the nodes are queried so often with the stats, i.e. each few seconds? Thank you! |
Hey,
I am having trouble for some while. I am getting random node disconnects and I cannot explain why.
There is no increase in traffic ( search or index ) when this is happening , it feels so random to me .
I first thought it could be the aws cloud plugin so I removed it and used unicast and pointed directly to my nodes IPs but that didn't seem to be the problem .
I changed the type of instances, now m3.2xlarge, added more instances, made so much modifications in ES yml config and still nothing .
Changed java oracle from 1.7 to 1.8 , changed CMS collector to G1GC and still nothing .
I am out of ideas ... how can I get more info on what is going on ?
Here are the logs I can see from master node and the data node
http://pastebin.com/GhKfRkaa
The text was updated successfully, but these errors were encountered: