New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster hangs and node disconnects due to exessive traffic on transport layer network card stopping pings #19646
Comments
The problem here is the long GCs you're getting, essentially because you're overwhelming the heap with your aggregations (probably because you're generating way too many buckets). You need to simplify your aggs structure. Imagine if you have 1,000 buckets and each has another 1,000 buckets, and each of those has another 1,000 buckets. That's a trillion buckets! It's just not going to work. In fact, we've recently added a circuit breaker which will prevent such aggs from running #19394 |
That would make sense if our query was generating thousands of buckets. The first aggregation returns 2 buckets There is GC going on on the cluster however doubling the capacity of the interface which is used for transports allows the query to run and return results, if the issue was purely a GC one then increasing the bandwidth available to the transport layer would not change the original behaviour. |
8000 buckets x 1175 indices x 5 shards that have to be handled by the coordinating node. Most of this traffic is across the transport layer, not http, so I very much doubt that separating the interfaces would really help here. That said, I'll reopen this for further discussion. /cc @bleskes |
This is much more than @clintongormley said, as the 8000 buckets on the lowest level need to be multiplied by the upper levels as well - assuming
Agreed. All the traffic here is highly likely on the transport layer, so separation is not a big deal. Also you can try it out since you can bind the http host to another IP than the transport one. See here. All in all I think you have two issues:
You should either increase the capacity of your cluster to meet what you do with it or try to reduce the number of shards and see how it helps. Closing this again. I suggest you continue the discussion on the thread you already have going with @danielmitterdorfer on discuss.elastic.co : https://discuss.elastic.co/t/cluster-nodes-get-disconnected-and-out-of-sync-due-to-ping-timeouts-caused-by-transport-load/56505 |
Elasticsearch version:
Issue Tested on v 2.3.1 & v2.3.4 of Elasticsearch
JVM version:
"version" : "1.8.0_92",
"vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
"vm_version" : "25.92-b14",
OS version:
CentOS Linux release 7.2.1511 (Core)
Description of the problem including expected versus actual behavior:
Background information
Our clusters consist of approximately 1500 or so indexes (5 shards per index), we are running a group of aggregated queries across 1175 of the available indexes. On the test system in question (although the issue affects both our development and much more powerful live environment) there is between 1.5 and 2.5 TB of data (including 1 replica per shard).
Expected
The frontend of our system issues complex queries and often runs multiple queries at once. The queries are complex with a number of aggregations. The queries normally run on the ES backend and the frontend code then renders the results to the users.
Cluster state remains Green.
Actual
Staring the queries from the frontend with the cluster running and occasionally logging GC entries.
The cluster starts processing as expected. After a short amount of time one or more cluster nodes get disconnected from the cluster. This is seen in the logs on both the data node and the master instance it was attempting to communicate with.
Master log entry
Failed node log entry
Each physical host has 3 nodes running on it. 1 Master node and 2 data nodes, host awareness is set for the nodes also.
The cluster state changes to red (even though there are replicas available), normally from this point on the cluster will respond to queries such as /_cluster/health and /_nodes but the failed nodes will not rejoin the cluster.
If i try to use the OS command to stop nodes that have timed out the command is ignored and just hangs. I have to kill -9 the process to stop the instance, this needs to be done on both the master and the node that failed. Usually even killing those 2 instances does not help, the cluster continues to throw errors about being unable to ping
Even once stopped and restarting the individual instances they still fail to connect back to the rest of the cluster reporting timeouts to other instance's IPs however I can ping all the IPs of the cluster.
I eventually realised that the problem is the bandwidth available in the network card used for the transport layer. Once this becomes saturated with traffic from the cluster the pings between nodes become queued at the network interface. By the time they are processed the other instances have already timed out the expected ping.
To confirm this was the issue I reset our cluster and modified it so rather than having 2 network cards em1 for the http traffic and em2 for the transport layer traffic i setup the machine so that it has
em1 - http traffic (1 Gb/s)
em2 and em3 as bonded interface bond0 with mode 0 round robin giving me a single (2Gb/s) interface.
Running the same query as above allowed the query to run without issue and there were no errors on the cluster. This would seem like the solution however the number of layers of aggregation in our system can change dynamically and even with the 2 x 1Gb/s interfaces acting as one adding another aggregation then caused the same original problem.
Steps to reproduce:
Provide logs (if relevant):
Thoughts on a solution
In previous experience with clusters would normally have some disk quorum device which arbitrates similar issues, I think with elasticsearch this isn't required but being able to define a dedicated LAN (possibly LANs to allow redundancy) like you can split http and transport traffic would remove this problem entirely. As an example a solution such as:
em1 - HTTP Traffic
em2 - Interconnect between nodes traffic (possibly including cluster state traffic)
em3 - transport layer traffic (i.e. results of searches, indexing etc......)
In this sense bonding or teaming of cards a the OS level would allow a user to provide resiliency whilst the segregation of traffic will protect the cluster from saturation of the transport layer with network IO.
I also have a post on the discussion forum for the same issue: https://discuss.elastic.co/t/cluster-nodes-get-disconnected-and-out-of-sync-due-to-ping-timeouts-caused-by-transport-load/56505/
Kind Regards
Lee
The text was updated successfully, but these errors were encountered: