-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elastic Search Cluster becomes unresponsive #6409
Comments
Hi @nimahajan Are you still experiencing this problem? What version? There must be something going on in the logs. Without more info it's difficult to help. |
I have the same issue of a cluster becoming unresposive after a random amount of time, 2014-08-19 16:48:48,390][INFO ][cluster.service ] [csliveeubs-elasticsearch01.server.lan] removed {[csliveeubap-logparser01.server.lan][81xwRvieT4ytrMO2PdJXZg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, added {[csliveeubap-logparser01.server.lan][I1t_-leETwqdSH_Xt-0Ymg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, reason: zen-disco-receive(from master [[csliveeubap-elasticsearch02.server.lan][TgTzuqrkRYay6fOyDIP66w][csliveeubap-elasticsearch02.server.lan][inet[/...:9300]]{datacenter_location=2, master=true}]) Im running ES 1.3.1 on debian wheezy x64 on the following Hardware Setup without the use of any plugins: 2x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz |
@psych0d0g that is your client shutting down and being replaced with another client. Also, we strongly advise against using a single cluster across two data centres. It will produce instability and could well be responsible for what you are seeing. What is in the logs of your other nodes? |
Exactly the same for all data-node logs. cluster.name: cslive-elasticsearch Search thread poolthreadpool.search.type: fixed Index thread poolthreadpool.index.type: fixed |
Some notes on your settings:
In v1.3.x we set
This stops Elasticsearch from exerting backpressure if it is struggling. Use
Why are you disabling the compound format? This should not be necessary, especially if you have an unbounded
You're going to overwhelm Elasticsearch with so many indexing threads. Use the default settings instead. And back to your original issue - you say the cluster becomes unresponsive, but what do you mean by that? Do no nodes respond at all, even to a simple |
There is nothing else at all in any of your other logs? |
the logs arent telling me anything for the time when the cluster becomes "unresposive".
The cluster responds with a 503 timeout {"error":"RemoteTransportException[[csliveeubs-elasticsearch01.server.lan][inet[/...:9300]][cluster/settings/update]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 30s]; ","status":503} |
How big is your cluster state? What's the output of:
|
at this moment its freshly restartet to bring it back to a working state asapst because our queue are filling up pretty fast, we need to get them empty again, i will have a look when this happens again. |
okay, here is the pending tasks, zen-disco-recive piled up like hell: |
@psych0d0g have you tried using the elasticsearch-http connector in logstash? |
not yet since my thought was that its not as performant as the node protocol |
So when a client node joins the cluster, it has to go through "discovery" which can take a little time. The idea is that the client node will be part of the cluster for an extended period. It looks like, for whatever reason, you are either creating lots and lots of connections or they are very short lived. You probably want to ask about your logstash config on the logstash forums, but I'd definitely try out the http connector, which is the recommended connector these days. |
mhm, i just did, and it IS slow, |
and the problem might be a combination of both, i use 24 workes for the ES output and i restart the logstashes constantly when introducing new grok filter. |
Perhaps the logstash guys can give you some tuning advice. Btw, with the work that is happening in the improve_zen branch on Elasticsearch, discovery will become much quicker than it is today. |
okay, now the issue is back again, even the /_cat/ request seems to timeout right now |
Here are some more, from yesterday at around the time the last loglines where indexed: [2014-08-26 20:17:37,892][DEBUG][action.admin.cluster.node.stats] [csliveeubs-elasticsearch01.server.lan] failed to execute on node [MM0GwoWYTSqBWD3VeFiaiw] ES has 12GB of heap space, how the heck can it eat all that precious RAM up and demand even more for its heap? |
@psych0d0g I'm guessing that you are sorting or aggregating on some fields which are taking up a lot of memory? Have a look at the output of:
|
{"cluster_name":"cslive-elasticsearch","nodes":{"vCAZom1MQF2XZGNqqtqtew":{"timestamp":1409733531784,"name":"csliveeubs-elasticsearch02.server.lan","transport_address":"inet[/10.79.192.12:9300]","host":"csliveeubs-elasticsearch02.server.lan","ip":["inet[/10.79.192.12:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"103.7mb","memory_size_in_bytes":108769547,"evictions":0,"fields":{"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485},"@timestamp":{"memory_size":"103.5mb","memory_size_in_bytes":108570376},"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686}}}}},"eMjtqKXER4ilutu9UPwwmw":{"timestamp":1409733531779,"name":"csliveeubap-elasticsearch02.server.lan","transport_address":"inet[/10.76.30.12:9300]","host":"csliveeubap-elasticsearch02.server.lan","ip":["inet[/10.76.30.12:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"194.5kb","memory_size_in_bytes":199171,"evictions":0,"fields":{"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686},"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485}}}}},"g2nl3aZJSeqhRZMaUdpbsA":{"timestamp":1409733531783,"name":"csqaeubs-logparser02.server.lan","transport_address":"inet[/10.88.7.52:9300]","host":"csqaeubs-logparser02.server.lan","ip":["inet[/10.88.7.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"2NuW6V_LQLieGaRLlFXipQ":{"timestamp":1409733532930,"name":"csliveeubap-elasticsearch01.server.lan","transport_address":"inet[/10.76.30.11:9300]","host":"csliveeubap-elasticsearch01.server.lan","ip":["inet[/10.76.30.11:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"96.7mb","memory_size_in_bytes":101478287,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"96.5mb","memory_size_in_bytes":101220616},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}},"fQIrgXboSWyZ6IgZ19_BPw":{"timestamp":1409733531783,"name":"csqaeubs-logparser01.server.lan","transport_address":"inet[/10.88.7.51:9300]","host":"csqaeubs-logparser01.server.lan","ip":["inet[/10.88.7.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"U8TV4j2ETYCuf4Y8HW4zYQ":{"timestamp":1409733531780,"name":"csqaeubap-logparser01.server.lan","transport_address":"inet[/10.88.32.51:9300]","host":"csqaeubap-logparser01.server.lan","ip":["inet[/10.88.32.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"XiaE-n4_Rp6ZvZ8mj74D1w":{"timestamp":1409733531780,"name":"csqaeubap-logparser02.server.lan","transport_address":"inet[/10.88.32.52:9300]","host":"csqaeubap-logparser02.server.lan","ip":["inet[/10.88.32.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"krlSSP8JSneA6Lqu4xdR-A":{"timestamp":1409733531783,"name":"csliveeubs-elasticsearch01.server.lan","transport_address":"inet[/10.79.192.11:9300]","host":"csliveeubs-elasticsearch01.server.lan","ip":["inet[/10.79.192.11:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"235.8mb","memory_size_in_bytes":247358263,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"235.6mb","memory_size_in_bytes":247100592},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}}}} |
OK, so it doesn't look like your fielddata, as that is not using much memory (or did you restart nodes in the interim). How big are your bulk batches? Are you still using so many indexing threads? Please add the full output of these two commands (in a separate gist):
|
It took the cluster several minutes to respond with the requested data GET /_nodes: GET /_nodes/stats: another GET '/_nodes/stats/indices/fielddata?human&fields=*': my batches are: 1000 |
Hi @psych0d0g The only things that show up as abnormal are (1) slow garbage collections and (2) the amount of RAM being used for your segments. See I suggest disabling bloom filters on old indices, and disabling norms (or making fields not_analyzed) on any string field which is not used for full text search. See: |
I'm assuming that this advice worked, as we haven't heard any more, so I'll close this ticket |
Hi,
I have been facing this issue for couple of days. Here are the details -
I have a cluster of 7 nodes, with 3 master eligible with no data and 4 only data nodes. I have set minimum master node property to 2. I have used KOPF/ElasticHQ/Head for monitoring. I have set up the cluster for testing so there is hardly any querying to going on. So it remains idle most of the time. After few hours of inactivity, the plugins cannot show the any details except for the green status (no other details like shard etc shown). If I make any search query, it does not work.
e.g. http://ironman:9200/_search?q=chrome
I have to restart the master's elasticsearch service to make it responsive again. I checked the $ tail -f /var/log/elasticsearch/avengers.log I don't see any failure/exceptions. I had elastic search 1.1 and updated to 1.2 as well. But I am still having this issue.
Please let me know if you need more details.
Thanks,
Nilesh.
The text was updated successfully, but these errors were encountered: