Elastic Search Cluster becomes unresponsive #6409

nimahajan · 2014-06-04T19:04:28Z

Hi,

I have been facing this issue for couple of days. Here are the details -

I have a cluster of 7 nodes, with 3 master eligible with no data and 4 only data nodes. I have set minimum master node property to 2. I have used KOPF/ElasticHQ/Head for monitoring. I have set up the cluster for testing so there is hardly any querying to going on. So it remains idle most of the time. After few hours of inactivity, the plugins cannot show the any details except for the green status (no other details like shard etc shown). If I make any search query, it does not work.

e.g. http://ironman:9200/_search?q=chrome

I have to restart the master's elasticsearch service to make it responsive again. I checked the $ tail -f /var/log/elasticsearch/avengers.log I don't see any failure/exceptions. I had elastic search 1.1 and updated to 1.2 as well. But I am still having this issue.

Please let me know if you need more details.

Thanks,
Nilesh.

clintongormley · 2014-08-08T12:46:52Z

Hi @nimahajan

Are you still experiencing this problem? What version? There must be something going on in the logs. Without more info it's difficult to help.

psych0d0g · 2014-08-20T08:17:21Z

I have the same issue of a cluster becoming unresposive after a random amount of time,
Im running with 4 nodes (all of them are master eligable and data nodes) in a Datacenter Aware Setup with 5 shards/index and 1 Replica in the other Datacenter.
My logfiles are quiet aswell when this happens, ES just stops logging at all at the same time the cluster is getting unresponsive.
Last logline i get is from yesterday:

2014-08-19 16:48:48,390][INFO ][cluster.service ] [csliveeubs-elasticsearch01.server.lan] removed {[csliveeubap-logparser01.server.lan][81xwRvieT4ytrMO2PdJXZg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, added {[csliveeubap-logparser01.server.lan][I1t_-leETwqdSH_Xt-0Ymg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, reason: zen-disco-receive(from master [[csliveeubap-elasticsearch02.server.lan][TgTzuqrkRYay6fOyDIP66w][csliveeubap-elasticsearch02.server.lan][inet[/...:9300]]{datacenter_location=2, master=true}])

Im running ES 1.3.1 on debian wheezy x64 on the following Hardware Setup without the use of any plugins:

2x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz
24GB DDR3 ecc ram
2.7T raid 5 with 3x600GB 10k sas HDDs

clintongormley · 2014-08-20T08:58:14Z

@psych0d0g that is your client shutting down and being replaced with another client. Also, we strongly advise against using a single cluster across two data centres. It will produce instability and could well be responsible for what you are seeing.

What is in the logs of your other nodes?

psych0d0g · 2014-08-20T09:13:01Z

Exactly the same for all data-node logs.
Our datacenters are linked with a dedicated connection via fibre.
The ping time between them is about 0.9ms
Additionally here is my ES Configuration:

cluster.name: cslive-elasticsearch
node.name: "csliveeubs-elasticsearch01.server.lan"
node.datacenter_location: 1
node.master: true
node.data: true
path.data: /mnt/storage
cluster.routing.allocation.awareness.attributes: datacenter_location
cluster.routing.allocation.cluster_concurrent_rebalance: 12
cluster.routing.allocation.node_concurrent_recoveries: 15
indices.recovery.concurrent_streams: 12
cluster.routing.allocation.node_initial_primaries_recoveries: 5
#Node in the other datacenter, diffrent vlan:
discovery.zen.ping.unicast.hosts: ...
discovery.zen.minimum_master_nodes: 2
index.number_of_replicas: 1
index.recovery.initial_shards: 2
indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000
replication: async
index.compound_on_flush: false
index.compound_format: false
indices.recovery.concurrent_streams: 12
indices.recovery.max_bytes_per_sec: 512mb
indices.store.throttle.max_bytes_per_sec: 512mb
action.disable_delete_all_indices: true

Search thread pool

threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100

Index thread pool

threadpool.index.type: fixed
threadpool.index.size: 600
threadpool.index.queue_size: 2000

clintongormley · 2014-08-20T09:20:40Z

Some notes on your settings:

index.translog.flush_threshold_ops: 50000

In v1.3.x we set flush_threshold_ops to unbounded. You should do the same.

replication: async

This stops Elasticsearch from exerting backpressure if it is struggling. Use sync replication.

index.compound_on_flush: false
index.compound_format: false

Why are you disabling the compound format? This should not be necessary, especially if you have an unbounded flush_threshold_ops

threadpool.index.size: 600

You're going to overwhelm Elasticsearch with so many indexing threads. Use the default settings instead.

And back to your original issue - you say the cluster becomes unresponsive, but what do you mean by that? Do no nodes respond at all, even to a simple GET / request?

clintongormley · 2014-08-20T09:21:25Z

There is nothing else at all in any of your other logs?

psych0d0g · 2014-08-20T09:23:04Z

the logs arent telling me anything for the time when the cluster becomes "unresposive".
When im using curl to for eg:

   curl -XPUT localhost:9200/_cluster/settings -d '{
            "persistent" : {
            "cluster.routing.allocation.disable_allocation" : true
            }
    }'

The cluster responds with a 503 timeout

{"error":"RemoteTransportException[[csliveeubs-elasticsearch01.server.lan][inet[/...:9300]][cluster/settings/update]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 30s]; ","status":503}

clintongormley · 2014-08-20T09:26:12Z

How big is your cluster state? What's the output of:

curl 'localhost:9200/_cat/pending_tasks?v'

psych0d0g · 2014-08-20T09:27:28Z

at this moment its freshly restartet to bring it back to a working state asapst because our queue are filling up pretty fast, we need to get them empty again, i will have a look when this happens again.

psych0d0g · 2014-08-25T15:50:51Z

okay, here is the pending tasks, zen-disco-recive piled up like hell:

https://gist.github.com/psych0d0g/8e4e0d140b9d95946df4

clintongormley · 2014-08-25T16:04:26Z

@psych0d0g have you tried using the elasticsearch-http connector in logstash?

psych0d0g · 2014-08-25T16:05:50Z

not yet since my thought was that its not as performant as the node protocol
additionally with the node protocol i dont have to care about distributing the load myself accross the cluster

clintongormley · 2014-08-25T16:14:45Z

So when a client node joins the cluster, it has to go through "discovery" which can take a little time. The idea is that the client node will be part of the cluster for an extended period. It looks like, for whatever reason, you are either creating lots and lots of connections or they are very short lived.

You probably want to ask about your logstash config on the logstash forums, but I'd definitely try out the http connector, which is the recommended connector these days.

psych0d0g · 2014-08-25T16:17:58Z

mhm, i just did, and it IS slow,
with the node protocoll i managed to index about 20k loglines/s
with the http api im down to 12k/s

psych0d0g · 2014-08-25T16:19:33Z

and the problem might be a combination of both, i use 24 workes for the ES output and i restart the logstashes constantly when introducing new grok filter.
That might be the root cause indeed

clintongormley · 2014-08-25T16:20:42Z

Perhaps the logstash guys can give you some tuning advice. Btw, with the work that is happening in the improve_zen branch on Elasticsearch, discovery will become much quicker than it is today.

psych0d0g · 2014-08-27T09:06:41Z

okay, now the issue is back again, even the /_cat/ request seems to timeout right now
so i cant check whether my pending tasks increased
at least i get a bunch of stacktrace's in my logfiles now:
https://gist.github.com/psych0d0g/73a52ca35405236c9406

psych0d0g · 2014-08-27T09:18:49Z

Here are some more, from yesterday at around the time the last loglines where indexed:

[2014-08-26 20:17:37,892][DEBUG][action.admin.cluster.node.stats] [csliveeubs-elasticsearch01.server.lan] failed to execute on node [MM0GwoWYTSqBWD3VeFiaiw]
org.elasticsearch.transport.RemoteTransportException: [csliveeubap-elasticsearch01.server.lan][inet[/10.76.30.11:9300]][cluster/nodes/stats/n]
Caused by: org.elasticsearch.index.engine.EngineException: [logstash-live-2014.08.26][4] failed to acquire searcher, source doc_stats
at org.elasticsearch.index.engine.internal.InternalEngine.acquireSearcher(InternalEngine.java:713)
at org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:653)
at org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:647)
at org.elasticsearch.index.shard.service.InternalIndexShard.docStats(InternalIndexShard.java:498)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:127)
at org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)
at org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:209)
at org.elasticsearch.node.service.NodeService.stats(NodeService.java:156)
at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:95)
at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:277)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:268)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.lucene.store.AlreadyClosedException: this ReferenceManager is closed
at org.apache.lucene.search.ReferenceManager.acquire(ReferenceManager.java:98)
at org.elasticsearch.index.engine.internal.InternalEngine.acquireSearcher(InternalEngine.java:700)
... 15 more
[2014-08-26 20:19:26,109][WARN ][cluster.action.shard ] [csliveeubs-elasticsearch01.server.lan] [logstash-live-2014.08.26][4] received shard failed for [logstash-live-2014.08.26][4], node[MM0GwoWYTSqBWD3VeFiaiw], [R], s[STARTED], indexUUID [V5aS0Y7ZRveaReA_huF-KA], reason [engine failure, message [merge exception][MergeException[java.lang.OutOfMemoryError: Java heap space]; nested: OutOfMemoryError[Java heap space]; ]]

ES has 12GB of heap space, how the heck can it eat all that precious RAM up and demand even more for its heap?

clintongormley · 2014-09-02T15:50:21Z

@psych0d0g I'm guessing that you are sorting or aggregating on some fields which are taking up a lot of memory? Have a look at the output of:

GET '/_nodes/stats/indices/fielddata?human&fields=*'

psych0d0g · 2014-09-03T08:39:05Z

{"cluster_name":"cslive-elasticsearch","nodes":{"vCAZom1MQF2XZGNqqtqtew":{"timestamp":1409733531784,"name":"csliveeubs-elasticsearch02.server.lan","transport_address":"inet[/10.79.192.12:9300]","host":"csliveeubs-elasticsearch02.server.lan","ip":["inet[/10.79.192.12:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"103.7mb","memory_size_in_bytes":108769547,"evictions":0,"fields":{"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485},"@timestamp":{"memory_size":"103.5mb","memory_size_in_bytes":108570376},"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686}}}}},"eMjtqKXER4ilutu9UPwwmw":{"timestamp":1409733531779,"name":"csliveeubap-elasticsearch02.server.lan","transport_address":"inet[/10.76.30.12:9300]","host":"csliveeubap-elasticsearch02.server.lan","ip":["inet[/10.76.30.12:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"194.5kb","memory_size_in_bytes":199171,"evictions":0,"fields":{"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686},"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485}}}}},"g2nl3aZJSeqhRZMaUdpbsA":{"timestamp":1409733531783,"name":"csqaeubs-logparser02.server.lan","transport_address":"inet[/10.88.7.52:9300]","host":"csqaeubs-logparser02.server.lan","ip":["inet[/10.88.7.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"2NuW6V_LQLieGaRLlFXipQ":{"timestamp":1409733532930,"name":"csliveeubap-elasticsearch01.server.lan","transport_address":"inet[/10.76.30.11:9300]","host":"csliveeubap-elasticsearch01.server.lan","ip":["inet[/10.76.30.11:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"96.7mb","memory_size_in_bytes":101478287,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"96.5mb","memory_size_in_bytes":101220616},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}},"fQIrgXboSWyZ6IgZ19_BPw":{"timestamp":1409733531783,"name":"csqaeubs-logparser01.server.lan","transport_address":"inet[/10.88.7.51:9300]","host":"csqaeubs-logparser01.server.lan","ip":["inet[/10.88.7.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"U8TV4j2ETYCuf4Y8HW4zYQ":{"timestamp":1409733531780,"name":"csqaeubap-logparser01.server.lan","transport_address":"inet[/10.88.32.51:9300]","host":"csqaeubap-logparser01.server.lan","ip":["inet[/10.88.32.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"XiaE-n4_Rp6ZvZ8mj74D1w":{"timestamp":1409733531780,"name":"csqaeubap-logparser02.server.lan","transport_address":"inet[/10.88.32.52:9300]","host":"csqaeubap-logparser02.server.lan","ip":["inet[/10.88.32.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"krlSSP8JSneA6Lqu4xdR-A":{"timestamp":1409733531783,"name":"csliveeubs-elasticsearch01.server.lan","transport_address":"inet[/10.79.192.11:9300]","host":"csliveeubs-elasticsearch01.server.lan","ip":["inet[/10.79.192.11:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"235.8mb","memory_size_in_bytes":247358263,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"235.6mb","memory_size_in_bytes":247100592},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}}}}

clintongormley · 2014-09-03T13:10:56Z

OK, so it doesn't look like your fielddata, as that is not using much memory (or did you restart nodes in the interim).

How big are your bulk batches? Are you still using so many indexing threads? Please add the full output of these two commands (in a separate gist):

GET /_nodes
GET /_nodes/stats

psych0d0g · 2014-09-03T14:41:45Z

It took the cluster several minutes to respond with the requested data

GET /_nodes:
https://gist.github.com/psych0d0g/9a19380a5d5241002f55

GET /_nodes/stats:
https://gist.github.com/psych0d0g/524c8279e4bc5de161c4

another GET '/_nodes/stats/indices/fielddata?human&fields=*':
https://gist.github.com/psych0d0g/1119212c262ef2026ba7

my batches are: 1000
my indexing threads have ben removed from the config so im back to default,
im restarting the cluster daily atm to get it back to performance

clintongormley · 2014-09-06T16:24:17Z

Hi @psych0d0g

The only things that show up as abnormal are (1) slow garbage collections and (2) the amount of RAM being used for your segments. See indices.segments.memory_in_bytes in the stats output.

I suggest disabling bloom filters on old indices, and disabling norms (or making fields not_analyzed) on any string field which is not used for full text search.

See:

clintongormley · 2014-10-29T14:09:48Z

I'm assuming that this advice worked, as we haven't heard any more, so I'll close this ticket

clintongormley added the feedback_needed label Aug 8, 2014

clintongormley self-assigned this Sep 6, 2014

clintongormley closed this as completed Oct 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elastic Search Cluster becomes unresponsive #6409

Elastic Search Cluster becomes unresponsive #6409

nimahajan commented Jun 4, 2014

clintongormley commented Aug 8, 2014

psych0d0g commented Aug 20, 2014

clintongormley commented Aug 20, 2014

psych0d0g commented Aug 20, 2014

clintongormley commented Aug 20, 2014

clintongormley commented Aug 20, 2014

psych0d0g commented Aug 20, 2014

clintongormley commented Aug 20, 2014

psych0d0g commented Aug 20, 2014

psych0d0g commented Aug 25, 2014

clintongormley commented Aug 25, 2014

psych0d0g commented Aug 25, 2014

clintongormley commented Aug 25, 2014

psych0d0g commented Aug 25, 2014

psych0d0g commented Aug 25, 2014

clintongormley commented Aug 25, 2014

psych0d0g commented Aug 27, 2014

psych0d0g commented Aug 27, 2014

clintongormley commented Sep 2, 2014

psych0d0g commented Sep 3, 2014

clintongormley commented Sep 3, 2014

psych0d0g commented Sep 3, 2014

clintongormley commented Sep 6, 2014

clintongormley commented Oct 29, 2014

Elastic Search Cluster becomes unresponsive #6409

Elastic Search Cluster becomes unresponsive #6409

Comments

nimahajan commented Jun 4, 2014

clintongormley commented Aug 8, 2014

psych0d0g commented Aug 20, 2014

clintongormley commented Aug 20, 2014

psych0d0g commented Aug 20, 2014

Search thread pool

Index thread pool

clintongormley commented Aug 20, 2014

clintongormley commented Aug 20, 2014

psych0d0g commented Aug 20, 2014

clintongormley commented Aug 20, 2014

psych0d0g commented Aug 20, 2014

psych0d0g commented Aug 25, 2014

clintongormley commented Aug 25, 2014

psych0d0g commented Aug 25, 2014

clintongormley commented Aug 25, 2014

psych0d0g commented Aug 25, 2014

psych0d0g commented Aug 25, 2014

clintongormley commented Aug 25, 2014

psych0d0g commented Aug 27, 2014

psych0d0g commented Aug 27, 2014

clintongormley commented Sep 2, 2014

psych0d0g commented Sep 3, 2014

clintongormley commented Sep 3, 2014

psych0d0g commented Sep 3, 2014

clintongormley commented Sep 6, 2014

clintongormley commented Oct 29, 2014