Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic Search Cluster becomes unresponsive #6409

Closed
nimahajan opened this issue Jun 4, 2014 · 24 comments
Closed

Elastic Search Cluster becomes unresponsive #6409

nimahajan opened this issue Jun 4, 2014 · 24 comments
Assignees

Comments

@nimahajan
Copy link

Hi,

I have been facing this issue for couple of days. Here are the details -

I have a cluster of 7 nodes, with 3 master eligible with no data and 4 only data nodes. I have set minimum master node property to 2. I have used KOPF/ElasticHQ/Head for monitoring. I have set up the cluster for testing so there is hardly any querying to going on. So it remains idle most of the time. After few hours of inactivity, the plugins cannot show the any details except for the green status (no other details like shard etc shown). If I make any search query, it does not work.

e.g. http://ironman:9200/_search?q=chrome

I have to restart the master's elasticsearch service to make it responsive again. I checked the $ tail -f /var/log/elasticsearch/avengers.log I don't see any failure/exceptions. I had elastic search 1.1 and updated to 1.2 as well. But I am still having this issue.

Please let me know if you need more details.

Thanks,
Nilesh.

@clintongormley
Copy link

Hi @nimahajan

Are you still experiencing this problem? What version? There must be something going on in the logs. Without more info it's difficult to help.

@psych0d0g
Copy link

I have the same issue of a cluster becoming unresposive after a random amount of time,
Im running with 4 nodes (all of them are master eligable and data nodes) in a Datacenter Aware Setup with 5 shards/index and 1 Replica in the other Datacenter.
My logfiles are quiet aswell when this happens, ES just stops logging at all at the same time the cluster is getting unresponsive.
Last logline i get is from yesterday:

2014-08-19 16:48:48,390][INFO ][cluster.service ] [csliveeubs-elasticsearch01.server.lan] removed {[csliveeubap-logparser01.server.lan][81xwRvieT4ytrMO2PdJXZg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, added {[csliveeubap-logparser01.server.lan][I1t_-leETwqdSH_Xt-0Ymg][csliveeubap-logparser01.server.lan][inet[/...:9300]]{client=true, data=false},}, reason: zen-disco-receive(from master [[csliveeubap-elasticsearch02.server.lan][TgTzuqrkRYay6fOyDIP66w][csliveeubap-elasticsearch02.server.lan][inet[/...:9300]]{datacenter_location=2, master=true}])

Im running ES 1.3.1 on debian wheezy x64 on the following Hardware Setup without the use of any plugins:

2x Intel(R) Xeon(R) CPU E5-2430 v2 @ 2.50GHz
24GB DDR3 ecc ram
2.7T raid 5 with 3x600GB 10k sas HDDs

@clintongormley
Copy link

@psych0d0g that is your client shutting down and being replaced with another client. Also, we strongly advise against using a single cluster across two data centres. It will produce instability and could well be responsible for what you are seeing.

What is in the logs of your other nodes?

@psych0d0g
Copy link

Exactly the same for all data-node logs.
Our datacenters are linked with a dedicated connection via fibre.
The ping time between them is about 0.9ms
Additionally here is my ES Configuration:

cluster.name: cslive-elasticsearch
node.name: "csliveeubs-elasticsearch01.server.lan"
node.datacenter_location: 1
node.master: true
node.data: true
path.data: /mnt/storage
cluster.routing.allocation.awareness.attributes: datacenter_location
cluster.routing.allocation.cluster_concurrent_rebalance: 12
cluster.routing.allocation.node_concurrent_recoveries: 15
indices.recovery.concurrent_streams: 12
cluster.routing.allocation.node_initial_primaries_recoveries: 5
#Node in the other datacenter, diffrent vlan:
discovery.zen.ping.unicast.hosts: ...
discovery.zen.minimum_master_nodes: 2
index.number_of_replicas: 1
index.recovery.initial_shards: 2
indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000
replication: async
index.compound_on_flush: false
index.compound_format: false
indices.recovery.concurrent_streams: 12
indices.recovery.max_bytes_per_sec: 512mb
indices.store.throttle.max_bytes_per_sec: 512mb
action.disable_delete_all_indices: true

Search thread pool

threadpool.search.type: fixed
threadpool.search.size: 20
threadpool.search.queue_size: 100

Index thread pool

threadpool.index.type: fixed
threadpool.index.size: 600
threadpool.index.queue_size: 2000

@clintongormley
Copy link

Some notes on your settings:

index.translog.flush_threshold_ops: 50000

In v1.3.x we set flush_threshold_ops to unbounded. You should do the same.

replication: async

This stops Elasticsearch from exerting backpressure if it is struggling. Use sync replication.

index.compound_on_flush: false
index.compound_format: false

Why are you disabling the compound format? This should not be necessary, especially if you have an unbounded flush_threshold_ops

threadpool.index.size: 600

You're going to overwhelm Elasticsearch with so many indexing threads. Use the default settings instead.

And back to your original issue - you say the cluster becomes unresponsive, but what do you mean by that? Do no nodes respond at all, even to a simple GET / request?

@clintongormley
Copy link

There is nothing else at all in any of your other logs?

@psych0d0g
Copy link

the logs arent telling me anything for the time when the cluster becomes "unresposive".
When im using curl to for eg:

   curl -XPUT localhost:9200/_cluster/settings -d '{
            "persistent" : {
            "cluster.routing.allocation.disable_allocation" : true
            }
    }'

The cluster responds with a 503 timeout

{"error":"RemoteTransportException[[csliveeubs-elasticsearch01.server.lan][inet[/...:9300]][cluster/settings/update]]; nested: ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 30s]; ","status":503}

@clintongormley
Copy link

How big is your cluster state? What's the output of:

curl 'localhost:9200/_cat/pending_tasks?v'

@psych0d0g
Copy link

at this moment its freshly restartet to bring it back to a working state asapst because our queue are filling up pretty fast, we need to get them empty again, i will have a look when this happens again.

@psych0d0g
Copy link

okay, here is the pending tasks, zen-disco-recive piled up like hell:

https://gist.github.com/psych0d0g/8e4e0d140b9d95946df4

@clintongormley
Copy link

@psych0d0g have you tried using the elasticsearch-http connector in logstash?

@psych0d0g
Copy link

not yet since my thought was that its not as performant as the node protocol
additionally with the node protocol i dont have to care about distributing the load myself accross the cluster

@clintongormley
Copy link

So when a client node joins the cluster, it has to go through "discovery" which can take a little time. The idea is that the client node will be part of the cluster for an extended period. It looks like, for whatever reason, you are either creating lots and lots of connections or they are very short lived.

You probably want to ask about your logstash config on the logstash forums, but I'd definitely try out the http connector, which is the recommended connector these days.

@psych0d0g
Copy link

mhm, i just did, and it IS slow,
with the node protocoll i managed to index about 20k loglines/s
with the http api im down to 12k/s

@psych0d0g
Copy link

and the problem might be a combination of both, i use 24 workes for the ES output and i restart the logstashes constantly when introducing new grok filter.
That might be the root cause indeed

@clintongormley
Copy link

Perhaps the logstash guys can give you some tuning advice. Btw, with the work that is happening in the improve_zen branch on Elasticsearch, discovery will become much quicker than it is today.

@psych0d0g
Copy link

okay, now the issue is back again, even the /_cat/ request seems to timeout right now
so i cant check whether my pending tasks increased
at least i get a bunch of stacktrace's in my logfiles now:
https://gist.github.com/psych0d0g/73a52ca35405236c9406

@psych0d0g
Copy link

Here are some more, from yesterday at around the time the last loglines where indexed:

[2014-08-26 20:17:37,892][DEBUG][action.admin.cluster.node.stats] [csliveeubs-elasticsearch01.server.lan] failed to execute on node [MM0GwoWYTSqBWD3VeFiaiw]
org.elasticsearch.transport.RemoteTransportException: [csliveeubap-elasticsearch01.server.lan][inet[/10.76.30.11:9300]][cluster/nodes/stats/n]
Caused by: org.elasticsearch.index.engine.EngineException: [logstash-live-2014.08.26][4] failed to acquire searcher, source doc_stats
at org.elasticsearch.index.engine.internal.InternalEngine.acquireSearcher(InternalEngine.java:713)
at org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:653)
at org.elasticsearch.index.shard.service.InternalIndexShard.acquireSearcher(InternalIndexShard.java:647)
at org.elasticsearch.index.shard.service.InternalIndexShard.docStats(InternalIndexShard.java:498)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:127)
at org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)
at org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:209)
at org.elasticsearch.node.service.NodeService.stats(NodeService.java:156)
at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:95)
at org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:43)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:277)
at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:268)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.apache.lucene.store.AlreadyClosedException: this ReferenceManager is closed
at org.apache.lucene.search.ReferenceManager.acquire(ReferenceManager.java:98)
at org.elasticsearch.index.engine.internal.InternalEngine.acquireSearcher(InternalEngine.java:700)
... 15 more
[2014-08-26 20:19:26,109][WARN ][cluster.action.shard ] [csliveeubs-elasticsearch01.server.lan] [logstash-live-2014.08.26][4] received shard failed for [logstash-live-2014.08.26][4], node[MM0GwoWYTSqBWD3VeFiaiw], [R], s[STARTED], indexUUID [V5aS0Y7ZRveaReA_huF-KA], reason [engine failure, message [merge exception][MergeException[java.lang.OutOfMemoryError: Java heap space]; nested: OutOfMemoryError[Java heap space]; ]]

ES has 12GB of heap space, how the heck can it eat all that precious RAM up and demand even more for its heap?

@clintongormley
Copy link

@psych0d0g I'm guessing that you are sorting or aggregating on some fields which are taking up a lot of memory? Have a look at the output of:

GET '/_nodes/stats/indices/fielddata?human&fields=*'

@psych0d0g
Copy link

{"cluster_name":"cslive-elasticsearch","nodes":{"vCAZom1MQF2XZGNqqtqtew":{"timestamp":1409733531784,"name":"csliveeubs-elasticsearch02.server.lan","transport_address":"inet[/10.79.192.12:9300]","host":"csliveeubs-elasticsearch02.server.lan","ip":["inet[/10.79.192.12:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"103.7mb","memory_size_in_bytes":108769547,"evictions":0,"fields":{"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485},"@timestamp":{"memory_size":"103.5mb","memory_size_in_bytes":108570376},"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686}}}}},"eMjtqKXER4ilutu9UPwwmw":{"timestamp":1409733531779,"name":"csliveeubap-elasticsearch02.server.lan","transport_address":"inet[/10.76.30.12:9300]","host":"csliveeubap-elasticsearch02.server.lan","ip":["inet[/10.76.30.12:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"194.5kb","memory_size_in_bytes":199171,"evictions":0,"fields":{"_uid":{"memory_size":"119.8kb","memory_size_in_bytes":122686},"tags":{"memory_size":"74.6kb","memory_size_in_bytes":76485}}}}},"g2nl3aZJSeqhRZMaUdpbsA":{"timestamp":1409733531783,"name":"csqaeubs-logparser02.server.lan","transport_address":"inet[/10.88.7.52:9300]","host":"csqaeubs-logparser02.server.lan","ip":["inet[/10.88.7.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"2NuW6V_LQLieGaRLlFXipQ":{"timestamp":1409733532930,"name":"csliveeubap-elasticsearch01.server.lan","transport_address":"inet[/10.76.30.11:9300]","host":"csliveeubap-elasticsearch01.server.lan","ip":["inet[/10.76.30.11:9300]","NONE"],"attributes":{"datacenter_location":"bap","master":"true"},"indices":{"fielddata":{"memory_size":"96.7mb","memory_size_in_bytes":101478287,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"96.5mb","memory_size_in_bytes":101220616},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}},"fQIrgXboSWyZ6IgZ19_BPw":{"timestamp":1409733531783,"name":"csqaeubs-logparser01.server.lan","transport_address":"inet[/10.88.7.51:9300]","host":"csqaeubs-logparser01.server.lan","ip":["inet[/10.88.7.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"U8TV4j2ETYCuf4Y8HW4zYQ":{"timestamp":1409733531780,"name":"csqaeubap-logparser01.server.lan","transport_address":"inet[/10.88.32.51:9300]","host":"csqaeubap-logparser01.server.lan","ip":["inet[/10.88.32.51:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"XiaE-n4_Rp6ZvZ8mj74D1w":{"timestamp":1409733531780,"name":"csqaeubap-logparser02.server.lan","transport_address":"inet[/10.88.32.52:9300]","host":"csqaeubap-logparser02.server.lan","ip":["inet[/10.88.32.52:9300]","NONE"],"attributes":{"client":"true","data":"false"},"indices":{"fielddata":{"memory_size":"0b","memory_size_in_bytes":0,"evictions":0}}},"krlSSP8JSneA6Lqu4xdR-A":{"timestamp":1409733531783,"name":"csliveeubs-elasticsearch01.server.lan","transport_address":"inet[/10.79.192.11:9300]","host":"csliveeubs-elasticsearch01.server.lan","ip":["inet[/10.79.192.11:9300]","NONE"],"attributes":{"datacenter_location":"bs","master":"true"},"indices":{"fielddata":{"memory_size":"235.8mb","memory_size_in_bytes":247358263,"evictions":0,"fields":{"tags":{"memory_size":"149.3kb","memory_size_in_bytes":152936},"@timestamp":{"memory_size":"235.6mb","memory_size_in_bytes":247100592},"_uid":{"memory_size":"102.2kb","memory_size_in_bytes":104735}}}}}}}

@clintongormley
Copy link

OK, so it doesn't look like your fielddata, as that is not using much memory (or did you restart nodes in the interim).

How big are your bulk batches? Are you still using so many indexing threads? Please add the full output of these two commands (in a separate gist):

GET /_nodes
GET /_nodes/stats

@psych0d0g
Copy link

It took the cluster several minutes to respond with the requested data

GET /_nodes:
https://gist.github.com/psych0d0g/9a19380a5d5241002f55

GET /_nodes/stats:
https://gist.github.com/psych0d0g/524c8279e4bc5de161c4

another GET '/_nodes/stats/indices/fielddata?human&fields=*':
https://gist.github.com/psych0d0g/1119212c262ef2026ba7

my batches are: 1000
my indexing threads have ben removed from the config so im back to default,
im restarting the cluster daily atm to get it back to performance

@clintongormley
Copy link

Hi @psych0d0g

The only things that show up as abnormal are (1) slow garbage collections and (2) the amount of RAM being used for your segments. See indices.segments.memory_in_bytes in the stats output.

I suggest disabling bloom filters on old indices, and disabling norms (or making fields not_analyzed) on any string field which is not used for full text search.

See:

@clintongormley clintongormley self-assigned this Sep 6, 2014
@clintongormley
Copy link

I'm assuming that this advice worked, as we haven't heard any more, so I'll close this ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants