New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable ES 2.4.1 cluster due to timeout exceptions. #23251
Comments
Thx @tarunramsinghani . The null pointer exception is indeed a potential bug albeit harmless - it means that collecting disk info went wrong and it will just be retried later. At this we only fix major issues with 2.4 and I don't this qualified. I suggest you upgrade and if it repeats we can dig further. All the other problems indicate that your cluster /network is under stress. Those recovery failures kick in once the recovery target didn't see any activity in the last 30m . That nodes leave and joining has to potentially do with network issues but are more likely to be cause by nodes going into long GC and not responding to pings. I'm not sure what hardware you run on (i.e., if it's a VM) but a young GC of 8s of on such a 10GB heap is crazy slow. I suggest you look at the amount of data you store on those nodes and their performance. Closing this for now. |
Thank you for the pointers @bleskes. We're running on VMs and have set |
@bleskes We are still facing the same issue. Do you think the setting we have i.e. ES_HEAP_NEWSIZE to ~25% of ES_HEAP_SIZE can cause this. We added this setting in ES 1.7 to avoid OOM issues, but does this setting have any adverse effect on ES 2.4 ? |
@tarunramsinghani I suggest you ask this question on our discuss.elastic.co forums where we can help more. We keep github for features and bugs. |
Elasticsearch version: ES 2.4.1
Plugins installed: [MarvelAgent , DeleteByQuery ]
JVM version: 1.8
OS version: Windows 10.0.14393
Description of the problem including expected versus actual behavior:
ES Cluster is unstable and nodes are leaving and joining within few seconds. The reason to leave are timeout exceptions. Also Lot of recovery failed exceptions due to timeouts. The cluster is in constant cycle of doing this and cluster state bouncing between green and yellow continuously.
Steps to reproduce:
No specific steps to reproduce just normal indexing and queries going on
Provide logs (if relevant):
NullPointer Exception
[2017-02-18 21:49:15,028][WARN ][cluster ] [ES-NODE-M04] Failed to execute IndicesStatsAction for ClusterInfoUpdateJob
java.lang.NullPointerException
at org.elasticsearch.cluster.InternalClusterInfoService.buildShardLevelInfo(InternalClusterInfoService.java:414)
at org.elasticsearch.cluster.InternalClusterInfoService$4.onResponse(InternalClusterInfoService.java:360)
at org.elasticsearch.cluster.InternalClusterInfoService$4.onResponse(InternalClusterInfoService.java:354)
at org.elasticsearch.action.LatchedActionListener.onResponse(LatchedActionListener.java:41)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:394)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:363)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:335)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:327)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:158)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:124)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Shard Recovery failed exceptions
[2017-02-18 21:59:34,235][WARN ][cluster.action.shard ] [ES-NODE-M04] [codesearchshared_5_0][11] received shard failed for target shard [[codesearchshared_5_0][11], node[vdqbzQlZRfWaIeJ_PEPf9Q], relocating [FYKeY1KqTHGJJMyWdPriZA], [P], v[79], s[INITIALIZING], a[id=me8y0YPkSGe2FUX6olnnmg, rId=Je0cNKI-Qn2l4IB9zpDuvQ], expected_shard_size[119409422524]], indexUUID [RhFbhWq3RLW76DVSrEsBGA], message [failed recovery], failure [RecoveryFailedException[[codesearchshared_5_0][11]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[codesearchshared_5_0][11]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
[2017-02-18 22:00:18,747][WARN ][cluster.action.shard ] [ES-NODE-M04] [codesearchshared_3_0][13] received shard failed for target shard [[codesearchshared_3_0][13], node[MZVdp7_xTqKdeJGaUwtToQ], [R], v[76], s[INITIALIZING], a[id=aZszgKaBS2awb9bEv_X1pA], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-02-18T20:55:47.720Z], details[failed recovery, failure RecoveryFailedException[[codesearchshared_3_0][13]: Recovery failed from {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} into {ES-NODE-D20}{JWWG8q9BRRSsb_f4qWVK9A}{10.0.0.160}{10.0.0.160:9300}{faultDomain=1, updateDomain=4, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]], expected_shard_size[27284248413]], indexUUID [Gj_cfhwtTOKOC4l3o8bF6g], message [failed recovery], failure [RecoveryFailedException[[codesearchshared_3_0][13]: Recovery failed from {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} into {ES-NODE-D17}{MZVdp7_xTqKdeJGaUwtToQ}{10.0.0.157}{10.0.0.157:9300}{faultDomain=0, updateDomain=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[codesearchshared_3_0][13]: Recovery failed from {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} into {ES-NODE-D17}{MZVdp7_xTqKdeJGaUwtToQ}{10.0.0.157}{10.0.0.157:9300}{faultDomain=0, updateDomain=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
[2017-02-18 22:05:09,129][WARN ][cluster.action.shard ] [ES-NODE-M04] [codesearchshared_8_0][0] received shard failed for target shard [[codesearchshared_8_0][0], node[n_DD9rw8R1W8q9v4EDr7Dw], relocating [MZVdp7_xTqKdeJGaUwtToQ], [R], v[73], s[INITIALIZING], a[id=GmMUVPcJQAy1eELN7zoSKQ, rId=A4asqoQ8RxeJ0sR9souRKQ], expected_shard_size[23680822913]], indexUUID [_cnaxFMEQi-rduKsxfo88w], message [failed recovery], failure [RecoveryFailedException[[codesearchshared_8_0][0]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D12}{n_DD9rw8R1W8q9v4EDr7Dw}{10.0.0.152}{10.0.0.152:9300}{faultDomain=1, updateDomain=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[codesearchshared_8_0][0]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D12}{n_DD9rw8R1W8q9v4EDr7Dw}{10.0.0.152}{10.0.0.152:9300}{faultDomain=1, updateDomain=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
on data nodes logs like these shows up from time to time
[2017-02-18 20:59:33,650][WARN ][indices.cluster ] [ES-NODE-D14] [[codesearchshared_5_0][11]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[codesearchshared_5_0][11]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D14}{DBHABIxPRpCFAoQsK74aCA}{10.0.0.154}{10.0.0.154:9300}{faultDomain=1, updateDomain=3, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
[2017-02-18 21:01:23,621][WARN ][indices.cluster ] [ES-NODE-D14] [[codesearchshared_9_0][6]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[codesearchshared_9_0][6]: Recovery failed from {ES-NODE-D18}{RDxrVioiR6-uv04ZpLJsTw}{10.0.0.158}{10.0.0.158:9300}{faultDomain=1, updateDomain=2, master=false} into {ES-NODE-D14}{DBHABIxPRpCFAoQsK74aCA}{10.0.0.154}{10.0.0.154:9300}{faultDomain=1, updateDomain=3, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
Also nodes are leaving and oining witin few seconds
[2017-02-18 20:24:57,674][INFO ][cluster.service ] [ES-NODE-D18] removed {{ES-NODE-D13}{AjgJdH6oRiel4wH6dtxjNQ}{10.0.0.153}{10.0.0.153:9300}{faultDomain=0, updateDomain=2, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 20:25:34,139][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-D13}{AjgJdH6oRiel4wH6dtxjNQ}{10.0.0.153}{10.0.0.153:9300}{faultDomain=0, updateDomain=2, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 20:46:10,030][INFO ][cluster.service ] [ES-NODE-D18] removed {{ES-NODE-D15}{UPMq2sK8SsueTI5DQSaYvQ}{10.0.0.155}{10.0.0.155:9300}{faultDomain=0, updateDomain=4, master=false},{ES-NODE-Q15}{8rn1GNRhQ-2fyCLI2fIPMw}{10.0.0.65}{10.0.0.65:9300}{faultDomain=0, data=false, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 20:46:51,658][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-D15}{UPMq2sK8SsueTI5DQSaYvQ}{10.0.0.155}{10.0.0.155:9300}{faultDomain=0, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 20:46:56,460][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-Q15}{8rn1GNRhQ-2fyCLI2fIPMw}{10.0.0.65}{10.0.0.65:9300}{faultDomain=0, data=false, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:05:08,829][INFO ][cluster.service ] [ES-NODE-D18] removed {{ES-NODE-D20}{JWWG8q9BRRSsb_f4qWVK9A}{10.0.0.160}{10.0.0.160:9300}{faultDomain=1, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:05:32,104][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-D20}{JWWG8q9BRRSsb_f4qWVK9A}{10.0.0.160}{10.0.0.160:9300}{faultDomain=1, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:16:36,436][WARN ][indices.cluster ] [ES-NODE-D18] [[codesearchshared_4_0][17]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[codesearchshared_4_0][17]: Recovery failed from {ES-NODE-D14}{DBHABIxPRpCFAoQsK74aCA}{10.0.0.154}{10.0.0.154:9300}{faultDomain=1, updateDomain=3, master=false} into {ES-NODE-D18}{RDxrVioiR6-uv04ZpLJsTw}{10.0.0.158}{10.0.0.158:9300}{faultDomain=1, updateDomain=2, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
[2017-02-18 21:27:00,616][INFO ][discovery.zen ] [ES-NODE-D18] master_left [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-02-18 21:27:00,616][WARN ][discovery.zen ] [ES-NODE-D18] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{ES-NODE-Q15}{8rn1GNRhQ-2fyCLI2fIPMw}{10.0.0.65}{10.0.0.65:9300}{faultDomain=0, data=false, updateDomain=4, master=false},{ES-NODE-D15}{UPMq2sK8SsueTI5DQSaYvQ}{10.0.0.155}{10.0.0.155:9300}{faultDomain=0, updateDomain=4, master=false},{ES-NODE-D20}{JWWG8q9BRRSsb_f4qWVK9A}{10.0.0.160}{10.0.0.160:9300}{faultDomain=1, updateDomain=4, master=false},{ES-NODE-D11}{Pf9J5INfRJKayPg0eBk3Lg}{10.0.0.151}{10.0.0.151:9300}{faultDomain=0, updateDomain=0, master=false},{ES-NODE-D21}{W0dBAEWFQpeiw_Rf__mIYg}{10.0.0.161}{10.0.0.161:9300}{faultDomain=0, updateDomain=1, master=false},{ES-NODE-D14}{DBHABIxPRpCFAoQsK74aCA}{10.0.0.154}{10.0.0.154:9300}{faultDomain=1, updateDomain=3, master=false},{ES-NODE-D12}{n_DD9rw8R1W8q9v4EDr7Dw}{10.0.0.152}{10.0.0.152:9300}{faultDomain=1, updateDomain=1, master=false},{ES-NODE-D18}{RDxrVioiR6-uv04ZpLJsTw}{10.0.0.158}{10.0.0.158:9300}{faultDomain=1, updateDomain=2, master=false},{ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false},{ES-NODE-Q12}{0up_O6UQQB6u_qNodklcqQ}{10.0.0.62}{10.0.0.62:9300}{faultDomain=1, data=false, updateDomain=1, master=false},{ES-NODE-D17}{MZVdp7_xTqKdeJGaUwtToQ}{10.0.0.157}{10.0.0.157:9300}{faultDomain=0, updateDomain=1, master=false},{ES-NODE-Q14}{y5paVrSMQGS5zTa6V0v8AQ}{10.0.0.64}{10.0.0.64:9300}{faultDomain=1, data=false, updateDomain=3, master=false},{ES-NODE-M05}{gITcXJGEQ5We9lOS4FXgxw}{10.0.0.105}{10.0.0.105:9300}{faultDomain=1, data=false, updateDomain=1, master=true},{ES-NODE-Q11}{BamR_sBsSiSM5tpGQvnd_w}{10.0.0.61}{10.0.0.61:9300}{faultDomain=0, data=false, updateDomain=0, master=false},{ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false},{ES-NODE-Q13}{LUcYT1_TQ-KbLDrQOldSow}{10.0.0.63}{10.0.0.63:9300}{faultDomain=0, data=false, updateDomain=2, master=false},{ES-NODE-M06}{i-LK8d_KThK_7DgcrXCNOQ}{10.0.0.106}{10.0.0.106:9300}{faultDomain=0, data=false, updateDomain=2, master=true},{ES-NODE-D13}{AjgJdH6oRiel4wH6dtxjNQ}{10.0.0.153}{10.0.0.153:9300}{faultDomain=0, updateDomain=2, master=false},}
[2017-02-18 21:27:00,632][INFO ][cluster.service ] [ES-NODE-D18] removed {{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true},}, reason: zen-disco-master_failed ({ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true})
[2017-02-18 21:27:16,495][INFO ][cluster.service ] [ES-NODE-D18] master {new {ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}}, removed {{ES-NODE-Q13}{LUcYT1_TQ-KbLDrQOldSow}{10.0.0.63}{10.0.0.63:9300}{faultDomain=0, data=false, updateDomain=2, master=false},}, added {{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:27:36,348][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-Q13}{LUcYT1_TQ-KbLDrQOldSow}{10.0.0.63}{10.0.0.63:9300}{faultDomain=0, data=false, updateDomain=2, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:27:59,754][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][young][4513490][93243] duration [2.3s], collections [1]/[2.4s], total [2.3s]/[1.7h], memory [9.1gb]->[6.8gb]/[13.6gb], all_pools {[young] [3.1gb]->[3.8mb]/[3.2gb]}{[survivor] [276.3mb]->[409.5mb]/[409.5mb]}{[old] [5.6gb]->[6.4gb]/[10gb]}
[2017-02-18 21:29:08,298][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][young][4513514][93244] duration [9.8s], collections [1]/[45.2s], total [9.8s]/[1.7h], memory [9.9gb]->[2.7gb]/[13.6gb], all_pools {[young] [3.1gb]->[25.9mb]/[3.2gb]}{[survivor] [409.5mb]->[0b]/[409.5mb]}{[old] [6.4gb]->[2.7gb]/[10gb]}
[2017-02-18 21:29:08,314][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][old][4513514][225] duration [35s], collections [1]/[45.2s], total [35s]/[3.5m], memory [9.9gb]->[2.7gb]/[13.6gb], all_pools {[young] [3.1gb]->[25.9mb]/[3.2gb]}{[survivor] [409.5mb]->[0b]/[409.5mb]}{[old] [6.4gb]->[2.7gb]/[10gb]}
[2017-02-18 21:29:09,079][WARN ][cluster.service ] [ES-NODE-D18] cluster state update task [zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])] took 46s above the warn threshold of 30s
[2017-02-18 21:29:17,600][WARN ][transport ] [ES-NODE-D18] Received response for a request that has timed out, sent [120090ms] ago, timed out [90112ms] ago, action [internal:discovery/zen/fd/master_ping], node [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}], id [9350862]
[2017-02-18 21:30:18,363][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][young][4513576][93245] duration [8s], collections [1]/[8.4s], total [8s]/[1.7h], memory [5.9gb]->[4.2gb]/[13.6gb], all_pools {[young] [3.1gb]->[3.8mb]/[3.2gb]}{[survivor] [0b]->[409.5mb]/[409.5mb]}{[old] [2.7gb]->[3.8gb]/[10gb]}
[2017-02-18 21:42:33,730][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][young][4514303][93246] duration [1.3s], collections [1]/[1.3s], total [1.3s]/[1.7h], memory [7.4gb]->[4.5gb]/[13.6gb], all_pools {[young] [3.1gb]->[1.9mb]/[3.2gb]}{[survivor] [409.5mb]->[335.3mb]/[409.5mb]}{[old] [3.8gb]->[4.2gb]/[10gb]}
The text was updated successfully, but these errors were encountered: