Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unstable ES 2.4.1 cluster due to timeout exceptions. #23251

Closed
tarunramsinghani opened this issue Feb 19, 2017 · 4 comments
Closed

Unstable ES 2.4.1 cluster due to timeout exceptions. #23251

tarunramsinghani opened this issue Feb 19, 2017 · 4 comments

Comments

@tarunramsinghani
Copy link

Elasticsearch version: ES 2.4.1

Plugins installed: [MarvelAgent , DeleteByQuery ]

JVM version: 1.8

OS version: Windows 10.0.14393

Description of the problem including expected versus actual behavior:
ES Cluster is unstable and nodes are leaving and joining within few seconds. The reason to leave are timeout exceptions. Also Lot of recovery failed exceptions due to timeouts. The cluster is in constant cycle of doing this and cluster state bouncing between green and yellow continuously.

Steps to reproduce:
No specific steps to reproduce just normal indexing and queries going on

Provide logs (if relevant):
NullPointer Exception

[2017-02-18 21:49:15,028][WARN ][cluster ] [ES-NODE-M04] Failed to execute IndicesStatsAction for ClusterInfoUpdateJob
java.lang.NullPointerException
at org.elasticsearch.cluster.InternalClusterInfoService.buildShardLevelInfo(InternalClusterInfoService.java:414)
at org.elasticsearch.cluster.InternalClusterInfoService$4.onResponse(InternalClusterInfoService.java:360)
at org.elasticsearch.cluster.InternalClusterInfoService$4.onResponse(InternalClusterInfoService.java:354)
at org.elasticsearch.action.LatchedActionListener.onResponse(LatchedActionListener.java:41)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:89)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:85)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:394)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:363)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:335)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:327)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:158)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:124)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Shard Recovery failed exceptions

[2017-02-18 21:59:34,235][WARN ][cluster.action.shard ] [ES-NODE-M04] [codesearchshared_5_0][11] received shard failed for target shard [[codesearchshared_5_0][11], node[vdqbzQlZRfWaIeJ_PEPf9Q], relocating [FYKeY1KqTHGJJMyWdPriZA], [P], v[79], s[INITIALIZING], a[id=me8y0YPkSGe2FUX6olnnmg, rId=Je0cNKI-Qn2l4IB9zpDuvQ], expected_shard_size[119409422524]], indexUUID [RhFbhWq3RLW76DVSrEsBGA], message [failed recovery], failure [RecoveryFailedException[[codesearchshared_5_0][11]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[codesearchshared_5_0][11]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
[2017-02-18 22:00:18,747][WARN ][cluster.action.shard ] [ES-NODE-M04] [codesearchshared_3_0][13] received shard failed for target shard [[codesearchshared_3_0][13], node[MZVdp7_xTqKdeJGaUwtToQ], [R], v[76], s[INITIALIZING], a[id=aZszgKaBS2awb9bEv_X1pA], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-02-18T20:55:47.720Z], details[failed recovery, failure RecoveryFailedException[[codesearchshared_3_0][13]: Recovery failed from {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} into {ES-NODE-D20}{JWWG8q9BRRSsb_f4qWVK9A}{10.0.0.160}{10.0.0.160:9300}{faultDomain=1, updateDomain=4, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]], expected_shard_size[27284248413]], indexUUID [Gj_cfhwtTOKOC4l3o8bF6g], message [failed recovery], failure [RecoveryFailedException[[codesearchshared_3_0][13]: Recovery failed from {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} into {ES-NODE-D17}{MZVdp7_xTqKdeJGaUwtToQ}{10.0.0.157}{10.0.0.157:9300}{faultDomain=0, updateDomain=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[codesearchshared_3_0][13]: Recovery failed from {ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false} into {ES-NODE-D17}{MZVdp7_xTqKdeJGaUwtToQ}{10.0.0.157}{10.0.0.157:9300}{faultDomain=0, updateDomain=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
[2017-02-18 22:05:09,129][WARN ][cluster.action.shard ] [ES-NODE-M04] [codesearchshared_8_0][0] received shard failed for target shard [[codesearchshared_8_0][0], node[n_DD9rw8R1W8q9v4EDr7Dw], relocating [MZVdp7_xTqKdeJGaUwtToQ], [R], v[73], s[INITIALIZING], a[id=GmMUVPcJQAy1eELN7zoSKQ, rId=A4asqoQ8RxeJ0sR9souRKQ], expected_shard_size[23680822913]], indexUUID [_cnaxFMEQi-rduKsxfo88w], message [failed recovery], failure [RecoveryFailedException[[codesearchshared_8_0][0]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D12}{n_DD9rw8R1W8q9v4EDr7Dw}{10.0.0.152}{10.0.0.152:9300}{faultDomain=1, updateDomain=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]]; ]
RecoveryFailedException[[codesearchshared_8_0][0]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D12}{n_DD9rw8R1W8q9v4EDr7Dw}{10.0.0.152}{10.0.0.152:9300}{faultDomain=1, updateDomain=1, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]

on data nodes logs like these shows up from time to time

[2017-02-18 20:59:33,650][WARN ][indices.cluster ] [ES-NODE-D14] [[codesearchshared_5_0][11]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[codesearchshared_5_0][11]: Recovery failed from {ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false} into {ES-NODE-D14}{DBHABIxPRpCFAoQsK74aCA}{10.0.0.154}{10.0.0.154:9300}{faultDomain=1, updateDomain=3, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
[2017-02-18 21:01:23,621][WARN ][indices.cluster ] [ES-NODE-D14] [[codesearchshared_9_0][6]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[codesearchshared_9_0][6]: Recovery failed from {ES-NODE-D18}{RDxrVioiR6-uv04ZpLJsTw}{10.0.0.158}{10.0.0.158:9300}{faultDomain=1, updateDomain=2, master=false} into {ES-NODE-D14}{DBHABIxPRpCFAoQsK74aCA}{10.0.0.154}{10.0.0.154:9300}{faultDomain=1, updateDomain=3, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]

Also nodes are leaving and oining witin few seconds

[2017-02-18 20:24:57,674][INFO ][cluster.service ] [ES-NODE-D18] removed {{ES-NODE-D13}{AjgJdH6oRiel4wH6dtxjNQ}{10.0.0.153}{10.0.0.153:9300}{faultDomain=0, updateDomain=2, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 20:25:34,139][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-D13}{AjgJdH6oRiel4wH6dtxjNQ}{10.0.0.153}{10.0.0.153:9300}{faultDomain=0, updateDomain=2, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 20:46:10,030][INFO ][cluster.service ] [ES-NODE-D18] removed {{ES-NODE-D15}{UPMq2sK8SsueTI5DQSaYvQ}{10.0.0.155}{10.0.0.155:9300}{faultDomain=0, updateDomain=4, master=false},{ES-NODE-Q15}{8rn1GNRhQ-2fyCLI2fIPMw}{10.0.0.65}{10.0.0.65:9300}{faultDomain=0, data=false, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 20:46:51,658][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-D15}{UPMq2sK8SsueTI5DQSaYvQ}{10.0.0.155}{10.0.0.155:9300}{faultDomain=0, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 20:46:56,460][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-Q15}{8rn1GNRhQ-2fyCLI2fIPMw}{10.0.0.65}{10.0.0.65:9300}{faultDomain=0, data=false, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:05:08,829][INFO ][cluster.service ] [ES-NODE-D18] removed {{ES-NODE-D20}{JWWG8q9BRRSsb_f4qWVK9A}{10.0.0.160}{10.0.0.160:9300}{faultDomain=1, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:05:32,104][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-D20}{JWWG8q9BRRSsb_f4qWVK9A}{10.0.0.160}{10.0.0.160:9300}{faultDomain=1, updateDomain=4, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:16:36,436][WARN ][indices.cluster ] [ES-NODE-D18] [[codesearchshared_4_0][17]] marking and sending shard failed due to [failed recovery]
RecoveryFailedException[[codesearchshared_4_0][17]: Recovery failed from {ES-NODE-D14}{DBHABIxPRpCFAoQsK74aCA}{10.0.0.154}{10.0.0.154:9300}{faultDomain=1, updateDomain=3, master=false} into {ES-NODE-D18}{RDxrVioiR6-uv04ZpLJsTw}{10.0.0.158}{10.0.0.158:9300}{faultDomain=1, updateDomain=2, master=false} (no activity after [30m])]; nested: ElasticsearchTimeoutException[no activity after [30m]];
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:236)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: ElasticsearchTimeoutException[no activity after [30m]]
... 5 more
[2017-02-18 21:27:00,616][INFO ][discovery.zen ] [ES-NODE-D18] master_left [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2017-02-18 21:27:00,616][WARN ][discovery.zen ] [ES-NODE-D18] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{ES-NODE-Q15}{8rn1GNRhQ-2fyCLI2fIPMw}{10.0.0.65}{10.0.0.65:9300}{faultDomain=0, data=false, updateDomain=4, master=false},{ES-NODE-D15}{UPMq2sK8SsueTI5DQSaYvQ}{10.0.0.155}{10.0.0.155:9300}{faultDomain=0, updateDomain=4, master=false},{ES-NODE-D20}{JWWG8q9BRRSsb_f4qWVK9A}{10.0.0.160}{10.0.0.160:9300}{faultDomain=1, updateDomain=4, master=false},{ES-NODE-D11}{Pf9J5INfRJKayPg0eBk3Lg}{10.0.0.151}{10.0.0.151:9300}{faultDomain=0, updateDomain=0, master=false},{ES-NODE-D21}{W0dBAEWFQpeiw_Rf__mIYg}{10.0.0.161}{10.0.0.161:9300}{faultDomain=0, updateDomain=1, master=false},{ES-NODE-D14}{DBHABIxPRpCFAoQsK74aCA}{10.0.0.154}{10.0.0.154:9300}{faultDomain=1, updateDomain=3, master=false},{ES-NODE-D12}{n_DD9rw8R1W8q9v4EDr7Dw}{10.0.0.152}{10.0.0.152:9300}{faultDomain=1, updateDomain=1, master=false},{ES-NODE-D18}{RDxrVioiR6-uv04ZpLJsTw}{10.0.0.158}{10.0.0.158:9300}{faultDomain=1, updateDomain=2, master=false},{ES-NODE-D19}{FYKeY1KqTHGJJMyWdPriZA}{10.0.0.159}{10.0.0.159:9300}{faultDomain=0, updateDomain=3, master=false},{ES-NODE-Q12}{0up_O6UQQB6u_qNodklcqQ}{10.0.0.62}{10.0.0.62:9300}{faultDomain=1, data=false, updateDomain=1, master=false},{ES-NODE-D17}{MZVdp7_xTqKdeJGaUwtToQ}{10.0.0.157}{10.0.0.157:9300}{faultDomain=0, updateDomain=1, master=false},{ES-NODE-Q14}{y5paVrSMQGS5zTa6V0v8AQ}{10.0.0.64}{10.0.0.64:9300}{faultDomain=1, data=false, updateDomain=3, master=false},{ES-NODE-M05}{gITcXJGEQ5We9lOS4FXgxw}{10.0.0.105}{10.0.0.105:9300}{faultDomain=1, data=false, updateDomain=1, master=true},{ES-NODE-Q11}{BamR_sBsSiSM5tpGQvnd_w}{10.0.0.61}{10.0.0.61:9300}{faultDomain=0, data=false, updateDomain=0, master=false},{ES-NODE-D16}{vdqbzQlZRfWaIeJ_PEPf9Q}{10.0.0.156}{10.0.0.156:9300}{faultDomain=1, updateDomain=0, master=false},{ES-NODE-Q13}{LUcYT1_TQ-KbLDrQOldSow}{10.0.0.63}{10.0.0.63:9300}{faultDomain=0, data=false, updateDomain=2, master=false},{ES-NODE-M06}{i-LK8d_KThK_7DgcrXCNOQ}{10.0.0.106}{10.0.0.106:9300}{faultDomain=0, data=false, updateDomain=2, master=true},{ES-NODE-D13}{AjgJdH6oRiel4wH6dtxjNQ}{10.0.0.153}{10.0.0.153:9300}{faultDomain=0, updateDomain=2, master=false},}
[2017-02-18 21:27:00,632][INFO ][cluster.service ] [ES-NODE-D18] removed {{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true},}, reason: zen-disco-master_failed ({ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true})
[2017-02-18 21:27:16,495][INFO ][cluster.service ] [ES-NODE-D18] master {new {ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}}, removed {{ES-NODE-Q13}{LUcYT1_TQ-KbLDrQOldSow}{10.0.0.63}{10.0.0.63:9300}{faultDomain=0, data=false, updateDomain=2, master=false},}, added {{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:27:36,348][INFO ][cluster.service ] [ES-NODE-D18] added {{ES-NODE-Q13}{LUcYT1_TQ-KbLDrQOldSow}{10.0.0.63}{10.0.0.63:9300}{faultDomain=0, data=false, updateDomain=2, master=false},}, reason: zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])
[2017-02-18 21:27:59,754][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][young][4513490][93243] duration [2.3s], collections [1]/[2.4s], total [2.3s]/[1.7h], memory [9.1gb]->[6.8gb]/[13.6gb], all_pools {[young] [3.1gb]->[3.8mb]/[3.2gb]}{[survivor] [276.3mb]->[409.5mb]/[409.5mb]}{[old] [5.6gb]->[6.4gb]/[10gb]}
[2017-02-18 21:29:08,298][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][young][4513514][93244] duration [9.8s], collections [1]/[45.2s], total [9.8s]/[1.7h], memory [9.9gb]->[2.7gb]/[13.6gb], all_pools {[young] [3.1gb]->[25.9mb]/[3.2gb]}{[survivor] [409.5mb]->[0b]/[409.5mb]}{[old] [6.4gb]->[2.7gb]/[10gb]}
[2017-02-18 21:29:08,314][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][old][4513514][225] duration [35s], collections [1]/[45.2s], total [35s]/[3.5m], memory [9.9gb]->[2.7gb]/[13.6gb], all_pools {[young] [3.1gb]->[25.9mb]/[3.2gb]}{[survivor] [409.5mb]->[0b]/[409.5mb]}{[old] [6.4gb]->[2.7gb]/[10gb]}
[2017-02-18 21:29:09,079][WARN ][cluster.service ] [ES-NODE-D18] cluster state update task [zen-disco-receive(from master [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}])] took 46s above the warn threshold of 30s
[2017-02-18 21:29:17,600][WARN ][transport ] [ES-NODE-D18] Received response for a request that has timed out, sent [120090ms] ago, timed out [90112ms] ago, action [internal:discovery/zen/fd/master_ping], node [{ES-NODE-M04}{Tu9nTM4uS72ZBcfoTAJoXw}{10.0.0.104}{10.0.0.104:9300}{faultDomain=0, data=false, updateDomain=0, master=true}], id [9350862]
[2017-02-18 21:30:18,363][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][young][4513576][93245] duration [8s], collections [1]/[8.4s], total [8s]/[1.7h], memory [5.9gb]->[4.2gb]/[13.6gb], all_pools {[young] [3.1gb]->[3.8mb]/[3.2gb]}{[survivor] [0b]->[409.5mb]/[409.5mb]}{[old] [2.7gb]->[3.8gb]/[10gb]}
[2017-02-18 21:42:33,730][WARN ][monitor.jvm ] [ES-NODE-D18] [gc][young][4514303][93246] duration [1.3s], collections [1]/[1.3s], total [1.3s]/[1.7h], memory [7.4gb]->[4.5gb]/[13.6gb], all_pools {[young] [3.1gb]->[1.9mb]/[3.2gb]}{[survivor] [409.5mb]->[335.3mb]/[409.5mb]}{[old] [3.8gb]->[4.2gb]/[10gb]}

@bleskes
Copy link
Contributor

bleskes commented Feb 20, 2017

Thx @tarunramsinghani . The null pointer exception is indeed a potential bug albeit harmless - it means that collecting disk info went wrong and it will just be retried later. At this we only fix major issues with 2.4 and I don't this qualified. I suggest you upgrade and if it repeats we can dig further.

All the other problems indicate that your cluster /network is under stress. Those recovery failures kick in once the recovery target didn't see any activity in the last 30m . That nodes leave and joining has to potentially do with network issues but are more likely to be cause by nodes going into long GC and not responding to pings. I'm not sure what hardware you run on (i.e., if it's a VM) but a young GC of 8s of on such a 10GB heap is crazy slow. I suggest you look at the amount of data you store on those nodes and their performance. Closing this for now.

@bleskes bleskes closed this as completed Feb 20, 2017
@bittusarkar
Copy link

Thank you for the pointers @bleskes. We're running on VMs and have set ES_HEAP_SIZE to 14g and ES_HEAP_NEWSIZE to 4g on all data nodes. Do you suspect the value of ES_HEAP_NEWSIZE is too high?

@tarunramsinghani
Copy link
Author

@bleskes We are still facing the same issue. Do you think the setting we have i.e. ES_HEAP_NEWSIZE to ~25% of ES_HEAP_SIZE can cause this. We added this setting in ES 1.7 to avoid OOM issues, but does this setting have any adverse effect on ES 2.4 ?

@bleskes
Copy link
Contributor

bleskes commented May 10, 2017

@tarunramsinghani I suggest you ask this question on our discuss.elastic.co forums where we can help more. We keep github for features and bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants