Primary shard stuck initialising #15622

robertsmarty · 2015-12-23T06:01:55Z

Hi,

This is related to #15431 however now I have some more info. I am processing over 200GB of PA firewall logs per day. The cluster is working really well until I make any changes. I just started rolling in 2.1.1 into the 2.1.0 cluster and when the shards started relocating one of the primary shards got stuck initialising. I couldn't get it working so I wound up trashing the whole cluster and deploying the whole thing again with just 2.1.1. Not a massive task as we have automated deployment so no problem there. The logs started streaming in again and then I started restoring my data which was working well until again, one of the primary shards got stuck initialising again. It's also an active shard, lots of data trying to go in there. I then tried deleting the index and have it dynamically create again however the primary shard is stuck again and I have this error in the logs...

[2015-12-23 15:47:18,442][DEBUG][action.admin.indices.recovery] [elkrp14] [indices:monitor/recovery] failed to execute operation for shard [[pa_traffic-2015.12.23][4], node[ijdqvmZHTi66urgpuvCyUQ], [P], v[1], s[INITIALIZING], a[id=3A8c4qmfSU6cggJqOAu9bA], unassigned_info[[reason=INDEX_CREATED], at[2015-12-23T05:38:42.740Z]]]
[pa_traffic-2015.12.23][[pa_traffic-2015.12.23][4]] BroadcastShardOperationFailedException[operation indices:monitor/recovery failed]; nested: IndexNotFoundException[no such index];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.shield.transport.ShieldServerTransportService$ProfileSecuredRequestHandler.messageReceived(ShieldServerTransportService.java:165)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [pa_traffic-2015.12.23] IndexNotFoundException[no such index]
at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:310)
at org.elasticsearch.action.admin.indices.recovery.TransportRecoveryAction.shardOperation(TransportRecoveryAction.java:102)
at org.elasticsearch.action.admin.indices.recovery.TransportRecoveryAction.shardOperation(TransportRecoveryAction.java:52)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 8 more

Any ideas?

Cheers,

Marty

s1monw · 2015-12-23T08:27:00Z

lemme ask some questions:

when you say "until I make any changes" what do you mean exactly?
"when the shards started relocating one of the primary shards got stuck initialising." what does that mean?
- how long is it in that state until you do something
- are you actively indexing into that shard / index
do you have any throtteling enabled?
do you have any allocaiton decider settings in your cluster?
can you share your index settings / cluster settings?
is your cluster RED while this all happens?

robertsmarty · 2015-12-23T11:06:35Z

Sorry for being so general with my comments. The changes which caused the issue in the past have been when cycling in and out new nodes, restoring data and on dynamic index creation when the clock ticks over to the next day.

In all circumstances I am actively indexing about 120-180k of events per minute into elasticsearch with no throttling and no allocation decider settings. I have 5x r3.xlarge nodes behind an ELB hosted in AWS and the nodes do not appear to be under a large amount of load. There is also loads of disk space with more than 400GB free on each node. I am only using this cluster to process these logs and therefore I have not changed the default settings. The index has all non analysed fields and is using doc values. There are also 5 shards and 1 replicas.

I have also been monitoring rejected bulk inserts and it all looks ok. This issue seems to occur very suddenly. Usually it ends up with 4 out of the 5 primary shards started and one shard that stays in an initialising state. I've waited for up to 30 minutes before taking action.

I understand that the cluster can be in a red state while restoring as the shards can be initialising for some time however it is the active shards receiving PA firewall logs from logstash that have the issue.

I have the cluster in a good state again now but I can reproduce the issue easily. This issue has occurred under different conditions however today it was initiated by rolling in 2.1.1 into the cluster and rolling out one of the 2.1.0 nodes. This caused the shards today's index to be relocated and one of them got stuck in the initialising state.

I'll attach the config shortly.

robertsmarty · 2015-12-23T11:21:06Z

_nodes.txt

s1monw · 2015-12-23T13:14:32Z

I have the cluster in a good state again now but I can reproduce the issue easily. This issue has occurred under different conditions however today it was initiated by rolling in 2.1.1 into the cluster and rolling out one of the 2.1.0 nodes. This caused the shards today's index to be relocated and one of them got stuck in the initialising state.

I still don't fully understand, your index is green (all shards are allocated) then you do a rolling restart and shards start to initialize on a different node. Then one of the shards are stuck in initialization state? Do you have replicas for your index?

robertsmarty · 2015-12-23T23:33:27Z

I've just removed a node and added a new one. The cluster stayed in a yellow state until the new node was available which makes sense as some of the replicas would have become primary and new replicas would be in the process of being initialised or unassigned. Once the new node was available the cluster started moving things around as it does however this is the current state...

pa_traffic-2015.12.23 pa_traffic-2015.12.23 pa_traffic-2015.12.23 pa_traffic-2015.12.23 pa_traffic-2015.12.23 pa_traffic-2015.12.23 pa_traffic-2015.12.23 pa_traffic-2015.12.23 pa_traffic-2015.12.23 pa_traffic-2015.12.23 .kibana .kibana .shield_audit_log-2015.12.23 .shield_audit_log-2015.12.23 .shield_audit_log-2015.12.23 .shield_audit_log-2015.12.23 .shield_audit_log-2015.12.23 .shield_audit_log-2015.12.23 pa_traffic-2015.12.20 pa_traffic-2015.12.20 pa_traffic-2015.12.20 pa_traffic-2015.12.20 pa_traffic-2015.12.20 pa_traffic-2015.12.20 pa_traffic-2015.12.20 pa_traffic-2015.12.20 pa_traffic-2015.12.20 pa_traffic-2015.12.20 .marvel-es-data .marvel-es-data pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.21 pa_traffic-2015.12.22 pa_traffic-2015.12.22 pa_traffic-2015.12.22 pa_traffic-2015.12.22 pa_traffic-2015.12.22 pa_traffic-2015.12.22 pa_traffic-2015.12.22 pa_traffic-2015.12.22 pa_traffic-2015.12.22 pa_traffic-2015.12.22 .marvel-es-2015.12.23 .marvel-es-2015.12.23 4 r STARTED 21353381 11.6gb 10.10.60.121 elkrp7
4 p STARTED 21353630 11.6gb 10.10.60.58 elkrp13
1 p STARTED 21356538 11.4gb 10.10.60.121 elkrp7
1 r STARTED 21356538 11.6gb 10.10.60.46 elkrp9
2 p STARTED 21351154 11.4gb 10.10.60.46 elkrp9
2 r STARTED 21351154 11.6gb 10.10.60.146 elkrp14
3 p STARTED 21351399 11.4gb 10.10.60.58 elkrp13
3 r INITIALIZING 10.10.60.146 elkrp14
0 p STARTED 21354418 11.6gb 10.10.60.58 elkrp13
0 r STARTED 21354414 11.4gb 10.10.60.32 elkrp11
0 p STARTED 3 28.1kb 10.10.60.46 elkrp9
0 r UNASSIGNED
1 p STARTED 3354547 10.8gb 10.10.60.121 elkrp7
1 r STARTED 3354547 12gb 10.10.60.58 elkrp13
2 p INITIALIZING 10.10.60.146 elkrp14
2 r UNASSIGNED
0 p STARTED 3245269 12.7gb 10.10.60.121 elkrp7
0 r STARTED 3245269 11gb 10.10.60.46 elkrp9
4 p STARTED 23091307 12.1gb 10.10.60.146 elkrp14
4 r STARTED 23091307 12.1gb 10.10.60.32 elkrp11
1 r STARTED 23090686 12gb 10.10.60.121 elkrp7
1 p STARTED 23090686 12gb 10.10.60.46 elkrp9
2 p STARTED 23085383 12gb 10.10.60.58 elkrp13
2 r UNASSIGNED
3 p STARTED 23085349 12gb 10.10.60.58 elkrp13
3 r STARTED 23085349 12gb 10.10.60.46 elkrp9
0 p STARTED 23084229 12gb 10.10.60.121 elkrp7
0 r STARTED 23084229 12gb 10.10.60.58 elkrp13
0 p STARTED 9 255.8kb 10.10.60.58 elkrp13
0 r STARTED 9 17kb 10.10.60.46 elkrp9
4 p STARTED 31254221 16.7gb 10.10.60.121 elkrp7
4 r STARTED 31254221 16.7gb 10.10.60.46 elkrp9
1 p STARTED 31252701 16.8gb 10.10.60.46 elkrp9
1 r UNASSIGNED
2 p STARTED 31246488 16.8gb 10.10.60.58 elkrp13
2 r UNASSIGNED
3 r STARTED 31238741 16.8gb 10.10.60.121 elkrp7
3 p STARTED 31238741 16.8gb 10.10.60.58 elkrp13
0 p STARTED 31249916 16.8gb 10.10.60.121 elkrp7
0 r STARTED 31249916 16.8gb 10.10.60.46 elkrp9
4 p STARTED 13337219 7.1gb 10.10.60.121 elkrp7
4 r UNASSIGNED
1 r STARTED 13721140 7.4gb 10.10.60.58 elkrp13
1 p STARTED 13721140 7.4gb 10.10.60.46 elkrp9
2 r STARTED 13582002 7.3gb 10.10.60.121 elkrp7
2 p STARTED 13582002 7.3gb 10.10.60.46 elkrp9
3 p STARTED 13643624 7.3gb 10.10.60.58 elkrp13
3 r UNASSIGNED
0 p STARTED 13571877 7.3gb 10.10.60.121 elkrp7
0 r UNASSIGNED
0 p STARTED 159014 67.6mb 10.10.60.121 elkrp7
0 r STARTED 159015 66.5mb 10.10.60.58 elkrp13

You can see the problem here...

.shield_audit_log-2015.12.23 2 p INITIALIZING 10.10.60.146 elkrp14
.shield_audit_log-2015.12.23 2 r UNASSIGNED

This is not a shield issue as it also occurs with the pa_traffic. I'm assuming that the replica is unassigned from when I removed the node. I'm not sure why the primary would be initialising. Perhaps this may provide some insight?

[2015-12-24 09:18:24,681][DEBUG][action.admin.indices.stats] [elkrp14] [indices:monitor/stats] failed to execute operation for shard [[.shield_audit_log-2015.12.23][2], node[WubrhkLySOecswSxuP1Oyg], [P], v[15], s[INITIALIZING], a[id=MoCPzrqVSU6rsraDA1yIYQ], unassigned_info[[reason=NODE_LEFT], at[2015-12-23T23:03:37.884Z], details[node_left[WubrhkLySOecswSxuP1Oyg]]]]
[.shield_audit_log-2015.12.23][[.shield_audit_log-2015.12.23][2]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.shield.transport.ShieldServerTransportService$ProfileSecuredRequestHandler.messageReceived(ShieldServerTransportService.java:165)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [.shield_audit_log-2015.12.23][[.shield_audit_log-2015.12.23][2]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:131)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 8 more

robertsmarty · 2015-12-29T03:36:21Z

Just come back after the break (4 days) and the cluster is looking good. Decided to kill a node and see what happens. After the new node joins I can see replicas initialising and some unassigned replicas. I'm assuming the unassigned replicas are due to a maximum number of initialising shards that are permitted due to current settings. After about 30 mins or so this happens...

pa_traffic-2015.12.29 0 p INITIALIZING 10.10.60.58 elkrp13
pa_traffic-2015.12.29 0 r UNASSIGNED

And the cluster went RED. After about another 30 mins it changes to STARTED!

pa_traffic-2015.12.28 0 p STARTED 33679290 17.8gb 10.10.60.58 elkrp13
pa_traffic-2015.12.28 0 r STARTED 33679290 17.8gb 10.10.60.46 elkrp9

However, I lost 30 mins of pa_traffic logs.

robertsmarty · 2015-12-29T22:34:01Z

Some more info. To improve resilience I thought that adding an additional replica might help however after removing a node and adding another I still ended up in this state after the new node joined the cluster...

pa_traffic-2015.12.29 3 p INITIALIZING 10.10.60.46 elkrp9
pa_traffic-2015.12.29 3 r UNASSIGNED
pa_traffic-2015.12.29 3 r UNASSIGNED

I'll go back to my original state of 1 replica so I can do more testing.

robertsmarty · 2015-12-29T22:35:08Z

Re last comment, the primary shard is stuck initialising after 11 hours.

robertsmarty · 2015-12-29T22:52:04Z

Here is the error in the log...

[2015-12-30 08:42:02,661][DEBUG][action.admin.indices.stats] [elkrp9] [indices:monitor/stats] failed to execute operation for shard [[pa_traffic-2015.12.29][3], node[4a3qk_8bQgeZ90ii33WwOw], [P], v[12], s[INITIALIZING], a[id=TdUeaEvyRKKfaqiui4BBwA], unassigned_info[[reason=NODE_LEFT], at[2015-12-29T11:26:49.382Z], details[node_left[zia7--P2Rs6jkxjquILX2w]]]]
[pa_traffic-2015.12.29][[pa_traffic-2015.12.29][3]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.shield.transport.ShieldServerTransportService$ProfileSecuredRequestHandler.messageReceived(ShieldServerTransportService.java:165)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [pa_traffic-2015.12.29][[pa_traffic-2015.12.29][3]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:131)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 8 more

Also many of these errors...

[2015-12-30 08:42:02,681][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,688][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,712][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,712][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,716][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,910][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [connection_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,958][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [connection_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:03,109][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [connection_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:03,111][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [connection_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.

jasontedor · 2015-12-29T23:51:07Z

The logging from TransportBroadcastByNodeAction is harmless and is slated to be removed in future patch releases of Elasticsearch from #14950. It just means that management actions (from Marvel?) are being executed against shards that aren't ready for such actions (from the shard that you say is stuck initializing?).

robertsmarty · 2015-12-30T13:00:06Z

I've been looking further into the logs and I can see events showing the master not being able to communicate with all node and offending nodes reporting that they can't ping the master...

From the master...

[2015-12-30 21:55:15,665][WARN ][discovery.zen.publish ] [elkrp4] timed out waiting for all nodes to process published state [1403](timeout [30s], pending nodes: [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}, {elkrp11}{I7U0iJwYQH6emidwdosnCg}{10.10.60.32}{10.10.60.32:9300}{master=true}])
[2015-12-30 21:55:15,669][WARN ][cluster.service ] [elkrp4] cluster state update task [shard-failed ([.marvel-es-data][0], node[oNTMquLMS_mK1B0jnf-Log], [R], v[42], s[INITIALIZING], a[id=LUCMR3VrQ5S8rG4V9khyHA], unassigned_info[[reason=NODE_LEFT], at[2015-12-30T11:04:07.286Z], details[node_left[9nL6sGpGRC6YxPdWSrSYRA]]]), message [failed to perform indices:data/write/bulk[s] on replica on node {elkrp7}{oNTMquLMS_mK1B0jnf-Log}{10.10.60.250}{10.10.60.250:9300}{master=true}]] took 30s above the warn threshold of 30s
[2015-12-30 21:57:28,886][WARN ][discovery.zen.publish ] [elkrp4] timed out waiting for all nodes to process published state [1404](timeout [30s], pending nodes: [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}, {elkrp11}{I7U0iJwYQH6emidwdosnCg}{10.10.60.32}{10.10.60.32:9300}{master=true}])
[2015-12-30 21:57:28,890][WARN ][cluster.service ] [elkrp4] cluster state update task [shard-started ([pa_traffic-2015.12.30][4], node[vhYUtKacQrKqBv_NDH-kTg], [R], v[13], s[INITIALIZING], a[id=qotxyLhuQRS68r1B0d2BuA], unassigned_info[[reason=NODE_LEFT], at[2015-12-30T11:04:07.286Z], details[node_left[9nL6sGpGRC6YxPdWSrSYRA]]]), reason [after recovery (replica) from node [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}]]] took 30s above the warn threshold of 30s
[2015-12-30 22:00:49,471][DEBUG][discovery.zen.publish ] [elkrp4] failed to send cluster state to {elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}
NodeDisconnectedException[[elkrp13][10.10.60.58:9300][internal:discovery/zen/publish] disconnected]
[2015-12-30 22:00:49,471][TRACE][discovery.zen.fd ] [elkrp4] [node ] [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}] transport disconnected
[2015-12-30 22:00:49,471][DEBUG][discovery.zen.publish ] [elkrp4] failed to send cluster state to {elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}
NodeDisconnectedException[[elkrp13][10.10.60.58:9300][internal:discovery/zen/publish] disconnected]
[2015-12-30 22:00:49,473][DEBUG][discovery.zen.publish ] [elkrp4] failed to send cluster state to {elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}
NodeDisconnectedException[[elkrp13][10.10.60.58:9300][internal:discovery/zen/publish] disconnected]
[2015-12-30 22:00:49,496][DEBUG][gateway ] [elkrp4] [pa_traffic-2015.12.28][2]: delaying allocation of [[pa_traffic-2015.12.28][2], node[null], [R], v[36], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2015-12-30T12:00:49.471Z], details[node_left[AyRjNl8dSbiYAUFW_c96-Q]]]] for [1m]
[2015-12-30 22:00:49,496][DEBUG][gateway ] [elkrp4] [pa_traffic-2015.12.26][1]: delaying allocation of [[pa_traffic-2015.12.26][1], node[null], [R], v[43], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2015-12-30T12:00:49.471Z], details[node_left[AyRjNl8dSbiYAUFW_c96-Q]]]] for [1m]
[2015-12-30 22:00:49,497][INFO ][cluster.service ] [elkrp4] removed {{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true},}, reason: zen-disco-node_failed({elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}), reason transport disconnected
[2015-12-30 22:00:49,715][DEBUG][action.admin.cluster.node.info] [elkrp4] failed to execute on node [AyRjNl8dSbiYAUFW_c96-Q]
SendRequestTransportException[[elkrp13][10.10.60.58:9300][cluster:monitor/nodes/info[n]]]; nested: NodeNotConnectedException[[elkrp13][10.10.60.58:9300]

From the offending node...

[2015-12-30 22:00:54,549][TRACE][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], retry [1] out of [3]
RemoteTransportException[[elkrp4][10.10.60.236:9300][internal:discovery/zen/fd/master_ping]]; nested: IllegalStateException;
Caused by: java.lang.IllegalStateException
[2015-12-30 22:00:54,549][TRACE][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], retry [2] out of [3]
RemoteTransportException[[elkrp4][10.10.60.236:9300][internal:discovery/zen/fd/master_ping]]; nested: IllegalStateException;
Caused by: java.lang.IllegalStateException
[2015-12-30 22:00:54,550][TRACE][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], retry [3] out of [3]
RemoteTransportException[[elkrp4][10.10.60.236:9300][internal:discovery/zen/fd/master_ping]]; nested: IllegalStateException;
Caused by: java.lang.IllegalStateException
[2015-12-30 22:00:54,550][DEBUG][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], tried [3] times, each with maximum [30s] timeout
[2015-12-30 22:00:54,550][DEBUG][discovery.zen.fd ] [elkrp13] [master] stopping fault detection against master [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], reason [master failure, failed to ping, tried [3] times, each with maximum [30s] timeout]
[2015-12-30 22:00:54,551][INFO ][discovery.ec2 ] [elkrp13] master_left [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2015-12-30 22:00:54,551][TRACE][discovery.ec2 ] [elkrp13] removed [0] pending cluster states
[2015-12-30 22:00:54,551][WARN ][discovery.ec2 ] [elkrp13] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{elkrp7}{oNTMquLMS_mK1B0jnf-Log}{10.10.60.250}{10.10.60.250:9300}{master=true},{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true},{elkrp11}{I7U0iJwYQH6emidwdosnCg}{10.10.60.32}{10.10.60.32:9300}{master=true},{elkrp12}{vhYUtKacQrKqBv_NDH-kTg}{10.10.60.169}{10.10.60.169:9300}{master=true},}
[2015-12-30 22:00:54,551][INFO ][cluster.service ] [elkrp13] removed {{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true},}, reason: zen-disco-master_failed ({elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true})
[2015-12-30 22:00:54,551][TRACE][discovery.zen ] [elkrp13] starting to accumulate joins

Then about 30 seconds later...

[2015-12-30 22:01:21,804][DEBUG][discovery.zen.publish ] [elkrp13] received full cluster state version 1414 with size 13087
[2015-12-30 22:01:21,804][DEBUG][discovery.zen.fd ] [elkrp13] [master] restarting fault detection against master [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], reason [new cluster state received and we are monitoring the wrong master [null]]
[2015-12-30 22:01:21,804][DEBUG][discovery.ec2 ] [elkrp13] got first state from fresh master [0ZqKG13FSJCwe9VNtql8hw]

So it looks like some communication issue is the catalyst. This only happens under certain circumstances such as doing a restore, when a dynamic index is created or when cycling in and out nodes. Perhaps caused by load? CPU, heap and disk all look fine

You can see node elkrp13 takes 1ms to decide that the master is no longer available and then goes through the process of finding another which takes about 30 seconds. After that, it discovers the same master again however it's active primary shards remain in an INITIALIZING state.

There is about 120-180k of events coming in per minute. Is it possible that this 30 second blip combined with this volume of events creates a situation where the cluster can't recover?

s1monw · 2016-01-04T08:38:15Z

you might have answered it already but when you stop/start a node or take one out is your cluster green?
I mean the primary is unassigned and starts initializing and I wonder how this could happen if a replica is active? If there is no replica active for this primary then this is expected that the primary initializes. Why it's not becoming active is a different story...

robertsmarty · 2016-01-04T09:12:18Z

Hi Simon, thanks for your reply. I just purchased a license and now going through support via my gold subscription. We are using host based firewall which is dropping idle connections after 5 minutes. I'm assuming the issue has some thing to do with that although I'm currently waiting on a response regarding why the connection is idle and not kept alive via default elasticsearch settings. I have a meeting with support in 14 hours. We haven't got to the bottom of it yet however I'll update this issue in case someone else experiences the same problem.

s1monw · 2016-01-04T09:51:02Z

thanks @robertsmarty

robertsmarty · 2016-01-05T22:52:14Z

It appears that this issue may be caused by the Trend Deep Security agent that we have installed. Profile updates are resetting the connection table and then dropping network connections...

http://esupport.trendmicro.com/solution/en-US/1096766.aspx

This issue has been fixed in Deep Security 9.5 Service Pack (SP) 1 Patch 3.

I'll test and post an update. This could also affect other customers that use this agent due to very strict security requirements.

robertsmarty · 2016-01-06T04:07:22Z

Confirmed this is caused by Trend Deep Security agent < 9.5 Service Pack (SP) 1 Patch 3 when stateful inspection is enabled.

s1monw · 2016-01-07T16:45:57Z

@robertsmarty glad we sorted it out & thanks for reporting back!

robertsmarty closed this as completed Jan 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primary shard stuck initialising #15622

Primary shard stuck initialising #15622

robertsmarty commented Dec 23, 2015

s1monw commented Dec 23, 2015

robertsmarty commented Dec 23, 2015

robertsmarty commented Dec 23, 2015

s1monw commented Dec 23, 2015

robertsmarty commented Dec 23, 2015

robertsmarty commented Dec 29, 2015

robertsmarty commented Dec 29, 2015

robertsmarty commented Dec 29, 2015

robertsmarty commented Dec 29, 2015

jasontedor commented Dec 29, 2015

robertsmarty commented Dec 30, 2015

s1monw commented Jan 4, 2016

robertsmarty commented Jan 4, 2016

s1monw commented Jan 4, 2016

robertsmarty commented Jan 5, 2016

robertsmarty commented Jan 6, 2016

s1monw commented Jan 7, 2016

Primary shard stuck initialising #15622

Primary shard stuck initialising #15622

Comments

robertsmarty commented Dec 23, 2015

s1monw commented Dec 23, 2015

robertsmarty commented Dec 23, 2015

robertsmarty commented Dec 23, 2015

s1monw commented Dec 23, 2015

robertsmarty commented Dec 23, 2015

robertsmarty commented Dec 29, 2015

robertsmarty commented Dec 29, 2015

robertsmarty commented Dec 29, 2015

robertsmarty commented Dec 29, 2015

jasontedor commented Dec 29, 2015

robertsmarty commented Dec 30, 2015

s1monw commented Jan 4, 2016

robertsmarty commented Jan 4, 2016

s1monw commented Jan 4, 2016

robertsmarty commented Jan 5, 2016

robertsmarty commented Jan 6, 2016

s1monw commented Jan 7, 2016