Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary shard stuck initialising #15622

Closed
robertsmarty opened this issue Dec 23, 2015 · 17 comments
Closed

Primary shard stuck initialising #15622

robertsmarty opened this issue Dec 23, 2015 · 17 comments

Comments

@robertsmarty
Copy link

Hi,

This is related to #15431 however now I have some more info. I am processing over 200GB of PA firewall logs per day. The cluster is working really well until I make any changes. I just started rolling in 2.1.1 into the 2.1.0 cluster and when the shards started relocating one of the primary shards got stuck initialising. I couldn't get it working so I wound up trashing the whole cluster and deploying the whole thing again with just 2.1.1. Not a massive task as we have automated deployment so no problem there. The logs started streaming in again and then I started restoring my data which was working well until again, one of the primary shards got stuck initialising again. It's also an active shard, lots of data trying to go in there. I then tried deleting the index and have it dynamically create again however the primary shard is stuck again and I have this error in the logs...

[2015-12-23 15:47:18,442][DEBUG][action.admin.indices.recovery] [elkrp14] [indices:monitor/recovery] failed to execute operation for shard [[pa_traffic-2015.12.23][4], node[ijdqvmZHTi66urgpuvCyUQ], [P], v[1], s[INITIALIZING], a[id=3A8c4qmfSU6cggJqOAu9bA], unassigned_info[[reason=INDEX_CREATED], at[2015-12-23T05:38:42.740Z]]]
[pa_traffic-2015.12.23][[pa_traffic-2015.12.23][4]] BroadcastShardOperationFailedException[operation indices:monitor/recovery failed]; nested: IndexNotFoundException[no such index];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.shield.transport.ShieldServerTransportService$ProfileSecuredRequestHandler.messageReceived(ShieldServerTransportService.java:165)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [pa_traffic-2015.12.23] IndexNotFoundException[no such index]
at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:310)
at org.elasticsearch.action.admin.indices.recovery.TransportRecoveryAction.shardOperation(TransportRecoveryAction.java:102)
at org.elasticsearch.action.admin.indices.recovery.TransportRecoveryAction.shardOperation(TransportRecoveryAction.java:52)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 8 more

Any ideas?

Cheers,

Marty

@s1monw
Copy link
Contributor

s1monw commented Dec 23, 2015

lemme ask some questions:

  • when you say "until I make any changes" what do you mean exactly?
  • "when the shards started relocating one of the primary shards got stuck initialising." what does that mean?
    • how long is it in that state until you do something
    • are you actively indexing into that shard / index
  • do you have any throtteling enabled?
  • do you have any allocaiton decider settings in your cluster?
  • can you share your index settings / cluster settings?
  • is your cluster RED while this all happens?

@robertsmarty
Copy link
Author

Sorry for being so general with my comments. The changes which caused the issue in the past have been when cycling in and out new nodes, restoring data and on dynamic index creation when the clock ticks over to the next day.

In all circumstances I am actively indexing about 120-180k of events per minute into elasticsearch with no throttling and no allocation decider settings. I have 5x r3.xlarge nodes behind an ELB hosted in AWS and the nodes do not appear to be under a large amount of load. There is also loads of disk space with more than 400GB free on each node. I am only using this cluster to process these logs and therefore I have not changed the default settings. The index has all non analysed fields and is using doc values. There are also 5 shards and 1 replicas.

I have also been monitoring rejected bulk inserts and it all looks ok. This issue seems to occur very suddenly. Usually it ends up with 4 out of the 5 primary shards started and one shard that stays in an initialising state. I've waited for up to 30 minutes before taking action.

I understand that the cluster can be in a red state while restoring as the shards can be initialising for some time however it is the active shards receiving PA firewall logs from logstash that have the issue.

I have the cluster in a good state again now but I can reproduce the issue easily. This issue has occurred under different conditions however today it was initiated by rolling in 2.1.1 into the cluster and rolling out one of the 2.1.0 nodes. This caused the shards today's index to be relocated and one of them got stuck in the initialising state.

I'll attach the config shortly.

@robertsmarty
Copy link
Author

_nodes.txt

@s1monw
Copy link
Contributor

s1monw commented Dec 23, 2015

I have the cluster in a good state again now but I can reproduce the issue easily. This issue has occurred under different conditions however today it was initiated by rolling in 2.1.1 into the cluster and rolling out one of the 2.1.0 nodes. This caused the shards today's index to be relocated and one of them got stuck in the initialising state.

I still don't fully understand, your index is green (all shards are allocated) then you do a rolling restart and shards start to initialize on a different node. Then one of the shards are stuck in initialization state? Do you have replicas for your index?

@robertsmarty
Copy link
Author

I've just removed a node and added a new one. The cluster stayed in a yellow state until the new node was available which makes sense as some of the replicas would have become primary and new replicas would be in the process of being initialised or unassigned. Once the new node was available the cluster started moving things around as it does however this is the current state...

pa_traffic-2015.12.23 4 r STARTED 21353381 11.6gb 10.10.60.121 elkrp7
pa_traffic-2015.12.23 4 p STARTED 21353630 11.6gb 10.10.60.58 elkrp13
pa_traffic-2015.12.23 1 p STARTED 21356538 11.4gb 10.10.60.121 elkrp7
pa_traffic-2015.12.23 1 r STARTED 21356538 11.6gb 10.10.60.46 elkrp9
pa_traffic-2015.12.23 2 p STARTED 21351154 11.4gb 10.10.60.46 elkrp9
pa_traffic-2015.12.23 2 r STARTED 21351154 11.6gb 10.10.60.146 elkrp14
pa_traffic-2015.12.23 3 p STARTED 21351399 11.4gb 10.10.60.58 elkrp13
pa_traffic-2015.12.23 3 r INITIALIZING 10.10.60.146 elkrp14
pa_traffic-2015.12.23 0 p STARTED 21354418 11.6gb 10.10.60.58 elkrp13
pa_traffic-2015.12.23 0 r STARTED 21354414 11.4gb 10.10.60.32 elkrp11
.kibana 0 p STARTED 3 28.1kb 10.10.60.46 elkrp9
.kibana 0 r UNASSIGNED
.shield_audit_log-2015.12.23 1 p STARTED 3354547 10.8gb 10.10.60.121 elkrp7
.shield_audit_log-2015.12.23 1 r STARTED 3354547 12gb 10.10.60.58 elkrp13
.shield_audit_log-2015.12.23 2 p INITIALIZING 10.10.60.146 elkrp14
.shield_audit_log-2015.12.23 2 r UNASSIGNED
.shield_audit_log-2015.12.23 0 p STARTED 3245269 12.7gb 10.10.60.121 elkrp7
.shield_audit_log-2015.12.23 0 r STARTED 3245269 11gb 10.10.60.46 elkrp9
pa_traffic-2015.12.20 4 p STARTED 23091307 12.1gb 10.10.60.146 elkrp14
pa_traffic-2015.12.20 4 r STARTED 23091307 12.1gb 10.10.60.32 elkrp11
pa_traffic-2015.12.20 1 r STARTED 23090686 12gb 10.10.60.121 elkrp7
pa_traffic-2015.12.20 1 p STARTED 23090686 12gb 10.10.60.46 elkrp9
pa_traffic-2015.12.20 2 p STARTED 23085383 12gb 10.10.60.58 elkrp13
pa_traffic-2015.12.20 2 r UNASSIGNED
pa_traffic-2015.12.20 3 p STARTED 23085349 12gb 10.10.60.58 elkrp13
pa_traffic-2015.12.20 3 r STARTED 23085349 12gb 10.10.60.46 elkrp9
pa_traffic-2015.12.20 0 p STARTED 23084229 12gb 10.10.60.121 elkrp7
pa_traffic-2015.12.20 0 r STARTED 23084229 12gb 10.10.60.58 elkrp13
.marvel-es-data 0 p STARTED 9 255.8kb 10.10.60.58 elkrp13
.marvel-es-data 0 r STARTED 9 17kb 10.10.60.46 elkrp9
pa_traffic-2015.12.21 4 p STARTED 31254221 16.7gb 10.10.60.121 elkrp7
pa_traffic-2015.12.21 4 r STARTED 31254221 16.7gb 10.10.60.46 elkrp9
pa_traffic-2015.12.21 1 p STARTED 31252701 16.8gb 10.10.60.46 elkrp9
pa_traffic-2015.12.21 1 r UNASSIGNED
pa_traffic-2015.12.21 2 p STARTED 31246488 16.8gb 10.10.60.58 elkrp13
pa_traffic-2015.12.21 2 r UNASSIGNED
pa_traffic-2015.12.21 3 r STARTED 31238741 16.8gb 10.10.60.121 elkrp7
pa_traffic-2015.12.21 3 p STARTED 31238741 16.8gb 10.10.60.58 elkrp13
pa_traffic-2015.12.21 0 p STARTED 31249916 16.8gb 10.10.60.121 elkrp7
pa_traffic-2015.12.21 0 r STARTED 31249916 16.8gb 10.10.60.46 elkrp9
pa_traffic-2015.12.22 4 p STARTED 13337219 7.1gb 10.10.60.121 elkrp7
pa_traffic-2015.12.22 4 r UNASSIGNED
pa_traffic-2015.12.22 1 r STARTED 13721140 7.4gb 10.10.60.58 elkrp13
pa_traffic-2015.12.22 1 p STARTED 13721140 7.4gb 10.10.60.46 elkrp9
pa_traffic-2015.12.22 2 r STARTED 13582002 7.3gb 10.10.60.121 elkrp7
pa_traffic-2015.12.22 2 p STARTED 13582002 7.3gb 10.10.60.46 elkrp9
pa_traffic-2015.12.22 3 p STARTED 13643624 7.3gb 10.10.60.58 elkrp13
pa_traffic-2015.12.22 3 r UNASSIGNED
pa_traffic-2015.12.22 0 p STARTED 13571877 7.3gb 10.10.60.121 elkrp7
pa_traffic-2015.12.22 0 r UNASSIGNED
.marvel-es-2015.12.23 0 p STARTED 159014 67.6mb 10.10.60.121 elkrp7
.marvel-es-2015.12.23 0 r STARTED 159015 66.5mb 10.10.60.58 elkrp13

You can see the problem here...

.shield_audit_log-2015.12.23 2 p INITIALIZING 10.10.60.146 elkrp14
.shield_audit_log-2015.12.23 2 r UNASSIGNED

This is not a shield issue as it also occurs with the pa_traffic. I'm assuming that the replica is unassigned from when I removed the node. I'm not sure why the primary would be initialising. Perhaps this may provide some insight?

[2015-12-24 09:18:24,681][DEBUG][action.admin.indices.stats] [elkrp14] [indices:monitor/stats] failed to execute operation for shard [[.shield_audit_log-2015.12.23][2], node[WubrhkLySOecswSxuP1Oyg], [P], v[15], s[INITIALIZING], a[id=MoCPzrqVSU6rsraDA1yIYQ], unassigned_info[[reason=NODE_LEFT], at[2015-12-23T23:03:37.884Z], details[node_left[WubrhkLySOecswSxuP1Oyg]]]]
[.shield_audit_log-2015.12.23][[.shield_audit_log-2015.12.23][2]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.shield.transport.ShieldServerTransportService$ProfileSecuredRequestHandler.messageReceived(ShieldServerTransportService.java:165)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [.shield_audit_log-2015.12.23][[.shield_audit_log-2015.12.23][2]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:131)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 8 more

@robertsmarty
Copy link
Author

Just come back after the break (4 days) and the cluster is looking good. Decided to kill a node and see what happens. After the new node joins I can see replicas initialising and some unassigned replicas. I'm assuming the unassigned replicas are due to a maximum number of initialising shards that are permitted due to current settings. After about 30 mins or so this happens...

pa_traffic-2015.12.29 0 p INITIALIZING 10.10.60.58 elkrp13
pa_traffic-2015.12.29 0 r UNASSIGNED

And the cluster went RED. After about another 30 mins it changes to STARTED!

pa_traffic-2015.12.28 0 p STARTED 33679290 17.8gb 10.10.60.58 elkrp13
pa_traffic-2015.12.28 0 r STARTED 33679290 17.8gb 10.10.60.46 elkrp9

However, I lost 30 mins of pa_traffic logs.

@robertsmarty
Copy link
Author

Some more info. To improve resilience I thought that adding an additional replica might help however after removing a node and adding another I still ended up in this state after the new node joined the cluster...

pa_traffic-2015.12.29 3 p INITIALIZING 10.10.60.46 elkrp9
pa_traffic-2015.12.29 3 r UNASSIGNED
pa_traffic-2015.12.29 3 r UNASSIGNED

I'll go back to my original state of 1 replica so I can do more testing.

@robertsmarty
Copy link
Author

Re last comment, the primary shard is stuck initialising after 11 hours.

@robertsmarty
Copy link
Author

Here is the error in the log...

[2015-12-30 08:42:02,661][DEBUG][action.admin.indices.stats] [elkrp9] [indices:monitor/stats] failed to execute operation for shard [[pa_traffic-2015.12.29][3], node[4a3qk_8bQgeZ90ii33WwOw], [P], v[12], s[INITIALIZING], a[id=TdUeaEvyRKKfaqiui4BBwA], unassigned_info[[reason=NODE_LEFT], at[2015-12-29T11:26:49.382Z], details[node_left[zia7--P2Rs6jkxjquILX2w]]]]
[pa_traffic-2015.12.29][[pa_traffic-2015.12.29][3]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.shield.transport.ShieldServerTransportService$ProfileSecuredRequestHandler.messageReceived(ShieldServerTransportService.java:165)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [pa_traffic-2015.12.29][[pa_traffic-2015.12.29][3]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:974)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:808)
at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:628)
at org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:131)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 8 more

Also many of these errors...

[2015-12-30 08:42:02,681][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,688][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,712][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,712][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,716][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,910][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [connection_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:02,958][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [connection_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:03,109][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [connection_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.
[2015-12-30 08:42:03,111][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [connection_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing.

@jasontedor
Copy link
Member

The logging from TransportBroadcastByNodeAction is harmless and is slated to be removed in future patch releases of Elasticsearch from #14950. It just means that management actions (from Marvel?) are being executed against shards that aren't ready for such actions (from the shard that you say is stuck initializing?).

@robertsmarty
Copy link
Author

I've been looking further into the logs and I can see events showing the master not being able to communicate with all node and offending nodes reporting that they can't ping the master...

From the master...

[2015-12-30 21:55:15,665][WARN ][discovery.zen.publish ] [elkrp4] timed out waiting for all nodes to process published state [1403](timeout [30s], pending nodes: [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}, {elkrp11}{I7U0iJwYQH6emidwdosnCg}{10.10.60.32}{10.10.60.32:9300}{master=true}])
[2015-12-30 21:55:15,669][WARN ][cluster.service ] [elkrp4] cluster state update task [shard-failed ([.marvel-es-data][0], node[oNTMquLMS_mK1B0jnf-Log], [R], v[42], s[INITIALIZING], a[id=LUCMR3VrQ5S8rG4V9khyHA], unassigned_info[[reason=NODE_LEFT], at[2015-12-30T11:04:07.286Z], details[node_left[9nL6sGpGRC6YxPdWSrSYRA]]]), message [failed to perform indices:data/write/bulk[s] on replica on node {elkrp7}{oNTMquLMS_mK1B0jnf-Log}{10.10.60.250}{10.10.60.250:9300}{master=true}]] took 30s above the warn threshold of 30s
[2015-12-30 21:57:28,886][WARN ][discovery.zen.publish ] [elkrp4] timed out waiting for all nodes to process published state [1404](timeout [30s], pending nodes: [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}, {elkrp11}{I7U0iJwYQH6emidwdosnCg}{10.10.60.32}{10.10.60.32:9300}{master=true}])
[2015-12-30 21:57:28,890][WARN ][cluster.service ] [elkrp4] cluster state update task [shard-started ([pa_traffic-2015.12.30][4], node[vhYUtKacQrKqBv_NDH-kTg], [R], v[13], s[INITIALIZING], a[id=qotxyLhuQRS68r1B0d2BuA], unassigned_info[[reason=NODE_LEFT], at[2015-12-30T11:04:07.286Z], details[node_left[9nL6sGpGRC6YxPdWSrSYRA]]]), reason [after recovery (replica) from node [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}]]] took 30s above the warn threshold of 30s
[2015-12-30 22:00:49,471][DEBUG][discovery.zen.publish ] [elkrp4] failed to send cluster state to {elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}
NodeDisconnectedException[[elkrp13][10.10.60.58:9300][internal:discovery/zen/publish] disconnected]
[2015-12-30 22:00:49,471][TRACE][discovery.zen.fd ] [elkrp4] [node ] [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}] transport disconnected
[2015-12-30 22:00:49,471][DEBUG][discovery.zen.publish ] [elkrp4] failed to send cluster state to {elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}
NodeDisconnectedException[[elkrp13][10.10.60.58:9300][internal:discovery/zen/publish] disconnected]
[2015-12-30 22:00:49,473][DEBUG][discovery.zen.publish ] [elkrp4] failed to send cluster state to {elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}
NodeDisconnectedException[[elkrp13][10.10.60.58:9300][internal:discovery/zen/publish] disconnected]
[2015-12-30 22:00:49,496][DEBUG][gateway ] [elkrp4] [pa_traffic-2015.12.28][2]: delaying allocation of [[pa_traffic-2015.12.28][2], node[null], [R], v[36], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2015-12-30T12:00:49.471Z], details[node_left[AyRjNl8dSbiYAUFW_c96-Q]]]] for [1m]
[2015-12-30 22:00:49,496][DEBUG][gateway ] [elkrp4] [pa_traffic-2015.12.26][1]: delaying allocation of [[pa_traffic-2015.12.26][1], node[null], [R], v[43], s[UNASSIGNED], unassigned_info[[reason=NODE_LEFT], at[2015-12-30T12:00:49.471Z], details[node_left[AyRjNl8dSbiYAUFW_c96-Q]]]] for [1m]
[2015-12-30 22:00:49,497][INFO ][cluster.service ] [elkrp4] removed {{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true},}, reason: zen-disco-node_failed({elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}), reason transport disconnected
[2015-12-30 22:00:49,715][DEBUG][action.admin.cluster.node.info] [elkrp4] failed to execute on node [AyRjNl8dSbiYAUFW_c96-Q]
SendRequestTransportException[[elkrp13][10.10.60.58:9300][cluster:monitor/nodes/info[n]]]; nested: NodeNotConnectedException[[elkrp13][10.10.60.58:9300]

From the offending node...

[2015-12-30 22:00:54,549][TRACE][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], retry [1] out of [3]
RemoteTransportException[[elkrp4][10.10.60.236:9300][internal:discovery/zen/fd/master_ping]]; nested: IllegalStateException;
Caused by: java.lang.IllegalStateException
[2015-12-30 22:00:54,549][TRACE][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], retry [2] out of [3]
RemoteTransportException[[elkrp4][10.10.60.236:9300][internal:discovery/zen/fd/master_ping]]; nested: IllegalStateException;
Caused by: java.lang.IllegalStateException
[2015-12-30 22:00:54,550][TRACE][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], retry [3] out of [3]
RemoteTransportException[[elkrp4][10.10.60.236:9300][internal:discovery/zen/fd/master_ping]]; nested: IllegalStateException;
Caused by: java.lang.IllegalStateException
[2015-12-30 22:00:54,550][DEBUG][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], tried [3] times, each with maximum [30s] timeout
[2015-12-30 22:00:54,550][DEBUG][discovery.zen.fd ] [elkrp13] [master] stopping fault detection against master [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], reason [master failure, failed to ping, tried [3] times, each with maximum [30s] timeout]
[2015-12-30 22:00:54,551][INFO ][discovery.ec2 ] [elkrp13] master_left [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2015-12-30 22:00:54,551][TRACE][discovery.ec2 ] [elkrp13] removed [0] pending cluster states
[2015-12-30 22:00:54,551][WARN ][discovery.ec2 ] [elkrp13] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {{elkrp7}{oNTMquLMS_mK1B0jnf-Log}{10.10.60.250}{10.10.60.250:9300}{master=true},{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true},{elkrp11}{I7U0iJwYQH6emidwdosnCg}{10.10.60.32}{10.10.60.32:9300}{master=true},{elkrp12}{vhYUtKacQrKqBv_NDH-kTg}{10.10.60.169}{10.10.60.169:9300}{master=true},}
[2015-12-30 22:00:54,551][INFO ][cluster.service ] [elkrp13] removed {{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true},}, reason: zen-disco-master_failed ({elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true})
[2015-12-30 22:00:54,551][TRACE][discovery.zen ] [elkrp13] starting to accumulate joins

Then about 30 seconds later...

[2015-12-30 22:01:21,804][DEBUG][discovery.zen.publish ] [elkrp13] received full cluster state version 1414 with size 13087
[2015-12-30 22:01:21,804][DEBUG][discovery.zen.fd ] [elkrp13] [master] restarting fault detection against master [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], reason [new cluster state received and we are monitoring the wrong master [null]]
[2015-12-30 22:01:21,804][DEBUG][discovery.ec2 ] [elkrp13] got first state from fresh master [0ZqKG13FSJCwe9VNtql8hw]

So it looks like some communication issue is the catalyst. This only happens under certain circumstances such as doing a restore, when a dynamic index is created or when cycling in and out nodes. Perhaps caused by load? CPU, heap and disk all look fine

You can see node elkrp13 takes 1ms to decide that the master is no longer available and then goes through the process of finding another which takes about 30 seconds. After that, it discovers the same master again however it's active primary shards remain in an INITIALIZING state.

There is about 120-180k of events coming in per minute. Is it possible that this 30 second blip combined with this volume of events creates a situation where the cluster can't recover?

@s1monw
Copy link
Contributor

s1monw commented Jan 4, 2016

you might have answered it already but when you stop/start a node or take one out is your cluster green?
I mean the primary is unassigned and starts initializing and I wonder how this could happen if a replica is active? If there is no replica active for this primary then this is expected that the primary initializes. Why it's not becoming active is a different story...

@robertsmarty
Copy link
Author

Hi Simon, thanks for your reply. I just purchased a license and now going through support via my gold subscription. We are using host based firewall which is dropping idle connections after 5 minutes. I'm assuming the issue has some thing to do with that although I'm currently waiting on a response regarding why the connection is idle and not kept alive via default elasticsearch settings. I have a meeting with support in 14 hours. We haven't got to the bottom of it yet however I'll update this issue in case someone else experiences the same problem.

@s1monw
Copy link
Contributor

s1monw commented Jan 4, 2016

thanks @robertsmarty

@robertsmarty
Copy link
Author

It appears that this issue may be caused by the Trend Deep Security agent that we have installed. Profile updates are resetting the connection table and then dropping network connections...

http://esupport.trendmicro.com/solution/en-US/1096766.aspx

This issue has been fixed in Deep Security 9.5 Service Pack (SP) 1 Patch 3.

I'll test and post an update. This could also affect other customers that use this agent due to very strict security requirements.

@robertsmarty
Copy link
Author

Confirmed this is caused by Trend Deep Security agent < 9.5 Service Pack (SP) 1 Patch 3 when stateful inspection is enabled.

@s1monw
Copy link
Contributor

s1monw commented Jan 7, 2016

@robertsmarty glad we sorted it out & thanks for reporting back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants