New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Primary shard stuck initialising #15622
Comments
lemme ask some questions:
|
Sorry for being so general with my comments. The changes which caused the issue in the past have been when cycling in and out new nodes, restoring data and on dynamic index creation when the clock ticks over to the next day. In all circumstances I am actively indexing about 120-180k of events per minute into elasticsearch with no throttling and no allocation decider settings. I have 5x r3.xlarge nodes behind an ELB hosted in AWS and the nodes do not appear to be under a large amount of load. There is also loads of disk space with more than 400GB free on each node. I am only using this cluster to process these logs and therefore I have not changed the default settings. The index has all non analysed fields and is using doc values. There are also 5 shards and 1 replicas. I have also been monitoring rejected bulk inserts and it all looks ok. This issue seems to occur very suddenly. Usually it ends up with 4 out of the 5 primary shards started and one shard that stays in an initialising state. I've waited for up to 30 minutes before taking action. I understand that the cluster can be in a red state while restoring as the shards can be initialising for some time however it is the active shards receiving PA firewall logs from logstash that have the issue. I have the cluster in a good state again now but I can reproduce the issue easily. This issue has occurred under different conditions however today it was initiated by rolling in 2.1.1 into the cluster and rolling out one of the 2.1.0 nodes. This caused the shards today's index to be relocated and one of them got stuck in the initialising state. I'll attach the config shortly. |
I still don't fully understand, your index is green (all shards are allocated) then you do a rolling restart and shards start to initialize on a different node. Then one of the shards are stuck in initialization state? Do you have replicas for your index? |
I've just removed a node and added a new one. The cluster stayed in a yellow state until the new node was available which makes sense as some of the replicas would have become primary and new replicas would be in the process of being initialised or unassigned. Once the new node was available the cluster started moving things around as it does however this is the current state... pa_traffic-2015.12.23 4 r STARTED 21353381 11.6gb 10.10.60.121 elkrp7 You can see the problem here... .shield_audit_log-2015.12.23 2 p INITIALIZING 10.10.60.146 elkrp14 This is not a shield issue as it also occurs with the pa_traffic. I'm assuming that the replica is unassigned from when I removed the node. I'm not sure why the primary would be initialising. Perhaps this may provide some insight? [2015-12-24 09:18:24,681][DEBUG][action.admin.indices.stats] [elkrp14] [indices:monitor/stats] failed to execute operation for shard [[.shield_audit_log-2015.12.23][2], node[WubrhkLySOecswSxuP1Oyg], [P], v[15], s[INITIALIZING], a[id=MoCPzrqVSU6rsraDA1yIYQ], unassigned_info[[reason=NODE_LEFT], at[2015-12-23T23:03:37.884Z], details[node_left[WubrhkLySOecswSxuP1Oyg]]]] |
Just come back after the break (4 days) and the cluster is looking good. Decided to kill a node and see what happens. After the new node joins I can see replicas initialising and some unassigned replicas. I'm assuming the unassigned replicas are due to a maximum number of initialising shards that are permitted due to current settings. After about 30 mins or so this happens... pa_traffic-2015.12.29 0 p INITIALIZING 10.10.60.58 elkrp13 And the cluster went RED. After about another 30 mins it changes to STARTED! pa_traffic-2015.12.28 0 p STARTED 33679290 17.8gb 10.10.60.58 elkrp13 However, I lost 30 mins of pa_traffic logs. |
Some more info. To improve resilience I thought that adding an additional replica might help however after removing a node and adding another I still ended up in this state after the new node joined the cluster... pa_traffic-2015.12.29 3 p INITIALIZING 10.10.60.46 elkrp9 I'll go back to my original state of 1 replica so I can do more testing. |
Re last comment, the primary shard is stuck initialising after 11 hours. |
Here is the error in the log... [2015-12-30 08:42:02,661][DEBUG][action.admin.indices.stats] [elkrp9] [indices:monitor/stats] failed to execute operation for shard [[pa_traffic-2015.12.29][3], node[4a3qk_8bQgeZ90ii33WwOw], [P], v[12], s[INITIALIZING], a[id=TdUeaEvyRKKfaqiui4BBwA], unassigned_info[[reason=NODE_LEFT], at[2015-12-29T11:26:49.382Z], details[node_left[zia7--P2Rs6jkxjquILX2w]]]] Also many of these errors... [2015-12-30 08:42:02,681][WARN ][shield.audit.index ] [elkrp9] failed to index audit event: [access_granted]. queue is full; bulk processor may not be able to keep up or has stopped indexing. |
The logging from |
I've been looking further into the logs and I can see events showing the master not being able to communicate with all node and offending nodes reporting that they can't ping the master... From the master... [2015-12-30 21:55:15,665][WARN ][discovery.zen.publish ] [elkrp4] timed out waiting for all nodes to process published state [1403](timeout [30s], pending nodes: [{elkrp13}{AyRjNl8dSbiYAUFW_c96-Q}{10.10.60.58}{10.10.60.58:9300}{master=true}, {elkrp11}{I7U0iJwYQH6emidwdosnCg}{10.10.60.32}{10.10.60.32:9300}{master=true}]) From the offending node... [2015-12-30 22:00:54,549][TRACE][discovery.zen.fd ] [elkrp13] [master] failed to ping [{elkrp4}{0ZqKG13FSJCwe9VNtql8hw}{10.10.60.236}{10.10.60.236:9300}{master=true}], retry [1] out of [3] Then about 30 seconds later... [2015-12-30 22:01:21,804][DEBUG][discovery.zen.publish ] [elkrp13] received full cluster state version 1414 with size 13087 So it looks like some communication issue is the catalyst. This only happens under certain circumstances such as doing a restore, when a dynamic index is created or when cycling in and out nodes. Perhaps caused by load? CPU, heap and disk all look fine You can see node elkrp13 takes 1ms to decide that the master is no longer available and then goes through the process of finding another which takes about 30 seconds. After that, it discovers the same master again however it's active primary shards remain in an INITIALIZING state. There is about 120-180k of events coming in per minute. Is it possible that this 30 second blip combined with this volume of events creates a situation where the cluster can't recover? |
you might have answered it already but when you stop/start a node or take one out is your cluster green? |
Hi Simon, thanks for your reply. I just purchased a license and now going through support via my gold subscription. We are using host based firewall which is dropping idle connections after 5 minutes. I'm assuming the issue has some thing to do with that although I'm currently waiting on a response regarding why the connection is idle and not kept alive via default elasticsearch settings. I have a meeting with support in 14 hours. We haven't got to the bottom of it yet however I'll update this issue in case someone else experiences the same problem. |
thanks @robertsmarty |
It appears that this issue may be caused by the Trend Deep Security agent that we have installed. Profile updates are resetting the connection table and then dropping network connections... http://esupport.trendmicro.com/solution/en-US/1096766.aspx This issue has been fixed in Deep Security 9.5 Service Pack (SP) 1 Patch 3. I'll test and post an update. This could also affect other customers that use this agent due to very strict security requirements. |
Confirmed this is caused by Trend Deep Security agent < 9.5 Service Pack (SP) 1 Patch 3 when stateful inspection is enabled. |
@robertsmarty glad we sorted it out & thanks for reporting back! |
Hi,
This is related to #15431 however now I have some more info. I am processing over 200GB of PA firewall logs per day. The cluster is working really well until I make any changes. I just started rolling in 2.1.1 into the 2.1.0 cluster and when the shards started relocating one of the primary shards got stuck initialising. I couldn't get it working so I wound up trashing the whole cluster and deploying the whole thing again with just 2.1.1. Not a massive task as we have automated deployment so no problem there. The logs started streaming in again and then I started restoring my data which was working well until again, one of the primary shards got stuck initialising again. It's also an active shard, lots of data trying to go in there. I then tried deleting the index and have it dynamically create again however the primary shard is stuck again and I have this error in the logs...
[2015-12-23 15:47:18,442][DEBUG][action.admin.indices.recovery] [elkrp14] [indices:monitor/recovery] failed to execute operation for shard [[pa_traffic-2015.12.23][4], node[ijdqvmZHTi66urgpuvCyUQ], [P], v[1], s[INITIALIZING], a[id=3A8c4qmfSU6cggJqOAu9bA], unassigned_info[[reason=INDEX_CREATED], at[2015-12-23T05:38:42.740Z]]]
[pa_traffic-2015.12.23][[pa_traffic-2015.12.23][4]] BroadcastShardOperationFailedException[operation indices:monitor/recovery failed]; nested: IndexNotFoundException[no such index];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:405)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:382)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:371)
at org.elasticsearch.shield.transport.ShieldServerTransportService$ProfileSecuredRequestHandler.messageReceived(ShieldServerTransportService.java:165)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [pa_traffic-2015.12.23] IndexNotFoundException[no such index]
at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:310)
at org.elasticsearch.action.admin.indices.recovery.TransportRecoveryAction.shardOperation(TransportRecoveryAction.java:102)
at org.elasticsearch.action.admin.indices.recovery.TransportRecoveryAction.shardOperation(TransportRecoveryAction.java:52)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:401)
... 8 more
Any ideas?
Cheers,
Marty
The text was updated successfully, but these errors were encountered: