New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shards deleted from index (and from disk) on cluster restart #2496
Comments
Hi, we fixed a similar issue (the NPE that caused problems) in 0.20 final release, and in latest 0.19 cases. Though a rare situation to hit, seems like you hit it... . A good practice is to configure the gateway.recover_after_nodes config to make sure full cluster restart based recovery will only start once there are enough nodes in the cluster. |
Just to be clear, did you fixed the NPE or the indices deletion ? Or both at the same time ? I tried to find a Specific commit for that and didn't find exactly. I will update to the latest 0.20 before restarting indexation, but just to be sure. Update; My recover_after_nodes is set to 15. What is the recommended value for it ? I also have discovery.zen.minimum_master_nodes set to 30 |
@jgagnon1 we fixed the NPE, and then we also fixed a corner case of the deletion case. Regarding recover_after_nodes, since it only applied during cluster "startup", then you can set it to a high value, something like 40 or even 50. |
Thank you very much. I will update my ES version and this settings before restarting anything. |
I just had an issue with ES 0.20.0.RC1 that resulted on an cluster with 40 shards missing out of 250 on my production ready index. Note that the shards are not only missing in ElasticSearch but also on the filesystem. I have 51 nodes on my cluster.
This is the second time similar event happens to me, so I decided to file a bug on that.
I will describe the timelime event with logs snipped that explain what I think happenned.
14:12:00 - Cluster is on green state everything is fine -> sending shutdown via API
14:13:02 - First cluster restart. Restarting nodes 10 per 10 (I use tmux so I do it almost simultaneously)
Note: After the restart there are 8 nodes missing from the cluster.
Logs from one of the server that is NOT on the cluster .
14:13:13-14:14:00 - Restarting the missing node on the cluster one by one (8 of them)
14:14:08 - NullPointerException on Master and showing unassigned shards (not the one that are missing ?)
Logs from master :
Logs from one of the node that have missing shards
The last part is repeated for each missing shards.
14:16:00 - At this point I have a red cluster with all the nodes but 40 shards missing. (51, 210). Looking into the filesystem, the shards folders are not there anymore. I'm wondering what's happening, everything has happened kind of fast....
The text was updated successfully, but these errors were encountered: