Shards deleted from index (and from disk) on cluster restart #2496

jgagnon1 · 2012-12-18T21:25:36Z

I just had an issue with ES 0.20.0.RC1 that resulted on an cluster with 40 shards missing out of 250 on my production ready index. Note that the shards are not only missing in ElasticSearch but also on the filesystem. I have 51 nodes on my cluster.

This is the second time similar event happens to me, so I decided to file a bug on that.

I will describe the timelime event with logs snipped that explain what I think happenned.

14:12:00 - Cluster is on green state everything is fine -> sending shutdown via API
14:13:02 - First cluster restart. Restarting nodes 10 per 10 (I use tmux so I do it almost simultaneously)

Note: After the restart there are 8 nodes missing from the cluster.

Logs from one of the server that is NOT on the cluster .

[2012-12-18 14:13:17,942][WARN ][discovery.zen.ping.unicast] [es1b] failed to send ping to [[#zen_unicast_3#][inet[es12b.cx.wajam/10.1.
16.154:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[es12b.cx.wajam/10.1.16.154:9300]][discovery/zen/unicast] request_
id [0] timed out after [3750ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:342)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
[2012-12-18 14:13:17,948][WARN ][discovery.zen.ping.unicast] [es1b] failed to send ping to [[#zen_unicast_9#][inet[es18b.cx.wajam/10.1.
16.160:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[es18b.cx.wajam/10.1.16.160:9300]][discovery/zen/unicast] request_
id [9] timed out after [3750ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:342)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
[2012-12-18 14:13:17,971][WARN ][discovery.zen.ping.unicast] [es1b] failed to send ping to [[#zen_unicast_28#][inet[es35b.cx.wajam/10.1
.16.55:9300]]]

14:13:13-14:14:00 - Restarting the missing node on the cluster one by one (8 of them)

14:14:08 - NullPointerException on Master and showing unassigned shards (not the one that are missing ?)

Logs from master :

java.lang.NullPointerException
        at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:75)
        at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:198)
        at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:70)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:188)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:138)
        at org.elasticsearch.cluster.routing.RoutingService$1.execute(RoutingService.java:135)
        at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:223)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

Logs from one of the node that have missing shards

[2012-12-18 14:14:10,041][WARN ][discovery.zen            ] [es1b] received a cluster state from [[es22b][4FgfHy7vTOOgBZBPodjW2A][inet[/10.1.16.102:9300]]{master=true}] and not part of the cluster, should not happen
[2012-12-18 14:14:10,190][DEBUG][action.admin.indices.status] [es1b] [wajam][119], node[aGZztLJ0TOWe5qMaTvDsBg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@19661482]
org.elasticsearch.transport.RemoteTransportException: [es5b][inet[/10.1.13.135:9300]][indices/status/s]
Caused by: org.elasticsearch.indices.IndexMissingException: [wajam] missing
        at org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
[2012-12-18 14:14:10,199][DEBUG][action.admin.cluster.node.stats] [es1b] failed to execute on node [AaYgqtpFStuwxI6fMlJs9w]
org.elasticsearch.transport.RemoteTransportException: [es8b][inet[/10.1.13.138:9300]][cluster/nodes/stats/n]
Caused by: java.lang.NullPointerException
        at org.elasticsearch.action.support.nodes.NodeOperationResponse.writeTo(NodeOperationResponse.java:66)
        at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:290)
        at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:91)
        at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:67)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:276)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:267)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

The last part is repeated for each missing shards.

14:16:00 - At this point I have a red cluster with all the nodes but 40 shards missing. (51, 210). Looking into the filesystem, the shards folders are not there anymore. I'm wondering what's happening, everything has happened kind of fast....

The text was updated successfully, but these errors were encountered:

kimchy · 2012-12-18T21:28:07Z

Hi, we fixed a similar issue (the NPE that caused problems) in 0.20 final release, and in latest 0.19 cases. Though a rare situation to hit, seems like you hit it... .

A good practice is to configure the gateway.recover_after_nodes config to make sure full cluster restart based recovery will only start once there are enough nodes in the cluster.

jgagnon1 · 2012-12-18T21:39:31Z

Just to be clear, did you fixed the NPE or the indices deletion ? Or both at the same time ? I tried to find a Specific commit for that and didn't find exactly. I will update to the latest 0.20 before restarting indexation, but just to be sure.

Update; My recover_after_nodes is set to 15. What is the recommended value for it ? I also have discovery.zen.minimum_master_nodes set to 30

kimchy · 2012-12-18T21:44:18Z

@jgagnon1 we fixed the NPE, and then we also fixed a corner case of the deletion case.

Regarding recover_after_nodes, since it only applied during cluster "startup", then you can set it to a high value, something like 40 or even 50.

jgagnon1 · 2012-12-18T21:47:08Z

Thank you very much. I will update my ES version and this settings before restarting anything.

jgagnon1 closed this as completed Feb 27, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shards deleted from index (and from disk) on cluster restart #2496

Shards deleted from index (and from disk) on cluster restart #2496

jgagnon1 commented Dec 18, 2012

kimchy commented Dec 18, 2012

jgagnon1 commented Dec 18, 2012

kimchy commented Dec 18, 2012

jgagnon1 commented Dec 18, 2012

Shards deleted from index (and from disk) on cluster restart #2496

Shards deleted from index (and from disk) on cluster restart #2496

Comments

jgagnon1 commented Dec 18, 2012

kimchy commented Dec 18, 2012

jgagnon1 commented Dec 18, 2012

kimchy commented Dec 18, 2012

jgagnon1 commented Dec 18, 2012