Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shards deleted from index (and from disk) on cluster restart #2496

Closed
jgagnon1 opened this issue Dec 18, 2012 · 4 comments
Closed

Shards deleted from index (and from disk) on cluster restart #2496

jgagnon1 opened this issue Dec 18, 2012 · 4 comments

Comments

@jgagnon1
Copy link

I just had an issue with ES 0.20.0.RC1 that resulted on an cluster with 40 shards missing out of 250 on my production ready index. Note that the shards are not only missing in ElasticSearch but also on the filesystem. I have 51 nodes on my cluster.

This is the second time similar event happens to me, so I decided to file a bug on that.

I will describe the timelime event with logs snipped that explain what I think happenned.

14:12:00 - Cluster is on green state everything is fine -> sending shutdown via API
14:13:02 - First cluster restart. Restarting nodes 10 per 10 (I use tmux so I do it almost simultaneously)

Note: After the restart there are 8 nodes missing from the cluster.

Logs from one of the server that is NOT on the cluster .

[2012-12-18 14:13:17,942][WARN ][discovery.zen.ping.unicast] [es1b] failed to send ping to [[#zen_unicast_3#][inet[es12b.cx.wajam/10.1.
16.154:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[es12b.cx.wajam/10.1.16.154:9300]][discovery/zen/unicast] request_
id [0] timed out after [3750ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:342)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
[2012-12-18 14:13:17,948][WARN ][discovery.zen.ping.unicast] [es1b] failed to send ping to [[#zen_unicast_9#][inet[es18b.cx.wajam/10.1.
16.160:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[es18b.cx.wajam/10.1.16.160:9300]][discovery/zen/unicast] request_
id [9] timed out after [3750ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:342)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
[2012-12-18 14:13:17,971][WARN ][discovery.zen.ping.unicast] [es1b] failed to send ping to [[#zen_unicast_28#][inet[es35b.cx.wajam/10.1
.16.55:9300]]]

14:13:13-14:14:00 - Restarting the missing node on the cluster one by one (8 of them)

14:14:08 - NullPointerException on Master and showing unassigned shards (not the one that are missing ?)

Logs from master :

java.lang.NullPointerException
        at org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:75)
        at org.elasticsearch.gateway.local.LocalGatewayAllocator.allocateUnassigned(LocalGatewayAllocator.java:198)
        at org.elasticsearch.cluster.routing.allocation.allocator.ShardsAllocators.allocateUnassigned(ShardsAllocators.java:70)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:188)
        at org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:138)
        at org.elasticsearch.cluster.routing.RoutingService$1.execute(RoutingService.java:135)
        at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:223)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

Logs from one of the node that have missing shards

[2012-12-18 14:14:10,041][WARN ][discovery.zen            ] [es1b] received a cluster state from [[es22b][4FgfHy7vTOOgBZBPodjW2A][inet[/10.1.16.102:9300]]{master=true}] and not part of the cluster, should not happen
[2012-12-18 14:14:10,190][DEBUG][action.admin.indices.status] [es1b] [wajam][119], node[aGZztLJ0TOWe5qMaTvDsBg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@19661482]
org.elasticsearch.transport.RemoteTransportException: [es5b][inet[/10.1.13.135:9300]][indices/status/s]
Caused by: org.elasticsearch.indices.IndexMissingException: [wajam] missing
        at org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:244)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:152)
        at org.elasticsearch.action.admin.indices.status.TransportIndicesStatusAction.shardOperation(TransportIndicesStatusAction.java:59)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:398)
        at org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:384)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
[2012-12-18 14:14:10,199][DEBUG][action.admin.cluster.node.stats] [es1b] failed to execute on node [AaYgqtpFStuwxI6fMlJs9w]
org.elasticsearch.transport.RemoteTransportException: [es8b][inet[/10.1.13.138:9300]][cluster/nodes/stats/n]
Caused by: java.lang.NullPointerException
        at org.elasticsearch.action.support.nodes.NodeOperationResponse.writeTo(NodeOperationResponse.java:66)
        at org.elasticsearch.action.admin.cluster.node.stats.NodeStats.writeTo(NodeStats.java:290)
        at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:91)
        at org.elasticsearch.transport.netty.NettyTransportChannel.sendResponse(NettyTransportChannel.java:67)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:276)
        at org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:267)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:268)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

The last part is repeated for each missing shards.

14:16:00 - At this point I have a red cluster with all the nodes but 40 shards missing. (51, 210). Looking into the filesystem, the shards folders are not there anymore. I'm wondering what's happening, everything has happened kind of fast....

@kimchy
Copy link
Member

kimchy commented Dec 18, 2012

Hi, we fixed a similar issue (the NPE that caused problems) in 0.20 final release, and in latest 0.19 cases. Though a rare situation to hit, seems like you hit it... .

A good practice is to configure the gateway.recover_after_nodes config to make sure full cluster restart based recovery will only start once there are enough nodes in the cluster.

@jgagnon1
Copy link
Author

Just to be clear, did you fixed the NPE or the indices deletion ? Or both at the same time ? I tried to find a Specific commit for that and didn't find exactly. I will update to the latest 0.20 before restarting indexation, but just to be sure.

Update; My recover_after_nodes is set to 15. What is the recommended value for it ? I also have discovery.zen.minimum_master_nodes set to 30

@kimchy
Copy link
Member

kimchy commented Dec 18, 2012

@jgagnon1 we fixed the NPE, and then we also fixed a corner case of the deletion case.

Regarding recover_after_nodes, since it only applied during cluster "startup", then you can set it to a high value, something like 40 or even 50.

@jgagnon1
Copy link
Author

Thank you very much. I will update my ES version and this settings before restarting anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants