Handling disk and file system permission issues on new index creation #19789

ppf2 · 2016-08-03T19:05:25Z

Elasticsearch version: 2.3.2

This is an attempt to simulate a bad disk that has turned read only.

2 data nodes started
Set permissions for data path of 1 of the nodes (node-2) to read only for all
Create a new index (eg. say with only primary shards)
All shards allocated for node-2 for this new index are unassigned (as expected since it cannot write to its file system), and primary shards for the good node are allocated successfully:

testindex 3 p STARTED 0 130b 127.0.0.1 node1
testindex 4 p UNASSIGNED
testindex 2 p UNASSIGNED
testindex 1 p STARTED 0 130b 127.0.0.1 node1
testindex 0 p UNASSIGNED

However, the master node is not recovering from this scenario well. It keeps trying to load the shards onto node 2 ... pretty much perpetually as long as node2 is started with a read only file system.

[2016-08-03 11:45:04,927][WARN ][gateway                  ] [node1] [testindex][2]: failed to list shard for shard_started on node [Cr5kuAANQAi7dShxyW83Jg]
FailedNodeException[Failed node [Cr5kuAANQAi7dShxyW83Jg]]; nested: RemoteTransportException[[node2][127.0.0.1:9301][internal:gateway/local/started_shards[n]]]; nested: ElasticsearchException[failed to load started shards]; nested: NotSerializableExceptionWrapper[access_denied_exception: /Users/User/ELK/ElasticStack_2_0/elasticsearch-2.3.2_node2/data/my-application1/nodes/0/indices/testindex/2/_state];
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:206)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$800(TransportNodesAction.java:106)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$2.handleException(TransportNodesAction.java:179)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:212)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:202)
    at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:136)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: RemoteTransportException[[node2][127.0.0.1:9301][internal:gateway/local/started_shards[n]]]; nested: ElasticsearchException[failed to load started shards]; nested: NotSerializableExceptionWrapper[access_denied_exception: /Users/User/ELK/ElasticStack_2_0/elasticsearch-2.3.2_node2/data/my-application1/nodes/0/indices/testindex/2/_state];
Caused by: ElasticsearchException[failed to load started shards]; nested: NotSerializableExceptionWrapper[access_denied_exception: /Users/User/ELK/ElasticStack_2_0/elasticsearch-2.3.2_node2/data/my-application1/nodes/0/indices/testindex/2/_state];
    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:154)
    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:59)
    at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:92)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:230)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:226)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)

And these tasks keep getting added and re-added to the pending tasks to no end:

{
  "tasks": [
    {
      "insert_order": 141469,
      "priority": "HIGH",
      "source": "cluster_reroute(async_shard_fetch)",
      "executing": true,
      "time_in_queue_millis": 1,
      "time_in_queue": "1ms"
    },
    {
      "insert_order": 141470,
      "priority": "HIGH",
      "source": "cluster_reroute(async_shard_fetch)",
      "executing": false,
      "time_in_queue_millis": 0,
      "time_in_queue": "0s"
    }
  ]
}

It keeps going forever until node 2 is stopped and the underlying file system is addressed.

Once node2 is started back up with a writable data path, then you end up with a red index for it does not go and retry the allocation there.

testindex 3 p STARTED 0 130b 127.0.0.1 node1
testindex 4 p UNASSIGNED
testindex 2 p UNASSIGNED
testindex 1 p STARTED 0 130b 127.0.0.1 node1
testindex 0 p UNASSIGNED

Seems like there is an opportunity to handle this better:

Should the master be retrying forever and spewing large amounts of exceptions and keep performing async_shard_fetch actions? Or should it give up at some point?
Note that even though currently the master is retrying async_shard_fetch forever, if I fix the permission issues against the data folder on node-2 while this is happening, the master will stop complaining about the read permission issues as if things are good now. However, the shards remain unassigned so it doesn't actually go and try to start them up on node-2. Even after a full cluster restart, the testindex remains red with the unassigned shards.

The text was updated successfully, but these errors were encountered:

jasontedor · 2016-08-03T19:17:16Z

Duplicates #18417, relates #18467

ywelsch · 2016-08-04T08:15:25Z

There is another issue at play here:

The issue is that the index-creation context is lost after the primary shard fails to initialize on node 2 (which is a variation of #15241, only fixed by allocation ids in v5.0.0). This means that after the first failed attempt to initialize the primary shard, it is treated as an existing shard copy to be recovered instead of a new one (I've added some notes below how to detect this situation). This also means that the primary shard allocator searches for a copy of the data on the nodes to allocate the shard to, which triggers async_shard_fetch. As there are no on-disk copies of this shard, the shard can never be allocated. The only way to get the shard un-stuck is to either delete the index or use the reroute allocation command to force a fresh primary (which is implemented by restoring the index-creation context for that shard, see here. In v5.0.0, the index creation context is preserved and the primary allocates to node 1 after the failed attempt to node 2.

How to detect that index-creation context is lost in v1.x/v2.x:

This can be seen by looking at the unassignedInfo on the unassigned primary shard, which shows something like this:

 ({"state":"UNASSIGNED","primary":true,"node":null,"relocating_node":null,"shard":5,"index":"some_index","version":2,"unassigned_info":{"reason":"ALLOCATION_FAILED","at":"2016-08-01T00:00:28.958Z","details":"failed to create shard, ... "}}

The index-creation context is correctly set if the unassigned_info object has "reason":"INDEX_CREATED".

In v5.0.0, the decision whether a fresh shard copy is expected (like after index creation) or an existing copy is required (when recovering a previously started shard after cluster restart) is not based on the unassigned_info anymore but based on allocation ids. I have already opened a PR where it will be easier to see in v5.0.0 whether a primary is going to recover as a fresh initial shard copy or requires an existing on-disk shard copy (#19516).

ywelsch · 2016-08-04T08:42:39Z

This also shows another subtle issue (even in current master) which is related to async_shard_fetch:

Assume for simplicity an index with 1 primary shard and no replicas. Primary shard was successfully started at some point on data node X, but now we have done a full cluster restart. Data node X has some disk permission issues for the shard directory after restart. When master tries to allocate the primary after cluster restart, it first does an async_shard_fetch, which fails hard on node X as async_shard_fetch (TransportNodesListGatewayStartedShards#nodeOperation) treats the AccessDeniedException when trying to list shard directory contents as a hard failure on the node. The async_shard_fetch code marks this node as failed and queues a delayed reroute (essentially another allocation attempt) to be executed after the current one. This means however that if the primary cannot be allocated in this round, another round is triggered that leads to shard_fetching again (and again and again). Even #18467 does not help in this scenario, as we don't count these as failed attempts: What the code in PrimaryShardAllocator does is after shard fetching to see if there is a node that has any data (node X is disregarded because it failed), and in case no data was found it just ignores allocation of this primary shard:

if (enoughAllocationsFound == false){
   // we can't really allocate, so ignore it and continue
   changed |= unassignedIterator.removeAndIgnore(AllocationStatus.NO_VALID_SHARD_COPY);
}

ppf2 · 2016-08-04T20:42:43Z

To clarify per discussion with @ywelsch (thx):

The first issue where the shard loses its index creation context after first failed allocation attempt is solved in v5.0.0 based on allocation ids (#14739).

The second issue is that async shard fetching can in a certain situation be triggered again and again when no existing shard copy can be found to allocate as primary. The situation where this occurs is if just doing a listFiles on the shard directory during shard fetching on a data node already throws an exception. We are keeping this issue open to track this particular PR.

gmoskovicz · 2017-08-21T14:53:22Z

It would be great to do something when a disk goes to read-only. This seems to be the default in some linux OSs when there are issues (such as corruption or problems with the mounted disk).

Also, to avoid this we could mention (in the documentation) that RAID 0 could be helpful?

bleskes · 2018-03-20T11:50:58Z

@ywelsch This seems to be still an issue where failing to read a disk on a data node can lead to endless shard fetching. I tend to open an issue which is dedicated to that. Do you agree?

ywelsch · 2018-03-20T17:08:19Z

This could also be seen as falling under the umbrella of #18417, even if the issue technically happens before the shard is even allocated to the broken FS / node. How about closing this one and adding a comment to the linked issue?

bleskes · 2018-03-21T13:59:37Z

works for me. Added a comment to #18417

ppf2 added the resiliency label Aug 3, 2016

jasontedor closed this as completed Aug 3, 2016

ppf2 reopened this Aug 4, 2016

clintongormley added :Allocation >enhancement labels Aug 11, 2016

clintongormley assigned ywelsch Aug 11, 2016

MorrieAtElastic mentioned this issue Jul 7, 2017

[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591

Closed

lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018

bleskes closed this as completed Mar 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling disk and file system permission issues on new index creation #19789

Handling disk and file system permission issues on new index creation #19789

ppf2 commented Aug 3, 2016

jasontedor commented Aug 3, 2016

ywelsch commented Aug 4, 2016

ywelsch commented Aug 4, 2016

ppf2 commented Aug 4, 2016 •

edited

gmoskovicz commented Aug 21, 2017

bleskes commented Mar 20, 2018

ywelsch commented Mar 20, 2018

bleskes commented Mar 21, 2018

Handling disk and file system permission issues on new index creation #19789

Handling disk and file system permission issues on new index creation #19789

Comments

ppf2 commented Aug 3, 2016

jasontedor commented Aug 3, 2016

ywelsch commented Aug 4, 2016

ywelsch commented Aug 4, 2016

ppf2 commented Aug 4, 2016 • edited

gmoskovicz commented Aug 21, 2017

bleskes commented Mar 20, 2018

ywelsch commented Mar 20, 2018

bleskes commented Mar 21, 2018

ppf2 commented Aug 4, 2016 •

edited