Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling disk and file system permission issues on new index creation #19789

Closed
ppf2 opened this issue Aug 3, 2016 · 8 comments
Closed

Handling disk and file system permission issues on new index creation #19789

ppf2 opened this issue Aug 3, 2016 · 8 comments
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement resiliency

Comments

@ppf2
Copy link
Member

ppf2 commented Aug 3, 2016

Elasticsearch version: 2.3.2

This is an attempt to simulate a bad disk that has turned read only.

  1. 2 data nodes started
  2. Set permissions for data path of 1 of the nodes (node-2) to read only for all
  3. Create a new index (eg. say with only primary shards)
  4. All shards allocated for node-2 for this new index are unassigned (as expected since it cannot write to its file system), and primary shards for the good node are allocated successfully:

testindex 3 p STARTED 0 130b 127.0.0.1 node1
testindex 4 p UNASSIGNED
testindex 2 p UNASSIGNED
testindex 1 p STARTED 0 130b 127.0.0.1 node1
testindex 0 p UNASSIGNED

However, the master node is not recovering from this scenario well. It keeps trying to load the shards onto node 2 ... pretty much perpetually as long as node2 is started with a read only file system.

[2016-08-03 11:45:04,927][WARN ][gateway                  ] [node1] [testindex][2]: failed to list shard for shard_started on node [Cr5kuAANQAi7dShxyW83Jg]
FailedNodeException[Failed node [Cr5kuAANQAi7dShxyW83Jg]]; nested: RemoteTransportException[[node2][127.0.0.1:9301][internal:gateway/local/started_shards[n]]]; nested: ElasticsearchException[failed to load started shards]; nested: NotSerializableExceptionWrapper[access_denied_exception: /Users/User/ELK/ElasticStack_2_0/elasticsearch-2.3.2_node2/data/my-application1/nodes/0/indices/testindex/2/_state];
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:206)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$800(TransportNodesAction.java:106)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$2.handleException(TransportNodesAction.java:179)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handleException(MessageChannelHandler.java:212)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:202)
    at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:136)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: RemoteTransportException[[node2][127.0.0.1:9301][internal:gateway/local/started_shards[n]]]; nested: ElasticsearchException[failed to load started shards]; nested: NotSerializableExceptionWrapper[access_denied_exception: /Users/User/ELK/ElasticStack_2_0/elasticsearch-2.3.2_node2/data/my-application1/nodes/0/indices/testindex/2/_state];
Caused by: ElasticsearchException[failed to load started shards]; nested: NotSerializableExceptionWrapper[access_denied_exception: /Users/User/ELK/ElasticStack_2_0/elasticsearch-2.3.2_node2/data/my-application1/nodes/0/indices/testindex/2/_state];
    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:154)
    at org.elasticsearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:59)
    at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:92)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:230)
    at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:226)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)

And these tasks keep getting added and re-added to the pending tasks to no end:

{
  "tasks": [
    {
      "insert_order": 141469,
      "priority": "HIGH",
      "source": "cluster_reroute(async_shard_fetch)",
      "executing": true,
      "time_in_queue_millis": 1,
      "time_in_queue": "1ms"
    },
    {
      "insert_order": 141470,
      "priority": "HIGH",
      "source": "cluster_reroute(async_shard_fetch)",
      "executing": false,
      "time_in_queue_millis": 0,
      "time_in_queue": "0s"
    }
  ]
}

It keeps going forever until node 2 is stopped and the underlying file system is addressed.

Once node2 is started back up with a writable data path, then you end up with a red index for it does not go and retry the allocation there.

testindex 3 p STARTED 0 130b 127.0.0.1 node1
testindex 4 p UNASSIGNED
testindex 2 p UNASSIGNED
testindex 1 p STARTED 0 130b 127.0.0.1 node1
testindex 0 p UNASSIGNED

Seems like there is an opportunity to handle this better:

  • Should the master be retrying forever and spewing large amounts of exceptions and keep performing async_shard_fetch actions? Or should it give up at some point?
  • Note that even though currently the master is retrying async_shard_fetch forever, if I fix the permission issues against the data folder on node-2 while this is happening, the master will stop complaining about the read permission issues as if things are good now. However, the shards remain unassigned so it doesn't actually go and try to start them up on node-2. Even after a full cluster restart, the testindex remains red with the unassigned shards.
@ppf2 ppf2 added the resiliency label Aug 3, 2016
@jasontedor
Copy link
Member

Duplicates #18417, relates #18467

@ywelsch
Copy link
Contributor

ywelsch commented Aug 4, 2016

There is another issue at play here:

The issue is that the index-creation context is lost after the primary shard fails to initialize on node 2 (which is a variation of #15241, only fixed by allocation ids in v5.0.0). This means that after the first failed attempt to initialize the primary shard, it is treated as an existing shard copy to be recovered instead of a new one (I've added some notes below how to detect this situation). This also means that the primary shard allocator searches for a copy of the data on the nodes to allocate the shard to, which triggers async_shard_fetch. As there are no on-disk copies of this shard, the shard can never be allocated. The only way to get the shard un-stuck is to either delete the index or use the reroute allocation command to force a fresh primary (which is implemented by restoring the index-creation context for that shard, see here. In v5.0.0, the index creation context is preserved and the primary allocates to node 1 after the failed attempt to node 2.

How to detect that index-creation context is lost in v1.x/v2.x:

This can be seen by looking at the unassignedInfo on the unassigned primary shard, which shows something like this:

 ({"state":"UNASSIGNED","primary":true,"node":null,"relocating_node":null,"shard":5,"index":"some_index","version":2,"unassigned_info":{"reason":"ALLOCATION_FAILED","at":"2016-08-01T00:00:28.958Z","details":"failed to create shard, ... "}}

The index-creation context is correctly set if the unassigned_info object has "reason":"INDEX_CREATED".

In v5.0.0, the decision whether a fresh shard copy is expected (like after index creation) or an existing copy is required (when recovering a previously started shard after cluster restart) is not based on the unassigned_info anymore but based on allocation ids. I have already opened a PR where it will be easier to see in v5.0.0 whether a primary is going to recover as a fresh initial shard copy or requires an existing on-disk shard copy (#19516).

@ywelsch
Copy link
Contributor

ywelsch commented Aug 4, 2016

This also shows another subtle issue (even in current master) which is related to async_shard_fetch:

Assume for simplicity an index with 1 primary shard and no replicas. Primary shard was successfully started at some point on data node X, but now we have done a full cluster restart. Data node X has some disk permission issues for the shard directory after restart. When master tries to allocate the primary after cluster restart, it first does an async_shard_fetch, which fails hard on node X as async_shard_fetch (TransportNodesListGatewayStartedShards#nodeOperation) treats the AccessDeniedException when trying to list shard directory contents as a hard failure on the node. The async_shard_fetch code marks this node as failed and queues a delayed reroute (essentially another allocation attempt) to be executed after the current one. This means however that if the primary cannot be allocated in this round, another round is triggered that leads to shard_fetching again (and again and again). Even #18467 does not help in this scenario, as we don't count these as failed attempts: What the code in PrimaryShardAllocator does is after shard fetching to see if there is a node that has any data (node X is disregarded because it failed), and in case no data was found it just ignores allocation of this primary shard:

if (enoughAllocationsFound == false){
   // we can't really allocate, so ignore it and continue
   changed |= unassignedIterator.removeAndIgnore(AllocationStatus.NO_VALID_SHARD_COPY);
}

@ppf2 ppf2 reopened this Aug 4, 2016
@ppf2
Copy link
Member Author

ppf2 commented Aug 4, 2016

To clarify per discussion with @ywelsch (thx):

The first issue where the shard loses its index creation context after first failed allocation attempt is solved in v5.0.0 based on allocation ids (#14739).

The second issue is that async shard fetching can in a certain situation be triggered again and again when no existing shard copy can be found to allocate as primary. The situation where this occurs is if just doing a listFiles on the shard directory during shard fetching on a data node already throws an exception. We are keeping this issue open to track this particular PR.

@gmoskovicz
Copy link
Contributor

It would be great to do something when a disk goes to read-only. This seems to be the default in some linux OSs when there are issues (such as corruption or problems with the mounted disk).

Also, to avoid this we could mention (in the documentation) that RAID 0 could be helpful?

@lcawl lcawl added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Allocation labels Feb 13, 2018
@bleskes
Copy link
Contributor

bleskes commented Mar 20, 2018

@ywelsch This seems to be still an issue where failing to read a disk on a data node can lead to endless shard fetching. I tend to open an issue which is dedicated to that. Do you agree?

@ywelsch
Copy link
Contributor

ywelsch commented Mar 20, 2018

This could also be seen as falling under the umbrella of #18417, even if the issue technically happens before the shard is even allocated to the broken FS / node. How about closing this one and adding a comment to the linked issue?

@bleskes
Copy link
Contributor

bleskes commented Mar 21, 2018

works for me. Added a comment to #18417

@bleskes bleskes closed this as completed Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >enhancement resiliency
Projects
None yet
Development

No branches or pull requests

7 participants