Recovery: Quick cluster state processing can cause relocation finalization to fail and delete both copies #9503
Labels
blocker
:Distributed/Recovery
Anything around constructing a new shard, either from a local or a remote source.
resiliency
v1.5.0
v2.0.0-beta1
#8570 added some extra protection for the case where a source shard is being closed during recovery. However, this introduces a race condition in the case that the target shard has moved to POST_RECOVERY and the master processes the shard started action and activates the shard before the source node completes the recovery. In that case the source node will close the source shard, causing the recovery to be cancelled. The target node receives the cancellation notification and deletes the local copy (still in POST_RECOVERY).
The extra close listener is not yet released but is part of the 1.5 push.
See: http://build-us-00.elasticsearch.org/job/es_core_1x_debian/3474/
The text was updated successfully, but these errors were encountered: