Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery: Quick cluster state processing can cause relocation finalization to fail and delete both copies #9503

Closed
bleskes opened this issue Jan 30, 2015 · 0 comments
Labels
blocker :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. resiliency v1.5.0 v2.0.0-beta1

Comments

@bleskes
Copy link
Contributor

bleskes commented Jan 30, 2015

#8570 added some extra protection for the case where a source shard is being closed during recovery. However, this introduces a race condition in the case that the target shard has moved to POST_RECOVERY and the master processes the shard started action and activates the shard before the source node completes the recovery. In that case the source node will close the source shard, causing the recovery to be cancelled. The target node receives the cancellation notification and deletes the local copy (still in POST_RECOVERY).

The extra close listener is not yet released but is part of the 1.5 push.

See: http://build-us-00.elasticsearch.org/job/es_core_1x_debian/3474/

@bleskes bleskes added v2.0.0-beta1 resiliency v1.5.0 :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Jan 30, 2015
@s1monw s1monw added the blocker label Feb 9, 2015
bleskes added a commit to bleskes/elasticsearch that referenced this issue Feb 27, 2015
…emantics

We keep track of the current stage of recovery using an instance of RecoveryState which is stored on the relevant IndexShard. At the moment changes to this object are made in many places of the code, which are charged of doing it in the right order, keeping track of timers and many more. Also the changes to shard state are decoupled from the recovery stages which caused elastic#9503.

This PR refactors this and brings all of the changes into IndexShard. It also makes all recovery follow the exact same stages and shortcut some. This is in order to keep things simple and always the same (those shortcuts didn't add anything, we ended doing it all anyway).

Also, all timer management is now folded into RecoveryState and unit tests are added.

This closes elastic#9503 by moving the shard to post recovery only once the recovery is done (before they were decoupled), meaning that master promotion of the target shard to started can not cancel the recovery.

Closes elastic#9902
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. resiliency v1.5.0 v2.0.0-beta1
Projects
None yet
Development

No branches or pull requests

2 participants