Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart recovery upon mapping changes during translog replay #11363

Closed

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented May 27, 2015

In rare occasion, the translog replay phase of recovery may require mapping changes on the target shard. This can happen where indexing on the primary introduces new mappings while the recovery is in phase1. If the source node processes the new mapping from the master, allowing the indexing to proceed, before the target node does and the recovery moves to the phase 2 (translog replay) before as well, the translog operations arriving on the target node may miss the mapping changes. Since this is extremely rare, we opt for a simple fix and simply restart the recovery. Note that in the case the file copy phase will likely be very short as the files are already in sync.

Restarting recoveries in such a late phase means we may need to copy segment_N files and/or files that were quickly merged away on the target again. This annoys the write-once protection in our testing infra. To work around it I have introduces a counter in the termpoary file name prefix used by the recovery code.

**** THERE IS STILL AN ONGOING ISSUE ***: Lucene will try to write the same segment_N file (which was cleaned by the recovery code) twice triggering test failures.

Due ot this issue we have decided to change approach and use a cluster observer to retry operations once the mapping have arrived (or any other change)

Closes #11281

In rare occasion, the translog replay phase of recovery may require mapping changes on the target shard. This can happen where indexing on the primary introduces new mappings while the recovery is in phase1. If the source node processes the new mapping from the master, allowing the indexing to proceed, before the target node does and the recovery moves to the phase 2 (translog replay) before as well, the translog operations arriving on the target node may miss the mapping changes. Since this is extremely rare, we opt for a simple fix and simply restart the recovery. Note that in the case the file copy phase will likely be very short as the files are already in sync.

Restarting recoveries in such a late phase means we may need to copy segment_N files and/or files that were quickly merged away on the target again. This annoys the write-once protection in our testing infra. To work around it I have introduces a counter in the termpoary file name prefix used by the recovery code.

**** THERE IS STILL AN ONGOING ISSUE ***: Lucene will try to write the same segment_N file (which was cleaned by the recovery code) twice triggering test failures.

 Closes elastic#11281
@bleskes bleskes added >bug v2.0.0-beta1 review :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels May 27, 2015
@s1monw
Copy link
Contributor

s1monw commented May 28, 2015

@bleskes I looked at this and I think we should not try to restart the recovery in this hyper corner case I think we should just fail the shard, fail the recovery and start fresh. This makes the entire code less complicated and more strict. It's I think we should not design for all these corner cases and rather start fresh?

*/
int performBatchRecovery(Engine engine, Iterable<Translog.Operation> operations) {
int performBatchRecovery(Engine engine, Iterable<Translog.Operation> operations, boolean allowMappingUpdates) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only call this from one place so we don't need boolean allowMappingUpdates but can pass false directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I felt that since performRecoveryOperation is public and allows to control this via parameter, it would be better API (consistent) to expose it here in a similar fashion. Don't feel too strongly about it though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it it's confusion IMO

@bleskes
Copy link
Contributor Author

bleskes commented Jun 4, 2015

@s1monw I pushed an update based on our discussion ... no more DelayRecoveryException

@s1monw
Copy link
Contributor

s1monw commented Jun 4, 2015

LGTM, one question if we elect a new master will this somehow get notified and we retry?

@bleskes
Copy link
Contributor Author

bleskes commented Jun 4, 2015

I'm not sure I follow the question exactly, but if the recovery code is wait on the observer it will retry on any change in the cluster state, master related or not.

@s1monw
Copy link
Contributor

s1monw commented Jun 4, 2015

@bleskes I got confused... nevermind

@bleskes bleskes closed this in ea41ee9 Jun 4, 2015
@kevinkluge kevinkluge removed the review label Jun 4, 2015
@bleskes bleskes deleted the restart_recovery_mapping_changes branch June 4, 2015 20:30
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jun 8, 2015
elastic#11363 introduced a retry logic for the case where we have to wait on a mapping update during the translog replay phase of recovery. The retry throws or recovery stats off as it may count ops twice.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Translog recovery can fail due to mappings not present on recovery target
3 participants