Restart recovery upon mapping changes during translog replay #11363

bleskes · 2015-05-27T06:42:11Z

In rare occasion, the translog replay phase of recovery may require mapping changes on the target shard. This can happen where indexing on the primary introduces new mappings while the recovery is in phase1. If the source node processes the new mapping from the master, allowing the indexing to proceed, before the target node does and the recovery moves to the phase 2 (translog replay) before as well, the translog operations arriving on the target node may miss the mapping changes. Since this is extremely rare, we opt for a simple fix and simply restart the recovery. Note that in the case the file copy phase will likely be very short as the files are already in sync.

Restarting recoveries in such a late phase means we may need to copy segment_N files and/or files that were quickly merged away on the target again. This annoys the write-once protection in our testing infra. To work around it I have introduces a counter in the termpoary file name prefix used by the recovery code.

**** THERE IS STILL AN ONGOING ISSUE ***: Lucene will try to write the same segment_N file (which was cleaned by the recovery code) twice triggering test failures.

Due ot this issue we have decided to change approach and use a cluster observer to retry operations once the mapping have arrived (or any other change)

Closes #11281

In rare occasion, the translog replay phase of recovery may require mapping changes on the target shard. This can happen where indexing on the primary introduces new mappings while the recovery is in phase1. If the source node processes the new mapping from the master, allowing the indexing to proceed, before the target node does and the recovery moves to the phase 2 (translog replay) before as well, the translog operations arriving on the target node may miss the mapping changes. Since this is extremely rare, we opt for a simple fix and simply restart the recovery. Note that in the case the file copy phase will likely be very short as the files are already in sync. Restarting recoveries in such a late phase means we may need to copy segment_N files and/or files that were quickly merged away on the target again. This annoys the write-once protection in our testing infra. To work around it I have introduces a counter in the termpoary file name prefix used by the recovery code. **** THERE IS STILL AN ONGOING ISSUE ***: Lucene will try to write the same segment_N file (which was cleaned by the recovery code) twice triggering test failures. Closes elastic#11281

s1monw · 2015-05-28T09:39:39Z

@bleskes I looked at this and I think we should not try to restart the recovery in this hyper corner case I think we should just fail the shard, fail the recovery and start fresh. This makes the entire code less complicated and more strict. It's I think we should not design for all these corner cases and rather start fresh?

s1monw · 2015-06-01T14:20:02Z

src/main/java/org/elasticsearch/index/shard/TranslogRecoveryPerformer.java

     */
-    int performBatchRecovery(Engine engine, Iterable<Translog.Operation> operations) {
+    int performBatchRecovery(Engine engine, Iterable<Translog.Operation> operations, boolean allowMappingUpdates) {


we only call this from one place so we don't need boolean allowMappingUpdates but can pass false directly?

I felt that since performRecoveryOperation is public and allows to control this via parameter, it would be better API (consistent) to expose it here in a similar fashion. Don't feel too strongly about it though

remove it it's confusion IMO

… an observer

bleskes · 2015-06-04T13:30:13Z

@s1monw I pushed an update based on our discussion ... no more DelayRecoveryException

s1monw · 2015-06-04T19:27:41Z

LGTM, one question if we elect a new master will this somehow get notified and we retry?

bleskes · 2015-06-04T20:00:22Z

I'm not sure I follow the question exactly, but if the recovery code is wait on the observer it will retry on any change in the cluster state, master related or not.

s1monw · 2015-06-04T20:04:34Z

@bleskes I got confused... nevermind

elastic#11363 introduced a retry logic for the case where we have to wait on a mapping update during the translog replay phase of recovery. The retry throws or recovery stats off as it may count ops twice.

bleskes added >bug v2.0.0-beta1 review :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels May 27, 2015

s1monw reviewed Jun 1, 2015
View reviewed changes

bleskes added 2 commits June 4, 2015 14:35

Moved to retry performing operations locally on the target node using…

a0f1a7e

… an observer

revert changes to RecoveryStatus now unneeded

d45f779

bleskes closed this in ea41ee9 Jun 4, 2015

kevinkluge removed the review label Jun 4, 2015

bleskes deleted the restart_recovery_mapping_changes branch June 4, 2015 20:30

This was referenced Jun 8, 2015

Fix recovered translog ops stat counting when retrying a batch #11536

Merged

Fix MapperException detection during translog ops replay #11583

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart recovery upon mapping changes during translog replay #11363

Restart recovery upon mapping changes during translog replay #11363

bleskes commented May 27, 2015

s1monw commented May 28, 2015

s1monw Jun 1, 2015

bleskes Jun 2, 2015

s1monw Jun 2, 2015

bleskes commented Jun 4, 2015

s1monw commented Jun 4, 2015

bleskes commented Jun 4, 2015

s1monw commented Jun 4, 2015

Restart recovery upon mapping changes during translog replay #11363

Restart recovery upon mapping changes during translog replay #11363

Conversation

bleskes commented May 27, 2015

s1monw commented May 28, 2015

s1monw Jun 1, 2015

Choose a reason for hiding this comment

bleskes Jun 2, 2015

Choose a reason for hiding this comment

s1monw Jun 2, 2015

Choose a reason for hiding this comment

bleskes commented Jun 4, 2015

s1monw commented Jun 4, 2015

bleskes commented Jun 4, 2015

s1monw commented Jun 4, 2015