New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart recovery upon mapping changes during translog replay #11363
Conversation
In rare occasion, the translog replay phase of recovery may require mapping changes on the target shard. This can happen where indexing on the primary introduces new mappings while the recovery is in phase1. If the source node processes the new mapping from the master, allowing the indexing to proceed, before the target node does and the recovery moves to the phase 2 (translog replay) before as well, the translog operations arriving on the target node may miss the mapping changes. Since this is extremely rare, we opt for a simple fix and simply restart the recovery. Note that in the case the file copy phase will likely be very short as the files are already in sync. Restarting recoveries in such a late phase means we may need to copy segment_N files and/or files that were quickly merged away on the target again. This annoys the write-once protection in our testing infra. To work around it I have introduces a counter in the termpoary file name prefix used by the recovery code. **** THERE IS STILL AN ONGOING ISSUE ***: Lucene will try to write the same segment_N file (which was cleaned by the recovery code) twice triggering test failures. Closes elastic#11281
@bleskes I looked at this and I think we should not try to restart the recovery in this |
*/ | ||
int performBatchRecovery(Engine engine, Iterable<Translog.Operation> operations) { | ||
int performBatchRecovery(Engine engine, Iterable<Translog.Operation> operations, boolean allowMappingUpdates) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only call this from one place so we don't need boolean allowMappingUpdates
but can pass false
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt that since performRecoveryOperation
is public and allows to control this via parameter, it would be better API (consistent) to expose it here in a similar fashion. Don't feel too strongly about it though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove it it's confusion IMO
@s1monw I pushed an update based on our discussion ... no more DelayRecoveryException |
LGTM, one question if we elect a new master will this somehow get notified and we retry? |
I'm not sure I follow the question exactly, but if the recovery code is wait on the observer it will retry on any change in the cluster state, master related or not. |
@bleskes I got confused... nevermind |
elastic#11363 introduced a retry logic for the case where we have to wait on a mapping update during the translog replay phase of recovery. The retry throws or recovery stats off as it may count ops twice.
In rare occasion, the translog replay phase of recovery may require mapping changes on the target shard. This can happen where indexing on the primary introduces new mappings while the recovery is in phase1. If the source node processes the new mapping from the master, allowing the indexing to proceed, before the target node does and the recovery moves to the phase 2 (translog replay) before as well, the translog operations arriving on the target node may miss the mapping changes. Since this is extremely rare, we opt for a simple fix and simply restart the recovery. Note that in the case the file copy phase will likely be very short as the files are already in sync.
Restarting recoveries in such a late phase means we may need to copy segment_N files and/or files that were quickly merged away on the target again. This annoys the write-once protection in our testing infra. To work around it I have introduces a counter in the termpoary file name prefix used by the recovery code.
**** THERE IS STILL AN ONGOING ISSUE ***: Lucene will try to write the same segment_N file (which was cleaned by the recovery code) twice triggering test failures.
Due ot this issue we have decided to change approach and use a cluster observer to retry operations once the mapping have arrived (or any other change)
Closes #11281