Do not wait for advancement of checkpoint in recovery #39006

dnhatn · 2019-02-17T01:56:20Z

With this change, we won't wait for the local checkpoint to advance to the max_seq_no before starting phase2 of peer-recovery. We also remove the sequence number range check in peer-recovery. We can safely do these thank to Yannick's finding.

The replication group to be used is currently sampled after indexing into the primary (see ReplicationOperation class). This means that when initiating tracking of a new replica, we have to consider the following two cases:

there are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable).
there are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to endingSeqNo are available in the snapshot (i.e. also see comment in RecoverySourceHandler saying We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint.

Relates #39000
Closes #38949

elasticmachine · 2019-02-17T01:56:21Z

Pinging @elastic/es-distributed

...in/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java

Relates elastic#35975 Relates elastic#39006

dnhatn · 2019-02-17T02:10:27Z

@bleskes Would you please take a look? Thank you!

DaveCTurner

As per #39000 (comment) there doesn't seem to be a way for the primary to fail to replicate every operation (thanks in part to the primary/replica resync work) so it sounds like we don't need this check. But we do need tests to ensure that @ywelsch's reasoning doesn't break down in future.

dnhatn · 2019-02-19T21:11:32Z

Closes in favor of #39153.

dnhatn · 2019-02-25T15:57:34Z

@DaveCTurner @jasontedor @ywelsch Thanks so much for reviewing.

With this change, we won't wait for the local checkpoint to advance to the max_seq_no before starting phase2 of peer-recovery. We also remove the sequence number range check in peer-recovery. We can safely do these thanks to Yannick's finding. The replication group to be used is currently sampled after indexing into the primary (see `ReplicationOperation` class). This means that when initiating tracking of a new replica, we have to consider the following two cases: - There are operations for which the replication group has not been sampled yet. As we initiated the new replica as tracking, we know that those operations will be replicated to the new replica and follow the typical replication group semantics (e.g. marked as stale when unavailable). - There are operations for which the replication group has already been sampled. These operations will not be sent to the new replica. However, we know that those operations are already indexed into Lucene and the translog on the primary, as the sampling is happening after that. This means that by taking a snapshot of Lucene or the translog, we will be getting those ops as well. What we cannot guarantee anymore is that all ops up to `endingSeqNo` are available in the snapshot (i.e. also see comment in `RecoverySourceHandler` saying `We need to wait for all operations up to the current max to complete, otherwise we can not guarantee that all operations in the required range will be available for replaying from the translog of the source.`). This is not needed, though, as we can no longer guarantee that max seq no == local checkpoint. Relates #39000 Closes #38949 Co-authored-by: Yannick Welsch <yannick@welsch.lu>

Since #39006, we should either remove `testWaitForPendingSeqNo` or adjust it not to wait for the pending operations. This change picks the latter. Relates #39006

Before this change, closed indexes were simply not replicated. It was therefore possible to close an index and then decommission a data node without knowing that this data node contained shards of the closed index, potentially leading to data loss. Shards of closed indices were not completely taken into account when balancing the shards within the cluster, or automatically replicated through shard copies, and they were not easily movable from node A to node B using APIs like Cluster Reroute without being fully reopened and closed again. This commit changes the logic executed when closing an index, so that its shards are not just removed and forgotten but are instead reinitialized and reallocated on data nodes using an engine implementation which does not allow searching or indexing, which has a low memory overhead (compared with searchable/indexable opened shards) and which allows shards to be recovered from peer or promoted as primaries when needed. This new closing logic is built on top of the new Close Index API introduced in 6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before closing them, and closing an index on a 8.0 cluster will reinitialize the index shards and therefore impact the cluster health. Some APIs have been adapted to make them work with closed indices: - Cluster Health API - Cluster Reroute API - Cluster Allocation Explain API - Recovery API - Cat Indices - Cat Shards - Cat Health - Cat Recovery This commit contains all the following changes (most recent first): * c6c42a1 Adapt NoOpEngineTests after #39006 * 3f9993d Wait for shards to be active after closing indices (#38854) * 5e7a428 Adapt the Cluster Health API to closed indices (#39364) * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767) * 71f5c34 Recover closed indices after a full cluster restart (#39249) * 4db7fd9 Adapt the Recovery API for closed indices (#38421) * 4fd1bb2 Adapt more tests suites to closed indices (#39186) * 0519016 Add replica to primary promotion test for closed indices (#39110) * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631) * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955) * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex() * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329) * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327) * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326) * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024) * e53a9be Fix compilation error in IndexShardIT after merge with master * cae4155 Relax NoOpEngine constraints (#37413) * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903) Relates to #33888

Backport support for replicating closed indices (#39499) Before this change, closed indexes were simply not replicated. It was therefore possible to close an index and then decommission a data node without knowing that this data node contained shards of the closed index, potentially leading to data loss. Shards of closed indices were not completely taken into account when balancing the shards within the cluster, or automatically replicated through shard copies, and they were not easily movable from node A to node B using APIs like Cluster Reroute without being fully reopened and closed again. This commit changes the logic executed when closing an index, so that its shards are not just removed and forgotten but are instead reinitialized and reallocated on data nodes using an engine implementation which does not allow searching or indexing, which has a low memory overhead (compared with searchable/indexable opened shards) and which allows shards to be recovered from peer or promoted as primaries when needed. This new closing logic is built on top of the new Close Index API introduced in 6.7.0 (#37359). Some pre-closing sanity checks are executed on the shards before closing them, and closing an index on a 8.0 cluster will reinitialize the index shards and therefore impact the cluster health. Some APIs have been adapted to make them work with closed indices: - Cluster Health API - Cluster Reroute API - Cluster Allocation Explain API - Recovery API - Cat Indices - Cat Shards - Cat Health - Cat Recovery This commit contains all the following changes (most recent first): * c6c42a1 Adapt NoOpEngineTests after #39006 * 3f9993d Wait for shards to be active after closing indices (#38854) * 5e7a428 Adapt the Cluster Health API to closed indices (#39364) * 3e61939 Adapt CloseFollowerIndexIT for replicated closed indices (#38767) * 71f5c34 Recover closed indices after a full cluster restart (#39249) * 4db7fd9 Adapt the Recovery API for closed indices (#38421) * 4fd1bb2 Adapt more tests suites to closed indices (#39186) * 0519016 Add replica to primary promotion test for closed indices (#39110) * b756f6c Test the Cluster Shard Allocation Explain API with closed indices (#38631) * c484c66 Remove index routing table of closed indices in mixed versions clusters (#38955) * 00f1828 Mute CloseFollowerIndexIT.testCloseAndReopenFollowerIndex() * e845b0a Do not schedule Refresh/Translog/GlobalCheckpoint tasks for closed indices (#38329) * cf9a015 Adapt testIndexCanChangeCustomDataPath for replicated closed indices (#38327) * b9becdd Adapt testPendingTasks() for replicated closed indices (#38326) * 02cc730 Allow shards of closed indices to be replicated as regular shards (#38024) * e53a9be Fix compilation error in IndexShardIT after merge with master * cae4155 Relax NoOpEngine constraints (#37413) * 54d110b [RCI] Adapt NoOpEngine to latest FrozenEngine changes * c63fd69 [RCI] Add NoOpEngine for closed indices (#33903) Relates to #33888

Since #39006 we should be able to complete a peer-recovery without waiting for pending indexing operations. Thus, the assertion in testDoNotWaitForPendingSeqNo should be updated from false to true. Closes #39510

Use target checkpoint to verify sending ops in recovery

dd105be

dnhatn added >non-issue blocker :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 :Distributed/CCR Issues around the Cross Cluster State Replication features v6.7.0 v8.0.0 v7.2.0 labels Feb 17, 2019

dnhatn requested review from martijnvg, bleskes, ywelsch, jasontedor and DaveCTurner February 17, 2019 01:56

dnhatn commented Feb 17, 2019

View reviewed changes

...in/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java Outdated Show resolved Hide resolved

dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Feb 17, 2019

Simulate remote recovery in ShardFollowTaskReplicationTests

8ae54ac

Relates elastic#35975 Relates elastic#39006

dnhatn mentioned this pull request Feb 17, 2019

Add remote recovery to ShardFollowTaskReplicationTests #39007

Merged

Merge branch 'master' into recovery-range-check

d73f703

DaveCTurner reviewed Feb 19, 2019

View reviewed changes

dnhatn closed this Feb 19, 2019

dnhatn deleted the recovery-range-check branch February 19, 2019 21:11

dnhatn restored the recovery-range-check branch February 19, 2019 21:35

dnhatn added 2 commits February 19, 2019 16:35

Merge branch 'master' into recovery-range-check

28dc8e4

remove check

01e2e95

dnhatn changed the title ~~Use target checkpoint to verify sending ops in recovery~~ Do not wait for advancement of checkpoint in recovery Feb 19, 2019

await test

81bf67b

dnhatn reopened this Feb 19, 2019

dnhatn merged commit 51e7a5f into elastic:master Feb 25, 2019

dnhatn deleted the recovery-range-check branch February 25, 2019 15:57

dnhatn added the backport pending label Feb 25, 2019

dnhatn added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. and removed backport pending :Distributed/CCR Issues around the Cross Cluster State Replication features labels Feb 25, 2019

This was referenced Feb 25, 2019

Replica recovery on follower can fail after recovery from remote #39000

Closed

[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633

Closed

Adjust testWaitForPendingSeqNo #39404

Merged

dnhatn added a commit that referenced this pull request Feb 26, 2019

Adjust testWaitForPendingSeqNo (#39404)

eae83fd

Since #39006, we should either remove `testWaitForPendingSeqNo` or adjust it not to wait for the pending operations. This change picks the latter. Relates #39006

dnhatn added a commit that referenced this pull request Feb 26, 2019

Adjust testWaitForPendingSeqNo (#39404)

a9e86bc

Since #39006, we should either remove `testWaitForPendingSeqNo` or adjust it not to wait for the pending operations. This change picks the latter. Relates #39006

dnhatn added a commit that referenced this pull request Feb 26, 2019

Adjust testWaitForPendingSeqNo (#39404)

eff7ae8

Since #39006, we should either remove `testWaitForPendingSeqNo` or adjust it not to wait for the pending operations. This change picks the latter. Relates #39006

dnhatn added a commit that referenced this pull request Feb 26, 2019

Adjust testWaitForPendingSeqNo (#39404)

a99e996

Since #39006, we should either remove `testWaitForPendingSeqNo` or adjust it not to wait for the pending operations. This change picks the latter. Relates #39006

tlrx added a commit that referenced this pull request Feb 28, 2019

Adapt NoOpEngineTests after #39006

c6c42a1

tlrx mentioned this pull request Feb 28, 2019

Add support for replicating closed indices #39499

Merged

dnhatn mentioned this pull request Mar 1, 2019

Assert recovery done in testDoNotWaitForPendingSeqNo #39595

Merged

jakelandis added v7.0.0-rc2 and removed v7.0.0 labels Apr 3, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not wait for advancement of checkpoint in recovery #39006

Do not wait for advancement of checkpoint in recovery #39006

dnhatn commented Feb 17, 2019 •

edited

elasticmachine commented Feb 17, 2019

dnhatn commented Feb 17, 2019

DaveCTurner left a comment

dnhatn commented Feb 19, 2019

dnhatn commented Feb 25, 2019

Do not wait for advancement of checkpoint in recovery #39006

Do not wait for advancement of checkpoint in recovery #39006

Conversation

dnhatn commented Feb 17, 2019 • edited

elasticmachine commented Feb 17, 2019

dnhatn commented Feb 17, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

dnhatn commented Feb 19, 2019

dnhatn commented Feb 25, 2019

dnhatn commented Feb 17, 2019 •

edited