Decouple recoveries from engine flush #10624

bleskes · 2015-04-16T08:13:05Z

In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back.

To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).

Change highlights:

Refactor Translog file management to allow for multiple files.
Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog.
Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to index.translog.sync_interval (was index.gateway.local.sync)

There are a still some no commits and some open issues around the fact that ShadowIndexShards doesn't have a translog (I have some ideas for solutions but I want to discuss before making the PR even bigger). Finally testConcurrentWriteViewsAndSnapshot has a concurrency issue (test bug) I need to solve but I think we can start the review cycles. @s1monw @dakrone can you have a look?

dakrone · 2015-04-16T18:13:41Z

src/main/java/org/elasticsearch/common/util/concurrent/ReleasableLock.java

+
+    public Boolean isHeldByCurrentThread() {
+        if (holdingThreads == null) {
+            return null;


I feel weird about returning null here, if holdingThreads is null, can we just return false? Or are we signaling something special (asserts aren't enabled) with this?

just an "i don't know" semantics. I will change it to throw an exception saying it's only supported when assertions are enabled.

dakrone · 2015-04-16T20:25:52Z

src/main/java/org/elasticsearch/common/util/concurrent/ReleasableLock.java

    public ReleasableLock(Lock lock) {
        this.lock = lock;
+        boolean useHoldingThreads = false;
+        assert (useHoldingThreads = true);


Maybe you can leave a comment here about what the holdingThreads ThreadLocal is used for, that way it doesn't get accidentally changed during a cleanup at a later time.

added java docs on the holdingThreads field

s1monw · 2015-04-23T14:14:58Z

src/main/java/org/elasticsearch/index/translog/fs/SimpleFsTranslogFile.java

-                    channelReference.decRef();
-                }
+    public FsChannelImmutableReader immutableReader() throws TranslogException {
+        if (channelReference.tryIncRef() == false) {


I guess we can use the double incRef pattern here too?

not sure I follow what you mean?

see #10624 (comment)

s1monw · 2015-04-23T14:21:17Z

I like what I see a lot. Yet, I think we still need to work on the unittest end to add more basic tests, it's just a feeling but we should have more tests that just make use of these classes as they are intended. I can help here once we are closer!

bleskes · 2015-04-24T21:28:47Z

@s1monw @dakrone pushed another commit or addressing yours and @kimchy 's feedback.

s1monw · 2015-04-28T14:39:53Z

@bleskes I added some answers to your comments

bleskes · 2015-04-28T19:38:49Z

@s1monw I pushed another round. Also added a test and removed the last no commit. I think this is ready now?

s1monw · 2015-04-28T19:45:36Z

@bleskes I added some minor commetns on the commit - LGTM feel free up push once fixing

s1monw · 2015-04-29T09:35:00Z

LGTM

In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back. To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood). Change highlights: - Refactor Translog file management to allow for multiple files. - Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene. - A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations. - Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost. - Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog. - Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`) Closes elastic#10624

In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back. To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood). Change highlights: - Refactor Translog file management to allow for multiple files. - Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene. - A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations. - Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost. - IndexShard now creates the translog together with the engine. The translog is closed by the engine on close. ShadowIndexShards do not open the translog. - Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`) Closes elastic#10624

… slow recovery elastic#10624 decoupled translog flush from ongoing recoveries. In the process, the translog creation was delayed to moment the engine is created (during recovery, after copying files from the primary). On the other side, TranslogService, in charge of translog based flushes, starts a background checker as soon as the shard is allocated. That checker performs it's first check after 5s expected the translog to be there. However, if the file copying phase of the recovery takes >5s (likely!) or local recovery is slow, the check can run into an exception and never recover. The end result is that the translog based flush is completely disabled. Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing. Closes elastic#15814

yiguolei · 2016-04-22T03:34:34Z

@bleskes When user delete a doc during recovery, and then the recovery process will send a index request to replica, the replica will index the doc again, so that the doc will appeared again since we removed phase3

bleskes · 2016-05-11T14:23:45Z

@yiguolei sorry for the late response, but doc deletes are versioned just like any other write operation and can arrive out of order to the replicas. When the indexing operation is replayed it will not override the delete operation.

bleskes added >enhancement v2.0.0-beta1 resiliency :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Apr 16, 2015

kevinkluge added the in progress label Apr 16, 2015

bleskes added the review label Apr 16, 2015

bleskes assigned s1monw Apr 16, 2015

dakrone reviewed Apr 16, 2015
View reviewed changes

bleskes force-pushed the gen_translog branch from fa27693 to 7095b10 Compare April 16, 2015 19:17

dakrone reviewed Apr 16, 2015
View reviewed changes

s1monw reviewed Apr 23, 2015
View reviewed changes

bleskes mentioned this pull request Apr 25, 2015

FSTranslog#snapshot() can enter infinite loop #10807

Closed

dakrone added the >breaking label Apr 26, 2015

bleskes force-pushed the gen_translog branch from cd7418f to a15eaf6 Compare April 28, 2015 19:31

bleskes force-pushed the gen_translog branch from 0557843 to 3a6f7a0 Compare April 29, 2015 15:44

bleskes force-pushed the gen_translog branch from 3a6f7a0 to 20d8cb7 Compare April 29, 2015 21:21

bleskes force-pushed the gen_translog branch from 20d8cb7 to 9e0697c Compare April 30, 2015 15:10

bleskes force-pushed the gen_translog branch from 9e0697c to 45c1c9c Compare April 30, 2015 16:38

bleskes force-pushed the gen_translog branch from 45c1c9c to d596f5c Compare April 30, 2015 20:43

s1monw merged commit d596f5c into elastic:master May 5, 2015

s1monw removed in progress labels May 5, 2015

dakrone mentioned this pull request May 5, 2015

Add another test for reading a truncated translog #9966

Closed

bleskes added the release highlight label Jun 3, 2015

bleskes mentioned this pull request Jan 7, 2016

Translog base flushes can be disabled after replication relocation or slow recovery #15830

Closed

bleskes deleted the gen_translog branch May 11, 2016 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decouple recoveries from engine flush #10624

Decouple recoveries from engine flush #10624

bleskes commented Apr 16, 2015

dakrone Apr 16, 2015

bleskes Apr 17, 2015

dakrone Apr 16, 2015

bleskes Apr 17, 2015

s1monw Apr 23, 2015

bleskes Apr 24, 2015

s1monw Apr 28, 2015

s1monw commented Apr 23, 2015

bleskes commented Apr 24, 2015

s1monw commented Apr 28, 2015

bleskes commented Apr 28, 2015

s1monw commented Apr 28, 2015

s1monw commented Apr 29, 2015

yiguolei commented Apr 22, 2016

bleskes commented May 11, 2016

Decouple recoveries from engine flush #10624

Decouple recoveries from engine flush #10624

Conversation

bleskes commented Apr 16, 2015

dakrone Apr 16, 2015

Choose a reason for hiding this comment

bleskes Apr 17, 2015

Choose a reason for hiding this comment

dakrone Apr 16, 2015

Choose a reason for hiding this comment

bleskes Apr 17, 2015

Choose a reason for hiding this comment

s1monw Apr 23, 2015

Choose a reason for hiding this comment

bleskes Apr 24, 2015

Choose a reason for hiding this comment

s1monw Apr 28, 2015

Choose a reason for hiding this comment

s1monw commented Apr 23, 2015

bleskes commented Apr 24, 2015

s1monw commented Apr 28, 2015

bleskes commented Apr 28, 2015

s1monw commented Apr 28, 2015

s1monw commented Apr 29, 2015

yiguolei commented Apr 22, 2016

bleskes commented May 11, 2016