New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decouple recoveries from engine flush #10624
Conversation
|
||
public Boolean isHeldByCurrentThread() { | ||
if (holdingThreads == null) { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel weird about returning null here, if holdingThreads is null, can we just return false? Or are we signaling something special (asserts aren't enabled) with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an "i don't know" semantics. I will change it to throw an exception saying it's only supported when assertions are enabled.
public ReleasableLock(Lock lock) { | ||
this.lock = lock; | ||
boolean useHoldingThreads = false; | ||
assert (useHoldingThreads = true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can leave a comment here about what the holdingThreads ThreadLocal is used for, that way it doesn't get accidentally changed during a cleanup at a later time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added java docs on the holdingThreads field
channelReference.decRef(); | ||
} | ||
public FsChannelImmutableReader immutableReader() throws TranslogException { | ||
if (channelReference.tryIncRef() == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can use the double incRef pattern here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I follow what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see #10624 (comment)
I like what I see a lot. Yet, I think we still need to work on the unittest end to add more basic tests, it's just a feeling but we should have more tests that just make use of these classes as they are intended. I can help here once we are closer! |
@bleskes I added some answers to your comments |
@s1monw I pushed another round. Also added a test and removed the last no commit. I think this is ready now? |
@bleskes I added some minor commetns on the commit - LGTM feel free up push once fixing |
LGTM |
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back. To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood). Change highlights: - Refactor Translog file management to allow for multiple files. - Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene. - A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations. - Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost. - Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog. - Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`) Closes elastic#10624
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back. To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood). Change highlights: - Refactor Translog file management to allow for multiple files. - Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene. - A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations. - Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost. - Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog. - Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`) Closes elastic#10624
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back. To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood). Change highlights: - Refactor Translog file management to allow for multiple files. - Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene. - A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations. - Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost. - Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog. - Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`) Closes elastic#10624
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back. To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood). Change highlights: - Refactor Translog file management to allow for multiple files. - Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene. - A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations. - Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost. - IndexShard now creates the translog together with the engine. The translog is closed by the engine on close. ShadowIndexShards do not open the translog. - Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`) Closes elastic#10624
… slow recovery elastic#10624 decoupled translog flush from ongoing recoveries. In the process, the translog creation was delayed to moment the engine is created (during recovery, after copying files from the primary). On the other side, TranslogService, in charge of translog based flushes, starts a background checker as soon as the shard is allocated. That checker performs it's first check after 5s expected the translog to be there. However, if the file copying phase of the recovery takes >5s (likely!) or local recovery is slow, the check can run into an exception and never recover. The end result is that the translog based flush is completely disabled. Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing. Closes elastic#15814
@bleskes When user delete a doc during recovery, and then the recovery process will send a index request to replica, the replica will index the doc again, so that the doc will appeared again since we removed phase3 |
@yiguolei sorry for the late response, but doc deletes are versioned just like any other write operation and can arrive out of order to the replicas. When the indexing operation is replayed it will not override the delete operation. |
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back.
To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).
Change highlights:
index.translog.sync_interval
(wasindex.gateway.local.sync
)There are a still some no commits and some open issues around the fact that ShadowIndexShards doesn't have a translog (I have some ideas for solutions but I want to discuss before making the PR even bigger). Finally
testConcurrentWriteViewsAndSnapshot
has a concurrency issue (test bug) I need to solve but I think we can start the review cycles. @s1monw @dakrone can you have a look?