Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple recoveries from engine flush #10624

Merged
merged 1 commit into from May 5, 2015
Merged

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Apr 16, 2015

In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted) we have to recover a long translog when we come back.

To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).

Change highlights:

  • Refactor Translog file management to allow for multiple files.
  • Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
  • A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
  • Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
  • Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog.
  • Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to index.translog.sync_interval (was index.gateway.local.sync)

There are a still some no commits and some open issues around the fact that ShadowIndexShards doesn't have a translog (I have some ideas for solutions but I want to discuss before making the PR even bigger). Finally testConcurrentWriteViewsAndSnapshot has a concurrency issue (test bug) I need to solve but I think we can start the review cycles. @s1monw @dakrone can you have a look?

@bleskes bleskes added >enhancement v2.0.0-beta1 resiliency :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Apr 16, 2015
@bleskes bleskes added the review label Apr 16, 2015

public Boolean isHeldByCurrentThread() {
if (holdingThreads == null) {
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel weird about returning null here, if holdingThreads is null, can we just return false? Or are we signaling something special (asserts aren't enabled) with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an "i don't know" semantics. I will change it to throw an exception saying it's only supported when assertions are enabled.

public ReleasableLock(Lock lock) {
this.lock = lock;
boolean useHoldingThreads = false;
assert (useHoldingThreads = true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can leave a comment here about what the holdingThreads ThreadLocal is used for, that way it doesn't get accidentally changed during a cleanup at a later time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added java docs on the holdingThreads field

channelReference.decRef();
}
public FsChannelImmutableReader immutableReader() throws TranslogException {
if (channelReference.tryIncRef() == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can use the double incRef pattern here too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I follow what you mean?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1monw
Copy link
Contributor

s1monw commented Apr 23, 2015

I like what I see a lot. Yet, I think we still need to work on the unittest end to add more basic tests, it's just a feeling but we should have more tests that just make use of these classes as they are intended. I can help here once we are closer!

@bleskes
Copy link
Contributor Author

bleskes commented Apr 24, 2015

@s1monw @dakrone pushed another commit or addressing yours and @kimchy 's feedback.

@s1monw
Copy link
Contributor

s1monw commented Apr 28, 2015

@bleskes I added some answers to your comments

@bleskes
Copy link
Contributor Author

bleskes commented Apr 28, 2015

@s1monw I pushed another round. Also added a test and removed the last no commit. I think this is ready now?

@s1monw
Copy link
Contributor

s1monw commented Apr 28, 2015

@bleskes I added some minor commetns on the commit - LGTM feel free up push once fixing

@s1monw
Copy link
Contributor

s1monw commented Apr 29, 2015

LGTM

bleskes added a commit to bleskes/elasticsearch that referenced this pull request Apr 29, 2015
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted)  we have to recover a long translog when we come back.

To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).

Change highlights:
- Refactor Translog file management to allow for multiple files.
- Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
- A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
- Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery  to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
- Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog.
- Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`)

Closes elastic#10624
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Apr 29, 2015
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted)  we have to recover a long translog when we come back.

To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).

Change highlights:
- Refactor Translog file management to allow for multiple files.
- Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
- A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
- Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery  to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
- Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog.
- Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`)

Closes elastic#10624
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Apr 30, 2015
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted)  we have to recover a long translog when we come back.

To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).

Change highlights:
- Refactor Translog file management to allow for multiple files.
- Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
- A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
- Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery  to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
- Opening and closing the translog is now the responsibility of the IndexShard. ShadowIndexShards do not open the translog.
- Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`)

Closes elastic#10624
In order to safely complete recoveries / relocations we have to keep all operation done since the recovery start at available for replay. At the moment we do so by preventing the engine from flushing and thus making sure that the operations are kept in the translog. A side effect of this is that the translog keeps on growing until the recovery is done. This is not a problem as we do need these operations but if the another recovery starts concurrently it may have an unneededly long translog to replay. Also, if we shutdown the engine for some reason at this point (like when a node is restarted)  we have to recover a long translog when we come back.

To void this, the translog is changed to be based on multiple files instead of a single one. This allows recoveries to keep hold to the files they need while allowing the engine to flush and do a lucene commit (which will create a new translog files bellow the hood).

Change highlights:
- Refactor Translog file management to allow for multiple files.
- Translog maintains a list of referenced files, both by outstanding recoveries and files containing operations not yet committed to Lucene.
- A new Translog.View concept is introduced, allowing recoveries to get a reference to all currently uncommitted translog files plus all future translog files created until the view is closed. They can use this view to iterate over operations.
- Recovery phase3 is removed. That phase was replaying operations while preventing new writes to the engine. This is unneeded as standard indexing also send all operations from the start of the recovery  to the recovering shard. Replay all ops in the view acquired in recovery start is enough to guarantee no operation is lost.
- IndexShard now creates the translog together with the engine. The translog is closed by the engine on close. ShadowIndexShards do not open the translog.
- Moved the ownership of translog fsyncing to the translog it self, changing the responsible setting to `index.translog.sync_interval` (was `index.gateway.local.sync`)

Closes elastic#10624
@s1monw s1monw merged commit d596f5c into elastic:master May 5, 2015
bleskes added a commit to bleskes/elasticsearch that referenced this pull request Jan 7, 2016
… slow recovery

elastic#10624 decoupled translog flush from ongoing recoveries. In the process, the translog creation was delayed to moment the engine is created (during recovery, after copying files from the primary). On the other side, TranslogService, in charge of translog based flushes, starts a background checker as soon as the shard is allocated. That checker performs it's first check after 5s expected the translog to be there. However, if the file copying phase of the recovery takes >5s (likely!) or local recovery is slow, the check can run into an exception and never recover. The end result is that the translog based flush is completely disabled.

Note that this is mitigated but shard inactivity which triggers synced flush after 5m of no indexing.

Closes elastic#15814
@yiguolei
Copy link

@bleskes When user delete a doc during recovery, and then the recovery process will send a index request to replica, the replica will index the doc again, so that the doc will appeared again since we removed phase3

@bleskes bleskes deleted the gen_translog branch May 11, 2016 14:22
@bleskes
Copy link
Contributor Author

bleskes commented May 11, 2016

@yiguolei sorry for the late response, but doc deletes are versioned just like any other write operation and can arrive out of order to the replicas. When the indexing operation is replayed it will not override the delete operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>breaking :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement release highlight resiliency v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants