Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore EngineClosedException during translog fysnc #12384

Merged
merged 1 commit into from Jul 21, 2015

Conversation

dakrone
Copy link
Member

@dakrone dakrone commented Jul 21, 2015

When performing an operation on a primary, the state is captured and the
operation is performed on the primary shard. The original request is then
modified to increment the version of the operation as preparation for it to
be sent to the replicas.

If the request first fails on the primary during the translog sync
(because the Engine is already closed due to shadow primaries closing
the engine on relocation), then the operation is retried on the new primary
after being modified for the replica shards. It will then fail due to the
version being incorrect (the document does not yet exist but the request
expects a version of "1").

Order of operations:

  • Request is executed against primary
  • Request is modified (version incremented) so it can be sent to replicas
  • Engine's translog is fsync'd if necessary (failing, and throwing an exception)
  • Modified request is retried against new primary

This change ignores the exception where the engine is already closed
when syncing the translog (similar to how we ignore exceptions when
refreshing the shard if the ?refresh=true flag is used).

indexShard.sync(location);
} catch (EngineClosedException e) {
// ignore, the engine is already closed and the operation is
// going to be retried
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want it to be retried... that was the cause of the failure

@bleskes
Copy link
Contributor

bleskes commented Jul 21, 2015

Left a minor comment. LGTM.

@dakrone dakrone force-pushed the ignore-closed-engine-during-fsync branch from 7090590 to 2dfd2bd Compare July 21, 2015 21:09
When performing an operation on a primary, the state is captured and the
operation is performed on the primary shard. The original request is
then modified to increment the version of the operation as preparation
for it to be sent to the replicas.

If the request first fails on the primary during the translog sync
(because the Engine is already closed due to shadow primaries closing
the engine on relocation), then the operation is retried on the new primary
after being modified for the replica shards. It will then fail due to the
version being incorrect (the document does not yet exist but the request
expects a version of "1").

Order of operations:

- Request is executed against primary
- Request is modified (version incremented) so it can be sent to replicas
- Engine's translog is fsync'd if necessary (failing, and throwing an exception)
- Modified request is retried against new primary

This change ignores the exception where the engine is already closed
when syncing the translog (similar to how we ignore exceptions when
refreshing the shard if the ?refresh=true flag is used).
@dakrone dakrone force-pushed the ignore-closed-engine-during-fsync branch from 2dfd2bd to c286cd1 Compare July 21, 2015 21:09
@dakrone dakrone merged commit c286cd1 into elastic:master Jul 21, 2015
@@ -208,7 +209,12 @@ private void processAfter(IndexRequest request, IndexShard indexShard, Translog.
}

if (indexShard.getTranslogDurability() == Translog.Durabilty.REQUEST && location != null) {
indexShard.sync(location);
try {
indexShard.sync(location);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the wrong fix since we call this in many places ie. during delete and bulk. I think we should just rename this method into trySync and don't sync if the engine is closed. so we handle it correctly everywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I have opened another issue for this: #12603

@dakrone dakrone deleted the ignore-closed-engine-during-fsync branch August 16, 2016 22:01
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants