New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore EngineClosedException during translog fysnc #12384
Ignore EngineClosedException during translog fysnc #12384
Conversation
indexShard.sync(location); | ||
} catch (EngineClosedException e) { | ||
// ignore, the engine is already closed and the operation is | ||
// going to be retried |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't want it to be retried... that was the cause of the failure
Left a minor comment. LGTM. |
7090590
to
2dfd2bd
Compare
When performing an operation on a primary, the state is captured and the operation is performed on the primary shard. The original request is then modified to increment the version of the operation as preparation for it to be sent to the replicas. If the request first fails on the primary during the translog sync (because the Engine is already closed due to shadow primaries closing the engine on relocation), then the operation is retried on the new primary after being modified for the replica shards. It will then fail due to the version being incorrect (the document does not yet exist but the request expects a version of "1"). Order of operations: - Request is executed against primary - Request is modified (version incremented) so it can be sent to replicas - Engine's translog is fsync'd if necessary (failing, and throwing an exception) - Modified request is retried against new primary This change ignores the exception where the engine is already closed when syncing the translog (similar to how we ignore exceptions when refreshing the shard if the ?refresh=true flag is used).
2dfd2bd
to
c286cd1
Compare
@@ -208,7 +209,12 @@ private void processAfter(IndexRequest request, IndexShard indexShard, Translog. | |||
} | |||
|
|||
if (indexShard.getTranslogDurability() == Translog.Durabilty.REQUEST && location != null) { | |||
indexShard.sync(location); | |||
try { | |||
indexShard.sync(location); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the wrong fix since we call this in many places ie. during delete and bulk. I think we should just rename this method into trySync
and don't sync if the engine is closed. so we handle it correctly everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I have opened another issue for this: #12603
When performing an operation on a primary, the state is captured and the
operation is performed on the primary shard. The original request is then
modified to increment the version of the operation as preparation for it to
be sent to the replicas.
If the request first fails on the primary during the translog sync
(because the Engine is already closed due to shadow primaries closing
the engine on relocation), then the operation is retried on the new primary
after being modified for the replica shards. It will then fail due to the
version being incorrect (the document does not yet exist but the request
expects a version of "1").
Order of operations:
This change ignores the exception where the engine is already closed
when syncing the translog (similar to how we ignore exceptions when
refreshing the shard if the
?refresh=true
flag is used).