New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSTranslog#snapshot() can enter infinite loop #10807
Comments
@bleskes I wonder if this can also cause funky other things like large translogs etc? |
here is the failure that shows the bug http://build-us-00.elastic.co/job/es_g1gc_master_metal/6078/consoleFull |
good catch! I don't think this can explain big translogs as the translog is only closed after the engine is closed, so there are no writes possible while this is ongoing. For what it's worth - this is also fixed in #10624 as the snapshot is retrieved from a view held by the recovery code. That view has it's own reference to the relevant translog files. Which means they can not be closed. |
If the translog is closed while a snapshot opertion is in progress we must fail the snapshot operation otherwise we end up in an endless loop. Closes elastic#10807
@bleskes I kept the fix minimal since we are refactoring this anyways |
If the translog is closed while a snapshot opertion is in progress we must fail the snapshot operation otherwise we end up in an endless loop. Closes #10807 Conflicts: src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java
If the translog is closed while a snapshot opertion is in progress we must fail the snapshot operation otherwise we end up in an endless loop. Closes #10807 Conflicts: src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java
the translog might be reused across engines which is currently a problem in the design such that we have to allow calls to `close` more than once. This moves the closed check for snapshot on the actual file to exit the loop. Relates to #10807
the translog might be reused across engines which is currently a problem in the design such that we have to allow calls to `close` more than once. This moves the closed check for snapshot on the actual file to exit the loop. Relates to #10807
the translog might be reused across engines which is currently a problem in the design such that we have to allow calls to `close` more than once. This moves the closed check for snapshot on the actual file to exit the loop. Relates to #10807
If the translog is closed while a snapshot opertion is in progress we must fail the snapshot operation otherwise we end up in an endless loop. Closes elastic#10807 Conflicts: src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java
the translog might be reused across engines which is currently a problem in the design such that we have to allow calls to `close` more than once. This moves the closed check for snapshot on the actual file to exit the loop. Relates to elastic#10807
If a translog is closed while there is still a recovery running we can end up in an infinite loop keeping the shard and the store etc. open. I bet there are more bad thing that can happen here...
it manifested in exceptions like this:
and exceptions like:
all subsequent tests also fail with the delete not acked and print the same thread always sitting on a yield call in
FSTranslog
if the translog is actually closed
current.snapshot()
always returnsnull
and we will spin forever....phew I am happy I finally tracked it down, it's super rare but annoying :)
The text was updated successfully, but these errors were encountered: