Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSTranslog#snapshot() can enter infinite loop #10807

Closed
s1monw opened this issue Apr 25, 2015 · 4 comments · Fixed by #10809
Closed

FSTranslog#snapshot() can enter infinite loop #10807

s1monw opened this issue Apr 25, 2015 · 4 comments · Fixed by #10809
Assignees
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v1.5.2 v1.6.0 v2.0.0-beta1

Comments

@s1monw
Copy link
Contributor

s1monw commented Apr 25, 2015

If a translog is closed while there is still a recovery running we can end up in an infinite loop keeping the shard and the store etc. open. I bet there are more bad thing that can happen here...

it manifested in exceptions like this:

 2> REPRODUCE WITH: mvn test -Pdev -Dtests.seed=89A1F19C6ECBCF0C -Dtests.class=org.elasticsearch.test.rest.Rest2Tests -Dtests.slow=true -Dtests.method="test {yaml=indices.get_field_mapping/10_basic/Get field mapping with include_defaults}" -Des.logger.level=DEBUG -Des.node.mode=local -Dtests.security.manager=true -Dtests.nightly=false -Dtests.heap.size=1024m -Dtests.jvm.argline="-server -XX:+UseG1GC -XX:+UseCompressedOops -XX:+AggressiveOpts" -Dtests.locale=tr -Dtests.timezone=Europe/Isle_of_Man -Dtests.rest.blacklist=cat.recovery/10_basic/*
FAILURE 30.4s J4 | Rest2Tests.test {yaml=indices.get_field_mapping/10_basic/Get field mapping with include_defaults} <<<
   > Throwable #1: java.lang.AssertionError: Delete Index failed - not acked

and exceptions like:

  1> [2015-04-25 16:10:50,790][DEBUG][indices                  ] [node_s1] [test_index] failed to delete index store - at least one shards is still locked
  1> org.apache.lucene.store.LockObtainFailedException: Can't lock shard [test_index][2], timed out after 0ms
  1>    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:520)
  1>    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:448)
  1>    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:392)
  1>    at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:342)
  1>    at org.elasticsearch.indices.IndicesService.deleteIndexStore(IndicesService.java:494)
  1>    at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:401)
  1>    at org.elasticsearch.indices.IndicesService.deleteIndex(IndicesService.java:443)
  1>    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndex(IndicesClusterStateService.java:845)

all subsequent tests also fail with the delete not acked and print the same thread always sitting on a yield call in FSTranslog

  1>         at java.lang.Thread.run(Thread.java:745)
  1>    8) Thread[id=178, name=elasticsearch[node_s0][generic][T#1], state=RUNNABLE, group=TGRP-Rest2Tests]
  1>         at java.lang.Thread.yield(Native Method)
  1>         at org.elasticsearch.index.translog.fs.FsTranslog.snapshot(FsTranslog.java:362)
  1>         at org.elasticsearch.index.translog.fs.FsTranslog.snapshot(FsTranslog.java:61)
  1>         at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:845)
  1>         at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:730)
  1>         at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
  1>         at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
  1>         at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:135)

if the translog is actually closed current.snapshot() always returns null and we will spin forever....

 @Override
    public FsChannelSnapshot snapshot() throws TranslogException {
        while (true) {
            FsChannelSnapshot snapshot = current.snapshot();
            if (snapshot != null) {
                return snapshot;
            }
            Thread.yield();
        }
    }

phew I am happy I finally tracked it down, it's super rare but annoying :)

@s1monw
Copy link
Contributor Author

s1monw commented Apr 25, 2015

@bleskes I wonder if this can also cause funky other things like large translogs etc?

@s1monw
Copy link
Contributor Author

s1monw commented Apr 25, 2015

here is the failure that shows the bug http://build-us-00.elastic.co/job/es_g1gc_master_metal/6078/consoleFull

@bleskes
Copy link
Contributor

bleskes commented Apr 25, 2015

good catch! I don't think this can explain big translogs as the translog is only closed after the engine is closed, so there are no writes possible while this is ongoing. For what it's worth - this is also fixed in #10624 as the snapshot is retrieved from a view held by the recovery code. That view has it's own reference to the relevant translog files. Which means they can not be closed.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Apr 26, 2015
If the translog is closed while a snapshot opertion is in progress
we must fail the snapshot operation otherwise we end up in an endless
loop.

Closes elastic#10807
@s1monw s1monw self-assigned this Apr 26, 2015
@s1monw
Copy link
Contributor Author

s1monw commented Apr 26, 2015

@bleskes I kept the fix minimal since we are refactoring this anyways

s1monw added a commit that referenced this issue Apr 26, 2015
If the translog is closed while a snapshot opertion is in progress
we must fail the snapshot operation otherwise we end up in an endless
loop.

Closes #10807

Conflicts:
	src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java
s1monw added a commit that referenced this issue Apr 26, 2015
If the translog is closed while a snapshot opertion is in progress
we must fail the snapshot operation otherwise we end up in an endless
loop.

Closes #10807

Conflicts:
	src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java
s1monw added a commit that referenced this issue Apr 26, 2015
the translog might be reused across engines which is currently a problem
in the design such that we have to allow calls to `close` more than once.
This moves the closed check for snapshot on the actual file to exit the loop.

Relates to #10807
@s1monw s1monw added v1.5.2 and removed v1.5.3 labels Apr 26, 2015
s1monw added a commit that referenced this issue Apr 26, 2015
the translog might be reused across engines which is currently a problem
in the design such that we have to allow calls to `close` more than once.
This moves the closed check for snapshot on the actual file to exit the loop.

Relates to #10807
s1monw added a commit that referenced this issue Apr 26, 2015
the translog might be reused across engines which is currently a problem
in the design such that we have to allow calls to `close` more than once.
This moves the closed check for snapshot on the actual file to exit the loop.

Relates to #10807
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
If the translog is closed while a snapshot opertion is in progress
we must fail the snapshot operation otherwise we end up in an endless
loop.

Closes elastic#10807

Conflicts:
	src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
the translog might be reused across engines which is currently a problem
in the design such that we have to allow calls to `close` more than once.
This moves the closed check for snapshot on the actual file to exit the loop.

Relates to elastic#10807
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Translog :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. v1.5.2 v1.6.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants