FSTranslog#snapshot() can enter infinite loop #10807

s1monw · 2015-04-25T16:26:45Z

If a translog is closed while there is still a recovery running we can end up in an infinite loop keeping the shard and the store etc. open. I bet there are more bad thing that can happen here...

it manifested in exceptions like this:

 2> REPRODUCE WITH: mvn test -Pdev -Dtests.seed=89A1F19C6ECBCF0C -Dtests.class=org.elasticsearch.test.rest.Rest2Tests -Dtests.slow=true -Dtests.method="test {yaml=indices.get_field_mapping/10_basic/Get field mapping with include_defaults}" -Des.logger.level=DEBUG -Des.node.mode=local -Dtests.security.manager=true -Dtests.nightly=false -Dtests.heap.size=1024m -Dtests.jvm.argline="-server -XX:+UseG1GC -XX:+UseCompressedOops -XX:+AggressiveOpts" -Dtests.locale=tr -Dtests.timezone=Europe/Isle_of_Man -Dtests.rest.blacklist=cat.recovery/10_basic/*
FAILURE 30.4s J4 | Rest2Tests.test {yaml=indices.get_field_mapping/10_basic/Get field mapping with include_defaults} <<<
   > Throwable #1: java.lang.AssertionError: Delete Index failed - not acked

and exceptions like:

  1> [2015-04-25 16:10:50,790][DEBUG][indices                  ] [node_s1] [test_index] failed to delete index store - at least one shards is still locked
  1> org.apache.lucene.store.LockObtainFailedException: Can't lock shard [test_index][2], timed out after 0ms
  1>    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:520)
  1>    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:448)
  1>    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:392)
  1>    at org.elasticsearch.env.NodeEnvironment.deleteIndexDirectorySafe(NodeEnvironment.java:342)
  1>    at org.elasticsearch.indices.IndicesService.deleteIndexStore(IndicesService.java:494)
  1>    at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:401)
  1>    at org.elasticsearch.indices.IndicesService.deleteIndex(IndicesService.java:443)
  1>    at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndex(IndicesClusterStateService.java:845)

all subsequent tests also fail with the delete not acked and print the same thread always sitting on a yield call in FSTranslog

  1>         at java.lang.Thread.run(Thread.java:745)
  1>    8) Thread[id=178, name=elasticsearch[node_s0][generic][T#1], state=RUNNABLE, group=TGRP-Rest2Tests]
  1>         at java.lang.Thread.yield(Native Method)
  1>         at org.elasticsearch.index.translog.fs.FsTranslog.snapshot(FsTranslog.java:362)
  1>         at org.elasticsearch.index.translog.fs.FsTranslog.snapshot(FsTranslog.java:61)
  1>         at org.elasticsearch.index.engine.InternalEngine.recover(InternalEngine.java:845)
  1>         at org.elasticsearch.index.shard.IndexShard.recover(IndexShard.java:730)
  1>         at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:125)
  1>         at org.elasticsearch.indices.recovery.RecoverySource.access$200(RecoverySource.java:49)
  1>         at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:135)

if the translog is actually closed current.snapshot() always returns null and we will spin forever....

 @Override
    public FsChannelSnapshot snapshot() throws TranslogException {
        while (true) {
            FsChannelSnapshot snapshot = current.snapshot();
            if (snapshot != null) {
                return snapshot;
            }
            Thread.yield();
        }
    }

phew I am happy I finally tracked it down, it's super rare but annoying :)

The text was updated successfully, but these errors were encountered:

s1monw · 2015-04-25T16:27:47Z

@bleskes I wonder if this can also cause funky other things like large translogs etc?

s1monw · 2015-04-25T16:28:29Z

here is the failure that shows the bug http://build-us-00.elastic.co/job/es_g1gc_master_metal/6078/consoleFull

bleskes · 2015-04-25T21:49:55Z

good catch! I don't think this can explain big translogs as the translog is only closed after the engine is closed, so there are no writes possible while this is ongoing. For what it's worth - this is also fixed in #10624 as the snapshot is retrieved from a view held by the recovery code. That view has it's own reference to the relevant translog files. Which means they can not be closed.

If the translog is closed while a snapshot opertion is in progress we must fail the snapshot operation otherwise we end up in an endless loop. Closes elastic#10807

s1monw · 2015-04-26T12:09:43Z

@bleskes I kept the fix minimal since we are refactoring this anyways

If the translog is closed while a snapshot opertion is in progress we must fail the snapshot operation otherwise we end up in an endless loop. Closes #10807 Conflicts: src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java

the translog might be reused across engines which is currently a problem in the design such that we have to allow calls to `close` more than once. This moves the closed check for snapshot on the actual file to exit the loop. Relates to #10807

If the translog is closed while a snapshot opertion is in progress we must fail the snapshot operation otherwise we end up in an endless loop. Closes elastic#10807 Conflicts: src/main/java/org/elasticsearch/index/translog/fs/FsTranslog.java

the translog might be reused across engines which is currently a problem in the design such that we have to allow calls to `close` more than once. This moves the closed check for snapshot on the actual file to exit the loop. Relates to elastic#10807

s1monw added >bug v2.0.0-beta1 v1.6.0 v1.5.2 labels Apr 25, 2015

clintongormley added v1.5.3 and removed v1.5.2 labels Apr 26, 2015

s1monw mentioned this issue Apr 26, 2015

Fail #snapshot if translog is closed #10809

Merged

s1monw self-assigned this Apr 26, 2015

s1monw added the :Translog label Apr 26, 2015

s1monw closed this as completed in #10809 Apr 26, 2015

s1monw added v1.5.2 and removed v1.5.3 labels Apr 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSTranslog#snapshot() can enter infinite loop #10807

FSTranslog#snapshot() can enter infinite loop #10807

s1monw commented Apr 25, 2015

s1monw commented Apr 25, 2015

s1monw commented Apr 25, 2015

bleskes commented Apr 25, 2015

s1monw commented Apr 26, 2015

FSTranslog#snapshot() can enter infinite loop #10807

FSTranslog#snapshot() can enter infinite loop #10807

Comments

s1monw commented Apr 25, 2015

s1monw commented Apr 25, 2015

s1monw commented Apr 25, 2015

bleskes commented Apr 25, 2015

s1monw commented Apr 26, 2015