Test Failure: SearchQueryIT.testIssue3177 - forced merged attempted on closed index #13266

colings86 · 2015-09-02T09:17:43Z

Build URL: http://build-us-00.elastic.co/job/elasticsearch-20-strong/311/testReport/junit/org.elasticsearch.search.query/SearchQueryIT/testIssue3177/

Cannot reproduce locally

Reproduction command:

mvn verify -Pdev -Dskip.unit.tests -pl org.elasticsearch:elasticsearch -Dtests.seed=B91C259E8901B4E2 -Dtests.class=org.elasticsearch.search.query.SearchQueryIT -Dtests.method="testIssue3177" -Des.logger.level=DEBUG -Des.node.mode=network -Dtests.assertion.disabled=false -Dtests.security.manager=true -Dtests.heap.size=835m -Dtests.locale=pt -Dtests.timezone=Africa/Asmera

Stack trace for failure:

java.lang.AssertionError: Unexpected ShardFailures: [[test][2] failed, reason [BroadcastShardOperationFailedException[operation indices:admin/optimize failed]; nested: ForceMergeFailedEngineException[force merge failed]; nested: EngineClosedException[CurrentState[CLOSED] Closed]; ]]
Expected: <0>
     but: was <1>
    at __randomizedtesting.SeedInfo.seed([B91C259E8901B4E2:681C883A5FC3B16F]:0)
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
    at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertNoFailures(ElasticsearchAssertions.java:343)
    at org.elasticsearch.test.ESIntegTestCase.optimize(ESIntegTestCase.java:1226)
    at org.elasticsearch.search.query.SearchQueryIT.testIssue3177(SearchQueryIT.java:148)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1638)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:847)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:883)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:897)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:365)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:798)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:458)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:856)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:792)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:803)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:54)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:365)
    at java.lang.Thread.run(Thread.java:745)

The text was updated successfully, but these errors were encountered:

brwe · 2015-09-02T16:59:47Z

The test fails because optimize is called and no errors are expected but optimize fails because a shard is relocating. Here is how:

First of all, when a shard relocates while optimize is called the call might fail with
ForceMergeFailedEngineException[force merge failed]; nested: AlreadyClosedException[this IndexWriter is closed] if:

the optimize request was send to a shard that is currently relocating
the request is started while the shard is still active and open but executed after it was closed (which happens very rarely)

To avoid this in the test we create an index, wait for relocations to finish and only then call optimize.
This is the part of the code:

        createIndex("test");
        client().prepareIndex("test", "type1", "1").setSource("field1", "value1").get();
        client().prepareIndex("test", "type1", "2").setSource("field1", "value2").get();
        client().prepareIndex("test", "type1", "3").setSource("field1", "value3").get();
        waitForRelocation();
        optimize();

This assumes that after waitForRelocation() has returned shards are no more relocated and optimize will succeed always.

However, because the test does not wait for green status, relocations can still start after waitForRelocation() has returned successfully.
In this case, here is the cluster state after all shards have been assigned but not all allocated:

nodes: 
{node_s2}{EXoFzKQETGiqaiRDrLPbCg}{127.0.0.1}{127.0.0.1:9402}{mode=network}
{node_s0}{iUkDDsM2T-OGXFNYdljzBQ}{127.0.0.1}{127.0.0.1:9400}{mode=network}, master
{node_s1}{gW0qrQwiQpOzwgOfzKUlPg}{127.0.0.1}{127.0.0.1:9401}{mode=network}
[...]
routing_nodes:
-----node_id[gW0qrQwiQpOzwgOfzKUlPg][V]
--------[test][1], node[gW0qrQwiQpOzwgOfzKUlPg], [R], v[2], s[INITIALIZING], a[id=bdfh-y94SOi9bfw_zJ9OYQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]
--------[test][3], node[gW0qrQwiQpOzwgOfzKUlPg], [P], v[3], s[STARTED], a[id=LfWoFMG-Rr2b-T5xXcYfNQ]
--------[test][2], node[gW0qrQwiQpOzwgOfzKUlPg], [R], v[3], s[STARTED], a[id=PJUO1DYETwujMkyCh9iR3A]
--------[test][5], node[gW0qrQwiQpOzwgOfzKUlPg], [R], v[3], s[STARTED], a[id=HUnQpyvbTFycsVRZOYe87w]
--------[test][0], node[gW0qrQwiQpOzwgOfzKUlPg], [P], v[2], s[STARTED], a[id=fbzg-cteQo6A9nIqCpfKXQ]
-----node_id[iUkDDsM2T-OGXFNYdljzBQ][V]
--------[test][1], node[iUkDDsM2T-OGXFNYdljzBQ], [P], v[2], s[STARTED], a[id=B7XeUI8-TqSXKA7ZnASn0Q]
--------[test][4], node[iUkDDsM2T-OGXFNYdljzBQ], [P], v[3], s[STARTED], a[id=Xne9jkeJRZCLgxeU4ZDEUw]
--------[test][0], node[iUkDDsM2T-OGXFNYdljzBQ], [R], v[2], s[INITIALIZING], a[id=IPATKM4TTRac_duGlNgEWQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]
-----node_id[EXoFzKQETGiqaiRDrLPbCg][V]
--------[test][3], node[EXoFzKQETGiqaiRDrLPbCg], [R], v[3], s[STARTED], a[id=8BDoiJRgS0euiPu9svsgNA]
--------[test][2], node[EXoFzKQETGiqaiRDrLPbCg], [P], v[3], s[STARTED], a[id=MbLEmN88TPOnZzD7Kj4Vcg]
--------[test][4], node[EXoFzKQETGiqaiRDrLPbCg], [R], v[3], s[STARTED], a[id=rSdf8UHSQpuFZeDLbH6RNw]
--------[test][5], node[EXoFzKQETGiqaiRDrLPbCg], [P], v[3], s[STARTED], a[id=dEuJ6wkYTo-XeJAch4AVYA]
---- unassigned

tasks: (1):
131/URGENT/shard-started ([test][0], node[iUkDDsM2T-OGXFNYdljzBQ], [R], v[2], s[INITIALIZING], a[id=IPATKM4TTRac_duGlNgEWQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]), reason [after recovery (replica) from node [{node_s1}{gW0qrQwiQpOzwgOfzKUlPg}{127.0.0.1}{127.0.0.1:9401}{mode=network}]]/10.6s

Because of the uneven distribution of shards one shard will be relocated from node_s0 to node_s1 after waiting for relocations is done and the subsequent optimize might then fail.

An ensureGreen() will fix the issue but I do wonder why we start of with such an odd distribution of shards to begin with. I will investigate this.

In this test we assume that after waitForRelocation() has returned shards are no more relocated and optimize will therefore succeed always. However, because the test does not wait for green status, relocations can still start after waitForRelocation() has returned successfully. see #13266 for a detailed explanation

s1monw · 2015-09-02T18:21:36Z

An ensureGreen() will fix the issue but I do wonder why we start of with such an odd distribution of shards to begin with. I will investigate this.

this allocation is a corner case that I tried to special case in the shard allocator here: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java#L647

in the last allocation round we have wo shards left:

--------[test][1], node[gW0qrQwiQpOzwgOfzKUlPg], [R], v[2], s[INITIALIZING], a[id=bdfh-y94SOi9bfw_zJ9OYQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]
--------[test][0], node[iUkDDsM2T-OGXFNYdljzBQ], [R], v[2], s[INITIALIZING], a[id=IPATKM4TTRac_duGlNgEWQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]

ideally we would put both on -----node_id[iUkDDsM2T-OGXFNYdljzBQ][V] but it already has shard --------[test][1] so we had to put it somewhere else... maybe there is a bug in the code or we should take another look at solving it differently.

brwe · 2015-09-07T17:36:25Z

We ( @bleskes and I) discussed further about the fact that we get a failure at all for optimize while relocating. Before I wrote

The test fails because optimize is called and no errors are expected but optimize fails because a shard is relocating.

But actually, optimize should not fail at all just because a shard is relocating. The reason why we get a failure is that the Exceptions returned by the call because a shard is closed are not all handled the same. I made a pr to fix this (#13380).

However, while discussing this we found another issue with this: when we call optimize in tests and shards are relocating we might actually miss some because they are relocating and maybe the coordinating node for the request does not have the new address of the shards yet, request is not sent to initializing shards etc. We should fix that as well as it might cause test failures too.

In addition other apis like stats does not seem to behave as expected too when shards are relocating. @tlrx might add more to that.

Overall I now think the problem is not so much that shards are relocating in tests unexpectedly but more that apis don't behave as expected when shards are relocating?

tlrx · 2015-09-08T08:11:42Z

In addition other apis like stats does not seem to behave as expected too when shards are relocating. @tlrx might add more to that.

I noticed that Indices Stats API, when executed at shard level, throws a ShardNotFoundException if the shard has no routing entry (see code here).

Exceptions are then accumulated at TransportBroadcastByNodeAction level and shards operations that throws exceptions are just ignored and excluded from indices stats.

Whe we call optimize we ignore Exceptions that indicate a closed shard. However, when a shard is closed while an optimize request is in flight it might also trigger an AlreadyClosedException from the IndexWriter when we get the config or ForceMergeFailedEngineException with the EngineClosedException wrapped inside. Because these are not identified as exceptions that indicate a closed shard (TransportActions.isShardNotAvailableException(..)) optimize would sometimes report failures when shards were relocating while optimize was called and sometimes not. This caused weird test failures, see elastic#13266 . Instead, we should let EngineClosedException bubble up and also recognize AlreadyClosedException as an indicator for a closed shard.

Whe we call optimize we ignore Exceptions that indicate a closed shard. However, when a shard is closed while an optimize request is in flight it might also trigger an AlreadyClosedException from the IndexWriter when we get the config or ForceMergeFailedEngineException with the EngineClosedException wrapped inside. Because these are not identified as exceptions that indicate a closed shard (TransportActions.isShardNotAvailableException(..)) optimize would sometimes report failures when shards were relocating while optimize was called and sometimes not. This caused weird test failures, see #13266 . Instead, we should let EngineClosedException bubble up and also recognize AlreadyClosedException as an indicator for a closed shard.

brwe · 2015-10-05T15:27:36Z

Test was fixed. I opened #13719 for a general discussion on how different apis deal with relocating shards.

colings86 added v2.0.0-beta1 >test-failure Triaged test failures from CI labels Sep 2, 2015

brwe self-assigned this Sep 2, 2015

clintongormley added v2.0.0 and removed v2.0.0-beta1 labels Sep 3, 2015

brwe mentioned this issue Sep 7, 2015

Engine: Let AlreadyClosedException and EngineClosedExceptionBubble up #13380

Merged

brwe mentioned this issue Sep 22, 2015

During shard relocation some requests might fail to be sent to a shard #13719

Open

brwe closed this as completed Oct 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Failure: SearchQueryIT.testIssue3177 - forced merged attempted on closed index #13266

Test Failure: SearchQueryIT.testIssue3177 - forced merged attempted on closed index #13266

colings86 commented Sep 2, 2015

brwe commented Sep 2, 2015

s1monw commented Sep 2, 2015

brwe commented Sep 7, 2015

tlrx commented Sep 8, 2015

brwe commented Oct 5, 2015

Test Failure: SearchQueryIT.testIssue3177 - forced merged attempted on closed index #13266

Test Failure: SearchQueryIT.testIssue3177 - forced merged attempted on closed index #13266

Comments

colings86 commented Sep 2, 2015

brwe commented Sep 2, 2015

s1monw commented Sep 2, 2015

brwe commented Sep 7, 2015

tlrx commented Sep 8, 2015

brwe commented Oct 5, 2015