Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Failure: SearchQueryIT.testIssue3177 - forced merged attempted on closed index #13266

Closed
colings86 opened this issue Sep 2, 2015 · 5 comments
Assignees
Labels
>test-failure Triaged test failures from CI v2.0.0

Comments

@colings86
Copy link
Contributor

Build URL: http://build-us-00.elastic.co/job/elasticsearch-20-strong/311/testReport/junit/org.elasticsearch.search.query/SearchQueryIT/testIssue3177/

Cannot reproduce locally

Reproduction command:

mvn verify -Pdev -Dskip.unit.tests -pl org.elasticsearch:elasticsearch -Dtests.seed=B91C259E8901B4E2 -Dtests.class=org.elasticsearch.search.query.SearchQueryIT -Dtests.method="testIssue3177" -Des.logger.level=DEBUG -Des.node.mode=network -Dtests.assertion.disabled=false -Dtests.security.manager=true -Dtests.heap.size=835m -Dtests.locale=pt -Dtests.timezone=Africa/Asmera

Stack trace for failure:

java.lang.AssertionError: Unexpected ShardFailures: [[test][2] failed, reason [BroadcastShardOperationFailedException[operation indices:admin/optimize failed]; nested: ForceMergeFailedEngineException[force merge failed]; nested: EngineClosedException[CurrentState[CLOSED] Closed]; ]]
Expected: <0>
     but: was <1>
    at __randomizedtesting.SeedInfo.seed([B91C259E8901B4E2:681C883A5FC3B16F]:0)
    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
    at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertNoFailures(ElasticsearchAssertions.java:343)
    at org.elasticsearch.test.ESIntegTestCase.optimize(ESIntegTestCase.java:1226)
    at org.elasticsearch.search.query.SearchQueryIT.testIssue3177(SearchQueryIT.java:148)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1638)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:847)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:883)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:897)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:365)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:798)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:458)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:856)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$3.evaluate(RandomizedRunner.java:758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$4.evaluate(RandomizedRunner.java:792)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:803)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:42)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:39)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:54)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:55)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:365)
    at java.lang.Thread.run(Thread.java:745)
@colings86 colings86 added v2.0.0-beta1 >test-failure Triaged test failures from CI labels Sep 2, 2015
@brwe brwe self-assigned this Sep 2, 2015
@brwe
Copy link
Contributor

brwe commented Sep 2, 2015

The test fails because optimize is called and no errors are expected but optimize fails because a shard is relocating. Here is how:

First of all, when a shard relocates while optimize is called the call might fail with
ForceMergeFailedEngineException[force merge failed]; nested: AlreadyClosedException[this IndexWriter is closed] if:

  • the optimize request was send to a shard that is currently relocating
  • the request is started while the shard is still active and open but executed after it was closed (which happens very rarely)

To avoid this in the test we create an index, wait for relocations to finish and only then call optimize.
This is the part of the code:

        createIndex("test");
        client().prepareIndex("test", "type1", "1").setSource("field1", "value1").get();
        client().prepareIndex("test", "type1", "2").setSource("field1", "value2").get();
        client().prepareIndex("test", "type1", "3").setSource("field1", "value3").get();
        waitForRelocation();
        optimize();

This assumes that after waitForRelocation() has returned shards are no more relocated and optimize will succeed always.

However, because the test does not wait for green status, relocations can still start after waitForRelocation() has returned successfully.
In this case, here is the cluster state after all shards have been assigned but not all allocated:

nodes: 
{node_s2}{EXoFzKQETGiqaiRDrLPbCg}{127.0.0.1}{127.0.0.1:9402}{mode=network}
{node_s0}{iUkDDsM2T-OGXFNYdljzBQ}{127.0.0.1}{127.0.0.1:9400}{mode=network}, master
{node_s1}{gW0qrQwiQpOzwgOfzKUlPg}{127.0.0.1}{127.0.0.1:9401}{mode=network}
[...]
routing_nodes:
-----node_id[gW0qrQwiQpOzwgOfzKUlPg][V]
--------[test][1], node[gW0qrQwiQpOzwgOfzKUlPg], [R], v[2], s[INITIALIZING], a[id=bdfh-y94SOi9bfw_zJ9OYQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]
--------[test][3], node[gW0qrQwiQpOzwgOfzKUlPg], [P], v[3], s[STARTED], a[id=LfWoFMG-Rr2b-T5xXcYfNQ]
--------[test][2], node[gW0qrQwiQpOzwgOfzKUlPg], [R], v[3], s[STARTED], a[id=PJUO1DYETwujMkyCh9iR3A]
--------[test][5], node[gW0qrQwiQpOzwgOfzKUlPg], [R], v[3], s[STARTED], a[id=HUnQpyvbTFycsVRZOYe87w]
--------[test][0], node[gW0qrQwiQpOzwgOfzKUlPg], [P], v[2], s[STARTED], a[id=fbzg-cteQo6A9nIqCpfKXQ]
-----node_id[iUkDDsM2T-OGXFNYdljzBQ][V]
--------[test][1], node[iUkDDsM2T-OGXFNYdljzBQ], [P], v[2], s[STARTED], a[id=B7XeUI8-TqSXKA7ZnASn0Q]
--------[test][4], node[iUkDDsM2T-OGXFNYdljzBQ], [P], v[3], s[STARTED], a[id=Xne9jkeJRZCLgxeU4ZDEUw]
--------[test][0], node[iUkDDsM2T-OGXFNYdljzBQ], [R], v[2], s[INITIALIZING], a[id=IPATKM4TTRac_duGlNgEWQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]
-----node_id[EXoFzKQETGiqaiRDrLPbCg][V]
--------[test][3], node[EXoFzKQETGiqaiRDrLPbCg], [R], v[3], s[STARTED], a[id=8BDoiJRgS0euiPu9svsgNA]
--------[test][2], node[EXoFzKQETGiqaiRDrLPbCg], [P], v[3], s[STARTED], a[id=MbLEmN88TPOnZzD7Kj4Vcg]
--------[test][4], node[EXoFzKQETGiqaiRDrLPbCg], [R], v[3], s[STARTED], a[id=rSdf8UHSQpuFZeDLbH6RNw]
--------[test][5], node[EXoFzKQETGiqaiRDrLPbCg], [P], v[3], s[STARTED], a[id=dEuJ6wkYTo-XeJAch4AVYA]
---- unassigned

tasks: (1):
131/URGENT/shard-started ([test][0], node[iUkDDsM2T-OGXFNYdljzBQ], [R], v[2], s[INITIALIZING], a[id=IPATKM4TTRac_duGlNgEWQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]), reason [after recovery (replica) from node [{node_s1}{gW0qrQwiQpOzwgOfzKUlPg}{127.0.0.1}{127.0.0.1:9401}{mode=network}]]/10.6s

Because of the uneven distribution of shards one shard will be relocated from node_s0 to node_s1 after waiting for relocations is done and the subsequent optimize might then fail.

An ensureGreen() will fix the issue but I do wonder why we start of with such an odd distribution of shards to begin with. I will investigate this.

brwe added a commit that referenced this issue Sep 2, 2015
In this test we assume that after waitForRelocation() has returned shards
are no more relocated and optimize will therefore succeed always.
However, because the test does not wait for green status, relocations can
still start after waitForRelocation() has returned successfully.

see #13266 for a detailed explanation
brwe added a commit that referenced this issue Sep 2, 2015
In this test we assume that after waitForRelocation() has returned shards
are no more relocated and optimize will therefore succeed always.
However, because the test does not wait for green status, relocations can
still start after waitForRelocation() has returned successfully.

see #13266 for a detailed explanation
@s1monw
Copy link
Contributor

s1monw commented Sep 2, 2015

An ensureGreen() will fix the issue but I do wonder why we start of with such an odd distribution of shards to begin with. I will investigate this.

this allocation is a corner case that I tried to special case in the shard allocator here: https://github.com/elastic/elasticsearch/blob/master/core/src/main/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java#L647

in the last allocation round we have wo shards left:

--------[test][1], node[gW0qrQwiQpOzwgOfzKUlPg], [R], v[2], s[INITIALIZING], a[id=bdfh-y94SOi9bfw_zJ9OYQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]
--------[test][0], node[iUkDDsM2T-OGXFNYdljzBQ], [R], v[2], s[INITIALIZING], a[id=IPATKM4TTRac_duGlNgEWQ], unassigned_info[[reason=INDEX_CREATED], at[2015-09-02T16:40:03.287Z]]

ideally we would put both on -----node_id[iUkDDsM2T-OGXFNYdljzBQ][V] but it already has shard --------[test][1] so we had to put it somewhere else... maybe there is a bug in the code or we should take another look at solving it differently.

@brwe
Copy link
Contributor

brwe commented Sep 7, 2015

We ( @bleskes and I) discussed further about the fact that we get a failure at all for optimize while relocating. Before I wrote

The test fails because optimize is called and no errors are expected but optimize fails because a shard is relocating.

But actually, optimize should not fail at all just because a shard is relocating. The reason why we get a failure is that the Exceptions returned by the call because a shard is closed are not all handled the same. I made a pr to fix this (#13380).

However, while discussing this we found another issue with this: when we call optimize in tests and shards are relocating we might actually miss some because they are relocating and maybe the coordinating node for the request does not have the new address of the shards yet, request is not sent to initializing shards etc. We should fix that as well as it might cause test failures too.

In addition other apis like stats does not seem to behave as expected too when shards are relocating. @tlrx might add more to that.

Overall I now think the problem is not so much that shards are relocating in tests unexpectedly but more that apis don't behave as expected when shards are relocating?

@tlrx
Copy link
Member

tlrx commented Sep 8, 2015

In addition other apis like stats does not seem to behave as expected too when shards are relocating. @tlrx might add more to that.

I noticed that Indices Stats API, when executed at shard level, throws a ShardNotFoundException if the shard has no routing entry (see code here).

Exceptions are then accumulated at TransportBroadcastByNodeAction level and shards operations that throws exceptions are just ignored and excluded from indices stats.

brwe added a commit to brwe/elasticsearch that referenced this issue Sep 8, 2015
Whe we call optimize we ignore Exceptions that indicate a closed shard.
However, when a shard is closed while an optimize request is in flight it
might also trigger an AlreadyClosedException from the IndexWriter when we
get the config or ForceMergeFailedEngineException with the EngineClosedException
wrapped inside. Because these are not identified as exceptions that indicate
a closed shard (TransportActions.isShardNotAvailableException(..)) optimize
would sometimes report failures when shards were relocating while optimize was called
and sometimes not. This caused weird test failures, see elastic#13266 .
Instead, we should let EngineClosedException bubble up and also recognize
AlreadyClosedException as an indicator for a closed shard.
brwe added a commit that referenced this issue Sep 8, 2015
Whe we call optimize we ignore Exceptions that indicate a closed shard.
However, when a shard is closed while an optimize request is in flight it
might also trigger an AlreadyClosedException from the IndexWriter when we
get the config or ForceMergeFailedEngineException with the EngineClosedException
wrapped inside. Because these are not identified as exceptions that indicate
a closed shard (TransportActions.isShardNotAvailableException(..)) optimize
would sometimes report failures when shards were relocating while optimize was called
and sometimes not. This caused weird test failures, see #13266 .
Instead, we should let EngineClosedException bubble up and also recognize
AlreadyClosedException as an indicator for a closed shard.
brwe added a commit that referenced this issue Sep 8, 2015
Whe we call optimize we ignore Exceptions that indicate a closed shard.
However, when a shard is closed while an optimize request is in flight it
might also trigger an AlreadyClosedException from the IndexWriter when we
get the config or ForceMergeFailedEngineException with the EngineClosedException
wrapped inside. Because these are not identified as exceptions that indicate
a closed shard (TransportActions.isShardNotAvailableException(..)) optimize
would sometimes report failures when shards were relocating while optimize was called
and sometimes not. This caused weird test failures, see #13266 .
Instead, we should let EngineClosedException bubble up and also recognize
AlreadyClosedException as an indicator for a closed shard.
@brwe
Copy link
Contributor

brwe commented Oct 5, 2015

Test was fixed. I opened #13719 for a general discussion on how different apis deal with relocating shards.

@brwe brwe closed this as completed Oct 5, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>test-failure Triaged test failures from CI v2.0.0
Projects
None yet
Development

No branches or pull requests

5 participants