Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test fails with: can't resolve processed cluster state with uuid #32767

Closed
alpar-t opened this issue Aug 10, 2018 · 4 comments
Closed

test fails with: can't resolve processed cluster state with uuid #32767

alpar-t opened this issue Aug 10, 2018 · 4 comments
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI v7.0.0-beta1

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Aug 10, 2018

Example build failure

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/2608/console

Reproduction line (does not reproduce locally)

./gradlew :server:integTest \
  -Dtests.seed=CFEB21B624409D08 \
  -Dtests.class=org.elasticsearch.cluster.ack.AckIT \
  -Dtests.method="testIndicesAliasesAcknowledgement" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-EG \
  -Dtests.timezone=Europe/Busingen

Example relevant log

12:05:51   1> [2018-08-09T14:05:49,772][INFO ][o.e.c.a.AckIT            ] [testIndicesAliasesAcknowledgement] before test
12:05:51   1> [2018-08-09T14:05:49,772][INFO ][o.e.c.a.AckIT            ] [testIndicesAliasesAcknowledgement] [AckIT#testIndicesAliasesAcknowledgement]: setting up test
12:05:51   1> [2018-08-09T14:05:49,773][WARN ][o.e.d.z.PublishClusterStateAction] [node_sm0] timed out waiting for all nodes to process published state [98] (timeout [0s], pending nodes: [{node_sd1}{SUmmm71tS4S8Rjjmu7RRzQ}{KMqfr68_QR2_o6qIfel-pw}{127.0.0.1}{127.0.0.1:37251}, {node_sc3}{D3fZVHArSOm2z7CSJQjLLQ}{pdC1isitQaSDqccMh9K1ZQ}{127.0.0.1}{127.0.0.1:36167}, {node_sd2}{jqkfT8R_Q_ihxv7ORnNGcQ}{mRSClYlNTSu04uSSA50YbQ}{127.0.0.1}{127.0.0.1:39169}])
12:05:51   1> [2018-08-09T14:05:49,786][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm0] adding template [random_index_template] for index patterns [*]
12:05:51   1> [2018-08-09T14:05:49,787][WARN ][o.e.d.z.PublishClusterStateAction] [node_sm0] timed out waiting for all nodes to process published state [99] (timeout [0s], pending nodes: [{node_sd1}{SUmmm71tS4S8Rjjmu7RRzQ}{KMqfr68_QR2_o6qIfel-pw}{127.0.0.1}{127.0.0.1:37251}, {node_sc3}{D3fZVHArSOm2z7CSJQjLLQ}{pdC1isitQaSDqccMh9K1ZQ}{127.0.0.1}{127.0.0.1:36167}, {node_sd2}{jqkfT8R_Q_ihxv7ORnNGcQ}{mRSClYlNTSu04uSSA50YbQ}{127.0.0.1}{127.0.0.1:39169}])
12:05:51   1> [2018-08-09T14:05:49,790][INFO ][o.e.c.a.AckIT            ] [testIndicesAliasesAcknowledgement] [AckIT#testIndicesAliasesAcknowledgement]: all set up test
12:05:51   1> [2018-08-09T14:05:49,790][INFO ][o.e.c.a.AckIT            ] [testIndicesAliasesAcknowledgement] using custom data_path for index: [bZFgSZDbso]
12:05:51   1> [2018-08-09T14:05:49,802][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_sm0] [test] creating index, cause [api], templates [random_index_template], shards [1]/[0], mappings []
12:05:51   1> [2018-08-09T14:05:49,803][WARN ][o.e.d.z.PublishClusterStateAction] [node_sm0] timed out waiting for all nodes to process published state [100] (timeout [0s], pending nodes: [{node_sd1}{SUmmm71tS4S8Rjjmu7RRzQ}{KMqfr68_QR2_o6qIfel-pw}{127.0.0.1}{127.0.0.1:37251}, {node_sc3}{D3fZVHArSOm2z7CSJQjLLQ}{pdC1isitQaSDqccMh9K1ZQ}{127.0.0.1}{127.0.0.1:36167}, {node_sd2}{jqkfT8R_Q_ihxv7ORnNGcQ}{mRSClYlNTSu04uSSA50YbQ}{127.0.0.1}{127.0.0.1:39169}])
12:05:51   1> [2018-08-09T14:05:49,813][INFO ][o.e.c.a.AckIT            ] [testIndicesAliasesAcknowledgement] [AckIT#testIndicesAliasesAcknowledgement]: cleaning up after test
12:05:51   1> [2018-08-09T14:05:49,815][ERROR][o.e.t.d.TestZenDiscovery ] [node_sd2] unexpected failure applying [master {node_sm0}{hGe5mTwjSEeYCIoxAs1RNQ}{AIWCpEcCT2-pAsyJJvAk6w}{127.0.0.1}{127.0.0.1:36719} committed version [100]]
12:05:51   1> java.lang.IllegalStateException: can't resolve processed cluster state with uuid [8KrcUpcjR4ulxE1zmnc2Xg], version [100]
12:05:51   1> 	at org.elasticsearch.discovery.zen.PendingClusterStatesQueue.markAsProcessed(PendingClusterStatesQueue.java:147) ~[main/:?]
12:05:51   1> 	at org.elasticsearch.discovery.zen.ZenDiscovery$3.onSuccess(ZenDiscovery.java:795) [main/:?]
12:05:51   1> 	at org.elasticsearch.cluster.service.ClusterApplierService$SafeClusterApplyListener.onSuccess(ClusterApplierService.java:533) [main/:?]
12:05:51   1> 	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [main/:?]
12:05:51   1> 	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:427) [main/:?]
12:05:51   1> 	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:157) [main/:?]
12:05:51   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [main/:?]
12:05:51   1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [main/:?]
12:05:51   1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [main/:?]
12:05:51   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
12:05:51   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
12:05:51   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
12:05:51   1> [2018-08-09T14:05:49,815][ERROR][o.e.t.d.TestZenDiscovery ] [node_sd2] unexpected exception while failing [master {node_sm0}{hGe5mTwjSEeYCIoxAs1RNQ}{AIWCpEcCT2-pAsyJJvAk6w}{127.0.0.1}{127.0.0.1:36719} committed version [100]]
12:05:51   1> java.lang.IllegalArgumentException: can't resolve failed cluster state with uuid [8KrcUpcjR4ulxE1zmnc2Xg], version [100]
12:05:51   1> 	at org.elasticsearch.discovery.zen.PendingClusterStatesQueue.markAsFailed(PendingClusterStatesQueue.java:110) ~[main/:?]
12:05:51   1> 	at org.elasticsearch.discovery.zen.ZenDiscovery$3.onFailure(ZenDiscovery.java:807) [main/:?]
12:05:51   1> 	at org.elasticsearch.discovery.zen.ZenDiscovery$3.onSuccess(ZenDiscovery.java:797) [main/:?]
12:05:51   1> 	at org.elasticsearch.cluster.service.ClusterApplierService$SafeClusterApplyListener.onSuccess(ClusterApplierService.java:533) [main/:?]
12:05:51   1> 	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [main/:?]
12:05:51   1> 	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:427) [main/:?]
12:05:51   1> 	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:157) [main/:?]
12:05:51   2> REPRODUCE WITH: ./gradlew :server:integTest -Dtests.seed=CFEB21B624409D08 -Dtests.class=org.elasticsearch.cluster.ack.AckIT -Dtests.method="testIndicesAliasesAcknowledgement" -Dtests.security.manager=true -Dtests.locale=ar-EG -Dtests.timezone=Europe/Busingen
12:05:51   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [main/:?]
12:05:51   1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [main/:?]
12:05:51   1> 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [main/:?]
12:05:51   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
12:05:51   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
12:05:51   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
12:05:51   1> 	Suppressed: java.lang.IllegalStateException: can't resolve processed cluster state with uuid [8KrcUpcjR4ulxE1zmnc2Xg], version [100]
12:05:51   1> 		at org.elasticsearch.discovery.zen.PendingClusterStatesQueue.markAsProcessed(PendingClusterStatesQueue.java:147) ~[main/:?]
12:05:51   1> 		at org.elasticsearch.discovery.zen.ZenDiscovery$3.onSuccess(ZenDiscovery.java:795) [main/:?]
12:05:51   1> 		at org.elasticsearch.cluster.service.ClusterApplierService$SafeClusterApplyListener.onSuccess(ClusterApplierService.java:533) [main/:?]
12:05:51   1> 		at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [main/:?]
12:05:51   1> 		at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:427) [main/:?]
12:05:51   1> 		at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:157) [main/:?]
12:05:51   1> 		at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624) [main/:?]
12:05:51   1> 		at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [main/:?]
12:05:51   1> 		at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [main/:?]
12:05:51   1> 		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
12:05:51   1> 		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
12:05:51   1> 		at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
12:05:51   1> [2018-08-09T14:05:49,819][INFO ][o.e.c.r.a.AllocationService] [node_sm0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test][0]] ...]).
12:05:51   1> [2018-08-09T14:05:49,820][WARN ][o.e.d.z.PublishClusterStateAction] [node_sm0] timed out waiting for all nodes to process published state [101] (timeout [0s], pending nodes: [{node_sd1}{SUmmm71tS4S8Rjjmu7RRzQ}{KMqfr68_QR2_o6qIfel-pw}{127.0.0.1}{127.0.0.1:37251}, {node_sc3}{D3fZVHArSOm2z7CSJQjLLQ}{pdC1isitQaSDqccMh9K1ZQ}{127.0.0.1}{127.0.0.1:36167}, {node_sd2}{jqkfT8R_Q_ihxv7ORnNGcQ}{mRSClYlNTSu04uSSA50YbQ}{127.0.0.1}{127.0.0.1:39169}])
12:05:51   1> [2018-08-09T14:05:49,823][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_sm0] [test/TdvnaRhdSu290Bl-3Dj4qg] deleting index
12:05:51   1> [2018-08-09T14:05:49,823][WARN ][o.e.d.z.PublishClusterStateAction] [node_sm0] timed out waiting for all nodes to process published state [102] (timeout [0s], pending nodes: [{node_sd1}{SUmmm71tS4S8Rjjmu7RRzQ}{KMqfr68_QR2_o6qIfel-pw}{127.0.0.1}{127.0.0.1:37251}, {node_sc3}{D3fZVHArSOm2z7CSJQjLLQ}{pdC1isitQaSDqccMh9K1ZQ}{127.0.0.1}{127.0.0.1:36167}, {node_sd2}{jqkfT8R_Q_ihxv7ORnNGcQ}{mRSClYlNTSu04uSSA50YbQ}{127.0.0.1}{127.0.0.1:39169}])
12:05:51   1> [2018-08-09T14:05:49,833][INFO ][o.e.c.m.MetaDataIndexTemplateService] [node_sm0] removing template [random_index_template]
12:05:51   1> [2018-08-09T14:05:49,833][WARN ][o.e.d.z.PublishClusterStateAction] [node_sm0] timed out waiting for all nodes to process published state [103] (timeout [0s], pending nodes: [{node_sd1}{SUmmm71tS4S8Rjjmu7RRzQ}{KMqfr68_QR2_o6qIfel-pw}{127.0.0.1}{127.0.0.1:37251}, {node_sc3}{D3fZVHArSOm2z7CSJQjLLQ}{pdC1isitQaSDqccMh9K1ZQ}{127.0.0.1}{127.0.0.1:36167}, {node_sd2}{jqkfT8R_Q_ihxv7ORnNGcQ}{mRSClYlNTSu04uSSA50YbQ}{127.0.0.1}{127.0.0.1:39169}])
12:05:51   1> [2018-08-09T14:05:49,836][INFO ][o.e.c.a.AckIT            ] [testIndicesAliasesAcknowledgement] [AckIT#testIndicesAliasesAcknowledgement]: cleaned up after test
12:05:51   1> [2018-08-09T14:05:49,836][INFO ][o.e.c.a.AckIT            ] [testIndicesAliasesAcknowledgement] after test
12:05:51 FAILURE 0.12s J2 | AckIT.testIndicesAliasesAcknowledgement <<< FAILURES!
12:05:51    > Throwable #1: java.lang.AssertionError: CreateIndexResponse failed - not acked
12:05:51    > Expected: <true>
12:05:51    >      but: was <false>
12:05:51    > 	at __randomizedtesting.SeedInfo.seed([CFEB21B624409D08:4BF121D23848E9F9]:0)
12:05:51    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
12:05:51    > 	at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:108)
12:05:51    > 	at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:96)
12:05:51    > 	at org.elasticsearch.test.ESIntegTestCase.createIndex(ESIntegTestCase.java:741)
12:05:51    > 	at org.elasticsearch.cluster.ack.AckIT.testIndicesAliasesAcknowledgement(AckIT.java:214)
12:05:51    > 	at java.lang.Thread.run(Thread.java:748)

Frequency

This particular test did not fail before however test logs shows that can't resolve processed cluster state with uuid started happening on 2018-06-24 with failures almost every week since.

@alpar-t alpar-t added >test-failure Triaged test failures from CI v7.0.0 :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Aug 10, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch added :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Aug 10, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Aug 14, 2018

The reason these acking tests started failing since 2018-06-24 is that #30672 fixed the actual acking behavior, which would previously just erronously report failures as successful acks. The tests are completely bogus and would have never worked if the acking implementation had not been broken since its inception. The reason that the acking cannot work in this class is that it uses a publish timeout of 0, which means that cluster states can arrive out-of-order on nodes, which can in certain situations lead to cluster states being rejected as a better cluster state was already fully applied on the node. One way to work around the limitations in the test here might be to bring the cluster to a stable state first, and only then change the publish timeout for the one publishing round we want to check the acking for.

If there are more frequent failures of this, we can disable the whole class (AckIT) for now.

jpountz added a commit that referenced this issue Aug 16, 2018
jpountz added a commit that referenced this issue Aug 16, 2018
jpountz added a commit that referenced this issue Aug 16, 2018
@jpountz
Copy link
Contributor

jpountz commented Aug 16, 2018

I AwaitFix'ed this test.

jimczi pushed a commit that referenced this issue Aug 17, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Sep 13, 2018

Closed in favour of #33673

@ywelsch ywelsch closed this as completed Sep 13, 2018
tlrx added a commit that referenced this issue Feb 1, 2019
Those tests are broken like explained in #33673. Similar tests have
already been muted (see #32767) and this specific class has been muted
in master already (see #34053).

Closes #37478
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests

5 participants