[CI] RestartIndexFollowingIT.testFollowIndex fails #37681

costin · 2019-01-22T07:14:56Z

An assertion fails (66 vs 96) which caused the build to fail.
There are a number of node failures / closed channel exceptions so it might be a side-effect of an incomplete topology.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=openjdk12,nodes=virtual&&linux/195/console

elasticmachine · 2019-01-22T07:14:57Z

Pinging @elastic/es-distributed

martijnvg · 2019-01-22T10:13:41Z

The expect number of documents did not get replicated, because the shard follow task unexpectedly failed:

[2019-01-22T14:48:10,560][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
  1> java.lang.IllegalStateException: No node available for cluster: leader_cluster
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectedNodes.getAny(RemoteClusterConnection.java:708) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection.getAnyConnectedNode(RemoteClusterConnection.java:668) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection.getConnection(RemoteClusterConnection.java:353) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterService.getConnection(RemoteClusterService.java:380) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterAwareClient.lambda$doExecute$0(RemoteClusterAwareClient.java:54) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection.ensureConnected(RemoteClusterConnection.java:212) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterService.ensureConnected(RemoteClusterService.java:376) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterAwareClient.doExecute(RemoteClusterAwareClient.java:48) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.innerSendShardChangesRequest(ShardFollowTasksExecutor.java:241) [main/:?]
  1>    at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.sendShardChangesRequest(ShardFollowNodeTask.java:254) [main/:?]
  1>    at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.lambda$sendShardChangesRequest$3(ShardFollowNodeTask.java:277) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:662) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>    at java.lang.Thread.run(Thread.java:835) [?:?]

The java.lang.IllegalStateException: No node available for cluster: leader_cluster is another variant of the remote cluster connection not being available. I will work on a fix.

connectivity to the remote connection is failing. Relates to elastic#37681

…37767) or connectivity to the remote connection is failing. Relates to #37681

dnhatn · 2019-01-31T06:46:28Z

@martijnvg I think this is fixed by #37767. Can we close it?

martijnvg · 2019-01-31T06:53:27Z

yes, that has not failed in a long time after the above commits have been pushed.

martijnvg · 2019-02-08T16:24:36Z

This test failed again, but this time with a different error:

FAILURE 20.0s J1 | RestartIndexFollowingIT.testFollowIndex <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <37L>
   >      but: was <23L>
   >    at __randomizedtesting.SeedInfo.seed([5B20C5B4482DC66A:B84C0A867195FFE0]:0)
   >    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >    at org.elasticsearch.xpack.ccr.RestartIndexFollowingIT.lambda$testFollowIndex$2(RestartIndexFollowingIT.java:80)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:858)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:832)
   >    at org.elasticsearch.xpack.ccr.RestartIndexFollowingIT.testFollowIndex(RestartIndexFollowingIT.java:79)
   >    at java.lang.Thread.run(Thread.java:748)

It looks like some docs did not replica because shard following failed with this error:

[2019-02-08T17:07:41,475][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
  1> java.lang.IllegalStateException: handshake failed, mismatched cluster name [Cluster [FSleTAAAQACAHlpFAAAAAA]] - {leader_cluster#127.0.0.1:45912}{hc5lTgrnQeS1p7AjUVIi7A}{127.0.0.1}{127.0.0.1:45912}
  1>    at org.elasticsearch.transport.TransportService.handshake(TransportService.java:422) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:449) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:438) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:328) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:426) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_202]
  1>    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_202]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]

Relates to #37681

martijnvg · 2019-02-12T09:21:57Z

I made a tweak to CcrIntegTestCase, so that a clear cluster name is returned in the above exception message.

Relates to #37681

bizybot · 2019-02-18T02:44:24Z

Hi @martijnvg,

This failed again today,
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos-7&&immutable/250/

I see your changes are in to make debugging easier.
Locally it did not reproduce with the following:-

./gradlew :x-pack:plugin:ccr:internalClusterTest \
            -Dtests.seed=FFB9BD561495F6DF \
            -Dtests.class=org.elasticsearch.xpack.ccr.RestartIndexFollowingIT \
            -Dtests.method="testFollowIndex" \
            -Dtests.security.manager=true \
            -Dtests.locale=ar-OM \
            -Dtests.timezone=Africa/Kigali \
            -Dcompiler.java=11 \
            -Druntime.java=8

martijnvg · 2019-02-18T07:12:31Z

Thanks @bizybot, I'm going to take a look at this failure.

martijnvg · 2019-02-18T07:56:03Z

I think during the test the follower cluster may end up with seed nodes to itself instead of the leader node. The remote cluster service fails and the shard follow node task fails because of that too, causing not all of the expected documents to be replicated.

[2019-02-18T03:14:09,139][WARN ][o.e.t.RemoteClusterConnection] [followerd3] fetching nodes from external cluster [leader_cluster] failed
  1> java.lang.IllegalStateException: handshake failed, mismatched cluster name [Cluster [follower_cluster]] - {leader_cluster#127.0.0.1:40312}{YxrFM5olRtqM4YiJF-0WSg}
{127.0.0.1}{127.0.0.1:40312}
  1>    at org.elasticsearch.transport.TransportService.handshake(TransportService.java:422) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:449) ~[elasticsearch-8.0.0-SNAPS
HOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:438) [elasticsearch-8.0.0-SNAPSHOT.jar:8.
0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:328) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAP
SHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:426) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHO
T]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_202]
  1>    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_202]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHO
T]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]

clean remote connection prior to leader cluster restart Relates to #37681

Relates to #37681

martijnvg · 2019-03-08T09:54:01Z

The above fix seems to have fixed this failure. I've not seen a failure since.

costin added >test-failure Triaged test failures from CI :Distributed/CCR Issues around the Cross Cluster State Replication features labels Jan 22, 2019

martijnvg self-assigned this Jan 22, 2019

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 23, 2019

Fail with a dedicated exception if remote connection is missing or

070db0e

connectivity to the remote connection is failing. Relates to elastic#37681

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 23, 2019

Fail with a dedicated exception if remote connection is missing or

369cca2

connectivity to the remote connection is failing. Relates to elastic#37681

martijnvg mentioned this issue Jan 23, 2019

Fail with a dedicated exception if remote connection is missing or #37767

Merged

martijnvg added a commit that referenced this issue Jan 25, 2019

Fail with a dedicated exception if remote connection is missing or (#…

1151f3b

…37767) or connectivity to the remote connection is failing. Relates to #37681

martijnvg added a commit that referenced this issue Jan 25, 2019

Fail with a dedicated exception if remote connection is missing or (#…

ccfe8f1

…37767) or connectivity to the remote connection is failing. Relates to #37681

martijnvg closed this as completed Jan 31, 2019

martijnvg reopened this Feb 8, 2019

martijnvg added a commit that referenced this issue Feb 12, 2019

Use clear cluster names in order to make debugging easier.

a0b69c6

Relates to #37681

martijnvg added a commit that referenced this issue Feb 12, 2019

Use clear cluster names in order to make debugging easier.

6290d59

Relates to #37681

martijnvg added a commit that referenced this issue Feb 12, 2019

Use clear cluster names in order to make debugging easier.

dd2ce2f

Relates to #37681

martijnvg added a commit that referenced this issue Feb 12, 2019

Use clear cluster names in order to make debugging easier.

808db1f

Relates to #37681

martijnvg added a commit that referenced this issue Feb 26, 2019

Ensure remote connection established and

406633e

clean remote connection prior to leader cluster restart Relates to #37681

martijnvg added a commit that referenced this issue Feb 26, 2019

Ensure remote connection established and

b159cc5

clean remote connection prior to leader cluster restart Relates to #37681

martijnvg added a commit that referenced this issue Feb 26, 2019

Ensure remote connection established and

c1b2d7c

clean remote connection prior to leader cluster restart Relates to #37681

martijnvg added a commit that referenced this issue Feb 26, 2019

Ensure remote connection established and

c2a29da

clean remote connection prior to leader cluster restart Relates to #37681

martijnvg added a commit that referenced this issue Feb 26, 2019

Fix test, more than one node may be connected.

24e478c

Relates to #37681

martijnvg added a commit that referenced this issue Feb 26, 2019

Fix test, more than one node may be connected.

1ac8eab

Relates to #37681

martijnvg added a commit that referenced this issue Feb 26, 2019

Fix test, more than one node may be connected.

503fcda

Relates to #37681

martijnvg added a commit that referenced this issue Feb 26, 2019

Fix test, more than one node may be connected.

6d9434f

Relates to #37681

martijnvg closed this as completed Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] RestartIndexFollowingIT.testFollowIndex fails #37681

[CI] RestartIndexFollowingIT.testFollowIndex fails #37681

costin commented Jan 22, 2019

elasticmachine commented Jan 22, 2019

martijnvg commented Jan 22, 2019

dnhatn commented Jan 31, 2019

martijnvg commented Jan 31, 2019

martijnvg commented Feb 8, 2019

martijnvg commented Feb 12, 2019

bizybot commented Feb 18, 2019

martijnvg commented Feb 18, 2019

martijnvg commented Feb 18, 2019

martijnvg commented Mar 8, 2019

[CI] RestartIndexFollowingIT.testFollowIndex fails #37681

[CI] RestartIndexFollowingIT.testFollowIndex fails #37681

Comments

costin commented Jan 22, 2019

elasticmachine commented Jan 22, 2019

martijnvg commented Jan 22, 2019

dnhatn commented Jan 31, 2019

martijnvg commented Jan 31, 2019

martijnvg commented Feb 8, 2019

martijnvg commented Feb 12, 2019

bizybot commented Feb 18, 2019

martijnvg commented Feb 18, 2019

martijnvg commented Feb 18, 2019

martijnvg commented Mar 8, 2019