Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RestartIndexFollowingIT.testFollowIndex fails #37681

Closed
costin opened this issue Jan 22, 2019 · 10 comments
Closed

[CI] RestartIndexFollowingIT.testFollowIndex fails #37681

costin opened this issue Jan 22, 2019 · 10 comments
Assignees
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI

Comments

@costin
Copy link
Member

costin commented Jan 22, 2019

An assertion fails (66 vs 96) which caused the build to fail.
There are a number of node failures / closed channel exceptions so it might be a side-effect of an incomplete topology.

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=openjdk12,nodes=virtual&&linux/195/console

@costin costin added >test-failure Triaged test failures from CI :Distributed/CCR Issues around the Cross Cluster State Replication features labels Jan 22, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@martijnvg martijnvg self-assigned this Jan 22, 2019
@martijnvg
Copy link
Member

The expect number of documents did not get replicated, because the shard follow task unexpectedly failed:

[2019-01-22T14:48:10,560][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
  1> java.lang.IllegalStateException: No node available for cluster: leader_cluster
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectedNodes.getAny(RemoteClusterConnection.java:708) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection.getAnyConnectedNode(RemoteClusterConnection.java:668) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection.getConnection(RemoteClusterConnection.java:353) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterService.getConnection(RemoteClusterService.java:380) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterAwareClient.lambda$doExecute$0(RemoteClusterAwareClient.java:54) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection.ensureConnected(RemoteClusterConnection.java:212) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterService.ensureConnected(RemoteClusterService.java:376) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterAwareClient.doExecute(RemoteClusterAwareClient.java:48) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:393) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.xpack.ccr.action.ShardFollowTasksExecutor$1.innerSendShardChangesRequest(ShardFollowTasksExecutor.java:241) [main/:?]
  1>    at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.sendShardChangesRequest(ShardFollowNodeTask.java:254) [main/:?]
  1>    at org.elasticsearch.xpack.ccr.action.ShardFollowNodeTask.lambda$sendShardChangesRequest$3(ShardFollowNodeTask.java:277) [main/:?]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:662) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1>    at java.lang.Thread.run(Thread.java:835) [?:?]

The java.lang.IllegalStateException: No node available for cluster: leader_cluster is another variant of the remote cluster connection not being available. I will work on a fix.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 23, 2019
connectivity to the remote connection is failing.

Relates to elastic#37681
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 23, 2019
connectivity to the remote connection is failing.

Relates to elastic#37681
martijnvg added a commit that referenced this issue Jan 25, 2019
…37767)

or connectivity to the remote connection is failing.

Relates to #37681
martijnvg added a commit that referenced this issue Jan 25, 2019
…37767)

or connectivity to the remote connection is failing.

Relates to #37681
@dnhatn
Copy link
Member

dnhatn commented Jan 31, 2019

@martijnvg I think this is fixed by #37767. Can we close it?

@martijnvg
Copy link
Member

yes, that has not failed in a long time after the above commits have been pushed.

@martijnvg
Copy link
Member

This test failed again, but this time with a different error:

FAILURE 20.0s J1 | RestartIndexFollowingIT.testFollowIndex <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <37L>
   >      but: was <23L>
   >    at __randomizedtesting.SeedInfo.seed([5B20C5B4482DC66A:B84C0A867195FFE0]:0)
   >    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >    at org.elasticsearch.xpack.ccr.RestartIndexFollowingIT.lambda$testFollowIndex$2(RestartIndexFollowingIT.java:80)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:858)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:832)
   >    at org.elasticsearch.xpack.ccr.RestartIndexFollowingIT.testFollowIndex(RestartIndexFollowingIT.java:79)
   >    at java.lang.Thread.run(Thread.java:748)

It looks like some docs did not replica because shard following failed with this error:

[2019-02-08T17:07:41,475][WARN ][o.e.x.c.a.ShardFollowNodeTask] [follower0] shard follow task encounter non-retryable error
  1> java.lang.IllegalStateException: handshake failed, mismatched cluster name [Cluster [FSleTAAAQACAHlpFAAAAAA]] - {leader_cluster#127.0.0.1:45912}{hc5lTgrnQeS1p7AjUVIi7A}{127.0.0.1}{127.0.0.1:45912}
  1>    at org.elasticsearch.transport.TransportService.handshake(TransportService.java:422) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:449) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:438) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:328) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:426) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_202]
  1>    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_202]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]

@martijnvg
Copy link
Member

I made a tweak to CcrIntegTestCase, so that a clear cluster name is returned in the above exception message.

@bizybot
Copy link
Contributor

bizybot commented Feb 18, 2019

Hi @martijnvg,

This failed again today,
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos-7&&immutable/250/

I see your changes are in to make debugging easier.
Locally it did not reproduce with the following:-

./gradlew :x-pack:plugin:ccr:internalClusterTest \
            -Dtests.seed=FFB9BD561495F6DF \
            -Dtests.class=org.elasticsearch.xpack.ccr.RestartIndexFollowingIT \
            -Dtests.method="testFollowIndex" \
            -Dtests.security.manager=true \
            -Dtests.locale=ar-OM \
            -Dtests.timezone=Africa/Kigali \
            -Dcompiler.java=11 \
            -Druntime.java=8

@martijnvg
Copy link
Member

Thanks @bizybot, I'm going to take a look at this failure.

@martijnvg
Copy link
Member

I think during the test the follower cluster may end up with seed nodes to itself instead of the leader node. The remote cluster service fails and the shard follow node task fails because of that too, causing not all of the expected documents to be replicated.

[2019-02-18T03:14:09,139][WARN ][o.e.t.RemoteClusterConnection] [followerd3] fetching nodes from external cluster [leader_cluster] failed
  1> java.lang.IllegalStateException: handshake failed, mismatched cluster name [Cluster [follower_cluster]] - {leader_cluster#127.0.0.1:40312}{YxrFM5olRtqM4YiJF-0WSg}
{127.0.0.1}{127.0.0.1:40312}
  1>    at org.elasticsearch.transport.TransportService.handshake(TransportService.java:422) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.lambda$collectRemoteNodes$2(RemoteClusterConnection.java:449) ~[elasticsearch-8.0.0-SNAPS
HOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:108) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.collectRemoteNodes(RemoteClusterConnection.java:438) [elasticsearch-8.0.0-SNAPSHOT.jar:8.
0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler.access$900(RemoteClusterConnection.java:328) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAP
SHOT]
  1>    at org.elasticsearch.transport.RemoteClusterConnection$ConnectHandler$1.doRun(RemoteClusterConnection.java:426) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHO
T]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
  1>    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_202]
  1>    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_202]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHO
T]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]

martijnvg added a commit that referenced this issue Feb 26, 2019
clean remote connection prior to leader cluster restart

Relates to #37681
martijnvg added a commit that referenced this issue Feb 26, 2019
clean remote connection prior to leader cluster restart

Relates to #37681
martijnvg added a commit that referenced this issue Feb 26, 2019
clean remote connection prior to leader cluster restart

Relates to #37681
martijnvg added a commit that referenced this issue Feb 26, 2019
clean remote connection prior to leader cluster restart

Relates to #37681
martijnvg added a commit that referenced this issue Feb 26, 2019
martijnvg added a commit that referenced this issue Feb 26, 2019
martijnvg added a commit that referenced this issue Feb 26, 2019
martijnvg added a commit that referenced this issue Feb 26, 2019
@martijnvg
Copy link
Member

The above fix seems to have fixed this failure. I've not seen a failure since.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants