Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633

Closed
cbuescher opened this issue Feb 8, 2019 · 8 comments
Closed

[CI] FollowerFailOverIT testFailOverOnFollower fails frequently #38633

cbuescher opened this issue Feb 8, 2019 · 8 comments
Assignees
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@cbuescher
Copy link
Member

This was once mentioned in #33337 but since its
a different method and its not muted yet I'll open a separate issue to track it.

Latest failures eg: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+internalClusterTest/780/console

Doesn't reproduce locally, but this is the reproduction line

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=E9FBEB281249D0E0 \
  -Dtests.class=org.elasticsearch.xpack.ccr.FollowerFailOverIT \
  -Dtests.method="testFailOverOnFollower" \
  -Dtests.security.manager=true \
  -Dtests.locale=uk-UA \
  -Dtests.timezone=Asia/Jerusalem \
  -Dcompiler.java=11 \
  -Druntime.java=8

Errors in the log:

java.lang.AssertionError: timed out waiting for green state
	at org.junit.Assert.fail(Assert.java:88)
	at org.elasticsearch.xpack.CcrIntegTestCase.ensureColor(CcrIntegTestCase.java:317)
	at org.elasticsearch.xpack.CcrIntegTestCase.ensureFollowerGreen(CcrIntegTestCase.java:291)
	at org.elasticsearch.xpack.CcrIntegTestCase.ensureFollowerGreen(CcrIntegTestCase.java:286)
	at org.elasticsearch.xpack.ccr.FollowerFailOverIT.testFailOverOnFollower(FollowerFailOverIT.java:113)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
java.lang.AssertionError: [follower-index][0], node[5iKNfmaJTBinORNF0IpQ6A], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=nxJxZYsLT6mCGGshzf7bMQ], unassigned_info[[reason=NODE_LEFT], at[2019-02-08T15:11:56.410Z], delayed=false, details[node_left [5iKNfmaJTBinORNF0IpQ6A]], allocation_status[no_attempt]], expected_shard_size[230] seq_no_stats mismatch
Expected: <SeqNoStats{maxSeqNo=300, localCheckpoint=71, globalCheckpoint=71}>
     but: was <SeqNoStats{maxSeqNo=300, localCheckpoint=-1, globalCheckpoint=71}>
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.elasticsearch.test.InternalTestCluster.lambda$assertSeqNos$6(InternalTestCluster.java:1389)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:858)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:832)
	at org.elasticsearch.test.InternalTestCluster.assertSeqNos(InternalTestCluster.java:1359)
	at org.elasticsearch.xpack.CcrIntegTestCase.afterTest(CcrIntegTestCase.java:180)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.lang.AssertionError: [follower-index][0], node[5iKNfmaJTBinORNF0IpQ6A], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=nxJxZYsLT6mCGGshzf7bMQ], unassigned_info[[reason=NODE_LEFT], at[2019-02-08T15:11:56.410Z], delayed=false, details[node_left [5iKNfmaJTBinORNF0IpQ6A]], allocation_status[no_attempt]], expected_shard_size[230] seq_no_stats mismatch
Expected: <SeqNoStats{maxSeqNo=300, localCheckpoint=71, globalCheckpoint=71}>
     but: was <SeqNoStats{maxSeqNo=300, localCheckpoint=-1, globalCheckpoint=71}>
		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
		at org.junit.Assert.assertThat(Assert.java:956)
		at org.elasticsearch.test.InternalTestCluster.lambda$assertSeqNos$6(InternalTestCluster.java:1389)
		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:846)
		... 38 more
@cbuescher cbuescher added >test-failure Triaged test failures from CI :Distributed/CCR Issues around the Cross Cluster State Replication features v8.0.0 labels Feb 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Feb 8, 2019
cbuescher pushed a commit that referenced this issue Feb 8, 2019
cbuescher pushed a commit that referenced this issue Feb 8, 2019
cbuescher pushed a commit that referenced this issue Feb 8, 2019
@cbuescher
Copy link
Member Author

I muted this test method with #38634 on master, 7.x and 7.0

@dnhatn dnhatn self-assigned this Feb 9, 2019
dnhatn added a commit that referenced this issue Feb 12, 2019
There were two documents (seq=2 and seq=103) missing on the follower in
one of the failures of `testFailOverOnFollower`. I spent several hours
on that failure but could not figure out the reason. I adjust log and
unmute this test so we can collect more information.

Relates #38633
dnhatn added a commit that referenced this issue Feb 12, 2019
There were two documents (seq=2 and seq=103) missing on the follower in
one of the failures of `testFailOverOnFollower`. I spent several hours
on that failure but could not figure out the reason. I adjust log and
unmute this test so we can collect more information.

Relates #38633
dimitris-athanasiou pushed a commit to dimitris-athanasiou/elasticsearch that referenced this issue Feb 12, 2019
@jasontedor
Copy link
Member

@bleskes @dnhatn @DaveCTurner @martijnvg @ywelsch I think that I can explain what is occurring in this test, and what is occurring here is a blocker issue.

This issue is the result of adding recovery from remote.

With recovery from remote, we copy over the index files and there is no translog replay phase. If the leader does a flush before the follower initiates recovery from remote, after recovery from remote the follower will be fully caught up all primary shards of the follower will have empty translogs. When a replica shard of the follower attempts to recover from the primary shard of the follower, we want to replay translog from the local checkpoint of the commit, to bake a history of operations for it. Since the primary shards of the follower have empty translogs, this replay can not happen and recovery will fail. That explains the following:

RemoteTransportException[[follower3][127.0.0.1:58606][internal:index/shard/recovery/start_recovery]]; nested: IllegalStateException[translog replay failed to cover required sequence numbers (required range [1:174). first missing op is [1]]; ], allocation_status[no_attempt]]

Immediately, I see two options:

  • InternalEngine#readHistoryOperations could serve operations from the Lucene history instead of the translog; however, this interplays with the need for shard history retention leases to interplay with recovery to ensure that these operations exist
  • the primary shards of the follower could build their own translog from Lucene history after remotely recovering files

I am going to mute this test for now and open a new issue marked as a blocker.

@jasontedor
Copy link
Member

I have marked this test as awaiting a fix in 6.7, 7.0, 7.x, and master.

@dnhatn
Copy link
Member

dnhatn commented Feb 16, 2019

@jasontedor Thanks for digging into this. @martijnvg and I found this issue in #38949. I am working on the fix for this - it's pretty contained. We can use the local checkpoint of the recovery target to validate the sending operations instead of tracking them locally on the recovery source.

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Feb 16, 2019

Is this caused (or exacerbated) by #38904? Prior to #38904 if soft deletes were enabled we did read historical operations from Lucene during peer recovery, and I think the follower primary would have retained that history in Lucene for the duration of the recovery.

I ask because I note that the first report of this failure comes from a time before when #38904 was merged.

@jasontedor
Copy link
Member

Yes that exacerbates it.

dnhatn added a commit that referenced this issue Feb 23, 2019
There were two documents (seq=2 and seq=103) missing on the follower in
one of the failures of `testFailOverOnFollower`. I spent several hours
on that failure but could not figure out the reason. I adjust log and
unmute this test so we can collect more information.

Relates #38633
dnhatn added a commit that referenced this issue Feb 25, 2019
@dnhatn
Copy link
Member

dnhatn commented Feb 25, 2019

This should be fixed by #39006. I unmuted this test in f59b8cd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

No branches or pull requests

6 participants