Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] IndexFollowingIT.testFollowIndex failure #38949

Closed
iverase opened this issue Feb 15, 2019 · 6 comments · Fixed by #39006
Closed

[CI] IndexFollowingIT.testFollowIndex failure #38949

iverase opened this issue Feb 15, 2019 · 6 comments · Fixed by #39006
Assignees
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI

Comments

@iverase
Copy link
Contributor

iverase commented Feb 15, 2019

This test has fail with an assertion error. In the logs what looks relevant Is the following:

  1> [2019-02-15T08:37:44,302][INFO ][o.e.c.m.MetaDataMappingService] [leaderm2] [index1/zhspcSQ3Su-DfqnT8FirZg] create_mapping [doc]
  1> [2019-02-15T08:37:44,327][INFO ][o.e.c.m.MetaDataMappingService] [follower1] [index2/48YajD45TdeuzdasRRO35A] create_mapping [doc]
  1> [2019-02-15T08:37:45,353][INFO ][o.e.c.m.MetaDataIndexStateService] [follower1] closing indices [index2/48YajD45TdeuzdasRRO35A]
  1> [2019-02-15T08:37:45,368][INFO ][o.e.c.m.MetaDataIndexStateService] [follower1] completed closing of indices [index2]
  1> [2019-02-15T08:37:45,385][WARN ][o.e.x.c.a.b.TransportBulkShardOperationsAction] [follower1] unexpected error during the primary phase for action [indices:data/write/bulk_shard_operations[s]], request [BulkShardOperationsRequest{historyUUID=N2HT1imJT2-IDpPZ0D6Mog, operations=1, maxSeqNoUpdates=-1, shardId=[index2][0], timeout=1m, index='index2', waitForActiveShards=DEFAULT}]
  1> org.elasticsearch.indices.IndexClosedException: closed
  1> 	at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:215) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]

It does not reproduce locally:

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=37D22C55D4DB9A44 \
  -Dtests.class=org.elasticsearch.xpack.ccr.IndexFollowingIT \
  -Dtests.method="testFollowIndex" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-CU \
  -Dtests.timezone=America/Argentina/Tucuman \
  -Dcompiler.java=11 \
  -Druntime.java=8

Console log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/122/console

@iverase iverase added >test-failure Triaged test failures from CI :Distributed/CCR Issues around the Cross Cluster State Replication features labels Feb 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@martijnvg martijnvg self-assigned this Feb 15, 2019
@martijnvg
Copy link
Member

The shards of the follow index could not be acked (PutFollowAction.Response#isFollowIndexShardsAcked() returned false). This is the first time that I've seen this test failed at this assertion.

It appears this failure is caught because of missing ops during file based recovery:

[2019-02-15T08:36:57,681][WARN ][o.e.i.c.IndicesClusterStateService] [follower0] [[index2][0]] marking and sending shard failed due to [failed recovery]
  1> org.elasticsearch.indices.recovery.RecoveryFailedException: [index2][0]: Recovery failed from {follower1}{iqu96z6sTPyAZFgqumPkDA}{2Y6Xuyf3SWi11ApN6Rr_vQ}{127.0.0.1}{127.0.0.1:36877}{xpack.installed=true} in
to {follower0}{zgOtImoTSEuCUs71SNSBqA}{k5d-fAtgRN2TKv_5578QOg}{127.0.0.1}{127.0.0.1:46759}{xpack.installed=true}
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:253) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:298) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.lambda$handleException$23(TcpTransport.java:1000) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) [elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [follower1][127.0.0.1:36877][internal:index/shard/recovery/start_recovery]
  1> Caused by: java.lang.IllegalStateException: translog replay failed to cover required sequence numbers (required range [9:63). first missing op is [9]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase2$25(RecoverySourceHandler.java:592) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.sendBatch(RecoverySourceHandler.java:648) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$sendBatch$26(RecoverySourceHandler.java:634) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.lambda$indexTranslogOperations$4(RemoteRecoveryTargetHandler.java:133) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleResponse(PlainTransportFuture.java:87) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:974) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    ... 3 more

@tvernum, @ywelsch or @dnhatn Have we seen this IllegalStateException failure before during file based recovery?

@jpountz
Copy link
Contributor

jpountz commented Feb 15, 2019

@iverase Let's mute this test?

iverase added a commit that referenced this issue Feb 15, 2019
iverase added a commit that referenced this issue Feb 15, 2019
@dnhatn dnhatn self-assigned this Feb 15, 2019
@dnhatn
Copy link
Member

dnhatn commented Feb 16, 2019

Thanks @martijnvg for the initial look. I have some idea of what is happening.

@martijnvg
Copy link
Member

The IndexFollowingIT#testFollowIndexWithConcurrentMappingChanges test fails for the same reason as above. The follower index can't get into a green state:

[2019-02-18T13:07:38,034][INFO ][o.e.x.c.IndexFollowingIT ] [testFollowIndexWithConcurrentMappingChanges] after test
FAILURE 30.4s J1 | IndexFollowingIT.testFollowIndexWithConcurrentMappingChanges <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: timed out waiting for green state
   >    at __randomizedtesting.SeedInfo.seed([702638CC73823FFE:5754AC34DA21D6DA]:0)
   >    at org.elasticsearch.xpack.CcrIntegTestCase.ensureColor(CcrIntegTestCase.java:336)
   >    at org.elasticsearch.xpack.CcrIntegTestCase.ensureFollowerGreen(CcrIntegTestCase.java:310)
   >    at org.elasticsearch.xpack.CcrIntegTestCase.ensureFollowerGreen(CcrIntegTestCase.java:305)
   >    at org.elasticsearch.xpack.ccr.IndexFollowingIT.testFollowIndexWithConcurrentMappingChanges(IndexFollowingIT.java:269)
   >    at java.lang.Thread.run(Thread.java:748)

Error in the logs:

[2019-02-18T13:07:08,840][WARN ][o.e.c.r.a.AllocationService] [follower1] failing shard [failed shard, shard [index2][0], node[RmgnfY6vTDe-fyixPwBAsA], [R], recovery_source[peer recovery], s[INITIALIZING], 
a[id=EVB96a62RQW6pTtldBWYcA], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-02-18T13:07:08.659Z], failed_attempts[4], delayed=false, details[failed shard on node [RmgnfY6vTDe-fyixPwBAsA]: failed to perform
 indices:data/write/bulk_shard_operations[s] on replica [index2][0], node[RmgnfY6vTDe-fyixPwBAsA], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=gNA_4HuDT7S4NJfIz6RJZA], unassigned_info[[reason=ALLO
CATION_FAILED], at[2019-02-18T13:07:08.480Z], failed_attempts[3], delayed=false, details[failed shard on node [RmgnfY6vTDe-fyixPwBAsA]: failed recovery, failure RecoveryFailedException[[index2][0]: Recovery fail
ed from {follower1}{TF5Ju3gbQimfP5f5UAx1Pw}{EZ9qhrdkR-GZ-r3kGcu-QA}{127.0.0.1}{127.0.0.1:33857}{xpack.installed=true} into {follower0}{RmgnfY6vTDe-fyixPwBAsA}{k9gtij__QTqzLoW2w45oBg}{127.0.0.1}{127.0.0.1:33247}{
xpack.installed=true}]; nested: RemoteTransportException[[follower1][127.0.0.1:33857][internal:index/shard/recovery/start_recovery]]; nested: IllegalStateException[translog replay failed to cover required sequen
ce numbers (required range [23:58). first missing op is [23]]; ], allocation_status[no_attempt]], failure RemoteTransportException[[follower0][127.0.0.1:33247][indices:data/write/bulk_shard_operations[s][r]]]; n
ested: ShardNotFoundException[no such shard]; ], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[index2][0]: Recovery failed from {follower1}{TF5Ju3gbQimfP5f5UAx1Pw}{
EZ9qhrdkR-GZ-r3kGcu-QA}{127.0.0.1}{127.0.0.1:33857}{xpack.installed=true} into {follower0}{RmgnfY6vTDe-fyixPwBAsA}{k9gtij__QTqzLoW2w45oBg}{127.0.0.1}{127.0.0.1:33247}{xpack.installed=true}]; nested: RemoteTransp
ortException[[follower1][127.0.0.1:33857][internal:index/shard/recovery/start_recovery]]; nested: IllegalStateException[translog replay failed to cover required sequence numbers (required range [23:85). first mi
ssing op is [23]]; ], markAsStale [true]]
  1> org.elasticsearch.indices.recovery.RecoveryFailedException: [index2][0]: Recovery failed from {follower1}{TF5Ju3gbQimfP5f5UAx1Pw}{EZ9qhrdkR-GZ-r3kGcu-QA}{127.0.0.1}{127.0.0.1:33857}{xpack.installed=true} in
to {follower0}{RmgnfY6vTDe-fyixPwBAsA}{k9gtij__QTqzLoW2w45oBg}{127.0.0.1}{127.0.0.1:33247}{xpack.installed=true}
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:253) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:298) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.lambda$handleException$23(TcpTransport.java:1000) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_202]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_202]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_202]
  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [follower1][127.0.0.1:33857][internal:index/shard/recovery/start_recovery]
  1> Caused by: java.lang.IllegalStateException: translog replay failed to cover required sequence numbers (required range [23:85). first missing op is [23]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$phase2$25(RecoverySourceHandler.java:592) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.sendBatch(RecoverySourceHandler.java:648) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$sendBatch$26(RecoverySourceHandler.java:634) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.indices.recovery.RemoteRecoveryTargetHandler.lambda$indexTranslogOperations$4(RemoteRecoveryTargetHandler.java:133) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:54) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.PlainTransportFuture.handleResponse(PlainTransportFuture.java:87) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1108) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport$1.doRun(TcpTransport.java:974) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1>    ... 3 more

Error in the cluster state:

[2019-02-18T13:07:37,939][INFO ][o.e.x.c.IndexFollowingIT ] [testFollowIndexWithConcurrentMappingChanges] ensureGreen timed out, cluster state:
  1> cluster uuid: Ng9EU-PNRtKSGAho1lXeuw
  1> version: 133
  1> state uuid: MHeg6LFATgqtJNXpGR7jCQ
  1> from_diff: false
  1> meta data version: 111
  1>    coordination_metadata:
  1>       term: 1
  1>       last_committed_config: VotingConfiguration{TF5Ju3gbQimfP5f5UAx1Pw}
  1>       last_accepted_config: VotingConfiguration{TF5Ju3gbQimfP5f5UAx1Pw}
  1>       voting tombstones: []
  1>    [index2/WvAgjeByTKyKyHyfv4F3_g]: v[29], mv[26], sv[1]
  1>       0: p_term [1], isa_ids [4_9HUKWCRFabr2ctMYuJDA]
  1> metadata customs:
  1>    persistent_tasks: {"last_allocation_id":1,"tasks":[{"id":"WvAgjeByTKyKyHyfv4F3_g-0","task":{"xpack/ccr/shard_follow_task":{"params":{"remote_cluster":"leader_cluster","follow_shard_index":"index2","follo
w_shard_index_uuid":"WvAgjeByTKyKyHyfv4F3_g","follow_shard_shard":0,"leader_shard_index":"index1","leader_shard_index_uuid":"iV7aduaNSqyO9sJE9bZgOw","leader_shard_shard":0,"max_read_request_operation_count":5120
,"max_write_request_operation_count":5120,"max_outstanding_read_requests":12,"max_outstanding_write_requests":9,"max_read_request_size":"32mb","max_write_request_size":"9223372036854775807b","max_write_buffer_co
unt":2147483647,"max_write_buffer_size":"512mb","max_retry_delay":"10ms","read_poll_timeout":"10ms","headers":{}}}},"allocation_id":1,"assignment":{"executor_node":"TF5Ju3gbQimfP5f5UAx1Pw","explanation":""}}]}  
 index-graveyard: IndexGraveyard[[[index=[index2/oUo45WFIT4KD6fx44LuOug], deleteDate=2019-02-18T13:06:54.275Z], [index=[index2/Jig6XuL7S1ugGmHMbpgYZw], deleteDate=2019-02-18T13:06:55.007Z], [index=[index2/0ou9jM
_IT9Sg2LWBXoAbog], deleteDate=2019-02-18T13:06:56.962Z], [index=[follower/jblDMPVeTx2yc_6Bguu2vg], deleteDate=2019-02-18T13:07:00.029Z], [index=[follower/IYSgSyO9Su-I9uX9hKb5gQ], deleteDate=2019-02-18T13:07:03.0
73Z], [index=[index2/1On8lJ1aQuGV4y8J6I65hg], deleteDate=2019-02-18T13:07:05.700Z], [index=[index2/-Jd8S67ETySPcNWbbm2-kA], deleteDate=2019-02-18T13:07:07.622Z]]]
  1> nodes: 
  1>    {follower1}{TF5Ju3gbQimfP5f5UAx1Pw}{EZ9qhrdkR-GZ-r3kGcu-QA}{127.0.0.1}{127.0.0.1:33857}{xpack.installed=true}, local, master
  1>    {follower0}{RmgnfY6vTDe-fyixPwBAsA}{k9gtij__QTqzLoW2w45oBg}{127.0.0.1}{127.0.0.1:33247}{xpack.installed=true}
  1> routing_table (version 63):
  1> -- index [[index2/WvAgjeByTKyKyHyfv4F3_g]]
  1> ----shard_id [index2][0]
  1> --------[index2][0], node[TF5Ju3gbQimfP5f5UAx1Pw], [P], s[STARTED], a[id=4_9HUKWCRFabr2ctMYuJDA]
  1> --------[index2][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-02-18T13:07:08.840Z], failed_attempts[5], delayed=false, details[failed shard on node [RmgnfY6vTDe-fyixPwBAsA]: failed recovery, failure RecoveryFailedException[[index2][0]: Recovery failed from {follower1}{TF5Ju3gbQimfP5f5UAx1Pw}{EZ9qhrdkR-GZ-r3kGcu-QA}{127.0.0.1}{127.0.0.1:33857}{xpack.installed=true} into {follower0}{RmgnfY6vTDe-fyixPwBAsA}{k9gtij__QTqzLoW2w45oBg}{127.0.0.1}{127.0.0.1:33247}{xpack.installed=true}]; nested: RemoteTransportException[[follower1][127.0.0.1:33857][internal:index/shard/recovery/start_recovery]]; nested: IllegalStateException[translog replay failed to cover required sequence numbers (required range [23:85). first missing op is [23]]; ], allocation_status[no_attempt]]
  1> routing_nodes:
  1> -----node_id[TF5Ju3gbQimfP5f5UAx1Pw][V]
  1> --------[index2][0], node[TF5Ju3gbQimfP5f5UAx1Pw], [P], s[STARTED], a[id=4_9HUKWCRFabr2ctMYuJDA]
  1> -----node_id[RmgnfY6vTDe-fyixPwBAsA][V]
  1> ---- unassigned
  1> --------[index2][0], node[null], [R], recovery_source[peer recovery], s[UNASSIGNED], unassigned_info[[reason=ALLOCATION_FAILED], at[2019-02-18T13:07:08.840Z], failed_attempts[5], delayed=false, details[failed shard on node [RmgnfY6vTDe-fyixPwBAsA]: failed recovery, failure RecoveryFailedException[[index2][0]: Recovery failed from {follower1}{TF5Ju3gbQimfP5f5UAx1Pw}{EZ9qhrdkR-GZ-r3kGcu-QA}{127.0.0.1}{127.0.0.1:33857}{xpack.installed=true} into {follower0}{RmgnfY6vTDe-fyixPwBAsA}{k9gtij__QTqzLoW2w45oBg}{127.0.0.1}{127.0.0.1:33247}{xpack.installed=true}]; nested: RemoteTransportException[[follower1][127.0.0.1:33857][internal:index/shard/recovery/start_recovery]]; nested: IllegalStateException[translog replay failed to cover required sequence numbers (required range [23:85). first missing op is [23]]; ], allocation_status[no_attempt]]
  1> customs:
  1>    restore: RestoreInProgress[[]]
  1> tasks: (0):

Build url: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/156/console

I'm going to mute this test as well.

martijnvg added a commit that referenced this issue Feb 18, 2019
martijnvg added a commit that referenced this issue Feb 18, 2019
@martijnvg
Copy link
Member

This is another instance of the test failure in the previous commit: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+release-tests/462/console

martijnvg added a commit that referenced this issue Feb 18, 2019
martijnvg added a commit that referenced this issue Feb 18, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this issue Feb 18, 2019
…ate-file

* elastic/master:
  Remove tests and branches that will never execute (elastic#38772)
  also check ccr stats api return empty response in ensureNoCcrTasks()
  Add overlapping, before, after filters to intervals query (elastic#38999)
  Mute test elastic#38949
  Add remote recovery to ShardFollowTaskReplicationTests (elastic#39007)
  [ML] More advanced post-test cleanup of ML indices (elastic#39049)
  wait for shard to be allocated before executing a resume follow api
  Update track-total-hits.asciidoc
  Force kill testcluster nodes (elastic#37353)
  Make pullFixture a task dependency of resolveAllDependencies (elastic#38956)
  set minimum supported version (elastic#39043)
  Enforce Completion Context Limit (elastic#38675)
  Mute test
  Don't close caches while there might still be in-flight requests. (elastic#38958)
  Fix elastic#38623 remove xpack namespace REST API (elastic#38625)
  Add data frame feature (elastic#38934)
  Test bi-directional index following during a rolling upgrade. (elastic#38962)
  Generate mvn pom for ssl-config library (elastic#39019)
  Mute testRetentionLeaseIsRenewedDuringRecovery
talevy pushed a commit that referenced this issue Feb 20, 2019
dnhatn added a commit that referenced this issue Feb 25, 2019
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
dnhatn added a commit that referenced this issue Feb 25, 2019
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
dnhatn added a commit that referenced this issue Feb 25, 2019
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
dnhatn added a commit that referenced this issue Feb 25, 2019
With this change, we won't wait for the local checkpoint to advance to
the max_seq_no before starting phase2 of peer-recovery. We also remove
the sequence number range check in peer-recovery. We can safely do these
thanks to Yannick's finding.

The replication group to be used is currently sampled after indexing
into the primary (see `ReplicationOperation` class). This means that
when initiating tracking of a new replica, we have to consider the
following two cases:

- There are operations for which the replication group has not been
sampled yet. As we initiated the new replica as tracking, we know that
those operations will be replicated to the new replica and follow the
typical replication group semantics (e.g. marked as stale when
unavailable).

- There are operations for which the replication group has already been
sampled. These operations will not be sent to the new replica.  However,
we know that those operations are already indexed into Lucene and the
translog on the primary, as the sampling is happening after that. This
means that by taking a snapshot of Lucene or the translog, we will be
getting those ops as well. What we cannot guarantee anymore is that all
ops up to `endingSeqNo` are available in the snapshot (i.e.  also see
comment in `RecoverySourceHandler` saying `We need to wait for all
operations up to the current max to complete, otherwise we can not
guarantee that all operations in the required range will be available
for replaying from the translog of the source.`). This is not needed,
though, as we can no longer guarantee that max seq no == local
checkpoint.

Relates #39000
Closes #38949

Co-authored-by: Yannick Welsch <yannick@welsch.lu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants