[CI] unexpected failure while replicating translog entry #38898

droberts195 · 2019-02-14T13:29:03Z

A test failure occurred in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/99/console, however, the root cause was that one of the nodes in the cluster suffered a fatal exception:

[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting
java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
    at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$2(RecoveryTarget.java:362) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:191) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT] 
    at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:333) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:521) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:480) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1076) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

The test that was running when this happened was:

./gradlew :qa:smoke-test-multinode:integTestRunner \
  -Dtests.seed=666DFE2D5892C30E \
  -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT \
  -Dtests.method="test {yaml=smoke_test_multinode/10_basic/cluster health basic test, wait for both nodes to join}" \
  -Dtests.security.manager=true \
  -Dtests.locale=mk \
  -Dtests.timezone=Europe/Malta \
  -Dcompiler.java=11 \
  -Druntime.java=8

Since an almost identical test immediately before succeeded I doubt there is anything wrong with that test. (The "stash dump on failure" in the Jenkins log is very confusing too as it contains the result of the previous successful test.)

cluster_logs.zip contains the logs from the nodes in the test cluster that died with the fatal error.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-14T13:29:05Z

Pinging @elastic/es-distributed

talevy · 2019-02-19T21:48:34Z

another instance https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=openjdk12,nodes=immutable&&linux&&docker/25/console

romseygeek · 2019-02-22T13:49:30Z

And another: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/32/console

dnhatn · 2019-02-22T13:57:38Z

I am on it today.

We tripped this assertion three times for the last two weeks. However, it only says "this IndexWriter is closed" without the actual cause. ``` [2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed ``` This change replaces an assert with an AssertionError so that we will have the actual cause in the next build failures. Relates #38898

dnhatn · 2019-02-23T18:07:32Z

I added a reason that closes the IndexWriter to this assertion in #39333. If this test fails again, we will know why the IndexWriter is closed.

We tripped this assertion three times for the last two weeks. However, it only says "this IndexWriter is closed" without the actual cause. ``` [2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed ``` This change replaces an assert with an AssertionError so that we will have the actual cause in the next build failures. Relates #38898

dnhatn · 2019-02-25T03:07:22Z

Another instance: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/262/console

[2019-02-24T17:35:13,791][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#5]], exiting
java.lang.AssertionError: unexpected failure while replicating translog entry
	at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$2(RecoveryTarget.java:366) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:191) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:335) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:521) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:480) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1077) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) ~[lucene-core-8.0.0-snapshot-83f9835.jar:8.0.0-snapshot-83f9835 83f9835a47a00a2ec58a4cf5fc0d492497cf7898 - jpountz - 2019-01-21 13:05:58]
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:695) ~[lucene-core-8.0.0-snapshot-83f9835.jar:8.0.0-snapshot-83f9835 83f9835a47a00a2ec58a4cf5fc0d492497cf7898 - jpountz - 2019-01-21 13:05:58]
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1591) ~[lucene-core-8.0.0-snapshot-83f9835.jar:8.0.0-snapshot-83f9835 83f9835a47a00a2ec58a4cf5fc0d492497cf7898 - jpountz - 2019-01-21 13:05:58]
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) ~[lucene-core-8.0.0-snapshot-83f9835.jar:8.0.0-snapshot-83f9835 83f9835a47a00a2ec58a4cf5fc0d492497cf7898 - jpountz - 2019-01-21 13:05:58]
	at org.elasticsearch.index.engine.InternalEngine.innerNoOp(InternalEngine.java:1487) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.engine.InternalEngine.noOp(InternalEngine.java:1456) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.shard.IndexShard.noOp(IndexShard.java:815) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.shard.IndexShard.markSeqNoAsNoop(IndexShard.java:807) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1338) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1313) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$2(RecoveryTarget.java:360) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]

dnhatn · 2019-02-25T03:07:33Z

I opened #39338.

Today we do not bubble up exceptions when processing NoOps but always treat them as document-level failures. This incorrect treatment causes the assert_no_failure being tripped in peer-recovery if IndexWriter was closed exceptionally before. Closes #38898

droberts195 added >test-failure Triaged test failures from CI :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 14, 2019

dnhatn self-assigned this Feb 15, 2019

dnhatn mentioned this issue Feb 23, 2019

Add cause to assert_no_failure when replay translog #39333

Merged

dnhatn mentioned this issue Feb 25, 2019

Bubble up exception when processing NoOp #39338

Merged

dnhatn closed this as completed in #39338 Feb 25, 2019

dnhatn mentioned this issue Apr 19, 2019

Peer recovery should not indefinitely retry on mapping error #41099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] unexpected failure while replicating translog entry #38898

[CI] unexpected failure while replicating translog entry #38898

droberts195 commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

talevy commented Feb 19, 2019

romseygeek commented Feb 22, 2019

dnhatn commented Feb 22, 2019

dnhatn commented Feb 23, 2019 •

edited

Loading

dnhatn commented Feb 25, 2019

dnhatn commented Feb 25, 2019

[CI] unexpected failure while replicating translog entry #38898

[CI] unexpected failure while replicating translog entry #38898

Comments

droberts195 commented Feb 14, 2019

elasticmachine commented Feb 14, 2019

talevy commented Feb 19, 2019

romseygeek commented Feb 22, 2019

dnhatn commented Feb 22, 2019

dnhatn commented Feb 23, 2019 • edited Loading

dnhatn commented Feb 25, 2019

dnhatn commented Feb 25, 2019

dnhatn commented Feb 23, 2019 •

edited

Loading