Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] unexpected failure while replicating translog entry #38898

Closed
droberts195 opened this issue Feb 14, 2019 · 7 comments · Fixed by #39338
Closed

[CI] unexpected failure while replicating translog entry #38898

droberts195 opened this issue Feb 14, 2019 · 7 comments · Fixed by #39338
Assignees
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

A test failure occurred in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+intake/99/console, however, the root cause was that one of the nodes in the cluster suffered a fatal exception:

[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting
java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
    at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$2(RecoveryTarget.java:362) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:191) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT] 
    at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:333) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:521) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:480) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1076) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    at java.lang.Thread.run(Thread.java:834) [?:?]

The test that was running when this happened was:

./gradlew :qa:smoke-test-multinode:integTestRunner \
  -Dtests.seed=666DFE2D5892C30E \
  -Dtests.class=org.elasticsearch.smoketest.SmokeTestMultiNodeClientYamlTestSuiteIT \
  -Dtests.method="test {yaml=smoke_test_multinode/10_basic/cluster health basic test, wait for both nodes to join}" \
  -Dtests.security.manager=true \
  -Dtests.locale=mk \
  -Dtests.timezone=Europe/Malta \
  -Dcompiler.java=11 \
  -Druntime.java=8

Since an almost identical test immediately before succeeded I doubt there is anything wrong with that test. (The "stash dump on failure" in the Jenkins log is very confusing too as it contains the result of the previous successful test.)

cluster_logs.zip contains the logs from the nodes in the test cluster that died with the fatal error.

@droberts195 droberts195 added >test-failure Triaged test failures from CI :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn self-assigned this Feb 15, 2019
@dnhatn
Copy link
Member

dnhatn commented Feb 22, 2019

I am on it today.

dnhatn added a commit that referenced this issue Feb 23, 2019
We tripped this assertion three times for the last two weeks. However,
it only says "this IndexWriter is closed" without the actual cause.

```
[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting

java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
```

This change replaces an assert with an AssertionError so that we will
have the actual cause in the next build failures.

Relates #38898
@dnhatn
Copy link
Member

dnhatn commented Feb 23, 2019

I added a reason that closes the IndexWriter to this assertion in #39333. If this test fails again, we will know why the IndexWriter is closed.

dnhatn added a commit that referenced this issue Feb 23, 2019
We tripped this assertion three times for the last two weeks. However,
it only says "this IndexWriter is closed" without the actual cause.

```
[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting

java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
```

This change replaces an assert with an AssertionError so that we will
have the actual cause in the next build failures.

Relates #38898
dnhatn added a commit that referenced this issue Feb 23, 2019
We tripped this assertion three times for the last two weeks. However,
it only says "this IndexWriter is closed" without the actual cause.

```
[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting

java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
```

This change replaces an assert with an AssertionError so that we will
have the actual cause in the next build failures.

Relates #38898
dnhatn added a commit that referenced this issue Feb 24, 2019
We tripped this assertion three times for the last two weeks. However,
it only says "this IndexWriter is closed" without the actual cause.

```
[2019-02-14T11:46:31,144][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#2]], exiting

java.lang.AssertionError: unexpected failure while replicating translog entry: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
```

This change replaces an assert with an AssertionError so that we will
have the actual cause in the next build failures.

Relates #38898
@dnhatn
Copy link
Member

dnhatn commented Feb 25, 2019

Another instance: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/262/console

[2019-02-24T17:35:13,791][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-1] fatal error in thread [elasticsearch[node-1][generic][T#5]], exiting
java.lang.AssertionError: unexpected failure while replicating translog entry
	at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$2(RecoveryTarget.java:366) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:191) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.indices.recovery.RecoveryTarget.indexTranslogOperations(RecoveryTarget.java:335) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:521) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$TranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:480) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1077) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:681) ~[lucene-core-8.0.0-snapshot-83f9835.jar:8.0.0-snapshot-83f9835 83f9835a47a00a2ec58a4cf5fc0d492497cf7898 - jpountz - 2019-01-21 13:05:58]
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:695) ~[lucene-core-8.0.0-snapshot-83f9835.jar:8.0.0-snapshot-83f9835 83f9835a47a00a2ec58a4cf5fc0d492497cf7898 - jpountz - 2019-01-21 13:05:58]
	at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1591) ~[lucene-core-8.0.0-snapshot-83f9835.jar:8.0.0-snapshot-83f9835 83f9835a47a00a2ec58a4cf5fc0d492497cf7898 - jpountz - 2019-01-21 13:05:58]
	at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) ~[lucene-core-8.0.0-snapshot-83f9835.jar:8.0.0-snapshot-83f9835 83f9835a47a00a2ec58a4cf5fc0d492497cf7898 - jpountz - 2019-01-21 13:05:58]
	at org.elasticsearch.index.engine.InternalEngine.innerNoOp(InternalEngine.java:1487) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.engine.InternalEngine.noOp(InternalEngine.java:1456) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.shard.IndexShard.noOp(IndexShard.java:815) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.shard.IndexShard.markSeqNoAsNoop(IndexShard.java:807) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1338) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.index.shard.IndexShard.applyTranslogOperation(IndexShard.java:1313) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$indexTranslogOperations$2(RecoveryTarget.java:360) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]

@dnhatn
Copy link
Member

dnhatn commented Feb 25, 2019

I opened #39338.

dnhatn added a commit that referenced this issue Feb 25, 2019
Today we do not bubble up exceptions when processing NoOps but always
treat them as document-level failures. This incorrect treatment causes
the assert_no_failure being tripped in peer-recovery if IndexWriter was
closed exceptionally before.

Closes #38898
dnhatn added a commit that referenced this issue Feb 25, 2019
Today we do not bubble up exceptions when processing NoOps but always
treat them as document-level failures. This incorrect treatment causes
the assert_no_failure being tripped in peer-recovery if IndexWriter was
closed exceptionally before.

Closes #38898
dnhatn added a commit that referenced this issue Feb 26, 2019
Today we do not bubble up exceptions when processing NoOps but always
treat them as document-level failures. This incorrect treatment causes
the assert_no_failure being tripped in peer-recovery if IndexWriter was
closed exceptionally before.

Closes #38898
dnhatn added a commit that referenced this issue Feb 26, 2019
Today we do not bubble up exceptions when processing NoOps but always
treat them as document-level failures. This incorrect treatment causes
the assert_no_failure being tripped in peer-recovery if IndexWriter was
closed exceptionally before.

Closes #38898
dnhatn added a commit that referenced this issue Feb 26, 2019
Today we do not bubble up exceptions when processing NoOps but always
treat them as document-level failures. This incorrect treatment causes
the assert_no_failure being tripped in peer-recovery if IndexWriter was
closed exceptionally before.

Closes #38898
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants