[CI] Global checkpoint assertion violated #27970

dnhatn · 2017-12-23T01:37:58Z

We introduced the global checkpoint assertion in #27837 and in #27965 we set global checkpoint in recoveries. However, the latter PR causes some tests violate this assertion.

CI starts failing recently:

01:11:09 [2017-12-23T01:10:24,178][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-0] fatal error in thread [elasticsearch[node-0][clusterApplierService#updateTask][T#1]], exiting
01:11:09 java.lang.AssertionError: global checkpoint [-2] lower than initial gcp [4]
01:11:09 	at org.elasticsearch.index.translog.TranslogWriter.lambda$create$0(TranslogWriter.java:141) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.translog.TranslogWriter.syncNeeded(TranslogWriter.java:258) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.translog.TranslogWriter.syncUpTo(TranslogWriter.java:338) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.translog.TranslogWriter.sync(TranslogWriter.java:249) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.translog.Translog.close(Translog.java:328) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) ~[lucene-core-7.2.0.jar:7.2.0 bca54cad5a9f6a80800944fd5bd585b68acde8c8 - jpountz - 2017-12-14 16:12:44]
01:11:09 	at org.apache.lucene.util.IOUtils.close(IOUtils.java:76) ~[lucene-core-7.2.0.jar:7.2.0 bca54cad5a9f6a80800944fd5bd585b68acde8c8 - jpountz - 2017-12-14 16:12:44]
01:11:09 	at org.elasticsearch.index.engine.InternalEngine.closeNoLock(InternalEngine.java:1796) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.engine.Engine.close(Engine.java:1387) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) ~[lucene-core-7.2.0.jar:7.2.0 bca54cad5a9f6a80800944fd5bd585b68acde8c8 - jpountz - 2017-12-14 16:12:44]
01:11:09 	at org.apache.lucene.util.IOUtils.close(IOUtils.java:76) ~[lucene-core-7.2.0.jar:7.2.0 bca54cad5a9f6a80800944fd5bd585b68acde8c8 - jpountz - 2017-12-14 16:12:44]
01:11:09 	at org.elasticsearch.index.shard.IndexShard.close(IndexShard.java:1196) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.IndexService.closeShard(IndexService.java:431) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.IndexService.removeShard(IndexService.java:414) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.IndexService.close(IndexService.java:274) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:554) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:285) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:219) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_151]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_151]
01:11:09 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_151]
01:11:09 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

The text was updated successfully, but these errors were encountered:

In PR elastic#27965, we set the global checkpoint from the translog in a store recovery. However, we set after an engine is opened. This causes the global checkpoint assertion in TranslogWriter violated as if we are forced to close the engine before we set the global checkpoint. A closing engine will close translog which in turn read the current global checkpoint; however it is still unassigned and smaller than the initial global checkpoint from translog. Closes elastic#27970

Today we did not set the global checkpoint when opening an engine from an existing store. If we are forced to close an engine before advancing the global checkpoint, we also have to close translog which in turn sync a new checkpoint with an unassigned global checkpoint. This is not caught until the global checkpoint assertion was introduced in PR elastic#27837. This commit tightens the syncNeeded conditions. Relates elastic#27970

In PR #27965, we set the global checkpoint from the translog in a store recovery. However, we set after an engine is opened. This causes the global checkpoint assertion in TranslogWriter violated as if we are forced to close the engine before we set the global checkpoint. A closing engine will close translog which in turn read the current global checkpoint; however it is still unassigned and smaller than the initial global checkpoint from translog. Closes #27970

dnhatn added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. :Sequence IDs >test-failure Triaged test failures from CI labels Dec 23, 2017

dnhatn assigned bleskes and dnhatn Dec 23, 2017

This was referenced Dec 23, 2017

Set global checkpoint before open engine from store #27972

Merged

Only sync translog when global checkpoint increased #27973

Closed

ywelsch closed this as completed in #27972 Dec 23, 2017

clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Global checkpoint assertion violated #27970

[CI] Global checkpoint assertion violated #27970

dnhatn commented Dec 23, 2017

[CI] Global checkpoint assertion violated #27970

[CI] Global checkpoint assertion violated #27970

Comments

dnhatn commented Dec 23, 2017