Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Global checkpoint assertion violated #27970

Closed
dnhatn opened this issue Dec 23, 2017 · 0 comments · Fixed by #27972
Closed

[CI] Global checkpoint assertion violated #27970

dnhatn opened this issue Dec 23, 2017 · 0 comments · Fixed by #27972
Assignees
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI

Comments

@dnhatn
Copy link
Member

dnhatn commented Dec 23, 2017

We introduced the global checkpoint assertion in #27837 and in #27965 we set global checkpoint in recoveries. However, the latter PR causes some tests violate this assertion.

CI starts failing recently:

01:11:09 [2017-12-23T01:10:24,178][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node-0] fatal error in thread [elasticsearch[node-0][clusterApplierService#updateTask][T#1]], exiting
01:11:09 java.lang.AssertionError: global checkpoint [-2] lower than initial gcp [4]
01:11:09 	at org.elasticsearch.index.translog.TranslogWriter.lambda$create$0(TranslogWriter.java:141) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.translog.TranslogWriter.syncNeeded(TranslogWriter.java:258) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.translog.TranslogWriter.syncUpTo(TranslogWriter.java:338) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.translog.TranslogWriter.sync(TranslogWriter.java:249) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.translog.Translog.close(Translog.java:328) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) ~[lucene-core-7.2.0.jar:7.2.0 bca54cad5a9f6a80800944fd5bd585b68acde8c8 - jpountz - 2017-12-14 16:12:44]
01:11:09 	at org.apache.lucene.util.IOUtils.close(IOUtils.java:76) ~[lucene-core-7.2.0.jar:7.2.0 bca54cad5a9f6a80800944fd5bd585b68acde8c8 - jpountz - 2017-12-14 16:12:44]
01:11:09 	at org.elasticsearch.index.engine.InternalEngine.closeNoLock(InternalEngine.java:1796) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.engine.Engine.close(Engine.java:1387) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.apache.lucene.util.IOUtils.close(IOUtils.java:89) ~[lucene-core-7.2.0.jar:7.2.0 bca54cad5a9f6a80800944fd5bd585b68acde8c8 - jpountz - 2017-12-14 16:12:44]
01:11:09 	at org.apache.lucene.util.IOUtils.close(IOUtils.java:76) ~[lucene-core-7.2.0.jar:7.2.0 bca54cad5a9f6a80800944fd5bd585b68acde8c8 - jpountz - 2017-12-14 16:12:44]
01:11:09 	at org.elasticsearch.index.shard.IndexShard.close(IndexShard.java:1196) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.IndexService.closeShard(IndexService.java:431) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.IndexService.removeShard(IndexService.java:414) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.index.IndexService.close(IndexService.java:274) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:554) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:285) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:219) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:498) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_151]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:495) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:482) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:566) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) ~[elasticsearch-6.2.0-SNAPSHOT.jar:6.2.0-SNAPSHOT]
01:11:09 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_151]
01:11:09 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_151]
01:11:09 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
@dnhatn dnhatn added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. :Sequence IDs >test-failure Triaged test failures from CI labels Dec 23, 2017
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Dec 23, 2017
In PR elastic#27965, we set the global checkpoint from the translog in a store
recovery. However, we set after an engine is opened. This causes the
global checkpoint assertion in TranslogWriter violated as if we are
forced to close the engine before we set the global checkpoint. A
closing engine will close translog which in turn read the current global
checkpoint; however it is still unassigned and smaller than the initial
global checkpoint from translog.

Closes elastic#27970
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Dec 23, 2017
Today we did not set the global checkpoint when opening an engine from
an existing store. If we are forced to close an engine before advancing
the global checkpoint, we also have to close translog which in turn sync
a new checkpoint with an unassigned global checkpoint. This is not
caught until the global checkpoint assertion was introduced in PR
elastic#27837.

This commit tightens the syncNeeded conditions.

Relates elastic#27970
ywelsch pushed a commit that referenced this issue Dec 23, 2017
In PR #27965, we set the global checkpoint from the translog in a store
recovery. However, we set after an engine is opened. This causes the
global checkpoint assertion in TranslogWriter violated as if we are
forced to close the engine before we set the global checkpoint. A
closing engine will close translog which in turn read the current global
checkpoint; however it is still unassigned and smaller than the initial
global checkpoint from translog.

Closes #27970
ywelsch pushed a commit that referenced this issue Dec 23, 2017
In PR #27965, we set the global checkpoint from the translog in a store
recovery. However, we set after an engine is opened. This causes the
global checkpoint assertion in TranslogWriter violated as if we are
forced to close the engine before we set the global checkpoint. A
closing engine will close translog which in turn read the current global
checkpoint; however it is still unassigned and smaller than the initial
global checkpoint from translog.

Closes #27970
@clintongormley clintongormley added :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Sequence IDs labels Feb 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants