es server always restart because of reading metadata file incorrectly #37286

kkewwei · 2019-01-10T04:07:57Z

ES_VERSION: 5.6.8
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:
It is often not appearing. when the machine is turned off because of hardware malfunction, the es server left the cluster for a long time passively. Aftertime the machine is ok and the es server restarts, the es service can automatically identify metadata as planed, but it reports those error logs and down. as the result es keeps cycling restarting and down
Provide logs (if relevant):

Error injecting constructor, ElasticsearchException[java.io.IOException: failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: IOException[failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: CorruptStateException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))];
at org.elasticsearch.gateway.GatewayMetaState.(Unknown Source)
while locating org.elasticsearch.gateway.GatewayMetaState
for parameter 4 at org.elasticsearch.gateway.GatewayService.(Unknown Source)
while locating org.elasticsearch.gateway.GatewayService
Caused by: ElasticsearchException[java.io.IOException: failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: IOException[failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: CorruptStateException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))];
at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:190)
at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:334)
at org.elasticsearch.common.util.IndexFolderUpgrader.upgrade(IndexFolderUpgrader.java:90)
at org.elasticsearch.common.util.IndexFolderUpgrader.upgradeIndicesIfNeeded(IndexFolderUpgrader.java:128)
at org.elasticsearch.gateway.GatewayMetaState.(GatewayMetaState.java:91)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:49)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:116)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:825)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:50)
at org.elasticsearch.common.inject.SingleParameterInjector.inject(SingleParameterInjector.java:42)
at org.elasticsearch.common.inject.SingleParameterInjector.getAll(SingleParameterInjector.java:66)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:85)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:116)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:825)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:50)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:191)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:183)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:818)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:183)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:173)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:161)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:96)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96)
at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70)
at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:42)
at org.elasticsearch.node.Node.(Node.java:499)
at org.elasticsearch.node.Node.(Node.java:245)
at org.elasticsearch.bootstrap.Bootstrap$5.(Bootstrap.java:233)
at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233)
at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342)
at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132)
at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123)
at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70)

elasticmachine · 2019-01-10T14:46:53Z

Pinging @elastic/es-distributed

andrershov · 2019-01-10T15:10:11Z

@kkewwei On node startup, Elasticsearch node reads metadata from the disk. If the metadata file is corrupted, node startup will fail. In this case, it seems that /data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st is corrupted, which is the metadata file of the index with uuid rDD73n7GR7uw_vE0Fk1FrA. File corruption can easily be caused by hardware mulfunction that you mention in you comment.
If you were running with number of replicas > 0 (meaning index data was residing on other nodes as well), you can safely remove data folder from this node and let it join the cluster as a fresh node. Elasticsearch will re-distribute shards ownership across the nodes automatically.
If it was the single node holding the index, there is not much you can do. We don't have a tool to perform metadata recovery.

kkewwei · 2019-01-11T01:49:24Z

it does not appear with ES2 in no circumstances. and it indeed arises with number of replicas = 0. a little files are corrupted and most files is good,maybe formatting data is not a good idea because of the importance of data. As a treatment, I delete the corruped file and it works well. Can we improve the process by skipping the corruped files on node startup? if this is the case, we can recovery most of data.

ywelsch · 2019-03-11T14:29:51Z

@andrershov as part of #32006, do you think it would be useful to have a command-line tool that would allow a node to recover all index metadata except corrupted ones? In particular, rewrite the cluster state manifest file to remove a "corrupted" index?

andrershov · 2019-03-11T15:18:34Z

@ywelsch I think it's a bigger discussion: not only index metadata, but global metadata and manifest itself could be affected. Do we want to recover from this kind of situations?

ywelsch · 2019-03-13T08:36:07Z

I think we can treat this in a similar way as shard corruptions, for which we currently have a command-line tool. We could do a best-effort recovery of the metadata, with plenty of warnings. I don't think we should do this automatically at startup, however, but require an explicit administrative step. I also think that there are two levels of severeness here: master-eligible or non-master-eligible node. When it comes to a non-master-eligible nodes, temporarily removing the metadata might be relatively harmless, as when joining the cluster, this metadata will be recreated on the node. For master-eligible nodes, it is trickier as the revised metadata might now be published to other nodes, overriding the intact metadata that they might have.

ywelsch · 2020-07-23T07:53:18Z

The storage format for metadata has changed in recent ES version (7.6+), and is now Lucene-based. Given how few cases we have seen of metadata corruption (due to faulty hardware), I don't see the need to build automated tooling to support this (instead it should be treated as a full node failure).

kkewwei changed the title ~~es server always restart because of reading file incorrectly~~ es server always restart because of reading metadata file incorrectly Jan 10, 2019

andrershov added the :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. label Jan 10, 2019

andrershov self-assigned this Jan 10, 2019

andrershov removed their assignment May 1, 2019

DaveCTurner mentioned this issue Jul 12, 2019

Fail node containing ancient closed index #44264

Merged

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

ywelsch closed this as completed Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

es server always restart because of reading metadata file incorrectly #37286

es server always restart because of reading metadata file incorrectly #37286

kkewwei commented Jan 10, 2019

elasticmachine commented Jan 10, 2019

andrershov commented Jan 10, 2019

kkewwei commented Jan 11, 2019

ywelsch commented Mar 11, 2019

andrershov commented Mar 11, 2019

ywelsch commented Mar 13, 2019

ywelsch commented Jul 23, 2020

es server always restart because of reading metadata file incorrectly #37286

es server always restart because of reading metadata file incorrectly #37286

Comments

kkewwei commented Jan 10, 2019

elasticmachine commented Jan 10, 2019

andrershov commented Jan 10, 2019

kkewwei commented Jan 11, 2019

ywelsch commented Mar 11, 2019

andrershov commented Mar 11, 2019

ywelsch commented Mar 13, 2019

ywelsch commented Jul 23, 2020