Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

es server always restart because of reading metadata file incorrectly #37286

Closed
kkewwei opened this issue Jan 10, 2019 · 7 comments
Closed

es server always restart because of reading metadata file incorrectly #37286

kkewwei opened this issue Jan 10, 2019 · 7 comments
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team

Comments

@kkewwei
Copy link
Contributor

kkewwei commented Jan 10, 2019

ES_VERSION: 5.6.8
JVM version : JDK1.8.0_112
OS version : linux
Description of the problem including expected versus actual behavior:
It is often not appearing. when the machine is turned off because of hardware malfunction, the es server left the cluster for a long time passively. Aftertime the machine is ok and the es server restarts, the es service can automatically identify metadata as planed, but it reports those error logs and down. as the result es keeps cycling restarting and down
Provide logs (if relevant):

  1. Error injecting constructor, ElasticsearchException[java.io.IOException: failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: IOException[failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: CorruptStateException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))];
    at org.elasticsearch.gateway.GatewayMetaState.(Unknown Source)
    while locating org.elasticsearch.gateway.GatewayMetaState
    for parameter 4 at org.elasticsearch.gateway.GatewayService.(Unknown Source)
    while locating org.elasticsearch.gateway.GatewayService
    Caused by: ElasticsearchException[java.io.IOException: failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: IOException[failed to read [id:0, legacy:false, file:/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st]]; nested: CorruptStateException[org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))]; nested: CorruptIndexException[codec footer mismatch (file truncated?): actual footer=758728244 vs expected footer=-1071082520 (resource=BufferedChecksumIndexInput(SimpleFSIndexInput(path="/data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st")))];
    at org.elasticsearch.ExceptionsHelper.maybeThrowRuntimeAndSuppress(ExceptionsHelper.java:190)
    at org.elasticsearch.gateway.MetaDataStateFormat.loadLatestState(MetaDataStateFormat.java:334)
    at org.elasticsearch.common.util.IndexFolderUpgrader.upgrade(IndexFolderUpgrader.java:90)
    at org.elasticsearch.common.util.IndexFolderUpgrader.upgradeIndicesIfNeeded(IndexFolderUpgrader.java:128)
    at org.elasticsearch.gateway.GatewayMetaState.(GatewayMetaState.java:91)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:49)
    at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
    at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:116)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
    at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:825)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
    at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
    at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:50)
    at org.elasticsearch.common.inject.SingleParameterInjector.inject(SingleParameterInjector.java:42)
    at org.elasticsearch.common.inject.SingleParameterInjector.getAll(SingleParameterInjector.java:66)
    at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:85)
    at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:116)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:47)
    at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:825)
    at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:43)
    at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:59)
    at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:50)
    at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:191)
    at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:183)
    at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:818)
    at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:183)
    at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:173)
    at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:161)
    at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:96)
    at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:96)
    at org.elasticsearch.common.inject.Guice.createInjector(Guice.java:70)
    at org.elasticsearch.common.inject.ModulesBuilder.createInjector(ModulesBuilder.java:42)
    at org.elasticsearch.node.Node.(Node.java:499)
    at org.elasticsearch.node.Node.(Node.java:245)
    at org.elasticsearch.bootstrap.Bootstrap$5.(Bootstrap.java:233)
    at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233)
    at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342)
    at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132)
    at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123)
    at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70)
@kkewwei kkewwei changed the title es server always restart because of reading file incorrectly es server always restart because of reading metadata file incorrectly Jan 10, 2019
@andrershov andrershov added the :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. label Jan 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov andrershov self-assigned this Jan 10, 2019
@andrershov
Copy link
Contributor

@kkewwei On node startup, Elasticsearch node reads metadata from the disk. If the metadata file is corrupted, node startup will fail. In this case, it seems that /data1/es_data/lc_gh_3/nodes/0/indices/rDD73n7GR7uw_vE0Fk1FrA/_state/state-0.st is corrupted, which is the metadata file of the index with uuid rDD73n7GR7uw_vE0Fk1FrA. File corruption can easily be caused by hardware mulfunction that you mention in you comment.
If you were running with number of replicas > 0 (meaning index data was residing on other nodes as well), you can safely remove data folder from this node and let it join the cluster as a fresh node. Elasticsearch will re-distribute shards ownership across the nodes automatically.
If it was the single node holding the index, there is not much you can do. We don't have a tool to perform metadata recovery.

@kkewwei
Copy link
Contributor Author

kkewwei commented Jan 11, 2019

it does not appear with ES2 in no circumstances. and it indeed arises with number of replicas = 0. a little files are corrupted and most files is good,maybe formatting data is not a good idea because of the importance of data. As a treatment, I delete the corruped file and it works well. Can we improve the process by skipping the corruped files on node startup? if this is the case, we can recovery most of data.

@ywelsch ywelsch added >enhancement :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. and removed :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. labels Mar 11, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Mar 11, 2019

@andrershov as part of #32006, do you think it would be useful to have a command-line tool that would allow a node to recover all index metadata except corrupted ones? In particular, rewrite the cluster state manifest file to remove a "corrupted" index?

@andrershov
Copy link
Contributor

@ywelsch I think it's a bigger discussion: not only index metadata, but global metadata and manifest itself could be affected. Do we want to recover from this kind of situations?

@ywelsch
Copy link
Contributor

ywelsch commented Mar 13, 2019

I think we can treat this in a similar way as shard corruptions, for which we currently have a command-line tool. We could do a best-effort recovery of the metadata, with plenty of warnings. I don't think we should do this automatically at startup, however, but require an explicit administrative step. I also think that there are two levels of severeness here: master-eligible or non-master-eligible node. When it comes to a non-master-eligible nodes, temporarily removing the metadata might be relatively harmless, as when joining the cluster, this metadata will be recreated on the node. For master-eligible nodes, it is trickier as the revised metadata might now be published to other nodes, overriding the intact metadata that they might have.

@andrershov andrershov removed their assignment May 1, 2019
@rjernst rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020
@ywelsch
Copy link
Contributor

ywelsch commented Jul 23, 2020

The storage format for metadata has changed in recent ES version (7.6+), and is now Lucene-based. Given how few cases we have seen of metadata corruption (due to faulty hardware), I don't see the need to build automated tooling to support this (instead it should be treated as a full node failure).

@ywelsch ywelsch closed this as completed Jul 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

5 participants