Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430

Closed
philnate opened this issue Aug 25, 2014 · 13 comments
Closed
Assignees
Labels

Comments

@philnate
Copy link

We recently tried to upgrade a ES cluster from 0.2 to 1.3. The actual upgrade worked out fine, but once we restarted the whole cluster, we saw those warnings for all shards (constantly repeating):

IndexShardGatewayRecoveryException[[maki-log-2014-08-21][3] shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, _checksums-1408609875350, _10f.nrm, segments.gen, _17x.nrm, _10f.tis, _17z.si, _17y.fdt, _17y.tis, _17x.fdx, _10f.frq, _17x.fdt, _17y.tii, _17x.prx, _17y.nrm, _10f.fdx, _10f.fnm, _17x.tis, _10f.tii, _10f.fdt, _17z.cfe, _17x.frq, _17x.tii, segments_j, _10f.prx, _17y.fnm, _17y.fdx, _17y.prx, _17y.frq, _17z.cfs, _17x.fnm]]; nested: FileNotFoundException[No such file [_10f.si]]; ]]
[2014-08-21 13:44:15,826][WARN ][cluster.action.shard     ] [Ghost Dancer] [maki-log-2014-08-21][3] received shard failed for [maki-log-2014-08-21][3], node[QsfMdS40Qve8PukS4er9oA], [P], s[INITIALIZING], indexUUID [_na_], reason [master [Ghost Dancer][QsfMdS40Qve8PukS4er9oA][gboanea-ThinkPad-W520][inet[/10.200.54.63:9300]] marked shard as initializing, but shard is marked as failed, resend shard failure]
[2014-08-21 13:44:15,841][WARN ][cluster.action.shard     ] [Ghost Dancer] [maki-log-2014-08-21][4] received shard failed for [maki-log-2014-08-21][4], node[QsfMdS40Qve8PukS4er9oA], [P], s[INITIALIZING], indexUUID [_na_], reason [master [Ghost Dancer][QsfMdS40Qve8PukS4er9oA][gboanea-ThinkPad-W520][inet[/10.200.54.63:9300]] marked shard as initializing, but shard is marked as failed, resend shard failure]
[2014-08-21 13:44:15,844][WARN ][indices.cluster          ] [Ghost Dancer] [maki-log-2014-08-21][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [maki-log-2014-08-21][1] failed to fetch index version after copying it over
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [maki-log-2014-08-21][1] shard allocated for local recovery (post api), should exist, but doesn't, current files: [_17o.nrm, _zv.fdx, write.lock, _zv.fdt, segments.gen, _17o.fdx, _17o.tis, _17o.fdt, _zv.fnm, _17p.cfe, _zv.prx, _zv.frq, _17n.prx, _17n.frq, _17o.tii, _17n.nrm, _17n.tii, _17o.frq, _checksums-1408609865048, _17p.cfs, segments_j, _17o.fnm, _17o.prx, _17n.tis, _17n.fdx, _17n.fdt, _zv.tis, _zv.nrm, _17n.fnm, _zv.tii, _17p.si]
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:131)
    ... 4 more
Caused by: java.io.FileNotFoundException: No such file [_zv.si]
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:173)
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:144)
    at org.elasticsearch.index.store.DistributorDirectory.openInput(DistributorDirectory.java:130)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
    at org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:375)
    at org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:361)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:907)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:753)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
    at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:98)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:122)
    ... 4 more..

When we shutdown the cluster a couple minutes after bringing it up, with the new version, we saw this behavior just for the newest index. After about an hour the behavior would be the same for other indexes after a cluster restart.

We found out that the indexes are updated and on shutdown nearly all segment info (*.si) files are deleted (those which have a corresponding marker _upgraded.si). Those si files surviving seemed to be not upgraded (at least they don't have those marker files). And there content is like this or this:

?�l��Lucene3xSegmentInfo3.6.21�
os.version�2.6.39-300.17.2.el6uek.x86_64�osLinuxlucene.version+3.6.2 1423725 - rmuir - 2012-12-18 19:45:40�sourceflushos.archamd64
                                                                                                                                  java.versio1.7.0_51
         java.vendor�Oracle Corporation
_175.fdt�_175_upgraded.s_175.fdx_175.s_175.fn_175.ti_175.ti_175.nr_175.fr_175.prx%
1408712122907SegmentInfo��4.9�� timestamp
os.version�3.2.0-67-generic�osLinuxlucene.version+4.9.0 1604085 - rmuir - 2014-06-20 06:22:23�sourceflushos.archamd64
                                                                                                                     java.versio1.7.0_65
                                                                                                                                        java.vendor�Oracle Corporation_18n.cf_18n.cfs_18n.si�(����\%

While those updated contain afterwards this kind of information:

?�l��Lucene3xSegmentInfo3.6.2�2�  
                                        mergeFactor�10
os.version�2.6.39-300.17.2.el6uek.x86_64�osLinuxlucene.version+3.6.2 1423725 - rmuir - 2012-12-18 19:45:40�sourcemergeos.archamd64�mergeMaxNumSegments�-1
             java.versio1.7.0_51
                                java.vendor�Oracle Corporation
_1mx.ti_1mx.fr_1mx.pr_1mx.fd_1mx.nr_1mx.fdt_1mx.si�_1mx_upgraded.s_1mx.fn_1mx.tis%  

We could force the same behavior triggering an optimize for a given index. By restarting one node at a time and waiting till it fully integrated into the cluster we were able to restore the deleted si files through other nodes (including the _upgraded.si marker files). Afterwards the si files where safe and didn't got deleted.

To me it looks like either ES or Lucene is memorizing to delete the _upgraded.si files on VM shutdown but by accident deletes the actual si files as well.

@s1monw
Copy link
Contributor

s1monw commented Aug 25, 2014

@philnate lemme ask a view questions:

  • did you modify any files on you filesystem?
  • do you have multiple datapaths configured?
  • did you see any other errors in the logs?
  • can you share your node-config?
  • did you change anything in the upgrade process?

@s1monw s1monw self-assigned this Aug 25, 2014
@philnate
Copy link
Author

  • no
  • no
  • no, right during startup this warning is logged on and on
  • I've reproduced the issue with vanilla ES
  • as we were coming from 0.2 we did a shutdown of the cluster, updated the binaries and restarted the cluster, then stopped it and after restart we saw the failure

the script to reproduce the issue (just steps, script is slowly synchronous, but doesn't wait for ES starts...) https://gist.github.com/philnate/cfee1d171022b9eb3b23

Hope this helps

@philnate
Copy link
Author

Updating from 0.20 to 1.0.3 is working. But trying to migrate to any newer version than 1.0.x is giving me the described error. So this issue seems to be introduced with 1.1.0 being it the newer Lucene version used or ES itself.

@mikemccand
Copy link
Contributor

This is a Lucene issue; I managed to make a standalone test exposing it: https://issues.apache.org/jira/browse/LUCENE-5907

@mikemccand
Copy link
Contributor

The Lucene issue is fixed, and will be included in Lucene 4.10.

@philnate
Copy link
Author

@mikemccand thank you for your fast solution. I've patched ES 1.3.2 with the fix and a quick test seemed to be ok. Will do some addtional testing.

@mikemccand
Copy link
Contributor

Thanks for testing @philnate

@s1monw
Copy link
Contributor

s1monw commented Aug 27, 2014

@philnate thanks so much for verifying!! that is super helpful

@philnate
Copy link
Author

Seems to be all fine. A colleague did some extended testing and everything was good. @s1monw do you plan to consume lucene 4.10 with some upcoming release of ES 1.3? That would allow us to directly upgrade to the latest and greatest ES, not migrating first to 1.0 and then to 1.3.

@clintongormley
Copy link

@philnate Lucene 4.10 will be in v1.4, but not in v1.3.x

Fixed by #7584

@clintongormley clintongormley changed the title Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart Sep 11, 2014
@mikemccand
Copy link
Contributor

Lucene 4.9.1 is released, and I upgraded 1.3.x to it: c998d8d

@cywjackson
Copy link

it seems we face this issue when upgrade 1.1.2 to 1.3.6 (6-node cluster, rolling upgrade). Here is a count from 1 of the logs:

      1 [_gyqx_2k2.del]
      1 [_hcap_2c4.del]
   7951 [_6mef.si]
   9252 [_24gwg.si]
  11295 [_7yti.si]

so most of the error coming from that .si files

we also seems to be hitting issue like CorruptIndexException[checksum failed" , "failed to fetch index version after copying it over" and [2015-01-07 11:03:59,051 +0000][DEBUG][index.shard.service ] [usw2a-search5-prod] [abid_v3][106] state: [RECOVERING]->[CLOSED], reason [recovery failure [IndexShardGatewayRecoveryException[[abid_v3][106] failed to fetch index version after copying it over]; nested: CorruptIndexException[[abid_v3][106] Corrupted index [corrupted_ySuuD05fT8ae37Z0ABqoMA] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=9j4trc actual=1mqj6pc resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2d4333ac)]]; ]]

There are corrupted_* files in many place, feel very similar to #8805 , but i am not sure the exact files to delete in the workaround there We deleted the checksum and the marker file and ES loaded the shards again automatically. it is very possible as @s1monw said (btw. they might have been corrupted for a long time already but old ES version didn't check this. We ran this ES < 0.90.3 and has been upgrading to 1.0.3 and 1.1.2 (and ran into various issues along the way) .

We have > 3000 shards total so its hard to say "which phrase" or "primary/replica"... I am very curious to know exactly what steps are We solved the issue on our side by deleting the checksums as well as the corrupted file and updating the indices.

Currently I am attempting to recovery this by setting index.shard.check_on_startup: fix , but not sure if it is the right fix especially with the big warning sign about data loss in the doc :(

@krispyjala
Copy link

So what do we do when this has already happened? Is this error recoverable? Is there a way for us to recreate the missing .si file? Currently, it is happening to two of my indices, but just for one shard (out of 5). Even if we can somehow retain data on the other 4 shards would be preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants