Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430

philnate · 2014-08-25T08:27:12Z

We recently tried to upgrade a ES cluster from 0.2 to 1.3. The actual upgrade worked out fine, but once we restarted the whole cluster, we saw those warnings for all shards (constantly repeating):

IndexShardGatewayRecoveryException[[maki-log-2014-08-21][3] shard allocated for local recovery (post api), should exist, but doesn't, current files: [write.lock, _checksums-1408609875350, _10f.nrm, segments.gen, _17x.nrm, _10f.tis, _17z.si, _17y.fdt, _17y.tis, _17x.fdx, _10f.frq, _17x.fdt, _17y.tii, _17x.prx, _17y.nrm, _10f.fdx, _10f.fnm, _17x.tis, _10f.tii, _10f.fdt, _17z.cfe, _17x.frq, _17x.tii, segments_j, _10f.prx, _17y.fnm, _17y.fdx, _17y.prx, _17y.frq, _17z.cfs, _17x.fnm]]; nested: FileNotFoundException[No such file [_10f.si]]; ]]
[2014-08-21 13:44:15,826][WARN ][cluster.action.shard     ] [Ghost Dancer] [maki-log-2014-08-21][3] received shard failed for [maki-log-2014-08-21][3], node[QsfMdS40Qve8PukS4er9oA], [P], s[INITIALIZING], indexUUID [_na_], reason [master [Ghost Dancer][QsfMdS40Qve8PukS4er9oA][gboanea-ThinkPad-W520][inet[/10.200.54.63:9300]] marked shard as initializing, but shard is marked as failed, resend shard failure]
[2014-08-21 13:44:15,841][WARN ][cluster.action.shard     ] [Ghost Dancer] [maki-log-2014-08-21][4] received shard failed for [maki-log-2014-08-21][4], node[QsfMdS40Qve8PukS4er9oA], [P], s[INITIALIZING], indexUUID [_na_], reason [master [Ghost Dancer][QsfMdS40Qve8PukS4er9oA][gboanea-ThinkPad-W520][inet[/10.200.54.63:9300]] marked shard as initializing, but shard is marked as failed, resend shard failure]
[2014-08-21 13:44:15,844][WARN ][indices.cluster          ] [Ghost Dancer] [maki-log-2014-08-21][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [maki-log-2014-08-21][1] failed to fetch index version after copying it over
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:152)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [maki-log-2014-08-21][1] shard allocated for local recovery (post api), should exist, but doesn't, current files: [_17o.nrm, _zv.fdx, write.lock, _zv.fdt, segments.gen, _17o.fdx, _17o.tis, _17o.fdt, _zv.fnm, _17p.cfe, _zv.prx, _zv.frq, _17n.prx, _17n.frq, _17o.tii, _17n.nrm, _17n.tii, _17o.frq, _checksums-1408609865048, _17p.cfs, segments_j, _17o.fnm, _17o.prx, _17n.tis, _17n.fdx, _17n.fdt, _zv.tis, _zv.nrm, _17n.fnm, _zv.tii, _17p.si]
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:131)
    ... 4 more
Caused by: java.io.FileNotFoundException: No such file [_zv.si]
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:173)
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:144)
    at org.elasticsearch.index.store.DistributorDirectory.openInput(DistributorDirectory.java:130)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
    at org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:375)
    at org.apache.lucene.codecs.lucene3x.Lucene3xSegmentInfoReader.read(Lucene3xSegmentInfoReader.java:103)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:361)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:457)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:907)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:753)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:453)
    at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:98)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:122)
    ... 4 more..

When we shutdown the cluster a couple minutes after bringing it up, with the new version, we saw this behavior just for the newest index. After about an hour the behavior would be the same for other indexes after a cluster restart.

We found out that the indexes are updated and on shutdown nearly all segment info (*.si) files are deleted (those which have a corresponding marker _upgraded.si). Those si files surviving seemed to be not upgraded (at least they don't have those marker files). And there content is like this or this:

?�l��Lucene3xSegmentInfo3.6.21�
os.version�2.6.39-300.17.2.el6uek.x86_64�osLinuxlucene.version+3.6.2 1423725 - rmuir - 2012-12-18 19:45:40�sourceflushos.archamd64
                                                                                                                                  java.versio1.7.0_51
         java.vendor�Oracle Corporation
_175.fdt�_175_upgraded.s_175.fdx_175.s_175.fn_175.ti_175.ti_175.nr_175.fr_175.prx%

1408712122907SegmentInfo��4.9�� timestamp
os.version�3.2.0-67-generic�osLinuxlucene.version+4.9.0 1604085 - rmuir - 2014-06-20 06:22:23�sourceflushos.archamd64
                                                                                                                     java.versio1.7.0_65
                                                                                                                                        java.vendor�Oracle Corporation_18n.cf_18n.cfs_18n.si�(����\%

While those updated contain afterwards this kind of information:

?�l��Lucene3xSegmentInfo3.6.2�2�  
                                        mergeFactor�10
os.version�2.6.39-300.17.2.el6uek.x86_64�osLinuxlucene.version+3.6.2 1423725 - rmuir - 2012-12-18 19:45:40�sourcemergeos.archamd64�mergeMaxNumSegments�-1
             java.versio1.7.0_51
                                java.vendor�Oracle Corporation
_1mx.ti_1mx.fr_1mx.pr_1mx.fd_1mx.nr_1mx.fdt_1mx.si�_1mx_upgraded.s_1mx.fn_1mx.tis%

We could force the same behavior triggering an optimize for a given index. By restarting one node at a time and waiting till it fully integrated into the cluster we were able to restore the deleted si files through other nodes (including the _upgraded.si marker files). Afterwards the si files where safe and didn't got deleted.

To me it looks like either ES or Lucene is memorizing to delete the _upgraded.si files on VM shutdown but by accident deletes the actual si files as well.

The text was updated successfully, but these errors were encountered:

s1monw · 2014-08-25T09:38:40Z

@philnate lemme ask a view questions:

did you modify any files on you filesystem?
do you have multiple datapaths configured?
did you see any other errors in the logs?
can you share your node-config?
did you change anything in the upgrade process?

philnate · 2014-08-25T15:27:44Z

no
no
no, right during startup this warning is logged on and on
I've reproduced the issue with vanilla ES
as we were coming from 0.2 we did a shutdown of the cluster, updated the binaries and restarted the cluster, then stopped it and after restart we saw the failure

the script to reproduce the issue (just steps, script is slowly synchronous, but doesn't wait for ES starts...) https://gist.github.com/philnate/cfee1d171022b9eb3b23

Hope this helps

philnate · 2014-08-26T12:47:14Z

Updating from 0.20 to 1.0.3 is working. But trying to migrate to any newer version than 1.0.x is giving me the described error. So this issue seems to be introduced with 1.1.0 being it the newer Lucene version used or ES itself.

mikemccand · 2014-08-26T14:22:05Z

This is a Lucene issue; I managed to make a standalone test exposing it: https://issues.apache.org/jira/browse/LUCENE-5907

mikemccand · 2014-08-26T22:42:21Z

The Lucene issue is fixed, and will be included in Lucene 4.10.

philnate · 2014-08-27T11:25:05Z

@mikemccand thank you for your fast solution. I've patched ES 1.3.2 with the fix and a quick test seemed to be ok. Will do some addtional testing.

mikemccand · 2014-08-27T13:30:49Z

Thanks for testing @philnate

s1monw · 2014-08-27T14:32:25Z

@philnate thanks so much for verifying!! that is super helpful

philnate · 2014-08-28T07:09:18Z

Seems to be all fine. A colleague did some extended testing and everything was good. @s1monw do you plan to consume lucene 4.10 with some upcoming release of ES 1.3? That would allow us to directly upgrade to the latest and greatest ES, not migrating first to 1.0 and then to 1.3.

clintongormley · 2014-09-06T13:24:13Z

@philnate Lucene 4.10 will be in v1.4, but not in v1.3.x

Fixed by #7584

mikemccand · 2014-09-22T08:26:27Z

Lucene 4.9.1 is released, and I upgraded 1.3.x to it: c998d8d

cywjackson · 2015-01-07T11:25:46Z

it seems we face this issue when upgrade 1.1.2 to 1.3.6 (6-node cluster, rolling upgrade). Here is a count from 1 of the logs:

      1 [_gyqx_2k2.del]
      1 [_hcap_2c4.del]
   7951 [_6mef.si]
   9252 [_24gwg.si]
  11295 [_7yti.si]

so most of the error coming from that .si files

we also seems to be hitting issue like CorruptIndexException[checksum failed" , "failed to fetch index version after copying it over" and [2015-01-07 11:03:59,051 +0000][DEBUG][index.shard.service ] [usw2a-search5-prod] [abid_v3][106] state: [RECOVERING]->[CLOSED], reason [recovery failure [IndexShardGatewayRecoveryException[[abid_v3][106] failed to fetch index version after copying it over]; nested: CorruptIndexException[[abid_v3][106] Corrupted index [corrupted_ySuuD05fT8ae37Z0ABqoMA] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=9j4trc actual=1mqj6pc resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2d4333ac)]]; ]]

There are corrupted_* files in many place, feel very similar to #8805 , but i am not sure the exact files to delete in the workaround there We deleted the checksum and the marker file and ES loaded the shards again automatically. it is very possible as @s1monw said (btw. they might have been corrupted for a long time already but old ES version didn't check this. We ran this ES < 0.90.3 and has been upgrading to 1.0.3 and 1.1.2 (and ran into various issues along the way) .

We have > 3000 shards total so its hard to say "which phrase" or "primary/replica"... I am very curious to know exactly what steps are We solved the issue on our side by deleting the checksums as well as the corrupted file and updating the indices.

Currently I am attempting to recovery this by setting index.shard.check_on_startup: fix , but not sure if it is the right fix especially with the big warning sign about data loss in the doc :(

krispyjala · 2015-02-20T01:58:43Z

So what do we do when this has already happened? Is this error recoverable? Is there a way for us to recreate the missing .si file? Currently, it is happening to two of my indices, but just for one shard (out of 5). Even if we can somehow retain data on the other 4 shards would be preferred.

s1monw self-assigned this Aug 25, 2014

clintongormley added bug labels Sep 6, 2014

clintongormley closed this as completed Sep 6, 2014

clintongormley changed the title ~~Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart~~ Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart Sep 11, 2014

mikemccand added the v1.3.3 label Sep 22, 2014

cywjackson mentioned this issue Jan 7, 2015

After upgrade from 1.0.* to 1.4.1, checksum check fails after restart causing cluster to go red after yellow state #8805

Closed

clintongormley removed v1.3.3 v1.4.0.Beta1 v2.0.0-beta1 labels Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430

Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430

philnate commented Aug 25, 2014

s1monw commented Aug 25, 2014

philnate commented Aug 25, 2014

philnate commented Aug 26, 2014

mikemccand commented Aug 26, 2014

mikemccand commented Aug 26, 2014

philnate commented Aug 27, 2014

mikemccand commented Aug 27, 2014

s1monw commented Aug 27, 2014

philnate commented Aug 28, 2014

clintongormley commented Sep 6, 2014

mikemccand commented Sep 22, 2014

cywjackson commented Jan 7, 2015

krispyjala commented Feb 20, 2015

Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430

Internal: Indexes unuseable after upgrade from 0.2 to 1.3 and cluster restart #7430

Comments

philnate commented Aug 25, 2014

s1monw commented Aug 25, 2014

philnate commented Aug 25, 2014

philnate commented Aug 26, 2014

mikemccand commented Aug 26, 2014

mikemccand commented Aug 26, 2014

philnate commented Aug 27, 2014

mikemccand commented Aug 27, 2014

s1monw commented Aug 27, 2014

philnate commented Aug 28, 2014

clintongormley commented Sep 6, 2014

mikemccand commented Sep 22, 2014

cywjackson commented Jan 7, 2015

krispyjala commented Feb 20, 2015