CorruptIndexException after upgrade from 0.20.6 to 1.4.2 #9140

gboanea · 2015-01-05T14:39:26Z

After updating form ES 0.20.6 to 1.4.2 the cluster remains in a RED state and CorruptIndexExceptions are generated:

[2015-01-05 12:57:20,933][WARN ][cluster.action.shard     ] [Bob Diamond] [index1][2] sending failed shard for [index1][2], node[gkXgdCNvSACjIFiPAVTxXA], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[index1][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[index1][2] Preexisting corrupted index [corrupted_jbVtki8fRwulkaA6A3HZ4Q] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@3f592170)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2a0df6a)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=ze9sfb actual=ne6jhj resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7992625b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1om7x0v actual=1urrlkf resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1633862f)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

I reproduced the error with a vanilla ES, these are the steps:

Start cluster ES 0.20.6 nodes
Index data (one index ~100MB)
Cluster health

curl -XGET http://localhost:9200/_cluster/health\?pretty\=true
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0
}

Disable shard allocation

curl -XPUT localhost:9200/_cluster/settings -d '{"persistent":{"cluster.routing.allocation.disable_allocation”:true}}’

curl -XGET localhost:9200/_cluster/settings

{"persistent":{"cluster.routing.allocation.disable_allocation":"true"},"transient":{}}

Shutdown cluster

curl -XPOST http://localhost:9200/_shutdown

{"cluster_name":"elasticsearch","nodes":{"hxt59kUgTBCN8NAB6K2KwQ":{"name":"Whitman, Debra"},"h7JQP60BShWadDNFTB94CA":{"name":"Corbo, Adrian"},"T_Qzz66MQ9uTJeEwe7Q4Ig":{"name":"Surtur"},"l8Xex8xuRnSIM1cMJgvhvA":{"name":"Shriker"}}}

Update
Start cluster ES 1.4.2 nodes
Cluster health

curl -XGET http://localhost:9200/_cluster/health\?pretty\=true
{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5
}

Enable shard allocation

curl -XPUT localhost:9200/_cluster/settings -d '{"persistent":{"cluster.routing.allocation.disable_allocation":false}}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"disable_allocation":"false"}}}},"transient":{}}

Cluster health

curl -XGET http://localhost:9200/_cluster/health\?pretty\=true
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 5,
  "unassigned_shards" : 5
}

Some logs from the log files:

[2015-01-05 12:42:12,579][INFO ][node                     ] [Roma] version[1.4.2], pid[65313], build[927caff/2014-12-16T14:11:12Z]
[2015-01-05 12:42:12,580][INFO ][node                     ] [Roma] initializing ...
[2015-01-05 12:42:12,584][INFO ][plugins                  ] [Roma] loaded [], sites []
[2015-01-05 12:42:15,173][INFO ][node                     ] [Roma] initialized
[2015-01-05 12:42:15,173][INFO ][node                     ] [Roma] starting ...
[2015-01-05 12:42:15,241][INFO ][transport                ] [Roma] bound_address {inet[/127.0.0.1:9300]}, publish_address {inet[/127.0.0.1:9300]}
[2015-01-05 12:42:15,260][INFO ][discovery                ] [Roma] elasticsearch/BvPgLtdCS8eIEFUATx2lVw
[2015-01-05 12:42:19,035][INFO ][cluster.service          ] [Roma] new_master [Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]], reason: zen-disco-join (elected_as_master)
[2015-01-05 12:42:19,051][INFO ][http                     ] [Roma] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[/127.0.0.1:9200]}
[2015-01-05 12:42:19,052][INFO ][node                     ] [Roma] started
[2015-01-05 12:42:19,505][INFO ][cluster.routing.allocation.decider] [Roma] updating [cluster.routing.allocation.disable_allocation] from [false] to [true]
[2015-01-05 12:42:19,511][INFO ][gateway                  ] [Roma] recovered [1] indices into cluster_state
[2015-01-05 12:42:25,403][INFO ][cluster.service          ] [Roma] added {[Bob Diamond][gkXgdCNvSACjIFiPAVTxXA][dw1949demum.int.demandware.com][inet[/127.0.0.1:9301]],}, reason: zen-disco-receive(join from node[[Bob Diamond][gkXgdCNvSACjIFiPAVTxXA][dw1949demum.int.demandware.com][inet[/127.0.0.1:9301]]])
[2015-01-05 12:42:32,863][INFO ][cluster.service          ] [Roma] added {[Phoenix][LTdV5wDIQtuJUPTl9NuuJw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9302]],}, reason: zen-disco-receive(join from node[[Phoenix][LTdV5wDIQtuJUPTl9NuuJw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9302]]])
[2015-01-05 12:42:40,505][INFO ][cluster.service          ] [Roma] added {[Druid][ftzmRGfARAuhjPfw9CVE5Q][dw1949demum.int.demandware.com][inet[/127.0.0.1:9303]],}, reason: zen-disco-receive(join from node[[Druid][ftzmRGfARAuhjPfw9CVE5Q][dw1949demum.int.demandware.com][inet[/127.0.0.1:9303]]])
[2015-01-05 12:56:58,560][INFO ][cluster.routing.allocation.decider] [Roma] updating [cluster.routing.allocation.disable_allocation] from [true] to [false]
[2015-01-05 12:56:59,184][WARN ][cluster.action.shard     ] [Roma] [index1][2] received shard failed for [index1][2], node[gkXgdCNvSACjIFiPAVTxXA], [P], s[STARTED], indexUUID [_na_], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index1][2] Failed to transfer [54] files with total size of [4.3mb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)]; ]]
[2015-01-05 12:56:59,740][WARN ][cluster.action.shard     ] [Roma] [index1][1] received shard failed for [index1][1], node[gkXgdCNvSACjIFiPAVTxXA], [P], s[STARTED], indexUUID [_na_], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index1][1] Failed to transfer [94] files with total size of [4.3mb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@47778e4f)]; ]]
[2015-01-05 12:57:09,641][WARN ][cluster.action.shard     ] [Roma] [index1][3] received shard failed for [index1][3], node[LTdV5wDIQtuJUPTl9NuuJw], [P], s[STARTED], indexUUID [_na_], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index1][3] Failed to transfer [114] files with total size of [4.3mb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7e6256a6)]; ]]
[2015-01-05 12:57:09,663][WARN ][cluster.action.shard     ] [Roma] [index1][2] received shard failed for [index1][2], node[LTdV5wDIQtuJUPTl9NuuJw], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[index1][2] failed recovery]; nested: EngineCreationFailureException[[index1][2] failed to open reader on writer]; nested: FileNotFoundException[No such file [_in.tis]]; ]]
[2015-01-05 12:57:09,695][WARN ][cluster.action.shard     ] [Roma] [index1][2] received shard failed for [index1][2], node[gkXgdCNvSACjIFiPAVTxXA], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[index1][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[index1][2] Preexisting corrupted index [corrupted_jbVtki8fRwulkaA6A3HZ4Q] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@3f592170)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2a0df6a)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=ze9sfb actual=ne6jhj resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7992625b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745
...

[2015-01-05 12:42:19,707][INFO ][node                     ] [Bob Diamond] version[1.4.2], pid[65324], build[927caff/2014-12-16T14:11:12Z]
[2015-01-05 12:42:19,707][INFO ][node                     ] [Bob Diamond] initializing ...
[2015-01-05 12:42:19,712][INFO ][plugins                  ] [Bob Diamond] loaded [], sites []
[2015-01-05 12:42:22,303][INFO ][node                     ] [Bob Diamond] initialized
[2015-01-05 12:42:22,304][INFO ][node                     ] [Bob Diamond] starting ...
[2015-01-05 12:42:22,365][INFO ][transport                ] [Bob Diamond] bound_address {inet[/127.0.0.1:9301]}, publish_address {inet[/127.0.0.1:9301]}
[2015-01-05 12:42:22,378][INFO ][discovery                ] [Bob Diamond] elasticsearch/gkXgdCNvSACjIFiPAVTxXA
[2015-01-05 12:42:25,417][INFO ][cluster.service          ] [Bob Diamond] detected_master [Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]], added {[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]],}, reason: zen-disco-receive(from master [[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]]])
[2015-01-05 12:42:25,423][INFO ][cluster.routing.allocation.decider] [Bob Diamond] updating [cluster.routing.allocation.disable_allocation] from [false] to [true]
[2015-01-05 12:42:25,436][INFO ][http                     ] [Bob Diamond] bound_address {inet[/127.0.0.1:9201]}, publish_address {inet[/127.0.0.1:9201]}
[2015-01-05 12:42:25,436][INFO ][node                     ] [Bob Diamond] started
[2015-01-05 12:42:32,864][INFO ][cluster.service          ] [Bob Diamond] added {[Phoenix][LTdV5wDIQtuJUPTl9NuuJw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9302]],}, reason: zen-disco-receive(from master [[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]]])
[2015-01-05 12:42:40,507][INFO ][cluster.service          ] [Bob Diamond] added {[Druid][ftzmRGfARAuhjPfw9CVE5Q][dw1949demum.int.demandware.com][inet[/127.0.0.1:9303]],}, reason: zen-disco-receive(from master [[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]]])
[2015-01-05 12:56:58,557][INFO ][cluster.routing.allocation.decider] [Bob Diamond] updating [cluster.routing.allocation.disable_allocation] from [true] to [false]
[2015-01-05 12:56:58,801][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_im.tis], length [530], checksum [3uya22], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,801][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_im.tii], length [35], checksum [1vw8erc], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,807][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_io.tii], length [35], checksum [1vw8erc], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,814][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_io.tis], length [537], checksum [ze9sfb], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,841][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_in.tii], length [1124], checksum [1om7x0v], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,906][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_in.tis], length [75185], checksum [elttvz], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,075][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_iq.tis], length [438], checksum [1wqfspo], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,081][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_iq.tii], length [35], checksum [1vw8erc], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,089][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_ip.tis], length [522], checksum [gn93en], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,093][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_ip.tii], length [35], checksum [1vw8erc], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,099][WARN ][index.engine.internal    ] [Bob Diamond] [index1][2] failed engine [corrupt file detected source: [recovery phase 1]]
org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [index1][2] Failed to transfer [54] files with total size of [4.3mb]
    at org.elasticsearch.indices.recovery.RecoverySource$1.phase1(RecoverySource.java:276)
    at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1116)
    at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:654)
    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:137)
    at org.elasticsearch.indices.recovery.RecoverySource.access$2600(RecoverySource.java:74)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:464)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:450)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    ... 4 more
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@3f592170)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2a0df6a)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=ze9sfb actual=ne6jhj resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7992625b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
...

[2015-01-05 12:42:34,825][INFO ][node                     ] [Druid] version[1.4.2], pid[65351], build[927caff/2014-12-16T14:11:12Z]
[2015-01-05 12:42:34,826][INFO ][node                     ] [Druid] initializing ...
[2015-01-05 12:42:34,830][INFO ][plugins                  ] [Druid] loaded [], sites []
[2015-01-05 12:42:37,395][INFO ][node                     ] [Druid] initialized
[2015-01-05 12:42:37,395][INFO ][node                     ] [Druid] starting ...
[2015-01-05 12:42:37,466][INFO ][transport                ] [Druid] bound_address {inet[/127.0.0.1:9303]}, publish_address {inet[/127.0.0.1:9303]}
[2015-01-05 12:42:37,480][INFO ][discovery                ] [Druid] elasticsearch/ftzmRGfARAuhjPfw9CVE5Q
[2015-01-05 12:42:40,516][INFO ][cluster.service          ] [Druid] detected_master [Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]], added {[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]],[Bob Diamond][gkXgdCNvSACjIFiPAVTxXA][dw1949demum.int.demandware.com][inet[/127.0.0.1:9301]],[Phoenix][LTdV5wDIQtuJUPTl9NuuJw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9302]],}, reason: zen-disco-receive(from master [[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]]])
[2015-01-05 12:42:40,526][INFO ][cluster.routing.allocation.decider] [Druid] updating [cluster.routing.allocation.disable_allocation] from [false] to [true]
[2015-01-05 12:42:40,540][INFO ][http                     ] [Druid] bound_address {inet[/127.0.0.1:9203]}, publish_address {inet[/127.0.0.1:9203]}
[2015-01-05 12:42:40,540][INFO ][node                     ] [Druid] started
[2015-01-05 12:56:58,557][INFO ][cluster.routing.allocation.decider] [Druid] updating [cluster.routing.allocation.disable_allocation] from [true] to [false]
[2015-01-05 12:57:09,746][WARN ][indices.cluster          ] [Druid] [index1][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index1][1] failed recovery
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: [index1][1] failed to open reader on writer
    at org.elasticsearch.index.engine.internal.InternalEngine.start(InternalEngine.java:321)
    at org.elasticsearch.index.shard.service.InternalIndexShard.postRecovery(InternalIndexShard.java:710)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:223)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    ... 3 more
Caused by: java.io.FileNotFoundException: No such file [_id.tis]
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:176)
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:144)
    at org.elasticsearch.index.store.DistributorDirectory.openInput(DistributorDirectory.java:130)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
    at org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:487)
    at org.apache.lucene.codecs.lucene3x.TermInfosReader.<init>(TermInfosReader.java:115)
    at org.apache.lucene.codecs.lucene3x.Lucene3xFields.newTermInfosReader(Lucene3xFields.java:142)
    at org.apache.lucene.codecs.lucene3x.Lucene3xFields.<init>(Lucene3xFields.java:88)
    at org.apache.lucene.codecs.lucene3x.Lucene3xPostingsFormat.fieldsProducer(Lucene3xPostingsFormat.java:62)
    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
    at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:144)
    at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:238)
    at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:104)
    at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:422)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)
    at org.apache.lucene.search.SearcherManager.<init>(SearcherManager.java:89)
    at org.elasticsearch.index.engine.internal.InternalEngine.buildSearchManager(InternalEngine.java:1527)
    at org.elasticsearch.index.engine.internal.InternalEngine.start(InternalEngine.java:309)
    ... 6 more
...

The text was updated successfully, but these errors were encountered:

rmuir · 2015-01-05T15:43:32Z

I don't think 0.20.x computed a proper adler checksum for files like .tis/.tii that arent "append-only". this is because those formats seeked backwards and rewrote earlier bytes. I think we should just only verify the length for those extensions? The other case is segments_N, but it never wrote checksum for that so its ok.

rmuir · 2015-01-05T15:51:19Z

I will investigate with @gboanea procedure to see if the metadata length is reliable as well. We don't want an off-by-8 but we need to at least verify the length of the file to detect e.g. network disconnect or other problem transferring the file.

clintongormley · 2015-01-05T15:53:13Z

@rmuir Don't know if this helps but I just had a go at replicating this. With a single node, the upgrade was smooth, but as soon as I added a second node (and thus replicas) this was easy to reproduce.

rmuir · 2015-01-05T15:56:34Z

@clintongormley was the problematic file a .tis or .tii?

It should fail always with the 3.x index format (0.20.x). Thats because those files were never append-only, and so the adler's were incorrect. But they were used essentially only as hash values at the time so it was no problem.

clintongormley · 2015-01-05T15:58:51Z

@rmuir yes, both .tis and .tii

rmuir · 2015-01-05T16:00:09Z

OK, let me try to make a quick patch. The issue does not impact master, only 1.x. only Lucene 3.x (and 4.0) indexes are impacted. After 4.1 all files are append-only.

rmuir · 2015-01-05T16:08:18Z

I made an untested patch here: #9142

rmuir · 2015-01-05T16:53:15Z

I manually tested, the patch works (on 0.20.x you just have to index enough to trigger a merge or turn off cfs-on-flush so you get .tis and .tii files). I opened #9143 to fix the bigger test issue.

clintongormley · 2015-12-03T19:23:04Z

I believe this has been solved, but either way 0.20.6 is too old to worry about anymore

rmuir mentioned this issue Jan 5, 2015

Improve static backwards index tests #9143

Closed

clintongormley assigned rmuir Jan 5, 2015

clintongormley added >bug :Core/Infra/Core Core issues without another label labels Jan 5, 2015

rmuir mentioned this issue Jan 5, 2015

Add 0.20.x index. #9146

Closed

clintongormley mentioned this issue Jan 26, 2015

One of the shard couldn't recovery from the rolling upgrade, 1.1.1 -> 1.4.2 #9406

Closed

ghost mentioned this issue Mar 2, 2015

Master node is sending the same cluster state again and again upon shard failure. #8730

Closed

imotov mentioned this issue Mar 7, 2015

Index shard got corrupted elastic/elasticsearch-cloud-aws#186

Closed

clintongormley closed this as completed Dec 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CorruptIndexException after upgrade from 0.20.6 to 1.4.2 #9140

CorruptIndexException after upgrade from 0.20.6 to 1.4.2 #9140

gboanea commented Jan 5, 2015

rmuir commented Jan 5, 2015

rmuir commented Jan 5, 2015

clintongormley commented Jan 5, 2015

rmuir commented Jan 5, 2015

clintongormley commented Jan 5, 2015

rmuir commented Jan 5, 2015

rmuir commented Jan 5, 2015

rmuir commented Jan 5, 2015

clintongormley commented Dec 3, 2015

CorruptIndexException after upgrade from 0.20.6 to 1.4.2 #9140

CorruptIndexException after upgrade from 0.20.6 to 1.4.2 #9140

Comments

gboanea commented Jan 5, 2015

rmuir commented Jan 5, 2015

rmuir commented Jan 5, 2015

clintongormley commented Jan 5, 2015

rmuir commented Jan 5, 2015

clintongormley commented Jan 5, 2015

rmuir commented Jan 5, 2015

rmuir commented Jan 5, 2015

rmuir commented Jan 5, 2015

clintongormley commented Dec 3, 2015