Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CorruptIndexException after upgrade from 0.20.6 to 1.4.2 #9140

Closed
gboanea opened this issue Jan 5, 2015 · 9 comments
Closed

CorruptIndexException after upgrade from 0.20.6 to 1.4.2 #9140

gboanea opened this issue Jan 5, 2015 · 9 comments
Assignees
Labels
>bug :Core/Infra/Core Core issues without another label

Comments

@gboanea
Copy link

gboanea commented Jan 5, 2015

After updating form ES 0.20.6 to 1.4.2 the cluster remains in a RED state and CorruptIndexExceptions are generated:

[2015-01-05 12:57:20,933][WARN ][cluster.action.shard     ] [Bob Diamond] [index1][2] sending failed shard for [index1][2], node[gkXgdCNvSACjIFiPAVTxXA], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[index1][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[index1][2] Preexisting corrupted index [corrupted_jbVtki8fRwulkaA6A3HZ4Q] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@3f592170)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2a0df6a)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=ze9sfb actual=ne6jhj resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7992625b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1om7x0v actual=1urrlkf resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1633862f)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

I reproduced the error with a vanilla ES, these are the steps:

  1. Start cluster ES 0.20.6 nodes

  2. Index data (one index ~100MB)

  3. Cluster health

curl -XGET http://localhost:9200/_cluster/health\?pretty\=true
{
  "cluster_name" : "elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 5,
  "active_shards" : 10,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0
}
  1. Disable shard allocation
curl -XPUT localhost:9200/_cluster/settings -d '{"persistent":{"cluster.routing.allocation.disable_allocation”:true}}’

curl -XGET localhost:9200/_cluster/settings

{"persistent":{"cluster.routing.allocation.disable_allocation":"true"},"transient":{}}
  1. Shutdown cluster
curl -XPOST http://localhost:9200/_shutdown

{"cluster_name":"elasticsearch","nodes":{"hxt59kUgTBCN8NAB6K2KwQ":{"name":"Whitman, Debra"},"h7JQP60BShWadDNFTB94CA":{"name":"Corbo, Adrian"},"T_Qzz66MQ9uTJeEwe7Q4Ig":{"name":"Surtur"},"l8Xex8xuRnSIM1cMJgvhvA":{"name":"Shriker"}}}
  1. Update

  2. Start cluster ES 1.4.2 nodes

  3. Cluster health

curl -XGET http://localhost:9200/_cluster/health\?pretty\=true
{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 5
}
  1. Enable shard allocation
curl -XPUT localhost:9200/_cluster/settings -d '{"persistent":{"cluster.routing.allocation.disable_allocation":false}}'
{"acknowledged":true,"persistent":{"cluster":{"routing":{"allocation":{"disable_allocation":"false"}}}},"transient":{}}
  1. Cluster health
curl -XGET http://localhost:9200/_cluster/health\?pretty\=true
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 5,
  "unassigned_shards" : 5
}

Some logs from the log files:

[2015-01-05 12:42:12,579][INFO ][node                     ] [Roma] version[1.4.2], pid[65313], build[927caff/2014-12-16T14:11:12Z]
[2015-01-05 12:42:12,580][INFO ][node                     ] [Roma] initializing ...
[2015-01-05 12:42:12,584][INFO ][plugins                  ] [Roma] loaded [], sites []
[2015-01-05 12:42:15,173][INFO ][node                     ] [Roma] initialized
[2015-01-05 12:42:15,173][INFO ][node                     ] [Roma] starting ...
[2015-01-05 12:42:15,241][INFO ][transport                ] [Roma] bound_address {inet[/127.0.0.1:9300]}, publish_address {inet[/127.0.0.1:9300]}
[2015-01-05 12:42:15,260][INFO ][discovery                ] [Roma] elasticsearch/BvPgLtdCS8eIEFUATx2lVw
[2015-01-05 12:42:19,035][INFO ][cluster.service          ] [Roma] new_master [Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]], reason: zen-disco-join (elected_as_master)
[2015-01-05 12:42:19,051][INFO ][http                     ] [Roma] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[/127.0.0.1:9200]}
[2015-01-05 12:42:19,052][INFO ][node                     ] [Roma] started
[2015-01-05 12:42:19,505][INFO ][cluster.routing.allocation.decider] [Roma] updating [cluster.routing.allocation.disable_allocation] from [false] to [true]
[2015-01-05 12:42:19,511][INFO ][gateway                  ] [Roma] recovered [1] indices into cluster_state
[2015-01-05 12:42:25,403][INFO ][cluster.service          ] [Roma] added {[Bob Diamond][gkXgdCNvSACjIFiPAVTxXA][dw1949demum.int.demandware.com][inet[/127.0.0.1:9301]],}, reason: zen-disco-receive(join from node[[Bob Diamond][gkXgdCNvSACjIFiPAVTxXA][dw1949demum.int.demandware.com][inet[/127.0.0.1:9301]]])
[2015-01-05 12:42:32,863][INFO ][cluster.service          ] [Roma] added {[Phoenix][LTdV5wDIQtuJUPTl9NuuJw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9302]],}, reason: zen-disco-receive(join from node[[Phoenix][LTdV5wDIQtuJUPTl9NuuJw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9302]]])
[2015-01-05 12:42:40,505][INFO ][cluster.service          ] [Roma] added {[Druid][ftzmRGfARAuhjPfw9CVE5Q][dw1949demum.int.demandware.com][inet[/127.0.0.1:9303]],}, reason: zen-disco-receive(join from node[[Druid][ftzmRGfARAuhjPfw9CVE5Q][dw1949demum.int.demandware.com][inet[/127.0.0.1:9303]]])
[2015-01-05 12:56:58,560][INFO ][cluster.routing.allocation.decider] [Roma] updating [cluster.routing.allocation.disable_allocation] from [true] to [false]
[2015-01-05 12:56:59,184][WARN ][cluster.action.shard     ] [Roma] [index1][2] received shard failed for [index1][2], node[gkXgdCNvSACjIFiPAVTxXA], [P], s[STARTED], indexUUID [_na_], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index1][2] Failed to transfer [54] files with total size of [4.3mb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)]; ]]
[2015-01-05 12:56:59,740][WARN ][cluster.action.shard     ] [Roma] [index1][1] received shard failed for [index1][1], node[gkXgdCNvSACjIFiPAVTxXA], [P], s[STARTED], indexUUID [_na_], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index1][1] Failed to transfer [94] files with total size of [4.3mb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@47778e4f)]; ]]
[2015-01-05 12:57:09,641][WARN ][cluster.action.shard     ] [Roma] [index1][3] received shard failed for [index1][3], node[LTdV5wDIQtuJUPTl9NuuJw], [P], s[STARTED], indexUUID [_na_], reason [engine failure, message [corrupt file detected source: [recovery phase 1]][RecoverFilesRecoveryException[[index1][3] Failed to transfer [114] files with total size of [4.3mb]]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7e6256a6)]; ]]
[2015-01-05 12:57:09,663][WARN ][cluster.action.shard     ] [Roma] [index1][2] received shard failed for [index1][2], node[LTdV5wDIQtuJUPTl9NuuJw], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[index1][2] failed recovery]; nested: EngineCreationFailureException[[index1][2] failed to open reader on writer]; nested: FileNotFoundException[No such file [_in.tis]]; ]]
[2015-01-05 12:57:09,695][WARN ][cluster.action.shard     ] [Roma] [index1][2] received shard failed for [index1][2], node[gkXgdCNvSACjIFiPAVTxXA], [P], s[INITIALIZING], indexUUID [_na_], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[index1][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[index1][2] Preexisting corrupted index [corrupted_jbVtki8fRwulkaA6A3HZ4Q] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@3f592170)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2a0df6a)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=ze9sfb actual=ne6jhj resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7992625b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745
...
[2015-01-05 12:42:19,707][INFO ][node                     ] [Bob Diamond] version[1.4.2], pid[65324], build[927caff/2014-12-16T14:11:12Z]
[2015-01-05 12:42:19,707][INFO ][node                     ] [Bob Diamond] initializing ...
[2015-01-05 12:42:19,712][INFO ][plugins                  ] [Bob Diamond] loaded [], sites []
[2015-01-05 12:42:22,303][INFO ][node                     ] [Bob Diamond] initialized
[2015-01-05 12:42:22,304][INFO ][node                     ] [Bob Diamond] starting ...
[2015-01-05 12:42:22,365][INFO ][transport                ] [Bob Diamond] bound_address {inet[/127.0.0.1:9301]}, publish_address {inet[/127.0.0.1:9301]}
[2015-01-05 12:42:22,378][INFO ][discovery                ] [Bob Diamond] elasticsearch/gkXgdCNvSACjIFiPAVTxXA
[2015-01-05 12:42:25,417][INFO ][cluster.service          ] [Bob Diamond] detected_master [Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]], added {[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]],}, reason: zen-disco-receive(from master [[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]]])
[2015-01-05 12:42:25,423][INFO ][cluster.routing.allocation.decider] [Bob Diamond] updating [cluster.routing.allocation.disable_allocation] from [false] to [true]
[2015-01-05 12:42:25,436][INFO ][http                     ] [Bob Diamond] bound_address {inet[/127.0.0.1:9201]}, publish_address {inet[/127.0.0.1:9201]}
[2015-01-05 12:42:25,436][INFO ][node                     ] [Bob Diamond] started
[2015-01-05 12:42:32,864][INFO ][cluster.service          ] [Bob Diamond] added {[Phoenix][LTdV5wDIQtuJUPTl9NuuJw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9302]],}, reason: zen-disco-receive(from master [[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]]])
[2015-01-05 12:42:40,507][INFO ][cluster.service          ] [Bob Diamond] added {[Druid][ftzmRGfARAuhjPfw9CVE5Q][dw1949demum.int.demandware.com][inet[/127.0.0.1:9303]],}, reason: zen-disco-receive(from master [[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]]])
[2015-01-05 12:56:58,557][INFO ][cluster.routing.allocation.decider] [Bob Diamond] updating [cluster.routing.allocation.disable_allocation] from [true] to [false]
[2015-01-05 12:56:58,801][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_im.tis], length [530], checksum [3uya22], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,801][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_im.tii], length [35], checksum [1vw8erc], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,807][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_io.tii], length [35], checksum [1vw8erc], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,814][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_io.tis], length [537], checksum [ze9sfb], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,841][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_in.tii], length [1124], checksum [1om7x0v], writtenBy [null] checksum mismatch
[2015-01-05 12:56:58,906][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_in.tis], length [75185], checksum [elttvz], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,075][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_iq.tis], length [438], checksum [1wqfspo], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,081][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_iq.tii], length [35], checksum [1vw8erc], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,089][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_ip.tis], length [522], checksum [gn93en], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,093][WARN ][indices.recovery         ] [Bob Diamond] [index1][2] Corrupted file detected name [_ip.tii], length [35], checksum [1vw8erc], writtenBy [null] checksum mismatch
[2015-01-05 12:56:59,099][WARN ][index.engine.internal    ] [Bob Diamond] [index1][2] failed engine [corrupt file detected source: [recovery phase 1]]
org.elasticsearch.indices.recovery.RecoverFilesRecoveryException: [index1][2] Failed to transfer [54] files with total size of [4.3mb]
    at org.elasticsearch.indices.recovery.RecoverySource$1.phase1(RecoverySource.java:276)
    at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1116)
    at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:654)
    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:137)
    at org.elasticsearch.indices.recovery.RecoverySource.access$2600(RecoverySource.java:74)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:464)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:450)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=3uya22 actual=1rkh7qi resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@49c64a7f)
    at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
    at org.elasticsearch.index.store.Store.verify(Store.java:365)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
    at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
    ... 4 more
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@3f592170)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1vw8erc actual=1ckq9x4 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@2a0df6a)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    Suppressed: org.elasticsearch.transport.RemoteTransportException: [Phoenix][inet[/127.0.0.1:9302]][internal:index/shard/recovery/file_chunk]
    Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=ze9sfb actual=ne6jhj resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7992625b)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
        at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
        at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
...
[2015-01-05 12:42:34,825][INFO ][node                     ] [Druid] version[1.4.2], pid[65351], build[927caff/2014-12-16T14:11:12Z]
[2015-01-05 12:42:34,826][INFO ][node                     ] [Druid] initializing ...
[2015-01-05 12:42:34,830][INFO ][plugins                  ] [Druid] loaded [], sites []
[2015-01-05 12:42:37,395][INFO ][node                     ] [Druid] initialized
[2015-01-05 12:42:37,395][INFO ][node                     ] [Druid] starting ...
[2015-01-05 12:42:37,466][INFO ][transport                ] [Druid] bound_address {inet[/127.0.0.1:9303]}, publish_address {inet[/127.0.0.1:9303]}
[2015-01-05 12:42:37,480][INFO ][discovery                ] [Druid] elasticsearch/ftzmRGfARAuhjPfw9CVE5Q
[2015-01-05 12:42:40,516][INFO ][cluster.service          ] [Druid] detected_master [Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]], added {[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]],[Bob Diamond][gkXgdCNvSACjIFiPAVTxXA][dw1949demum.int.demandware.com][inet[/127.0.0.1:9301]],[Phoenix][LTdV5wDIQtuJUPTl9NuuJw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9302]],}, reason: zen-disco-receive(from master [[Roma][BvPgLtdCS8eIEFUATx2lVw][dw1949demum.int.demandware.com][inet[/127.0.0.1:9300]]])
[2015-01-05 12:42:40,526][INFO ][cluster.routing.allocation.decider] [Druid] updating [cluster.routing.allocation.disable_allocation] from [false] to [true]
[2015-01-05 12:42:40,540][INFO ][http                     ] [Druid] bound_address {inet[/127.0.0.1:9203]}, publish_address {inet[/127.0.0.1:9203]}
[2015-01-05 12:42:40,540][INFO ][node                     ] [Druid] started
[2015-01-05 12:56:58,557][INFO ][cluster.routing.allocation.decider] [Druid] updating [cluster.routing.allocation.disable_allocation] from [true] to [false]
[2015-01-05 12:57:09,746][WARN ][indices.cluster          ] [Druid] [index1][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index1][1] failed recovery
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException: [index1][1] failed to open reader on writer
    at org.elasticsearch.index.engine.internal.InternalEngine.start(InternalEngine.java:321)
    at org.elasticsearch.index.shard.service.InternalIndexShard.postRecovery(InternalIndexShard.java:710)
    at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:223)
    at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:132)
    ... 3 more
Caused by: java.io.FileNotFoundException: No such file [_id.tis]
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:176)
    at org.elasticsearch.index.store.DistributorDirectory.getDirectory(DistributorDirectory.java:144)
    at org.elasticsearch.index.store.DistributorDirectory.openInput(DistributorDirectory.java:130)
    at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
    at org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:487)
    at org.apache.lucene.codecs.lucene3x.TermInfosReader.<init>(TermInfosReader.java:115)
    at org.apache.lucene.codecs.lucene3x.Lucene3xFields.newTermInfosReader(Lucene3xFields.java:142)
    at org.apache.lucene.codecs.lucene3x.Lucene3xFields.<init>(Lucene3xFields.java:88)
    at org.apache.lucene.codecs.lucene3x.Lucene3xPostingsFormat.fieldsProducer(Lucene3xPostingsFormat.java:62)
    at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:120)
    at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
    at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:144)
    at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:238)
    at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:104)
    at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:422)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:112)
    at org.apache.lucene.search.SearcherManager.<init>(SearcherManager.java:89)
    at org.elasticsearch.index.engine.internal.InternalEngine.buildSearchManager(InternalEngine.java:1527)
    at org.elasticsearch.index.engine.internal.InternalEngine.start(InternalEngine.java:309)
    ... 6 more
...
@rmuir
Copy link
Contributor

rmuir commented Jan 5, 2015

I don't think 0.20.x computed a proper adler checksum for files like .tis/.tii that arent "append-only". this is because those formats seeked backwards and rewrote earlier bytes. I think we should just only verify the length for those extensions? The other case is segments_N, but it never wrote checksum for that so its ok.

@rmuir
Copy link
Contributor

rmuir commented Jan 5, 2015

I will investigate with @gboanea procedure to see if the metadata length is reliable as well. We don't want an off-by-8 but we need to at least verify the length of the file to detect e.g. network disconnect or other problem transferring the file.

@clintongormley
Copy link

@rmuir Don't know if this helps but I just had a go at replicating this. With a single node, the upgrade was smooth, but as soon as I added a second node (and thus replicas) this was easy to reproduce.

@rmuir
Copy link
Contributor

rmuir commented Jan 5, 2015

@clintongormley was the problematic file a .tis or .tii?

It should fail always with the 3.x index format (0.20.x). Thats because those files were never append-only, and so the adler's were incorrect. But they were used essentially only as hash values at the time so it was no problem.

@clintongormley
Copy link

@rmuir yes, both .tis and .tii

@rmuir
Copy link
Contributor

rmuir commented Jan 5, 2015

OK, let me try to make a quick patch. The issue does not impact master, only 1.x. only Lucene 3.x (and 4.0) indexes are impacted. After 4.1 all files are append-only.

@rmuir
Copy link
Contributor

rmuir commented Jan 5, 2015

I made an untested patch here: #9142

@rmuir
Copy link
Contributor

rmuir commented Jan 5, 2015

I manually tested, the patch works (on 0.20.x you just have to index enough to trigger a merge or turn off cfs-on-flush so you get .tis and .tii files). I opened #9143 to fix the bigger test issue.

@clintongormley
Copy link

I believe this has been solved, but either way 0.20.6 is too old to worry about anymore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label
Projects
None yet
Development

No branches or pull requests

3 participants