New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Corruption reported from ES but not Lucene #10062
Comments
Is the data sensitive? The example shard you have here seems pretty small. If you can share the data files I can dig deeper into it. |
Today we rely on the IndexOutput#toString method to print the actual resource name we are verifying. This has a but in the 4.10.x series that leaves us with the default toString. This commit adds the filename to each corruption message for easier debugging. Relates to elastic#10062
Today we rely on the IndexOutput#toString method to print the actual resource name we are verifying. This has a but in the 4.10.x series that leaves us with the default toString. This commit adds the filename to each corruption message for easier debugging. Relates to #10062 Conflicts: src/main/java/org/elasticsearch/index/store/VerifyingIndexOutput.java
Today we rely on the IndexOutput#toString method to print the actual resource name we are verifying. This has a but in the 4.10.x series that leaves us with the default toString. This commit adds the filename to each corruption message for easier debugging. Relates to #10062
@brunson if you are up for it can you maybe share the content of the |
Let me check on the contents of the index, I don't believe it's sensitive. I've got 3 (of 6) nodes with a checksum file for that shard, two are the same size and one slightly smaller, they all have different shas. Would seeing all 3 be helpful? |
yeah please gimme all of them |
and if possible can we get the entire shard as a zipfile? |
Definitely. Can I give you a download link privately? |
sure |
Today we rely on the IndexOutput#toString method to print the actual resource name we are verifying. This has a but in the 4.10.x series that leaves us with the default toString. This commit adds the filename to each corruption message for easier debugging. Relates to elastic#10062
Today we rely on the IndexOutput#toString method to print the actual resource name we are verifying. This has a but in the 4.10.x series that leaves us with the default toString. This commit adds the filename to each corruption message for easier debugging. Relates to elastic#10062 Conflicts: src/main/java/org/elasticsearch/index/store/VerifyingIndexOutput.java
After a rolling upgrade from 1.3.9 to 1.4.4 I'm left with ~270 shards in 206 (of 1082) indices that ES is reporting are corrupted for primaries and all replicas and seems unable to recover them, but the lucene CheckIndex tool can find no problem with the underlying data.
Here's a sample log message from ES:
[2015-03-11 13:40:45,047][WARN ][cluster.action.shard ] [Murmur II] [logstash-2014.01.07][2] sending failed shard for [logstash-2014.01.07][2], node[vnlCCnbRQBiJFj-16Xvuyg], [P], s[INITIALIZING], indexUUID [eKh7osfjRfmqmE5Mbl2-Tw], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[logstash-2014.01.07][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[logstash-2014.01.07][2] Preexisting corrupted index [corrupted_JLDbR2yHQ9SVrwgY_LeOJQ] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1dclij9 actual=1wjox1n resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1e8de440)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1dclij9 actual=1wjox1n resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1e8de440)
at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:393)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Suppressed: org.elasticsearch.transport.RemoteTransportException: [Devastator][inet[/10.226.73.178:9300]][internal:index/shard/recovery/file_chunk]
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=5546sn actual=15nanx1 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@75934f20)
at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:393)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
]; ]]
But Lucene finds no issue with the index:
$ /usr/lib/jvm/java-7-openjdk-amd64/bin/java -cp /usr/share/elasticsearch/lib/elasticsearch-0.90.13.jar:/usr/share/elasticsearch/lib/lucene-core-4.10.3.jar:/usr/share/elasticsearch/lib/* -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /local/mnt/elasticsearch/bait/nodes/0/indices/logstash-2014.01.07/2/index
Opening index @ /local/mnt/elasticsearch/bait/nodes/0/indices/logstash-2014.01.07/2/index
Segments file=segments_4ir numSegments=6 versions=[4.6.0 .. 4.9.0] format= userData={translog_id=1389052807442}
1 of 6: name=_tyc docCount=63647
version=4.6.0
codec=Lucene46
compound=false
numFiles=12
size (MB)=19.953
diagnostics = {timestamp=1389121243139, os=Linux, os.version=3.2.0-40-generic, mergeFactor=10, source=merge, lucene.version=4.6.0 1543363 - simon - 2013-11-19 11:05:50, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.6.0_26, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [113 fields]
test: field norms.........OK [2 fields]
test: terms, freq, prox...OK [392120 terms; 4498364 terms/docs pairs; 340461 tokens]
test: stored fields.......OK [127294 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
2 of 6: name=_11s4 docCount=17508
version=4.7.0
codec=Lucene46
compound=false
numFiles=14
size (MB)=5.576
diagnostics = {timestamp=1407350741636, os=Linux, os.version=3.2.0-40-generic, mergeFactor=10, source=merge, lucene.version=4.7.0 1570806 - simon - 2014-02-22 08:25:23, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.6.0_26, java.vendor=Sun Microsystems Inc.}
has deletions [delGen=1]
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [1 deleted docs]
test: fields..............OK [119 fields]
test: field norms.........OK [2 fields]
test: terms, freq, prox...OK [118686 terms; 1248860 terms/docs pairs; 105214 tokens]
test (ignoring deletes): terms, freq, prox...OK [118834 terms; 1249008 terms/docs pairs; 105214 tokens]
test: stored fields.......OK [35014 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
3 of 6: name=_121s docCount=11
version=4.7.0
codec=Lucene46
compound=true
numFiles=3
size (MB)=0.008
diagnostics = {timestamp=1407971871508, os=Linux, os.version=3.2.0-40-generic, source=flush, lucene.version=4.7.0 1570806 - simon - 2014-02-22 08:25:23, os.arch=amd64, java.version=1.6.0_26, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [17 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [243 terms; 660 terms/docs pairs; 0 tokens]
test: stored fields.......OK [22 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
4 of 6: name=_16a8 docCount=30
version=4.7.0
codec=Lucene46
compound=true
numFiles=3
size (MB)=0.012
diagnostics = {timestamp=1422479687275, os=Linux, os.version=3.2.0-40-generic, source=flush, lucene.version=4.7.2 1586229 - rmuir - 2014-04-10 09:00:35, os.arch=amd64, java.version=1.7.0_15, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [16 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [365 terms; 1777 terms/docs pairs; 0 tokens]
test: stored fields.......OK [60 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
5 of 6: name=_16a9 docCount=1
version=4.7.0
codec=Lucene46
compound=true
numFiles=3
size (MB)=0.004
diagnostics = {timestamp=1422479743815, os=Linux, os.version=3.2.0-40-generic, source=flush, lucene.version=4.7.2 1586229 - rmuir - 2014-04-10 09:00:35, os.arch=amd64, java.version=1.7.0_15, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [17 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [79 terms; 79 terms/docs pairs; 0 tokens]
test: stored fields.......OK [2 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
6 of 6: name=_16z7 docCount=14
version=4.9.0
codec=Lucene49
compound=true
numFiles=3
size (MB)=0.012
diagnostics = {timestamp=1426010857824, os=Linux, os.version=3.2.0-40-generic, source=flush, lucene.version=4.9.1 1625909 - mike - 2014-09-18 04:03:13, os.arch=amd64, java.version=1.7.0_15, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [21 fields]
test: field norms.........OK [1 fields]
test: terms, freq, prox...OK [281 terms; 553 terms/docs pairs; 14 tokens]
test: stored fields.......OK [28 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]
No problems were detected with this index.
The text was updated successfully, but these errors were encountered: