Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Corruption reported from ES but not Lucene #10062

Closed
brunson opened this issue Mar 11, 2015 · 8 comments
Closed

Data Corruption reported from ES but not Lucene #10062

brunson opened this issue Mar 11, 2015 · 8 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label discuss

Comments

@brunson
Copy link

brunson commented Mar 11, 2015

After a rolling upgrade from 1.3.9 to 1.4.4 I'm left with ~270 shards in 206 (of 1082) indices that ES is reporting are corrupted for primaries and all replicas and seems unable to recover them, but the lucene CheckIndex tool can find no problem with the underlying data.

Here's a sample log message from ES:

[2015-03-11 13:40:45,047][WARN ][cluster.action.shard ] [Murmur II] [logstash-2014.01.07][2] sending failed shard for [logstash-2014.01.07][2], node[vnlCCnbRQBiJFj-16Xvuyg], [P], s[INITIALIZING], indexUUID [eKh7osfjRfmqmE5Mbl2-Tw], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[logstash-2014.01.07][2] failed to fetch index version after copying it over]; nested: CorruptIndexException[[logstash-2014.01.07][2] Preexisting corrupted index [corrupted_JLDbR2yHQ9SVrwgY_LeOJQ] caused by: CorruptIndexException[checksum failed (hardware problem?) : expected=1dclij9 actual=1wjox1n resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1e8de440)]
org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1dclij9 actual=1wjox1n resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@1e8de440)
at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:393)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Suppressed: org.elasticsearch.transport.RemoteTransportException: [Devastator][inet[/10.226.73.178:9300]][internal:index/shard/recovery/file_chunk]
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=5546sn actual=15nanx1 resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@75934f20)
at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
at org.elasticsearch.index.store.Store.verify(Store.java:393)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:599)
at org.elasticsearch.indices.recovery.RecoveryTarget$FileChunkTransportRequestHandler.messageReceived(RecoveryTarget.java:536)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)

]; ]]

But Lucene finds no issue with the index:

$ /usr/lib/jvm/java-7-openjdk-amd64/bin/java -cp /usr/share/elasticsearch/lib/elasticsearch-0.90.13.jar:/usr/share/elasticsearch/lib/lucene-core-4.10.3.jar:/usr/share/elasticsearch/lib/* -ea:org.apache.lucene... org.apache.lucene.index.CheckIndex /local/mnt/elasticsearch/bait/nodes/0/indices/logstash-2014.01.07/2/index

Opening index @ /local/mnt/elasticsearch/bait/nodes/0/indices/logstash-2014.01.07/2/index

Segments file=segments_4ir numSegments=6 versions=[4.6.0 .. 4.9.0] format= userData={translog_id=1389052807442}
1 of 6: name=_tyc docCount=63647
version=4.6.0
codec=Lucene46
compound=false
numFiles=12
size (MB)=19.953
diagnostics = {timestamp=1389121243139, os=Linux, os.version=3.2.0-40-generic, mergeFactor=10, source=merge, lucene.version=4.6.0 1543363 - simon - 2013-11-19 11:05:50, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.6.0_26, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [113 fields]
test: field norms.........OK [2 fields]
test: terms, freq, prox...OK [392120 terms; 4498364 terms/docs pairs; 340461 tokens]
test: stored fields.......OK [127294 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

2 of 6: name=_11s4 docCount=17508
version=4.7.0
codec=Lucene46
compound=false
numFiles=14
size (MB)=5.576
diagnostics = {timestamp=1407350741636, os=Linux, os.version=3.2.0-40-generic, mergeFactor=10, source=merge, lucene.version=4.7.0 1570806 - simon - 2014-02-22 08:25:23, os.arch=amd64, mergeMaxNumSegments=-1, java.version=1.6.0_26, java.vendor=Sun Microsystems Inc.}
has deletions [delGen=1]
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK [1 deleted docs]
test: fields..............OK [119 fields]
test: field norms.........OK [2 fields]
test: terms, freq, prox...OK [118686 terms; 1248860 terms/docs pairs; 105214 tokens]
test (ignoring deletes): terms, freq, prox...OK [118834 terms; 1249008 terms/docs pairs; 105214 tokens]
test: stored fields.......OK [35014 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

3 of 6: name=_121s docCount=11
version=4.7.0
codec=Lucene46
compound=true
numFiles=3
size (MB)=0.008
diagnostics = {timestamp=1407971871508, os=Linux, os.version=3.2.0-40-generic, source=flush, lucene.version=4.7.0 1570806 - simon - 2014-02-22 08:25:23, os.arch=amd64, java.version=1.6.0_26, java.vendor=Sun Microsystems Inc.}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [17 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [243 terms; 660 terms/docs pairs; 0 tokens]
test: stored fields.......OK [22 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

4 of 6: name=_16a8 docCount=30
version=4.7.0
codec=Lucene46
compound=true
numFiles=3
size (MB)=0.012
diagnostics = {timestamp=1422479687275, os=Linux, os.version=3.2.0-40-generic, source=flush, lucene.version=4.7.2 1586229 - rmuir - 2014-04-10 09:00:35, os.arch=amd64, java.version=1.7.0_15, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [16 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [365 terms; 1777 terms/docs pairs; 0 tokens]
test: stored fields.......OK [60 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

5 of 6: name=_16a9 docCount=1
version=4.7.0
codec=Lucene46
compound=true
numFiles=3
size (MB)=0.004
diagnostics = {timestamp=1422479743815, os=Linux, os.version=3.2.0-40-generic, source=flush, lucene.version=4.7.2 1586229 - rmuir - 2014-04-10 09:00:35, os.arch=amd64, java.version=1.7.0_15, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [17 fields]
test: field norms.........OK [0 fields]
test: terms, freq, prox...OK [79 terms; 79 terms/docs pairs; 0 tokens]
test: stored fields.......OK [2 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

6 of 6: name=_16z7 docCount=14
version=4.9.0
codec=Lucene49
compound=true
numFiles=3
size (MB)=0.012
diagnostics = {timestamp=1426010857824, os=Linux, os.version=3.2.0-40-generic, source=flush, lucene.version=4.9.1 1625909 - mike - 2014-09-18 04:03:13, os.arch=amd64, java.version=1.7.0_15, java.vendor=Oracle Corporation}
no deletions
test: open reader.........OK
test: check integrity.....OK
test: check live docs.....OK
test: fields..............OK [21 fields]
test: field norms.........OK [1 fields]
test: terms, freq, prox...OK [281 terms; 553 terms/docs pairs; 14 tokens]
test: stored fields.......OK [28 total field count; avg 2 fields per doc]
test: term vectors........OK [0 total vector count; avg 0 term/freq vector fields per doc]
test: docvalues...........OK [1 docvalues fields; 0 BINARY; 1 NUMERIC; 0 SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET]

No problems were detected with this index.

@clintongormley clintongormley added discuss :Core/Infra/Core Core issues without another label labels Mar 16, 2015
@rmuir
Copy link
Contributor

rmuir commented Mar 16, 2015

Is the data sensitive? The example shard you have here seems pretty small. If you can share the data files I can dig deeper into it.

s1monw added a commit to s1monw/elasticsearch that referenced this issue Mar 16, 2015
Today we rely on the IndexOutput#toString method to print the actual
resource name we are verifying. This has a but in the 4.10.x series
that leaves us with the default toString. This commit adds the filename
to each corruption message for easier debugging.

Relates to elastic#10062
s1monw added a commit that referenced this issue Mar 16, 2015
Today we rely on the IndexOutput#toString method to print the actual
resource name we are verifying. This has a but in the 4.10.x series
that leaves us with the default toString. This commit adds the filename
to each corruption message for easier debugging.

Relates to #10062

Conflicts:
	src/main/java/org/elasticsearch/index/store/VerifyingIndexOutput.java
s1monw added a commit that referenced this issue Mar 16, 2015
Today we rely on the IndexOutput#toString method to print the actual
resource name we are verifying. This has a but in the 4.10.x series
that leaves us with the default toString. This commit adds the filename
to each corruption message for easier debugging.

Relates to #10062
@s1monw
Copy link
Contributor

s1monw commented Mar 16, 2015

@brunson if you are up for it can you maybe share the content of the checksums-* file in your index directory? or maybe upload them somewhere?

@brunson
Copy link
Author

brunson commented Mar 16, 2015

Let me check on the contents of the index, I don't believe it's sensitive.

I've got 3 (of 6) nodes with a checksum file for that shard, two are the same size and one slightly smaller, they all have different shas. Would seeing all 3 be helpful?

@s1monw
Copy link
Contributor

s1monw commented Mar 16, 2015

yeah please gimme all of them

@s1monw
Copy link
Contributor

s1monw commented Mar 16, 2015

and if possible can we get the entire shard as a zipfile?

@brunson
Copy link
Author

brunson commented Mar 16, 2015

Definitely. Can I give you a download link privately?

@s1monw
Copy link
Contributor

s1monw commented Mar 16, 2015

sure simon at elastic dot co

@s1monw
Copy link
Contributor

s1monw commented Mar 17, 2015

@brunson thanks so much for sending me all these information. I tried your shards on 1.5.0-snapshot and it recovered just fine. this is very likely to be caused by #8587 so you will have a fix for this in the upcoming 1.5.0 release which should be very soon given that I branched today.

@s1monw s1monw closed this as completed Mar 17, 2015
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
Today we rely on the IndexOutput#toString method to print the actual
resource name we are verifying. This has a but in the 4.10.x series
that leaves us with the default toString. This commit adds the filename
to each corruption message for easier debugging.

Relates to elastic#10062
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
Today we rely on the IndexOutput#toString method to print the actual
resource name we are verifying. This has a but in the 4.10.x series
that leaves us with the default toString. This commit adds the filename
to each corruption message for easier debugging.

Relates to elastic#10062

Conflicts:
	src/main/java/org/elasticsearch/index/store/VerifyingIndexOutput.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label discuss
Projects
None yet
Development

No branches or pull requests

4 participants