Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index shard got corrupted #186

Closed
sukantasaha opened this issue Mar 2, 2015 · 15 comments · Fixed by #188
Closed

Index shard got corrupted #186

sukantasaha opened this issue Mar 2, 2015 · 15 comments · Fixed by #188
Assignees
Labels
Milestone

Comments

@sukantasaha
Copy link

Hi

in all our elasticsearch cluster we use this elasticsearch-cloud-aws plugin to create the snapshots on s3 on a regular basis.

Some times we saw the shard got corrupted for an index in our elasticsearch log.
So we try to restore it from backup and while restoring it from backup again we see the same exception in logs which is follows

[2015-02-25 08:18:10,824][WARN ][indices.cluster          ] [test-es-cluster-1e-data-2] [lst_p113_v_4_20140615_0000][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [lst_p113_v_4_20140615_0000][0] failed recovery
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [lst_p113_v_4_20140615_0000][0] restore failed
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
        ... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [lst_p113_v_4_20140615_0000][0] failed to restore snapshot [listening-prod6-20150224]
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:165)
        at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
        ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [lst_p113_v_4_20140615_0000][0] Failed to recover index
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:787)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:162)
        ... 5 more
Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=1lvsjli actual=3awj8p resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7266a49d)
        at org.elasticsearch.index.store.LegacyVerification$Adler32VerifyingIndexOutput.verify(LegacyVerification.java:73)
        at org.elasticsearch.index.store.Store.verify(Store.java:365)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restoreFile(BlobStoreIndexShardRepository.java:843)
        at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:784)
        ... 6 more
[2015-02-25 08:18:10,826][WARN ][cluster.action.shard     ] [test-es-cluster-1e-data-2] [lst_p113_v_4_20140615_0000][0] sending failed shard for [lst_p113_v_4_20140615_0000][0], node[shNgLjr8RlW7Zrk3P4UdPg], [P], restoring[aws-prod-elasticsearch-backup:listening-prod6-20150224], s[INITIALIZING], indexUUID [ZQKQ-6naQqeLP1Gk8IFsig], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[lst_p113_v_4_20140615_0000][0] failed recovery]; nested: IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] restore failed]; nested: IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] failed to restore snapshot [listening-prod6-20150224]]; nested: IndexShardRestoreFailedException[[lst_p113_v_4_20140615_0000][0] Failed to recover index]; nested: CorruptIndexException[checksum failed (hardware problem?) : expected=1lvsjli actual=3awj8p resource=(org.apache.lucene.store.FSDirectory$FSIndexOutput@7266a49d)]; ]] 

Even if we back to an older snapshot we found the same exception.

So what we did was we download all the segments files from s3 merge it and there we found some segments were corrupted by using org.apache.lucene.index.CheckIndex with -fix
We fixed it but we loose 5gb data.

We shared this problem with aws team, here what they suggested:

We have taken a look at the ElasticSearch plugin you have identified at https://github.com/elasticsearch/elasticsearch-cloud-aws . In general, as this plugin is not something developed by AWS, our ability to provide support here may be limited. > If you have not already raised this with the developers of this plugin, We would certainly do so in order to follow this up on both sides.

Now, with regard to the S3 side, there is a feature available in S3 that clients can use to ensure data integrity on upload. Whenever an object is PUT to an S3 bucket, the client is able to supply the 'Content-MD5' header and an appropriate MD5sum matching the contents of the data being uploaded. S3 will verify the data against this MD5 prior to accepting an upload request. Any upload that does not match the supplied MD5 will be rejected. For reference, please see the S3 PutObject API:

http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html

In looking at the ElasticSearch plugin, it is not clear that the client uses this feature. It is certainly not available as a configuration option. From looking at the source code (and I must note, I am not a Java developer), it does not appear this feature is used when objects are uploaded to S3 in:

https://github.com/elasticsearch/elasticsearch-cloud-aws/blob/master/src/main/java/org/elasticsearch/cloud/aws/blobstore/DefaultS3OutputStream.java

As a general comment, the scenario you are describing is most commonly encountered when a file is uploaded to S3 while it is open for write by an application. As such, the contents of the file may change while the upload is in progress. This can manifest in corruption, but this is not an S3 problem per-se. Essentially, clients uploading to S3 should ensure the source files are not currently being updated at the time of an upload to ensure data integrity. S3 is an extremely reliable platform, but it can only store the bytes it receives. Supplying the Content-MD5 header, as per my above suggestion, often catches scenarios where a file is being modified as it is in transit, as the uploaded data typically won't match the pre-calculated MD5.

Can you guys please have a look into this issue and suggest something

Thanks

@sukantasaha
Copy link
Author

Guys any update ?

@dadoonet dadoonet self-assigned this Mar 4, 2015
@dadoonet dadoonet added this to the 2.4.2 milestone Mar 4, 2015
@otisg
Copy link

otisg commented Mar 4, 2015

Thanks for #188 @dadoonet , but why do you think this happens in the first place? Is it possible that there is a bug in snapshot creation that makes it work with files that are still being modified?

@dadoonet
Copy link
Member

dadoonet commented Mar 4, 2015

@otisg @imotov might know if this could happen. AFAIK, it tries to copy immutable segments so I don't think this could happen.

PR #188 only tries to secure copy over the wire by checking bytes written on both side. But you are right if there is a bug in snapshot creation, this won't fix it.

@imotov
Copy link
Contributor

imotov commented Mar 4, 2015

Which version of elasticsearch was the snapshot created with?

@sukantasaha
Copy link
Author

We are using
es version : elasticsearch 1.4.1
cloud-aws : elasticsearch-cloud-aws-2.4.0

On Thu, Mar 5, 2015 at 2:40 AM, Igor Motov notifications@github.com wrote:

Which version of elasticsearch was the snapshot created with?


Reply to this email directly or view it on GitHub
#186 (comment)
.

@imotov
Copy link
Contributor

imotov commented Mar 5, 2015

@sukantasaha I understand that 1.4.1 is the version that you are currently using. But I am trying to figure out which version of elasticsearch this index was first snapshotted with. You said that you are snapshotting on a regular basis. So, judging from the name, the index was created in June 2014. Is this when you created the first snapshot of this index or was it later on? If this index was first snapshotted back then and you were staying with the latest (more or less) available version of elasticsearch we can assume that this index was first snapshotted by elasticsearch v1.1 or v1.2. Was this the case?

@sukantasaha
Copy link
Author

This cluster was created on Thu, Jul 2014 with elasticsearch-1.2.1 and may
be we migrated this index .
So we snapshot on regular basis, Later when we updated the es version to
elasticsearch-1.4.1-SNAPSHOT
we created a new repository and there we created the snapshot.

On Thu, Mar 5, 2015 at 10:33 AM, Igor Motov notifications@github.com
wrote:

@sukantasaha https://github.com/sukantasaha I understand that 1.4.1 is
the version that you are currently using. But I am trying to figure out
which version of elasticsearch this index was first snapshotted with. You
said that you are snapshotting on a regular basis. So, judging from the
name, the index was created in June 2014. Is this when you created the
first snapshot of this index or was it later on? If this index was first
snapshotted back then and you were staying with the latest (more or less)
available version of elasticsearch we can assume that this index was first
snapshotted by elasticsearch v1.1 or v1.2. Was this the case?


Reply to this email directly or view it on GitHub
#186 (comment)
.

@imotov
Copy link
Contributor

imotov commented Mar 7, 2015

@sukantasaha what did you mean by "may be we migrated this index"? Could it be possible that this index was originally created with older version of elasticsearch? I am just wondering if you are hitting elastic/elasticsearch#9140. Would it be possible for you as an experiment to restore this index using a newer version of elasticsearch (v1.4.3 or higher)?

@sukantasaha
Copy link
Author

We did the same test, Launched a new cluster with 1.4.3 version and there
we tried to restore that index only and we saw the same error like index
shard got corrupted

If you want I can do the same again and show you

On Sat, Mar 7, 2015 at 8:26 AM, Igor Motov notifications@github.com wrote:

@sukantasaha https://github.com/sukantasaha what did you mean by "may
be we migrated this index"? Could it be possible that this index was
originally created with older version of elasticsearch? I am just wondering
if you are hitting elastic/elasticsearch#9140
elastic/elasticsearch#9140. Would it be
possible for you as an experiment to restore this index using a newer
version of elasticsearch (v1.4.3 or higher)?


Reply to this email directly or view it on GitHub
#186 (comment)
.

@sukantasaha
Copy link
Author

May I know that when S3 plugin has been extended by MD5 checksum and in which version , so that no more problems with snapshot restore because we get to know if snapshot was really successful or not

@dadoonet
Copy link
Member

MD5 could be added in 2.4.2, so it's not there yet.

@sukantasaha
Copy link
Author

When can we expect ?

@dadoonet
Copy link
Member

No ETA yet. And the PR has not been merged.
Which means that if you want to build it yourself you need to checkout branch es-1.4, back port the PR in and build.

@sidcarter
Copy link

Is this fix now in es-1.5?

@dadoonet
Copy link
Member

No. It's not.

dadoonet added a commit that referenced this issue May 20, 2015
There is a feature available in S3 that clients can use to ensure data integrity on upload. Whenever an object is PUT to an S3 bucket, the client is able to get back the `MD5` base64 encoded and check that it's the same `MD5` as the local one.

 For reference, please see the [S3 PutObject API](http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html).

 Closes #186.

(cherry picked from commit 2d4fd39)
(cherry picked from commit 3369e02)
(cherry picked from commit 1d1f5c7)
dadoonet added a commit that referenced this issue May 20, 2015
There is a feature available in S3 that clients can use to ensure data integrity on upload. Whenever an object is PUT to an S3 bucket, the client is able to get back the `MD5` base64 encoded and check that it's the same `MD5` as the local one.

 For reference, please see the [S3 PutObject API](http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html).

 Closes #186.

(cherry picked from commit 2d4fd39)
dadoonet added a commit that referenced this issue May 20, 2015
There is a feature available in S3 that clients can use to ensure data integrity on upload. Whenever an object is PUT to an S3 bucket, the client is able to get back the `MD5` base64 encoded and check that it's the same `MD5` as the local one.

 For reference, please see the [S3 PutObject API](http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html).

 Closes #186.

(cherry picked from commit 2d4fd39)
(cherry picked from commit 3369e02)
@dadoonet dadoonet removed the update label May 20, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants