Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden recovery for old segments #8399

Closed
wants to merge 3 commits into from
Closed

Conversation

rmuir
Copy link
Contributor

@rmuir rmuir commented Nov 8, 2014

When a lucene 4.8+ file is transferred, Store returns a VerifyingIndexOutput
that verifies both the CRC32 integrity and the length of the file.

However, for older files, problems can make it to the lucene level. This is not great
since older lucene files aren't especially strong as far as detecting issues here.

For example, if a network transfer is closed on the remote side, we might write a
truncated or corrupted file... which old lucene formats may or may not detect.

The idea here is to verify old files with their legacy Adler32 checksum, plus expected
length. If they don't have an Adler32 (segments_N, jurassic elasticsearch?, its optional
as far as the protocol goes), then at least check the length.

We could improve it for segments_N, its had an embedded CRC32 forever in lucene, but this
gets trickier. Long term, we should also try to also improve tests around here, especially
backwards compat testing, we should test that detected corruptions are handled properly.

I added unit tests for the two cases covered here. PR is against 1.x since thats what i'm working with,
but I can forward-port to master once we figure this out.

When a lucene 4.8+ file is transferred, Store returns a VerifyingIndexOutput
that verifies both the CRC32 integrity and the length of the file.

However, for older files, problems can make it to the lucene level. This is not great
since older lucene files aren't especially strong as far as detecting issues here.

For example, if a network transfer is closed on the remote side, we might write a
truncated file... which old lucene formats may or may not detect.

The idea here is to verify old files with their legacy Adler32 checksum, plus expected
length. If they don't have an Adler32 (segments_N, jurassic elasticsearch?, its optional
as far as the protocol goes), then at least check the length.

We could improve it for segments_N, its had an embedded CRC32 forever in lucene, but this
gets trickier. Long term, we should also try to also improve tests around here, especially
backwards compat testing, we should test that detected corruptions are handled properly.
@rmuir rmuir added the review label Nov 8, 2014
@rjernst
Copy link
Member

rjernst commented Nov 8, 2014

LGTM!

@s1monw
Copy link
Contributor

s1monw commented Nov 8, 2014

I didn't look very close but on a high level this looks very good to me - not sure why we haven't done this before to be honest...

/**
* verifies length for index files before lucene 4.8
*/
static class LengthVerifyingIndexOutput extends VerifyingIndexOutput {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like a FilterIndexOutput is worth putting into lucene?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, although Lucene doesn't decorate at this level. Instead, it works at OutputStream level, using JDK classes.

For example the default chain is essentially:

FilterOutputStream(
    BufferedOutputStream(
        CheckedOutputStream(
            Files.newOutputStream(...))))

@s1monw
Copy link
Contributor

s1monw commented Nov 8, 2014

left some rather cosmetic comments, LGTM though - honestly I almost classify this as a bugfix so I think we should push it to 1.4 and 1.3 too WDYT?

@rmuir
Copy link
Contributor Author

rmuir commented Nov 8, 2014

Thanks for looking @s1monw . Can you look at my comments? I am concerned that those comments are not cosmetic but are too risky for this task.

@rmuir
Copy link
Contributor Author

rmuir commented Nov 8, 2014

I tried a way that might make us both happy. I made the irrelevant forwarded methods 'final' and so on.

@s1monw
Copy link
Contributor

s1monw commented Nov 8, 2014

thanks @rmuir LGTM

@mikemccand
Copy link
Contributor

LGTM

rmuir added a commit that referenced this pull request Nov 9, 2014
When a lucene 4.8+ file is transferred, Store returns a VerifyingIndexOutput
that verifies both the CRC32 integrity and the length of the file.

However, for older files, problems can make it to the lucene level. This is not great
since older lucene files aren't especially strong as far as detecting issues here.

For example, if a network transfer is closed on the remote side, we might write a
truncated file... which old lucene formats may or may not detect.

The idea here is to verify old files with their legacy Adler32 checksum, plus expected
length. If they don't have an Adler32 (segments_N, jurassic elasticsearch?, its optional
as far as the protocol goes), then at least check the length.

We could improve it for segments_N, its had an embedded CRC32 forever in lucene, but this
gets trickier. Long term, we should also try to also improve tests around here, especially
backwards compat testing, we should test that detected corruptions are handled properly.

Closes #8399
rmuir added a commit that referenced this pull request Nov 9, 2014
When a lucene 4.8+ file is transferred, Store returns a VerifyingIndexOutput
that verifies both the CRC32 integrity and the length of the file.

However, for older files, problems can make it to the lucene level. This is not great
since older lucene files aren't especially strong as far as detecting issues here.

For example, if a network transfer is closed on the remote side, we might write a
truncated file... which old lucene formats may or may not detect.

The idea here is to verify old files with their legacy Adler32 checksum, plus expected
length. If they don't have an Adler32 (segments_N, jurassic elasticsearch?, its optional
as far as the protocol goes), then at least check the length.

We could improve it for segments_N, its had an embedded CRC32 forever in lucene, but this
gets trickier. Long term, we should also try to also improve tests around here, especially
backwards compat testing, we should test that detected corruptions are handled properly.

Closes #8399

Conflicts:
	src/main/java/org/elasticsearch/index/store/Store.java
rmuir added a commit that referenced this pull request Nov 9, 2014
When a lucene 4.8+ file is transferred, Store returns a VerifyingIndexOutput
that verifies both the CRC32 integrity and the length of the file.

However, for older files, problems can make it to the lucene level. This is not great
since older lucene files aren't especially strong as far as detecting issues here.

For example, if a network transfer is closed on the remote side, we might write a
truncated file... which old lucene formats may or may not detect.

The idea here is to verify old files with their legacy Adler32 checksum, plus expected
length. If they don't have an Adler32 (segments_N, jurassic elasticsearch?, its optional
as far as the protocol goes), then at least check the length.

We could improve it for segments_N, its had an embedded CRC32 forever in lucene, but this
gets trickier. Long term, we should also try to also improve tests around here, especially
backwards compat testing, we should test that detected corruptions are handled properly.

Closes #8399

Conflicts:
	src/main/java/org/elasticsearch/index/store/Store.java
@rmuir rmuir closed this Nov 9, 2014
rmuir added a commit that referenced this pull request Nov 9, 2014
When a lucene 4.8+ file is transferred, Store returns a VerifyingIndexOutput
that verifies both the CRC32 integrity and the length of the file.

However, for older files, problems can make it to the lucene level. This is not great
since older lucene files aren't especially strong as far as detecting issues here.

For example, if a network transfer is closed on the remote side, we might write a
truncated file... which old lucene formats may or may not detect.

The idea here is to verify old files with their legacy Adler32 checksum, plus expected
length. If they don't have an Adler32 (segments_N, jurassic elasticsearch?, its optional
as far as the protocol goes), then at least check the length.

We could improve it for segments_N, its had an embedded CRC32 forever in lucene, but this
gets trickier. Long term, we should also try to also improve tests around here, especially
backwards compat testing, we should test that detected corruptions are handled properly.

Closes #8399

Conflicts:
	src/main/java/org/elasticsearch/index/store/Store.java
	src/test/java/org/elasticsearch/index/store/StoreTest.java
@clintongormley clintongormley added :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. and removed review labels Mar 19, 2015
@clintongormley clintongormley changed the title Internal: harden recovery for old segments Harden recovery for old segments Jun 8, 2015
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
When a lucene 4.8+ file is transferred, Store returns a VerifyingIndexOutput
that verifies both the CRC32 integrity and the length of the file.

However, for older files, problems can make it to the lucene level. This is not great
since older lucene files aren't especially strong as far as detecting issues here.

For example, if a network transfer is closed on the remote side, we might write a
truncated file... which old lucene formats may or may not detect.

The idea here is to verify old files with their legacy Adler32 checksum, plus expected
length. If they don't have an Adler32 (segments_N, jurassic elasticsearch?, its optional
as far as the protocol goes), then at least check the length.

We could improve it for segments_N, its had an embedded CRC32 forever in lucene, but this
gets trickier. Long term, we should also try to also improve tests around here, especially
backwards compat testing, we should test that detected corruptions are handled properly.

Closes elastic#8399

Conflicts:
	src/main/java/org/elasticsearch/index/store/Store.java
mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
When a lucene 4.8+ file is transferred, Store returns a VerifyingIndexOutput
that verifies both the CRC32 integrity and the length of the file.

However, for older files, problems can make it to the lucene level. This is not great
since older lucene files aren't especially strong as far as detecting issues here.

For example, if a network transfer is closed on the remote side, we might write a
truncated file... which old lucene formats may or may not detect.

The idea here is to verify old files with their legacy Adler32 checksum, plus expected
length. If they don't have an Adler32 (segments_N, jurassic elasticsearch?, its optional
as far as the protocol goes), then at least check the length.

We could improve it for segments_N, its had an embedded CRC32 forever in lucene, but this
gets trickier. Long term, we should also try to also improve tests around here, especially
backwards compat testing, we should test that detected corruptions are handled properly.

Closes elastic#8399

Conflicts:
	src/main/java/org/elasticsearch/index/store/Store.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v1.3.6 v1.4.1 v1.5.0 v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants