Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighten up our compression framework. #11279

Merged
merged 2 commits into from May 29, 2015

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented May 21, 2015

We have a compression framework that we use internally, mainly to compress some
xcontent bytes. However it is quite lenient: for instance it relies on the
assumption that detection of the compression format can only be called on either
some compressed xcontent bytes or some raw xcontent bytes, but nothing checks
this. By the way, we are misusing it in BinaryFieldMapper so that if someone
indexes a binary field which happens to have the same header as a LZF stream,
then at read time, we will try to decompress it.

It also simplifies the API by removing block compression (only streaming) and
some code duplication caused by some methods accepting a byte[] and other
methods a BytesReference.

@jpountz
Copy link
Contributor Author

jpountz commented May 21, 2015

I'm reading the description again and I've probably not been clear enough about what changed:

  • If you call CompressorFactory.compressor(ChannelBuffer) and the compression format is not detected, you will get an exception. This was already working this way at the call site: MessageChannelHandler.messageReceived, now it's done directly in the framework.
  • If you call CompressorFactory.compressor(BytesReference) and the bytes are not either some xcontent bytes or some compressed xcontent bytes, then you will get an exception. There is no way in general to detect if some bytes are compressed because nothing prevents an arbitrary bytes reference to have the same header as a LZF block so we enforce that this method be only used on xcontent bytes, which is how we are using this method (to the notable exception of BinaryFieldMapper, which is buggy as described in the description).

@jpountz
Copy link
Contributor Author

jpountz commented May 21, 2015

I opened #11280 to fix the BinaryFieldMapper issue.

@jpountz
Copy link
Contributor Author

jpountz commented May 22, 2015

I pushed a new commit to also use DEFLATE (with a compression level of 3) instead of LZF. The reason is that even if LZF might be faster, we have had some issues recently like #7210 or #7468 and expect that using something that is more widely used like DEFLATE will protect us better from corruptions in the future.


package org.elasticsearch.common.compress;

/** Exception indicating that we were expecting something compressed, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-pasted from NotCompressedException?

@rjernst
Copy link
Member

rjernst commented May 23, 2015

LGTM, I left a couple minor comments.

@jpountz jpountz force-pushed the fix/simplify_compression branch 2 times, most recently from 2bc10bf to e725532 Compare May 29, 2015 08:39
@jpountz
Copy link
Contributor Author

jpountz commented May 29, 2015

@rjernst Thanks for the review, I pushed a new commit.

@rjernst
Copy link
Member

rjernst commented May 29, 2015

LGTM

We have a compression framework that we use internally, mainly to compress some
xcontent bytes. However it is quite lenient: for instance it relies on the
assumption that detection of the compression format can only be called on either
some compressed xcontent bytes or some raw xcontent bytes, but nothing checks
this. By the way, we are misusing it in BinaryFieldMapper so that if someone
indexes a binary field which happens to have the same header as a LZF stream,
then at read time, we will try to decompress it.

It also simplifies the API by removing block compression (only streaming) and
some code duplication caused by some methods accepting a byte[] and other
methods a BytesReference.
LZF only stays for backward-compatibility reasons and can only read, not write.
DEFLATE is configured to use level=3, which is a nice trade-off between speed
and compression ratio and is the same as we use for Lucene's high compression
codec.
jpountz added a commit that referenced this pull request May 29, 2015
Internal: tighten up our compression framework.
@jpountz jpountz merged commit 0f3206e into elastic:master May 29, 2015
@kevinkluge kevinkluge removed the review label May 29, 2015
@jpountz jpountz deleted the fix/simplify_compression branch May 29, 2015 15:23
@clintongormley clintongormley changed the title Internal: tighten up our compression framework. Tighten up our compression framework. Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants