Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a very rare case of corruption in compression used for internal cluster communication. #7210

Closed
wants to merge 1 commit into from

Conversation

rjernst
Copy link
Member

@rjernst rjernst commented Aug 8, 2014

See CorruptedCompressorTests for details on how this bug can be hit.

@rjernst rjernst changed the title Internal: Fix a very rare case of corruption in replication compression. Fix a very rare case of corruption in compression used for internal cluster communication. Aug 8, 2014
*/
public class CorruptedCompressorTests extends ElasticsearchTestCase {

public void testCorruption() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing @Test annotation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@test doesnt do anything :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know if the method starts with test we are good anyway but why do we use the annotation all over the place then :) Either we remove it everywhere or we stick to it I'd say...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't use @Test unless someone makes me, for the exact reason Robert pointed out. It is just extra characters with no benefit.

@rmuir
Copy link
Contributor

rmuir commented Aug 8, 2014

Please disable unsafe encode/decode complete.

* This is a fork of {@link com.ning.compress.lzf.impl.VanillaChunkEncoder} to quickly fix
* an extremely rare bug. See CorruptedCompressorTests for details on reproducing the bug.
*/
public class ElasticsearchChunkEncoder extends VanillaChunkEncoder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically, we were using "X" prefix to designate temporary implementations like this one. So, a more traditional name would be XVanillaChunkEncoder. For example 2edde35

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed to XVanillaChunkEncoder.

@rjernst
Copy link
Member Author

rjernst commented Aug 8, 2014

Ok, I think I addressed all the comments. The only unchanged thing is the license file, because I don't know which license to put in there (the original file had no license header).

@rjernst
Copy link
Member Author

rjernst commented Aug 9, 2014

The PR to the compress-lzf project was merged, and a 1.0.2 release was made. I removed the X encoder and made the upgrade to 1.0.2.

@@ -1381,6 +1381,7 @@
<!-- t-digest -->
<exclude>src/main/java/org/elasticsearch/search/aggregations/metrics/percentiles/tdigest/TDigestState.java</exclude>
<exclude>src/test/java/org/elasticsearch/search/aggregations/metrics/GroupTree.java</exclude>
<exclude>src/test/java/org/elasticsearch/common/compress/lzf/XVanillaChunkEncoder.java</exclude>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this with 1.0.2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Removed.

internal cluster communication.

See CorruptedCompressorTests for details on how this bug can be hit.
This change also removes the ability to use the unsafe variant of
ChunkedEncoder, removing support for the compress.lzf.decoder setting.
@rmuir
Copy link
Contributor

rmuir commented Aug 11, 2014

looks good, thanks Ryan.

@jpountz
Copy link
Contributor

jpountz commented Aug 11, 2014

+1 as well

@rjernst
Copy link
Member Author

rjernst commented Aug 11, 2014

Thanks. Pushed.

@rjernst rjernst closed this Aug 11, 2014
@rjernst rjernst added the v1.2.4 label Aug 12, 2014
@clintongormley clintongormley changed the title Fix a very rare case of corruption in compression used for internal cluster communication. Internal: Fix a very rare case of corruption in compression used for internal cluster communication. Sep 8, 2014
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Nov 11, 2014
The compression bug fixed in elastic#7210 can still strike us since we are
running BWC test against these version. This commit disables compression
forcefully if the compatibility version is < 1.3.2 to prevent debugging
already known issues.
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Nov 11, 2014
The compression bug fixed in elastic#7210 can still strike us since we are
running BWC test against these version. This commit disables compression
forcefully if the compatibility version is < 1.3.2 to prevent debugging
already known issues.
@rjernst rjernst deleted the fix/compress-corruption branch January 21, 2015 23:22
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Mar 2, 2015
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around elastic#7210

Closes elastic#9922
s1monw added a commit to s1monw/elasticsearch that referenced this pull request Mar 2, 2015
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around elastic#7210

Closes elastic#9922

Conflicts:
	src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
@clintongormley clintongormley changed the title Internal: Fix a very rare case of corruption in compression used for internal cluster communication. Fix a very rare case of corruption in compression used for internal cluster communication. Jun 7, 2015
@taf2
Copy link

taf2 commented Jun 18, 2015

Upgrading from 1.1.1 to 1.6.0 and noticing this output from our cluster

insertOrder timeInQueue priority source
      37659        27ms HIGH     shard-failed ([callers][2], node[Ko3b9KsESN68lTkPtVrHKw], relocating [4mcZCKvBRoKQJS_StGNPng], [P], s[INITIALIZING]), reason [shard failure [failed recovery][RecoveryFailedException[[callers][2]: Recovery failed from [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} into [aws_el1a][Ko3b9KsESN68lTkPtVrHKw][ip-10-55-11-211][inet[/10.55.11.211:9300]]{rack=useast1, zone=zonea, master=true} (unexpected error)]; nested: ElasticsearchIllegalStateException[Can't recovery from node [aws_el1][4mcZCKvBRoKQJS_StGNPng][ip-10-55-11-210][inet[/10.55.11.210:9300]]{rack=useast1, master=true, zone=zonea} with [indices.recovery.compress : true] due to compression bugs -  see issue #7210 for details]; ]]```

what do we do?

@rjernst
Copy link
Member Author

rjernst commented Jun 18, 2015

@taf2 Turn off compression before upgrading.

@taf2
Copy link

taf2 commented Jun 18, 2015

@rjernst thanks! which kind of compression do we disable...

is it this option in

/etc/elasticsearch/elasticsearch.yml
#transport.tcp.compress: true

?

or another option?

@taf2
Copy link

taf2 commented Jun 18, 2015

okay sorry it looks like we need to disable indices.recovery.compress - but is this something that needs to be disabled on all nodes in the cluster or just the new 1.6.0 node we're starting up now?

@rjernst
Copy link
Member Author

rjernst commented Jun 18, 2015

All nodes in the cluster, before starting the upgrade. The problem is old nodes with this setting enabled would use the old buggy code, which can then cause data copied between and old and new node to become corrupted.

@taf2
Copy link

taf2 commented Jun 18, 2015

excellent thank you - we have run the following on the existing cluster:

curl -XPUT localhost:9200/_cluster/settings -d '{"transient" : {"indices.recovery.compress" : false }}'

@taf2
Copy link

taf2 commented Jun 18, 2015

Thank you that did the trick!

mute pushed a commit to mute/elasticsearch that referenced this pull request Jul 29, 2015
This commit forces a full recovery if the source node is < 1.4.0 and
prevents any recoveries from pre 1.3.2 nodes if compression is enabled to
work around elastic#7210

Closes elastic#9922

Conflicts:
	src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants