Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

configureCompression only creates compressor once (fixes compression problems with Deflate and Snappy) #352

Merged
merged 1 commit into from
Nov 25, 2014

Conversation

jbeynon
Copy link

@jbeynon jbeynon commented Nov 22, 2014

Since 0.8.5 I noticed that compression was broken when using Avro but was never bothered enough to look into it. With the latest 0.9.0 release I finally took some time and found the issue. The problem manifests it as either an OutOfMemoryError or yarn killing tasks for going "beyond memory limits" when using Deflate or Snappy and isn't specific to Avro, only noticed because Avro silently changes GZip to Deflate.

2014-11-21 20:00:33,234 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Direct buffer memory
    at java.nio.Bits.reserveMemory(Bits.java:658)
    at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
    at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
    at org.apache.hadoop.io.compress.snappy.SnappyCompressor.<init>(SnappyCompressor.java:82)
    at org.apache.hadoop.io.compress.SnappyCodec.createCompressor(SnappyCodec.java:147)
    at com.nicta.scoobi.core.Compression$$anonfun$getCompressor$1.apply(DataSink.scala:95)
    at com.nicta.scoobi.core.Compression$$anonfun$getCompressor$1.apply(DataSink.scala:95)
    at com.nicta.scoobi.impl.control.Exceptions$class.trye(Exceptions.scala:106)
    at com.nicta.scoobi.impl.control.Exceptions$.trye(Exceptions.scala:128)
    at com.nicta.scoobi.core.Compression$.getCompressor(DataSink.scala:95)
    at com.nicta.scoobi.core.DataSink$$anonfun$configureCompression$2.apply(DataSink.scala:55)
    at com.nicta.scoobi.core.DataSink$$anonfun$configureCompression$2.apply(DataSink.scala:54)
    at scala.Option.filter(Option.scala:181)
    at com.nicta.scoobi.core.DataSink$class.configureCompression(DataSink.scala:54)
    at com.nicta.scoobi.io.text.TextFileSink.configureCompression(TextFileSink.scala:38)
    at com.nicta.scoobi.io.text.TextFileSink.configureCompression(TextFileSink.scala:38)
    at com.nicta.scoobi.impl.plan.mscr.MscrOutputChannel$$anon$1$$anonfun$write$1.apply(OutputChannel.scala:172)
    at com.nicta.scoobi.impl.plan.mscr.MscrOutputChannel$$anon$1$$anonfun$write$1.apply(OutputChannel.scala:171)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at com.nicta.scoobi.impl.plan.mscr.MscrOutputChannel$$anon$1.write(OutputChannel.scala:171)
    at com.nicta.scoobi.core.EnvDoFn$$anon$1.emit(EnvDoFn.scala:38)

This stacktrace from explicitly using Snappy is what helped me. Basically the issue is that DataSink.configureCompression is being called for every emit from EvnDoFn. It creates a compressor to test that the settings are working and then promptly discards it. The problem with this is that the Deflate and Snappy compressors create off-heap buffers using ByteBuffer.allocateDirect and this memory does not get GC'd as you'd expect. So for each emit you get a 64kb (for Snappy, not sure the buffer size for Deflate) memory leak.

Anyway, so I added a simple check in DataSink so that configureCompression only actually acts once and things seem to work now. I haven't run a full regression but this is a pretty mild change.

By the way, this would have been much easier to diagnose because configureCompression has a debug log in it and you'd see in the logs thousands of log messages, but I couldn't get debug output working in the latest build. No matter what I tried I either got default logging or no logging.

@jbeynon
Copy link
Author

jbeynon commented Nov 25, 2014

@etorreborre
Can you please take a look at this? It's been a major blocker in upgrading past Scoobi 0.8.3 and is necessary for anyone using Avro and compression.

@markhibberd
Copy link
Collaborator

@jbeynon I have re-triggered the build, to see if it goes green this time. If so, I will merge.

@xelax
Copy link
Contributor

xelax commented Nov 25, 2014

Thanks mark! and if you could publish 0.9.1 it would be awesome ;-)

markhibberd added a commit that referenced this pull request Nov 25, 2014
configureCompression only creates compressor once (fixes compression problems with Deflate and Snappy)
@markhibberd markhibberd merged commit f948d43 into NICTA:master Nov 25, 2014
@jbeynon jbeynon deleted the compress branch November 25, 2014 20:40
@etorreborre
Copy link
Collaborator

Released now (sorry for not having been very reactive with the merge, thanks @markhibberd!).

@jbeynon
Copy link
Author

jbeynon commented Nov 26, 2014

No worries. I know how easy it is for notifications to get lost in the noise.

@xelax
Copy link
Contributor

xelax commented Nov 26, 2014

Eric and Mark, thank you!
about the release: I believe you forgot to publish the 2.10 version:
https://oss.sonatype.org/content/repositories/releases/com/nicta/scoobi_2.11/ has 0.9.1 but
it is missing in:
https://oss.sonatype.org/content/repositories/releases/com/nicta/scoobi_2.10/

@etorreborre
Copy link
Collaborator

Alex, the release is now available for 2.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants