Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel deflater + long reads data = intermittent corrupt bams #99

Open
droazen opened this issue Mar 15, 2019 · 3 comments

Comments

Projects
None yet
3 participants
@droazen
Copy link

commented Mar 15, 2019

@kvg reports that running the Intel deflater via GATK on long reads data intermittently produces corrupt bam outputs. His specific use case is sharding a single unaligned bam file into multiple smaller bams. Running with the JDK deflater (--use-jdk-deflater) appears to resolve the issue.

Example error when trying to read a corrupt shard (reading with htsjdk produces the same error):

$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v1/ -nr 100000
$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v2/ -nr 100000
$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v3/ -nr 100000
$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v4/ -nr 100000
$ java -jar gatk.jar SplitSubreadsByZmw -I sharding_test.bam -O intel_compression/v5/ -nr 100000

$ samtools view intel_compression/v1/sharding_test.000002.bam > /dev/null
$ samtools view intel_compression/v2/sharding_test.000002.bam > /dev/null
[E::bgzf_read] Read block operation failed with error 2 after 30675 of 72043 bytes
[main_samview] truncated file.
$ samtools view intel_compression/v3/sharding_test.000002.bam > /dev/null
$ samtools view intel_compression/v4/sharding_test.000002.bam > /dev/null
$ samtools view intel_compression/v5/sharding_test.000002.bam > /dev/null

(Only the second attempt yields a corrupted file; runs before and after appear to be correct, despite nothing changing between steps.)

There may be a bug in https://github.com/Intel-HLS/GKL/blob/master/src/main/native/compression/IntelDeflater.cc, perhaps triggered when a read spans many compressed blocks.

This bug is also tracked in the GATK repo: broadinstitute/gatk#5798

@mepowers

This comment has been minimized.

Copy link

commented Apr 12, 2019

Hi @droazen, @kvg - Is this is a blocker? Or is running with the JDK deflater a workaround?

Please feel free to tag me in future GKL issues so I'm sure to see them.

@kvg

This comment has been minimized.

Copy link

commented Apr 12, 2019

@mepowers Running the JDK deflater does permit us to work around the issue for now. However, the JDK deflater is ~6x slower, so it is unfortunately a rather expensive workaround.

@mepowers

This comment has been minimized.

Copy link

commented Apr 12, 2019

@kvg - thanks for the update. @droazen mentioned you have a bunch of test cases. Have you confirmed it's triggered when a read spans many compressed blocks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.