Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComputeSamTags java.lang.OutOfMemoryError: Java heap space error #312

Closed
emyli14 opened this issue Mar 20, 2020 · 11 comments
Closed

ComputeSamTags java.lang.OutOfMemoryError: Java heap space error #312

emyli14 opened this issue Mar 20, 2020 · 11 comments

Comments

@emyli14
Copy link

emyli14 commented Mar 20, 2020

I am trying to run the preprocessing step of gridss on a single file to test the pipeline and am encountering an error of Exception in thread "main" java.lang.OutOfMemoryError: Java heap space during the ComputeSamTags command. I have tried adjusting the value of --jvmheap to no avail. Can you offer any advice? Thanks.

My gridss command line is:
gridss.sh --reference $REFERENCE --output $OUTPUT --assembly $ASSEMBLY --workingdir $WORKINGDIR --blacklist $BLACKLIST --steps PreProcess $INPUT_1

The error in the output log file reads:
19:24:41.164 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/apps/gridss/2.8.0/bin/gridss-2.8.0-gridss-jar-with-dependencies.jar!/com/intel/gkl/native/libgkl_compression.so [Thu Mar 19 19:24:41 AEDT 2020] ComputeSamTags INPUT=./output/gridss/Single/intermediate/HY5NTCCXX_4_160805_FR07885900_Homo-sapiens__R_160712_KELWIL1_BLAIRDNA_M004.dupmarked.realigned.recalibrated.bam.gridss.working/tmp.HY5NTCCXX_4_160805_FR07885900_Homo-sapiens__R_160712_KELWIL1_BLAIRDNA_M004.dupmarked.realigned.recalibrated.bam.namedsorted.bam OUTPUT=/dev/stdout ASSUME_SORTED=true SOFTEN_HARD_CLIPS=true FIX_MATE_INFORMATION=true FIX_DUPLICATE_FLAG=true FIX_SA=true FIX_MISSING_HARD_CLIP=true RECALCULATE_SA_SUPPLEMENTARY=true TAGS=[R2, Q2, MQ, SA, MC, NM] WORKING_DIR=./output/gridss/Single/intermediate TMP_DIR=[./output/gridss/Single/intermediate/HY5NTCCXX_4_160805_FR07885900_Homo-sapiens__R_160712_KELWIL1_BLAIRDNA_M004.dupmarked.realigned.recalibrated.bam.gridss.working] COMPRESSION_LEVEL=0 REFERENCE_SEQUENCE=./genome2.fa IGNORE_DUPLICATES=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Thu Mar 19 19:24:41 AEDT 2020] Executing as mcc549@c324 on Linux 4.12.14-95.32-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_141-b15; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.8.0-gridss INFO 2020-03-19 19:24:42 SAMFileWriterFactory Unknown file extension, assuming BAM format when writing file: file:///dev/stdout INFO 2020-03-19 19:24:45 ComputeSamTags Seen many non-increasing record positions. Printing Read-names as well. INFO 2020-03-19 19:25:13 ComputeSamTags Processed 1,000,000 records. Elapsed time: 00:00:30s. Time for last 1,000,000: 30s. Last read position: 8:112,220,272. Last read name: ST-E00152:202:HY5NTCCXX:4:1102:31020:32127 Exception in thread "tmp.HY5NTCCXX_4_160805_FR07885900_Homo-sapiens__R_160712_KELWIL1_BLAIRDNA_M004.dupmarked.realigned.recalibrated.bam.namedsorted.bam-tags" java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space at au.edu.wehi.idsv.util.AsyncBufferedIterator$ReaderRunnable.run(AsyncBufferedIterator.java:189) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.ArrayDeque.doubleCapacity(ArrayDeque.java:157) at java.util.ArrayDeque.addFirst(ArrayDeque.java:231) at au.edu.wehi.idsv.sam.SAMRecordUtil.calculateSATags(SAMRecordUtil.java:1049) at au.edu.wehi.idsv.sam.SAMRecordUtil.calculateTemplateTags(SAMRecordUtil.java:729) at au.edu.wehi.idsv.sam.TemplateTagsIterator.ensureQueue(TemplateTagsIterator.java:47) at au.edu.wehi.idsv.sam.TemplateTagsIterator.hasNext(TemplateTagsIterator.java:54) at au.edu.wehi.idsv.util.AsyncBufferedIterator$ReaderRunnable.run(AsyncBufferedIterator.java:165) ... 1 more [Thu Mar 19 19:25:40 AEDT 2020] gridss.ComputeSamTags done. Elapsed time: 0.99 minutes. Runtime.totalMemory()=4151836672 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.ArrayDeque.doubleCapacity(ArrayDeque.java:157) at java.util.ArrayDeque.addFirst(ArrayDeque.java:231) at au.edu.wehi.idsv.sam.SAMRecordUtil.calculateSATags(SAMRecordUtil.java:1049) at au.edu.wehi.idsv.sam.SAMRecordUtil.calculateTemplateTags(SAMRecordUtil.java:729) at au.edu.wehi.idsv.sam.TemplateTagsIterator.ensureQueue(TemplateTagsIterator.java:47) at au.edu.wehi.idsv.sam.TemplateTagsIterator.hasNext(TemplateTagsIterator.java:54) at au.edu.wehi.idsv.util.AsyncBufferedIterator$ReaderRunnable.run(AsyncBufferedIterator.java:165) at java.lang.Thread.run(Thread.java:748) [bam_sort_core] merging from 0 files and 8 in-memory blocks...

@d-cameron
Copy link
Member

--jvmheap only affects the assembly and variant calling steps. The pre-processing steps are use hardcoded - ComputeSamTags uses 4gb. There are a few possibilities I can think of that might be causing this:

  • You're genuinely out of memory. On a human size genome, you'll want at least 16GB of memory. Lowest that GRIDSS will run with is 12GB (4gb for SoftClipsToSplitReads, and another 8GB for the bwa mem process that SoftClipsToSplitReads spawns). If you're dealing with small genomes, then then min memory is much lower due to the smaller bwa index size.
    • The samtools sort step that is run in parallel with ComputeSamTags is using all your available memory. # threads defaults to the number of threads available on the system and samtools sort uses ~768Mb per thread. Specifying a thread count will limit the memory used by samtools.
  • Your input BAM is problematic:
    • read names are not unique per fragment/read pair as required by the SAM specifications and too much is being loaded into memory at once. The log file includes has a read name of ST-E00152:202:HY5NTCCXX:4:1102:31020:32127 so it's probably not this.
    • Big reference genome. GRIDSS caches a (2-bit encoded) copy of the reference genome in memory. If you have a large reference genome then 4GB won't be enough. Similarly, if you have a long contig, then GRIDSS might OOM when it's still in the process of 2-bit encoding it (it loads the entire contigs). Does genome2.fa.gridsscache exist and if so, how big is it?
    • Overly aggressive read-ahead/async IO caching. GRIDSS does IO buffering, internal caches are generally set to 4Mb, or 500 reads (depending on whether reads or bytes are being cached), so this should be less than 50Mb total. This assumption might be wrong if your data doesn't look like normal Illumina data.
    • Long read sequencing. GRIDSS isn't designed for PacBio or ONT data.

Potential fixes:

  • Edit gridss.sh to give ComputeTagSams more memory
  • Use a machine with more memory (if you've actually run out)
  • Explicitly specify thread count (if you're on a cluster or a machine with high core count)
  • Edit gridss.sh to disable aysnc IO and buffering (delete the 4 -Dsamjdk. options around line 390 of gridss.sh)

@DC23
Copy link

DC23 commented Mar 25, 2020

Hi, I am assisting @emyli14 by creating a Snakemake workflow to run on a compute cluster. I encountered the same OOM exception today before getting directed to this issue. I can supply this additional information:

  • Preprocessing is running on a cluster node with 128G RAM and 20 cores.
  • I have tried both "--threads 20" and "--threads 5" : same OOM exception
  • Prior to reading this issue, I had tried specifying "--jvmheap 60g"
  • After running to the OOM, I cannot find genome2.fa.gridsscache anywhere in the output or working directories. Is there another location I should look?

I can test your suggested modifications to gridss.sh, but I require a sys admin to create a copy of the environment module. I can't start editing the version shared by all users.

Also, given your comments regarding samtools sort, do you have a suggested number of threads? How well does this scale on average?

And for running with 128G RAM, what value for jvmheap do you suggest?

@DC23
Copy link

DC23 commented Mar 31, 2020

OK, I have a duplicate test module setup. The relevant line for adjust ComputeSamTags memory appears to be 450:

			echo "$(date)	ComputeSamTags|samtools	$f" | tee -a $timinglogfile
			{ $timecmd java -Xmx4g $jvm_args \
					-cp $gridss_jar gridss.ComputeSamTags \

Not sure of a good value, so I will increase by 2G increments and test.

@DC23
Copy link

DC23 commented Apr 3, 2020

A bit more information. I modified gridss.sh to take an extra argument --stheap to control the jvmheap value given to the ComputeSamTags command. Pre-processing with 10G and 2 threads on a 20 core, 128G RAM compute node, processing did run further than before. But this time I get an IllegalStateException:

[Thu Apr 02 12:47:46 AEDT 2020] Executing as col52j@c232 on Linux 4.12.14-95.32-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_141-b15; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.8.0-gridss
INFO	2020-04-02 12:47:47	SAMFileWriterFactory	Unknown file extension, assuming BAM format when writing file: file:///dev/stdout
INFO	2020-04-02 12:47:47	ComputeSamTags	Seen many non-increasing record positions. Printing Read-names as well.
INFO	2020-04-02 12:48:02	ComputeSamTags	Processed     1,000,000 records.  Elapsed time: 00:00:15s.  Time for last 1,000,000:   15s.  Last read position: 4:43,277,947.  Last read name: ST-E00152:206:H573MALXX:1:1102:11830:23284
INFO	2020-04-02 12:48:14	ComputeSamTags	Processed     2,000,000 records.  Elapsed time: 00:00:27s.  Time for last 1,000,000:   11s.  Last read position: 8:46,844,104.  Last read name: ST-E00152:206:H573MALXX:1:1103:24495:3805
INFO	2020-04-02 12:48:40	ComputeSamTags	Processed     3,000,000 records.  Elapsed time: 00:00:52s.  Time for last 1,000,000:   25s.  Last read position: hs37d5:27,066,726.  Last read name: ST-E00152:206:H573MALXX:1:1105:5518:41005
Exception in thread "tmp.H573MALXX_1_160713_FR07885751_Homo-sapiens__R_160712_KELWIL1_BLAIRDNA_M002.dupmarked.realigned.recalibrated.bam.namedsorted.bam-tags" java.lang.RuntimeException: java.lang.IllegalStateException: Sorry, deque too big
	at au.edu.wehi.idsv.util.AsyncBufferedIterator$ReaderRunnable.run(AsyncBufferedIterator.java:189)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalStateException: Sorry, deque too big
	at java.util.ArrayDeque.doubleCapacity(ArrayDeque.java:156)
	at java.util.ArrayDeque.addFirst(ArrayDeque.java:231)
	at au.edu.wehi.idsv.sam.SAMRecordUtil.calculateSATags(SAMRecordUtil.java:1049)
	at au.edu.wehi.idsv.sam.SAMRecordUtil.calculateTemplateTags(SAMRecordUtil.java:729)
	at au.edu.wehi.idsv.sam.TemplateTagsIterator.ensureQueue(TemplateTagsIterator.java:47)
	at au.edu.wehi.idsv.sam.TemplateTagsIterator.hasNext(TemplateTagsIterator.java:54)
	at au.edu.wehi.idsv.util.AsyncBufferedIterator$ReaderRunnable.run(AsyncBufferedIterator.java:165)

@d-cameron
Copy link
Member

Ok, with that error message my best guess is that you don't have unique read names. Do all your SAM records have the same read name?

SAM section 1.4.1 states:

Reads/segments having identical QNAME are regarded to come from
the same template. A QNAME ‘*’ indicates the information is unavailable. In a SAM file, a read may
occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given.

GRIDSS does not support unnamed (*) reads or non-unique read names.

@d-cameron
Copy link
Member

I cannot find genome2.fa.gridsscache anywhere in the output or working directories. Is there another location I should look?

In the same location as the reference genome itself. It it's not writable, GRIDSS will just silently fail and continue since the genome caching isn't critical.

@emyli14
Copy link
Author

emyli14 commented Apr 7, 2020

I have confirmed that all read names are unique and no reads are unnamed.

@d-cameron
Copy link
Member

This one has me stumped. Is it possible to share the data (or a subset thereof?)?

  • Is your aligner reporting secondary alignments (ie multiple alignments when maqp is low)?
  • Do any of your SAM records have HI, IH, NH, FI, FS, TC, CC or CP tags?

In coincidence, I just happen to have rewritten the function the crash occurs in. Would to be able to rerun the latest GRIDSS version (2.8.3)?

@d-cameron
Copy link
Member

NB: tags can be checked by looking at INPUT.gridss.working/INPUT.tag_metrics

@emyli14
Copy link
Author

emyli14 commented Apr 27, 2020

Thanks for your help and suggestions.

Happy to share a subset of the data. I've attached most of my output files to this dropbox folder.
https://www.dropbox.com/sh/v8j9hutsqdxzhas/AAC0hXwb3i78LiIWuvkEz6Ola?dl=0

None the tag metrics you mentioned above seem to be present, and as far as I can tell there are no secondary alignments reported.

Will try to get GRIDSS version (2.8.3) running.

d-cameron pushed a commit that referenced this issue Apr 27, 2020
@d-cameron
Copy link
Member

Should be fixed in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants