Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nphase takes a long time at GATK step #26

Open
mdondrup opened this issue Jan 12, 2024 · 4 comments
Open

nphase takes a long time at GATK step #26

mdondrup opened this issue Jan 12, 2024 · 4 comments

Comments

@mdondrup
Copy link

mdondrup commented Jan 12, 2024

I am now running nPhase on my real data and noticed it takes a very long time to finish. My data is yeast ONT reads with ~120X coverage and Illumina reads with ~100X coverage.

It seems nPhase is stuck in the GATK HaplotypeCaller step of the short reads, because this is the only process that is running (at only ~100% CPU):

michaeld  40243  40242 98 Jan02 ?        9-21:53:35 java -Dsamjdk.use_async_io_read_samtools=false 
-Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 
-jar /Home/ii/michaeld/miniconda2/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller 
-R ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna -ploidy 2 -I ./nphase_out/Kveik_sample_6/Mapped/shortReads/Kveik_sample_6.final.bam -O ./nphase_out/Kveik_sample_6/VariantCalls/shortReads/Kveik_sample_6.vcf

This is the full command line I am using to run it:

nohup nice -2 nphase pipeline --sampleName Kveik_sample_6 --longReads sample6_cleaned_trimmed.fastq.gz 
--longReadPlatform  ont --R1 ../../Illumina_seq/Fastq/6-1_S8_R1_001.fastq.gz --R2 ../../Illumina_seq/Fastq/6-1_S8_R2_001.fastq.gz  \
--reference  ../../reference_genome/GCF_000146045.2/GCF_000146045.2_R64_genomic.fna --output ./nphase_out --threads 120 >nphase_nohup.out &

The machine has 144 threads and 2TB RAM, so it should be fine, also I remember that variant calling based on the short reads didn't take that long when I ran GATK directly. Possibly, one could set some performance options for java/gatk to inscrease the
speed?

@OmarOakheart
Copy link
Owner

That's odd, maybe the --threads argument isn't being passed properly? I'm not sure what's causing this.

The quickest solution, I think, is you can avoid running GATK through nPhase by providing a VCF file directly (which you will have generated previously by running GATK) with the following command:

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE --vcf VCF_FILE --longReadPlatform {ont,pacbio} --threads 8

It should work with any VCF generated by GATK, but filter out any INDELs first, nPhase only tries to phase SNPs

@mdondrup
Copy link
Author

Sorry for the delay in my response. I was investigating whether I could use the precomputed VCF files. However, I noticed some differences in the workflow which may make recomputing necessary or may even explain the difference in run-time.

Firstly, I ran MarkDuplicates on the aligned bam file:

gatk --java-options "-Xmx200G -Djava.io.tmpdir=$TMPDIR" MarkDuplicatesSpark --spark-master local[60] -R $REF -I $BAM -O bam/$BASE-dedup.bam

The question is: is it acceptable or required to run MarkDuplicates for nPhase? I didn't find that step in the pipeline code.
Secondly, I generated files in gVCF format, but these might be possible to convert.

Therefore I called gatk HaplotypeCaller:

gatk --java-options "-XX:ConcGCThreads=10 -XX:ParallelGCThreads=10 -Xmx500G -Djava.io.tmpdir=$TMPDIR" \ HaplotypeCaller --ploidy $P --native-pair-hmm-threads 60 -R $REF \ -G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation \ --tmp-dir $TMPDIR \ -I bam/$BASE-dedup.bam -O vcf/$BASE.Ploidy${P}.g.vcf -ERC GVCF

I thought that using MarkDuplicates first may make the step much faster.

@OmarOakheart
Copy link
Owner

nPhase only uses the short reads into to obtain a VCF with high-confidence variant positions, MarkDuplicates (and any filtering steps you wish to perform) is fine. I think they would need to be converted first, I'm not sure. I'd have to take a look at the differences between output formats and check if my code would be affected, it may be more easier to simply try it, it would result in an error if anything's out of place.

@mdondrup
Copy link
Author

mdondrup commented Jan 23, 2024

After some more attempts to debug this, I conclude that this is in part a technical problem of GATK and the /tmp directory setup on that specific machine (an issue with GATK registered here) The first problem is that the /tmp directory on that server is mounted noexec.
If I run HaplotypeCaller with the default settings, I get the following error messages in the log:

15:18:46.056 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/
Home/ii/michaeld/miniconda2/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-pac
kage-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
15:18:46.058 WARN  NativeLibraryLoader - Unable to load libgkl_utils.so from native/libgkl_utils.so (/tmp/libgkl_utils9418239050694741169.so: /tmp/libgkl_utils9
418239050694741169.so: failed to map segment from shared object: Operation not permitted)
15:18:46.058 WARN  IntelPairHmm - Intel GKL Utils not loaded
15:18:46.058 INFO  PairHMM - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported

This can be averted by passing Java options on the command-line. When running GATK through the nPhase pipeline, I set export _JAVA_OPTIONS=-Djava.io.tmpdir=/export/scratch/tmp/ before running nPhase. This is also described in the GATK manual. I have seen a few reports where AVX-support could not be found even though the processor supports it.

The second part is that the number of threads used is set via --native-pair-hmm-threads and the default seems to be 4. So I set --native-pair-hmm-threads 60 and it again significantly speeds up. Possibly, the threads parameter could be passed down to GATK this way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants