Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duration and Heterozygosity #22

Open
rmormando opened this issue Sep 22, 2022 · 28 comments
Open

Duration and Heterozygosity #22

rmormando opened this issue Sep 22, 2022 · 28 comments

Comments

@rmormando
Copy link

Hi I'm trying to run nPhase on a tetraploid yeast genome I seqeunced with PacBio HiFi reads.

Here is my code:

nphase pipeline --sampleName trialRun1 --reference s288c.fa --longReads pacbioHiFireads.fastq.gz --longReadPlatform pacbio --R1 illuminaShortRead_1.fastq --R2 illuminaShortRead_2.fastq --output trialRun1Output --threads 8

The issue I am having is that it's taking a long time to run - my computer accidentally restarted during its run so the progress I made running it for the past day or two is gone and before I run the code again I wanted to see if there was a tag or something I could add or do to the data to make it go faster.

I saw your messages in issue #11 and #10 but was wondering if there was anything done to the nPhase code to account for this. I also read up on your solution to #12 but I'm not sure how to go about doing that. I can run the code using the command I have written out above and I don't mind if it takes a longer amount of time but how much time on average would it take for a tetraploid yeast genome to run through nPhase?

@OmarOakheart
Copy link
Owner

Before starting over from scratch, you should check if the long read mapping, short read mapping and variant calling steps were over. If so, then you can save a lot of time by running nphase partial instead of rerunning the entire pipeline (since you only need to map and variant call once). Realistically if it ran for an entire day I would expect at least one of the mapping steps to be over, unless if you have a particularly high coverage/slow processing.

If you have any output in the Phased folder, that's a good sign that all of the previous steps are over. If you're unsure how to determine if the pre-processing steps ended I can look into it later and give you some pretty clear ways of identifying if something was interrupted partway through

@rmormando
Copy link
Author

It looks like there is data in both the long read and short read mapping folders but there are two empty folders (FastQ and Plots) in the Phased folder, which I'm guessing means that it cut off right in the middle of the phasing step?

what would the command look like to just do the partial run? If you can look into that later and let me know some ways to see its progress that would be really helpful!

@OmarOakheart
Copy link
Owner

OmarOakheart commented Sep 22, 2022

If the long reads are mapped and the short reads are mapped and variant called, then you can run

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE --vcf VCF_FILE --mappedLongReads MAPPED_LONG_READ_FILE --threads 8

Where MAPPED_LONG_READ_FILE looks like ${OUTPUT_FOLDER}/longReads/${SAMPLE_NAME}.sorted.sam
and VCF_FILE looks like ${OUTPUT_FOLDER}/variantCalled/shortReads/${SAMPLE_NAME}.SNPs.vcf

If the variant calling wasn't over but the short reads were mapped, then it's

nphase partial --sampleName SAMPLE_NAME --reference REFERENCE --output OUTPUT_FOLDER --longReads LONG_READ_FILE --mappedLongReads MAPPED_LONG_READ_FILE --mappedShortReads MAPPED_SHORT_READ_FILE --threads 8

where MAPPED_SHORT_READ_FILE looks like ${OUTPUT_FOLDER}/shortReads/${SAMPLE_NAME}.final.bam

@rmormando
Copy link
Author

Hello! I'm so sorry for the long wait - I started working through other projects but I have finally circled back to this issue and I have some updates.

I went to run nphase partial using the mapped long read SAM and VCF files from the resulting run when I used nphase pipeline. After running for about ~2 hours it printed this error to the STDOUT:

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF

And then this is what was in the log file:

COMMAND: gatk CreateSequenceDictionary -R Saccharomyces_cerevisiae.fasta

STDERR:

INFO	2022-12-13 17:07:36	CreateSequenceDictionary	Output dictionary will be written in Saccharomyces_cerevisiae.dict
17:07:36.994 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Dec 13 17:07:37 UTC 2022] CreateSequenceDictionary --REFERENCE Saccharomyces_cerevisiae.fasta --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Tue Dec 13 17:07:37 UTC 2022] Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux 5.10.155-138.670.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
[Tue Dec 13 17:07:37 UTC 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=159383552
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: /home/ec2-user/Saccharomyces_cerevisiae.dict already exists.  Delete this file and try again, or specify a different output file.
	at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:220)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar CreateSequenceDictionary -R Saccharomyces_cerevisiae.fasta


STDOUT:



COMMAND: gatk SelectVariants -R Saccharomyces_cerevisiae.fasta --variant ./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf -O ./nphase_work/OYL011/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf --select-type-to-include SNP

STDERR:

17:07:40.064 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:07:40.200 INFO  SelectVariants - ------------------------------------------------------------
17:07:40.201 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.3.0.0
17:07:40.201 INFO  SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
17:07:40.201 INFO  SelectVariants - Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux v5.10.155-138.670.amzn2.x86_64 amd64
17:07:40.201 INFO  SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS
17:07:40.201 INFO  SelectVariants - Start Date/Time: December 13, 2022 at 5:07:39 PM UTC
17:07:40.202 INFO  SelectVariants - ------------------------------------------------------------
17:07:40.202 INFO  SelectVariants - ------------------------------------------------------------
17:07:40.202 INFO  SelectVariants - HTSJDK Version: 3.0.1
17:07:40.202 INFO  SelectVariants - Picard Version: 2.27.5
17:07:40.202 INFO  SelectVariants - Built for Spark Version: 2.4.5
17:07:40.203 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:07:40.203 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:07:40.203 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:07:40.203 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:07:40.203 INFO  SelectVariants - Deflater: IntelDeflater
17:07:40.203 INFO  SelectVariants - Inflater: IntelInflater
17:07:40.203 INFO  SelectVariants - GCS max retries/reopens: 20
17:07:40.203 INFO  SelectVariants - Requester pays: disabled
17:07:40.203 INFO  SelectVariants - Initializing engine
17:07:40.395 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/ec2-user/./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf
17:07:40.435 INFO  SelectVariants - Done initializing engine
17:07:40.488 INFO  ProgressMeter - Starting traversal
17:07:40.488 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
17:07:42.946 INFO  ProgressMeter -        chrXVI:941337              0.0                 86639        2114865.7
17:07:42.947 INFO  ProgressMeter - Traversal complete. Processed 86639 total variants in 0.0 minutes.
17:07:42.953 INFO  SelectVariants - Shutting down engine
[December 13, 2022 at 5:07:42 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=235929600
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar SelectVariants -R Saccharomyces_cerevisiae.fasta --variant ./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf -O ./nphase_work/OYL011/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf --select-type-to-include SNP


STDOUT:



COMMAND: gatk CreateSequenceDictionary -R Saccharomyces_cerevisiae.fasta

STDERR:

INFO	2022-12-13 17:08:30	CreateSequenceDictionary	Output dictionary will be written in Saccharomyces_cerevisiae.dict
17:08:30.548 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Dec 13 17:08:30 UTC 2022] CreateSequenceDictionary --REFERENCE Saccharomyces_cerevisiae.fasta --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Tue Dec 13 17:08:30 UTC 2022] Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux 5.10.155-138.670.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
[Tue Dec 13 17:08:30 UTC 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=191889408
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: /home/ec2-user/Saccharomyces_cerevisiae.dict already exists.  Delete this file and try again, or specify a different output file.
	at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:220)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar CreateSequenceDictionary -R Saccharomyces_cerevisiae.fasta


STDOUT:



COMMAND: gatk SelectVariants -R Saccharomyces_cerevisiae.fasta --variant ./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf -O ./nphase_work/OYL011/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf --select-type-to-include SNP

STDERR:

17:08:33.451 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
17:08:33.594 INFO  SelectVariants - ------------------------------------------------------------
17:08:33.595 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.3.0.0
17:08:33.595 INFO  SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
17:08:33.595 INFO  SelectVariants - Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux v5.10.155-138.670.amzn2.x86_64 amd64
17:08:33.595 INFO  SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS
17:08:33.595 INFO  SelectVariants - Start Date/Time: December 13, 2022 at 5:08:33 PM UTC
17:08:33.595 INFO  SelectVariants - ------------------------------------------------------------
17:08:33.596 INFO  SelectVariants - ------------------------------------------------------------
17:08:33.596 INFO  SelectVariants - HTSJDK Version: 3.0.1
17:08:33.596 INFO  SelectVariants - Picard Version: 2.27.5
17:08:33.596 INFO  SelectVariants - Built for Spark Version: 2.4.5
17:08:33.597 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:08:33.597 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:08:33.597 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:08:33.597 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:08:33.597 INFO  SelectVariants - Deflater: IntelDeflater
17:08:33.597 INFO  SelectVariants - Inflater: IntelInflater
17:08:33.597 INFO  SelectVariants - GCS max retries/reopens: 20
17:08:33.597 INFO  SelectVariants - Requester pays: disabled
17:08:33.597 INFO  SelectVariants - Initializing engine
17:08:33.788 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/ec2-user/./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf
17:08:33.815 INFO  SelectVariants - Done initializing engine
17:08:33.868 INFO  ProgressMeter - Starting traversal
17:08:33.873 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
17:08:36.278 INFO  ProgressMeter -        chrXVI:941337              0.0                 86639        2162371.0
17:08:36.278 INFO  ProgressMeter - Traversal complete. Processed 86639 total variants in 0.0 minutes.
17:08:36.328 INFO  SelectVariants - Shutting down engine
[December 13, 2022 at 5:08:36 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=273678336
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar SelectVariants -R Saccharomyces_cerevisiae.fasta --variant ./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf -O ./nphase_work/OYL011/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf --select-type-to-include SNP


STDOUT:



COMMAND: gatk CreateSequenceDictionary -R Saccharomyces_cerevisiae.fasta

STDERR:

INFO	2022-12-13 19:22:37	CreateSequenceDictionary	Output dictionary will be written in Saccharomyces_cerevisiae.dict
19:22:37.520 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Dec 13 19:22:37 UTC 2022] CreateSequenceDictionary --REFERENCE Saccharomyces_cerevisiae.fasta --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Tue Dec 13 19:22:37 UTC 2022] Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux 5.10.155-138.670.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
[Tue Dec 13 19:22:37 UTC 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=159383552
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: /home/ec2-user/Saccharomyces_cerevisiae.dict already exists.  Delete this file and try again, or specify a different output file.
	at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:220)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar CreateSequenceDictionary -R Saccharomyces_cerevisiae.fasta


STDOUT:



COMMAND: gatk SelectVariants -R Saccharomyces_cerevisiae.fasta --variant ./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf -O ./nphase_work/OYL011/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf --select-type-to-include SNP

STDERR:

19:22:40.475 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
19:22:40.628 INFO  SelectVariants - ------------------------------------------------------------
19:22:40.629 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.3.0.0
19:22:40.629 INFO  SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
19:22:40.629 INFO  SelectVariants - Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux v5.10.155-138.670.amzn2.x86_64 amd64
19:22:40.629 INFO  SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS
19:22:40.629 INFO  SelectVariants - Start Date/Time: December 13, 2022 at 7:22:40 PM UTC
19:22:40.629 INFO  SelectVariants - ------------------------------------------------------------
19:22:40.629 INFO  SelectVariants - ------------------------------------------------------------
19:22:40.630 INFO  SelectVariants - HTSJDK Version: 3.0.1
19:22:40.630 INFO  SelectVariants - Picard Version: 2.27.5
19:22:40.630 INFO  SelectVariants - Built for Spark Version: 2.4.5
19:22:40.630 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
19:22:40.630 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:22:40.630 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
19:22:40.631 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:22:40.631 INFO  SelectVariants - Deflater: IntelDeflater
19:22:40.631 INFO  SelectVariants - Inflater: IntelInflater
19:22:40.631 INFO  SelectVariants - GCS max retries/reopens: 20
19:22:40.631 INFO  SelectVariants - Requester pays: disabled
19:22:40.631 INFO  SelectVariants - Initializing engine
19:22:40.823 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/ec2-user/./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf
19:22:40.861 INFO  SelectVariants - Done initializing engine
19:22:40.910 INFO  ProgressMeter - Starting traversal
19:22:40.917 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
19:22:43.227 INFO  ProgressMeter -        chrXVI:941337              0.0                 86639        2251338.2
19:22:43.227 INFO  ProgressMeter - Traversal complete. Processed 86639 total variants in 0.0 minutes.
19:22:43.284 INFO  SelectVariants - Shutting down engine
[December 13, 2022 at 7:22:43 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.05 minutes.
Runtime.totalMemory()=255852544
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar SelectVariants -R Saccharomyces_cerevisiae.fasta --variant ./nphase_work/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf -O ./nphase_work/OYL011/OYL011/VariantCalls/shortReads/OYL011.SNPs.vcf --select-type-to-include SNP


STDOUT:

Oof! I know this turned in to a long message but is there anything I can do to fix this? It's doing everything BUT phasing my sample and I don't know what I'm doing wrong. Thank you in advance!!

Other info:
version: 1.2.0
OS: AWS EC2 Linux Instance

@OmarOakheart
Copy link
Owner

OmarOakheart commented Dec 14, 2022

Based on your output it looks like, for some reason, the GATK variant selection is happening twice simultaneously. Is there any possible way you somehow ran the same command line twice? I'll be checking the code to see if there's a bug no one's ran into before.

What I think is most efficient and definitely what I would do, is download the example data in https://github.com/OmarOakheart/nPhase/tree/master/example, then run it and check that it works properly (it should be over in a matter of minutes). If it does work fine, then my recommendation is to just start over with your raw. There's still a risk that by accidentally restarting in the middle of a run you had an incomplete BAM file or something of that nature and now we're seeing weird things happen because the input is weird, better to rule that out quickly in my opinion.

If you're strapped for computing time let me know and I can try to come up with another recommendation

@rmormando
Copy link
Author

I ran through the pipeline with the example data and it went all the way through and finished fine so I am going to go back and rerun it. I'll let you know what happens!

@rmormando
Copy link
Author

Hello again! I reran the pipeline and it ended at virtually the same spot:

This was printed in the STDOUT:

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF

And this is the log file:

COMMAND: gatk CreateSequenceDictionary -R Saccharomyces_cerevisiae.fasta

STDERR:

INFO	2022-12-14 17:01:49	CreateSequenceDictionary	Output dictionary will be written in Saccharomyces_cerevisiae.dict
17:01:49.650 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Dec 14 17:01:49 UTC 2022] CreateSequenceDictionary --REFERENCE Saccharomyces_cerevisiae.fasta --TRUNCATE_NAMES_AT_WHITESPACE true --NUM_SEQUENCES 2147483647 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Dec 14 17:01:49 UTC 2022] Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux 5.10.155-138.670.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
[Wed Dec 14 17:01:49 UTC 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=159383552
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
picard.PicardException: /home/ec2-user/Saccharomyces_cerevisiae.dict already exists.  Delete this file and try again, or specify a different output file.
	at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:220)
	at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:309)
	at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar CreateSequenceDictionary -R Saccharomyces_cerevisiae.fasta


STDOUT:



COMMAND: ngmlr -t 8 -x ont -r Saccharomyces_cerevisiae.fasta -q OYL-011_passMinQ7.fastq -o ./nphase_work/ag_OYL011/Mapped/longReads/ag_OYL011.sam

STDERR:

ngmlr 0.2.7 (build: Feb 21 2022 18:53:53, start: 2022-12-14.17:01:49)
Contact: philipp.rescheneder@univie.ac.at
Opening for output (SAM): ./nphase_work/ag_OYL011/Mapped/longReads/ag_OYL011.sam
Reading encoded reference from Saccharomyces_cerevisiae.fasta-enc.2.ngm
Reading 12 Mbp from disk took 0.04s
Reading reference index from Saccharomyces_cerevisiae.fasta-ht-13-2.2.ngm
Reading from disk took 2.58s
Opening query file OYL-011_passMinQ7.fastq
Mapping reads...
Processed: 29 (0.97), R/S: 14.50, RL: 23375, Time: 3.00 1.50 44.00, Align: 1.00, 485, 0.93
�[A�[2KProcessed: 95 (0.96), R/S: 23.75, RL: 11878, Time: 5.31 2.42 87.53, Align: 1.00, 425, 0.94
�[A�[2KProcessed: 119 (0.96), R/S: 19.83, RL: 10906, Time: 5.33 2.11 90.13, Align: 0.99, 539, 0.93
�[A�[2KProcessed: 162 (0.96), R/S: 20.25, RL: 10456, Time: 4.51 1.75 91.50, Align: 1.00, 529, 0.93
�[A�[2KProcessed: 186 (0.96), R/S: 18.60, RL: 10979, Time: 2.69 1.34 93.94, Align: 1.00, 575, 0.94
�...
�[A�[2KProcessed: 127373 (0.95), R/S: 12.59, RL: 7900, Time: 3.07 1.00 92.93, Align: 0.99, 534, 0.93
�[A�[2KProcessed: 127400 (0.95), R/S: 12.59, RL: 7898, Time: 3.19 1.00 92.81, Align: 0.99, 534, 0.93
�[A�[2KProcessed: 127402 (0.95), R/S: 12.59, RL: 7898, Time: 3.09 1.00 92.91, Align: 0.99, 534, 0.93
�[A�[2KProcessed: 127402 (0.95), R/S: 12.59, RL: 7898, Time: 3.09 1.00 92.91, Align: 0.99, 534, 0.93
Done (120626 reads mapped (94.68%), 6776 reads not mapped, 153868 lines written)(elapsed: 168m, 11 r/s)


STDOUT:


COMMAND: samtools view -t Saccharomyces_cerevisiae.fasta -@ 8 -F 260 ./nphase_work/ag_OYL011/Mapped/longReads/ag_OYL011.sam -o ./nphase_work/ag_OYL011/Mapped/longReads/ag_OYL011.pass.sam

COMMAND: samtools sort ./nphase_work/ag_OYL011/Mapped/longReads/ag_OYL011.pass.sam -@ 8 -o ./nphase_work/ag_OYL011/Mapped/longReads/ag_OYL011.sorted.header.sam

STDERR:

[bam_sort_core] merging from 0 files and 8 in-memory blocks...


STDOUT:



COMMAND: samtools view ./nphase_work/ag_OYL011/Mapped/longReads/ag_OYL011.sorted.header.sam -@ 8 -o ./nphase_work/ag_OYL011/Mapped/longReads/ag_OYL011.sorted.sam

COMMAND: bwa mem -t 8 -M Saccharomyces_cerevisiae.fasta OYL011_S6_L001_R1_001.fastq.gz OYL011_S6_L001_R2_001.fastq.gz > ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.noRG.sam

STDERR:

[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 326712 sequences (80000053 bp)...
[M::process] read 326680 sequences (80000010 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 145219, 5, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (199, 265, 334)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 604)
[M::mem_pestat] mean and std.dev: (268.41, 104.00)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 739)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 326712 reads in 74.035 CPU sec, 37.054 real sec
[M::process] read 327118 sequences (80000009 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 145437, 2, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (199, 265, 334)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 604)
[M::mem_pestat] mean and std.dev: (268.38, 103.73)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 739)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 326680 reads in 132.460 CPU sec, 66.358 real sec
[M::process] read 326932 sequences (80000202 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 145758, 3, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (199, 264, 333)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 601)
[M::mem_pestat] mean and std.dev: (267.74, 103.69)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 735)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 327118 reads in 294.026 CPU sec, 147.177 real sec
[M::process] read 325136 sequences (80000456 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 145482, 3, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (199, 264, 333)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 601)
[M::mem_pestat] mean and std.dev: (267.54, 103.40)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 735)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 326932 reads in 133.116 CPU sec, 66.673 real sec
[M::process] read 324740 sequences (80000309 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 144711, 3, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (200, 266, 335)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 605)
[M::mem_pestat] mean and std.dev: (269.53, 103.64)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 740)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 325136 reads in 85.287 CPU sec, 42.732 real sec
[M::process] read 325134 sequences (80000485 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 144406, 1, 3)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (201, 266, 334)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 600)
[M::mem_pestat] mean and std.dev: (269.28, 103.53)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 733)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 324740 reads in 295.813 CPU sec, 148.122 real sec
[M::process] read 325256 sequences (80000206 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 144541, 6, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (200, 266, 335)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 605)
[M::mem_pestat] mean and std.dev: (269.32, 103.70)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 740)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 325134 reads in 200.611 CPU sec, 100.449 real sec
[M::process] read 182586 sequences (45013849 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (1, 144863, 3, 1)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (200, 266, 335)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 605)
[M::mem_pestat] mean and std.dev: (269.47, 103.99)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 740)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 325256 reads in 80.425 CPU sec, 40.290 real sec
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 81287, 3, 2)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (201, 265, 335)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 603)
[M::mem_pestat] mean and std.dev: (269.58, 103.49)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 737)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 182586 reads in 133.896 CPU sec, 67.060 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 8 -M Saccharomyces_cerevisiae.fasta OYL011_S6_L001_R1_001.fastq.gz OYL011_S6_L001_R2_001.fastq.gz
[main] Real time: 718.314 sec; CPU: 1431.567 sec


STDOUT is in ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.noRG.sam

COMMAND: gatk AddOrReplaceReadGroups -I ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.noRG.sam -O ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sam -RGID ID_ag_OYL011 -RGLB LB_ag_OYL011 -RGPL ILLUMINA -RGPU PU_ag_OYL011 -RGSM SM_ag_OYL011

STDERR:

20:07:26.140 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Dec 14 20:07:26 UTC 2022] AddOrReplaceReadGroups --INPUT ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.noRG.sam --OUTPUT ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sam --RGID ID_ag_OYL011 --RGLB LB_ag_OYL011 --RGPL ILLUMINA --RGPU PU_ag_OYL011 --RGSM SM_ag_OYL011 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Dec 14 20:07:27 UTC 2022] Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux 5.10.155-138.670.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
INFO	2022-12-14 20:07:27	AddOrReplaceReadGroups	Created read-group ID=ID_ag_OYL011 PL=ILLUMINA LB=LB_ag_OYL011 SM=SM_ag_OYL011

INFO	2022-12-14 20:07:28	AddOrReplaceReadGroups	Seen many non-increasing record positions. Printing Read-names as well.
INFO	2022-12-14 20:08:00	AddOrReplaceReadGroups	Processed     1,000,000 records.  Elapsed time: 00:00:32s.  Time for last 1,000,000:   32s.  Last read position: chrXIII:141,583.  Last read name: M06553:49:000000000-K2W7W:1:1114:6450:6718
INFO	2022-12-14 20:08:29	AddOrReplaceReadGroups	Processed     2,000,000 records.  Elapsed time: 00:01:02s.  Time for last 1,000,000:   29s.  Last read position: chrmt:24,675.  Last read name: M06553:49:000000000-K2W7W:1:2108:6936:15882
[Wed Dec 14 20:08:53 UTC 2022] picard.sam.AddOrReplaceReadGroups done. Elapsed time: 1.47 minutes.
Runtime.totalMemory()=325058560
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar AddOrReplaceReadGroups -I ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.noRG.sam -O ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sam -RGID ID_ag_OYL011 -RGLB LB_ag_OYL011 -RGPL ILLUMINA -RGPU PU_ag_OYL011 -RGSM SM_ag_OYL011


STDOUT:

Tool returned:
0


COMMAND: samtools view -@ 8 -bT Saccharomyces_cerevisiae.fasta -q 1 ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sam -o ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.bam

COMMAND: samtools sort ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.bam -@ 8 -o ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.bam

STDERR:

[bam_sort_core] merging from 0 files and 8 in-memory blocks...


STDOUT:



COMMAND: samtools index -@ 8 ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.bam

COMMAND: gatk MarkDuplicates --REMOVE_DUPLICATES true -I ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.bam -O ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.MD.bam -M /dev/null

STDERR:

20:10:25.841 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Dec 14 20:10:25 UTC 2022] MarkDuplicates --INPUT ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.bam --OUTPUT ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.MD.bam --METRICS_FILE /dev/null --REMOVE_DUPLICATES true --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --FLOW_MODE false --FLOW_QUALITY_SUM_STRATEGY false --USE_END_IN_UNPAIRED_READS false --USE_UNPAIRED_CLIPPED_END false --UNPAIRED_END_UNCERTAINTY 0 --FLOW_SKIP_FIRST_N_FLOWS 0 --FLOW_Q_IS_KNOWN_END false --FLOW_EFFECTIVE_QUALITY_THRESHOLD 15 --ADD_PG_TAG_TO_READS true --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX false --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false
[Wed Dec 14 20:10:26 UTC 2022] Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux 5.10.155-138.670.amzn2.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.1+13-LTS; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:4.3.0.0
INFO	2022-12-14 20:10:26	MarkDuplicates	Start of doWork freeMemory: 64347464; totalMemory: 91226112; maxMemory: 2084569088
INFO	2022-12-14 20:10:26	MarkDuplicates	Reading input file and constructing read end information.
INFO	2022-12-14 20:10:26	MarkDuplicates	Will retain up to 7552786 data points before spilling to disk.
INFO	2022-12-14 20:10:33	MarkDuplicates	Read     1,000,000 records.  Elapsed time: 00:00:06s.  Time for last 1,000,000:    6s.  Last read position: chrVIII:114,308
INFO	2022-12-14 20:10:33	MarkDuplicates	Tracking 923 as yet unmatched pairs. 74 records in RAM.
INFO	2022-12-14 20:10:38	MarkDuplicates	Read     2,000,000 records.  Elapsed time: 00:00:11s.  Time for last 1,000,000:    5s.  Last read position: chrXV:76,867
INFO	2022-12-14 20:10:38	MarkDuplicates	Tracking 1537 as yet unmatched pairs. 110 records in RAM.
INFO	2022-12-14 20:10:41	MarkDuplicates	Read 2581301 records. 2353 pairs never matched.
INFO	2022-12-14 20:10:41	MarkDuplicates	After buildSortedReadEndLists freeMemory: 541550472; totalMemory: 904921088; maxMemory: 2084569088
INFO	2022-12-14 20:10:41	MarkDuplicates	Will retain up to 65142784 duplicate indices before spilling to disk.
INFO	2022-12-14 20:10:41	MarkDuplicates	Traversing read pair information and detecting duplicates.
INFO	2022-12-14 20:10:42	MarkDuplicates	Traversing fragment information and detecting duplicates.
INFO	2022-12-14 20:10:43	MarkDuplicates	Sorting list of duplicate records.
INFO	2022-12-14 20:10:43	MarkDuplicates	After generateDuplicateIndexes freeMemory: 367127064; totalMemory: 916455424; maxMemory: 2084569088
INFO	2022-12-14 20:10:43	MarkDuplicates	Marking 57060 records as duplicates.
INFO	2022-12-14 20:10:43	MarkDuplicates	Found 113 optical duplicate clusters.
INFO	2022-12-14 20:10:43	MarkDuplicates	Reads are assumed to be ordered by: coordinate
INFO	2022-12-14 20:10:56	MarkDuplicates	Writing complete. Closing input iterator.
INFO	2022-12-14 20:10:56	MarkDuplicates	Duplicate Index cleanup.
INFO	2022-12-14 20:10:56	MarkDuplicates	Getting Memory Stats.
INFO	2022-12-14 20:10:56	MarkDuplicates	Before output close freeMemory: 69763160; totalMemory: 98566144; maxMemory: 2084569088
INFO	2022-12-14 20:10:56	MarkDuplicates	Closed outputs. Getting more Memory Stats.
INFO	2022-12-14 20:10:56	MarkDuplicates	After output close freeMemory: 66857608; totalMemory: 94371840; maxMemory: 2084569088
[Wed Dec 14 20:10:56 UTC 2022] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.52 minutes.
Runtime.totalMemory()=94371840
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar MarkDuplicates --REMOVE_DUPLICATES true -I ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.bam -O ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.MD.bam -M /dev/null


STDOUT:

Tool returned:
0


COMMAND: samtools sort ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.sorted.MD.bam -@ 8 -o ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.final.bam

STDERR:

[bam_sort_core] merging from 0 files and 8 in-memory blocks...


STDOUT:



COMMAND: samtools index -@ 8 ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.final.bam

COMMAND: gatk HaplotypeCaller -R Saccharomyces_cerevisiae.fasta -ploidy 2 -I ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.final.bam -O ./nphase_work/ag_OYL011/VariantCalls/shortReads/ag_OYL011.vcf

STDERR:

20:11:17.028 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
20:11:17.179 INFO  HaplotypeCaller - ------------------------------------------------------------
20:11:17.179 INFO  HaplotypeCaller - The Genome Analysis Toolkit (GATK) v4.3.0.0
20:11:17.179 INFO  HaplotypeCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
20:11:17.180 INFO  HaplotypeCaller - Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux v5.10.155-138.670.amzn2.x86_64 amd64
20:11:17.180 INFO  HaplotypeCaller - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS
20:11:17.180 INFO  HaplotypeCaller - Start Date/Time: December 14, 2022 at 8:11:16 PM UTC
20:11:17.180 INFO  HaplotypeCaller - ------------------------------------------------------------
20:11:17.180 INFO  HaplotypeCaller - ------------------------------------------------------------
20:11:17.181 INFO  HaplotypeCaller - HTSJDK Version: 3.0.1
20:11:17.181 INFO  HaplotypeCaller - Picard Version: 2.27.5
20:11:17.181 INFO  HaplotypeCaller - Built for Spark Version: 2.4.5
20:11:17.181 INFO  HaplotypeCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
20:11:17.181 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
20:11:17.181 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
20:11:17.182 INFO  HaplotypeCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
20:11:17.182 INFO  HaplotypeCaller - Deflater: IntelDeflater
20:11:17.182 INFO  HaplotypeCaller - Inflater: IntelInflater
20:11:17.182 INFO  HaplotypeCaller - GCS max retries/reopens: 20
20:11:17.182 INFO  HaplotypeCaller - Requester pays: disabled
20:11:17.182 INFO  HaplotypeCaller - Initializing engine
20:11:17.400 INFO  HaplotypeCaller - Done initializing engine
20:11:17.413 INFO  HaplotypeCallerEngine - Disabling physical phasing, which is supported only for reference-model confidence output
20:11:17.430 INFO  NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
20:11:17.440 INFO  NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
20:11:17.468 INFO  IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
20:11:17.469 INFO  IntelPairHmm - Available threads: 2
20:11:17.469 INFO  IntelPairHmm - Requested threads: 4
20:11:17.469 WARN  IntelPairHmm - Using 2 available threads, but 4 were requested
20:11:17.469 INFO  PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
20:11:17.529 INFO  ProgressMeter - Starting traversal
20:11:17.529 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Regions Processed   Regions/Minute
20:11:18.171 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position chrI:472 and possibly subsequent; at least 10 samples must have called genotypes
20:11:34.286 INFO  ProgressMeter -            chrI:3242              0.3                    20             71.6
20:11:50.691 INFO  ProgressMeter -            chrI:4953              0.6                    40             72.4
20:12:03.823 INFO  ProgressMeter -            chrI:7011              0.8                    60             77.8
20:12:25.167 INFO  ProgressMeter -            chrI:9007              1.1                    80             71.0
...
22:05:55.207 INFO  ProgressMeter -          chrmt:73003            114.6                 90990            793.8
22:06:12.662 INFO  ProgressMeter -          chrmt:77664            114.9                 91020            792.0
22:07:01.748 INFO  ProgressMeter -          chrmt:81262            115.7                 91050            786.7
22:07:17.980 INFO  ProgressMeter -          chrmt:82384            116.0                 91060            784.9
22:07:34.274 INFO  HaplotypeCaller - 15375 read(s) filtered by: MappingQualityReadFilter 
0 read(s) filtered by: MappingQualityAvailableReadFilter 
0 read(s) filtered by: MappedReadFilter 
23544 read(s) filtered by: NotSecondaryAlignmentReadFilter 
0 read(s) filtered by: NotDuplicateReadFilter 
0 read(s) filtered by: PassesVendorQualityCheckReadFilter 
0 read(s) filtered by: NonZeroReferenceLengthAlignmentReadFilter 
0 read(s) filtered by: GoodCigarReadFilter 
0 read(s) filtered by: WellformedReadFilter 
38919 total reads filtered
22:07:34.275 INFO  ProgressMeter -          chrmt:83724            116.3                 91079            783.3
22:07:34.275 INFO  ProgressMeter - Traversal complete. Processed 91079 total regions in 116.3 minutes.
22:07:34.314 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 3.627222499
22:07:34.315 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 2524.580265775
22:07:34.315 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 3223.87 sec
22:07:34.316 INFO  HaplotypeCaller - Shutting down engine
[December 14, 2022 at 10:07:34 PM UTC] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 116.29 minutes.
Runtime.totalMemory()=675282944
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar HaplotypeCaller -R Saccharomyces_cerevisiae.fasta -ploidy 2 -I ./nphase_work/ag_OYL011/Mapped/shortReads/ag_OYL011.final.bam -O ./nphase_work/ag_OYL011/VariantCalls/shortReads/ag_OYL011.vcf


STDOUT:



COMMAND: gatk SelectVariants -R Saccharomyces_cerevisiae.fasta --variant ./nphase_work/ag_OYL011/VariantCalls/shortReads/ag_OYL011.vcf -O ./nphase_work/ag_OYL011/VariantCalls/shortReads/ag_OYL011.SNPs.vcf --select-type-to-include SNP

STDERR:

22:07:44.455 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
22:07:45.122 INFO  SelectVariants - ------------------------------------------------------------
22:07:45.123 INFO  SelectVariants - The Genome Analysis Toolkit (GATK) v4.3.0.0
22:07:45.124 INFO  SelectVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
22:07:45.133 INFO  SelectVariants - Executing as ec2-user@ip-172-31-29-135.us-east-2.compute.internal on Linux v5.10.155-138.670.amzn2.x86_64 amd64
22:07:45.133 INFO  SelectVariants - Java runtime: OpenJDK 64-Bit Server VM v11.0.1+13-LTS
22:07:45.133 INFO  SelectVariants - Start Date/Time: December 14, 2022 at 10:07:44 PM UTC
22:07:45.133 INFO  SelectVariants - ------------------------------------------------------------
22:07:45.133 INFO  SelectVariants - ------------------------------------------------------------
22:07:45.134 INFO  SelectVariants - HTSJDK Version: 3.0.1
22:07:45.135 INFO  SelectVariants - Picard Version: 2.27.5
22:07:45.135 INFO  SelectVariants - Built for Spark Version: 2.4.5
22:07:45.135 INFO  SelectVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:07:45.135 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:07:45.135 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:07:45.136 INFO  SelectVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:07:45.136 INFO  SelectVariants - Deflater: IntelDeflater
22:07:45.136 INFO  SelectVariants - Inflater: IntelInflater
22:07:45.149 INFO  SelectVariants - GCS max retries/reopens: 20
22:07:45.149 INFO  SelectVariants - Requester pays: disabled
22:07:45.149 INFO  SelectVariants - Initializing engine
22:07:45.821 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/ec2-user/./nphase_work/ag_OYL011/VariantCalls/shortReads/ag_OYL011.vcf
22:07:45.960 INFO  SelectVariants - Done initializing engine
22:07:46.123 INFO  ProgressMeter - Starting traversal
22:07:46.124 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
22:07:54.842 INFO  ProgressMeter -        chrXVI:941337              0.1                 86639         596345.1
22:07:54.842 INFO  ProgressMeter - Traversal complete. Processed 86639 total variants in 0.1 minutes.
22:07:54.880 INFO  SelectVariants - Shutting down engine
[December 14, 2022 at 10:07:54 PM UTC] org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants done. Elapsed time: 0.18 minutes.
Runtime.totalMemory()=267386880
Using GATK jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/ec2-user/miniconda3/envs/polyploidPhasing/share/gatk4-4.3.0.0-0/gatk-package-4.3.0.0-local.jar SelectVariants -R Saccharomyces_cerevisiae.fasta --variant ./nphase_work/ag_OYL011/VariantCalls/shortReads/ag_OYL011.vcf -O ./nphase_work/ag_OYL011/VariantCalls/shortReads/ag_OYL011.SNPs.vcf --select-type-to-include SNP


STDOUT:

I'm sorry this is such a long post but I am so confused why it ends the way it does - and consistently every time I run it. I'm going to try and run it with another sample today just to make sure it's not this sample that's giving it grief. Thank you for all your help so far!!!

@OmarOakheart
Copy link
Owner

So, looking at the output log, you can see that a lot of lines are duplicated, which shouldn't happen and I also presume didn't happen with the example dataset?

Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF
Long reads successfully mapped to reference
Removing quality 0 (multimapped) reads, turning to bam and sorting it
Short reads successfully mapped to reference
Identified heterozygous SNPs in short read VCF
Extracted heterozygous SNP info based on short read VCF

Could you show me what command line you type to launch nPhase? I'm really not seeing a situation that would cause this behavior. Could you also check if when you launch the example dataset it does the same thing of seemingly running the entire pipeline several times in parallel?

@rmormando
Copy link
Author

I noticed that too! I thought it was weird! I'm glad I wasn't seeing things...

I used conda to activate polyploidPhasing and then I use this command (with nohup and & so it can continue running on the server without me needing to be logged in:

nohup nphase pipeline --sampleName OYL011 --reference Saccharomyces_cerevisiae.fasta --R1 OYL011_S2_L001_R1_001.fastq.gz --R2 OYL011_S2_L001_R2_001.fastq.gz --longReads OYL-011_passMinQ7.fastq --longReadPlatform ont --output ./nphase_work/ --threads 8 &

When I ran the example dataset it didn't seem to run the pipeline several times either.

@OmarOakheart
Copy link
Owner

I usually use screen rather than nohup so I'm not familiar with how it proceeds. How do you get the STDOUT, did you just stay logged in? Do you also get the STDERR?

@rmormando
Copy link
Author

It writes the STDOUT (and whatever prints to the command line - so I'm assuming the STDERR as well) to a txt file called: nohup.out

It should work the same way as screen? But I could try using that instead and see if that makes a difference.

@OmarOakheart
Copy link
Owner

I think there's the slightest little chance that STDERR isn't redirected to nohup.out and that it's hiding an error from us that would help us understand what's going on here. I do see the documentation for nohup seems to say that STDERR should be in nohup.out but maybe doing 2>&1 & instead of just & at the end of your command line will give us more information?

Screen is a bit weird to use, if you're not familiar with it.

Sorry this is such trouble, I really don't see what could cause it to run several times in parallel. I suppose I've also never tested running nphase in the background with & but that shouldn't break anything.

I'm thinking there's an error being output that we're not seeing. The example dataset is tiny, so maybe we wouldn't see anything, but did you test it with nohup or without?

@rmormando
Copy link
Author

Hi again! I ran it again with the example data using nohup and & and the log files didn't show any signs of duplications in the process. So I'm back to square one and I can't figure out what to do!

Hopefully we can figure this out together - it would be great to use and in case someone else has a problem like this they can use this to help!

@OmarOakheart
Copy link
Owner

Okay, let's troubleshoot. First making sure we're on the same page.

Initially, the issue was that your machine restarted by accident in the middle of a run and my advice was to run nphase partial. Running nphase partial, for some reason, failed silently.

The example data works without issue, but when you tried to run the entire pipeline (in a new folder, starting over completely), it fails again silently at the same spot.

We also are seeing a weird glitch where the heterozygous SNPs are being extracted twice, as shown in the logs, but it doesn't happen with the example data.

Maybe let's look at the data then and see what's going on. Hopefully you still have the data from the last failed attempt at running nphase pipeline. Do you have a .hetPositions.SNPxLongReads.validated.tsv file in your VariantCalls/longReads folder? Does it show SNPs in chromosomes I to XVI?

@rmormando
Copy link
Author

Okay the weirdest thing happened...it worked???

I reran the SAME code last night with my sample and it went all the way through and gave me phased fastq files? I have no idea what was different/what changed but it went all the way through. Maybe it had something to do with the server I was on?

I'm going to use another sample now and see if it runs all the way or stops at the same place but everything you said was right: my machine restarted by accident in the middle of the run, trying to run nphase partial to pick up where it left off failed silently, the example data worked fine with no duplicated processes, but when I go to run it with my data the same errors (pretty much at the same place) come up. Depending on what happens with the next sample maybe it was just something to do with just using & ? The only thing I did differently this time around was use 2>&1 & like you suggested.

Now that I have data I have some questions on how to interpret it. In the resulting /Phased/FastQ folder there are 834 cluster fastq files and 430 merged clusters fastq files. Is that normal for some (yeast) data? I know these are haploTIGs and not full haploTYPEs so how do I combine them into 4 haplotypes if I think the sample is supposed to be tetraploid? Is this where I would go to the Phasing-Toolkit to combine the necessary clusters together to generate the 4 files I am looking for? My next steps with this data would be to take the 4 haplotypes and do a de novo assembly, polish with short reads, and scaffold them to the reference to then do other downstream analysis with the processed data. So my next step would be to take these 834 clusters.fq and combine them into 4 fastq files - I'm just not sure how to do that now.

@OmarOakheart
Copy link
Owner

Haha, I mean, from my perspective it's weird that it wasn't working before! Hopefully it all runs fine again with your other sample.

Are you saying you have a total of 1200 fastQ? The naming convention of the fastQs being output doesn't have any particular meaning. This sounds to me like nPhase didn't do a good job at phasing and is giving you way too many haplotigs. Would you be able to share the _covVis.png output? It should be easy for me to see quickly if your data was phased well and there's just a lot of noise or if if the data didn't phase well at all. Then I can try to give you advice depending on the case

@rmormando
Copy link
Author

I hope so too! I have a feeling it might've been a fluke though just based on the number of fastq files it resulted in...

This is what the _covVis.png looks like:

Screenshot 2022-12-20 at 11 18 18 AM

I'm also adding in the _phasedVis.png just to show that as well:

Screenshot 2022-12-20 at 11 19 43 AM

I also ran this sample using only chromosome IX as the reference sequence (it's just the area I am focused on right now so I wanted to see if only using that specific chromosome would make it go faster and have more accurate results) and that resulted in this _covVis.png and _phasedVis.png images:

Screenshot 2022-12-20 at 11 18 56 AM

Screenshot 2022-12-20 at 11 20 53 AM

Running with only chr9 also only resulted in 4 clustered fastq files and 2 merged cluster fastq files - does that mean for this area it picked up on 4 haplotypes (each cluster fastq = 1 haplotype) showing that it's tetraploid?

All in all, what does the data look like to you?

@OmarOakheart
Copy link
Owner

What's your total coverage? Based on the total covVis it looks like it's around 10X? But then your chr9 looks like you have about 100X

The full run looks extremely fragmented, which makes sense if you only have 10-20X total, but if you have around 100X then maybe it's a read length distribution issue.

I find that contiguity depends mostly on three factors: coverage (I recommend 10X per haplotype minimum, so 40X for a 4n), read length distribution (generally a mean read length > 15kb will lead to good results), and heterozygosity level. The less heterozygosity you have, the more the other two parameters need to make up for it

@rmormando
Copy link
Author

I believe the coverage should be about 170X - but I'm not exactly sure how to check.

I ran both using the same raw long reads (ONT). Is there a way to specify the coverage in the full run? If I were to try this again with this sample what would the command look like to take all of that into account?

@OmarOakheart
Copy link
Owner

I found this in one of your outputs:

2KProcessed: 127402 (0.95), R/S: 12.59, RL: 7898, Time: 3.09 1.00 92.91, Align: 0.99, 534, 0.93 Done (120626 reads mapped (94.68%), 6776 reads not mapped, 153868 lines written)(elapsed: 168m, 11 r/s)

So that says 120 626 reads at an average of 7898 bp

Let's say 120 000 reads * 7900 bp / 12 500 000 bp for a coverage of about 75X. So the good news is that the coverage is high, but the read length is low for a lot of these reads. I would downsample to 40 or 50 X, trying to maximize read length. You can use a tool called filtlong for that https://github.com/rrwick/Filtlong

Something like filtlong --min_length 5000 --keep_percent 60 input.fastq.gz | gzip > output.fastq.gz

Keep_percent can be a bit higher, too, or lower. Downsample, then check the mean read length (you could use nanoplot to generate those stats). If it's at least 12 kb, ideally more than 15kb, run nphase again, otherwise downsample further. I wouldn't go below 30X coverage.

Running it again with a better read length profile should improve the results a lot. Right now the results you show are not interpretable, too fragmented, I would assume because of the read length. Although I still find it confusing that the output is showing a coverage of only about 3X per haplotig

@rmormando
Copy link
Author

Okay I ran that command with the raw reads in Filtlong and was able to come up with 17,754kb as the mean read length, which comes up to ~48X coverage! I just wrote the command and starting running it on my server so hopefully by tomorrow I'll have results and we can reconvene. Thank you so much for all your help so far I really really appreciate it. Here's to this run working!!!!!!!!

@rmormando
Copy link
Author

Hello again! I reran the pipeline with the filtered reads after getting those results from Filtlong and the results are strange again...

Here is the _covVis.png:
Screenshot 2022-12-21 at 8 35 53 PM

and here is the _phasedVis.png:
Screenshot 2022-12-21 at 8 35 36 PM

This time around it also produced 471 cluster files. What are your thoughts?

@OmarOakheart
Copy link
Owner

I'm very confused by these results. If I saw that as an output I would start double checking everything.
The covVis is showing data that isn't just very fragmented, but also very, very low coverage, since the Y axis is the mean coverage. That shouldn't happen with a sample that has 48X coverage.

So I would check my command line, in case I accidentally input the wrong long read file. Then I would use nanoplot to check the statistics of the long read file used as input, to check if the mean read quality, or length, or coverage are different than I expected. Then I would check the coverage of the .sam file in Mapped/LongReads to make sure that the coverage is what I expect along the entire genome.

The output is showing us that the heterozygous SNPs are evenly distributed along the genome (with some small exceptions, for example the beginning of chromosomeVI appears to be homozygous (though maybe that's not true and just an artefact of whatever is going wrong here)). So I wouldn't immediately check the short reads or the GATK variant calling. That appears to have run fairly smoothly.

If everything above is fine, then another possibility is that the long read sample and the short read sample aren't the same, and therefore nPhase is finding too few long reads that are heterozygous in the same positions as the short read data (perhaps even errors in the long read data, which accounts for very low coverage). I'm not sure that this is what it would look like if this happened, but I suppose it's something to consider. Maybe you have a way of double checking that on your side, if not you could use something like https://github.com/WGLab/NanoCaller to variant call the long reads and then compare the variants in the short read vcf and the long read vcf. Or I could try to advise you on a way to analyze the contextdepths file to determine if the majority variants are absent/low coverage in your long read data, which would be a sign that the long read sample does not present the variants found in the short read, either due to it being a different sample or due to issues in long read quality.

Finally, maybe I messed something up with my latest update to nPhase a few months ago. If it's none of the above we could test that by testing an earlier version of nPhase in a new environment.

I'm sure we can figure this out, I'm sorry using nPhase hasn't been the painless experience I was hoping it would be for everyone

@OmarOakheart
Copy link
Owner

One last thing, when you tested ChrIX by itself, that output looks normal (not perfect, though I think if it was run a second time with the subsampled data it would provide better results). So maybe the easiest way to figure out what's going on is to check for any differences between that run and the full-genome run? The reference sequence shouldn't the issue, maybe you used a different sample? Or maybe there were memory issues and the full-genome run is getting killed? Though I don't believe the plots would still be generated if that was happening.

If the chrIX was done with different samples, I recommend trying them again for a full-genome run (and subsample them first), this way we would know if the problem is with nPhase or with the data.

@rmormando
Copy link
Author

I completely agree…these results are confusing me as well. I went back to the command I used and made sure it was the corrected long read file, which it was. I included a summary of the stats after running Nanoplot on the filtered long reads below:

Metrics	dataset
number_of_reads	34009
number_of_bases	603795240.0
median_read_length	14745.0
mean_read_length	17754.0
read_length_stdev	10530.0
n50	20650.0
mean_qual	13.8
median_qual	13.9
longest_read_(with_Q):1	106146 (15.0)
longest_read_(with_Q):2	97927 (14.1)
longest_read_(with_Q):3	97175 (10.2)
longest_read_(with_Q):4	94289 (11.4)
longest_read_(with_Q):5	89316 (11.7)
highest_Q_read_(with_length):1	19.6 (11365)
highest_Q_read_(with_length):2	19.2 (6219)
highest_Q_read_(with_length):3	19.1 (13189)
highest_Q_read_(with_length):4	19.1 (5252)
highest_Q_read_(with_length):5	19.1 (5205)
Reads >Q5:	34009 (100.0%) 603.8Mb
Reads >Q7:	34009 (100.0%) 603.8Mb
Reads >Q10:	33853 (99.5%) 598.2Mb
Reads >Q12:	28990 (85.2%) 485.8Mb
Reads >Q15:	8299 (24.4%) 118.4Mb

One thing to note is that I did not filter the short reads with Filtlong - is that something I should take into account as well? I just filtered the long reads because I felt that would be enough and the short reads would fill in the rest of the data.

I’ll have a look at the short read variants. I’ll probably end up calling variants with the long reads and compare the two VCF files as well as the differences between the full genome and the ChrIX call. When I ran the command using just the ChrIX it was the same sample so I think there is either something wrong with the way its being run on my machine or there is something fundamentally wrong from the sequencing run. I wish there was a way to send you the samples so you could try the run but you’re suggestions have helped a ton already so far so I’m sure we’ll find a solution.

@OmarOakheart
Copy link
Owner

OmarOakheart commented Dec 22, 2022

But then, if chrIX was generated using the the same long read file, why aren't we seeing nearly the same level of coverage on chrIX when running the full genome? We should expect to obtain at least something similar.

You did it correctly, the short reads shouldn't be subsampled, only the long reads since they're the basis for the clustering.

You could upload the VCF generated in VariantCalls/ShortReads/ and the contextDepths file in the Overlaps folder to a google drive folder for me to take a look. You can email me a link at omaroakheart@gmail.com

I don't have access to a server to do a full run unfortunately

@rmormando
Copy link
Author

Hi! I left you an email earlier today but in case it's easier to respond here I have some results with the new run I was able to finish!

image

image

I think these results are much cleaner and make some more sense to me. However, I have ~186 files in the fastq folder from the run. Does this show the total number of haplotigs/haplotypes? If I predicted this sample to be tetraploid how to I extract out 4 fastq files that represent each haplotype? What are your thoughts with these new plots from this run?

@OmarOakheart
Copy link
Owner

Hey,

I've responded in detail through email, but just in case someone else sees this issue and is curious, here are some short answers:

One thing you can do now is run "nphase cleaning", which will attempt to merge together complementary haplotigs and get rid of small, inconsequential haplotigs such as the light blue haplotig at the beginning of "contig_47" in the bottom right, which is very lowly covered and does not cover many heterozygous SNPs.

Each of the 186 fastQ files represents the reads associated with one haplotig. For example, chrI only has three, and the coverage plot shows that they are all roughly equally covered. I would check if there is an aneuploidy in chrI which leaves it with only three copies (and three associated haplotypes). ChrI of S. cerevisiae is usually rather messy, but if you do find evidence of aneuploidy (for example if the short read data shows evidence for a 1/3, 2/3 distribution of heterozygous SNPs), then that would be great corroborating evidence.

You should use the discordanceVis plot to help with interpretation. It will show you if there are any haplotigs which have an unexpected allele frequency distribution. We would expect the allele frequency distribution to show nearly 90% of SNPs are in agreement, with a margin of about 10% for sequencing errors since we're looking at long reads. If there's a cluster of SNPs with an allele frequency closer to 50%, this is evidence that nphase has mistakenly mixed reads from different haplotypes.

Looking at the end of chromosome 2 or chromosome 15, we can tell that there have been mistakes made by nPhase, since the heterozygous are all represented by only one haplotig. You should expect to see evidence of this in the discordancevis plot. One thing you can do is take the fastQ file that corresponds to this haplotig and run nphase with it as the long read input, so that you might find a way to reduce the discordance levels and obtain a prediction that is coherent on that level.

There's a lot to be interpreted in these plots, for example, chromosome 13 shows evidence that it might have a +1 aneuploidy, making it a 5n. On the first 300kb it seems clear that there are 4 distinct haplotypes, three of which are covered at about a 1X level, and the fourth covered twice as much as the others, which is evidence there might be 5 copies.
Similarly, looking at chromosome 10, the red haplotig, at the top of the phasing plot, covers the start and end of the chromosome, with a gap in between. This points to a potential heterozygous SV such as perhaps a translocation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants