Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thousands of gridss_minimal_reproduction_data_for_error* #503

Closed
keiranmraine opened this issue Jun 15, 2021 · 17 comments
Closed

Thousands of gridss_minimal_reproduction_data_for_error* #503

keiranmraine opened this issue Jun 15, 2021 · 17 comments

Comments

@keiranmraine
Copy link

Hi,

Is there anyway to suppress the generation of these files and folders when an assembly step has issues?

gridss_minimal_reproduction_data_for_error*

I understand the intent, but I don't want them generating on a normal run only if I'm trying to debug.

I don't believe I'm doing anything unusual (2.11.1):

gridss.sh --externalaligner --reference /.../genome.fa --blacklist /.../gridss/blacklist-2011-05-04-ENCFF001TDO.bed --labels SAMPLE --assembly /.../Gridss/SAMPLE.gridss.bam --output /.../Gridss/SAMPLE.gridss.vcf.gz --workingdir /.../Gridss/gridss_tmp --threads 8 --steps assemble /.../2883676.bam

These are catastrophic when written to a non-local file system as the folders are ~2.1GB compressed to ~3MB. We're seeing this causing a significant I/O load on our lustre file system.

Thanks

@keiranmraine
Copy link
Author

Additional observations on the jobs I'm seeing this on. Assembly step 8cpus

ERROR   2021-06-15 14:52:17     TwoBitBufferedReferenceSequenceFile     Error loading reference genome from cache /.../human/GRCh37d5/genome.fa.gridsscachejava.io.InvalidClassException: au.edu.wehi.idsv.debruijn.PackedSequence; local class incompatible: stream classdesc serialVersionUID = -8769790295923840212, local class serialVersionUID = 8268042369116844383

Are each of these reproduction files due to these messages:

2618x

INFO    2021-06-15 14:54:08     PositionalAssembler     Error during assembly of chromosome 1 (14 reads in graph). Attempting recovery by rebuilding assembly graph.
java.util.NoSuchElementException
        at java.base/java.util.HashMap$HashIterator.nextNode(HashMap.java:1495)
        at java.base/java.util.HashMap$KeyIterator.next(HashMap.java:1516)
...

1150 x

ERROR   2021-06-14 16:46:16     PositionalAssembler     Error assembling 2. Please raise an issue at
...

d-cameron pushed a commit that referenced this issue Jun 16, 2021
configurable with assembly.maximumReproductionExportPackages
@d-cameron
Copy link
Member

INFO 2021-06-15 14:54:08 PositionalAssembler Error during assembly of chromosome 1 (14 reads in graph). Attempting recovery by rebuilding assembly graph.

Are you able to send through that error package containing only 14 reads? I should be able to identify and fix the root cause from a data set that small.

@d-cameron
Copy link
Member

ERROR 2021-06-15 14:52:17 TwoBitBufferedReferenceSequenceFile Error loading reference genome from cache /.../human/GRCh37d5/genome.fa.gridsscachejava.io.InvalidClassException: au.edu.wehi.idsv.debruijn.PackedSequence; local class incompatible: stream classdesc serialVersionUID = -8769790295923840212, local class serialVersionUID = 8268042369116844383

That's unrelated and fixed in 2.12.0. That particular error should only be a warning as it just falls back to loading from the .fasta file instead of the .gridsscache file.

@keiranmraine
Copy link
Author

That's unrelated and fixed in 2.12.0. That particular error should only be a warning as it just falls back to loading from the .fasta file instead of the .gridsscache file.

I found that rebuilding the gridsscache and img with the matched version suppressed this, but good to know

@keiranmraine
Copy link
Author

Are you able to send through that error package containing only 14 reads? I should be able to identify and fix the root cause from a data set that small.

gridss_minimal_reproduction_data_for_error_1.zip

@keiranmraine
Copy link
Author

Please feel free to drop me a line if you want me to try out an intermediate version, I believe I should be able to build the image easily since the recent changes to the Dockerfile.

@keiranmraine
Copy link
Author

@d-cameron is there any update, do you need further test data? Sorry to chase. A release with the new config variable (maximumReproductionExportPackages) would be a good interim solution for us, unless you see a significant issue in this test data.

@keiranmraine
Copy link
Author

Is there any progress on this? If it's not possible to handle please can a release with maximumReproductionExportPackages be made?

@d-cameron
Copy link
Member

unless you see a significant issue in this test data

Unfortunately, the file attached to the issue appears to be corrupt and I am unable to extract the reads from it. Could to attach another repo file that has the same issue?

@d-cameron
Copy link
Member

It's possible that GRIDSS got killed while it was still writing the minimal reproduction file resulting in an truncated output. Can you check to see if you have any fully-intact repo files?

@keiranmraine
Copy link
Author

gridss_minimal_reproduction_data_for_error_844.zip

I've done a round trip check that the uploaded file can be unpacked, sorry about that.

@d-cameron
Copy link
Member

d-cameron commented Jul 19, 2021

Custom SAM tag collision with reads with aa tags such as aa:Z:Nextflex-PCR-1

In retrospect, using "aa" was not the best idea.

Workaround is to run samtools view -x aa on the input files to strip the tag out.

@keiranmraine
Copy link
Author

keiranmraine commented Jul 19, 2021

Are there any filters I can apply to the reads to minimise the size of the duplicated file while doing this?

@keiranmraine
Copy link
Author

Sorry, I've just noticed you're going to handle this in code, I didn't spot the commit above.

@keiranmraine
Copy link
Author

Is there any possibility of a hotfix release with this update? I understand this may be low priority for your internal use but it is exceptionally valuable to us.

@d-cameron
Copy link
Member

d-cameron commented Aug 6, 2021 via email

@keiranmraine
Copy link
Author

Thank you, very much appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants