-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.lang.NumberFormatException #261
Comments
Delete all intermediate files (ie anything in a *.gridss.working directory) and try again. Some files may have already been written using your system default locale so |
What locale do you use as your system default? |
I typed in "locale", and it's showing: Thank you! |
Hello, Thank you, |
Ok, reading over this issue again, the problem is something completely different: GRIDSS is designed for joint assembly of all samples - not per-sample assembly. You should only have 1 assembly file. Merging per-sample assembly files won't work because:
In summary: GRIDSS does joint assembly: all input files are required for the assembly. |
Does joint assembly actually fail? If so, please raise an issue for that. If you really need independent per-sample assemblies for some reason you'd need to hack it together by doing joint assembly against placeholder bam (ie joint assembly, but with all samples except 1 replaced with empty bams) so the ordinals all match up, then renaming all the assembly contig names to unique names (e.g. |
Thank you for the detailed explanation. It makes a lot of sense! I actually tried with 5 random samples to call SV with the incorrect process I mentioned, but accidentally it succeeded to generate VCF. Therefore, I thought that might be a way to process my data. Yes, I tried to do joint assembly with multiple input bam files together. They either failed with error "Error assembling scaffold..." or it tried to do recover forever. I will raise another issue for that soon (Thank you for pointing it out). In addition, I tried the strategy you mentioned about empty placeholder bam (according to #182 ). During the preprocess of empty bam file, I got error message as following: Is gridss going to generate sv.bam for original empty bam files? Best, |
Since most assembly contigs are filtered internally within gridss, the output file it's
No. Just create empty placeholders for *.sv.bam files. You'll also need to create matching index files. When I had to do something like this, I just created an One thing you need to make sure of is that the order of the input files is always same. I'll add a sanity check for this in the next release of GRIDSS. |
Thanks a lot for the suggestions! I already started batch assembly (~10 samples per batcg) with empty bam files for other samples not in this batch, hopefully they will run well. I will keep you updated about any raising issues. Best, |
Hi Daniel, When I go into each .gridss.working directory, it seems like majority of the chunk has been finished with .bai file, except one or two are still gridss.temp.batch.bam.assembly.chunk*.bam file. Right now I am using 5 samples per batch, will it help if I split each batch further into 2 or 3 samples for those frozen ones, or do you have any other suggestions? Thank you! Yuanwen |
Sorry about posting multiple questions at once. I am a little bit confused about how to rename batch assembly file. Should I rename all the assembly contig names (which is third column of sam format file) or rename all the read names (first column of sam format file) as stated in #182 ? Much appreciated, |
Are they actually at 0% CPU utilisation, or are they just taking a very long time to run? If it's the latter, it might be due to low complexity regions of your reference. You can either wait for them to finish, rerun them with more aggressive assembly timeouts (
Correct. Only the assembly contig names need to change. This prevents name collisions when you merge them together (the SAM specs require each |
They are still running with ~99.8 CPU usage. Thank you for pointing out low complexity region issue! My genome is fungal, so I probably can run repeat mask to figure out these regions first, and use that as blacklist. Or I can try adjust the safetyModePathCountThreshold and safetyModeContigsToCall as you suggeted, but I am not sure about what is the meaning of these two configurations. Their explanation is written in PositionalAssemblyConfiguration.java as "Number of memoized paths to enter safety mode" and "Number of contigs called in safety mode", which I feel difficult to understand. Does safetyModePathCountThreshold has same function with --maxcoverage in command line? For batch assembly file renaming, if I understand correctly, I renamed first column (which is contig read) of SAM file, e.g, read name asm0-1 became batch1_asm0-1. For those successful assembled batch bam files, I merge them and start cohort calling. Hopefully it will run well. Thank you, |
Safety mode is entered when the number of active nodes in the assembly exceeds this threshold. When safety mode is entered, the next {@link #safetyModePathCountThreshold} contigs will be called then the remaining reads still in the assembly graph discarded. Safety mode is reset whenever a new read is loaded into the assembly graph (ie when assembler makes any progress and load more of the genome into the graph).
Correct |
100% of all cores, or 100% of a single core? If it's just a single core, it's probably one particular region that they're stalling on, if it's all cores, it's probably some sort of low complexity sequence/repeat that multiple worker threads are stalling on each homolog. Would to be will (and able) to provide a BAM containing just the reads in one of the stalled regions? Most of my work has been on mammals so if there's an edge case where GRIDSS runs unacceptably slowly, some sample data would help me adjust the parameter/add additional checks to ensure that worse-case runtime is not unreasonable. |
Hi Daniel, Sure, I will prepare the bam file containing stalled region reads soon. Thank you for taking time to help us! I am running cohort call with merged assembly with ~350gb memory. It ran well for about 34hrs but failed because too many split reads in one of the contig. I wonder if I reduce --maxcoverage or safetyModePathCountThreshold, will it help to avoid this kind of regions during calling process? My understanding is safetyModePathCountThreshold will help during assembly instead of calling process, please correct me if I am wrong. Do you have any suggestions about how to avoid such regions during final calling? Thank you! Best, |
Hello Daniel, I was trying to prepare a bam file containing reads only at stalling regions, but I am not sure which region should I pick. The reason is when I tried to do assembly based on two samples (s1 and s2), it froze at scaffold_62-82, the last few lines of log file are as below: But when I tried to run assembly based on one of the two samples (s1), it froze at another position scaffold_969-1008: Therefore, I tried to attach the whole bam file here, but looks like it can not be attached. I wonder if I can send it to you by email if convenient? For the question about core number, some of my assembly process are still running, they seem to be 100% of a single core. Thank you! Best, |
Is this still an issue in v2.7.3? |
Hello,
My issue is similar with #253 with slight difference. I tried to run gridss-2.6.0 to do a cohort calling for ~70 samples, but it came up with the error:
gridss.SoftClipsToSplitReads done. Elapsed time: 470.26 minutes.
Runtime.totalMemory()=4151836672
Exception in thread "main" java.lang.NumberFormatException: For input string: "asm4-34583"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at au.edu.wehi.idsv.SplitReadHelper.getRealignmentFirstAlignedBaseReadOffset(SplitReadHelper.java:231)
at au.edu.wehi.idsv.SplitReadHelper.unclip(SplitReadHelper.java:175)
at au.edu.wehi.idsv.SplitReadHelper.replaceAlignment(SplitReadHelper.java:275)
at au.edu.wehi.idsv.SplitReadHelper.replaceAlignment(SplitReadHelper.java:260)
at au.edu.wehi.idsv.SplitReadRealigner.prepareRecordsForWriting(SplitReadRealigner.java:273)
at au.edu.wehi.idsv.SplitReadRealigner.writeCompletedAlignment(SplitReadRealigner.java:237)
at au.edu.wehi.idsv.SplitReadRealigner.mergeSupplementaryAlignment(SplitReadRealigner.java:384)
at au.edu.wehi.idsv.SplitReadRealigner.mergeSupplementaryAlignment(SplitReadRealigner.java:340)
at au.edu.wehi.idsv.SplitReadRealigner.createSupplementaryAlignments(SplitReadRealigner.java:310)
at gridss.SoftClipsToSplitReads.doWork(SoftClipsToSplitReads.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at gridss.SoftClipsToSplitReads.main(SoftClipsToSplitReads.java:118)
INFO 2019-10-10 17:34:34 IdentifyVariants
This is after merging all the individual assembly bam file, I used the following commands for this call step:
sh /bulk/yuanwen/Tools/gridss-2.6.2/scripts/gridss.sh --threads 10 --reference /bulk/yuanwen/RKQQC/20190707_SALSA/files/HiC_final/scaffolds_FINAL.fasta --output /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/files/gridss.vcf --assembly /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/files/gridss.all.bam --jar /bulk/yuanwen/Tools/gridss-2.6.2-gridss-jar-with-dependencies.jar --workingdir /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/scripts/ --jvmheap 60g --steps Call /bulk/feihe/data/Pgt_isolates/bam_YuanWen/00M063C_S59.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/00MN99C_S38.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01MN84-A-1-2_S51.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01SD80A_S13.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01TUR17A_S1.RG.DupMarked.bam ....
I tried to add export LC_ALL=C before running commands, but still got same error.
Then I tried to merge individual assembly bam files in batch (5 samples per batch), instead of merging all ~70 samples bam files together. It turned out some files batch can process well, but majority can not (with similar error for example: Exception in thread "main" java.lang.NumberFormatException: For input string: "asm7-53" ).
Could you please give me some suggestions about what might be the problem?
Thank you!
Yuanwen
The text was updated successfully, but these errors were encountered: