java.lang.NumberFormatException #261

YuanwenGuo · 2019-10-13T23:42:19Z

Hello,

My issue is similar with #253 with slight difference. I tried to run gridss-2.6.0 to do a cohort calling for ~70 samples, but it came up with the error:
gridss.SoftClipsToSplitReads done. Elapsed time: 470.26 minutes.
Runtime.totalMemory()=4151836672
Exception in thread "main" java.lang.NumberFormatException: For input string: "asm4-34583"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at au.edu.wehi.idsv.SplitReadHelper.getRealignmentFirstAlignedBaseReadOffset(SplitReadHelper.java:231)
at au.edu.wehi.idsv.SplitReadHelper.unclip(SplitReadHelper.java:175)
at au.edu.wehi.idsv.SplitReadHelper.replaceAlignment(SplitReadHelper.java:275)
at au.edu.wehi.idsv.SplitReadHelper.replaceAlignment(SplitReadHelper.java:260)
at au.edu.wehi.idsv.SplitReadRealigner.prepareRecordsForWriting(SplitReadRealigner.java:273)
at au.edu.wehi.idsv.SplitReadRealigner.writeCompletedAlignment(SplitReadRealigner.java:237)
at au.edu.wehi.idsv.SplitReadRealigner.mergeSupplementaryAlignment(SplitReadRealigner.java:384)
at au.edu.wehi.idsv.SplitReadRealigner.mergeSupplementaryAlignment(SplitReadRealigner.java:340)
at au.edu.wehi.idsv.SplitReadRealigner.createSupplementaryAlignments(SplitReadRealigner.java:310)
at gridss.SoftClipsToSplitReads.doWork(SoftClipsToSplitReads.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at gridss.SoftClipsToSplitReads.main(SoftClipsToSplitReads.java:118)
INFO 2019-10-10 17:34:34 IdentifyVariants

This is after merging all the individual assembly bam file, I used the following commands for this call step:
sh /bulk/yuanwen/Tools/gridss-2.6.2/scripts/gridss.sh --threads 10 --reference /bulk/yuanwen/RKQQC/20190707_SALSA/files/HiC_final/scaffolds_FINAL.fasta --output /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/files/gridss.vcf --assembly /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/files/gridss.all.bam --jar /bulk/yuanwen/Tools/gridss-2.6.2-gridss-jar-with-dependencies.jar --workingdir /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/scripts/ --jvmheap 60g --steps Call /bulk/feihe/data/Pgt_isolates/bam_YuanWen/00M063C_S59.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/00MN99C_S38.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01MN84-A-1-2_S51.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01SD80A_S13.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01TUR17A_S1.RG.DupMarked.bam ....

I tried to add export LC_ALL=C before running commands, but still got same error.

Then I tried to merge individual assembly bam files in batch (5 samples per batch), instead of merging all ~70 samples bam files together. It turned out some files batch can process well, but majority can not (with similar error for example: Exception in thread "main" java.lang.NumberFormatException: For input string: "asm7-53" ).

Could you please give me some suggestions about what might be the problem?

Thank you!

Yuanwen

d-cameron · 2019-10-14T02:31:43Z

Could you please give me some suggestions about what might be the problem?

Delete all intermediate files (ie anything in a *.gridss.working directory) and try again. Some files may have already been written using your system default locale so export LC_ALL=C won't fix those files.

d-cameron · 2019-10-14T02:32:34Z

What locale do you use as your system default?

YuanwenGuo · 2019-10-14T02:45:31Z

I typed in "locale", and it's showing:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Thank you!

YuanwenGuo · 2019-10-15T01:18:49Z

Hello,
I tried to delete all intermediate files and also the *.gridss.working directory, but no luck until now.

Thank you,
Yuanwen

d-cameron · 2019-10-15T01:48:57Z

This is after merging all the individual assembly bam file, I used the following commands for this call step:
...
Then I tried to merge individual assembly bam files in batch (5 samples per batch), instead of merging all ~70 samples bam files together.

Ok, reading over this issue again, the problem is something completely different: GRIDSS is designed for joint assembly of all samples - not per-sample assembly. You should only have 1 assembly file.

Merging per-sample assembly files won't work because:

the per-sample assembly breakdown is done by ordinal, not name. Assembly support from all samples will all be assigned to the first
The SAM specs require unique read names. GRIDSS assembly contig names are only unique for each file. Merging multiple assemblies together will result in duplicate read names. This violation of the SAM specs results in the rather cryptic error you are getting.

In summary: GRIDSS does joint assembly: all input files are required for the assembly.

d-cameron · 2019-10-15T01:52:31Z

Does joint assembly actually fail? If so, please raise an issue for that.

If you really need independent per-sample assemblies for some reason you'd need to hack it together by doing joint assembly against placeholder bam (ie joint assembly, but with all samples except 1 replaced with empty bams) so the ordinals all match up, then renaming all the assembly contig names to unique names (e.g. asm4-34583 would become sample1_asm4-34583 and sample2_asm4-34583) to prevent the name clash you are encountering.

YuanwenGuo · 2019-10-15T02:34:50Z

Thank you for the detailed explanation. It makes a lot of sense!

I actually tried with 5 random samples to call SV with the incorrect process I mentioned, but accidentally it succeeded to generate VCF. Therefore, I thought that might be a way to process my data.

Yes, I tried to do joint assembly with multiple input bam files together. They either failed with error "Error assembling scaffold..." or it tried to do recover forever. I will raise another issue for that soon (Thank you for pointing it out).

In addition, I tried the strategy you mentioned about empty placeholder bam (according to #182 ). During the preprocess of empty bam file, I got error message as following:
"Exception in thread "main" java.lang.IllegalStateException: Missing required file /bulk/yuanwen/RKQQC/20191007_gridss_cohort/scripts/empty_09ETH8-3_S85.bam.gridss.working/empty_09ETH8-3_S85.bam.sv.bam"

Is gridss going to generate sv.bam for original empty bam files?

Best,
Yuanwen

d-cameron · 2019-10-15T03:30:11Z

I actually tried with 5 random samples to call SV with the incorrect process I mentioned, but accidentally it succeeded to generate VCF

Since most assembly contigs are filtered internally within gridss, the output file it's asm1_1, asm1_2 and has lots of gaps in it. With only 5 files, it looks like you didn't encountered the same assembly name within a small window so you no crash. Your VCF results will be wrong as the variant breakdown will show your first sample having assembly support for pretty much everything, and no other sample having anything assembled.

Is gridss going to generate sv.bam for original empty bam files?

No. Just create empty placeholders for *.sv.bam files. You'll also need to create matching index files.

When I had to do something like this, I just created an empty.bam and empty.bam.bai (the latter using samtools index). For each placeholder file I needed, I created symlinks to the empty bam and bai files.

One thing you need to make sure of is that the order of the input files is always same. I'll add a sanity check for this in the next release of GRIDSS.

…mbly errors #260 #261

YuanwenGuo · 2019-10-16T02:43:36Z

Thanks a lot for the suggestions! I already started batch assembly (~10 samples per batcg) with empty bam files for other samples not in this batch, hopefully they will run well. I will keep you updated about any raising issues.

Best,
Yuanwen

…puts

YuanwenGuo · 2019-10-17T15:16:15Z

Hi Daniel,
Some of my batches finished assembly in a reasonable time (6hrs-15hrs), but the other batches seem to have been frozen since ~13 hrs ago. The last few lines in log file are as follows:
INFO 2019-10-16 19:47:54 AssemblyEvidenceSource Completed assembly on chunk 13 (scaffold_31:1732124-scaffold_35:1698546) in 10053s (2.793 h)
INFO 2019-10-16 19:47:54 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch5.bam.gridss.working/gridss.tmp.batch5.bam.assembly.chunk13.bam
INFO 2019-10-16 19:53:04 AssemblyEvidenceSource Completed assembly on chunk 11 (scaffold_24:1089863-scaffold_27:2299935) in 10518s (2.922 h)
INFO 2019-10-16 19:53:04 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch5.bam.gridss.working/gridss.tmp.batch5.bam.assembly.chunk11.bam
INFO 2019-10-16 19:57:59 AssemblyEvidenceSource Completed assembly on chunk 16 (scaffold_45:1022091-scaffold_52:490468) in 9091s (2.525 h)
INFO 2019-10-16 19:57:59 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch5.bam.gridss.working/gridss.tmp.batch5.bam.assembly.chunk16.bam
INFO 2019-10-16 19:58:56 AssemblyEvidenceSource Completed assembly on chunk 15 (scaffold_40:526020-scaffold_45:1022090) in 9356s (2.599 h)
INFO 2019-10-16 19:58:56 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch5.bam.gridss.working/gridss.tmp.batch5.bam.assembly.chunk15.bam

When I go into each .gridss.working directory, it seems like majority of the chunk has been finished with .bai file, except one or two are still gridss.temp.batch.bam.assembly.chunk*.bam file.

Right now I am using 5 samples per batch, will it help if I split each batch further into 2 or 3 samples for those frozen ones, or do you have any other suggestions?

Thank you!

Yuanwen

YuanwenGuo · 2019-10-17T19:49:01Z

Sorry about posting multiple questions at once. I am a little bit confused about how to rename batch assembly file. Should I rename all the assembly contig names (which is third column of sam format file) or rename all the read names (first column of sam format file) as stated in #182 ?

Much appreciated,
Yuanwen

d-cameron · 2019-10-21T02:10:39Z

Right now I am using 5 samples per batch, will it help if I split each batch further into 2 or 3 samples for those frozen ones, or do you have any other suggestions?

Are they actually at 0% CPU utilisation, or are they just taking a very long time to run? If it's the latter, it might be due to low complexity regions of your reference. You can either wait for them to finish, rerun them with more aggressive assembly timeouts (assembly.positional.safetyModePathCountThreshold and assembly.positional.safetyModeContigsToCall being the two most relevant to low complexity slowness), or, if you know which regions as suspicious (e.g. hg19 ENCODE DAC blacklist), specify them using the BLACKLIST parameter.

I am a little bit confused about how to rename batch assembly file. Should I rename all the assembly contig names (which is third column of sam format file) or rename all the read names (first column of sam format file) as stated in #182 ?

Correct. Only the assembly contig names need to change. This prevents name collisions when you merge them together (the SAM specs require each template (aka read pair) to have a different read name).

YuanwenGuo · 2019-10-21T17:12:23Z

They are still running with ~99.8 CPU usage.

Thank you for pointing out low complexity region issue! My genome is fungal, so I probably can run repeat mask to figure out these regions first, and use that as blacklist. Or I can try adjust the safetyModePathCountThreshold and safetyModeContigsToCall as you suggeted, but I am not sure about what is the meaning of these two configurations. Their explanation is written in PositionalAssemblyConfiguration.java as "Number of memoized paths to enter safety mode" and "Number of contigs called in safety mode", which I feel difficult to understand. Does safetyModePathCountThreshold has same function with --maxcoverage in command line?

For batch assembly file renaming, if I understand correctly, I renamed first column (which is contig read) of SAM file, e.g, read name asm0-1 became batch1_asm0-1. For those successful assembled batch bam files, I merge them and start cohort calling. Hopefully it will run well.

Thank you,
Yuanwen

d-cameron · 2019-10-21T23:55:50Z

Safety mode is entered when the number of active nodes in the assembly exceeds this threshold. When safety mode is entered, the next {@link #safetyModePathCountThreshold} contigs will be called then the remaining reads still in the assembly graph discarded.

Safety mode is reset whenever a new read is loaded into the assembly graph (ie when assembler makes any progress and load more of the genome into the graph).

I renamed first column (which is contig read) of SAM file, e.g, read name asm0-1 became batch1_asm0-1

Correct

d-cameron · 2019-10-22T00:00:18Z

They are still running with ~99.8 CPU usage.

100% of all cores, or 100% of a single core? If it's just a single core, it's probably one particular region that they're stalling on, if it's all cores, it's probably some sort of low complexity sequence/repeat that multiple worker threads are stalling on each homolog.

Would to be will (and able) to provide a BAM containing just the reads in one of the stalled regions? Most of my work has been on mammals so if there's an edge case where GRIDSS runs unacceptably slowly, some sample data would help me adjust the parameter/add additional checks to ensure that worse-case runtime is not unreasonable.

YuanwenGuo · 2019-10-22T03:14:28Z

Hi Daniel,

Sure, I will prepare the bam file containing stalled region reads soon. Thank you for taking time to help us!

I am running cohort call with merged assembly with ~350gb memory. It ran well for about 34hrs but failed because too many split reads in one of the contig. I wonder if I reduce --maxcoverage or safetyModePathCountThreshold, will it help to avoid this kind of regions during calling process? My understanding is safetyModePathCountThreshold will help during assembly instead of calling process, please correct me if I am wrong. Do you have any suggestions about how to avoid such regions during final calling?

Thank you!

Best,
Yuanwen

YuanwenGuo · 2019-10-23T02:32:39Z

Hello Daniel,

I was trying to prepare a bam file containing reads only at stalling regions, but I am not sure which region should I pick. The reason is when I tried to do assembly based on two samples (s1 and s2), it froze at scaffold_62-82, the last few lines of log file are as below:
INFO 2019-10-17 14:59:31 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch16_sub1.bam.gridss.working/gridss.tmp.batch16_sub1.bam.assembly.chunk42.bam
INFO 2019-10-17 14:59:41 AssemblyEvidenceSource Completed assembly on chunk 18 (scaffold_62:344981-scaffold_82:43790) in 2059s (34.33 min)
INFO 2019-10-17 14:59:41 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch16_sub1.bam.gridss.working/gridss.tmp.batch16_sub1.bam.assembly.chunk18.bam

But when I tried to run assembly based on one of the two samples (s1), it froze at another position scaffold_969-1008:
INFO 2019-10-17 23:16:00 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch16_sub1_s2.bam.gridss.working/gridss.tmp.batch16_sub1_s2.bam.assembly.chunk41.bam
INFO 2019-10-17 23:16:23 AssemblyEvidenceSource Completed assembly on chunk 42 (scaffold_969:1-scaffold_1008:1235) in 174s (2.907 min)
INFO 2019-10-17 23:16:23 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch16_sub1_s2.bam.gridss.working/gridss.tmp.batch16_sub1_s2.bam.assembly.chunk42.bam

Therefore, I tried to attach the whole bam file here, but looks like it can not be attached. I wonder if I can send it to you by email if convenient?

For the question about core number, some of my assembly process are still running, they seem to be 100% of a single core.

Thank you!

Best,
Yuanwen

d-cameron · 2019-12-16T05:27:08Z

Is this still an issue in v2.7.3?

d-cameron pushed a commit that referenced this issue Oct 15, 2019

Creating zip file with minimal data required for reproduction of asse…

48e1105

…mbly errors #260 #261

d-cameron closed this as completed Oct 17, 2019

d-cameron pushed a commit that referenced this issue Oct 17, 2019

#261 Added checks to ensure assembly and variant calling have same in…

361e8b5

…puts

d-cameron added a commit that referenced this issue Oct 21, 2019

#261 exposing assembly read name prefix in configuration file

9a24281

d-cameron added a commit that referenced this issue Oct 21, 2019

#261 added safety mode explanation

358c00c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.lang.NumberFormatException #261

java.lang.NumberFormatException #261

YuanwenGuo commented Oct 13, 2019

d-cameron commented Oct 14, 2019

d-cameron commented Oct 14, 2019

YuanwenGuo commented Oct 14, 2019

YuanwenGuo commented Oct 15, 2019

d-cameron commented Oct 15, 2019

d-cameron commented Oct 15, 2019

YuanwenGuo commented Oct 15, 2019

d-cameron commented Oct 15, 2019

YuanwenGuo commented Oct 16, 2019

YuanwenGuo commented Oct 17, 2019

YuanwenGuo commented Oct 17, 2019

d-cameron commented Oct 21, 2019

YuanwenGuo commented Oct 21, 2019

d-cameron commented Oct 21, 2019

d-cameron commented Oct 22, 2019

YuanwenGuo commented Oct 22, 2019

YuanwenGuo commented Oct 23, 2019

d-cameron commented Dec 16, 2019 •

edited

Loading

java.lang.NumberFormatException #261

java.lang.NumberFormatException #261

Comments

YuanwenGuo commented Oct 13, 2019

d-cameron commented Oct 14, 2019

d-cameron commented Oct 14, 2019

YuanwenGuo commented Oct 14, 2019

YuanwenGuo commented Oct 15, 2019

d-cameron commented Oct 15, 2019

d-cameron commented Oct 15, 2019

YuanwenGuo commented Oct 15, 2019

d-cameron commented Oct 15, 2019

YuanwenGuo commented Oct 16, 2019

YuanwenGuo commented Oct 17, 2019

YuanwenGuo commented Oct 17, 2019

d-cameron commented Oct 21, 2019

YuanwenGuo commented Oct 21, 2019

d-cameron commented Oct 21, 2019

d-cameron commented Oct 22, 2019

YuanwenGuo commented Oct 22, 2019

YuanwenGuo commented Oct 23, 2019

d-cameron commented Dec 16, 2019 • edited Loading

d-cameron commented Dec 16, 2019 •

edited

Loading