Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.NumberFormatException #261

Closed
YuanwenGuo opened this issue Oct 13, 2019 · 18 comments
Closed

java.lang.NumberFormatException #261

YuanwenGuo opened this issue Oct 13, 2019 · 18 comments

Comments

@YuanwenGuo
Copy link

Hello,

My issue is similar with #253 with slight difference. I tried to run gridss-2.6.0 to do a cohort calling for ~70 samples, but it came up with the error:
gridss.SoftClipsToSplitReads done. Elapsed time: 470.26 minutes.
Runtime.totalMemory()=4151836672
Exception in thread "main" java.lang.NumberFormatException: For input string: "asm4-34583"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:580)
at java.lang.Integer.parseInt(Integer.java:615)
at au.edu.wehi.idsv.SplitReadHelper.getRealignmentFirstAlignedBaseReadOffset(SplitReadHelper.java:231)
at au.edu.wehi.idsv.SplitReadHelper.unclip(SplitReadHelper.java:175)
at au.edu.wehi.idsv.SplitReadHelper.replaceAlignment(SplitReadHelper.java:275)
at au.edu.wehi.idsv.SplitReadHelper.replaceAlignment(SplitReadHelper.java:260)
at au.edu.wehi.idsv.SplitReadRealigner.prepareRecordsForWriting(SplitReadRealigner.java:273)
at au.edu.wehi.idsv.SplitReadRealigner.writeCompletedAlignment(SplitReadRealigner.java:237)
at au.edu.wehi.idsv.SplitReadRealigner.mergeSupplementaryAlignment(SplitReadRealigner.java:384)
at au.edu.wehi.idsv.SplitReadRealigner.mergeSupplementaryAlignment(SplitReadRealigner.java:340)
at au.edu.wehi.idsv.SplitReadRealigner.createSupplementaryAlignments(SplitReadRealigner.java:310)
at gridss.SoftClipsToSplitReads.doWork(SoftClipsToSplitReads.java:95)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at gridss.SoftClipsToSplitReads.main(SoftClipsToSplitReads.java:118)
INFO 2019-10-10 17:34:34 IdentifyVariants

This is after merging all the individual assembly bam file, I used the following commands for this call step:
sh /bulk/yuanwen/Tools/gridss-2.6.2/scripts/gridss.sh --threads 10 --reference /bulk/yuanwen/RKQQC/20190707_SALSA/files/HiC_final/scaffolds_FINAL.fasta --output /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/files/gridss.vcf --assembly /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/files/gridss.all.bam --jar /bulk/yuanwen/Tools/gridss-2.6.2-gridss-jar-with-dependencies.jar --workingdir /bulk/yuanwen/RKQQC/20191008_gridss_new_combine/scripts/ --jvmheap 60g --steps Call /bulk/feihe/data/Pgt_isolates/bam_YuanWen/00M063C_S59.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/00MN99C_S38.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01MN84-A-1-2_S51.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01SD80A_S13.RG.DupMarked.bam /bulk/feihe/data/Pgt_isolates/bam_YuanWen/01TUR17A_S1.RG.DupMarked.bam ....

I tried to add export LC_ALL=C before running commands, but still got same error.

Then I tried to merge individual assembly bam files in batch (5 samples per batch), instead of merging all ~70 samples bam files together. It turned out some files batch can process well, but majority can not (with similar error for example: Exception in thread "main" java.lang.NumberFormatException: For input string: "asm7-53" ).

Could you please give me some suggestions about what might be the problem?

Thank you!

Yuanwen

@d-cameron
Copy link
Member

Could you please give me some suggestions about what might be the problem?

Delete all intermediate files (ie anything in a *.gridss.working directory) and try again. Some files may have already been written using your system default locale so export LC_ALL=C won't fix those files.

@d-cameron
Copy link
Member

What locale do you use as your system default?

@YuanwenGuo
Copy link
Author

I typed in "locale", and it's showing:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Thank you!

@YuanwenGuo
Copy link
Author

Hello,
I tried to delete all intermediate files and also the *.gridss.working directory, but no luck until now.

Thank you,
Yuanwen

@d-cameron
Copy link
Member

This is after merging all the individual assembly bam file, I used the following commands for this call step:
...
Then I tried to merge individual assembly bam files in batch (5 samples per batch), instead of merging all ~70 samples bam files together.

Ok, reading over this issue again, the problem is something completely different: GRIDSS is designed for joint assembly of all samples - not per-sample assembly. You should only have 1 assembly file.

Merging per-sample assembly files won't work because:

  1. the per-sample assembly breakdown is done by ordinal, not name. Assembly support from all samples will all be assigned to the first

  2. The SAM specs require unique read names. GRIDSS assembly contig names are only unique for each file. Merging multiple assemblies together will result in duplicate read names. This violation of the SAM specs results in the rather cryptic error you are getting.

In summary: GRIDSS does joint assembly: all input files are required for the assembly.

@d-cameron
Copy link
Member

Does joint assembly actually fail? If so, please raise an issue for that.

If you really need independent per-sample assemblies for some reason you'd need to hack it together by doing joint assembly against placeholder bam (ie joint assembly, but with all samples except 1 replaced with empty bams) so the ordinals all match up, then renaming all the assembly contig names to unique names (e.g. asm4-34583 would become sample1_asm4-34583 and sample2_asm4-34583) to prevent the name clash you are encountering.

@YuanwenGuo
Copy link
Author

Thank you for the detailed explanation. It makes a lot of sense!

I actually tried with 5 random samples to call SV with the incorrect process I mentioned, but accidentally it succeeded to generate VCF. Therefore, I thought that might be a way to process my data.

Yes, I tried to do joint assembly with multiple input bam files together. They either failed with error "Error assembling scaffold..." or it tried to do recover forever. I will raise another issue for that soon (Thank you for pointing it out).

In addition, I tried the strategy you mentioned about empty placeholder bam (according to #182 ). During the preprocess of empty bam file, I got error message as following:
"Exception in thread "main" java.lang.IllegalStateException: Missing required file /bulk/yuanwen/RKQQC/20191007_gridss_cohort/scripts/empty_09ETH8-3_S85.bam.gridss.working/empty_09ETH8-3_S85.bam.sv.bam"

Is gridss going to generate sv.bam for original empty bam files?

Best,
Yuanwen

@d-cameron
Copy link
Member

I actually tried with 5 random samples to call SV with the incorrect process I mentioned, but accidentally it succeeded to generate VCF

Since most assembly contigs are filtered internally within gridss, the output file it's asm1_1, asm1_2 and has lots of gaps in it. With only 5 files, it looks like you didn't encountered the same assembly name within a small window so you no crash. Your VCF results will be wrong as the variant breakdown will show your first sample having assembly support for pretty much everything, and no other sample having anything assembled.

Is gridss going to generate sv.bam for original empty bam files?

No. Just create empty placeholders for *.sv.bam files. You'll also need to create matching index files.

When I had to do something like this, I just created an empty.bam and empty.bam.bai (the latter using samtools index). For each placeholder file I needed, I created symlinks to the empty bam and bai files.

One thing you need to make sure of is that the order of the input files is always same. I'll add a sanity check for this in the next release of GRIDSS.

@YuanwenGuo
Copy link
Author

Thanks a lot for the suggestions! I already started batch assembly (~10 samples per batcg) with empty bam files for other samples not in this batch, hopefully they will run well. I will keep you updated about any raising issues.

Best,
Yuanwen

@YuanwenGuo
Copy link
Author

Hi Daniel,
Some of my batches finished assembly in a reasonable time (6hrs-15hrs), but the other batches seem to have been frozen since ~13 hrs ago. The last few lines in log file are as follows:
INFO 2019-10-16 19:47:54 AssemblyEvidenceSource Completed assembly on chunk 13 (scaffold_31:1732124-scaffold_35:1698546) in 10053s (2.793 h)
INFO 2019-10-16 19:47:54 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch5.bam.gridss.working/gridss.tmp.batch5.bam.assembly.chunk13.bam
INFO 2019-10-16 19:53:04 AssemblyEvidenceSource Completed assembly on chunk 11 (scaffold_24:1089863-scaffold_27:2299935) in 10518s (2.922 h)
INFO 2019-10-16 19:53:04 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch5.bam.gridss.working/gridss.tmp.batch5.bam.assembly.chunk11.bam
INFO 2019-10-16 19:57:59 AssemblyEvidenceSource Completed assembly on chunk 16 (scaffold_45:1022091-scaffold_52:490468) in 9091s (2.525 h)
INFO 2019-10-16 19:57:59 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch5.bam.gridss.working/gridss.tmp.batch5.bam.assembly.chunk16.bam
INFO 2019-10-16 19:58:56 AssemblyEvidenceSource Completed assembly on chunk 15 (scaffold_40:526020-scaffold_45:1022090) in 9356s (2.599 h)
INFO 2019-10-16 19:58:56 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch5.bam.gridss.working/gridss.tmp.batch5.bam.assembly.chunk15.bam

When I go into each .gridss.working directory, it seems like majority of the chunk has been finished with .bai file, except one or two are still gridss.temp.batch.bam.assembly.chunk*.bam file.

Right now I am using 5 samples per batch, will it help if I split each batch further into 2 or 3 samples for those frozen ones, or do you have any other suggestions?

Thank you!

Yuanwen

@YuanwenGuo
Copy link
Author

Sorry about posting multiple questions at once. I am a little bit confused about how to rename batch assembly file. Should I rename all the assembly contig names (which is third column of sam format file) or rename all the read names (first column of sam format file) as stated in #182 ?

Much appreciated,
Yuanwen

@d-cameron
Copy link
Member

Right now I am using 5 samples per batch, will it help if I split each batch further into 2 or 3 samples for those frozen ones, or do you have any other suggestions?

Are they actually at 0% CPU utilisation, or are they just taking a very long time to run? If it's the latter, it might be due to low complexity regions of your reference. You can either wait for them to finish, rerun them with more aggressive assembly timeouts (assembly.positional.safetyModePathCountThreshold and assembly.positional.safetyModeContigsToCall being the two most relevant to low complexity slowness), or, if you know which regions as suspicious (e.g. hg19 ENCODE DAC blacklist), specify them using the BLACKLIST parameter.

I am a little bit confused about how to rename batch assembly file. Should I rename all the assembly contig names (which is third column of sam format file) or rename all the read names (first column of sam format file) as stated in #182 ?

Correct. Only the assembly contig names need to change. This prevents name collisions when you merge them together (the SAM specs require each template (aka read pair) to have a different read name).

@YuanwenGuo
Copy link
Author

They are still running with ~99.8 CPU usage.

Thank you for pointing out low complexity region issue! My genome is fungal, so I probably can run repeat mask to figure out these regions first, and use that as blacklist. Or I can try adjust the safetyModePathCountThreshold and safetyModeContigsToCall as you suggeted, but I am not sure about what is the meaning of these two configurations. Their explanation is written in PositionalAssemblyConfiguration.java as "Number of memoized paths to enter safety mode" and "Number of contigs called in safety mode", which I feel difficult to understand. Does safetyModePathCountThreshold has same function with --maxcoverage in command line?

For batch assembly file renaming, if I understand correctly, I renamed first column (which is contig read) of SAM file, e.g, read name asm0-1 became batch1_asm0-1. For those successful assembled batch bam files, I merge them and start cohort calling. Hopefully it will run well.

Thank you,
Yuanwen

d-cameron added a commit that referenced this issue Oct 21, 2019
@d-cameron
Copy link
Member

Safety mode is entered when the number of active nodes in the assembly exceeds this threshold. When safety mode is entered, the next {@link #safetyModePathCountThreshold} contigs will be called then the remaining reads still in the assembly graph discarded.

Safety mode is reset whenever a new read is loaded into the assembly graph (ie when assembler makes any progress and load more of the genome into the graph).

I renamed first column (which is contig read) of SAM file, e.g, read name asm0-1 became batch1_asm0-1

Correct

@d-cameron
Copy link
Member

They are still running with ~99.8 CPU usage.

100% of all cores, or 100% of a single core? If it's just a single core, it's probably one particular region that they're stalling on, if it's all cores, it's probably some sort of low complexity sequence/repeat that multiple worker threads are stalling on each homolog.

Would to be will (and able) to provide a BAM containing just the reads in one of the stalled regions? Most of my work has been on mammals so if there's an edge case where GRIDSS runs unacceptably slowly, some sample data would help me adjust the parameter/add additional checks to ensure that worse-case runtime is not unreasonable.

@YuanwenGuo
Copy link
Author

Hi Daniel,

Sure, I will prepare the bam file containing stalled region reads soon. Thank you for taking time to help us!

I am running cohort call with merged assembly with ~350gb memory. It ran well for about 34hrs but failed because too many split reads in one of the contig. I wonder if I reduce --maxcoverage or safetyModePathCountThreshold, will it help to avoid this kind of regions during calling process? My understanding is safetyModePathCountThreshold will help during assembly instead of calling process, please correct me if I am wrong. Do you have any suggestions about how to avoid such regions during final calling?

Thank you!

Best,
Yuanwen

@YuanwenGuo
Copy link
Author

Hello Daniel,

I was trying to prepare a bam file containing reads only at stalling regions, but I am not sure which region should I pick. The reason is when I tried to do assembly based on two samples (s1 and s2), it froze at scaffold_62-82, the last few lines of log file are as below:
INFO 2019-10-17 14:59:31 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch16_sub1.bam.gridss.working/gridss.tmp.batch16_sub1.bam.assembly.chunk42.bam
INFO 2019-10-17 14:59:41 AssemblyEvidenceSource Completed assembly on chunk 18 (scaffold_62:344981-scaffold_82:43790) in 2059s (34.33 min)
INFO 2019-10-17 14:59:41 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch16_sub1.bam.gridss.working/gridss.tmp.batch16_sub1.bam.assembly.chunk18.bam

But when I tried to run assembly based on one of the two samples (s1), it froze at another position scaffold_969-1008:
INFO 2019-10-17 23:16:00 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch16_sub1_s2.bam.gridss.working/gridss.tmp.batch16_sub1_s2.bam.assembly.chunk41.bam
INFO 2019-10-17 23:16:23 AssemblyEvidenceSource Completed assembly on chunk 42 (scaffold_969:1-scaffold_1008:1235) in 174s (2.907 min)
INFO 2019-10-17 23:16:23 SAMFileUtil Sorting /bulk/yuanwen/RKQQC/20191014_gridss_batch/scripts/batch16_sub1_s2.bam.gridss.working/gridss.tmp.batch16_sub1_s2.bam.assembly.chunk42.bam

Therefore, I tried to attach the whole bam file here, but looks like it can not be attached. I wonder if I can send it to you by email if convenient?

For the question about core number, some of my assembly process are still running, they seem to be 100% of a single core.

Thank you!

Best,
Yuanwen

@d-cameron
Copy link
Member

d-cameron commented Dec 16, 2019

Is this still an issue in v2.7.3?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants