Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java NullPointerException when running virusbreakend.sh #480

Closed
helrick opened this issue Mar 30, 2021 · 8 comments
Closed

Java NullPointerException when running virusbreakend.sh #480

helrick opened this issue Mar 30, 2021 · 8 comments

Comments

@helrick
Copy link
Contributor

helrick commented Mar 30, 2021

Hi, thanks for the tool! I've been able to run GRIDSS successfully and generate the kraken database with virusbreakend-build.sh, but get an error from the virusbreakend.sh script when I run:

bsub -M 20G "virusbreakend.sh sample.gridss.masked.vcf.gz --db virusbreakenddb --jar gridss-2.11.0-gridss-jar-with-dependencies.jar --reference GRCh38.d1.vd1.fa --output sample.gridss.masked.virus.vcf.gz sample_T.bam

First, when using gridsstools, I received an error about not being able to load shared libraries: (gridsstools: error while loading shared libraries: libbz2.so.1.0) but the script still continued and returned a Java error (see below). Since I'm on a shared cluster (without root), I wanted to avoid compiling gridsstools from source. I removed it from my path and edited the script not to exit when gridsstools isn't found (sidenote: the "2-3x time longer" error message is a bit confusing since it reads like a warning but still kills the program. Is running without gridsstools supported?).

Both with and without gridsstools though, I received the following Java error at the end of the log file (edited to remove ids and full paths):

...
INFO    2021-03-30 11:26:25     IdentifyViralTaxa       Loading taxonomy IDs of interest from sample.masked.virus.vcf.gz.host_taxids.txt
INFO    2021-03-30 11:26:25     IdentifyViralTaxa       Loaded 4049 taxonomy IDs
INFO    2021-03-30 11:26:25     IdentifyViralTaxa       Loading seqid2taxid.map from virusbreakenddb/seqid2taxid.map
INFO    2021-03-30 11:26:25     IdentifyViralTaxa       Loading NCBI taxonomy from virusbreakenddb/taxonomy/nodes.dmp
INFO    2021-03-30 11:26:31     IdentifyViralTaxa       Parsing Kraken2 report from gripss/sample.masked.virus.vcf.gz.virusbreakend.working/sample.masked.virus.vcf.gz.kraken2.report.all.txt
INFO    2021-03-30 11:26:31     IdentifyViralTaxa       Writing abridged report to sample.masked.virus.vcf.gz.virusbreakend.working/sample.masked.virus.vcf.gz.kraken2.report.viral.txt
[Tue Mar 30 11:26:31 BST 2021] gridss.kraken.IdentifyViralTaxa done. Elapsed time: 0.18 minutes.
Runtime.totalMemory()=2787115008
Exception in thread "main" java.lang.NullPointerException
        at gridss.kraken.IdentifyViralTaxa.lambda$doWork$4(IdentifyViralTaxa.java:144)
        at java.util.stream.ReferencePipeline$4$1.accept(ReferencePipeline.java:210)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
        at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
        at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
        at java.util.stream.ForEachOps$ForEachOp$OfInt.evaluateSequential(ForEachOps.java:188)
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
        at java.util.stream.IntPipeline.forEach(IntPipeline.java:427)
        at gridss.kraken.IdentifyViralTaxa.doWork(IdentifyViralTaxa.java:145)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
        at gridss.kraken.IdentifyViralTaxa.main(IdentifyViralTaxa.java:230)

Let me know if I can provide any more info.

@d-cameron
Copy link
Member

Is running without gridsstools supported?

No. I dropped support when I found that the performance difference was >10x when using cram inputs.

@helrick
Copy link
Contributor Author

helrick commented Apr 6, 2021

Hi, great, thanks for the clarification and update. I'm just re-opening this as it seems like the issue wasn't related to gridsstools. I've compiled it from source and verified that it's being used (The log contains:Found /nfs/research/icortes/helrick/bin/gridsstools and gridsstools version: gridsstools 1.0), but I'm still getting the same Java error from above:

Exception in thread "main" java.lang.NullPointerException
at gridss.kraken.IdentifyViralTaxa.lambda$doWork$4(IdentifyViralTaxa.java:144)
at java.util.stream.ReferencePipeline$4$1.accept(ReferencePipeline.java:210)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:272)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.util.stream.ForEachOps$ForEachOp$OfInt.evaluateSequential(ForEachOps.java:188)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.IntPipeline.forEach(IntPipeline.java:427)
at gridss.kraken.IdentifyViralTaxa.doWork(IdentifyViralTaxa.java:145)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at gridss.kraken.IdentifyViralTaxa.main(IdentifyViralTaxa.java:230)

Let me know if I'm best to open as a new issue.

@d-cameron d-cameron reopened this Apr 9, 2021
@d-cameron
Copy link
Member

d-cameron commented Apr 9, 2021

That error is caused by the seqid2taxid.map file in your virusbreakend database not containing a sequence returned by kraken2.

This seems like there was an error building your virusbreakenddb database. What do you get when you run the following?

cd virusbreakenddb
grep -v -F -f <(cut -f 1 seqid2taxid.map | cut -f 1 -d ":") $(find library -name "*.fai" | grep -E "viral|added")

You should have 0 lines returned. That is, every viral reference in your database should also be in seqid2taxid.map.

@d-cameron
Copy link
Member

If you still have all the files created by virusbreakend-build.sh (ie, not just the files in the .tar.gz it creates) then you can run:

cd virusbreakenddb
grep -v -F -f <(cut -f 1 $(find library -name "*.fai")) seqid2taxid.map
grep -v -F -f <(cut -f 1 seqid2taxid.map | cut -f 1 -d ":") $(find library -name "*.fai")

To verify that the seqid2taxid.map matches the reference sequences in your kraken2 database (both commands should return nothing).

@helrick
Copy link
Contributor Author

helrick commented Apr 11, 2021

Yes, it looks like there are some viral references that aren't in the seqid2taxid.map file. They're all in the library/added/reKOgoDspY.fna.fai file:

> grep -v -F -f <(cut -f 1 seqid2taxid.map | cut -f 1 -d ":") $(find library -name "*.fai")
library/added/reKOgoDspY.fna.fai:LR898047.1     29338   2151442724      60      61
library/added/reKOgoDspY.fna.fai:LR898046.1     29338   2151472668      60      61
library/added/reKOgoDspY.fna.fai:LR898045.1     29338   2151502614      60      61
library/added/reKOgoDspY.fna.fai:LR898044.1     29338   2151532559      60      61
library/added/reKOgoDspY.fna.fai:LR898043.1     29338   2151562504      60      61
library/added/reKOgoDspY.fna.fai:LR898042.1     29338   2151592449      60      61
library/added/reKOgoDspY.fna.fai:LR898041.1     29338   2151622395      60      61
library/added/reKOgoDspY.fna.fai:LR898040.1     29338   2151652340      60      61
library/added/reKOgoDspY.fna.fai:LR898039.1     29338   2151682286      60      61
library/added/reKOgoDspY.fna.fai:LR898038.1     29338   2151712231      60      61
library/added/reKOgoDspY.fna.fai:LR898037.1     29338   2151742176      60      61
library/added/reKOgoDspY.fna.fai:LR898023.1     29338   2151772121      60      61
library/added/reKOgoDspY.fna.fai:LR898022.1     29338   2151802066      60      61
library/added/reKOgoDspY.fna.fai:LR898021.1     29338   2151832011      60      61
library/added/reKOgoDspY.fna.fai:LR898020.1     29338   2151861956      60      61
library/added/reKOgoDspY.fna.fai:LR898019.1     29338   2151891901      60      61
library/added/reKOgoDspY.fna.fai:LR898018.1     29338   2151921845      60      61
library/added/reKOgoDspY.fna.fai:LR898017.1     29338   2151951789      60      61
library/added/reKOgoDspY.fna.fai:LR898016.1     29338   2151981733      60      61
library/added/reKOgoDspY.fna.fai:LR898015.1     29338   2152011677      60      61
library/added/reKOgoDspY.fna.fai:LR897997.1     29338   2152041622      60      61
library/added/reKOgoDspY.fna.fai:LR897996.1     29338   2152071566      60      61
library/added/reKOgoDspY.fna.fai:LR897995.1     29338   2152101510      60      61
library/added/reKOgoDspY.fna.fai:LR897994.1     29338   2152131454      60      61
library/added/reKOgoDspY.fna.fai:LR897993.1     29338   2152161398      60      61
library/added/reKOgoDspY.fna.fai:LR897992.1     29338   2152191342      60      61
library/added/reKOgoDspY.fna.fai:LR897991.1     29338   2152221286      60      61
library/added/reKOgoDspY.fna.fai:LR897990.1     29338   2152251230      60      61
library/added/reKOgoDspY.fna.fai:LR897989.1     29338   2152281174      60      61
library/added/reKOgoDspY.fna.fai:LR897987.1     29338   2152311118      60      61
library/added/reKOgoDspY.fna.fai:LR897986.1     29338   2152341062      60      61
library/added/reKOgoDspY.fna.fai:LR897985.1     29338   2152371006      60      61
library/added/reKOgoDspY.fna.fai:LR897983.1     29338   2152400950      60      61
library/added/reKOgoDspY.fna.fai:LR897981.1     29338   2152430894      60      61
library/added/reKOgoDspY.fna.fai:LR897977.1     29338   2152460838      60      61

is rebuilding the best bet or is there a faster way to add these 35 references? Thanks for the help!

@d-cameron
Copy link
Member

I think I rebuild is needed as kraken2 updates seqid2taxid.map file as part of it's build process.

Technically speaking, you could edit seqid2taxid.map and add the missing references but the error is indicative of something going wrong during the build process itself (or maybe multiple builds in the same directory overwriting each other).

You didn't happen to save the build output to a log file did you?

@suhrig
Copy link

suhrig commented Jun 5, 2021

Hi Daniel,

I tried building the reference files twice, but each time I ran into the same error described in this issue. In both cases the underlying issue was that kraken2-build excluded some 15,000 sequences from the added library, while issuing a warning ... accession numbers remain unmapped, see unmapped.txt in DB directory. In the end, these sequences were present in the file library/added/reKOgoDspY.fna, but missing in the file seqid2taxid.map.

I resolved it by removing all of the unmapped sequences from the file library/added/reKOgoDspY.fna, and then rebuilding the FastA index and the sequence dictionary.

I looked up some of the sequences in the NCBI nucleotide database. Almost all of them seem to be fairly recent SARS-CoV2 virus genomes (less than two weeks old). Could it be that kraken2 fails to map these sequences, because their annotation is not complete yet? Maybe, the build would succeed a few weeks from now when the databases have been updated fully. If this is a frequent issue, it may be a good idea to implement a step in the script virusbreakend-build that removes unmapped sequences automatically to prevent other users from having trouble building the index. Since they are pretty much all SARS-CoV2 virus genomes, the information is probably super redundant anyway. Have you tried building the index recently? Did you run into the same issue or was it a smooth experience?

Regards,
Sebastian

@sbilobram
Copy link

I have been having much trouble building or downloading a virusbreakend database. If you can resolve either issue I can move forward.
1.)
Using virusbreakend-build --db virusbreakenddb --jar path/gridss.jar I was able to build one but got the NULLPOINTER errors addressed above when trying to use it in a virusbreakend run. Those errors went away when I editted the seqid2taxid.map file as suggested.
But now I get this error:

INFO 2023-05-17 10:28:10 IdentifyViralTaxa Loading seqid2taxid.map from virusbreakendORGdb/seqid2taxid.map
[Wed May 17 10:28:10 PDT 2023] gridss.kraken.IdentifyViralTaxa done. Elapsed time: 0.07 minutes.
Runtime.totalMemory()=2147483648
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
at au.edu.wehi.idsv.kraken.SeqIdToTaxIdMap.lambda$createLookup$2(SeqIdToTaxIdMap.java:18)
at java.base/java.util.stream.Collectors.lambda$uniqKeysMapAccumulator$1(Collectors.java:178)
at java.base/java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1654)
at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
at au.edu.wehi.idsv.kraken.SeqIdToTaxIdMap.createLookup(SeqIdToTaxIdMap.java:18)
at gridss.kraken.IdentifyViralTaxa.doWork(IdentifyViralTaxa.java:124)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at gridss.kraken.IdentifyViralTaxa.main(IdentifyViralTaxa.java:230)

Please is there a suggestion to get around this error?

2.) I tried to download the minimal reference DB from web sites mentioned in the VIRUSBreakend_Readme.md but the downloads from both sites failed (download stopped) multiple times but always at the same point, at about 28 GB into the expected 34.5GB download.
Please check this TARBALL so that a download can complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants