Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some sequences labeled as classified but no taxid assigned #40

Closed
DerrickWood opened this issue Aug 11, 2018 · 8 comments
Closed

Some sequences labeled as classified but no taxid assigned #40

DerrickWood opened this issue Aug 11, 2018 · 8 comments

Comments

@DerrickWood
Copy link
Owner

See DerrickWood/kraken#127 for original submission. Example output line (with initial "C" but taxid 0) is below:

C M00963:162:000000000-AHPVV:1:1101:14752:1397 0 123|123 0:89 |:| 0:15 27592:5 9913:8 0:19 0:22 0:20

@stuber Can you give me some more information to help me track this down? What kind of Kraken 2 database was being run against here? What was the full command (starting with kraken2) that would replicate this output?

@stuber
Copy link

stuber commented Aug 11, 2018

I included what I could for this build. It includes: archaea, bacteria, viral, human, fungi, plant, protozoa and 60 additional eukaryotic genomes, which are an additional 140GB. Total database size after cleaning is 190GB. If needed I can provide a list of genomes used.

Build command:
kraken2-build --threads 60 --minimizer-spaces 0 --build --db $DBNAME

Specified "0" for minimizer-spaces because this resulted in providing specificity needed. Identifying Mycobacterium tuberculosis complex is important for my situations. Using the default minimizer-spacer value only identified tuberculosis species reads to Mycobacterium. Setting to "0" resulted in identification to tuberculosis.

Run command:
kraken2 --db ${krakenDatabase} --threads ${CPUS} --paired ${read1} ${read2} --output $name-outputkraken.txt --report $name-reportkraken.txt

@stuber
Copy link

stuber commented Aug 15, 2018

More database info.

$ kraken2-inspect --db $DBNAME
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111111111111111111111111111111111
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 21153
# Table size: 35713390518
# Table capacity: 51025972662

@DerrickWood
Copy link
Owner Author

I don't see anything there or in the code so far that should trigger the output you're seeing. I'm thinking there may be an issue with the taxonomy processing somehow, because the code to decide whether or not to print the "C" vs "U" uses internal taxids (to speed up and simplify some processing in other parts of the software), but the code to display the taxid in the output does a lookup to find the external taxid (e.g., "9606" for human). If you try to classify this particular problem fragment with the --use-names option to kraken2, what output do you get? Did you just use kraken2-build --download-taxonomy --db $DBNAME to get the taxonomy?

@stuber
Copy link

stuber commented Aug 16, 2018

With option --use-names I get:
C M00963:162:000000000-AHPVV:1:1101:14752:1397 root (taxid 0) 123|123 0:89 |:| 0:15 27592:5 9913:8 0:19 0:22 0:20

I just used kraken2-build --download-taxonomy --db $DBNAME

Currently, I do not understand how Kraken works with the taxonomy files. I've looked into it a couple times, but then decide I'm just glad it all works. I can say the taxonomy numbers, 9913 and 27592, are found in nodes.dmp, and 9913 and 27592 are chosen for some reads as the taxid, so they are usable. Also using --use-names the appropriate names are shown next to the taxid.

For my use case I'll write something to provide a taxid even when there are very few identifying kmers.

Thanks,

@DerrickWood
Copy link
Owner Author

This is still a strange case. There are a few things that point to taxonomy problems here. The 0:19 0:22 0:20 in the hit list output is an error - consecutive runs of the same LCA should be collapsed per usual run-length encoding rules. That is, that should just be 0:61. My suspicion is that, at a minimum, the 0:22 is actually a hit to some taxon (call it "X") that is being misreported as taxid 0 (due to a faulty conversion between internal and external taxids), and so X is the true classification of the fragment. For some reason, though, taxon X isn't being reported out correctly here.

Note that the output you saw with --use-names - root (taxid 0) - is incorrect for a few reasons: taxid 0 doesn't exist (root is taxid 1), and unclassified fragments should have unclassified (taxid 0) output. So that's another indication of something awry.

I would be interested in seeing the actual sequence of this fragment's reads, if you're willing to share, @stuber. Also, if you still have the seqid2taxid.map file present in your database's directory, sharing it in some form (email, FTP, whatever works) would likely be helpful here as well, along with the nodes.dmp and names.dmp files it used. Thanks for your help in bringing this to my attention and tracking down the cause.

@stuber
Copy link

stuber commented Aug 22, 2018 via email

@NBaileyNCL
Copy link

Just posting here to mention I've had a similar issue in Kraken 2 with sequences labelled with taxid 0 despite being in the -classified-out output

@jenniferlu717
Copy link
Collaborator

The newest kraken2 version has fixed this issue. If any additional problems are found, please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants