Some sequences labeled as classified but no taxid assigned #40

DerrickWood · 2018-08-11T20:52:27Z

See DerrickWood/kraken#127 for original submission. Example output line (with initial "C" but taxid 0) is below:

C M00963:162:000000000-AHPVV:1:1101:14752:1397 0 123|123 0:89 |:| 0:15 27592:5 9913:8 0:19 0:22 0:20

@stuber Can you give me some more information to help me track this down? What kind of Kraken 2 database was being run against here? What was the full command (starting with kraken2) that would replicate this output?

The text was updated successfully, but these errors were encountered:

stuber · 2018-08-11T22:11:20Z

I included what I could for this build. It includes: archaea, bacteria, viral, human, fungi, plant, protozoa and 60 additional eukaryotic genomes, which are an additional 140GB. Total database size after cleaning is 190GB. If needed I can provide a list of genomes used.

Build command:
kraken2-build --threads 60 --minimizer-spaces 0 --build --db $DBNAME

Specified "0" for minimizer-spaces because this resulted in providing specificity needed. Identifying Mycobacterium tuberculosis complex is important for my situations. Using the default minimizer-spacer value only identified tuberculosis species reads to Mycobacterium. Setting to "0" resulted in identification to tuberculosis.

Run command:
kraken2 --db ${krakenDatabase} --threads ${CPUS} --paired ${read1} ${read2} --output $name-outputkraken.txt --report $name-reportkraken.txt

stuber · 2018-08-15T10:58:49Z

More database info.

$ kraken2-inspect --db $DBNAME
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111111111111111111111111111111111
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 21153
# Table size: 35713390518
# Table capacity: 51025972662

DerrickWood · 2018-08-16T13:28:04Z

I don't see anything there or in the code so far that should trigger the output you're seeing. I'm thinking there may be an issue with the taxonomy processing somehow, because the code to decide whether or not to print the "C" vs "U" uses internal taxids (to speed up and simplify some processing in other parts of the software), but the code to display the taxid in the output does a lookup to find the external taxid (e.g., "9606" for human). If you try to classify this particular problem fragment with the --use-names option to kraken2, what output do you get? Did you just use kraken2-build --download-taxonomy --db $DBNAME to get the taxonomy?

stuber · 2018-08-16T18:20:13Z

With option --use-names I get:
C M00963:162:000000000-AHPVV:1:1101:14752:1397 root (taxid 0) 123|123 0:89 |:| 0:15 27592:5 9913:8 0:19 0:22 0:20

I just used kraken2-build --download-taxonomy --db $DBNAME

Currently, I do not understand how Kraken works with the taxonomy files. I've looked into it a couple times, but then decide I'm just glad it all works. I can say the taxonomy numbers, 9913 and 27592, are found in nodes.dmp, and 9913 and 27592 are chosen for some reads as the taxid, so they are usable. Also using --use-names the appropriate names are shown next to the taxid.

For my use case I'll write something to provide a taxid even when there are very few identifying kmers.

Thanks,

DerrickWood · 2018-08-18T22:17:18Z

This is still a strange case. There are a few things that point to taxonomy problems here. The 0:19 0:22 0:20 in the hit list output is an error - consecutive runs of the same LCA should be collapsed per usual run-length encoding rules. That is, that should just be 0:61. My suspicion is that, at a minimum, the 0:22 is actually a hit to some taxon (call it "X") that is being misreported as taxid 0 (due to a faulty conversion between internal and external taxids), and so X is the true classification of the fragment. For some reason, though, taxon X isn't being reported out correctly here.

Note that the output you saw with --use-names - root (taxid 0) - is incorrect for a few reasons: taxid 0 doesn't exist (root is taxid 1), and unclassified fragments should have unclassified (taxid 0) output. So that's another indication of something awry.

I would be interested in seeing the actual sequence of this fragment's reads, if you're willing to share, @stuber. Also, if you still have the seqid2taxid.map file present in your database's directory, sharing it in some form (email, FTP, whatever works) would likely be helpful here as well, along with the nodes.dmp and names.dmp files it used. Thanks for your help in bringing this to my attention and tracking down the cause.

stuber · 2018-08-22T11:12:05Z

Files uploaded to FTP site. Please send me a direct email to tod.p.stuber@aphis.usda.gov<mailto:tod.p.stuber@aphis.usd.gov> and I’ll provide login credentials. Thanks, On Aug 18, 2018, at 5:17 PM, Derrick Wood <notifications@github.com<mailto:notifications@github.com>> wrote: This is still a strange case. There are a few things that point to taxonomy problems here. The 0:19 0:22 0:20 in the hit list output is an error - consecutive runs of the same LCA should be collapsed per usual run-length encoding rules. That is, that should just be 0:61. My suspicion is that, at a minimum, the 0:22 is actually a hit to some taxon (call it "X") that is being misreported as taxid 0 (due to a faulty conversion between internal and external taxids), and so X is the true classification of the fragment. For some reason, though, taxon X isn't being reported out correctly here. Note that the output you saw with --use-names - root (taxid 0) - is incorrect for a few reasons: taxid 0 doesn't exist (root is taxid 1), and unclassified fragments should have unclassified (taxid 0) output. So that's another indication of something awry. I would be interested in seeing the actual sequence of this fragment's reads, if you're willing to share, @stuber<https://github.com/stuber>. Also, if you still have the seqid2taxid.map file present in your database's directory, sharing it in some form (email, FTP, whatever works) would likely be helpful here as well, along with the nodes.dmp and names.dmp files it used. Thanks for your help in bringing this to my attention and tracking down the cause. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#40 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AFjCsQogSPgXtsM0K1dm58YhO5XvjpvTks5uSJJugaJpZM4V5TkH>. This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.

NBaileyNCL · 2020-01-06T15:31:00Z

Just posting here to mention I've had a similar issue in Kraken 2 with sequences labelled with taxid 0 despite being in the -classified-out output

jenniferlu717 · 2020-04-10T05:02:47Z

The newest kraken2 version has fixed this issue. If any additional problems are found, please open a new issue.

DerrickWood mentioned this issue Aug 11, 2018

Output classifies reads but tax id is zero DerrickWood/kraken#127

Closed

jenniferlu717 added the in progress label Jan 31, 2020

jenniferlu717 closed this as completed Apr 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some sequences labeled as classified but no taxid assigned #40

Some sequences labeled as classified but no taxid assigned #40

DerrickWood commented Aug 11, 2018

stuber commented Aug 11, 2018

stuber commented Aug 15, 2018 •

edited

Loading

DerrickWood commented Aug 16, 2018

stuber commented Aug 16, 2018

DerrickWood commented Aug 18, 2018

stuber commented Aug 22, 2018 via email

NBaileyNCL commented Jan 6, 2020

jenniferlu717 commented Apr 10, 2020

Some sequences labeled as classified but no taxid assigned #40

Some sequences labeled as classified but no taxid assigned #40

Comments

DerrickWood commented Aug 11, 2018

stuber commented Aug 11, 2018

stuber commented Aug 15, 2018 • edited Loading

DerrickWood commented Aug 16, 2018

stuber commented Aug 16, 2018

DerrickWood commented Aug 18, 2018

stuber commented Aug 22, 2018 via email

NBaileyNCL commented Jan 6, 2020

jenniferlu717 commented Apr 10, 2020

stuber commented Aug 15, 2018 •

edited

Loading