problems with classification using nt database #34

ajaybabu27 · 2018-07-27T13:27:58Z

Hi Derrick,

I tried classifying my sample reads with kraken database containing only "nt" sequences and found that all my reads were classified at the root. Here is the output from the first few lines of the report file:

100.00% 14552 14535 R 1 root
0.12% 17 0 R1 131567 cellular organisms
0.12% 17 17 D 2 Bacteria

I did the same classification using the "standard database" and I got similar results that I found in Kraken1:

100.00% 14552 0 R 1 root
100.00% 14552 0 R1 131567 cellular organisms
100.00% 14552 0 D 2 Bacteria
44.38% 6458 0 D1 1783270 FCB group
41.62% 6057 0 P 1224 Proteobacteria
14.00% 2037 7 D1 1783272 Terrabacteria group

I wonder if there are some kmers that are found in some bacterial species, are also annotated in weird categories placed near the root of the taxonomy tree; specifically in the case of the "nt" database ?

tseemann · 2018-07-31T01:58:16Z

"with kraken database containing only "nt" sequences "

How did you build the database?

What does kraken2-inspect say?
See http://ccb.jhu.edu/software/kraken/MANUAL.html#inspecting-a-kraken-2-databases-contents

ajaybabu27 · 2018-08-01T19:51:02Z

I built the "nt" database with the following commands,

kraken2-build --download-taxonomy --db kraken2_db_custom_20180718
kraken2-build --download-library nt --db kraken2_db_custom_20180718
kraken2-build --build --threads 36 --db kraken2_db_custom_20180718

Essentially the database only consisted of sequences from the "nt" library. If I am not wrong, "refseq" is a subset of "nt" library. Thus, I only used "nt" library to build my database.

I gave up running kraken2-inspect for my database (kraken2_db_custom_20180718) since it kept running continuously for 2 days (using around 300 Gb of memory and one core). Unfortunately, the script is not multi-threaded.

Have you tried classifying your metagenomic samples using the "nt" database (the way I have built it) ?

tseemann · 2018-08-06T01:01:39Z

I haven't built the nt database, but am considering it. I work in microbial genomics so I only want viruses -> protists + human. I'm happy for the rest to be unclassified. My main issue even with refseq is mistakes in taxonomy at species level.

DerrickWood · 2018-08-06T01:32:42Z

@ajaybabu27 Without knowing what your sample reads are (simulated? source genomes?), it's difficult to say why they are now classified at the root, beyond that the data in those sample reads looks to come from more than one kingdom (when comparing against the whole of nt).

ajaybabu27 · 2018-08-09T18:22:49Z

@DerrickWood It is a 16S (V4) amplicon gut microbiome sample (human). I am curious to test if Kraken2 with "nt" database can detect any contaminations in my sequencing library (barcoded+non-barcoded reads) in addition to classifying my reads to different taxa. I have used Kraken1 and Kraken2 (standard db) for classifying the above mentioned demultiplexed sample (in my first query) in the past and it gave expected results. I find it hard to believe that almost all of my reads in this sample also maps to additional kingdoms when using "nt" .

Is there a way I can get a lookup table of all the kmers annotated across different taxa in a given kraken db ? This way I could debug to see where the kmers from my reads would potentially map in the taxonomy tree given the annotations of LCA kmers to each taxa in the tree. Let me know if this would be possible.

Thanks,
Ajay.

DerrickWood · 2018-08-11T22:02:28Z

@ajaybabu27 Unfortunately, Kraken 2 doesn't easily allow for looking up which k-mers are in the database. That is part of the tradeoff made to get the database size smaller in Kraken 2. You could classify the entire library and then use that information to approximate that, but that isn't a trivial (nor memory-light) task.

In terms of debugging, though, it might be helpful to look at the Kraken output for classifying your reads. The final tab-delimited column is a space-separated list indicating which k-mers were associated with specific LCAs. You could also try using something like BLAST (probably MEGABLAST, not BLASTN) to search some of these sequences against nt, and review the top several hits for the sequences. It's possible that nt has some poor taxonomic annotations that could confuse the assignment of these kinds of reads for Kraken.

tseemann · 2018-08-19T00:57:52Z

"It's possible that nt has some poor taxonomic annotations"

Do you have a feel for how bad it could be?

Even for Refseq there are species level mistakes, but nt is made of lots of very old sequences.

rwst · 2018-08-20T05:43:12Z

There are not only problems with taxonomic classifications of databases but also with contamination, see Lu/Salzberg, http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006277. I found out the hard way yesterday that protozoa gives false Toxoplasma and Plasmodium positives, just as described in the above paper with EuPathDB.

I think providing clean databases may need a concerted effort.

ajaybabu27 · 2018-08-21T14:03:18Z

@tseemann, I agree with @rwst. I was trying to classify non-barcoded reads from a sequencing run. It should at the least have 30% PhiX reads (estimated based on direct mapping of non-barcoded reads to PhiX genome). Yet, when I classified using RefSeq Database (Kraken 1 and Kraken 2), only 3% of reads were classified as PhiX. I suspect PhiX sequence contamination is pervasive in NCBI RefSeq database since its used as a spike-in control in all Illumina based sequencing runs. See, https://standardsingenomics.biomedcentral.com/articles/10.1186/1944-3277-10-18

jenniferlu717 · 2020-04-10T04:58:42Z

Given how old this thread is, I'm going to close this issue. If you continue to have questions/concerns, please open a new issue.

Regarding testing where kmers from the database will end up classifying. you can run the first couple steps of the bracken manual (Step 1a) https://ccb.jhu.edu/software/bracken/

jenniferlu717 closed this as completed Apr 10, 2020

fconstancias mentioned this issue Dec 3, 2020

kraken2 classification Toxoplasma gondii and Plasmodium vivax #376

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

problems with classification using nt database #34

problems with classification using nt database #34

ajaybabu27 commented Jul 27, 2018

tseemann commented Jul 31, 2018 •

edited

ajaybabu27 commented Aug 1, 2018 •

edited

tseemann commented Aug 6, 2018

DerrickWood commented Aug 6, 2018

ajaybabu27 commented Aug 9, 2018 •

edited

DerrickWood commented Aug 11, 2018

tseemann commented Aug 19, 2018

rwst commented Aug 20, 2018 •

edited

ajaybabu27 commented Aug 21, 2018 •

edited

jenniferlu717 commented Apr 10, 2020

problems with classification using nt database #34

problems with classification using nt database #34

Comments

ajaybabu27 commented Jul 27, 2018

tseemann commented Jul 31, 2018 • edited

ajaybabu27 commented Aug 1, 2018 • edited

tseemann commented Aug 6, 2018

DerrickWood commented Aug 6, 2018

ajaybabu27 commented Aug 9, 2018 • edited

DerrickWood commented Aug 11, 2018

tseemann commented Aug 19, 2018

rwst commented Aug 20, 2018 • edited

ajaybabu27 commented Aug 21, 2018 • edited

jenniferlu717 commented Apr 10, 2020

tseemann commented Jul 31, 2018 •

edited

ajaybabu27 commented Aug 1, 2018 •

edited

ajaybabu27 commented Aug 9, 2018 •

edited

rwst commented Aug 20, 2018 •

edited

ajaybabu27 commented Aug 21, 2018 •

edited