Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problems with classification using nt database #34

Closed
ajaybabu27 opened this issue Jul 27, 2018 · 10 comments
Closed

problems with classification using nt database #34

ajaybabu27 opened this issue Jul 27, 2018 · 10 comments

Comments

@ajaybabu27
Copy link

Hi Derrick,

I tried classifying my sample reads with kraken database containing only "nt" sequences and found that all my reads were classified at the root. Here is the output from the first few lines of the report file:

100.00% 14552 14535 R 1 root
0.12% 17 0 R1 131567 cellular organisms
0.12% 17 17 D 2 Bacteria

I did the same classification using the "standard database" and I got similar results that I found in Kraken1:

100.00% 14552 0 R 1 root
100.00% 14552 0 R1 131567 cellular organisms
100.00% 14552 0 D 2 Bacteria
44.38% 6458 0 D1 1783270 FCB group
41.62% 6057 0 P 1224 Proteobacteria
14.00% 2037 7 D1 1783272 Terrabacteria group

I wonder if there are some kmers that are found in some bacterial species, are also annotated in weird categories placed near the root of the taxonomy tree; specifically in the case of the "nt" database ?

@tseemann
Copy link

tseemann commented Jul 31, 2018

"with kraken database containing only "nt" sequences "

How did you build the database?

What does kraken2-inspect say?
See http://ccb.jhu.edu/software/kraken/MANUAL.html#inspecting-a-kraken-2-databases-contents

@ajaybabu27
Copy link
Author

ajaybabu27 commented Aug 1, 2018

I built the "nt" database with the following commands,

kraken2-build --download-taxonomy --db kraken2_db_custom_20180718
kraken2-build --download-library nt --db kraken2_db_custom_20180718
kraken2-build --build --threads 36 --db kraken2_db_custom_20180718

Essentially the database only consisted of sequences from the "nt" library. If I am not wrong, "refseq" is a subset of "nt" library. Thus, I only used "nt" library to build my database.

I gave up running kraken2-inspect for my database (kraken2_db_custom_20180718) since it kept running continuously for 2 days (using around 300 Gb of memory and one core). Unfortunately, the script is not multi-threaded.

Have you tried classifying your metagenomic samples using the "nt" database (the way I have built it) ?

@tseemann
Copy link

tseemann commented Aug 6, 2018

I haven't built the nt database, but am considering it. I work in microbial genomics so I only want viruses -> protists + human. I'm happy for the rest to be unclassified. My main issue even with refseq is mistakes in taxonomy at species level.

@DerrickWood
Copy link
Owner

@ajaybabu27 Without knowing what your sample reads are (simulated? source genomes?), it's difficult to say why they are now classified at the root, beyond that the data in those sample reads looks to come from more than one kingdom (when comparing against the whole of nt).

@ajaybabu27
Copy link
Author

ajaybabu27 commented Aug 9, 2018

@DerrickWood It is a 16S (V4) amplicon gut microbiome sample (human). I am curious to test if Kraken2 with "nt" database can detect any contaminations in my sequencing library (barcoded+non-barcoded reads) in addition to classifying my reads to different taxa. I have used Kraken1 and Kraken2 (standard db) for classifying the above mentioned demultiplexed sample (in my first query) in the past and it gave expected results. I find it hard to believe that almost all of my reads in this sample also maps to additional kingdoms when using "nt" .

Is there a way I can get a lookup table of all the kmers annotated across different taxa in a given kraken db ? This way I could debug to see where the kmers from my reads would potentially map in the taxonomy tree given the annotations of LCA kmers to each taxa in the tree. Let me know if this would be possible.

Thanks,
Ajay.

@DerrickWood
Copy link
Owner

@ajaybabu27 Unfortunately, Kraken 2 doesn't easily allow for looking up which k-mers are in the database. That is part of the tradeoff made to get the database size smaller in Kraken 2. You could classify the entire library and then use that information to approximate that, but that isn't a trivial (nor memory-light) task.

In terms of debugging, though, it might be helpful to look at the Kraken output for classifying your reads. The final tab-delimited column is a space-separated list indicating which k-mers were associated with specific LCAs. You could also try using something like BLAST (probably MEGABLAST, not BLASTN) to search some of these sequences against nt, and review the top several hits for the sequences. It's possible that nt has some poor taxonomic annotations that could confuse the assignment of these kinds of reads for Kraken.

@tseemann
Copy link

"It's possible that nt has some poor taxonomic annotations"

Do you have a feel for how bad it could be?

Even for Refseq there are species level mistakes, but nt is made of lots of very old sequences.

@rwst
Copy link

rwst commented Aug 20, 2018

There are not only problems with taxonomic classifications of databases but also with contamination, see Lu/Salzberg, http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006277. I found out the hard way yesterday that protozoa gives false Toxoplasma and Plasmodium positives, just as described in the above paper with EuPathDB.

I think providing clean databases may need a concerted effort.

@ajaybabu27
Copy link
Author

ajaybabu27 commented Aug 21, 2018

@tseemann, I agree with @rwst. I was trying to classify non-barcoded reads from a sequencing run. It should at the least have 30% PhiX reads (estimated based on direct mapping of non-barcoded reads to PhiX genome). Yet, when I classified using RefSeq Database (Kraken 1 and Kraken 2), only 3% of reads were classified as PhiX. I suspect PhiX sequence contamination is pervasive in NCBI RefSeq database since its used as a spike-in control in all Illumina based sequencing runs. See, https://standardsingenomics.biomedcentral.com/articles/10.1186/1944-3277-10-18

@jenniferlu717
Copy link
Collaborator

Given how old this thread is, I'm going to close this issue. If you continue to have questions/concerns, please open a new issue.

Regarding testing where kmers from the database will end up classifying. you can run the first couple steps of the bracken manual (Step 1a) https://ccb.jhu.edu/software/bracken/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants