Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how does kraken differentiate masked genomes #122

Open
zhaoc1 opened this issue Apr 24, 2018 · 5 comments
Open

how does kraken differentiate masked genomes #122

zhaoc1 opened this issue Apr 24, 2018 · 5 comments

Comments

@zhaoc1
Copy link

zhaoc1 commented Apr 24, 2018

Hi,

I got the masked genomes from the eupathDBclean, and built a kraken database on it (kmer = 25). By looking back at the reports, some of the kmers fully mapped to the masked sequences, e.g.

AFPV01005151 |
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Therefore, I am wondering how does Kraken deal with the masked (low complexity) genome regions? How can I filter out those false positive kmer results that mapped to the masked region? Thank you!

Best,
Chunyu

@jenniferlu717
Copy link
Collaborator

That shouldn't happen. Any kmers with "N" don't end up in the database.

@zhaoc1
Copy link
Author

zhaoc1 commented Apr 24, 2018

I actually build a krakenHLL database on the cleaned eupathDB and this kmer mapping show up in my reports. Do you know why this is happening? Thanks!

@jenniferlu717
Copy link
Collaborator

Can you send me both the read causing this and the Kraken output showing this match? Please email the Kraken line and the read to jennifer.717@gmail.com

@fbreitwieser
Copy link
Collaborator

@zhaoc1 , maybe this is a KrakenHLL issue. You made a database with nodes for sequences and genomes, right? Did you restart the building process? Maybe the mapping file is not up to date.

@zhaoc1
Copy link
Author

zhaoc1 commented Apr 27, 2018

@fbreitwieser this is how I built the KrakenHLL database, after downloading the eupathDBclean files and seqid2taxid.map

DBNAME=testDB
krakenhll-download --db $DBNAME taxonomy
cp my_eupathDB_folder/*. $DBNAME/library
cp seqid2taxid.map $DBNAME/library
krakenhll-build --db $DBNAME --taxids-for-genomes --taxids-for-sequences --kmer-len 25 --threads 8

Could you please explain during which step the mapping file is not up to date, or is there any potential problems here? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants