-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create nodes.dmp and names.dmp files for custom database #436
Comments
Hi, It has been some time since I worked on this, so I do not remember 100% about the contents of the nodes.dmp and names.dmp, but I will try to help. The nodes.dmp file should have these columns with these delimiters: The names.dmp file should have these columns with these delimiters: For example, if you have the following rank information for a particular species (by the way, you would also need to create your own taxids): Your names.dmp file would look something like this(tab-space in b/w) with taxid in the first column and parent taxid in the second column: Your nodes.dmp file would look something like this (tab-space in b/w): You can change the display name in the nodes.dmp file to whatever you would like to see in the output. If you have more species under the Acidipila genus, then you would also need a taxid for that species as well and be able to connect it to your other taxonomy info. For example, if you have: Acidipila dinghuensis (which is another species under the Acidipila genus), then your names.dmp file might look like this: Your names.dmp file would look something like this(tab-space in b/w): Your nodes.dmp file then would look something like this (tab-space in b/w): You will need to write some code to create these dmp files. |
I think the easiest way to do this is to hack the gtdb_to_taxdump script: https://github.com/nick-youngblut/gtdb_to_taxdump Taxdump_edit also does a decent job and again you can easily adapt the perl script https://github.com/guyleonard/taxdump_edit There is also struo: https://github.com/leylabmpi/Struo and flextaxd: |
@punnettsun @mw55309 - Thank you both so much for the guidance. I really appreciate the response. I will be working on this today. But I do really appreciate you two getting back to me on this. |
I'm going to leave this here for anyone looking at this issue in the future. In general:
Note - genomes must be added before the build. In my experience it doesn't work retroactively. That should do it. Hope this helps someone in the future. |
Hi all,
First, I wanted to thank the kraken2 team for making a wonderful tool for open source use.
I have 21 new genomes (in fasta format) that I want to add to my kraken2 database. These genomes do not exist in the NCBI taxonomy. As such, I have added the requisite >sequence16|kraken:taxid|32630 syntax to all the files, where the specific number is one that does not exist in the library.
The genomes do have corresponding organisms in NCBI, but the specific strain does not and I don't want to override the existing strains in the database. Based on this, I assume I need to edit the nodes.dmp and names.dmp files. I have read #154, #179, & #317 but am still unclear how to proceed.
My current database was built with $kraken2-build --download-library bacteria --db $DBNAME. Simply, I want to add these 21 genomes to my bacterial database. I can successfully add them to the db with $kraken2-build --add-to-library chr1.fa --db $DBNAME but after that I don't know what to do.
How do I edit / create new nodes.dmp & names.dmp file that contain the information on my new strains?? Any help would be greatly appreciated.
The text was updated successfully, but these errors were encountered: