Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create nodes.dmp and names.dmp files for custom database #436

Closed
DJFast opened this issue Apr 20, 2021 · 4 comments
Closed

create nodes.dmp and names.dmp files for custom database #436

DJFast opened this issue Apr 20, 2021 · 4 comments

Comments

@DJFast
Copy link

DJFast commented Apr 20, 2021

Hi all,

First, I wanted to thank the kraken2 team for making a wonderful tool for open source use.

I have 21 new genomes (in fasta format) that I want to add to my kraken2 database. These genomes do not exist in the NCBI taxonomy. As such, I have added the requisite >sequence16|kraken:taxid|32630 syntax to all the files, where the specific number is one that does not exist in the library.

The genomes do have corresponding organisms in NCBI, but the specific strain does not and I don't want to override the existing strains in the database. Based on this, I assume I need to edit the nodes.dmp and names.dmp files. I have read #154, #179, & #317 but am still unclear how to proceed.

My current database was built with $kraken2-build --download-library bacteria --db $DBNAME. Simply, I want to add these 21 genomes to my bacterial database. I can successfully add them to the db with $kraken2-build --add-to-library chr1.fa --db $DBNAME but after that I don't know what to do.

How do I edit / create new nodes.dmp & names.dmp file that contain the information on my new strains?? Any help would be greatly appreciated.

@punnettsun
Copy link

punnettsun commented Apr 21, 2021

Hi,

It has been some time since I worked on this, so I do not remember 100% about the contents of the nodes.dmp and names.dmp, but I will try to help.
I would suggest looking at my comment in issue 126: jenniferlu717/Bracken#126
You would need to build a taxonomy tree so you can make a nodes.dmp and names.dmp file. The formats for the files are mentioned below and in the 126 issue as well.

The nodes.dmp file should have these columns with these delimiters:
"taxid\t|\tdisplay_name\t|\t-\t|\tscientific name\t|\n"

The names.dmp file should have these columns with these delimiters:
"taxid\t|\tparent_taxid\t|\trank\t|\t-\t|\n"

For example, if you have the following rank information for a particular species (by the way, you would also need to create your own taxids):
Domain: Bacteria (taxid: 1)
Phylum: Acidobacteria (taxid:2 )
Class: Acidobacteriia (taxid: 3)
Order: Acidobacteriales (taxid:4)
Family: Acidobacteriaceae (taxid: 5)
Genus: Acidipila (taxid:6)
Species: Acidipila rosea (taxid:7)

Your names.dmp file would look something like this(tab-space in b/w) with taxid in the first column and parent taxid in the second column:
1 | 1 | domain | -
2 | 1 | phylum | -
3 | 2 | class | -
4 | 3 | order | -
5 | 4 | family | -
6 | 5 | genus | -
7 | 6 | species | -
^ 7 is the taxid and 6 is the parent of 7. So the species (taxid 7) belongs under the genus (taxid 6).

Your nodes.dmp file would look something like this (tab-space in b/w):
1 | Bacteria | Bacteria |
2 | Acidobacteria | Acidobacteria |
3 | Acidobacteriia | Acidobacteriia |
4 | Acidobacteriales| Acidobacteriales|
5 | Acidobacteriac. | Acidobacteriac. |
6 | Acidipila | Acidipila |
7 | Acidipila_rosea | Acidipila_rosea |

You can change the display name in the nodes.dmp file to whatever you would like to see in the output.

If you have more species under the Acidipila genus, then you would also need a taxid for that species as well and be able to connect it to your other taxonomy info. For example, if you have: Acidipila dinghuensis (which is another species under the Acidipila genus), then your names.dmp file might look like this:

Your names.dmp file would look something like this(tab-space in b/w):
1 | 1 | domain | -
2 | 1 | phylum | -
3 | 2 | class | -
4 | 3 | order | -
5 | 4 | family | -
6 | 5 | genus | -
7 | 6 | species | -
8 | 6 | species | -
^ note that taxid 8 belongs under the genus (taxid 6) since Acidipila dinghuensis belongs to the Acidipila genus.

Your nodes.dmp file then would look something like this (tab-space in b/w):
1 | Bacteria | Bacteria |
2 | Acidobacteria | Acidobacteria |
3 | Acidobacteriia | Acidobacteriia |
4 | Acidobacteriales| Acidobacteriales|
5 | Acidobacteriac. | Acidobacteriac. |
6 | Acidipila | Acidipila |
7 | Acidipila_rosea | Acidipila_rosea |
8 | Acidipila dinghuensis | Acidipila dinghuensis |

You will need to write some code to create these dmp files.
I hope this helps. Let me know if you need further clarification.
I have not added new genomes to an already existing database, but this comment should help you create nodes & names dmp files.

@mw55309
Copy link

mw55309 commented Apr 21, 2021

I think the easiest way to do this is to hack the gtdb_to_taxdump script:

https://github.com/nick-youngblut/gtdb_to_taxdump

Taxdump_edit also does a decent job and again you can easily adapt the perl script

https://github.com/guyleonard/taxdump_edit

There is also struo:

https://github.com/leylabmpi/Struo

and flextaxd:

https://github.com/FOI-Bioinformatics/flextaxd

@DJFast
Copy link
Author

DJFast commented Apr 21, 2021

@punnettsun @mw55309 - Thank you both so much for the guidance. I really appreciate the response. I will be working on this today. But I do really appreciate you two getting back to me on this.

@DJFast
Copy link
Author

DJFast commented Apr 27, 2021

I'm going to leave this here for anyone looking at this issue in the future.
It is possible to add genomes to your database, edit the names.dmp and nodes.dmp file and have kraken map to those genomes.

In general:

  1. Make a database directory and download NCBI taxonomy >kraken2-build --download-taxonomy --db $DBNAME

  2. Download desired library >kraken2-build --download-library bacteria --db $DBNAME

  3. Edit names.dmp & nodes.dmp with taxa dump edit
    A. Assemble list of genomes to be added
    B. Identify parent taxa (from NCBI taxonomy), rank, and division example - for new Salmonella enterica strain
    Salmonella_enterica-XXXXX -parent 28901 -rank subspecies -division 11 (note genus and species must be seperated
    by an underscore. A space will not work)
    C. Run taxdump_edit.pl -names names.dmp -nodes nodes.dmp -taxa NAME -parent XXX -rank NAME -division X
    (note - species = S, subspecies = S1, varietas = S2 in kraken) ]
    D. Take note of the ID assigned to each new taxa as it is added
    "Your calculated TaxID = 3304349. Please use this with makeblastdb and your fasta sequences."

  4. Edit fasta genome file header to conform to kraken:taxid nomenclature - here
    Example - >genome|kraken:taxid|3304349 Salmonella_enterica-XXXXX

  5. Added new edited fasta genome files to kraken database > kraken2-build --add-to-library Salmonella_enterica-XXXXX.fa --db $DBNAME

  6. Build database > kraken2-build --build --db $DBNAME

Note - genomes must be added before the build. In my experience it doesn't work retroactively.
To add genomes from NCBI simply skip to step 5 and then build

That should do it. Hope this helps someone in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants