create nodes.dmp and names.dmp files for custom database #436

DJFast · 2021-04-20T18:50:00Z

Hi all,

First, I wanted to thank the kraken2 team for making a wonderful tool for open source use.

I have 21 new genomes (in fasta format) that I want to add to my kraken2 database. These genomes do not exist in the NCBI taxonomy. As such, I have added the requisite >sequence16|kraken:taxid|32630 syntax to all the files, where the specific number is one that does not exist in the library.

The genomes do have corresponding organisms in NCBI, but the specific strain does not and I don't want to override the existing strains in the database. Based on this, I assume I need to edit the nodes.dmp and names.dmp files. I have read #154, #179, & #317 but am still unclear how to proceed.

My current database was built with $kraken2-build --download-library bacteria --db $DBNAME. Simply, I want to add these 21 genomes to my bacterial database. I can successfully add them to the db with $kraken2-build --add-to-library chr1.fa --db $DBNAME but after that I don't know what to do.

How do I edit / create new nodes.dmp & names.dmp file that contain the information on my new strains?? Any help would be greatly appreciated.

punnettsun · 2021-04-21T04:20:16Z

Hi,

It has been some time since I worked on this, so I do not remember 100% about the contents of the nodes.dmp and names.dmp, but I will try to help.
I would suggest looking at my comment in issue 126: jenniferlu717/Bracken#126
You would need to build a taxonomy tree so you can make a nodes.dmp and names.dmp file. The formats for the files are mentioned below and in the 126 issue as well.

The nodes.dmp file should have these columns with these delimiters:
"taxid\t|\tdisplay_name\t|\t-\t|\tscientific name\t|\n"

The names.dmp file should have these columns with these delimiters:
"taxid\t|\tparent_taxid\t|\trank\t|\t-\t|\n"

For example, if you have the following rank information for a particular species (by the way, you would also need to create your own taxids):
Domain: Bacteria (taxid: 1)
Phylum: Acidobacteria (taxid:2 )
Class: Acidobacteriia (taxid: 3)
Order: Acidobacteriales (taxid:4)
Family: Acidobacteriaceae (taxid: 5)
Genus: Acidipila (taxid:6)
Species: Acidipila rosea (taxid:7)

Your names.dmp file would look something like this(tab-space in b/w) with taxid in the first column and parent taxid in the second column:
1 | 1 | domain | -
2 | 1 | phylum | -
3 | 2 | class | -
4 | 3 | order | -
5 | 4 | family | -
6 | 5 | genus | -
7 | 6 | species | -
^ 7 is the taxid and 6 is the parent of 7. So the species (taxid 7) belongs under the genus (taxid 6).

You can change the display name in the nodes.dmp file to whatever you would like to see in the output.

If you have more species under the Acidipila genus, then you would also need a taxid for that species as well and be able to connect it to your other taxonomy info. For example, if you have: Acidipila dinghuensis (which is another species under the Acidipila genus), then your names.dmp file might look like this:

Your names.dmp file would look something like this(tab-space in b/w):
1 | 1 | domain | -
2 | 1 | phylum | -
3 | 2 | class | -
4 | 3 | order | -
5 | 4 | family | -
6 | 5 | genus | -
7 | 6 | species | -
8 | 6 | species | -
^ note that taxid 8 belongs under the genus (taxid 6) since Acidipila dinghuensis belongs to the Acidipila genus.

You will need to write some code to create these dmp files.
I hope this helps. Let me know if you need further clarification.
I have not added new genomes to an already existing database, but this comment should help you create nodes & names dmp files.

mw55309 · 2021-04-21T08:06:12Z

I think the easiest way to do this is to hack the gtdb_to_taxdump script:

https://github.com/nick-youngblut/gtdb_to_taxdump

Taxdump_edit also does a decent job and again you can easily adapt the perl script

https://github.com/guyleonard/taxdump_edit

There is also struo:

https://github.com/leylabmpi/Struo

and flextaxd:

https://github.com/FOI-Bioinformatics/flextaxd

DJFast · 2021-04-21T13:09:03Z

@punnettsun @mw55309 - Thank you both so much for the guidance. I really appreciate the response. I will be working on this today. But I do really appreciate you two getting back to me on this.

DJFast · 2021-04-27T00:53:10Z

I'm going to leave this here for anyone looking at this issue in the future.
It is possible to add genomes to your database, edit the names.dmp and nodes.dmp file and have kraken map to those genomes.

In general:

Make a database directory and download NCBI taxonomy >kraken2-build --download-taxonomy --db $DBNAME
Download desired library >kraken2-build --download-library bacteria --db $DBNAME
Edit names.dmp & nodes.dmp with taxa dump edit
A. Assemble list of genomes to be added
B. Identify parent taxa (from NCBI taxonomy), rank, and division example - for new Salmonella enterica strain
Salmonella_enterica-XXXXX -parent 28901 -rank subspecies -division 11 (note genus and species must be seperated
by an underscore. A space will not work)
C. Run taxdump_edit.pl -names names.dmp -nodes nodes.dmp -taxa NAME -parent XXX -rank NAME -division X
(note - species = S, subspecies = S1, varietas = S2 in kraken) ]
D. Take note of the ID assigned to each new taxa as it is added
"Your calculated TaxID = 3304349. Please use this with makeblastdb and your fasta sequences."
Edit fasta genome file header to conform to kraken:taxid nomenclature - here
Example - >genome|kraken:taxid|3304349 Salmonella_enterica-XXXXX
Added new edited fasta genome files to kraken database > kraken2-build --add-to-library Salmonella_enterica-XXXXX.fa --db $DBNAME
Build database > kraken2-build --build --db $DBNAME

Note - genomes must be added before the build. In my experience it doesn't work retroactively.
To add genomes from NCBI simply skip to step 5 and then build

That should do it. Hope this helps someone in the future.

DJFast mentioned this issue Apr 21, 2021

Custom Taxonomy .dmp files #317

Closed

DJFast closed this as completed Apr 27, 2021

meenachakra mentioned this issue Aug 11, 2023

Question about the database species bhattlab/phanta#33

Closed

annabel-dekker mentioned this issue Mar 20, 2024

Adding custom seqs to existing databases epi2me-labs/wf-metagenomics#89

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create nodes.dmp and names.dmp files for custom database #436

create nodes.dmp and names.dmp files for custom database #436

DJFast commented Apr 20, 2021

punnettsun commented Apr 21, 2021 •

edited

mw55309 commented Apr 21, 2021

DJFast commented Apr 21, 2021

DJFast commented Apr 27, 2021

create nodes.dmp and names.dmp files for custom database #436

create nodes.dmp and names.dmp files for custom database #436

Comments

DJFast commented Apr 20, 2021

punnettsun commented Apr 21, 2021 • edited

mw55309 commented Apr 21, 2021

DJFast commented Apr 21, 2021

DJFast commented Apr 27, 2021

punnettsun commented Apr 21, 2021 •

edited