The current version includes the following data sets:
- rdp_train: Fungal LSU training set version 11
- warcup: Warcup ITS training set version 2
- silva_parc: LSU Parc version 138.1
- silva_ref: LSU Ref version 138.1
- silva_nr99: LSU Ref NR99 version 138.1
- unite_nosingle: Unite with singletons set as RefS, all eukaryotes, version 8.2
- unite_single: Unite with global and 97% singletons, all eukaryotes, version 8.2
Classifications are intended to match those used by the Unite database as much as possible. This has been accomplished mostly by matching taxon names in the existing databases with those given in the proposed scheme of Tedersoo (2017). This scheme is used because it uses uses the same taxonomic ranks across all eukaryotes. Only the primary taxonomic ranks (Kingdom, Phylum, Class, Order, Family, Genus) are included. Species annotations are also not included.
For certain groups, the existing databases are not specific enough to be directly mapped, often because of taxonomic "splitting" since they were compiled, so taxonomic annotations are downloaded from NCBI. The groups for which this is done in the current version is:
- "Protista"" (Unite)
- Non-Fungi (RDP training set)
- Lactarius (RDP)
- Sebacina (RDP)
Some additional changes are made by hand using the reference/*.pre.sed
files.
These include a few updates to Tedersoo's taxonomy.
Lists of all changed taxonomy for each database are found in output/*.changes
;
more summarized changes are found in output/*.changes2
.
Download the .fasta.gz
from the most recent [release], or from the output/
directory for the most recent pre-release build.
There are preformatted versions for use with SINTAX (in USEARCH/VSEARCH)
or the RDP Naïve Bayesian Classifier as implemented in DADA2.
The header format for SINTAX is:
>tax=k:kingdom,p:phylum,c:class,o:order,f:family,g:genus
The header format for DADA2 is:
>kingdom;phylum;class;order;family;genus
In both cases, all "unknown" classifications are truncated away; both SINTAX and DADA2 can accomodate this.
Intermediate classifications which are missing are given as the nearest non-missing supertaxon, followed by "_Incertae_sedis
".
Clone the repository.
The build is managed by a combination of Snakemake and Drake.
The easiest way to manage dependencies is conda
.
snakemake --use-conda
Otherwise, you can manually install all of the packages listed in config/drake.yaml
, and then do
snakemake
An internet connection is required.
NCBI queries are much more reliable if you have an ENTREZ key saved in a file ENTREZ_KEY
in the root directory.