Skip to content

Update GTDB taxonomy

Pierre Chaumeil edited this page Mar 22, 2022 · 19 revisions

The taxonomy for the genome tree is in a fairly constant state of flux. As such, it needs to be updated on a regular basis. Updating the database is complicated by a few factors:

  • provided taxonomy files often do not cover all 7 canonical ranks and may not properly use binomial names for species
  • the taxonomy of representative genomes must be propagated to all genomes clustered with the representative
  • updated taxonomy files may not cover all genomes in the database as curators often focus on specific parts of the tree

The GenomeTreeTk/gtdb_validation_tk provide functionality for allowing new taxonomy strings to be quickly corrected, verified, and expanded:

  1. run the genometreetk fill_ranks command so all taxonomy strings cover all 7 canonical ranks
  2. combine the final bacterial and archaeal taxonomies cat gtdb_r207_bac120_curation_taxonomy.tsv gtdb_r207_ar53_curation_taxonomy.tsv >> final_taxonomy_combined.tsv
  3. run the gtdb_migration_tk propagate_curated_taxonomy to propagate the taxonomy from the final taxonomy file (using canonical ids) to all genomes in the species cluster. gtdb_migration_tk propagate_curated_taxonomy -t final_taxonomy_combined.tsv -m metadata_r207.tsv -o propagated_taxonomy.tsv
  4. re-run the gtdb_validation_tk check_file command to ensure the taxonomy file is properly formatted and to identify potential issues
  5. The final taxonomy file can then be inserted into the GTDB using the gtdb-migration-tk add_taxonomy_to_database function gtdb-migration-tk add_taxonomy_to_database --hostname watson.ace.uq.edu.au -u gtdb -d gtdb_pierre_r207 -p ecogenomicsgtdb --taxonomy_file propagated_taxonomy.tsv -m metadata_r207.tsv --truncate_taxonomy

Note: This script only updates the taxonomy of the specified genomes. If the new taxonomy covers all genomes passing QC and other criteria it may be necessary to first set the gtdb_taxonomy, gtdb_phylum, ..., gtdb_species entires to NULL in the database:

UPDATE metadata_taxonomy SET gtdb_phylum = NULL, gtdb_class = NULL, gtdb_order = NULL, gtdb_family = NULL, gtdb_genus = NULL, gtdb_species = NULL;

The gtdb_domain field should never be set to NULL as this field is used for filtering purposes. If the taxonomy is domain specific be careful not to accidentally wipe out the other domain (i.e., clearing the database and adding back only a new bacterial taxonomy).

IMPORTANT To update the Taxonomy from one release to another we only clean the gtdb ranks for Refseq,Genbank and UBA genomes. To clear all the field for Bacterial genomes:

UPDATE metadata_taxonomy SET gtdb_phylum = NULL, gtdb_class = NULL, gtdb_order = NULL, gtdb_family = NULL, gtdb_genus = NULL, gtdb_species = NULL, gtdb_taxonomy = NULL WHERE gtdb_domain like 'd__Bacteria' and id in ( SELECT id from genomes ge where genome_source_id != 1 OR ge.id in (SELECT genome_id FROM genome_list_contents WHERE list_id = 479))

To clear all the field for Archaeal genomes: UPDATE metadata_taxonomy SET gtdb_phylum = NULL, gtdb_class = NULL, gtdb_order = NULL, gtdb_family = NULL, gtdb_genus = NULL, gtdb_species = NULL, gtdb_taxonomy = NULL WHERE gtdb_domain like 'd__Archaea' and id in ( SELECT id from genomes ge where genome_source_id != 1 OR ge.id in (SELECT genome_id FROM genome_list_contents WHERE list_id = 479))