Skip to content

Generating data files for GTDB website

Donovan H. Parks edited this page Jul 3, 2024 · 34 revisions

A number of data files must be created for each GTDB release and placed on the GTDB website at: https://data.ace.uq.edu.au/public/gtdb/data/releases

Propagating latest taxonomy

  • We need to combine the 2 final taxonomy files:
    cat final_taxonomy_ar122.tsv final_taxonomy_bac120.tsv >> final_taxonomy_combined.tsv
  • We propagate the Taxonomy from reps to all genomes in their clusters.
    gtdb_migration_tk propagate_curated_taxonomy -t final_taxonomy_combined.tsv -m gtdb_r202_metadata_20210413.tsv -o propagated_taxonomy.tsv
  • We push the new taxonomy to the database.
    gtdb_migration_tk add_taxonomy_to_database --hostname $hostname -u $user -d $db -p password -t propagated_taxonomy.tsv -m gtdb_r202_metadata_20210413.tsv --truncate_taxonomy
  • Makes sure sure we have the same number of genomes propagated_taxonomy and the DB.
    wc -l propagated_taxonomy.tsv should be equals to SELECT * from metadata_view where gtdb_species not like 's__'

Prerequisites

  • GTDB database must be updated to contain latest species clustering information
  • GTDB database must be updated to contain the latest GTDB taxonomy
  • Data in the updated database should be dumped to a TSV file, e.g.:
gtdb metadata export --format tab --output gtdb_r202_metadata_20210414.tsv
  • The path to the data directory for each genome in the GTDB should be dumped to a TSV file, e.g.:
gtdb power genome_paths --output gtdb_r202_genome_paths_20210414.tsv

Updating config.py

The config.py file in the GTDB Release Tk must be updated to reflect any changes in the path to data files.

Generating data files

Functionality for generating website data files is being moved into the GTDB Release Tk, but currently exists in a number of Python and SQL scripts. The examples below are all for GTDB r89 and version numbers should be updated to reflect the current release.

Species Cluster file

  • We get the gtdb_clusters_de_novo.tsv from /srv/db/gtdb/metadata/release202/representatives/sp_cluster_update/u_cluster_de_novo/
  • We generate canonicals_to_ncbi.tsv
    awk -v OFS='\t' '{ print $2,$1 }' gtdb_r207_metadata_20220322.tsv > canonicals_to_ncbi.tsv
  • file creation:
    gtdb-release-tk sp_cluster_file data_from_db/gtdb_r202_metadata_20210414.tsv data_from_db/gtdb_clusters_de_novo.tsv data_from_db/canonicals_to_ncbi.tsv 202 sp_cluster_file

Taxonomy files

Bacterial and archaeal taxonomy files spanning the species representative genomes can be obtained with:

gtdb-release-tk taxonomy_files gtdb_r202_metadata_20210414.tsv sp_clusters_r202.tsv 202 taxonomy_files

This creates the bac120_taxonomy_r89.tsv and ar122_taxonomy_r89.tsv taxonomy files. GTDB user genome IDs are replaced with a NCBI genome accession where available and a UBA ID otherwise.

Reference trees

  • The input trees must already be stripped of dummy curation nodes. This can be done with the remove_dummy method of the GTDB Validation Tk.
gtdb_validation_tk remove_dummy gtdb_r202_bac120_unscaled_decorated.tree gtdb_r202_bac120_unscaled_decorated_no_dummy.tree
gtdb_validation_tk remove_dummy gtdb_r202_ar122_unscaled_decorated.tree gtdb_r202_ar122_unscaled_decorated_no_dummy.tree

The archaeal and bacterial trees used during curation must be modified to replace all GTDB user genomes IDs. Reference trees for the GTDB website can be created with:

gtdb_release_tk tree_files gtdb_r202_metadata_20210414.tsv gtdb_r202_bac120_unscaled_decorated_no_dummy.tree gtdb_r202_ar122_unscaled_decorated_no_dummy.tree canonicals_to_ncbi.tsv 202 r202_temp_website/tree_files

SSU files

Three files spanning different sets of 16S rRNA sequences are placed on the GTDB website:

  • bac120_ssu_reps_r<release#>.fna: a single 16S rRNA sequence for each bacterial representative genomes. The longest identified 16S rRNA sequence is selected for each representative genome.
  • ar53_ssu_reps_r<release#>.fna: a single 16S rRNA sequence for each archaeal representative genomes. The longest identified 16S rRNA sequence is selected for each representative genome.
  • ssu_all_r<release#>.fna: contains all 16S rRNA sequences identified across the set of GTDB genomes passing QC.

There files can be created with:

gtdb-release-tk ssu_files gtdb_r202_metadata_20210414.tsv sp_clusters_r202.tsv gtdb_r202_genome_paths_20210414.tsv 202 ssu_files

Gene files for representative/all genomes

Information about individual marker genes along with individual MSAs are provided on the GTDB website. Initial version of these files can be obtained from the GTDB:

gtdb -t 30 tree create --no_tree --no_trim --individual --prefix bac120_r207_all --taxa_filter d__Bacteria --genome_batchfile ../data_from_db/bac120_all.lst --marker_set_ids 1 --guaranteed_batchfile ../data_from_db/bac120_all.lst --output bac120_msa_marker_genes_all_r207 --classic_header

gtdb -t 30 tree create --no_tree --no_trim --individual --prefix ar53_r207_all --taxa_filter d__Archaea --genome_batchfile ../data_from_db/ar53_all.lst --marker_set_ids 19 --guaranteed_batchfile ../data_from_db/ar53_all.lst --output ar53_msa_marker_genes_all_r207 --classic_header

once the alignement is finish, 
`cd bac120_msa_marker_genes_all_r207`
`mkdir individual ; mv bac120_r207_all_PF* individual ;mv bac120_r207_all_TI* individual; cd individual ; tar czvf bac120_msa_marker_genes_all_r207.tar.gz`

alternatively:
`gtdb-release-tk marker_files bac120_msa_marker_genes_all_r207 ar53_msa_marker_genes_all_r207 207 individual_gene_files `

Similar operation for reps

This produces the files:

  • bac120/ar122_msa_marker_info_r89.tsv
  • bac120/ar122_msa_individual_genes_r89.tar.gz

Protein and Nucleotide files you need to use the combined taxonomy file

gtdb-release-tk nucleotide_files --taxonomy_file taxonomy_files/taxonomy_r207.tsv --metadata_file data_from_db/gtdb_r207_metadata_20220322.tsv --release_number 207 --genome_dirs data_from_db/gtdb_r207_genome_paths_20220322.tsv --output_dir protein_fna_reps
gtdb-release-tk protein_files --taxonomy_file taxonomy_files/taxonomy_r207.tsv --metadata_file data_from_db/gtdb_r207_metadata_20220322.tsv --release_number 207 --genome_dirs data_from_db/gtdb_r207_genome_paths_20220322.tsv --output_dir protein_faa_reps

archive the results:

tar -cv protein_fna_reps | pigz -9 > gtdb_proteins_nt_reps_r202.tar.gz
tar -cv protein_faa_reps | pigz -9 > gtdb_proteins_aa_reps_r202.tar.gz

HQ genome file

Run the command

gtdb-release-tk hq_genome_file data_from_db/gtdb_r207_metadata.tsv 207 hq_genome_file

Metadata files

Run the command

gtdb-release-tk metadata_files data_from_db/gtdb_r207_metadata_20220322.tsv data_from_db/metadata_field_desc.tsv sp_cluster_file/sp_clusters_r207.tsv 207 metadata_files

LPSN urls mapping

Run the command gtdb-release-tk lpsn_urls /srv/db/gtdb/metadata/release207/lpsn/20210823/species_list.lst 207 lpsn_urls

QC failed

Get the qc_failed.tsv file from /srv/db/gtdb/metadata/release207/representatives/sp_cluster_update/2_u_qc_genomes Run gtdb-release-tk qc_file data_from_db/qc_failed.tsv data_from_db/canonicals_to_ncbi.tsv 207 qc_failed

Dict file

Run the command gtdb-release-tk dict_file taxonomy_files/taxonomy_r207.tsv 207 dict_file

Marker genes file

  • For reps
gtdb-release-tk gene_files --taxonomy_file taxonomy_files/taxonomy_r207.tsv --genome_dirs data_from_db/gtdb_r207_genome_paths_20220322.tsv --release_number 207 --output_dir gene_files_reps --cpus 30 --metadata_file data_from_db/gtdb_r207_metadata_20220322.tsv --only_reps

rename and archive output:

mv ar122_202_individual_genes ar122_marker_genes_reps_r202
mv bac120_202_individual_genes bac120_marker_genes_reps_r202
tar cvzf ar122_marker_genes_reps_r202.tar.gz ar122_marker_genes_reps_r202
tar cvzf bac120_marker_genes_reps_r202.tar.gz bac120_marker_genes_reps_r202
  • For all genomes
gtdb_release_tk gene_files --taxonomy_file taxonomy_files/taxonomy_r202.tsv --genome_dirs data_from_db/gtdb_r202_genome_paths_20210414.tsv --release_number 202 --output_dir gene_files_all --cpus 30 --metadata_file data_from_db/gtdb_r202_metadata_20210414.tsv

rename and archive output:

mv ar122_202_individual_genes ar122_marker_genes_all_r202
mv bac120_202_individual_genes bac120_marker_genes_all_r202
tar cvzf ar122_marker_genes_all_r202.tar.gz ar122_marker_genes_all_r202
tar cvzf bac120_marker_genes_all_r202.tar.gz bac120_marker_genes_all_r202

Individual marker genes

MSA files

MSA files used to produce the GTDB reference trees are created by the gtdb tree create command. These files need to be processed, you get those trimmed MSA from the /srv/project/gtdb/release/archaea(bacteria)/pre_curation/bac120(ar53)/msa:

gtdb-release-tk msa_files gtdb_r207_bac120_concatenated.faa gtdb_r207_ar53_concatenated.faa canonicals_to_ncbi.tsv gtdb_r207_metadata_20220322.tsv 207 ../trimmed_msa_files

Json Reference Tree

The JSON tree is used as a reference file to load the tree browser on the website Join both taxonomy and metadata files:

cat bac120_taxonomy_r89.tsv ar122_taxonomy_r89.tsv > taxonomy_r89.tsv
cat bac120_metadata_r89.tsv ar122_metadata_r89.tsv > metadata_r89.tsv
gtdb_release_tk json_tree_file --taxonomy_file taxonomy_r89.tsv --metadata_file metadata_r89.tsv --output_dir . --release_number 89

Name not in litterature

export the data from the database SELECT release_ver,rank_domain,rank_phylum,rank_class,rank_order,rank_family,rank_genus,rank_species FROM taxon_hist WHERE release_ver not like '%NCBI%' ORDER BY replace(release_ver,'R','')::float

export this file as allranks_allreleases.csv

download the latest information from NCBI to have the latest name rsync ftp.ncbi.nih.gov::pub/taxonomy/taxdump.tar.gz .

gtdb_release_tk nomenclatural_check --ncbi_node_file 20210420/nodes.dmp --ncbi_name_file 20210420/names.dmp --lpsn_species_file /srv/db/gtdb/metadata/release202/lpsn/20201124/species_list.lst --output_directory test --gtdb_taxonomy ../taxonomy_files/taxonomy_r202.tsv --rank_release_file ../data_from_db/allranks_allreleases.csv

Taxonomy Comparison file

This tool tracks changes between 2 different taxonomy files. The genomes ids in those files are automatically changed to Genbank of UBA ids to run the comparison. This functionality will return 10 files ( 5 Bacterial and 5 Archaeal from phylum to genus ) that can be copied to an excel spreadsheet.

gtdb_release_tk tax_comp_files --reference_taxonomy_file data_from_db/gtdb_taxonomy_ncbi_20210419.tsv --new_taxonomy_file taxonomy_files/taxonomy_r202.tsv --output_dir compare_taxonomy/ncbi_vs_gtdb --changes_only

Plots for GTDB stats page

Plots for the GTDB stats page using both the a default and color blind safe palette can be generated with:

gtdb_release_tk all_release_plots bac120_metadata_r<#>.tsv ar53_metadata_r<#>.tsv <release_number> <output_dir>