Skip to content

Update database metadata

Pierre Chaumeil edited this page Apr 6, 2024 · 239 revisions

Metadata for NCBI genomes needs to be generated after syncing with RefSeq and GenBank. Most of this data is calculated externally to the GTDB to improve efficiency.
In general, scripts need to be run once for RefSeq and then again for GenBank.

Ideally, metadata would only be generated when required, but it some cases it is calculated de novo for all genomes for simplicity. Some data should be calculated de novo for all genomes on a regular bases. For example, 16S classification when there is a new SILVA/LTP database.

All scripts have been moved in gtdb_migration_tk.
TODO: develop a clean conda image for gtdb_migration_tk dev

Call genes on RefSeq and GenBank genomes

NCBI gene calling is very conservative. This can negatively impact the CheckM estimates and results in valid marker genes being missed. As such, genes are called de novo using Prodigal:

Prodigal version: v2.6.3

module load biolib prodigal (optional)
gtdb_migration_tk prodigal -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genome_dirs.tsv -c 30 -l logs/prodigal.log

Search and Align HMMs:

pfam_search version: v0.1
hmmer version: v3.1b2
**It is important to get this version of hmmer to get the same annotations of previous versions.

for full database

gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 33.1 --hmm_db_path /srv/db/pfam/33.1/ -l logs/refseq_pfam.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 33.1 --hmm_db_path /srv/db/pfam/33.1/ -l logs/genbank_pfam.log

gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 15.0 --hmm_db_path /srv/whitlam/bio/db/tigrfam/15.0/TIGRFAMs_15.0_HMM/tigrfam.hmm -l logs/refseq_tigrfam.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 15.0 --hmm_db_path /srv/whitlam/bio/db/tigrfam/15.0/TIGRFAMs_15.0_HMM/tigrfam.hmm -l logs/genbank_tigrfam.log

for db_lite

gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 33.1_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/ -l logs/refseq_pfam_lite.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db pfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 33.1_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/ -l logs/genbank_pfam_lite.log

gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv -c 30 --folder_suffix 15.0_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/tigrfam.hmm -l logs/refseq_tigrfam_lite.log
gtdb_migration_tk hmmsearch -r /srv/db/gtdb/genomes/ncbi/release214/genbank/extra_gbk_report_gcf.log --db tigrfam -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv -c 30 --folder_suffix 15.0_lite --hmm_db_path /srv/whitlam/bio/db/gtdb/marker_genes/hmms_extended_pfam33.1_tigr15/tigrfam.hmm -l logs/genbank_tigrfam_lite.log

OPTIONAL: It is possible to regenerate tophit files without recalling the genes.

gtdb_migration_tk top_hit

Generate Metadata:

generate metadata derived from nucleotide (e.g., GC) and protein (e.g., gene count) files:

gtdb_migration_tk metadata -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genome_dirs.tsv -c 30 -l logs/gtdb_metadata.log

Generate Ribosomal Metadata:

Identifies, extracts, and taxonomically classifies 16S and 23S rRNA genes in genomes against SILVA

  • check if a new version of SILVA has been released. If so, download this and update the rna_silva.py script to point to the new SILVA data and to store results in a new directory (i.e., currently rna_silva_132).
  • see Update Silva
  • need to run for 16S, 23S, and 5S rRNA genes

Current silva version: 138.1

module load blast (optional)
gtdb_migration_tk rna_silva -v <current_silva#> -r ssu -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv -c 40 -l logs/rna_silva_<current_silva#>_ssu.log --rerun

Identifies, extracts, and taxonomically classifies 16S rRNA genes in genomes against the Living Tree Project (LTP)

  • check if a new version of LTP has been released. If so, download this and update the rna_ltp.py script to point to the new LTP data and to store results in a new directory (i.e., currently rna_ltp_132).
  • need to run for 16S

Current LTP version: 01_2022
Current SSU version: 138.1

gtdb_migration_tk rna_ltp -g /srv/db/gtdb/genomes/ncbi/release<release#>/refseq/genome_dirs.tsv --ltp_version 132 --ssu_version 138.1 -c 30 --all
gtdb_migration_tk rna_ltp -g /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/genome_dirs.tsv --ltp_version 132 --ssu_version 138.1 -c 30 --all

Predict transfer RNA

we use trnascan: identify tRNAs in genomes

module load trnascan-se (optional)
gtdb_migration_tk trnascan -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv -c 30 
--gbk_bac_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt 
--gbk_arc_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt 
--rfq_arc_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt 
--rfq_bac_assembly_file /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt
-l logs/trnascan.log

Estimate genome quality

The quality of new genomes can be estimated with CheckM. Estimates should be put in /srv/whitlam/bio/db/gtdb/metadata/<release#>/checkm/.

  1. estimate genome quality for new genomes in RefSeq and GenBank:
module load biolib checkm (optional)
gtdb_migration_tk checkm -g /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/genome_dirs.tsv --report /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/extra_gbk_report_gcf.log -o /srv/db/gtdb/metadata/release<release#>/checkm/genbank/ -c 40 -l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/check_genbank.log

gtdb_migration_tk checkm -g /srv/db/gtdb/genomes/ncbi/release<release#>/refseq/genome_dirs.tsv --report /srv/db/gtdb/genomes/ncbi/release<release#>/refseq/report_gcf.log -o /srv/db/gtdb/metadata/release<release#>/checkm/refseq/ -c 40 -l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/check_refseq.log

OPTIONAL In between official releases, checkM may have been calculated for other releases ( i.e. in between R95 and 202, R98,200,201 have been processed) So it's important to join the CheckM output files ( checkm.profiles,checkm.qa_sh100)

mkdir /srv/db/gtdb/metadata/release<release#>/checkm/join_releases
gtdb_migration_tk join_checkm -f /srv/db/gtdb/metadata/release98/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release200/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release201/checkm/refseq/checkm.profiles.tsv /srv/db/gtdb/metadata/release202/checkm/refseq/checkm.profiles.tsv -o /srv/db/gtdb/metadata/release202/checkm/join_releases/checkm_profile_refseq.tsv -l /srv/db/gtdb/metadata/release202/checkm/join_releases/join_checkm_refseq.log

Add CheckM v2 information

This is a work in progress, CheckM 2 Completeness and Contamination have been processed for genomes in R214. We want to add them in the db Information can be found. under /srv/whitlam/projects1/gtdb/studies/checkm2/r214 For now , I have just copied the checkm2-gtdb_r214.tsv: cut -f1,2,3,4 checkm2-gtdb_r214.tsv > checkm2-metadata.tsv replace the 4 columns name in checkm2-metadata.tsv with: genome_id,checkm2_completeness,checkm2_contamination,checkm2_model I have remove the extension .fna from genome_id and change the model_used information to only Specific or General

Add new genomes to the DB

With the CheckM profile generated, new and updated genomes can be pushed to the database.

  1. Create a duplicate of the current database in PostgresQL: in watson
CREATE DATABASE <_database_name_> (ie database_name = gtdb_release76)
  1. Populate the new created database with the data from the previous version (ie gtdb_release75):
pg_dump -h <_hostname_> -p <_port_> -U <_user_> gtdb_release<old> -f ~/pg_dumps/gtdb_release<old>_duplicate.sql
psql -h <_hostname_> -p <_port_> -U <_user_> -d gtdb_release<new> -f ~/pg_dumps/gtdb_release<new>_duplicate.sql
  1. Run the commands
gtdb_migration_tk update_db 
--hostname <hostname> -u <user>-p <password> -d <database>
--checkm_profile /srv/db/gtdb/metadata/release<release#>/checkm/join_releases/checkm_profile_refseq.tsv
-g gtdb/genomes/ncbi/release<release#>/refseq/genome_dirs.tsv
--report_folder /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_log_20210826/
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_refseq_20210826.log
-f 2020-01-29 --repository refseq --cpus 20

gtdb_migration_tk update_db 
--hostname <hostname> -u <user>-p <password> -d <database> 
--checkm_profile /srv/db/gtdb/metadata/release<release#>/checkm/join_releases/checkm_profile_genbank.tsv
-g /srv/db/gtdb/genomes/ncbi/release<release#>/genbank/genome_dirs.tsv
--report_folder /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_log_20210829
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/update_db_genbank_20210829.log
-f 2020-01-29 -r genbank --cpus 20

TODO: The function _checkPathRecord needs to be reviewed. sometimes the path is not properly set during a path change

Realign updated NCBI genomes

When updating the NCBI genomes, genomes that have been updated (i.e. GCF_000000.1 -> GCF_000000.2) needs to be realigned because all aligned markers have been removed.

gtdb -r power realign_updated_genomes

Metadata

  1. import CheckM estimates; you will need to export the metadata for the new release first (gtdb metadata export):
gtdb_migration_tk update_checkm_db --hostname <hostname> -u <user> -p <password> -d <database> --checkm_profile checkm/join_releases/checkm_profile_genbank.tsv -q checkm/join_releases/checkm_qa_sh100_genbank.tsv -m metadata_r202.tsv -l logs/update_checkm_db_genbank.log

gtdb_migration_tk update_checkm_db --hostname <hostname> -u <user> -p <password> -d <database> --checkm_profile checkm/join_releases/checkm_profile_refseq.tsv -q checkm/join_releases/checkm_qa_sh100_refseq.tsv -m metadata_r202.tsv -l logs/update_checkm_db_refseq.log
  1. create tables with metadata for all genomes:
gtdb_migration_tk create_tables -g /srv/db/gtdb/genomes/ncbi/release214/genome_dirs.tsv -v 138.1 -o .

gtdb_migration_tk create_tables: creates the metadata_nt.tsv, metadata_gene.tsv, metadata_ssu_silva.tsv, metadata_lsu_silva.tsv,'metadata_lsu_5S.tsv, metadata_ssu_silva_count.tsv, metadata_lsu_silva_count.tsv, metadata_lsu_5S_count.tsv, and metadata_trna_count.tsv tables

  1. create NCBI specific metadata tables
gtdb_migration_tk parse_assemblies --rb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt --ra /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt --gb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt --ga /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt -m metadata_r#.tsv 
-o ncbi_assembly_summary.tsv

gtdb_migration_tk parse_ncbi_dir -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv -o ncbi_assembly_metadata.tsv

gtdb_migration_tk ncbi_strains --gb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt --ga /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt --rb /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt --ra /srv/db/gtdb/metadata/release<release#>/ncbi/taxonomy/assembly_summary_archaea_refseq.txt -o . -g /srv/db/gtdb/genomes/ncbi/release<release#>/genome_dirs.tsv 
-l /srv/db/gtdb/genomes/ncbi/release<release#>/logs/ncbi_strains.log
  1. import all metadata tables into the database:
UPDATE metadata_taxonomy SET ncbi_type_material_designation = NULL
gtdb_migration_tk update_metadata_db --hostname <hostname> -u <user> -p <password> -d gtdb_release202_temp -i /srv/db/gtdb/metadata/release<releasenumber>/metadata_tables/ --genome_list metadata_r202.tsv

Generate 7-rank NCBI taxonomy

NCBI taxonomy information should be placed in /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy:

Using the taxdump.tar.gz, assembly_summary_refseq.txt and assembly_summary_genbank.txt downloaded previously:

  1. Run the ncbi_taxonomy.py script to create summary files of the NCBI taxonomy file:
gtdb_migration_tk parse_ncbi_taxonomy -t 20190725/ --rb assembly_summary_bacteria_refseq.txt --ra assembly_summary_archaea_refseq.txt --gb assembly_summary_bacteria_genbank.txt --ga assembly_summary_archaea_genbank.txt -p ncbi_r95
  1. Add Information to the database:
gtdb_migration_tk update_ncbitax_db --hostname <hostname> -u <user> -p <password> -d <db> -o ncbi_r202_organism_names.tsv --genome_list /srv/home/uqpchaum/tmp_dir/migration_tk_pycharm_tmp/metadata_r202.tsv --filtered ncbi_r202_standardized.tsv --unfiltered ncbi_r202_unfiltered_taxonomy.tsv

Propagating GTDB taxonomy

The final GTDB taxonomy from the previous release should be propagated to new releases. This can be done as follows:

  1. Get final GTDB metadata from previous release:
    • Load previous version of database
    • gtdb metadata export --output gtdb_metadata_prev.csv
  2. Get initial metadata for current release:
    • Load current version of database
    • gtdb metadata export --output gtdb_metadata_cur.csv
  3. Propagate over previous taxonomy were possible:
gtdb_migration_tk propagate_gtdb_taxonomy --gtdb_metadata_prev metadata_r95.tsv --gtdb_metadata_cur metadata_r202_20201020.tsv --taxonomy_file gtdb_taxonomy_propagate.tsv --rep_file gtdb_rep_ids.txt --log propagate_taxonomy/propagate.log
python -m gtdb_migration_tk update_propagated_tax --hostname watson.ace.uq.edu.au -u gtdb -p ecogenomicsgtdb -d gtdb_pierre_r207 -t gtdb_taxonomy_propagate.tsv --rep_file gtdb_rep_ids.txt -l propagate_taxonomy/update_taxonomy.log --truncate_taxonomy --genome_list ../metadata_for_import/metadata_after_realign.tsv  -m ../metadata_for_import/metadata_after_realign.tsv

The last command marks GTDB representatives with modified NCBI Accession numbers in the database. This should be done before assigning new NCBI genomes to representatives.

Realign main marker set for new genomes

gtdb -t 50 tree create --ncbi_genomes --marker_set_ids 1,19 --no_tree --output /srv/home/uqpchaum/tmp_dir/gtdb_tree_temp_todel We need to compare compare_markers to make sure all markers are properly aligned

  • gtdb power compare_markers

Setting GTDB domain for NCBI genomes

New genomes from NCBI may not yet have a designated domain under the GTDB taxonomy. Setting the GTDB domain is critical since this field is often used for filtering (e.g., to create archaeal and bacterial specific trees). The following script propagates the domain designated within the NCBI taxonomy to the GTDB domain field:

  • gtdb_migration_tk set_gtdb_domain --hostname <hostname> -u <user> -p <password> -d <db> Check the log to see the conflict of domain
  • gtdb power taxonomy_check

Set genome categoryfield

Update the "ncbi_genome_category" field to indicate if a genome is marked as a MAG or SAG at NCBI:

 * python -m gtdb_migration_tk ncbi_genome_category --genbank_assembly_summary ncbi/taxonomy/assembly_summary_genbank.txt --refseq_assembly_summary ncbi/taxonomy/assembly_summary_refseq.txt -g /srv/db/gtdb/genomes/ncbi/release207/genome_dirs.tsv -o category_field/ncbi_genome_category.tsv -l category_field/ncbi_genome_category.log
 * gtdb_migration_tk python -m gtdb_migration_tk update_metadata_db --hostname <hostname> -u <user> -p <password> -d <db> --metadata_table category_field/ncbi_genome_category.tsv --metadata_table_desc /srv/db/gtdb/metadata/metadata_ncbi_genome_category.desc.tsv -l /srv/db/gtdb/genomes/ncbi/release207/logs/update_ncbi_genome_category.log

Get Seqcode information

Seqcode parsing

Parse LPSN, and NCBI

Select Strains

Add type material information (gtdb_type_designation, gtdb_type_designation_sources, lpsn_priority_year, gtdb_type_species_of_genus) to database:

gtdb_migration_tk update_metadata_db --hostname <server> -u gtdb -p <password> -d gtdb_pierre_r214 --metadata_table gtdb_type_strain_summary.tsv --metadata_table_desc /srv/db/gtdb/metadata/metadata_type_material.desc.tsv -l /srv/db/gtdb/genomes/ncbi/release214/logs/update_metadata_db_type_strain.log

Realign remaining marker sets for new genomes

gtdb -t 50 tree create --ncbi_genomes --marker_set_ids 11,12,13,19 --no_tree --output /srv/home/uqpchaum/tmp_dir/gtdb_tree_temp_todel We need to compare compare_markers to make sure all markers are properly aligned

  • gtdb power compare_markers

Marker sets 11, 12, 13 (these are the ribosomal sets) and marker set 19 (this is the new archaeal marker set with 53 genes). These marker sets will be used each release.

Add surveillance genomes

Add Surveillance Genomes