Skip to content

Sync with RefSeq GenBank

Pierre Chaumeil edited this page Jul 29, 2022 · 77 revisions

NCBI is currently updating RefSeq/GenBank on odd numbered months. We are aiming to update the GTDB every 3 releases. Rsync is used to mirror the genome assemblies on NCBI's FTP site.

Download Latest NCBI taxonomy

NCBI taxonomy information should be downloaded on the same day that we sync with NCBI.

  1. Create the new NCBI taxonomy metadata directory:
mkdir -p /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy
  1. Download and extract the latest NCBI taxonomy database:
cd /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy
rsync ftp.ncbi.nih.gov::pub/taxonomy/taxdump.tar.gz .
mv taxdump.tar.gz taxdump_<date>.tar.gz (i.e. taxdump_20220718.tar.gz)
mkdir taxdump_<date>
tar xvzf taxdump_<date>.tar.gz -C taxdump_<date>

Download the latest RefSeq and GenBank assembly data

  1. In /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy):
rsync ftp.ncbi.nlm.nih.gov::genomes/refseq/archaea/assembly_summary.txt assembly_summary_archaea_refseq.txt    
rsync ftp.ncbi.nlm.nih.gov::genomes/refseq/bacteria/assembly_summary.txt assembly_summary_bacteria_refseq.txt    
rsync ftp.ncbi.nlm.nih.gov::genomes/genbank/archaea/assembly_summary.txt assembly_summary_archaea_genbank.txt    
rsync ftp.ncbi.nlm.nih.gov::genomes/genbank/bacteria/assembly_summary.txt assembly_summary_bacteria_genbank.txt
cat assembly_summary_archaea_genbank.txt assembly_summary_bacteria_genbank.txt > assembly_summary_genbank.txt
cat assembly_summary_archaea_refseq.txt assembly_summary_bacteria_refseq.txt > assembly_summary_refseq.txt

  1. Remove all genomes associated with 'large multi-isolate project':
grep -v 'large multi-isolate project' assembly_summary_archaea_genbank.txt > assembly_summary_archaea_genbank_nolargeproject.txt
grep -v 'large multi-isolate project' assembly_summary_bacteria_genbank.txt > assembly_summary_bacteria_genbank_nolargeproject.txt
grep -v 'large multi-isolate project' assembly_summary_bacteria_refseq.txt > assembly_summary_bacteria_refseq_nolargeproject.txt
grep -v 'large multi-isolate project' assembly_summary_archaea_refseq.txt > assembly_summary_archaea_refseq_nolargeproject.txt

remove the 2 first line of each nolargeproject.txt.

sed -i '1,2d' *_nolargeproject.txt
  1. For EACH of the assembly summary file downloaded get the 20th column of the file ( this is the ftp URL for each assemblies):
export VERSION=95
cut -f20 assembly_summary_bacteria_refseq_nolargeproject.txt | grep 'ftp' > bac120_refseq_r$VERSION.lst
cut -f20 assembly_summary_archaea_refseq_nolargeproject.txt | grep 'ftp' > ar53_refseq_r$VERSION.lst
cut -f20 assembly_summary_bacteria_genbank_nolargeproject.txt | grep 'ftp' > bac120_genbank_r$VERSION.lst
cut -f20 assembly_summary_archaea_genbank_nolargeproject.txt | grep 'ftp' > ar53_genbank_r$VERSION.lst

Download LPSN data.

Select Strains

Generate 7-rank NCBI taxonomy

NCBI taxonomy information should be placed in /srv/db/gtdb/metadata/<release#>/ncbi/taxonomy:

Using the taxdump.tar.gz, assembly_summary_refseq.txt and assembly_summary_genbank.txt downloaded previously:

  1. Run the ncbi_taxonomy.py script to create summary files of the NCBI taxonomy file:
mkdir standardised_taxonomy
cd standardised_taxonomy
gtdb_migration_tk parse_ncbi_taxonomy -t ../20190725/ --rb ../assembly_summary_bacteria_refseq.txt --ra ../assembly_summary_archaea_refseq.txt --gb ../assembly_summary_bacteria_genbank.txt --ga ../assembly_summary_archaea_genbank.txt -p ncbi_r202

Update FTP directory

  1. Remove deprecated genomes from FTP directory: **Because we run the rsync step individually for each genome present in the assembly summary file, we dont know which one have been remove. We need to write a script comparing the genome_dirs.tsv file form the previous release to the assembly summary file / *_r$VERSION.lst To know which one to remove before starting rsync. **
gtdb_migration_tk clean_ftp --new_list_genomes assembly_summary_archaea_genbank_nolargeproject.txt,assembly_summary_archaea_refseq_nolargeproject.txt,assembly_summary_bacteria_genbank_nolargeproject.txt,assembly_summary_bacteria_refseq_nolargeproject.txt --ftp_genome_dir_file /srv/db/gtdb/metadata/release207/ncbi/genome_dirs_ftp.tsv --report_dir report_clean_ftp/ --taxonomy_file standardised_taxonomy/ncbi_r213_standardized.tsv
  1. Run the rsync command for each of them
cd /srv/db/ncbi/new_ftp_structure/
mkdir r213_logs
cat /srv/db/gtdb/metadata/release<release_number>/ncbi/taxonomy/bac120_genbank_r<release_number>.lst |parallel --eta -j20 /srv/db/ncbi/new_ftp_structure/rsync_data.sh '<(' echo {} ')' '&&' echo finished {} '>>' bac120_gbk.log

Copy from NCBI Folder to GTDB Folder

Create the new <release#> folder in /srv/db/gtdb/genomes/ncbi/:

mkdir -p /srv/db/gtdb/genomes/ncbi/<release#>/refseq/
mkdir -p /srv/db/gtdb/genomes/ncbi/<release#>/genbank/

Before copying, list all records in GenBank and RefSeq FTP folders:

gtdb_migration_tk list_genomes -g /srv/db/ncbi/new_ftp_structure/genomes/all/ -o /srv/db/gtdb/metadata/<release#>/ncbi/genome_dirs_ftp.tsv

Update RefSeq

gtdb_migration_tk update_refseq --cpus 20 --ftp_refseq_directory /srv/db/ncbi/new_ftp_structure/genomes/all/ --new_refseq_directory /srv/db/gtdb/genomes/ncbi/release213/refseq/ --ftp_genome_dirs_file /srv/db/gtdb/metadata/release213/ncbi/genome_dirs_ftp.tsv --old_genome_dirs_file /srv/db/gtdb/genomes/ncbi/release207/refseq/genome_dirs.tsv --arc_assembly_summary /srv/db/gtdb/metadata/release213/ncbi/taxonomy/assembly_summary_archaea_refseq.txt --bac_assembly_summary /srv/db/gtdb/metadata/release213/ncbi/taxonomy/assembly_summary_bacteria_refseq.txt > /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/update_refseq_from_ftp_files.log


Optional Step: Manually curate conflicting genomes.
Sometimes (rarely), NCBI versioning of assembly is conflicting(GCF_000026325.1_ASM2632v1,GCF_000026325.1_ASM2632v2). The log has to be updated manually. To track which assembly is conflicting:

grep 'to_curate' report_gcf.log

Go to the genome directory, clean it ( remove duplicate assembly report, copy the proper version..etc) and updated the status ( unmodified/modified...) in report_gcf.log


List all records in the new RefSeq folder

gtdb_migration_tk list_genomes -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq -o /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv

Update GenBank folder

gtdb_migration_tk update_genbank    
--ftp_genbank_directory /srv/db/ncbi/new_ftp_structure/genomes/all/    
--new_genbank_directory ~/tmp_dir/test_update_refseq_multithreads/    
--new_ftp_genbank_dirs_file /srv/db/gtdb/metadata/release<current_release#>/ncbi/ftp_genome_dirs.tsv    
--old_genbank_genome_dirs_file /srv/db/gtdb/genomes/ncbi/release<old_release#>/genbank/genome_dirs.tsv    
--arc_assembly_summary /srv/db/gtdb/metadata/release<current_release#>/ncbi/taxonomy/assembly_summary_archaea_genbank.txt    
--bac_assembly_summary /srv/db/gtdb/metadata/release<current_release#>/ncbi/taxonomy/assembly_summary_bacteria_genbank.txt    
--cpus 30    
--new_refseq_genome_dirs_file /srv/db/gtdb/genomes/ncbi/release<current_release#>/refseq/genome_dirs.tsv

List all records in the new Genbank folder:

gtdb_migration_tk list_genomes -g /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/ -o /srv/db/gtdb/genomes/ncbi/release<current_release#>/genbank/genome_dirs.tsv

For the next steps of the update, go to Update database metadata

====================================================================================================================

*How update_refseq_from_ftp.py works (TO REVIEW):

  • For each domain:
    • List the RefSeq records present in the FTP folder (they have to be qualified as latest)
    • List the RefSeq records present in the previous GTDB folder
    • If a genome is present in the FTP list but not in the old GTDB list:
      • Add the genome folder to the new gtdb folder
    • If the new genome is actually a new version of an existing genome:
      • Replace the old one by this new one
    • If a genome is not present in the FTP:
      • Delete the genome from GTDB
      • Modify the lists having the deleted genomes.
    • If the genomes is present in both FTP and old GTDB:
      • Compare the checksum file of the 2 folder
      • If the checksum of the genomic.fna.gz and/or protein.faa.gz files are different:
        • Copy the FTP folder to the new GTDB folder
        • Unzip all gz file in the new GTDB folder
        • Update the sha sizes in the GTDB
      • If checksum of the genomic.fna.gz and/or protein.faa.gz files are the same:
        • Copy the old GTDB folder to the new GTDB folder
        • Compare the genbank files between the GTDB folder and FTP folder:
        • If there is a change:
          • Copy the genbank files from FTP that are different from the GTDB folder
          • Copy the checksum files from FTP
        • Compare the report files between the GTDB folder and FTP folder:
        • If there is a change:
          • Copy the report files from FTP that are different from the GTDB folder