Skip to content

Create GTDB SSU tree

Pierre Chaumeil edited this page May 9, 2023 · 5 revisions

Create GTDB SSU tree

  1. gtdb metadata export --format csv --output gtdb_metadata_rXX_.csv
  2. gtdb genomes ssu_export --output gtdb_rXX_.fna
  3. genometreetk ssu_tree gtdb_metadata_rXX_.csv gtdb_rXX_.fna . -c 24 --min_scaffold_length 5000
  4. genometreetk outgroup ...
  5. phylorank decorate ...

Genomes and the scaffold containing the 16S rRNA genes are filtered in order to try and avoid erroneous genes (i.e. 16S rRNA genes that are contamination within the genome). Contaminating 16S rRNA genes are preferentially found in genomes with low estimated quality, poor assembly statistics, and on short scaffolds. A BLAST-based filtering step is also used to filter out 16S rRNA genes that appear to be incongruent with the taxonomic assignment of the genome (see the GTDB manuscript for details).

Using Sina

  1. get the list of reps bac120_reps.lst and the ssu file generated for the website (ssu_all_.fna
  2. run convert_sequence.py from scripts_dev/ssu_tree/convert_sequence.py to get the list of 16S sequences great than 700aa for reps
  3. run Sina on this list of sequences.
    sina -i reps_ssu_non_aligned_gt700.fna -o reps_ssu_aligned_4frames_NR99_gt700.fasta --db ../../SILVA_138.1_SSURef_NR99_12_06_20_opt.arb -t all
  4. run convert_arb_file to recreate the arb metadata file and replace the genome sequences by their aligned SSU sequences (50K long)