Skip to content

Create SILVA SSU mapping file

Donovan Parks edited this page Nov 5, 2018 · 4 revisions

A mapping file is required in order to link the GTDB to results at SILVA. This is done through a TSV file parsed by the folks at SILVA.

This file is generated as follows:

  1. Identify 16S rRNA genes in all GTDB genomes (not just the dereplicated set)
  2. Filter 16S rRNA genes that are <1200 bp (<900 bp if archaeal), on a contig <10kb, have a length >2kb, or have 10 or more ambiguous bases. I also filter out genome that have a quality<50 or are comprised of >500 contigs. Overall, this filters out ~50% of the ~250,000 16S rRNA genes identified.
  3. BLAST the remaining 16S genes against SILVA's Ref database.
  4. Take hits with 99% identity and 99% alignment length over the shorter of the query and subject genes. This high stringency is needed to ensure correct species assignments. Relaxing this doesn't result in appreciably more assignments.
  5. Filter hits from the ~500 genomes marked as contaminated by EstCont16S (this is taken directly from the EzBioCloud website), or that fail the IDTAXA tests I have developed. This filtering reducing the number of SILVA 16S rRNA genes with a GTDB taxonomy assignment by <100.
  6. If a given SILVA gene has multiple hits with incongruent GTDB assignments, do a majority vote to determine the GTDB taxonomy string. This occurs in 1,047 of 22,240 cases. This is mostly cases where one hit indicates a specific GTDB species, but the hit has no species assignment (i.e., s__). There are all sorts of reasons this situation could occur so I think a majority vote is the safest approach.
  • validate
  • add URL

The final mapping file should be placed in /srv/home/ftp/public/gtdb/releaseXX/silva_mapping_rXX.tsv and the link to the latest SILVA mapping file updated in /srv/home/ftp/public/gtdb/. This linked file is used by the SILVA team, i.e.: