Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genome Taxonomy Database (GTDB) for prokaryotes #202

Open
pragermh opened this issue Nov 3, 2020 · 21 comments
Open

Genome Taxonomy Database (GTDB) for prokaryotes #202

pragermh opened this issue Nov 3, 2020 · 21 comments
Labels
source dataset request Requests to change or add a new source dataset to the CoL xcol eXtended COL checklist

Comments

@pragermh
Copy link

pragermh commented Nov 3, 2020

Dataset title
Genome Taxonomy Database (GTDB)

Dataset contact & access
https://gtdb.ecogenomic.org/
https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/ar122_taxonomy.tsv
https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/bac120_taxonomy.tsv

Taxonomic group & CoL sector
Prokaryotes: Archaea & Bacteria

Dataset description
The Genome taxonomy database (GTDB) is a standardised microbial taxonomy based on genome phylogeny, primarily funded by the Australian Research Council. GTDB currently includes ca. 30,000 prokaryote species clusters based on 195,000 genomes from isolates, metagenomes and single-cells from RefSeq and GenBank.
References:
Parks, D.H., et al. (2020). "A complete domain-to-species taxonomy for Bacteria and Archaea." Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.
Parks, D.H., et al. (2018). "A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life." Nature Biotechnology, 36: 996-1004, https://doi.org/10.1038/nbt.4229.

Motivation for changing/adding
GTDB is a well-defined and fast-growing taxonomy of prokaryotes: Between release 89 (Aug 2019) and 95 (Jul 2020), included genomes and species clusters both increased with ca. 30%, while 99,77% of existing genomes were still assigned to the same species clusters. When publishing a dataset of ca. 3000 Amplicon Sequence Variants (ASVs) of Baltic Sea microbes to the Swedish GBIF node (SBDI), we furthermore found that merging GTDB into the GBIF taxonomy backbone substantially increased the taxonomic resolution for our occurrences: The share of ASVs identified at genus and family level increased from 32 to 62% and 55 to 77%, respectively. Since the database is based on draft genomes rather than a single taxonomic marker gene, it gives a lot of flexibility in terms of usage. For example, shotgun metagenome data can be annotated as well as 16S rRNA gene data.

Suggested by
Anders Andersson (KTH), Daniel Lundin (LnU) and Maria Prager (SU/KI), all associated with the Swedish Biodiversity Data Infrastructure (SBDI).

@pragermh pragermh added the source dataset request Requests to change or add a new source dataset to the CoL label Nov 3, 2020
@mdoering
Copy link
Member

About 2/3 of GTDB are non Linnean names like phyla Desulfobacterota_B, UBA3054 or species 01-FULL-45-15b sp001822655. This occurs on all ranks:

Screenshot 2020-11-18 at 06 59 51

I wonder how much of these names are stable over time to be of any relevance for users.
This also challenges name parsing.

@mdoering
Copy link
Member

@thomasstjerne do you know how these non Linnean names are generated?
Are they stable across releases?

@pragermh
Copy link
Author

There is some info on placeholder names and stability here.

@mdoering
Copy link
Member

For COL to include GTDB it would require the name parser to detect OTU names reliably.
It's more difficult than for BOLD or UNITE identifiers, but seems doable.

@mdoering
Copy link
Member

thanks for the FAQ links, @pragermh !

@mdoering
Copy link
Member

COL is considering to use the List of Prokaryotic names with Standing in Nomenclature LPSN. It is a different taxonomy, but would at least be an up to date one.

@thomasstjerne
Copy link

thomasstjerne commented Mar 11, 2022

LPSN is probably mostly (only?) suitable for formally described taxa (= a cultured type specimen exist). The major part of prokaryotic data in GBIF origins from metabarcoding studies based on the 16S region. These data types will have taxonomy assigned through some classification pipeline using a reference database such as GTDB or SILVA.
When assessing the taxonomic diversity in these types of studies, it is important to understand the full diversity, not only the (small) fraction that has been formally described.
I could fear that using LPSN for GBIF indexing would result in a coarser taxonomic assignment for large parts of the prokaryotic data.

GTDB states that

LPSN is used as the primary nomenclatural reference for establishing naming priorities and nomenclature types.

So LPSN probably more or less makes up the subset of formally described taxa in GTDB.

@mdoering
Copy link
Member

@dhobern @olafbanki @yroskov for your attention

@dhobern
Copy link

dhobern commented Mar 11, 2022

Can't we use the football LPSN for the core COL content and the rest of GTDB to extend it?

@thomasstjerne
Copy link

I believe LPSN is nomenclature, whereas GTDB is a full Taxonomy (Phylogeny based on assembled genomes). Therefore GTDB might re-organise the classification of names in LPSN quite heavily in some cases. There might therefore be quite some conflicts.

@dhobern
Copy link

dhobern commented Mar 12, 2022

LPSN has been reorganised to reflect consensus classification in a way that matches what we need for a full COL/GBIF species list based on published names. I am sure there will be some mismatches with molecular phylogeny, but surely that is no different from all other parts of the list.

@mdoering
Copy link
Member

Indeed that seems like a plausible and consistent way forward to me. Not much different to the situation with UNITE and BOLD really.

@pragermh
Copy link
Author

Neither a taxonomist nor prokaryote expert myself, but perhaps @erikrikarddaniel or @andand has something to add?

@thomasstjerne
Copy link

Indeed that seems like a plausible and consistent way forward to me. Not much different to the situation with UNITE and BOLD really.

GTDB is quite different than BOLD or UNITE. The latter two uses a short fragment (COI and ITS) to cluster into "species-like" taxa (BINs, SHs). These BINs or SHs are then placed into a consensus classification that in most cases will be much like COL / GBIF.
GTDB produces the full classification (Phylogeny) and may sometimes (often?) deviate from "consensus" classification, even at high levels such as Phyla.

Here is an example:
The species Binatus soli in GTDB GBIF and LPSN

Classification in GTDB: Bacteria > Desulfobacterota_B > Binatia > Binatales > Binataceae > Binatus > Binatus soli
Classification in LPSN: Bacteria > Binatota > Binatia > Binatales > Binataceae> Binatus > Binatus soli

The Phylum Binatota is not present in the latest two versions of GTDB (history).
By extending LPSN with GTDB I guess we would get both phyla Desulfobacterota_B and Binatota and the species Binatus soli would be placed in the phylum Binatota.
GTDB has 4 species in the genus Binatus we should avoid that these end up in two homonym genera in two phyla.
Would we then move the 3 species of Binatus not known to LPSN into the phylum Binatota? (along with Binatus soli)
And what about sibling genera of Binatus ?

It might be safe to conclude that the GTDB phylum Desulfobacterota_B could simply be considered a synonym of the LPSN phylum Binatota, but I could imagine that there could potentially be many splits and merges, giving pro parte synonyms that would be less straight forward to deal with.

@mdoering
Copy link
Member

The Phylum Binatota is not present in the latest two versions of GTDB (history).
By extending LPSN with GTDB I guess we would get both phyla Desulfobacterota_B and Binatota and the species Binatus soli would be placed in the phylum Binatota.
GTDB has 4 species in the genus Binatus we should avoid that these end up in two homonym genera in two phyla.
Would we then move the 3 species of Binatus not known to LPSN into the phylum Binatota? (along with Binatus soli)
And what about sibling genera of Binatus ?

Yes, the merging procedure we consider will prevent splitting genera into different classifications, even if far apart. By extending LPSN we would therefore use the Binatota placement for all Binatus species coming from GTDB. This is obviously problematic, but that problem is true for all the "extended" sources. I suppose the difference in Bacteria taxonomy is just much larger than anywhere else in the tree, so it has a bigger impact and is more visible.

@thomasstjerne
Copy link

And what about sibling genera of Binatus ?

Will they move to the other phylum along with Binatus?

@mdoering
Copy link
Member

Not if we do it as in the GBIF builds. But it might be a good idea if we can work out how to do this.
The current thinking does not touch the higher classification, at least not above family level. So if a source has a yet unplaced genus with a classification that is also not represented at all the genus will be in Incertae sedis. If the kingdom is know under that kingdom, whatever snaps in the classification. In the Binatus example the family Binataceae is the same, so all siblings would also be placed there.

@mdoering
Copy link
Member

Here are some metrics from the latest data we have in CLB:
GTDB: https://www.checklistbank.org/dataset/2214/imports/52
LPSN: https://www.checklistbank.org/dataset/2015/imports

@mdoering
Copy link
Member

mdoering commented Feb 29, 2024

I have created a first version of an ColDP LPSN dataset using their API here: https://www.dev.checklistbank.org/dataset/284997

The main problem with that is that I cannot find a way to access the classification they do show on their site (parent link on top). I have contacted them and asked how to do that, lets see.

Also they made LPSN CC BY SA!

@mdoering
Copy link
Member

I have opened a dedicated issue for adding LPSN: #632

@mdoering mdoering added xcol eXtended COL checklist and removed Taxonomy Group labels Mar 18, 2024
@mdoering
Copy link
Member

@DianRHR @camiplata I have discussed with @thomasstjerne and @tobiasgf how to best integrate GTDB into the XCOL.
We want the classic LPSN in the base release of COL, but need to add GTDB OTU names additionally to integrate with eDNA data. Our suggestion would be to add GTDB add genus level and below and maybe also to include families, but nothing higher up. Could you look into any issues with that please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source dataset request Requests to change or add a new source dataset to the CoL xcol eXtended COL checklist
Projects
None yet
Development

No branches or pull requests

4 participants