-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import new sources of extinct/extant information #206
Comments
Can we incorporate trees from the Fossil Calibrations database? These are few but high-quality. See a data API for details. |
I suspect we can get a PaleoDB dump with extinctness information from GBIF. |
More examples here: OpenTreeOfLife/feedback#288 |
I wonder if OpenTree should flag as fossil all IRMNG-only species which have no specific extinct/extant info (the IRMNG sp. extant flag is either blank or in square brackets), and whose genus is marked as EF (extant and fossil). So to take an example from the previously mentioned Nautilus case (OpenTreeOfLife/feedback#288 ), OpenTree should be able to see that ott2867897/Nautilus-compressus is IRMNG-only (although it also has a GBIF entry via IRMNG). This species has no extant/extinct information (its "sp. extant flag" column is blank), but the genus is marked as E+F, so we know some fossils are in here, and this is probably one of them. The extant species such as Nautilus repertus would not be flagged as such because they are explicitly marked as Extant in the "sp. extant flag" column, a column which is presumably accessible in the IRMNG dump file. If you are worried about this approach resulting in too many species marked as fossil (a valid worry), then I suggest that fossil-status can be overridden if the species is found in NCBI (or perhaps, if NCBI has genetic data for this species). In other words, OpenTree can be liberal with the use of the fossil flag if it then uses the availability of sequence data in NCBI as a last-pass and absolute indicator of extant status (with the minor exception of the species here). The advantage with this approach is that the number of species with some sequence in NCBI is likely to increase rapidly, and so should become a more and more comprehensive indicator of extantness, unlike e.g. IRMNG status, which I suspect will be difficult to continue to correct. |
Note that using NCBI as a last-pass would also correct bugs like OpenTreeOfLife/feedback#345 (accidental omission of Conolophus because of a bad homonym) |
NB: a minor thing, but the NCBI approach might also help with e.g. OpenTreeOfLife/feedback#342, although this is a slightly trickier case, because the genus is marked as Extant-only on IRMNG, rather than (as I think it should be) Extant+Fossil - this may be because it is a sub fossil (recently extinct). There are (at least) 2 possible ways to implement what I suggested in this case:
Then do the NCBI last-pass. Implementing (2) is more conservative, but would fail to catch Daubentonia robusta. Implementing (1) would have a broader effect. If this is considered, you might want to take a random sample of species that have been changed by (2) but not (1), and assess whether it is on average an improvement. |
GBIF harvests extinct/extant information from all of its sources. It does not copy this information into the taxonomy dumps, but it's present in the GBIF web site's database. It is possible to get this information using the GBIF API https://www.gbif.org/developer/species , and this is the approach that Markus recommended when I asked him about this (many years ago now). |
@hyanwong By the way NCBI taxonomy has quite a few extinct species - maybe only in the dozens or hundreds, but they tend to be important ones (e.g. T. rex). |
Re-reading this thread... I apologise for the "indeterminate fossil status" of some species (perhaps 12k or so) in the IRMNG database; mainly these come from an import from the 2006 version of the Museum Victoria KEmu database, which contained many fossil taxon names (useful new content) but did not always state that they were fossil. I have considered dropping these from IRMNG but then again, in some/many cases they are useful for name resolution (correct spelling/ size of genera/ etc. etc.). Since they are unlikely to be gone through manually to add correct extant/fossil status, my hope is that one day this sector of IRMNG will become redundant as the same information may be in other sources in time e.g. PBDB. For the last 5 years or so my focus has been on IRMNG genera rather than species (resource constraints) so the species content of IRMNG is gradually ageing anyway (the main reason that it is no longer being included in DwCA dumps, although it is still accessible via the api). Take home message - either there are some kludges possible (e.g. if an ex-MV IRMNG name is not in a current "extant species" list for a generally well known group, e.g. sharks, it is quite possibly a fossil), or ignore IRMNG as a resource for species, hoping others will fill the gap (PaleoBoDB, or GBIF which incorporates the latter as I believe); or try to get resources to fix up/extend the species level content of IRMNG (but probably there would be better places for funders to put their money at this time). So a problem without a clear solution except perhaps the passage of time... |
Maybe paleodb. And I thought there were others, but can't remember.
We've been talking about this problem for a while, but there was no issue for it, so I'm creating this issue now, in particular so that I can redirect OpenTreeOfLife/treemachine#186 at it.
The text was updated successfully, but these errors were encountered: