Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import new sources of extinct/extant information #206

Open
jar398 opened this issue May 13, 2016 · 9 comments
Open

Import new sources of extinct/extant information #206

jar398 opened this issue May 13, 2016 · 9 comments

Comments

@jar398
Copy link
Member

jar398 commented May 13, 2016

Maybe paleodb. And I thought there were others, but can't remember.

We've been talking about this problem for a while, but there was no issue for it, so I'm creating this issue now, in particular so that I can redirect OpenTreeOfLife/treemachine#186 at it.

@jimallman
Copy link
Member

Can we incorporate trees from the Fossil Calibrations database? These are few but high-quality. See a data API for details.

@jar398
Copy link
Member Author

jar398 commented Sep 13, 2016

I suspect we can get a PaleoDB dump with extinctness information from GBIF.

@jar398
Copy link
Member Author

jar398 commented Oct 12, 2016

More examples here: OpenTreeOfLife/feedback#288

@hyanwong
Copy link
Collaborator

hyanwong commented May 12, 2017

I wonder if OpenTree should flag as fossil all IRMNG-only species which have no specific extinct/extant info (the IRMNG sp. extant flag is either blank or in square brackets), and whose genus is marked as EF (extant and fossil). So to take an example from the previously mentioned Nautilus case (OpenTreeOfLife/feedback#288 ), OpenTree should be able to see that ott2867897/Nautilus-compressus is IRMNG-only (although it also has a GBIF entry via IRMNG). This species has no extant/extinct information (its "sp. extant flag" column is blank), but the genus is marked as E+F, so we know some fossils are in here, and this is probably one of them. The extant species such as Nautilus repertus would not be flagged as such because they are explicitly marked as Extant in the "sp. extant flag" column, a column which is presumably accessible in the IRMNG dump file.

If you are worried about this approach resulting in too many species marked as fossil (a valid worry), then I suggest that fossil-status can be overridden if the species is found in NCBI (or perhaps, if NCBI has genetic data for this species). In other words, OpenTree can be liberal with the use of the fossil flag if it then uses the availability of sequence data in NCBI as a last-pass and absolute indicator of extant status (with the minor exception of the species here). The advantage with this approach is that the number of species with some sequence in NCBI is likely to increase rapidly, and so should become a more and more comprehensive indicator of extantness, unlike e.g. IRMNG status, which I suspect will be difficult to continue to correct.

@hyanwong
Copy link
Collaborator

hyanwong commented May 12, 2017

Note that using NCBI as a last-pass would also correct bugs like OpenTreeOfLife/feedback#345 (accidental omission of Conolophus because of a bad homonym)

@hyanwong
Copy link
Collaborator

hyanwong commented May 12, 2017

NB: a minor thing, but the NCBI approach might also help with e.g. OpenTreeOfLife/feedback#342, although this is a slightly trickier case, because the genus is marked as Extant-only on IRMNG, rather than (as I think it should be) Extant+Fossil - this may be because it is a sub fossil (recently extinct). There are (at least) 2 possible ways to implement what I suggested in this case:

  1. Flag as fossil all IRMNG-only species that are not explicitly marked as Extant
  2. Flag as fossil all IRMNG-only species that are not explicitly marked as Extant, and whose genus is marked as Extant+Fossil

Then do the NCBI last-pass.

Implementing (2) is more conservative, but would fail to catch Daubentonia robusta. Implementing (1) would have a broader effect. If this is considered, you might want to take a random sample of species that have been changed by (2) but not (1), and assess whether it is on average an improvement.

@jar398
Copy link
Member Author

jar398 commented Oct 31, 2019

GBIF harvests extinct/extant information from all of its sources. It does not copy this information into the taxonomy dumps, but it's present in the GBIF web site's database. It is possible to get this information using the GBIF API https://www.gbif.org/developer/species , and this is the approach that Markus recommended when I asked him about this (many years ago now).

@jar398
Copy link
Member Author

jar398 commented Oct 31, 2019

@hyanwong By the way NCBI taxonomy has quite a few extinct species - maybe only in the dozens or hundreds, but they tend to be important ones (e.g. T. rex).

@TonyRees
Copy link

Re-reading this thread... I apologise for the "indeterminate fossil status" of some species (perhaps 12k or so) in the IRMNG database; mainly these come from an import from the 2006 version of the Museum Victoria KEmu database, which contained many fossil taxon names (useful new content) but did not always state that they were fossil. I have considered dropping these from IRMNG but then again, in some/many cases they are useful for name resolution (correct spelling/ size of genera/ etc. etc.). Since they are unlikely to be gone through manually to add correct extant/fossil status, my hope is that one day this sector of IRMNG will become redundant as the same information may be in other sources in time e.g. PBDB. For the last 5 years or so my focus has been on IRMNG genera rather than species (resource constraints) so the species content of IRMNG is gradually ageing anyway (the main reason that it is no longer being included in DwCA dumps, although it is still accessible via the api). Take home message - either there are some kludges possible (e.g. if an ex-MV IRMNG name is not in a current "extant species" list for a generally well known group, e.g. sharks, it is quite possibly a fossil), or ignore IRMNG as a resource for species, hoping others will fill the gap (PaleoBoDB, or GBIF which incorporates the latter as I believe); or try to get resources to fix up/extend the species level content of IRMNG (but probably there would be better places for funders to put their money at this time). So a problem without a clear solution except perhaps the passage of time...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants