Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requesting support for ids in taxonomicClassification element #141

Closed
mbjones opened this issue Mar 12, 2017 · 35 comments
Closed

requesting support for ids in taxonomicClassification element #141

mbjones opened this issue Mar 12, 2017 · 35 comments
Assignees
Milestone

Comments

@mbjones
Copy link
Contributor

@mbjones mbjones commented Mar 12, 2017


Author Name: Margaret O'Brien (Margaret O'Brien)
Original Redmine Issue: 1636, https://projects.ecoinformatics.org/ecoinfo/issues/1636
Original Date: 2004-07-13
Original Assignee: Matt Jones


Support for external ids in taxonomicClassifiaction, eg, to a code from a system like ITIS or WoRMS. Possibly this could resemble the pattern used in alternateIdentifier. It should be repeatable.

original reasoning:
We use species codes in tables of abundance and density. If the
taxonomicClassification element supported ids, then we could create a reference
from the code to its taxon. The list of taxa would probably be kept under
additionalMetadata (due to its length).

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Mar 12, 2017


Original Redmine Comment
Author Name: Matt Jones (Matt Jones)
Original Date: 2004-09-02T16:38:14Z


Changing QA contact to the list for all current EML bugs so that people can
track what is happening.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Mar 12, 2017


Original Redmine Comment
Author Name: Redmine Admin (Redmine Admin)
Original Date: 2013-03-27T21:17:46Z


Original Bugzilla ID was 1636

@mbjones mbjones added this to the Unspecified milestone Mar 12, 2017
@mbjones mbjones removed the bug label Apr 22, 2017
@mobb mobb self-assigned this Jun 7, 2017
@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jun 7, 2017

The original justification for this feature request (internal references) isn't particularly useful. But allowing EML taxonCoverage to hold identifiers to external systems (eg, ITIS) has been discussed several times. This issue could apply to that need instead.

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Jun 7, 2017

👍
Support for external taxonomic IDs would be a great addition.

It might be informative to look at what the NeXML (http://www.nexml.org/, https://dx.doi.org/10.1093%2Fsysbio%2Fsys025) standard has done for annotating taxanomic units with identifiers and classification information? (some examples, though maybe too rdf-heavy to be very transparent).

@mbjones mbjones removed the Category: eml label Jul 24, 2017
@gastil

This comment has been minimized.

Copy link

@gastil gastil commented Mar 14, 2018

Like. It would be useful to have taxonomic ID reference links (URI's, whatever) in the taxonomicClassification element, specifically for those taxa which change with inconvenient frequency or need to use a reference specific to the science domain or geography. Could this get into EML 2.2 please?

@gastil

This comment has been minimized.

Copy link

@gastil gastil commented Mar 14, 2018

Also if we plan to someday do post-hoc external annotation of existing metadata docs, we will need an id in the taxonomy elements to point to.

@mbjones mbjones added the enhancement label Mar 14, 2018
@mbjones mbjones modified the milestones: Unspecified, EML2.2.0 Mar 14, 2018
@mbjones mbjones added the backlog label Mar 14, 2018
@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Mar 14, 2018

There is currently an id attribute on taxonomicCoverage, but not on taxonomicClassification - which as everyone knows - is what taxonomists like to argue about. So this is a good candidate for external annotation.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Mar 14, 2018

Sounds great to me. Should be an easy add. I moved it to the 2.2 backlog until someone claims it and wants to put the work into adding and documenting the schema elements. Any volunteers?

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Mar 15, 2018

Note from John Porter:
My thinking on the taxonomic element is probably simplistic, but it seems to me that simply adding a place for an identifier and its associated authority would be a major improvement. This could be done either as additional attributes in the EML taxonomicClassification, or as additional tags nested also in that section. The latter has the advantage that multiple authorities and identifiers could be included.
Alternatively, the authority attribute/tag could be linked to an ID in the taxonomicSystem (probably in classificationSystemCitation) (it too would need to be added).
Thus:

<taxonomicSystem>
  <classificationSystem>
    <classificationSystemCitation id="1">ITIS</classificationSystemCitation>
  </classificationSystem>
</taxonomicSystem>

Once defined, we can then use the authority id to provide context for an id for the particular taxonomic unit:

<taxonomicClassification  id="5935086" authority="1">
  <taxonRankName>genus</taxonRankName>
  <taxonRankValue>Peromyscus</taxonRankValue>
</taxonomicClassification>

OR (if we want to accommodate multiple IDs/authorities for individual taxa -taxonomicReference could be repeated):

<taxonomicClassification>
  <taxonRankName>genus</taxonRankName>
  <taxonRankValue>Peromyscus</taxonRankValue>
  <taxonomicReference>
    <taxonAuthority>1</taxonAuthority>
    <taxonIdentifier>59302086</taxonIdentifier>
  </taxonomicReference>
</taxonomicClassification>
@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Mar 15, 2018

I really like this proposal. While I think the id attribute use is great and would love to see id on most/all complex types, it's a little opaque; so I'd vote for the option with additional tags, which looks much more consistent with EML patterns (and not necessarily exclusive of using id in that way as well).

I think taxon ids are pretty fundamental and seem pretty ubiquitous in other ecology/evolution standards, so would really like to see these in EML; particularly given how loose the current taxonomicCoverage blocks are (i.e. users specify rank names from unconstrained list, even Dryad restricts these to rank names from Darwin Core) and how important all the coverage fields are for discovery. Getting researchers to use taxon ids instead of just species names is probably right up on my list behind doing dates in ISO 8601 even with no metadata.

Not sure I'm qualified to write XSD but could give this a stab next week?

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Mar 15, 2018

@cboettig - go for it! I see my name on this already, but I won't be able to get to it till next week anyway. so I can take a look after.

@mbjones mbjones removed the backlog label Mar 15, 2018
@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 5, 2018

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 5, 2018

Not sure if on the right track here. but could we use the same AnnotationType (not currently supported under taxonomicClassification):

<taxonomicClassification>
  <taxonRankName>species</taxonRankName>
  <taxonRankValue>Macrocystis pyrifera</taxonRankValue>
  <commonName>Giant Kelp</commonName>								
    <annotation>
      <propertyURI label="dwc:taxonID">http://rs.tdwg.org/dwc/terms/taxonID</propertyURI>
      <valueURI label="11274">https://itis.gov/bogus/uri/tsn/11274</valueURI>
    </annotation>
    <annotation>
      <propertyURI label="dwc:taxonID">http://rs.tdwg.org/dwc/terms/taxonID</propertyURI>
      <valueURI label="35122">http://ncbi.nlm.nih.gov/taxonomy/35122</valueURI>
    </annotation>
</taxonomicClassification>
@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Jul 5, 2018

@mbjones will annotation be supported on taxonomicClassification? I thought it was not.

I'm not sure that dwc:organismID is the correct term? My understanding is that would refer to a particular unique organism, i.e. the precise specimen held by a particular museum etc, and does not refer to the generic taxonomic level. I think what you're after here is dwc:taxonID, but I'm not an expert here.

To me, I think the simplest thing would be to introduce (one or more) <taxonID> node as sibling to <taxonRankName> <taxonRankValue>. Perhaps this should take the full URI (probably ideal from a parsing perspective), or could separately list the identifier and authority separately,

<taxonRankName>species</taxonRankName>
<taxonRankValue>Macrocystis pyrifera</taxonRankValue>
<taxonID>http://ncbi.nlm.nih.gov/taxonomy/35122 </taxonID>
<taxonID> https://itis.gov/bogus/uri/tsn/11274</taxonID>

or

<taxonRankName>species</taxonRankName>
<taxonRankValue>Macrocystis pyrifera</taxonRankValue>
<taxonID url="http://ncbi.nlm.nih.gov/taxonomy/35122"> NCBI:35122 </taxonID>
<taxonID url="https://itis.gov/bogus/uri/tsn/">ITIS:11274</taxonID>

Related, I wonder if it would make sense to support URIs in taxonRankName? Probably no one would pay attention, and a taxonID sorta solves this already, but it is something of a nuisance that rankName is free text and thus we get all variations of species, Species, Sp, etc.

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 5, 2018

As of now, annotation is not supported under taxonomicClassification.
and you're right, it's taxonID not organismID (I corrected the example above).

Some services do not seem to support URIs; eg, I have not found one for ITIS yet (hence the bogus/tsn). The example is mainly included for completeness (and its rdf-ness).

I think overall, that an explicit id (0:many) is simpler, if taxon URIs are hard to find, and the property can be presumed to be fixed.

@twhiteaker

This comment has been minimized.

Copy link
Contributor

@twhiteaker twhiteaker commented Jul 5, 2018

It's not as pretty as the bogus URL, but ITIS does support, e.g.,
https://itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=135122

I think storing the identifier URL (https://itis.gov/bogus/uri/tsn/11274) separately from the identifier (11274) would better support archiving since URLs may change. The identifier may become deprecated, but it should be permanent, and looking at ITIS as an exampe, if you hit a deprecated TSN page, they point you to the currently accepted value, as in https://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=28759. Well, I presume 28759 is deprecated.

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 5, 2018

thanks for all the input folks! there are several patterns for attaching taxon ids that we could follow (ie, that are already used in EML). Some are already covered here. I'll gather them up and list them as options.

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Jul 5, 2018

looks like @twhiteaker beat me to a "prefix" for ITIS; I agree these could be easier to find, GLOBI provides a convenient list in

Poelen, Jorrit H. (2018). Global Biotic Interactions: Taxon Graph (Version 0.3.4) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.1299296

see the prefixes table.

I agree that it probably makes sense to store identifiers separate from URLs since like Tim says not all those URLs are likely to stick around. I tend to prefix identifiers to the authority, e.g. ITIS:11274 or NCBI:35122, but perhaps splitting these into more explicit "Identifier" and "Authority" bits would be more intuitive.

Anyway, really excited to see taxonIDs supported in the future. Being able to search DataONE for one's study organism seems to be a common suggestion and it works rather poorly in the current system given the heterogeneity of specification.

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 9, 2018

Options

I spent a little time looking at we store external references in other parts of EML and how a couple of other taxonomic systems do it as well. Plus revisited the suggestions above.

Option 1: EML's system+id attribute pair

(suggested in comment from JP back on March 15)
This would have to include the name of the system (authority) as well. see example there.

EML already uses the id element for internal references, and only one is allowed per element. So this
one doesn't take us very far for external refs, plus it overloads the id element. So not a great candidate for external refs.

We could consider adding it (an id element), anyway; so that it could be used for internal
references, or for annotations as additionalMetadata. That use was suggested in the original issue, and is consistent with @cboettig comment (march 15)

option 2: follow the userId pattern

e.g.,

<taxonomicClassification>
  <taxonRankName>species</taxonRankName>
  <taxonRankValue>Macrocystis pyrifera</taxonRankValue>
  <commonName>Giant Kelp</commonName>								
  <taxonId provider="ITIS">11274</taxonId> 
  <taxonId provider="https://www.ncbi.nlm.nih.gov/taxonomy">35122</taxonId>   
</taxonomicClassification>

I am ambivalent about putting a URL in as provider, because they aren't reliable. that goes for a URL to the organism's landing page too. I'd prefer a recommended name that a landing page could be built from -- either from

  • the provider itself (we did this with the orcid example (see src/test/resources/eml-data-paper.xml)
  • another list (eg, the prefixes table, in an earlier comment)

If the provider's ID happens to be resolvable (as ORCID's is) - that's fine, but I think we should treat it as an ID, not assume or promote urls.

The word "provider" is what phyloxml uses; some other term works there also.

Option 3: option 2 + an id for referencing internally

If we added the id-system-scope group on taxonomicClassification, then we could accommodate both in-place external identifiers (with taxonID) and ad hoc annotations (with a ref to xml in additionalMetadata.)

<taxonomicClassification id="aaaaa" system="bbbbb" scope="system">
  <taxonRankName>species</taxonRankName>
  <taxonRankValue>Macrocystis pyrifera</taxonRankValue>
  <commonName>Giant Kelp</commonName>								
  <taxonId provider="ITIS">11274</taxonId> 
  <taxonId provider="https://www.ncbi.nlm.nih.gov/taxonomy">35122</taxonId>   
</taxonomicClassification>

This is attractive, but we risk people confusing the two ids

  • taxonId
  • taxononicClassification id=""
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 10, 2018

Of these options, I think Option 2 makes the most sense because I don't think the current EML id and system attributes work well for external identifiers that might be repeated in different sections of the document. The challenge with option 2 is how to scope provider such that everyone using, say, ITIS, knows to use the same provider value. This is a classic problem of how to reference a vocabulary when we know that the current number and location of useful vocabularies will change over time. ITIS has been present for a long time, but its URLs have shifted (our Morpho implementation is broken for just that reason). So, I think it would be good to:

  1. be sure provider is an open field that can include new providers in the future
  2. be sure the instructions for how to construct a value for provider are clear and unambiguous -- we can test this by having several people try to follow the instructions to add an ITIS TSN and an NCBI identifier and see if they all use the same provider value. We should provide explicit advice for how to name the common taxon vocabularies in use today, including ITIS and NCBI.
  3. Be clear about what should go in the value space (e.g., whether or not the identifier is prefixed in any way).

One reason DOIs are effective is that, generally speaking, they are universally recognizable (granted, with > 20 valid serializations) and therefore software can know that they should be resolved as DOIs following current resolution protocols. Taxon URIs have not reached this status. The difference between option 2 and the annotation approach is that the annotation approach specifically forces you to draw from a vocabulary that has a known stable URI for each concept. I think using any URI that is derived from an application (e.g., ?search_topic=TSN&search_value=28759) is unlikely to be stable over even a few years -- much better to draw from a vocabulary with a defined URI space, or from a plain identifier with a provider drawn from a known list.

So, I would support either option 2 or the annotation approach.

We have these same ambiguities in other controlled fields in EML. It would be nice to not repeat those issues.

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Jul 10, 2018

Agree 💯 with @mbjones. 👍 for Option 2.

Given provider is to be specified as an attribute, it seems that not introducing a prefix to the identifier would be best. Not sure about how to constrain the provider field, seems like a URL for the provider would be the least ambiguous; otherwise it would probably need a controlled vocabulary and that will be a pain to create an keep current. (In a more linked-data world provider would no doubt be either a URI or Organization description, but that seems like a poor fit here).

The annotation approach leaves the URI / URL stability problem unanswered -- probably okay for NCBI and Wikidata, but probably not so much for most of the 182 authorities recognized by Global Names Resolver.

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 10, 2018

checked in a version of option 3 (with only an id on the parent element, so it can be referenced by an annotation node). here is sample xml:

<taxonomicClassification id="taxon_MAPY">
  <taxonRankName>species</taxonRankName>
  <taxonRankValue>Macrocystis pyrifera</taxonRankValue>
  <commonName>Giant Kelp</commonName>								
  <taxonId provider="ITIS">11274</taxonId> 
  <taxonId provider="https://www.ncbi.nlm.nih.gov/taxonomy">35122</taxonId>   
</taxonomicClassification>

documentation is only basic, however.

@mobb mobb added the documentation label Jul 10, 2018
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 11, 2018

Aha, I misunderstood option 3 before. I like it. Except that provider="ITIS" just isn't going to work generally. What do you think of forcing provider to be xs:anyURI and to define it as the namespace URI for the provider (and not the provider's website URI)? This requires that each provider would need to have a definitive namespace, which is unlikely.

Also, looking at Darwin Core, we should differentiate what we want, a taxonID or a taxonConceptId, or either, or both. In general I would prefer a taxonConceptID, but these are rare in the wild, and mostly systems like ITIS provide taxonIDs, although there has been an extensive debate around where to draw the line between these (e.g., discussion of nameUsage and potentialTaxonConcept).

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Jul 11, 2018

This looks pretty good to me. 👍 for provider to be xs:anyURI. I'm happy for the documentation to recommend that this be a namespace for provider (i.e. does that mean resolving to the taxon when prefixed before the taxonID?) but since that's not often going to be practical or stable it seems like we should live with any URI for the provider.

I'm pretty sure we want taxonID, though that probably only because I can't make heads or tails about what a taxonConceptId is supposed to be. (I cannot find the discussion of nameUsage and potentialTaxonConcept you refer to; and the Darwin Core documentation is singularly unhelpful; defining a dwc:taxonConceptId as "An identifier for the taxonomic concept", which they go on to illustrate with an example: "8fa58e08-08de-4ac1-b69c-1235340b7001". Clear as mud to me).

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 11, 2018

The only ref to a taxonConcept I've seen is this one, where they are intended to 'stabilize the "meaning" of species identifiers'. Agree - these are not likely to be common.
https://resolver.globalnames.org/data_sources/101

pushed up these changes to eml-coverage.xsd. The documentation is still minimal, since it seems to me that the details of usage are better written up in external docs.

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 13, 2018

added this example to the test file eml-i18n.xml

<taxonomicClassification id="taxon_MAPY">
  <taxonRankName>species</taxonRankName>
  <taxonRankValue>Macrocystis pyrifera</taxonRankValue>
  <commonName>Giant Kelp</commonName>								
  <taxonId provider="ITIS">11274</taxonId> 
  <taxonId provider="https://www.ncbi.nlm.nih.gov/taxonomy">35122</taxonId>   
</taxonomicClassification>
mobb added a commit that referenced this issue Jul 13, 2018
@twhiteaker

This comment has been minimized.

Copy link
Contributor

@twhiteaker twhiteaker commented Jul 17, 2018

Looks good to me

@mobb mobb added needs-review and removed in progress labels Jul 24, 2018
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Nov 21, 2018

Need to still:

  • fix documentation in doc:example fields to use URIs
  • ensure our tests provide validatable examples of this provider field in use
@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jan 23, 2019

needed still

  • test that the provider is indeed a valid IRI
@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jan 23, 2019

Documentation should contain examples of valid IRIs (urls)

@mbjones mbjones removed the in progress label Jul 25, 2019
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Aug 16, 2019

Added a taxonId to the eml-sample.xml test file to ensure that validation works, which is does. So I think this new feature is complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.