-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define rules for scientific name identity #35
Comments
The word "name" by itself usually causes more confusion than clarity in these kinds of conversations. Everytime we try to pin down a "clean" definition of "name", it usually isn't successful in the end. I don't think there will be much value in attempting to define a "name" in the COL+ context. Without using the word "name", I suggest there are two classes of things we need to reconcile: Class 1: Literal Text String. These are literal strings of text (UTF-8) characters that are purported to represent the scientific label for an organism. These are what GNI indexes. Other than eliminating redundant whitespace, etc., these are the actual character strings associated to biodiversity data that we wish to use as taxonomic handles. Class 2: Nomenclatural Objects: These are conceptual nomenclatural entities that have key properties associated with them (e.g., authorship, year, corrected spelling, combination, rank, etc.), and represent the core "things" that we hope to assign persistenmt reusable identifiers to, and use to cross-link biodiversity data to each other through nomenclature and taxonomy. For example, your item #1 in your list is a single instance of a Class-2 object, which has been represented by at least six different Class-1 literal character strings. The first string you include ("Aus bus Linnaeus 1758") is not "the" name -- it is just one of potentially many text-string variants used to represent the abstract "name". As to your questions:
If you're asking what "thing" a CoL+ identifier should be attached to -- the answer is easy. It should be attached to a specific TNU instance (i.e., not an abstract "name", but an explict, specific usage instance of a name, which includes the set of protonyms and all the other relavent properties of the usage instance). COL is about taxon concept circumstcriptions, and this will always reqire a more explicit anchorpoint than a "name", no matter how we define a "name". |
Thanks Rich. I believe your class 1 literals do not deserve an identifier as the string does already identifiy itself. It is the nomenclatural entity that we talk about here. And exactly because a "name" per se is understood in various ways I want a clear definition of what it is we identifiy by a scientificNameID. I think it is mostly terminology that makes our views (GNUB / CoL+) appear different. It is the set of protonyms that I would like to identify just as you say, but not via the set of protonym ids but with an opaque identifier on its own. I am not convinced we want anything more than a set of 3 protonyms though. In the case of a species name string that includes a subgenus or a variety with a subspecies given: Do we care about the inherent taxonomic classification? I think we should not and regard these as the same name. As for ambiregnal my hope and gut feeling is just like yours, but needs to be tested. Chresonyms I also agree we should treat them as taxon concepts. It will just be hard to detect them without a sec/sensu/auct indication. So they will likely slip into the names dataset before we know it and we then have to deal with them. As they probably then have (stable) ids already the practical solution would be to keep them as names and mark them as chresonyms. Not nice, but likely to work... |
Thanks, Markus. Yes, for the Class1 Namestrings, the UTF8 string itself is, by definition, unique. Dima creates hashed UUID values from the strings themselves, and these can be useful for a variety of reasons as surrogate identifiers for the strings themselves. However, these are not "assigned" UUIDs, they are derived UUIDs, and as such can be thought of as simply a different rendering of the string itself. I fully agree that what we need to uniquely identify for COL+ purposes (as well as every other taxonomic purpose in biodiversity informatics) is the Clas2 nomenclatural object/entity. In the same way that "name" can be defined many different ways, I think we need to avoid the text-string components of a "name" (e.g., Genus part, species part, infraspecific part, qualifiers, authorships, year, etc.) as part of the defining properties of the nomenclatural objects. That was my man point. I certainly have no objection to issuing a single unique identifier to a "Protonym Set". The benefits in doing so are similar to the benefits that Dima has found for assigning hashed UUIDs for Class-1 text-string names. The object itself would be uniquely identified by the set of protonyms involved (in the same way that a Class-1 name string is uniquely identified by the set of UTF-8 characters it contains), but the surrogate identifier would be extremely convenient for data modelling and for perfirmance purposes, etc. I also agree that we can limit the set to three identifiers: In other words, the only "middle" identifier is a species. Thus "Aus (Xus) bus" would be the same as "Aus bus" (both with the same two Protonyms for Aus and bus). All other properties would be derived either from the Protonyms themselves (authorship, literature citations, type specimens, objectoive synonymies, homonyms, etc.) or from TNUs anchored to the same Protonym sets (=Nomenclatural Object/Entity). TNU properties would capture full classifications, spellings of each component of the names, rank information, subjective synonymies, etc.) So, I think we're in agreement on this. Agreed on need for testing Ambiregnals. Yes, Chresonyms will slip into the system, so we'll need a mechanism for treating them differently from homonyms. I don't see that as being a problem. We just have to accept that identifiers assigned to three different classes of Nomenclatural objects: 1) Legitimate Nomenclatural objects/entities; 2) inadvertant duplicates for legitimate Nomenclatural objects/entities (e.g. chresonyms, when discovered as such); and 3) entries that are out of scope (e.g., derived from text strings that are not actually names of organisms). I have a very robust/elegant mechanims for dealing with these three classes of objects, which I can discuss in detail if you wish. I think that the important thing is that we seem to agree that a "name" (Nomenclatural Object/entity) can be defined as a unique set of one, two or three protonyms (in sequence*), and should be represented by its own surrogate identifier (integer, or UUID hashed from the three Protonym identifiers?). Now we just need to define "Protonym"... :-) *The "in sequence" bit may be important for cases where "Aus bus cus" and "Aus cus bus" both exist separately (because someone made an error in determinign which name is the species, and which is the subspecies). My gut feeling is that these should be regarded as separate Nomenclatural Objects/Entities, but they represent an edge case. |
@deepreef, this is definitely applicable to zoological names, but in botany both Aus bus cus and Aus dus cus are both accepted subspecies names associated with two different species. The same will apply to all other rank. You cannot just pick terminal epithets for botanical names. |
For reference- I've thrown up some conceptual documentation for TaxonWorks. It was heavily influenced with ideas from Rich, and I'm almost positive the approach can capture what Rich wants to express. A somewhat pertinent point is that we actually built the tool, tested the concept, etc., i.e. we've gone far beyond conceptual discussions.
Note that the system allows for the application of unique identifiers at many different levels. In addition to the maintenance ids that are used (e.g. auto incremented ids in a RDB) global identifiers can be tacked on to any instances of any of these data. All instances of data classes can also be cited (linked to a reference). Some of these citation instances correspond to Rich's concepts. Amending to actually answer @mdoering questions:
Frankly, this is up to the curator. Sometimes it matters, sometimes it doesn't. Given them the relationships/object properties to express why it matters to them and it becomes less of an issue.
I very much wanted this, however the rules of nomenclature must let you interpret any name/string, so again this is up to the curator. What we hope to capture is that some curator is interpreting some name according to some nomenclatural rules. A best practice for a GSD would be to "only include names used as if they were intended to be governed by a set of rules". Again, unenforceable, and nothing can be done about it.
See Citation in TaxonWorks.
Up to the curators, but if the curators are sane, 2. Thanks for "ambiregnal" btw! |
Define clear rules what exactly makes a name the same name. A scientific name in the sense of the Clearinghouse of CoL+ has an identity and a unique, stable identifier. If possible these identifiers should be reusing ids issued by the participating nomenclators like IPNI.
A name includes it’s authorship. Two homonyms with different authors therefore represent two different name entities.
The same name can usually be represented by many different strings which we refer to as lexical variations. For each name a standard representation, the canonical form, exists. Lexical variations exist for various reasons. Author spelling, transliterations, epithet gender, additional infrageneric or infraspecific indications or cited species authors in infraspecific names are common reasons. Listed here are 7 distinct names with some of their string representations:
New names (sp./gen. nov.), new recombinations of the same epithet (comb. nov.), a name at a new rank (stat. nov.) or replacement names (nom. nov.) are all treated as distinct names.
Open questions to be addressed:
We need to capture examples of the various cases.
The text was updated successfully, but these errors were encountered: