New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoL+: first, do no harm #37

Closed
Mesibov opened this Issue Nov 1, 2017 · 13 comments

Comments

Projects
None yet
5 participants
@Mesibov

Mesibov commented Nov 1, 2017

I haven't noticed yet any comments about poor data quality in CoL+. It's like the smudge on the kitchen window — you know it's there, you will get around to cleaning it someday but it isn't a priority job. So the window remains smudged.

The "data quality" issues I'm talking about have nothing to do with whether a name use is backed with evidence, or whether an author misspelled a name or whether the correct authority has been cited or whether a URL is correct. CoL, like GBIF and WoRMS and many other aggregations, is riddled with low-level errors, like invalid data items, character encoding failures, incorrect formatting, duplications and truncated data items. As I wrote in a recent email to sp2000, these errors render a very large number of records completely useless for digital processing and tediously difficult for human processing.

Low-level errors first appeared in the aggregations mainly because incoming data were either not audited at all or were not audited carefully enough. They're still in the aggregations because existing data are either not audited at all or are not audited carefully enough.

They will persist in CoL+ unless this project uses data migration as an opportunity to clean data. The likelihood of this happening, to judge from the overall workplan and some correspondence I've had, is close to zero.

What's worse is that CoL+ may reduce further the quality of existing data. The Hippocratic principle "primum non nocere", "first, do no harm", was ignored when backbone taxonomies first appeared, to the horror of taxonomists and collection specialists. CoL+ hopes to fix the damage with explicit linking of names.

But CoL (and other aggregators) are guilty of data-mangling at lower levels, and with no plan to check for further data-mangling with the migration to CoL+, it will happen. In addition to character encoding failures and inadequate checking for duplicates, a surprisingly common stuff-up is truncation. As an example, CoL inconsistently truncates authors. There's an absolute limit of 100 characters, but shorter strings have also been truncated. In at least two source databases I'm aware of, the truncated strings appear full-length.

Every time I point out low-level data quality issues to aggregators I get the same answers (http://iphylo.blogspot.com.au/2016/04/guest-post-10-explanations-for-messy.html), which can be summarised as "We're aware of the problem, but..." Will CoL+ be any different?

@rdmpage

This comment has been minimized.

rdmpage commented Nov 2, 2017

Hi Bob, I guess I have the inevitable reactions "does it matter?" and "if so, what do we do about it?".

By "does it matter?" I mean what are the incentives to produce high-quality data (or, conversely, the consequences of producing bad data?). In other words, do people actually think this matters. I suspect that part of the problem is a lack of drivers to ensure high quality. The Apple Maps suffered from major quality problems which led to Scott Forstall losing his job - we don't have anything like situation here. Given that the primary driver for an aggregator is to have more data, if there are no metrics for quality that people care about, or no competitor to worry about, quality is unlikely to be a driver. I know we've had offline discussions about this and your a little underwhelmed by this argument, but I think the social engineering side matters.

By "what do we do about it?" I mean what things can we practically do to catch quality issues before they come up. For example, can we envisage having a test suite that is run over the data (rather like the continuous integration tests you often see on GitHub repositories represented by badges such as "build passing"). What would a test site look like in this context?

@Mesibov

This comment has been minimized.

Mesibov commented Nov 2, 2017

"does it matter?"

Yes, it matters. If aggregators aren't providing high-quality data and continually upgrading the quality of the data they provide, then aggregation isn't a service to the scientific community. It's a game being played at great expense by IT bods who enjoy doing clever things with databases and the Web.

And no, the service and the game aren't compatible. When scientific users say (as I've been hearing them say) "It's crap, I'd never use it for any real science", then the data managers and developers are shown to be playing a game for themselves, full stop. Here's what I received recently from a collection curator (aggregator name deleted):

"It is also not the worst problem with AGGREGATOR data. I would strongly suggest not using AGGREGATOR data in scientific pursuits. They apply a canonical matched name to the data we provide, and regularly screw up the taxonomy. No one replies to correspondence when you make them aware of issues. They often automatically substitute the correct authorities we provide with incorrect ones, and we have no idea how or why, or a way to predict which taxa will be affected."

It also matters because the aggregators are playing their game so badly. Those low-level errors in the data just wouldn't be tolerated by a database manager in a corporate enterprise. You'd struggle to find errors like that in the collection database of a small regional library system. They're Databasing 101 failures.

"what do we do about it?"

The "test suite" you want which will "run over" existing or incoming data won't fix the data, because there are so many different ways in which aggregated data are mangled. All a "test suite" can do is identify (some) problems, which you then have to fix with appropriate tools. Many of those fixes will require going back to the data provider with a list of questions. A "test suite" won't do that. There is no point to having a "test suite" if you don't act on what it finds.

I have a whole website devoted to finding and fixing low-level problems (https://www.polydesmida.info/cookbook/index.html) using fast, reliable command-line utilities. If you want to re-invent those wheels in some other language or developer's environment, you're welcome to do so.

There are already tools (both free-standing and in collection database software) for high-level geochecking and some taxonomic checks. Like the low-level problems, these higher-level problems will be identified but not fixed.

Flagging problems in an aggregated dataset so that the end-user is alerted that Something Is Wrong is avoiding responsibility for data quality. Setting up annotation and feedback mechanisms so that maybe some user, somewhere, at some time, can tell the database manager that Something Is Wrong is also avoiding responsibility for data quality.

All aggregators do some data cleaning, which demonstrates that "what do we do about it?" isn't a meaningless question, and that aggregators do take some responsibility for data quality. CoL+ is an opportunity to take a lot more responsibility and fix both legacy errors and the errors in future incoming data streams. If there isn't a budget or time allocation for doing this in the CoL+ planning, then CoL+ is committed to offering the same garbage that its predecessors do, and is no advance on any of them.

@mdoering

This comment has been minimized.

Collaborator

mdoering commented Nov 2, 2017

Thanks for sharing your concerns, Bob. I would be very interested in learning more about these low level errors in the current CoL. I am convinced CoL+ should indeed do a better job in this regard when your concern primarily is bad character encodings, truncated strings, unresolved html/xml entities and other artefacts introduced by the digital world.

CoL+ builds on many libraries developed in GBIF over the last decade where we face the same issues. It is obviously not possible to automatically fix all errors, but quite a bit can be corrected automatically. And much more can at least be flagged as a potential issue to look at. Flagging records with detected (potential) issues is key to data quality in the clearinghouse.

We can also detect and flag other issues base on taxonomic or nomenclatural "business rules". For example the species "Proagomphus mansuetus Attems, 1953" from the CoL
https://www.gbif.org/species/1014654 is flagged because it apparently is published after its genus was first published in 2005. It looks like this is in fact a recombination and the year should have been given in brackets.

@mdoering

This comment has been minimized.

Collaborator

mdoering commented Nov 2, 2017

And yes Bob, a lot needs to be fixed in the sources and can only be flagged in aggregators. Otherwise they would not aggregate but manage the data directly (which I am a big fan of, but our community, especially taxonomists, not so much). The idea of the Clearinghouse is central in CoL+, feeding back issues, new data and suggesting corrections. But keeping the ultimate decision at the end of the sources. Personally I am not convinced that this feedback loop design is gonna work. I've seen it failing too many times before. If that is gonna happen again (in some taxonomic groups) CoL+ will fallback to centrally/community managed data for those groups where things can be fixed immediately.

@rdmpage

This comment has been minimized.

rdmpage commented Nov 2, 2017

@Mesibov That will teach me to be a bit flippant. By "does it matter?" I'm really asking whether the goals users might have (high quality taxonomic data) and data providers might have (their data gets more visibility, and augmented in useful ways, e.g. links to literature, etc.) are aligned with the drivers of the project. If they are, then it "matters" in the sense that the things users and providers want are the things that determine the success of the project. Even better if users and providers have some clout, e.g. money. For example, imagine data providers pay a membership fee to join. They would then expect benefits (e.g., augmented names), if the project doesn't provide them, they don't pay.

Call me cynical, but I don't think appeals to abstract notions of "good science" actually carry much weight unless they are backed up by clear incentives. Obviously we want to have great data that is reliable and useful, the question is how you engineer a project that has to deliver that to succeed (as an aside, I don't buy that corporate data is always great, e.g. early Apple Maps, and I spend a lot of time with commercial publishing data and it is messy).

The drivers for CoL+ as far as I can see are two-fold:

  1. CoL needs a refresh, the project is struggling and carries a lot of technical debt - have you seen the database schema :O

  2. GBIF wants to hand over the responsibility of names to somebody else, preferably someone with money so GBIF doesn't have to pay.

CoL+ doesn't have drivers that align with the goals of data providers or end users. The money has been given already, the goals are 1 and 2 above. I'm not saying it can't deliver good quality data, I'm saying that there aren't compelling drivers such as "you will go out of business and lose your home if this fails".

To me it's a bit like the different between corporate software and consumer software. Corporate software tends to be horrible to use because usability isn't a driver, it's avoidance of risk (will this software crash and take my company website offline, will the company who sold it be around to support it next year). Consumer software has to be easy to use or nobody will use it, end of story. If we could figure out a way to make "align" goals of the data providers, users, and the project, then there is a much higher chance it will deliver.

@Mesibov

This comment has been minimized.

Mesibov commented Nov 2, 2017

I'll respond to parts of both Markus' and Rod's posts by saying that the current CoL, as evidenced in the 2017 Annual Checklist (DwC-A) I've downloaded, is not so much bad science as bad informatics. No one's asked me to do a full audit of low-level errors, so I haven't. I looked at taxa.txt and references.txt and found:

  • truncated data items
  • invalid data items, like non-numeric entries and impossible years in a date (year) field
  • incorrectly formatted data items, like non-capitalised genus and subgenus names and incorrectly punctuated strings
  • incorrect use of Darwin Core fields
  • character encoding failures where characters were replaced by gibberish strings or HTML equivalents (potentially reversible failures)
  • character encoding failures where characters were replaced by question marks (not reversible, original character lost forever)
  • invisible gremlin characters ranging from ancient ANSI control characters to vertical tabs to non-breaking spaces, and not appended to data, but embedded in strings

You don't have to be a taxonomist to find and recognise these, any more than you need to be an author to find spelling and punctuation mistakes in a publication. Since the download is simply a set of text files, each of these error types can be found in a few seconds. It didn't happen, or if it did, someone asked "does it matter?".

I've previously suggested to another aggregator that they audit incoming data and politely ask the data provider to fix provider-side errors before the data are accepted. It was suggested in reply that it was the data provider's job to make sure the data are OK, and the aggregator's job to flag (ineffectually) some of the problems before the data go live. This explains why the data published by that particular aggregator are the lowest in quality of any I've audited.

We're moving away from the original point of my first post, which is that aggregators can and do lower the quality of the data they aggregate. At a high level it's by tinkering with taxonomy (see my last post), at a low level it's by truncating data items and introducing further character encoding failures. It would be nice if CoL+ not only strenuously avoided doing harm, but also actively repaired the harm done by contributors.

If CoL+ doesn't do this, why will it bother to aggregate and publish data? Just to (1) refresh its database management and (2) help alleviate the taxonomic woes of an inadequately funded GBIF (leaving GBIF's many other data issues for an inadequately funded GBIF to fix), as Rod suggests?

If there isn't data quality management baked into CoL+ before it gets off the ground, then Rod's suggestions are perfectly acceptable. If CoL+ informatics people are content with tolerating errors that informatics people can quickly find and fix, then how can this CoL+ exercise be taken seriously by end users?

@dremsen

This comment has been minimized.

Collaborator

dremsen commented Nov 2, 2017

It would be useful to clarify the nature of the data quality issues in order to identify the best strategies to solve them. Bob provided a starting list in his last post. Most of these appear to textual, syntactic or nomenclatural issues, none of which have to do with the semantic qualities of a taxon assertion. Solving these problems does not need to fall solely on an aggregator/individual GSD curator or even on a nomenclator. I favor the approach I know that Rod has proposed which is a separate and much more widely curator nomenclatural system that opens up curation of objective syntactic information to wider group, with incentives to curate these data. The goal is to provide a common mechanism to establish a single record of authority for a name that the CoL checklist, as well as any others who reference names in order to assert taxonomic views, can access and utilize. Why is the spelling of a name, author or citation still an issue when many of these issues were captured in the original publication, then in Neave, Sherborne, and any other number of paper and digital publications. Why does any taxonomist have to go back and dig up these paper to re-verify these facts over and over again while the COL is peppered with gremlins, control characters and gibberish? Make a system that puts the best sets of these data into a controlled but relatively open access system. Let people argue but ultimately verify the facts. Get the original publications linked to anyone can see for themselves and then put services on these authority records so they are easy to use. Make them consistent across the codes so it's easy for everyone. Anything else is just politics.

@Mesibov

This comment has been minimized.

Mesibov commented Nov 2, 2017

"The goal is to provide a common mechanism to establish a single record of authority for a name that the CoL checklist, as well as any others who reference names in order to assert taxonomic views, can access and utilize."

Which brings up the mid-level problem with CoL: tens of thousands of duplicate records.

Being old and cynical, I can speculate that in the time spent arguing the "politics" of aggregation in recent years, a competent digital librarian or data scientist would have fixed all the CoL issues and would be halfway through GBIF's. But neither of those aggregators employ digital librarians or data scientists, and I'm guessing that CoL+ won't employ one, either.

@dremsen

This comment has been minimized.

Collaborator

dremsen commented Nov 2, 2017

@mjy

This comment has been minimized.

mjy commented Nov 2, 2017

@Mesibov

This comment has been minimized.

Mesibov commented Nov 2, 2017

@mjy: The "tens of thousands" was based on a 2016 estimate from taxa.txt and is now (happily) incorrect. For an earlier study on beetle data I downloaded the CoL mid-2016 snapshot and found nearly 8% of all the beetle records were duplicate pairs, usually thanks to variants in taxonomic authority, as here:

7820639 urn:lsid:catalogueoflife.org:taxon:6d75e69f-e478-11e5-86e7-bc764e092680:col20160624 39 WTaxa in Species 2000 & ITIS Catalogue of Life: 26th June 2016 28021358 provisionally accepted name species Zyzzyva ochreotecta Casey, T.L. , 1922 Animalia Arthropoda Insecta Coleoptera Curculionoidea Curculionidae Zyzzyva Zyzzyva ochreotecta Casey, T.L. , 1922 Alonso-Zarazaga M.A.& Lyal C.H.C. 2010 Confidence Level 1: data from secondary source only Wtx-62669 http://www.catalogueoflife.org/annual-checklist/details/species/id/5640bc441da8e026bdab5d2cb48da8e8 false

7842495 urn:lsid:catalogueoflife.org:taxon:5ddfddcb-e478-11e5-86e7-bc764e092680:col20160624 39 WTaxa in Species 2000 & ITIS Catalogue of Life: 26th June 2016 28021358 provisionally accepted name species Zyzzyva ochreotecta Casey , 1922 Animalia Arthropoda Insecta Coleoptera Curculionoidea Curculionidae Zyzzyva Zyzzyva ochreotecta Casey , 1922 Alonso-Zarazaga M.A.& Lyal C.H.C. 2010 Confidence Level 1: data from secondary source only Wtx-100706 http://www.catalogueoflife.org/annual-checklist/details/species/id/355650daee953abd8f5899df308cc626 false

Extrapolating to the whole of the CoL dataset got to "tens of thousands". I'm pleased to see that these have been cleaned up in the 2017 Annual Checklist, e.g.

30150085 urn:lsid:catalogueoflife.org:taxon:5642e5a1-d35f-11e6-9d3f-bc764e092680:col20170225 39 WTaxa in Species 2000 & ITIS Catalogue of Life: 27th February 2017 33377184 provisionally accepted name species Zyzzyva ochreotecta Casey, 1922 Animalia Arthropoda Insecta Coleoptera Curculionoidea Curculionidae Zyzzyva Zyzzyva ochreotecta Casey, 1922 Alonso-Zarazaga M.A.& Lyal C.H.C. Oct-2016 Wtx-100706 http://www.catalogueoflife.org/annual-checklist/details/species/id/357288dd1e64331e5e4827589de57b10 false

The remaining duplicate candidates are taxonomic puzzles, e.g.

8673504 urn:lsid:catalogueoflife.org:taxon:32d1d094-d35f-11e6-9d3f-bc764e092680:col20170225 101 Systema Dipterorum in Species 2000 & ITIS Catalogue of Life: 27th February 2017 33366947 accepted name species Eudorylas loewii (Kertesz, 1900) Animalia Arthropoda Insecta Diptera Pipunculidae Eudorylas Eudorylas loewii (Kertesz, 1900) Systema Dipterorum working record 2010 Sys-66491 http://www.catalogueoflife.org/annual-checklist/details/species/id/9e787dfb4db5bc15577243bfbd27192e false

8673505 urn:lsid:catalogueoflife.org:taxon:32d1d203-d35f-11e6-9d3f-bc764e092680:col20170225 101 Systema Dipterorum in Species 2000 & ITIS Catalogue of Life: 27th February 2017 33366947 accepted name species Eudorylas loewii (Kertesz, 1872) Animalia Arthropoda Insecta Diptera Pipunculidae Eudorylas Eudorylas loewii (Kertesz, 1872) Systema Dipterorum working record 2010 Sys-66492 http://www.catalogueoflife.org/annual-checklist/details/species/id/d2a2291e510cdc5545a8ec01d09bffc0 false

8673506 urn:lsid:catalogueoflife.org:taxon:32d1d4db-d35f-11e6-9d3f-bc764e092680:col20170225 101 Systema Dipterorum in Species 2000 & ITIS Catalogue of Life: 27th February 2017 33366947 accepted name species Eudorylas loewii (Kertesz, 1903) Animalia Arthropoda Insecta Diptera Pipunculidae Eudorylas Eudorylas loewii (Kertesz, 1903) Systema Dipterorum working record 2010 Sys-66493 http://www.catalogueoflife.org/annual-checklist/details/species/id/39e0a8136c0943815e2fe6efa06be5e6 false

8673507 urn:lsid:catalogueoflife.org:taxon:32d1d371-d35f-11e6-9d3f-bc764e092680:col20170225 101 Systema Dipterorum in Species 2000 & ITIS Catalogue of Life: 27th February 2017 33366947 accepted name species Eudorylas loewii (Kertesz, 1910) Animalia Arthropoda Insecta Diptera Pipunculidae Eudorylas Eudorylas loewii (Kertesz, 1910) Systema Dipterorum working record 2010 Sys-66494 http://www.catalogueoflife.org/annual-checklist/details/species/id/fef68cd392aca98958238b809ff2c1cb false

If you're asking about low-level errors generally in GSDs, then no, there's no one culprit, the errors are spread across many GSDs. That said, the prize for highest ratio of errors per contributed record probably goes to the Reptile Database.

@mjy

This comment has been minimized.

mjy commented Nov 2, 2017

Thanks for the clarification @Mesibov

@mdoering

This comment has been minimized.

Collaborator

mdoering commented Mar 15, 2018

The Reptile database is loaded with chresonyms which I have pulled as a distinct issue #39 dealing with homonyms.

@mdoering mdoering closed this Sep 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment