Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include Scrutinizer and ScrutinizerDate in COLDP exports #3464

Closed
dhobern opened this issue Jun 24, 2023 · 21 comments
Closed

Include Scrutinizer and ScrutinizerDate in COLDP exports #3464

dhobern opened this issue Jun 24, 2023 · 21 comments
Labels
enhancement Suggest an improvement to an existing function.

Comments

@dhobern
Copy link

dhobern commented Jun 24, 2023

Feature or enhancement

I need to start deriving metrics for progress in cleaning up the Global Lepidoptera Index. One of these relates to how many taxon records have been modified for each family each year. I would like to be able to assess this quickly as part of processing the COLDP exports, but there is no timestamp information in the export. The best way to do this would be to include your Updated By and Updated At values in the COLDP Taxon fields scrutinizer and scrutinizerDate, which are currently blank.

Location

Catalogue of Life (CoLDP) exports task

Screenshot, napkin sketch of interface, or conceptual description

No response

Your role

Data curator / biodiversity informatician

@dhobern dhobern added the enhancement Suggest an improvement to an existing function. label Jun 24, 2023
@proceps
Copy link
Contributor

proceps commented Jun 24, 2023

@dhobern, we already have 2 ways to see this stats in TW. In the Filter Nomenclature, you can search by updated date range and a person who made the change. You can add the family name with descendants to the search parameters and get the stats for a single family. We also have a specialized task called the Project Activity, where you can track for each user of the project how many records were created/updated, not just in taxonomy, in any TW model, how much time was spent, number of records edited per hour, etc. Please try and let us know if this satisfy your needs.

@dhobern
Copy link
Author

dhobern commented Jun 24, 2023

Thanks @proceps - yes, I realise I can do this, but I really want to start automating various metrics for datasets in ChecklistBank ad especially those that are components for COL, so I'd like to have the necessary information exposed there, and adding these two fields to the export felt like it should be a small/quick matter.

@mjy
Copy link
Member

mjy commented Jun 25, 2023

@dhobern We have a "Verifier" role that would add more explicity semantics than the Housekeeping created/updated. If we created the ability to batch add this role/metadata via the TaxonName filter (and see it's individual use on the radial annotator) would that be a better way to record this data? I'm hesistant to overload the semantics of houskeeping updated/created by linking them to things like Scrutinizer.

@dhobern
Copy link
Author

dhobern commented Jun 25, 2023

Hmmm - I fixed around 2000 names over the weekend (combinations included) and would like to be able to develop a traffic-light-based map for the whole insect order to understand how recently each name was touched by someone - without seriously slowing myself down.

The Created/Updated housekeeping elements are really useful for anyone using TW to clean up dirty data, and I don't have time to edit yet another radial link each time I'm working my way through tidying a name. So I'm not sure the Verified flag would fit what I need (although I guess it would be nice to have such a flag I could use at the level of subgenera and up so I could easily mark sections that I believe are currently complete).

The scrutinizer and scrutinizerDate fields in COLDP are really mainly used as a modifiedBy/modifiedAt pair, so what you already have seems a perfect fit to me.

@mjy
Copy link
Member

mjy commented Jun 26, 2023

@dhobern Curious- would an internal report meet your needs or does Checklist bank have some capabilities to do this? I.e. it seems this is going to be of use to others as well. If you have a sample plot (napkin sketch) and/or table please share. We can make a generalized report that loops valid family names pretty easily.

@dhobern
Copy link
Author

dhobern commented Jun 26, 2023

Thanks @mjy - I couldn't generate my report internally in TaxonWorks - I need to integrate with the datasets that COL uses to replace some of the worst sections in GLI. Right now, this would be external to ChecklistBank, but COL wants to increase the internal metrics for all datasets there, so I would expect some components would migrate into the CLB functionality.

I suppose my general thought is that this is useful contextual information for many analyses and presentations of the exported data and it seems that it should be part of what a user gets when they download from TW.

@mjy
Copy link
Member

mjy commented Jun 27, 2023

I would suggest COL needs to implement Housekeeping concepts. Too often something like this gets done, and it gets sloppy. This seems particularly important when we are trying to give attribution to people for what they actually did (we can't promise every updater is a scrutinizer), and when we try to record proper metadata provenance trails.

I don't mind adding this, but it's going to have big stars all over it. We literally just introduced the Georeferencer role to deal with this exact problem (we were attributing Georeferencer to people who added data, not people who did the Georeferencing).

@dhobern
Copy link
Author

dhobern commented Jun 27, 2023

Thanks @mjy - let me discuss with Markus - maybe we can add optional timestamp elements to all COLDP tables and use those instead.

@mdoering
Copy link

Adding modified/modifiedBy to ColDP for all records makes sense to me and is already part of the database model anyhow. I am having more problems with the scrutinizer property which exists from the start of COL. There hardly seem to be sources out there that track this concept and most often you find the housekeeping modifiedBy being used for it instead.

@mdoering
Copy link

ColDP doesn't mind about extra columns, so you could already include 2 new columns (I would probably call them modified & modifiedBy) if you want while I can work on supporting them in CLB

@mdoering
Copy link

CatalogueOfLife/coldp#73

@dhobern
Copy link
Author

dhobern commented Jun 29, 2023

Thanks @mdoering and @mjy - based on this discussion, I would be thrilled if TW could include modified (date or datetime) and modifiedBy in (each of) the CSV outputs included in the TW COLDP downloads.

@mjy
Copy link
Member

mjy commented Jun 30, 2023

@mdoering @dhobern Sounding better, now we have to do one better and leapfrog the oldness. We need to be able to include ORCiD or Wikidata or other global identifiers as pointers to the Person/People in question. I believe TDWG is moving forward with something like identifiedByID, perhaps we can have updatedByID, createdByID in addition to updatedBy and createdBy(agent name)? Note I'm not necessarily advocating for also including the created concept, but it is standard Housekeeping in hopefully all "modern" approaches.

@mdoering
Copy link

It is standard, but I rarely see any use for the created fields. Do you? I', happy to also have modifiedByID, but I wonder if we can just place the ORCID in modifiedBy in case we have that. I know you don't like overloading...

@mjy
Copy link
Member

mjy commented Jun 30, 2023

I do like created from a Time perspective, lag/latency means a lot. We have found it useful in filtering results as well when we are trying to track down provenance-related issues.

Classic ontology related responses from what I've experienced include the ID and a human readable label, even if redundant. If we just pass ORCiD then you're going to have to lookup names if you add any functionality on top of the dataset- if you're not planning to do any of that then one field should be fine I think.

@mdoering
Copy link

I will anyway lookup orcids and have to find also a way to manage users similar to how we already track local CLB users. I would actually love to only ever see ORCIDs or other resolvable identifiers instead of usernames like smith or green88 that are surely not globally unique. That is rather unlikely to exist in most systems though.

@mjy
Copy link
Member

mjy commented Jun 30, 2023

I would actually love to only ever see ORCIDs

Perfect. We'll send ORCiDs or names in that field then.

@dhobern
Copy link
Author

dhobern commented Jun 30, 2023

Great - thank you both so much

@mjy mjy closed this as completed in 6fec434 Jul 5, 2023
@dhobern
Copy link
Author

dhobern commented Jul 5, 2023

@mjy - Thanks. When will we see this change in the site? I just created a small COLDP export for a genus and it seems not yet to be included.

@mjy
Copy link
Member

mjy commented Jul 5, 2023

In general when you see the most recent CHANGELOG that has a release number in front of the changes they will be live. We're hoping to have it live this week or early next.

@dhobern
Copy link
Author

dhobern commented Jul 5, 2023

Thanks @mjy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Suggest an improvement to an existing function.
Projects
None yet
Development

No branches or pull requests

4 participants