Data quality checks gap analysis #13

djtfmartin · 2020-01-16T11:44:56Z

Need to identify the gaps between ALA's current suite of data quality tests and GBIF's.
The follow up to the work would be to implement any missing tests into pipelines, ideally supported in GBIF's core implementation.

A list of ALA's tests can be derived from here:

ALA Assertion codes

The GBIF pipeline equivalent is here:

GBIF Occurrence issues

and

GBIF Name usage issues

RobinaSanderson · 2020-04-09T04:24:17Z

Hi @djtfmartin - Is the GBIF Fact or Measurement transform where they do their data quality checking, or is there another transform missing from our diagram: https://confluence.csiro.au/display/ALASD/Process+overview+and+issues?
I'm wondering if I should put in a dotted box on the diagram in the ALA space for data quality checks in case the outcome of this issue results in additional data quality checking. Otherwise, it can wait until we know one way or the other.

djtfmartin · 2020-04-29T06:26:30Z

Thanks @RobinaSanderson. The data quality checks are in various transforms (Location, Temporal, Taxonomy).

RobinaSanderson · 2020-04-29T06:34:59Z

Thanks, I see... Does it make sense to look at all the data quality checks as this one issue, or to split them up into separate issues for each transform if that is where they happen? Or do we need to do this task to get an overview of all the data quality work, and then have individual tasks for where we have to update/add to transforms, when we know what they are?

Tasilee · 2020-04-29T06:45:17Z

I am wondering if we should put a lot of effort in when there was an undertaking some time ago from GBIF, ALA and iDigBio to implement TDWG TG2 Core Tests. That was previous administration admitted (John and Donald at least).

And BTW, I have a table somewhere with all the tests from various agencies that I could find. I will see if I can find it. This was my start point for the TG2 work.

djtfmartin · 2020-04-29T07:08:36Z

thanks @RobinaSanderson, i think separate tasks. We have tasks for Location #22 and Taxonomy #26 but we need separate task for Temporal (EventProcessor in biocache-store). We also need tasks for the functionality handled in TypeStatusProcessor, BasisOfRecordProcessor and the other processors.

https://github.com/AtlasOfLivingAustralia/biocache-store/tree/master/src/main/scala/au/org/ala/biocache/processor

RobinaSanderson · 2020-04-29T21:42:39Z

Hi @djtfmartin I've created the following gap analysis issues:
EventProcessor - #55
TypeStatusProcessor - #56
BasisOfRecordProcessor - #57
These are placeholders at the moment and will need more detail.

I will take a look at the link you gave above for further processes later. Sorry, I've got another piece of work to finish today.

javier-molina · 2020-08-25T06:20:43Z

This and #125 are related activities

charvolant · 2021-04-19T04:53:22Z

Assertion codes not used (set) in biocache-store and biocache-service

BADLY_FORMED_ALTITUDE
AMBIGUOUS_NAME
PROCESSING_ERROR - DAOs check for this but it doesn't seem to be set anywhere
MEDIA_REPRESENTATIVE
MEDIA_UNREPRESENTATIVE

Assertion codes only used in biocache-service

VERIFIED

Javier: From Planning Meeting 21-Apr-21. The above are now legacy assertions not in use and it is safe not to carry to LA Pipelines implementation.

charvolant · 2021-04-21T03:56:01Z

Things potentially missing from pipelines.

Generic user assertion codes. It's unclear how or whether these should be implemented, since they're really part of the biocache-service and derived from a cassandra table. The LocationRecord has a hasGeospatialIssue flag on it but it is never set and no other record has a similar flag.

Javier: User assertions will continue to be managed in biocache service, stored in Cassandra and exported to Solr index. We are adopting GBIF Spatially invalid flag but I'm not sure if that is related to Locationrecord.hasGeospatialIssue

COORDINATE_HABITAT_MISMATCH Believed to be deprecated

Javier: Confirmed as per https://confluence.csiro.au/display/CIU/2020-12-01+Reference+Group

DETECTED_OUTLIER Implemented with an outlier count in the index but no flag.
INFERRED_DUPLICATE_RECORD - Detected by the clustering pipeline but not flagged

Javier: Both assertions above have a way to be determined by pipelines fields hence they will not be implemented. DQ Profiles already use outlier_layer_count and duplicate_status fields.
Miles: I can't find any information on how these fields flags are set or how they might differ from outlier_layer_count and duplicate_status. I can say they are not currently used in the data profiles so won't impact them as they stand

COORDINATE_PRECISION_MISMATCH - see Location Processor Gaps #285

Javier: Doug will raise an issue in GBIF pipelines project and we will ask GBIF to address it.

SPECIES_OUTSIDE_EXPERT_RANGE - No expert range processing at present

Miles: I can't find any records with this assertion currently set however it has been generated in the past and is a useful quality metric. This field indicates whether the coordinates match the expert distribution layer for the species - there are only layers for birds and fish
Dave: we can only implement SPECIES_OUTSIDE_EXPERT_RANGE if we do the intersect with expert distributions - and the project steering committee agreed we’d drop that functionality

NAME_NOT_IN_NATIONAL_CHECKLISTS - Believed to be deprecated

Miles: no records with this assertion, I don't know if it was ever used, You can find out which name list a name comes from by following link to the species page but this information is in the index (as far as I can tell)
Javier: For SPECIES_OUTSIDE_EXPERT_RANGE and NAME_NOT_IN_NATIONAL_CHECKLISTS it is not clear how they might be used or how they were added originally. It is accepted that they are deprecated in LA Pipelines project (Planning Meeting 21-Apr-21)

UNRECOGNISED_COLLECTION_CODE - Mapped by ALACollectionLookup but a not found is not flagged
UNRECOGNISED_INSTITUTION_CODE - Mapped by ALACollectionLookup but a not found is not flagged

Javier: UNRECOGNISED_COLLECTION_CODE and UNRECOGNISED_INSTITUTION_CODE will be implemented in #303

RECORDED_BY_UNPARSABLE - Unparsable names are not flagged by CollectorNameParser

Javier: Collector name can be even an id and that is valid, hence this has been an opinionated assertion that is no longer valid.

javier-molina · 2021-06-10T01:53:19Z

Everything in comment above 21 Apr has been addressed.

djtfmartin added help wanted PoC-Required Required for Proof of Concept labels Jan 16, 2020

djtfmartin added the 2-weeks label Feb 19, 2020

RobinaSanderson removed 2-weeks-dev-task PoC-Required Required for Proof of Concept labels Mar 17, 2020

javier-molina added the not-in-diagram label Mar 18, 2020

javier-molina removed the not-in-diagram label Apr 3, 2020

M-Nicholls mentioned this issue Apr 8, 2020

Assess Data Quality AtlasOfLivingAustralia/DataQuality#40

Open

djtfmartin added the effort-unclear label Sep 2, 2020

javier-molina added this to the Sprint 12 milestone Nov 26, 2020

javier-molina mentioned this issue Dec 3, 2020

Review LocationTransform and TemporalTransform #102

Closed

javier-molina modified the milestones: Sprint 12, Sprint 13 Jan 13, 2021

javier-molina modified the milestones: Sprint 13, Sprint 14 Feb 7, 2021

javier-molina modified the milestones: Sprint 14, Sprint 15 Feb 25, 2021

javier-molina modified the milestones: Sprint 15, Sprint 16 Mar 23, 2021

javier-molina assigned charvolant Apr 8, 2021

javier-molina removed this from the Sprint 16 milestone Apr 21, 2021

javier-molina added this to the Sprint 18 milestone Apr 21, 2021

javier-molina added the gap label Apr 26, 2021

This was referenced Apr 26, 2021

Deploy AVH hub to aws-biocache-quoll.ala.org.au #313

Closed

Flag unrecognised collection/institution code combination with an issue #303

Closed

javier-molina assigned javier-molina and unassigned charvolant Apr 28, 2021

javier-molina modified the milestones: Sprint 18, Sprint 19, Sprint 20 May 4, 2021

javier-molina closed this as completed Jun 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data quality checks gap analysis #13

Data quality checks gap analysis #13

djtfmartin commented Jan 16, 2020 •

edited

Loading

RobinaSanderson commented Apr 9, 2020

djtfmartin commented Apr 29, 2020

RobinaSanderson commented Apr 29, 2020 •

edited

Loading

Tasilee commented Apr 29, 2020 •

edited

Loading

djtfmartin commented Apr 29, 2020

RobinaSanderson commented Apr 29, 2020

javier-molina commented Aug 25, 2020

charvolant commented Apr 19, 2021 •

edited by javier-molina

Loading

charvolant commented Apr 21, 2021 •

edited by djtfmartin

Loading

javier-molina commented Jun 10, 2021

Data quality checks gap analysis #13

Data quality checks gap analysis #13

Comments

djtfmartin commented Jan 16, 2020 • edited Loading

RobinaSanderson commented Apr 9, 2020

djtfmartin commented Apr 29, 2020

RobinaSanderson commented Apr 29, 2020 • edited Loading

Tasilee commented Apr 29, 2020 • edited Loading

djtfmartin commented Apr 29, 2020

RobinaSanderson commented Apr 29, 2020

javier-molina commented Aug 25, 2020

charvolant commented Apr 19, 2021 • edited by javier-molina Loading

charvolant commented Apr 21, 2021 • edited by djtfmartin Loading

javier-molina commented Jun 10, 2021

djtfmartin commented Jan 16, 2020 •

edited

Loading

RobinaSanderson commented Apr 29, 2020 •

edited

Loading

Tasilee commented Apr 29, 2020 •

edited

Loading

charvolant commented Apr 19, 2021 •

edited by javier-molina

Loading

charvolant commented Apr 21, 2021 •

edited by djtfmartin

Loading