Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data quality checks gap analysis #13

Closed
djtfmartin opened this issue Jan 16, 2020 · 10 comments
Closed

Data quality checks gap analysis #13

djtfmartin opened this issue Jan 16, 2020 · 10 comments
Assignees
Labels
Milestone

Comments

@djtfmartin
Copy link
Member

djtfmartin commented Jan 16, 2020

Need to identify the gaps between ALA's current suite of data quality tests and GBIF's.
The follow up to the work would be to implement any missing tests into pipelines, ideally supported in GBIF's core implementation.

A list of ALA's tests can be derived from here:

ALA Assertion codes

The GBIF pipeline equivalent is here:

GBIF Occurrence issues

and

GBIF Name usage issues

@RobinaSanderson
Copy link

Hi @djtfmartin - Is the GBIF Fact or Measurement transform where they do their data quality checking, or is there another transform missing from our diagram: https://confluence.csiro.au/display/ALASD/Process+overview+and+issues?
I'm wondering if I should put in a dotted box on the diagram in the ALA space for data quality checks in case the outcome of this issue results in additional data quality checking. Otherwise, it can wait until we know one way or the other.

@djtfmartin
Copy link
Member Author

Thanks @RobinaSanderson. The data quality checks are in various transforms (Location, Temporal, Taxonomy).

@RobinaSanderson
Copy link

RobinaSanderson commented Apr 29, 2020

Thanks, I see... Does it make sense to look at all the data quality checks as this one issue, or to split them up into separate issues for each transform if that is where they happen? Or do we need to do this task to get an overview of all the data quality work, and then have individual tasks for where we have to update/add to transforms, when we know what they are?

@Tasilee
Copy link

Tasilee commented Apr 29, 2020

I am wondering if we should put a lot of effort in when there was an undertaking some time ago from GBIF, ALA and iDigBio to implement TDWG TG2 Core Tests. That was previous administration admitted (John and Donald at least).

And BTW, I have a table somewhere with all the tests from various agencies that I could find. I will see if I can find it. This was my start point for the TG2 work.

@djtfmartin
Copy link
Member Author

thanks @RobinaSanderson, i think separate tasks. We have tasks for Location #22 and Taxonomy #26 but we need separate task for Temporal (EventProcessor in biocache-store). We also need tasks for the functionality handled in TypeStatusProcessor, BasisOfRecordProcessor and the other processors.

https://github.com/AtlasOfLivingAustralia/biocache-store/tree/master/src/main/scala/au/org/ala/biocache/processor

@RobinaSanderson
Copy link

Hi @djtfmartin I've created the following gap analysis issues:
EventProcessor - #55
TypeStatusProcessor - #56
BasisOfRecordProcessor - #57
These are placeholders at the moment and will need more detail.

I will take a look at the link you gave above for further processes later. Sorry, I've got another piece of work to finish today.

@javier-molina
Copy link

This and #125 are related activities

@charvolant
Copy link

charvolant commented Apr 19, 2021

Assertion codes not used (set) in biocache-store and biocache-service

  • BADLY_FORMED_ALTITUDE
  • AMBIGUOUS_NAME
  • PROCESSING_ERROR - DAOs check for this but it doesn't seem to be set anywhere
  • MEDIA_REPRESENTATIVE
  • MEDIA_UNREPRESENTATIVE

Assertion codes only used in biocache-service

  • VERIFIED

Javier: From Planning Meeting 21-Apr-21. The above are now legacy assertions not in use and it is safe not to carry to LA Pipelines implementation.

@charvolant
Copy link

charvolant commented Apr 21, 2021

Things potentially missing from pipelines.

  • Generic user assertion codes. It's unclear how or whether these should be implemented, since they're really part of the biocache-service and derived from a cassandra table. The LocationRecord has a hasGeospatialIssue flag on it but it is never set and no other record has a similar flag.

Javier: User assertions will continue to be managed in biocache service, stored in Cassandra and exported to Solr index. We are adopting GBIF Spatially invalid flag but I'm not sure if that is related to Locationrecord.hasGeospatialIssue

  • COORDINATE_HABITAT_MISMATCH Believed to be deprecated

Javier: Confirmed as per https://confluence.csiro.au/display/CIU/2020-12-01+Reference+Group

  • DETECTED_OUTLIER Implemented with an outlier count in the index but no flag.
  • INFERRED_DUPLICATE_RECORD - Detected by the clustering pipeline but not flagged

Javier: Both assertions above have a way to be determined by pipelines fields hence they will not be implemented. DQ Profiles already use outlier_layer_count and duplicate_status fields.
Miles: I can't find any information on how these fields flags are set or how they might differ from outlier_layer_count and duplicate_status. I can say they are not currently used in the data profiles so won't impact them as they stand

Javier: Doug will raise an issue in GBIF pipelines project and we will ask GBIF to address it.

  • SPECIES_OUTSIDE_EXPERT_RANGE - No expert range processing at present

Miles: I can't find any records with this assertion currently set however it has been generated in the past and is a useful quality metric. This field indicates whether the coordinates match the expert distribution layer for the species - there are only layers for birds and fish
Dave: we can only implement SPECIES_OUTSIDE_EXPERT_RANGE if we do the intersect with expert distributions - and the project steering committee agreed we’d drop that functionality

  • NAME_NOT_IN_NATIONAL_CHECKLISTS - Believed to be deprecated

Miles: no records with this assertion, I don't know if it was ever used, You can find out which name list a name comes from by following link to the species page but this information is in the index (as far as I can tell)
Javier: For SPECIES_OUTSIDE_EXPERT_RANGE and NAME_NOT_IN_NATIONAL_CHECKLISTS it is not clear how they might be used or how they were added originally. It is accepted that they are deprecated in LA Pipelines project (Planning Meeting 21-Apr-21)

  • UNRECOGNISED_COLLECTION_CODE - Mapped by ALACollectionLookup but a not found is not flagged
  • UNRECOGNISED_INSTITUTION_CODE - Mapped by ALACollectionLookup but a not found is not flagged

Javier: UNRECOGNISED_COLLECTION_CODE and UNRECOGNISED_INSTITUTION_CODE will be implemented in #303

  • RECORDED_BY_UNPARSABLE - Unparsable names are not flagged by CollectorNameParser

Javier: Collector name can be even an id and that is valid, hence this has been an opinionated assertion that is no longer valid.

@javier-molina javier-molina removed this from the Sprint 16 milestone Apr 21, 2021
@javier-molina
Copy link

Everything in comment above 21 Apr has been addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants