Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate date strings when transforming MITAardvark records #124

Merged
merged 5 commits into from
Feb 14, 2024

Conversation

ghukill
Copy link
Contributor

@ghukill ghukill commented Feb 14, 2024

Purpose and background context

The primary change this PR introduces is validating date strings for MITAardvark transformations, particularly for records from OpenGeoMetadata (OGM) where dates are less controlled than when we create them.

Using validate_date(...) is inline with most other transformations in Transmogrifier. By adding this, all date related errors while indexing to Opensearch locally were avoided, allowing an additional 26k records to index. However, this means we also lost adding dates to TIMDEX records that were not immediately valid, e.g.:

  • "1993?"
  • "approximately 1569"
  • "[1910 to 1919]"

While not ideal to lose information during transformation, this seems acceptable at this time for a couple of reasons:

  • this approach keeps this transformation inline with other transformations that also employ this validation
  • while some dates are lost, it's very rarely not the only date for an OGM record
  • when we return to dates in Transmog -- possibly cleaning common patterns like ? or xxx to yyy -- we can apply that logic once and know that all transformations will benefit

Also in this PR:

  • source gisogm should also use the MITAardvark transformation as those records; MIT and OGM share transformation
  • small tweak to skip multiple download links if present in field dct_references_s from Aardvark record

How can a reviewer manually see the effects of these changes?

The new unit test in this commit is indicative of how it drops values that are not valid.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: increase in documents indexed, decrease in multiple dates per record

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
The source gisogm should also use the MITAardvark transformer, as the GeoHarvester, even for OGM records,
will produce what we are referring to as MITAardvark files.  They are Aardvark in format, but meet MIT
requirements for TIMDEX.

How this addresses that need:
* Changes defined transformer class in config

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-109
Why these changes are being introduced:

While technically valid for an Aardvark file to have multiple download links as part
of dct_references_s, when creating a TIMDEX record, we cannot know which link to include
as the sole and primary download link.  Single, scalar values are by far the most common and
will be used, but if an array of values is encountered, adding a Download link is skipped.

How this addresses that need:
* Ensure that value coming from dct_references_s for URI 'http://schema.org/downloadUrl' is a scalar value

Side effects of this change:
* Slight decrease in download links for transforming OGM records

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-109
Why these changes are being introduced:

It was determined after some end-to-end testing that a large number of Aardvark records, when transformed,
were setting dates in the TIMDEX record that Opensearch could not parse.  This was unlike other transformations
where the date strings are validated.

NOTE: it is anticipated that we will return to date validation, possibly to extend it beyond just validation
and into extraction as well (e.g. pulling valid year '1933' from date string '1933?').

How this addresses that need:
* A try/except block added to field method _range_dates to warn and continue
* All non DateRange values are validated in field method get_dates; if fail validation, they are dropped

Side effects of this change:
* Massive increase in success rate for Opensearch indexing for gisogm source
* Decrease in date strings present in final records that are not valid without manipulation

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-109
@ghukill ghukill marked this pull request as ready for review February 14, 2024 14:34
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but 1 bigger picture question before approving

@@ -112,7 +112,7 @@
"gisogm": {
"name": "OpenGeoMetadata GIS Resources",
"base-url": "https://search.libraries.mit.edu/record/",
"transform-class": "transmogrifier.sources.json.aardvark.OGMAardvark",
"transform-class": "transmogrifier.sources.json.aardvark.MITAardvark",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, OGMAardvark was referenced here before being defined which we were expecting to need because of the get_source_link method including gismit, are we thinking differently about that approach now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really glad you raised this up, for a couple of reasons.

First of all, looking more closely at the get_source_link which I had not thought to check, I believe that we can update that for MITAardvark transformations to use data from dct_references_s. In this way, it also helps explain why MIT and OGM records can share the same transformer.

Every record that comes out of GeoHarvester is an "MIT" Aardvark record in the sense that, regardless of origin institution or metadata format, we have crafted the Aardvark file in a way that meets our TIMDEX needs. During that work in GeoHarvester, quite a bit of care is taken to craft the dct_references_s field which contains URLs.

The value for dct_refereces_s['http://schema.org/url'] is what the "source link" for the record should be. For MIT records this will be https://geodata.libraries.mit.edu/record/<IDENTIFIER> and for OGM records it will be an external URL that we extracted from the source metadata; gauranteed to be present or it does not get included in the harvester output.

Taking all this together, will work on another commit that:

  1. updates get_source_link to actually read data from the record
  2. will remove base_url from the gismit and gisogm configurations, as it's not needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 - just pushed this commit and have re-requested review.

Why these changes are being introduced:

Unlike other Transmog sources, which dynamically create a source link for the record
based on a base URL and some identifier, MITAardvark records are expected to have a fully
formed URL already present in the metadata.  This is true for both MIT and OGM records, where
MIT records point to geo TIMDEX UI, and OGM records point to the holding institution.

How this addresses that need:
* Updates get_source_link() method to use data from metadata
* Sets config.base_url to string 'None' for both gismit and gisogm

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-109
if len(url_links) == 1:
return url_links[0].url
message = "Could not locate a kind=Website link to pull the source link from."
raise ValueError(message)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, much better than what we were doing before

@ghukill ghukill merged commit 06de472 into main Feb 14, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants