Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "provider" field to TIMDEX and derive values for GIS sources #137

Merged
merged 1 commit into from
Feb 29, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

Purpose and background context

Add a top-level field provider to the TIMDEX data model and derive values for GIS sources (i.e., gismit and gisogm via the MITAardvark transformer). This field represents the institution or organization that provides access to the resource represented in TIMDEX. This change came from the need to support a UI filter "Institution" for the GeoData website and implements the decision proposed in ADR-0004.

Note a couple things:

  • This PR only adds the derivation of provider for GIS sources.
  • There are suggestions in ADR-004 on deriving this field for multiple sources might improve from changing their mapping to utilize this field, but before changes are made they are not broken.
  • For the MITAardvark transformer, the field schema_provider_s is no longer mapped to publication_information.

How can a reviewer manually see the effects of these changes?

  1. Temporarily set AWS credentials for TimdexManagers for Dev1 in your terminal.

  2. Transform gismit records

    • Run the following command from your local clone of transmogrifier:
      pipenv run transform -i s3://timdex-extract-dev-222053980223/gismit/gismit-2024-02-21-full-extracted-records-to-index.jsonl -o output/output_gismit.json -s gismit -v
      
    • View output/output_gismit.json. All records will read "GIS Lab, MIT Libraries".
  3. Transform gisogm recors

    • Run the following command from your local clone of transmogrifier:
      pipenv run transform -i s3://timdex-extract-dev-222053980223/gisogm/gisogm-2024-02-20-full-extracted-records-to-index.jsonl -o output/output_gisogm.json -s gisogm -v
      
    • View output/output_gisogm.json. All records will display the value from schema_provider_s, which derives its values from the OGM repositories config YAML file (see derivation in geo-harvester).

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES; The OpenSearch field mapping in timdex-index-manager requires an update to include "provider" field

What are the relevant tickets?

https://mitlibraries.atlassian.net/jira/software/c/projects/GDT/boards/225?selectedIssue=GDT-203

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review February 28, 2024 21:43
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! The ADR was certainly the difficult part here; the code changes are minimal and make sense. Confirmed that OGM harvests will show the institution via provider field.

Why these changes are being introduced:
* The "provider" represents the institution or organization that provides
access to the resource represented in TIMDEX. The idea for this change
came from the need to support a UI filter "Institution" for GeoData.

How this addresses that need:
* Add "provider" to transmogrifier.models.TimdexRecord
* Add derivation for "provider" in MITAardvark.get_optional_fields method
* Remove mapping of "schema_provider_s" from TIMDEX field "publication_information"
* Update test for MITAardvark transformer to reflect changes

Side effects of this change:
* The OpenSearch field mapping in timdex-index-manager requires an update
to include a "provider" field.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/jira/software/c/projects/GDT/boards/225?selectedIssue=GDT-206
@jonavellecuerdo jonavellecuerdo merged commit 0fd8ea8 into main Feb 29, 2024
5 checks passed
@jonavellecuerdo jonavellecuerdo deleted the GDT-206-add-provider-field branch February 29, 2024 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants