Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map Aardvark.dct_spatial_sm to Locations@kind="Place Name" #146

Merged
merged 2 commits into from
Mar 11, 2024

Conversation

jonavellecuerdo
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo commented Mar 11, 2024

Purpose and background context

Map Aardvark.dct_spatial_sm to Locations@kind="Place Name" to support the use of place names
for location-based searching for GeoData.

Note: MITAardvark is the only source transformer that writes Subject@kind="Dublin Core; Spatial" (i.e., what Aardvark.dct_spatial_sm currently maps to), so the updated mapping is only applied to this transformer.

How can a reviewer manually see the effects of these changes?

  1. Temporarily set AWS credentials for TimdexManagers for Dev1 in your terminal.

  2. Transform gismit records

    • Run the following command from your local clone of transmogrifier:
      pipenv run transform -i s3://timdex-extract-dev-222053980223/gismit/gismit-2024-03-11-full-extracted-records-to-index.jsonl -o output/output_gismit.json -s gismit -v
      
    • View output/output_gismit.json. Lines 67-88 should read:
      image
  3. Transform gisogm records (a partial run, at least!)

    • Run the following command from your local clone of transmogrifier:
      pipenv run transform -i s3://timdex-extract-dev-222053980223/gisogm/gisogm-2024-03-01-full-extracted-records-to-index.jsonl -o output/output_gisogm.json -s gisogm -v
      
    • View output/output_gisogm.json. Lines 49-66 should read:
      image

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-217

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

@jonavellecuerdo jonavellecuerdo self-assigned this Mar 11, 2024
@jonavellecuerdo jonavellecuerdo marked this pull request as ready for review March 11, 2024 17:47
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update looks good, and I think achieves what we need, but I might propose that we lean into the post-transform "hook" or "enrichment" (name TBD) pattern that was established in #141.

While I think it's agreed that we'll want to revist how to orchestrate these hooks, this second case of needing to do something very similar suggests there is a pattern that we can identify and abstract out.

Until then, I think we'd benefit from not locating this Subject --> Location data extraction/duplication in the MITAardvark transform only, lest we forget about it later and accidentally duplicate it's functionality. Moreover, if we lean into this pattern now and define it at the Transformer level, I think it provides more concrete use cases to consider when that refactor happens in earnest.

More concretely, where we currently have:

fields = self.create_dates_and_locations_from_publishers(fields)

I might propose we add another like:

# post transform hooks and enrichments
fields = self.create_dates_and_locations_from_publishers(fields)
fields = self.create_locations_from_spatial_subjects(fields)  # <--- new

The needle I think we want to thread is: making changes now that are solid and understandable, while understanding a future refactor may combine and extend this pattern that is emerging from these first two use cases.

Why these changes are being introduced:
* Support place names for location-based searching

How this addresses that need:
* Add global Transformer method to create Location objects from spatial subjects

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-217
@jonavellecuerdo jonavellecuerdo force-pushed the GDT-217-map-spatial-subjects-to-locations-field branch from ec52a9c to 95349bd Compare March 11, 2024 19:40
@jonavellecuerdo
Copy link
Contributor Author

@ghukill Thank you for clarifying! I undid the changes to the MITAardvark transformer and implemented the solution as a post-transform hook in the transformer module.

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General structure, looks good! Thanks for going with this pattern of locating these "post hook" work together.

I left a syntactical / stylistic comment for the filtering of subjects down to spatial ones. I find list comprehensions easier to scan than filter + lambda, but open to pushback!

Comment on lines 405 to 415
if not (subjects := fields.get("subjects")):
return fields

if not (
spatial_subjects := list(
filter(
lambda subject: subject.kind == "Dublin Core; Spatial", subjects # type: ignore[arg-type]
)
)
):
return fields
Copy link
Contributor

@ghukill ghukill Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I see the logic here, refining a list of potential subjects down using filter + a lambda, but perhaps this could be a simpler list comprehension?

# early return; handles edge case where subjects is value None and would return as such from .get()
if not fields.get("subjects"):
    return fields

# filters list, potentially to empty list which is okay
spatial_subjects = [
    subject
    for subject in fields.get("subjects", []) # defaults to empty list
    if subject.kind == "Dublin Core; Spatial"
    and subject.value is not None
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on list comprehension, that's what we've used throughout the app and I find it more readable as well

@jonavellecuerdo jonavellecuerdo force-pushed the GDT-217-map-spatial-subjects-to-locations-field branch from d47375c to df3ab26 Compare March 11, 2024 20:36
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jonavellecuerdo jonavellecuerdo merged commit a019673 into main Mar 11, 2024
5 checks passed
@jonavellecuerdo jonavellecuerdo deleted the GDT-217-map-spatial-subjects-to-locations-field branch March 11, 2024 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants