-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize GBL1 and Aardvark records for OGM harvests #151
Conversation
Why these changes are being introduced: When normalizing OGM records, it was determined that we might need some repo-by-repo specific configurations for OGM repositories. The established place to do this is in the OGM config YAML, and so it makes sense to expose this config dictionary to the record classes that require it. One of the repo specific updates that will be utilized in a future commit for GBL1 and Aardvark classes are alternate strategies for extracting a meaningful external URL for repositories. This has been, optionally, added to some OGM repo configurations. How this addresses that need: * Adds ogm_config dictionary to SourceRecord class * Some OGM configs get an alteranate url extraction strategy Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109
Why these changes are being introduced: When harvesting records from OGM, we have determined the only two formats we will need to support are GBL1 and Aardvark. To support OGM harvests, these classes need to normalize to "MIT Aardvark" which is the Aardvark format, just meeting MIT requirements and suitable for TIMDEX. How this addresses that need: * Completes stubbed SourceRecord classes GBL1 and Aardvark Side effects of this change: * OGM harvests are possible with new ability to normalize GBL1 and Aardvark to MITAardvark Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good but some questions and changes requested!
Why these changes are being introduced: When normalizing OGM records, a very small subset of OGM records include an array of values in the dct_references_s field for the URI 'http://schema.org/downloadUrl'. When this value is an array, and not a single value string, we cannot know which URL to select as the only download URL we will expose in TIMDEX. Therefore, despite the information loss, it is better to skip a download URL for these records. How this addresses that need: * Checks that found download value is a string URL value, not an array Side effects of this change: * A small subset of records that might have had download URLs for OGM harvests will not have them Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109
Thanks for the review @ehanson8 - I've responded with additional commits! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved but commenting to defer final approval to @jonavellecuerdo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @ghukill ! Great work here. I appreciate the detail in doc strings for field mappings that required additional steps--every time I question came to mind, there was a comment for it. 💯
I just have one clarification question/request for change (at most, to add a docstring); see my second comment regarding Aardvark._gbl_resourceClass_sm
. 🤔
|
||
https://opengeometadata.org/ogm-aardvark/#resource-class-values | ||
""" | ||
return { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you determine what key values to use? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonavellecuerdo - not a precise science here. Knowing the controlled terms we want end up with, just slowly adding instances that we encounter.
def _gbl_resourceClass_sm(self) -> list[str]: | ||
mapped_values = [] | ||
for value in self.parsed_data.get("gbl_resourceClass_sm", []): | ||
if mapped_value := self.gbl_resource_class_value_map.get( | ||
value.strip().lower() | ||
): | ||
mapped_values.append(mapped_value) # noqa: PERF401 | ||
return mapped_values if mapped_values else ["Other"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, if the original Aardvark record (in self.parsed_data
) does not have gbl_resourceClass_sm
assigned, the function would return ["Other"]
, right? Is that the expected behavior? If so, could we provide an explanation as to what "Other" means in this scenario? 🤔
It seems unlikely that we'll encounter this issue, given that gbl_resourceClass_sm is required for the Aardvark schema. Technically, if this Aardvark record was missing a value AND it was run through the schema, it would fail and throw a validation error...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct: if the original Aardvark file didn't have it, it'd get set to Other
.
As much a fan of docstrings as I am, I feel like it's okay without? I think in metadata-land it's fairly common to have a default value like Other
, particularly when the field is required, but there is no data to support it.
And to your point, we really should never encounter this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I'd be curious if any records that end up meeting this condition (i.e., returning ["Other"]
) and maybe only then revisiting whether "Aardvark" records missing this required field should be kept or ignored (I have some difficulty semantically accepting ["Other"]
as "no data" 😅). That said, it seems like a low chance that we'll encounter this case, so I think we can proceed forward!
Purpose and background context
This PR adds the ability to normalize GBL1 and Aardvark records for OGM harvests.
During this work there multiple threads, with multiple groups, that resulted in some interconnected changes. But the work can priarmily be broken down into two major chunks:
1- Expose OGM repository configurations to the
SourceRecord
class to use during normalization (commit)Normalization for OGM records sometimes requires information (e.g. common, consistent name for the institution) about the repository that is not explicitly present in the harvested metadata record. The configuration YAML is the best choice for storing and retrieving this information.
2- Normalization of OGM GBL1 and Aardvark records (commit)
This work was very similar to FGDC and ISO19139, but overall much simpler. It was determined that ALL harvested OGM records could be either GBL1 or Aardvark, making these two source classes very OGM centric.
Furthermore, GBL1 often was just renaming and re-typing field values, while many Aardvark fields were direct copies of the original record.
The bulk of this work is setting up field methods, ensuring correct types, and adequate testing; the approach is nearly identical to FGDC and ISO19139 source classes and normalization.
NOTE: It was determined that OGM harvests will not need to write source and normalized metadata to our S3 CDN bucket. This change will be handled in the next PR.
How can a reviewer manually see the effects of these changes?
As alluded to above, the normal flow of the CLI or deployed ECS task will attempt to write OGM records to S3 CDN, which is not needed or ideal.
Therefore, the best way to test OGM harvesting is still directly in python. The following performs a full harvest of:
edu.uchicago
: GBL1edu.cornell
: Aardvark, and example of explicit, alternate strategy of getting an external URL for each recordOnce harvest is complete, you can see the results of normalized files at
output/chicago_and_cornell_normalized.jsonl
.Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)