Normalize GBL1 and Aardvark records for OGM harvests #151

ghukill · 2024-02-09T21:53:43Z

Purpose and background context

This PR adds the ability to normalize GBL1 and Aardvark records for OGM harvests.

During this work there multiple threads, with multiple groups, that resulted in some interconnected changes. But the work can priarmily be broken down into two major chunks:

1- Expose OGM repository configurations to the SourceRecord class to use during normalization (commit)

Normalization for OGM records sometimes requires information (e.g. common, consistent name for the institution) about the repository that is not explicitly present in the harvested metadata record. The configuration YAML is the best choice for storing and retrieving this information.

2- Normalization of OGM GBL1 and Aardvark records (commit)

This work was very similar to FGDC and ISO19139, but overall much simpler. It was determined that ALL harvested OGM records could be either GBL1 or Aardvark, making these two source classes very OGM centric.

Furthermore, GBL1 often was just renaming and re-typing field values, while many Aardvark fields were direct copies of the original record.

The bulk of this work is setting up field methods, ensuring correct types, and adequate testing; the approach is nearly identical to FGDC and ISO19139 source classes and normalization.

NOTE: It was determined that OGM harvests will not need to write source and normalized metadata to our S3 CDN bucket. This change will be handled in the next PR.

How can a reviewer manually see the effects of these changes?

As alluded to above, the normal flow of the CLI or deployed ECS task will attempt to write OGM records to S3 CDN, which is not needed or ideal.

Therefore, the best way to test OGM harvesting is still directly in python. The following performs a full harvest of:

edu.uchicago: GBL1
edu.cornell: Aardvark, and example of explicit, alternate strategy of getting an external URL for each record

import logging
import time

from harvester.config import Config, configure_logger
from harvester.harvest.ogm import OGMHarvester

logger = logging.getLogger("harvester")
configure_logger(logger, True) # verbose
CONFIG = Config()

harvester = OGMHarvester(
    harvest_type="full",
    output_file="output/chicago_and_cornell_normalized.jsonl",
    include_repositories=[
        "edu.uchicago", # Aardvark
        "edu.cornell", # GBL1, example of explicit external URL strategy
    ],
    remove_local_repos=False
)
t0 = time.time()
results = harvester.harvest()
print(results)
print(f"total time: {time.time()-t0}")

Once harvest is complete, you can see the results of normalized files at output/chicago_and_cornell_normalized.jsonl.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/GDT-109

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: When normalizing OGM records, it was determined that we might need some repo-by-repo specific configurations for OGM repositories. The established place to do this is in the OGM config YAML, and so it makes sense to expose this config dictionary to the record classes that require it. One of the repo specific updates that will be utilized in a future commit for GBL1 and Aardvark classes are alternate strategies for extracting a meaningful external URL for repositories. This has been, optionally, added to some OGM repo configurations. How this addresses that need: * Adds ogm_config dictionary to SourceRecord class * Some OGM configs get an alteranate url extraction strategy Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109

Why these changes are being introduced: When harvesting records from OGM, we have determined the only two formats we will need to support are GBL1 and Aardvark. To support OGM harvests, these classes need to normalize to "MIT Aardvark" which is the Aardvark format, just meeting MIT requirements and suitable for TIMDEX. How this addresses that need: * Completes stubbed SourceRecord classes GBL1 and Aardvark Side effects of this change: * OGM harvests are possible with new ability to normalize GBL1 and Aardvark to MITAardvark Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109

ehanson8

Looking good but some questions and changes requested!

harvester/harvest/ogm.py

harvester/ogm_repositories_config.yaml

harvester/records/record.py

harvester/records/aardvark.py

harvester/records/gbl1.py

harvester/utils.py

tests/test_records/test_aardvark.py

Why these changes are being introduced: When normalizing OGM records, a very small subset of OGM records include an array of values in the dct_references_s field for the URI 'http://schema.org/downloadUrl'. When this value is an array, and not a single value string, we cannot know which URL to select as the only download URL we will expose in TIMDEX. Therefore, despite the information loss, it is better to skip a download URL for these records. How this addresses that need: * Checks that found download value is a string URL value, not an array Side effects of this change: * A small subset of records that might have had download URLs for OGM harvests will not have them Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/GDT-109

ghukill · 2024-02-13T14:25:32Z

Looking good but some questions and changes requested!

Thanks for the review @ehanson8 - I've responded with additional commits!

ehanson8

Approved but commenting to defer final approval to @jonavellecuerdo

harvester/records/gbl1.py

jonavellecuerdo

Hi, @ghukill ! Great work here. I appreciate the detail in doc strings for field mappings that required additional steps--every time I question came to mind, there was a comment for it. 💯

I just have one clarification question/request for change (at most, to add a docstring); see my second comment regarding Aardvark._gbl_resourceClass_sm. 🤔

jonavellecuerdo · 2024-02-14T15:53:16Z

harvester/records/record.py

+
+        https://opengeometadata.org/ogm-aardvark/#resource-class-values
+        """
+        return {


How did you determine what key values to use? 🤔

@jonavellecuerdo - not a precise science here. Knowing the controlled terms we want end up with, just slowly adding instances that we encounter.

jonavellecuerdo · 2024-02-14T16:44:14Z

harvester/records/aardvark.py

+    def _gbl_resourceClass_sm(self) -> list[str]:
+        mapped_values = []
+        for value in self.parsed_data.get("gbl_resourceClass_sm", []):
+            if mapped_value := self.gbl_resource_class_value_map.get(
+                value.strip().lower()
+            ):
+                mapped_values.append(mapped_value)  # noqa: PERF401
+        return mapped_values if mapped_values else ["Other"]


To clarify, if the original Aardvark record (in self.parsed_data) does not have gbl_resourceClass_sm assigned, the function would return ["Other"], right? Is that the expected behavior? If so, could we provide an explanation as to what "Other" means in this scenario? 🤔

It seems unlikely that we'll encounter this issue, given that gbl_resourceClass_sm is required for the Aardvark schema. Technically, if this Aardvark record was missing a value AND it was run through the schema, it would fail and throw a validation error...

Yes, that's correct: if the original Aardvark file didn't have it, it'd get set to Other.

As much a fan of docstrings as I am, I feel like it's okay without? I think in metadata-land it's fairly common to have a default value like Other, particularly when the field is required, but there is no data to support it.

And to your point, we really should never encounter this situation.

Sounds good! I'd be curious if any records that end up meeting this condition (i.e., returning ["Other"]) and maybe only then revisiting whether "Aardvark" records missing this required field should be kept or ignored (I have some difficulty semantically accepting ["Other"] as "no data" 😅). That said, it seems like a low chance that we'll encounter this case, so I think we can proceed forward!

harvester/records/gbl1.py

ghukill added 4 commits February 9, 2024 15:10

Include error type counts in harvest results

c42ce77

Merging and resolving conflicts from GDT-159

4dbea82

ghukill marked this pull request as ready for review February 12, 2024 14:23

ghukill requested review from jonavellecuerdo and ehanson8 February 12, 2024 14:23

ghukill added 2 commits February 12, 2024 11:33

Add requests to dependencies

0c5a613

Add PyYAML dependency

289cfff

ehanson8 requested changes Feb 12, 2024

View reviewed changes

ghukill added 3 commits February 12, 2024 16:21

rename ogm_config to ogm_repo_config

2596233

Refactor methods for external url strategies

cd0c9ee

ghukill requested a review from ehanson8 February 13, 2024 14:25

ehanson8 reviewed Feb 13, 2024

View reviewed changes

harvester/records/gbl1.py Show resolved Hide resolved

jonavellecuerdo requested changes Feb 14, 2024

View reviewed changes

ghukill requested review from ehanson8 and jonavellecuerdo February 14, 2024 18:20

jonavellecuerdo approved these changes Feb 14, 2024

View reviewed changes

ehanson8 approved these changes Feb 14, 2024

View reviewed changes

ghukill merged commit 2d2c49c into main Feb 14, 2024
5 checks passed

ghukill deleted the GDT-109-normalize-ogm-records branch March 26, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize GBL1 and Aardvark records for OGM harvests #151

Normalize GBL1 and Aardvark records for OGM harvests #151

ghukill commented Feb 9, 2024 •

edited by ehanson8

Loading

ehanson8 left a comment

ghukill commented Feb 13, 2024

ehanson8 left a comment

jonavellecuerdo left a comment

jonavellecuerdo Feb 14, 2024

ghukill Feb 14, 2024

jonavellecuerdo Feb 14, 2024

ghukill Feb 14, 2024

jonavellecuerdo Feb 14, 2024

Normalize GBL1 and Aardvark records for OGM harvests #151

Normalize GBL1 and Aardvark records for OGM harvests #151

Conversation

ghukill commented Feb 9, 2024 • edited by ehanson8 Loading

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ehanson8 left a comment

Choose a reason for hiding this comment

ghukill commented Feb 13, 2024

ehanson8 left a comment

Choose a reason for hiding this comment

jonavellecuerdo left a comment

Choose a reason for hiding this comment

jonavellecuerdo Feb 14, 2024

Choose a reason for hiding this comment

ghukill Feb 14, 2024

Choose a reason for hiding this comment

jonavellecuerdo Feb 14, 2024

Choose a reason for hiding this comment

ghukill Feb 14, 2024

Choose a reason for hiding this comment

jonavellecuerdo Feb 14, 2024

Choose a reason for hiding this comment

ghukill commented Feb 9, 2024 •

edited by ehanson8

Loading