Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize GBL1 and Aardvark records for OGM harvests #151

Merged
merged 9 commits into from
Feb 14, 2024

Conversation

ghukill
Copy link
Collaborator

@ghukill ghukill commented Feb 9, 2024

Purpose and background context

This PR adds the ability to normalize GBL1 and Aardvark records for OGM harvests.

During this work there multiple threads, with multiple groups, that resulted in some interconnected changes. But the work can priarmily be broken down into two major chunks:

1- Expose OGM repository configurations to the SourceRecord class to use during normalization (commit)

Normalization for OGM records sometimes requires information (e.g. common, consistent name for the institution) about the repository that is not explicitly present in the harvested metadata record. The configuration YAML is the best choice for storing and retrieving this information.

2- Normalization of OGM GBL1 and Aardvark records (commit)

This work was very similar to FGDC and ISO19139, but overall much simpler. It was determined that ALL harvested OGM records could be either GBL1 or Aardvark, making these two source classes very OGM centric.

Furthermore, GBL1 often was just renaming and re-typing field values, while many Aardvark fields were direct copies of the original record.

The bulk of this work is setting up field methods, ensuring correct types, and adequate testing; the approach is nearly identical to FGDC and ISO19139 source classes and normalization.

NOTE: It was determined that OGM harvests will not need to write source and normalized metadata to our S3 CDN bucket. This change will be handled in the next PR.

How can a reviewer manually see the effects of these changes?

As alluded to above, the normal flow of the CLI or deployed ECS task will attempt to write OGM records to S3 CDN, which is not needed or ideal.

Therefore, the best way to test OGM harvesting is still directly in python. The following performs a full harvest of:

import logging
import time

from harvester.config import Config, configure_logger
from harvester.harvest.ogm import OGMHarvester

logger = logging.getLogger("harvester")
configure_logger(logger, True) # verbose
CONFIG = Config()

harvester = OGMHarvester(
    harvest_type="full",
    output_file="output/chicago_and_cornell_normalized.jsonl",
    include_repositories=[
        "edu.uchicago", # Aardvark
        "edu.cornell", # GBL1, example of explicit external URL strategy
    ],
    remove_local_repos=False
)
t0 = time.time()
results = harvester.harvest()
print(results)
print(f"total time: {time.time()-t0}")

Once harvest is complete, you can see the results of normalized files at output/chicago_and_cornell_normalized.jsonl.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

When normalizing OGM records, it was determined that we might need some repo-by-repo specific
configurations for OGM repositories.  The established place to do this is in the OGM config YAML,
and so it makes sense to expose this config dictionary to the record classes that require it.

One of the repo specific updates that will be utilized in a future commit for GBL1 and Aardvark
classes are alternate strategies for extracting a meaningful external URL for repositories.  This
has been, optionally, added to some OGM repo configurations.

How this addresses that need:
* Adds ogm_config dictionary to SourceRecord class
* Some OGM configs get an alteranate url extraction strategy

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-109
Why these changes are being introduced:
When harvesting records from OGM, we have determined the only two formats we will need to support are
GBL1 and Aardvark.  To support OGM harvests, these classes need to normalize to "MIT Aardvark" which
is the Aardvark format, just meeting MIT requirements and suitable for TIMDEX.

How this addresses that need:
* Completes stubbed SourceRecord classes GBL1 and Aardvark

Side effects of this change:
* OGM harvests are possible with new ability to normalize GBL1 and Aardvark to MITAardvark

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-109
@ghukill ghukill marked this pull request as ready for review February 12, 2024 14:23
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good but some questions and changes requested!

harvester/harvest/ogm.py Outdated Show resolved Hide resolved
harvester/ogm_repositories_config.yaml Show resolved Hide resolved
harvester/records/record.py Show resolved Hide resolved
harvester/records/aardvark.py Show resolved Hide resolved
harvester/records/gbl1.py Show resolved Hide resolved
harvester/utils.py Show resolved Hide resolved
tests/test_records/test_aardvark.py Show resolved Hide resolved
Why these changes are being introduced:

When normalizing OGM records, a very small subset of OGM records include an array of
values in the dct_references_s field for the URI 'http://schema.org/downloadUrl'.  When
this value is an array, and not a single value string, we cannot know which URL to select
as the only download URL we will expose in TIMDEX.  Therefore, despite the information loss,
it is better to skip a download URL for these records.

How this addresses that need:
* Checks that found download value is a string URL value, not an array

Side effects of this change:
* A small subset of records that might have had download URLs for OGM harvests will not have them

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/GDT-109
@ghukill
Copy link
Collaborator Author

ghukill commented Feb 13, 2024

Looking good but some questions and changes requested!

Thanks for the review @ehanson8 - I've responded with additional commits!

Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved but commenting to defer final approval to @jonavellecuerdo

harvester/records/gbl1.py Show resolved Hide resolved
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @ghukill ! Great work here. I appreciate the detail in doc strings for field mappings that required additional steps--every time I question came to mind, there was a comment for it. 💯

I just have one clarification question/request for change (at most, to add a docstring); see my second comment regarding Aardvark._gbl_resourceClass_sm. 🤔


https://opengeometadata.org/ogm-aardvark/#resource-class-values
"""
return {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you determine what key values to use? 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonavellecuerdo - not a precise science here. Knowing the controlled terms we want end up with, just slowly adding instances that we encounter.

Comment on lines +32 to +39
def _gbl_resourceClass_sm(self) -> list[str]:
mapped_values = []
for value in self.parsed_data.get("gbl_resourceClass_sm", []):
if mapped_value := self.gbl_resource_class_value_map.get(
value.strip().lower()
):
mapped_values.append(mapped_value) # noqa: PERF401
return mapped_values if mapped_values else ["Other"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, if the original Aardvark record (in self.parsed_data) does not have gbl_resourceClass_sm assigned, the function would return ["Other"], right? Is that the expected behavior? If so, could we provide an explanation as to what "Other" means in this scenario? 🤔

It seems unlikely that we'll encounter this issue, given that gbl_resourceClass_sm is required for the Aardvark schema. Technically, if this Aardvark record was missing a value AND it was run through the schema, it would fail and throw a validation error...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct: if the original Aardvark file didn't have it, it'd get set to Other.

As much a fan of docstrings as I am, I feel like it's okay without? I think in metadata-land it's fairly common to have a default value like Other, particularly when the field is required, but there is no data to support it.

And to your point, we really should never encounter this situation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I'd be curious if any records that end up meeting this condition (i.e., returning ["Other"]) and maybe only then revisiting whether "Aardvark" records missing this required field should be kept or ignored (I have some difficulty semantically accepting ["Other"] as "no data" 😅). That said, it seems like a low chance that we'll encounter this case, so I think we can proceed forward!

harvester/records/gbl1.py Show resolved Hide resolved
@ghukill ghukill merged commit 2d2c49c into main Feb 14, 2024
5 checks passed
@ghukill ghukill deleted the GDT-109-normalize-ogm-records branch March 26, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants