Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 1st set of Datacite field methods #178

Merged
merged 4 commits into from
May 24, 2024

Conversation

ehanson8
Copy link
Contributor

Purpose and background context

Refactors the first set of Datacite fields as separate field methods. As this is the first of many of these PRs that DataEng will be writing/reviewing, I encourage nit-picking so we can refine the overall approach early.

How can a reviewer manually see the effects of these changes?

Run the following command to see that the Datacite transform still transforms a source file:

pipenv run transform -i tests/fixtures/datacite/datacite_records.xml -o output/datacite-transformed-records.json -s jpal

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* Refactor Datacite to use field methods

How this addresses that need:
* Add comments to organize conftest.py by source
* Add additional Datacite fixtures to conftest.py
* Update datacite_record_all_fields.xml to align with other source fixtures
* Add create_datacite_source_record_stub function to Datacite test module
* Rename param xml > source_record
* Add field methods and associated private methods for alternate_titles, content_type, and contributors
* Add unit tests for new field methods
* Shift note-related code block from content_type to notes

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-284
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think it looks great!

Left a few comments, but mostly just open questions and musings given this is one of the first PRs in this series.

I'd likely just outright approve with my questions for discussion, but I was curious about one particular comment about the self.get_<field_method>() or None pattern, pondering if it might be a good practice early here to make the field methods fully encapsulated to provide the final value.

transmogrifier/sources/xml/datacite.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/datacite.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/datacite.py Show resolved Hide resolved
Comment on lines 22 to 38
def create_datacite_source_record_stub(xml_insert: str) -> BeautifulSoup:
xml_string = f"""
<records>
<record xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<metadata>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://datacite.org/schema/kernel-4"
xsi:schemaLocation="http://datacite.org/schema/kernel-4
http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
{xml_insert}
</resource>
</metadata>
</record>
</records>
"""
return BeautifulSoup(xml_string, "xml")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I still like this pattern. I like that the XML template (let's call it) is fairly complete, such that any BS4 (or lxml) logic is accurately tested, including namespaces, DOM traversal, etc.

That said, wondering if there is an opportunity to abstract this a bit for more shared code between source tests? Imagining standlone XML files for each source as the template, then a shared utility function to a) read the template file, and b) pass in whatever elements will be tested.

Might be an interesting discussion at some point. But I think this method is readily understandable, lightweight, and could be a good stepping stone to that kind of centralization once all the edge cases are known.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion and that feels like an efficiency that would serve us better earlier in the refactor than later, @jonavellecuerdo your thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea!

Copy link
Contributor Author

@ehanson8 ehanson8 May 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a ticket and I would propose one of us pick it up soon to save future refactoring

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigned myself to the ticket!

tests/sources/xml/test_datacite.py Show resolved Hide resolved
tests/sources/xml/test_datacite.py Outdated Show resolved Hide resolved
tests/sources/xml/test_datacite.py Show resolved Hide resolved
* Shift "or None" logic to field methods and update type hinting
* Update unit tests to all use stub records
* Remove unnecessary fixtures from conftest.py
@ehanson8
Copy link
Contributor Author

@ghukill @jonavellecuerdo Pushed a new commit with the changes we discussed

* Add default value to create_datacite_source_record_stub's xml_insert param
* Replace calls of source_record_id variable with calls of get_source_record_id method inside field methods
@ehanson8
Copy link
Contributor Author

@ghukill @jonavellecuerdo Another update after our very helpful discussion this morning!

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehanson8 Just have a couple of questions and suggestions for changes. Let me know what you think! The Datacite transform has some complex logic, and I appreciate your efforts to make these updates as clear and digestible as possible. :)

transmogrifier/sources/xml/datacite.py Outdated Show resolved Hide resolved
Comment on lines +46 to +53
# aardvark ##########################


@pytest.fixture
def aardvark_records():
return JSONTransformer.parse_source_file("tests/fixtures/aardvark_records.jsonl")


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not relevant to this transform, can it be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set up those comments (including marc, oaidc, and timdex) to organize the existing fixtures conftest by source so I'd prefer to leave it

Comment on lines +263 to +273
alternate_titles.extend(list(cls._get_additional_titles(source_record)))
return alternate_titles or None

@classmethod
def _get_additional_titles(
cls, source_record: Tag
) -> Iterator[timdex.AlternateTitle]:
"""Get additional titles from get_main_titles method."""
for index, title in enumerate(cls.get_main_titles(source_record)):
if index > 0:
yield timdex.AlternateTitle(value=title)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the creation of the private method _get_additional_titles() brings the number of title-related methods to 4:

  • get_valid_title
  • get_main_titles
  • get_alternate_titles
  • _get_additional_titles

Because of this, I personally prefer moving the logic in _get_additional_titles() private method into get_alternate_titles() and adding a comment instead. If not, maybe rename to _get_additional_alternate_titles().

That said, I think we did kinda' discuss that get_valid_title() might be revisited at some point...

Curious what @ghukill thinks!

Copy link
Contributor

@ghukill ghukill May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising @jonavellecuerdo. Without comment on your proposal above, I see something else in these title parsings that also stood out, after some hopping back and forth.

Looks like maybe the same XML parsing is used here in get_main_titles() and then again here in get_alternate_titles()? The difference being the first is looking for strings, and the second is returning timdex.AlternateTitle instances.

I don't know what the solution is, but something is feeling a bit wonky here.

Wouldn't this result in duplicating values for "main titles" and "alternate titles"? Does that have some bearing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another one where I agree 100% but I think we're better off creating an issue to address this separately. This is getting too close to core logic that I'm hesitant to mess with it during the field method refactor. Agree or disagree with that approach?

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghukill I think the difference between get_main_titles() and get_alternate_titles is:

  • get_main_titles() retrieves values where attribute @titleType is None.
  • get_alternate_titles() retrieves values where attribute @titleType exists.
  • _get_additional_titles retrieves all values from get_main_titles(), excluding the first item in the list (if there are multiple values returned for get_main_titles(), the first value is used for timdexRecord.title.

That said, @ehanson8 you make a good point! This is stepping into the logic as opposed to the field method refactor work. I started an issue for Datacite transform logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed @ehanson8, and thanks for the eagle eyes @jonavellecuerdo! This is like the 100th time this week I've missed a not in code. I might need new glasses.

Thanks for starting the issue. It does feel like something that could get untangled and streamlined, but agree with all comments above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for creating the issue, though I would argue that title issue goes much deeper than just Datacite so you could frame it more broadly as Refactor title logic. I think other transforms may be equally messy re: titles

Comment on lines +287 to +291
else:
logger.warning(
"Datacite record %s missing required Datacite field resourceType",
cls.get_source_record_id(source_record),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that content_type is not a required field, I would opt to remove lines 287-291. The warning is confusing as it is technically "not required". This might've been related to our discussion earlier about "useful logging"? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this should go but propose we create an issue of "evaluate logging for usefulness" to handle it globally (with other items flagged in the comments as they are encountered) and not bog down the field method refactor. The origin of this is that resourceType is required by the Datacite schema, but we've never done anything with it and it's incongruous with how transmogrifier logging has developed since. Datacite was the first transform written and the context was the RDI project, which is no longer our context

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you for clarifying! Created an issue for future work and discussion. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating the issue!

transmogrifier/sources/xml/datacite.py Show resolved Hide resolved
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@@ -261,7 +261,7 @@ def get_alternate_titles(cls, source_record: Tag) -> list[timdex.AlternateTitle]
]
)
alternate_titles.extend(list(cls._get_additional_titles(source_record)))
return alternate_titles
return alternate_titles or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this time, I do like this full encapsulation, thanks for making these changes.

Relevant here and elsewhere...

Once we get into orchestration, which I think touches on the instantiation of the TIMDEXRecord instance, it might be worth considering if maybe it's okay to return an empty list? If so, and that was filtered out during serialization to JSON, then we could probably drop a bunch of these conversions to None. Just noting for the future, but again, think this is great for now, and in the spirit of the incremental approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am most likely onboard with that approach when we get there!

@ehanson8 ehanson8 merged commit 526705e into main May 24, 2024
5 checks passed
@ehanson8 ehanson8 deleted the TIMX-284-datacite-field-method-refactor branch May 24, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants