Add 1st set of Datacite field methods #178

ehanson8 · 2024-05-21T18:18:13Z

Purpose and background context

Refactors the first set of Datacite fields as separate field methods. As this is the first of many of these PRs that DataEng will be writing/reviewing, I encourage nit-picking so we can refine the overall approach early.

How can a reviewer manually see the effects of these changes?

Run the following command to see that the Datacite transform still transforms a source file:

pipenv run transform -i tests/fixtures/datacite/datacite_records.xml -o output/datacite-transformed-records.json -s jpal

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-284

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed and verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: * Refactor Datacite to use field methods How this addresses that need: * Add comments to organize conftest.py by source * Add additional Datacite fixtures to conftest.py * Update datacite_record_all_fields.xml to align with other source fixtures * Add create_datacite_source_record_stub function to Datacite test module * Rename param xml > source_record * Add field methods and associated private methods for alternate_titles, content_type, and contributors * Add unit tests for new field methods * Shift note-related code block from content_type to notes Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-284

ghukill

Overall, I think it looks great!

Left a few comments, but mostly just open questions and musings given this is one of the first PRs in this series.

I'd likely just outright approve with my questions for discussion, but I was curious about one particular comment about the self.get_<field_method>() or None pattern, pondering if it might be a good practice early here to make the field methods fully encapsulated to provide the final value.

transmogrifier/sources/xml/datacite.py

ghukill · 2024-05-22T14:44:03Z

tests/sources/xml/test_datacite.py

+def create_datacite_source_record_stub(xml_insert: str) -> BeautifulSoup:
+    xml_string = f"""
+        <records>
+         <record xmlns="http://www.openarchives.org/OAI/2.0/"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
+          <metadata>
+           <resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+           xmlns="http://datacite.org/schema/kernel-4"
+           xsi:schemaLocation="http://datacite.org/schema/kernel-4
+           http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
+           {xml_insert}
+           </resource>
+          </metadata>
+         </record>
+        </records>
+        """
+    return BeautifulSoup(xml_string, "xml")


I think I still like this pattern. I like that the XML template (let's call it) is fairly complete, such that any BS4 (or lxml) logic is accurately tested, including namespaces, DOM traversal, etc.

That said, wondering if there is an opportunity to abstract this a bit for more shared code between source tests? Imagining standlone XML files for each source as the template, then a shared utility function to a) read the template file, and b) pass in whatever elements will be tested.

Might be an interesting discussion at some point. But I think this method is readily understandable, lightweight, and could be a good stepping stone to that kind of centralization once all the edge cases are known.

Great suggestion and that feels like an efficiency that would serve us better earlier in the refactor than later, @jonavellecuerdo your thoughts?

I like the idea!

Created a ticket and I would propose one of us pick it up soon to save future refactoring

Assigned myself to the ticket!

tests/sources/xml/test_datacite.py

* Shift "or None" logic to field methods and update type hinting * Update unit tests to all use stub records * Remove unnecessary fixtures from conftest.py

ehanson8 · 2024-05-23T12:59:57Z

@ghukill @jonavellecuerdo Pushed a new commit with the changes we discussed

* Add default value to create_datacite_source_record_stub's xml_insert param * Replace calls of source_record_id variable with calls of get_source_record_id method inside field methods

ehanson8 · 2024-05-23T16:06:20Z

@ghukill @jonavellecuerdo Another update after our very helpful discussion this morning!

jonavellecuerdo

@ehanson8 Just have a couple of questions and suggestions for changes. Let me know what you think! The Datacite transform has some complex logic, and I appreciate your efforts to make these updates as clear and digestible as possible. :)

transmogrifier/sources/xml/datacite.py

jonavellecuerdo · 2024-05-23T19:46:47Z

tests/conftest.py

+# aardvark ##########################
+
+
+@pytest.fixture
+def aardvark_records():
+    return JSONTransformer.parse_source_file("tests/fixtures/aardvark_records.jsonl")
+
+


Not relevant to this transform, can it be removed?

I set up those comments (including marc, oaidc, and timdex) to organize the existing fixtures conftest by source so I'd prefer to leave it

jonavellecuerdo · 2024-05-23T20:01:13Z

transmogrifier/sources/xml/datacite.py

+        alternate_titles.extend(list(cls._get_additional_titles(source_record)))
+        return alternate_titles or None
+
+    @classmethod
+    def _get_additional_titles(
+        cls, source_record: Tag
+    ) -> Iterator[timdex.AlternateTitle]:
+        """Get additional titles from get_main_titles method."""
+        for index, title in enumerate(cls.get_main_titles(source_record)):
+            if index > 0:
+                yield timdex.AlternateTitle(value=title)


Hmm, the creation of the private method _get_additional_titles() brings the number of title-related methods to 4:

get_valid_title

get_main_titles

get_alternate_titles

_get_additional_titles

Because of this, I personally prefer moving the logic in _get_additional_titles() private method into get_alternate_titles() and adding a comment instead. If not, maybe rename to _get_additional_alternate_titles().

That said, I think we did kinda' discuss that get_valid_title() might be revisited at some point...

Curious what @ghukill thinks!

Thanks for raising @jonavellecuerdo. Without comment on your proposal above, I see something else in these title parsings that also stood out, after some hopping back and forth.

Looks like maybe the same XML parsing is used here in get_main_titles() and then again here in get_alternate_titles()? The difference being the first is looking for strings, and the second is returning timdex.AlternateTitle instances.

I don't know what the solution is, but something is feeling a bit wonky here.

Wouldn't this result in duplicating values for "main titles" and "alternate titles"? Does that have some bearing here?

This is another one where I agree 100% but I think we're better off creating an issue to address this separately. This is getting too close to core logic that I'm hesitant to mess with it during the field method refactor. Agree or disagree with that approach?

@ghukill I think the difference between get_main_titles() and get_alternate_titles is:

get_main_titles() retrieves values where attribute @titleType is None.

get_alternate_titles() retrieves values where attribute @titleType exists.

_get_additional_titles retrieves all values from get_main_titles(), excluding the first item in the list (if there are multiple values returned for get_main_titles(), the first value is used for timdexRecord.title.

That said, @ehanson8 you make a good point! This is stepping into the logic as opposed to the field method refactor work. I started an issue for Datacite transform logic.

Agreed @ehanson8, and thanks for the eagle eyes @jonavellecuerdo! This is like the 100th time this week I've missed a not in code. I might need new glasses.

Thanks for starting the issue. It does feel like something that could get untangled and streamlined, but agree with all comments above.

Thanks again for creating the issue, though I would argue that title issue goes much deeper than just Datacite so you could frame it more broadly as Refactor title logic. I think other transforms may be equally messy re: titles

jonavellecuerdo · 2024-05-23T20:07:43Z

transmogrifier/sources/xml/datacite.py

+        else:
+            logger.warning(
+                "Datacite record %s missing required Datacite field resourceType",
+                cls.get_source_record_id(source_record),
+            )


Given that content_type is not a required field, I would opt to remove lines 287-291. The warning is confusing as it is technically "not required". This might've been related to our discussion earlier about "useful logging"? 🤔

I agree this should go but propose we create an issue of "evaluate logging for usefulness" to handle it globally (with other items flagged in the comments as they are encountered) and not bog down the field method refactor. The origin of this is that resourceType is required by the Datacite schema, but we've never done anything with it and it's incongruous with how transmogrifier logging has developed since. Datacite was the first transform written and the context was the RDI project, which is no longer our context

Ah, thank you for clarifying! Created an issue for future work and discussion. :)

Thank you for creating the issue!

transmogrifier/sources/xml/datacite.py

ghukill

Looks good to me!

ghukill · 2024-05-23T20:34:10Z

transmogrifier/sources/xml/datacite.py

@@ -261,7 +261,7 @@ def get_alternate_titles(cls, source_record: Tag) -> list[timdex.AlternateTitle]
            ]
        )
        alternate_titles.extend(list(cls._get_additional_titles(source_record)))
-        return alternate_titles
+        return alternate_titles or None


At this time, I do like this full encapsulation, thanks for making these changes.

Relevant here and elsewhere...

Once we get into orchestration, which I think touches on the instantiation of the TIMDEXRecord instance, it might be worth considering if maybe it's okay to return an empty list? If so, and that was filtered out during serialization to JSON, then we could probably drop a bunch of these conversions to None. Just noting for the future, but again, think this is great for now, and in the spirit of the incremental approach.

I am most likely onboard with that approach when we get there!

ehanson8 requested review from ghukill and jonavellecuerdo May 21, 2024 18:18

ghukill reviewed May 22, 2024

View reviewed changes

Updates based on discussion in PR #178

6849eae

* Shift "or None" logic to field methods and update type hinting * Update unit tests to all use stub records * Remove unnecessary fixtures from conftest.py

Further updates based on discussion in PR #178

bdb2c4b

* Add default value to create_datacite_source_record_stub's xml_insert param * Replace calls of source_record_id variable with calls of get_source_record_id method inside field methods

jonavellecuerdo reviewed May 23, 2024

View reviewed changes

ghukill approved these changes May 23, 2024

View reviewed changes

jonavellecuerdo mentioned this pull request May 24, 2024

Evaluate logging for usefulness #180

Open

jonavellecuerdo self-requested a review May 24, 2024 13:58

jonavellecuerdo approved these changes May 24, 2024

View reviewed changes

Update get_alternate_titles method

861b3d0

ehanson8 merged commit 526705e into main May 24, 2024
5 checks passed

ehanson8 deleted the TIMX-284-datacite-field-method-refactor branch May 24, 2024 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 1st set of Datacite field methods #178

Add 1st set of Datacite field methods #178

ehanson8 commented May 21, 2024

ghukill left a comment

ghukill May 22, 2024

ehanson8 May 22, 2024

jonavellecuerdo May 22, 2024

ehanson8 May 22, 2024 •

edited

Loading

jonavellecuerdo May 23, 2024

ehanson8 commented May 23, 2024

ehanson8 commented May 23, 2024

jonavellecuerdo left a comment

jonavellecuerdo May 23, 2024

ehanson8 May 24, 2024

jonavellecuerdo May 23, 2024

ghukill May 23, 2024 •

edited

Loading

ehanson8 May 24, 2024

jonavellecuerdo May 24, 2024 •

edited

Loading

ghukill May 24, 2024

ehanson8 May 24, 2024

jonavellecuerdo May 23, 2024

ehanson8 May 24, 2024

jonavellecuerdo May 24, 2024

ehanson8 May 24, 2024

ghukill left a comment

ghukill May 23, 2024

ehanson8 May 24, 2024

Add 1st set of Datacite field methods #178

Add 1st set of Datacite field methods #178

Conversation

ehanson8 commented May 21, 2024

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 May 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 commented May 23, 2024

ehanson8 commented May 23, 2024

jonavellecuerdo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonavellecuerdo May 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghukill left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ehanson8 May 22, 2024 •

edited

Loading

ghukill May 23, 2024 •

edited

Loading

jonavellecuerdo May 24, 2024 •

edited

Loading