Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 2nd set of Datacite field methods #181

Merged
merged 2 commits into from
May 30, 2024

Conversation

ehanson8
Copy link
Contributor

Purpose and background context

Continue refactoring Datacite transform to use field methods.

How can a reviewer manually see the effects of these changes?

Run the following command to see that the Datacite transform still transforms a source file:

pipenv run transform -i tests/fixtures/datacite/datacite_records.xml -o output/datacite-transformed-records.json -s jpal

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed and verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:
* Continue refactoring Datacite to use field methods

How this addresses that need:
* Add field methods and associated private methods for dates, edition, file_formats, format, funding_information, identifiers, languages, links, and locations
* Update related_items code block to generate related_identifiers list that was moved into get_identifiers
* Add unit tests for new field methods

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-284
assert Datacite.get_file_formats(source_record) is None


def test_get_format_success():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usual tests were deliberately omitted given the method's guaranteed output

Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looking real good. Left a couple of comments, but nothing major. I'm finding the file, each time, easier and easier to scan.

tests/sources/xml/test_datacite.py Show resolved Hide resolved
tests/sources/xml/test_datacite.py Show resolved Hide resolved
transmogrifier/sources/xml/datacite.py Show resolved Hide resolved
)

@classmethod
def _get_dates(cls, source_record: Tag) -> Iterator[timdex.Date]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgive me if we've already discussed this, but I keep tripping on this private method _get_dates() and it's relationship to get_dates(). What about a rename to something like _get_other_dates()? I realize that "other" isn't very exact, but I think it definitively differentiates from the higher level field method get_dates().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about get _get_datacite_dates which I think is more accurate? I could also update _get_contributors to _get_datacite_contributors since that's what originally prompted the conversation

Copy link
Contributor

@ghukill ghukill May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks for the reminder, I thought it had come up.

I don't mean to be difficult, but even adding datacite feels odd... like we know it's datacite already. But I don't really have a better solution.

I suppose one unique property is that we're setting the date kind dynamically by the attribute dateType; wonder if that could come into play naming wise? Maybe something like _get_dates_by_datetype_attribute()? It's long... but it tells you precisely what's going on?

Or we could focus on the element used, like _get_dates_by_date_element()? or _parse_dates_from_date_elements().

Open to pushback here. If folks are okay with _get_dates() and _get_contributors() I could be too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like _get_dates_by_date_element, I think I'll go with that

transmogrifier/sources/xml/datacite.py Show resolved Hide resolved
transmogrifier/sources/xml/datacite.py Outdated Show resolved Hide resolved
transmogrifier/sources/xml/datacite.py Outdated Show resolved Hide resolved
* Rename methods for clarity
@ehanson8
Copy link
Contributor Author

@ghukill @jonavellecuerdo Pushed a new commit based on Graham's comments

@ghukill ghukill self-requested a review May 30, 2024 15:03
Copy link
Contributor

@ghukill ghukill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dig it! Thanks for the changes. At this stage, they feel really understandable. Maybe in "phase 2" we take another pass at naming, once we've seen a lot, but appreciate the changes now.

Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small change request and a question for both of you (@ghukill and @ehanson8).

transmogrifier/sources/xml/datacite.py Show resolved Hide resolved
@classmethod
def get_languages(cls, source_record: Tag) -> list[str] | None:
languages = []
if language := source_record.metadata.find("language", string=True):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non-blocking] When do we include the "_element" suffix to a variable?

This is something I've personally been getting tripped up on in the transforms I've been working on --naming conventions assigned to find/find_all method calls. These differences don't cause any problems but consistency would be nice...

@ghukill @ehanson8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using the _element convention where element attributes are accessed in addition to element.string so it probably should be used for get_funding_information as well. To me, it doesn't feel as necessary when we're just grabbing element.string. Interested in consistency as well so I want to hear everyone's thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the thought process there @ehanson8, and I agree that consistency would be nice, but it doesn't feel like a deal breaker to me with some variation. Given how common the convention is to a) find an element, and then b) use that element variable to extract a string or an attribute value, I think the _element could get pretty chatty.

It might be worth an issue? Then we could keep moving through this refactor, but make a global pass at some point.

In summary: I'm personally okay with variations, given different times in codebae, but an issue to revist could be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing! I'll apply that thought process as I work on other transforms, @ehanson8 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, created an issue

@jonavellecuerdo jonavellecuerdo self-requested a review May 30, 2024 18:14
@ehanson8 ehanson8 merged commit ff1aaeb into main May 30, 2024
5 checks passed
@ehanson8 ehanson8 deleted the TIMX-284-datacite-fmr-strikes-back branch May 30, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants