DataCite Query Pagination Support #263

collinss-jpl · 2021-10-04T23:31:16Z

🗒️ Summary

This PR adds support for DataCite's pagination scheme to the query function used to pull down existing DOI records from a DataCite server. This functionality should have been included with the original addition of DataCite support, but was overlooked.

In addition to pagination support, this branch adds several other small improvements, including:

Addition of "Radio Science" node ID to available PDS Node ID's recognized by DOI service
Better logging of warning messages from parsing of optional fields from a DataCite label
Better support for determining an appropriate node ID from a parsed DataCite label when importing to the local database

⚙️ Test Data and/or Report

One of the following should be included here:
A unit test has been added for the pagination support added to the DataCite web client.
test.txt

♻️ Related Issues

Fixes #261

… record index numbers with warnings when an optional field cannot be parsed

…initialize_production_deployment.py

nutjob4life

Please see my question but otherwise looks fine 👌

nutjob4life · 2021-10-04T23:44:56Z

src/pds_doi_service/core/outputs/datacite/datacite_web_client.py

+
+                # Append current results to full set returned
+                result = json.loads(datacite_response.text)
+                data.extend(result['data'])


Will there ever be a concern that data grows overly large if, say, the totalPages is also overly large? Should we be using a generator style pattern here to compensate instead of gathering and returning the entire kit & kaboodle?

I would say this is not currently a concern based on two things:

Based on the performance I observed importing all 1300+ records currently available on DataCite (which is only about a page and half):

INFO __main__:main DOI import complete in 23.26 seconds. INFO __main__:main Num records found: 1390 INFO __main__:main Num records processed: 1390 INFO __main__:main Num records written: 1390 INFO __main__:main Num records skipped: 0

This query function is only used right now for the local database importation script, so for the time being we shouldn't have to worry about an end-user of the service/UI running into some sort of performance bottleneck because we slurp all available pages.

nutjob4life · 2021-10-05T15:53:12Z

Thanks @collinss-jpl!

Scott Collins added 6 commits October 4, 2021 16:13

Added support for paginated results to DOIDataCiteWebClient.query_doi()

e7f08d7

Added unit test for DataCite-based query with paginated results

9816037

Added Radio Science to list of PDS nodes in node_util.py

6418d49

Updated parsing of optional fields in DOIDataCiteWebParser to provide…

80e4572

… record index numbers with warnings when an optional field cannot be parsed

Added better support for parsing of PDS node ID from a DOI record in …

fad8bad

…initialize_production_deployment.py

Fixed incorrect type in Doi dataclass definition

15444d0

collinss-jpl self-assigned this Oct 4, 2021

collinss-jpl requested a review from a team as a code owner October 4, 2021 23:31

nutjob4life approved these changes Oct 4, 2021

View reviewed changes

collinss-jpl merged commit 9ab673f into main Oct 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataCite Query Pagination Support #263

DataCite Query Pagination Support #263

collinss-jpl commented Oct 4, 2021

nutjob4life left a comment

nutjob4life Oct 4, 2021

collinss-jpl Oct 5, 2021

nutjob4life commented Oct 5, 2021

DataCite Query Pagination Support #263

DataCite Query Pagination Support #263

Conversation

collinss-jpl commented Oct 4, 2021

🗒️ Summary

⚙️ Test Data and/or Report

♻️ Related Issues

nutjob4life left a comment

Choose a reason for hiding this comment

nutjob4life Oct 4, 2021

Choose a reason for hiding this comment

collinss-jpl Oct 5, 2021

Choose a reason for hiding this comment

nutjob4life commented Oct 5, 2021