Save pages returned by `oa_request()` and do not process further #181

rkrug · 2023-10-20T11:24:18Z

This is inspired by #166

It is. not possible, due to memory limitations, to download a large number of works (thousands, millions). This pull request introduces an argument save_pages to the function oa_request which, if provided, saves all downloaded pages in the directory specified by the argument save_pages. No further processing of the downloaded pages is done, as this would again lead to memory problems, a vector containing the file names of the pages is returned.

This is at the moment a very rudimentary implementation, and could be tweaked further for performance, but I think this is not needed as a quick profiling indicated that the API calls take about 80% of the time, while the saveRDS() only 10%.

yjunechoe

Thanks for the PR! I left a few minor (design-related) comments for your consideration (haven't tested out how the code runs yet).

yjunechoe · 2023-10-20T11:44:38Z

R/oa_fetch.R

@@ -163,6 +163,15 @@ oa_fetch <- function(entity = if (is.null(identifier)) NULL else id_type(shorten
 #' Either "cursor" for cursor paging or "page" for basic paging.
 #' When used with options$sample, please set `paging = "page"`
 #' to avoid duplicates.
+#' @param save_pages Character.


Two quick thoughts about save_pages:

I think the name could be a little clearer - people might confuse this with a toggle option taking T/F. Maybe something like out_dir that makes it explicit that it expects a directory path as value?

The argument description should ideally be concise and document just its function. The motivation and other notes (about memory issues, etc.) could move to its own @section or @notes

Agreed. out_dir sounds a bit like where to save the output - which it is not. Rather output_pages_to?

OK - I will streamline it.

Yeah output_pages_to sounds great!

Changed in last commit

R/oa_fetch.R

trangdata

I'm not sure it's optimal to use saveRDS within the package.

I'm currently working on a different branch on improving paging in oa_fetch and oa_request. Let me finish some final touches and perhaps you can help me test my implementation @rkrug? The idea is to use basic paging for fetching individual pages. Personally, I think this change would be a little less disruptive than saveRDS. I will give some examples in the PR. And whether we move forward with output_pages_to or not, I think my branch would still improve the paging implementation.

Would love to hear your thoughts @rkrug @yjunechoe. Stay tuned!

rkrug · 2023-10-20T13:08:50Z

From my experience with snowballing starting from more than 3000 papers, the limitation is really memory, i.e. the memory management of R.
If one would download one page, then reserve the memory needed for one object to hold all data from all pages (e.g. rep(ONE_PAGE_DATA_structure, n_pages) by creating a dummy object and replacing after each page the page in the object, it could work - unless the final object is to large (and this would speed the process up as only one memory allocation is needed).
But still - I guess that the total memory limit kicks in sooner or later. So the fallback solution of saving each page using saveRDS would be a safety net, which one could use if a) an argument is set, or b) the object containing the data I can not be created.

Just my thoughts, and looking forward to your pull request, @trangdata

rkrug · 2023-10-23T10:23:42Z

R/oa_fetch.R

@@ -312,6 +322,7 @@ oa_request <- function(query_url,
                       per_page = 200,
                       paging = "cursor",
                       pages = NULL,
+                       output_pages_to = NULL,
                       count_only = FALSE,


This is not more than a proof of concept implementation.

yjunechoe · 2023-10-23T14:20:17Z

I'll wait to comment more until we resolve our discussion elsewhere on the design related issues, but I'll just add that I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA. I think we should either just dump out the JSON response exactly as we receive it, OR process it a bit further (ex: at least make it tabular) so that we can use a more efficient storage format (even csv is better than rds).

rkrug · 2023-10-23T15:11:20Z

Dumping the JSON response would make perfect sense - but probably as an RDS to safe space when downloading large sets. Let's discuss this further when the other discussion has reached consensus.

trangdata · 2023-10-23T16:15:50Z

I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA.

@yjunechoe Not sure you saw my comment in my Improve paging control PR, but I'll reiterate it here: at the moment, I think it's out of scope for openalexR to handle any type of serialization/IO. We currently provide enough flexibility for the user to do this themselves.

In general, I think making many requests to the Open Alex database is not a common use case, and I would generally advise against it. One can download the snapshot themselves and proceed with the filtering/processing without needing openalexR.

Therefore, my current suggestion would be to close this PR.

yjunechoe · 2023-10-23T16:19:07Z

Ah I'd missed the discussion going over there! Yes, I don't disagree with just closing this as out of scope (my comment was more of a "if we do implement this" comment)

yjunechoe · 2023-10-23T23:20:07Z

Closing this as out of scope for the package, in favor of #183 and the continuing discussion on #182 (where we can resolve how to streamline this workflow without introducing too much maintenance burden). Apologies for my earlier confusion and thanks for putting the work to explore this direction!

rkrug added 2 commits October 20, 2023 13:20

Add argument to save pages returned by oa_request

d5beb0a

Correction in man

d1f78ed

rkrug changed the title ~~Save pages returned by oa_request() and do not process fue=rther~~ Save pages returned by oa_request() and do not process further Oct 20, 2023

rkrug mentioned this pull request Oct 20, 2023

Download large number of works - automatic slicing? #166

Closed

Add regenerated man page

fcbd5ef

yjunechoe reviewed Oct 20, 2023

View reviewed changes

Update name and documentatuion changes

07b9721

trangdata requested changes Oct 20, 2023

View reviewed changes

Merge branch 'main' into save_pages

fdbc79f

rkrug commented Oct 23, 2023

View reviewed changes

Ading function argument to save pages

7a891df

yjunechoe closed this Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save pages returned by `oa_request()` and do not process further #181

Save pages returned by `oa_request()` and do not process further #181

rkrug commented Oct 20, 2023 •

edited

yjunechoe left a comment

yjunechoe Oct 20, 2023

rkrug Oct 20, 2023

yjunechoe Oct 20, 2023

rkrug Oct 20, 2023

trangdata left a comment

rkrug commented Oct 20, 2023

rkrug Oct 23, 2023

yjunechoe commented Oct 23, 2023

rkrug commented Oct 23, 2023

trangdata commented Oct 23, 2023

yjunechoe commented Oct 23, 2023

yjunechoe commented Oct 23, 2023 •

edited

Save pages returned by oa_request() and do not process further #181

Save pages returned by oa_request() and do not process further #181

Conversation

rkrug commented Oct 20, 2023 • edited

yjunechoe left a comment

Choose a reason for hiding this comment

yjunechoe Oct 20, 2023

Choose a reason for hiding this comment

rkrug Oct 20, 2023

Choose a reason for hiding this comment

yjunechoe Oct 20, 2023

Choose a reason for hiding this comment

rkrug Oct 20, 2023

Choose a reason for hiding this comment

trangdata left a comment

Choose a reason for hiding this comment

rkrug commented Oct 20, 2023

rkrug Oct 23, 2023

Choose a reason for hiding this comment

yjunechoe commented Oct 23, 2023

rkrug commented Oct 23, 2023

trangdata commented Oct 23, 2023

yjunechoe commented Oct 23, 2023

yjunechoe commented Oct 23, 2023 • edited

Save pages returned by `oa_request()` and do not process further #181

Save pages returned by `oa_request()` and do not process further #181

rkrug commented Oct 20, 2023 •

edited

yjunechoe commented Oct 23, 2023 •

edited