Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save pages returned by oa_request() and do not process further #181

Closed
wants to merge 6 commits into from

Conversation

rkrug
Copy link

@rkrug rkrug commented Oct 20, 2023

This is inspired by #166

It is. not possible, due to memory limitations, to download a large number of works (thousands, millions). This pull request introduces an argument save_pages to the function oa_request which, if provided, saves all downloaded pages in the directory specified by the argument save_pages. No further processing of the downloaded pages is done, as this would again lead to memory problems, a vector containing the file names of the pages is returned.

This is at the moment a very rudimentary implementation, and could be tweaked further for performance, but I think this is not needed as a quick profiling indicated that the API calls take about 80% of the time, while the saveRDS() only 10%.

@rkrug rkrug changed the title Save pages returned by oa_request() and do not process fue=rther Save pages returned by oa_request() and do not process further Oct 20, 2023
Copy link
Collaborator

@yjunechoe yjunechoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left a few minor (design-related) comments for your consideration (haven't tested out how the code runs yet).

R/oa_fetch.R Outdated
@@ -163,6 +163,15 @@ oa_fetch <- function(entity = if (is.null(identifier)) NULL else id_type(shorten
#' Either "cursor" for cursor paging or "page" for basic paging.
#' When used with options$sample, please set `paging = "page"`
#' to avoid duplicates.
#' @param save_pages Character.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two quick thoughts about save_pages:

  1. I think the name could be a little clearer - people might confuse this with a toggle option taking T/F. Maybe something like out_dir that makes it explicit that it expects a directory path as value?

  2. The argument description should ideally be concise and document just its function. The motivation and other notes (about memory issues, etc.) could move to its own @section or @notes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Agreed. out_dir sounds a bit like where to save the output - which it is not. Rather output_pages_to?
  2. OK - I will streamline it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah output_pages_to sounds great!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in last commit

R/oa_fetch.R Outdated Show resolved Hide resolved
Copy link
Collaborator

@trangdata trangdata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's optimal to use saveRDS within the package.

I'm currently working on a different branch on improving paging in oa_fetch and oa_request. Let me finish some final touches and perhaps you can help me test my implementation @rkrug? The idea is to use basic paging for fetching individual pages. Personally, I think this change would be a little less disruptive than saveRDS. I will give some examples in the PR. And whether we move forward with output_pages_to or not, I think my branch would still improve the paging implementation.

Would love to hear your thoughts @rkrug @yjunechoe. Stay tuned!

@rkrug
Copy link
Author

rkrug commented Oct 20, 2023

From my experience with snowballing starting from more than 3000 papers, the limitation is really memory, i.e. the memory management of R.
If one would download one page, then reserve the memory needed for one object to hold all data from all pages (e.g. rep(ONE_PAGE_DATA_structure, n_pages) by creating a dummy object and replacing after each page the page in the object, it could work - unless the final object is to large (and this would speed the process up as only one memory allocation is needed).
But still - I guess that the total memory limit kicks in sooner or later. So the fallback solution of saving each page using saveRDS would be a safety net, which one could use if a) an argument is set, or b) the object containing the data I can not be created.

Just my thoughts, and looking forward to your pull request, @trangdata

@@ -312,6 +322,7 @@ oa_request <- function(query_url,
per_page = 200,
paging = "cursor",
pages = NULL,
output_pages_to = NULL,
count_only = FALSE,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not more than a proof of concept implementation.

@yjunechoe
Copy link
Collaborator

I'll wait to comment more until we resolve our discussion elsewhere on the design related issues, but I'll just add that I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA. I think we should either just dump out the JSON response exactly as we receive it, OR process it a bit further (ex: at least make it tabular) so that we can use a more efficient storage format (even csv is better than rds).

@rkrug
Copy link
Author

rkrug commented Oct 23, 2023

Dumping the JSON response would make perfect sense - but probably as an RDS to safe space when downloading large sets. Let's discuss this further when the other discussion has reached consensus.

@trangdata
Copy link
Collaborator

I'm also a little wary of using saveRDS. It's a weird middle ground between where you apply some but not all processing to the data received from OA.

@yjunechoe Not sure you saw my comment in my Improve paging control PR, but I'll reiterate it here: at the moment, I think it's out of scope for openalexR to handle any type of serialization/IO. We currently provide enough flexibility for the user to do this themselves.

In general, I think making many requests to the Open Alex database is not a common use case, and I would generally advise against it. One can download the snapshot themselves and proceed with the filtering/processing without needing openalexR.

Therefore, my current suggestion would be to close this PR.

@yjunechoe
Copy link
Collaborator

Ah I'd missed the discussion going over there! Yes, I don't disagree with just closing this as out of scope (my comment was more of a "if we do implement this" comment)

@yjunechoe
Copy link
Collaborator

yjunechoe commented Oct 23, 2023

Closing this as out of scope for the package, in favor of #183 and the continuing discussion on #182 (where we can resolve how to streamline this workflow without introducing too much maintenance burden). Apologies for my earlier confusion and thanks for putting the work to explore this direction!

@yjunechoe yjunechoe closed this Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants